Configuring an Oozie job with a HDInsight Hadoop cluster
February 17, 2014 2 Comments
Oozie is widely used in the Hadoop world as a workflow scheduler. Microsoft HDInsight supports Oozie “out of the box” and comes with all necessary bits and examples which should help you to successfully configure Oozie in your Microsoft HDInsight environment.
While playing around with the HDInsight and trying to reproduce a customer problem, I had to launch a Hive action from the Oozie job. To make it all simple, I looked at the example Oozie job, which is executing some hive commands. Then I adopted that example to my own needs. The path to the example after you have made a remote connection to the head node is:
If you look at the folder content, you will see there are 3 files:
· job.properties
· script.q
· workflow
Now let’s look at what is going on in these sample files and then look at where we need to modify them?
Job.properties
This file’s role is to set up the configuration environment for the Oozie job and it has the following configuration parameters:
#
nameNode=hdfs://localhost:8020 ß points out to the default file system
jobTracker=localhost:8021 ßThis is where our jobTracker service runs
queueName=default
examplesRoot=examples
oozie.use.system.libpath=true ßtells us whether to use default Oozie libraries
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/hive
One classic scenario where HDInsight Service in Azure should be used is when you store your data in the Cloud and provision Hadoop clusters only when you need to run particular calculations. In other words, you are scaling your compute separately from your storage. HDInsight defaults to Windows Azure Storage (called WASB when referenced from HDInsight) and your data is stored in blob storage and not in the HDFS file system.
Therefore, in order to tell Oozie about this fact you need to change the string
nameNode=hdfs://localhost:8020
to something like
nameNode=wasb://container_name@storage_name.blob.core.windows.net
You may want to store scripts, which will be executed during the Oozie job and some other related configurations, such as workflows description, in separate folders. Therefore, the job should also specify where to find this information and data. In my case this line looks like the following:
oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobs
Another two strings to pay attention to are:
oozie.use.system.libpath=true
jobTracker=jobtrackerhost:9010 <-default value was something like localhost:8021
So, the final version of the job.properties file will look like the following:
nameNode=wasb://container_name@storage_name.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobs
outputDir=ooziejobs-out
oozie.use.system.libpath=true
Now, let’s look at the workflow configuration XML file.
Workflow.xml
The structure of the workflow.xml is very well described here. What we need to know is that this is the place where all the actions of the Oozie job will be specified. In our example of executing a sample hive job, we can expect that the specified Hive script script.q will be executed based on the below code:
<action name="hive-node">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/hive"/>
<mkdir path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.q</script> ß This is where our Hive script will be called
<param>INPUT=/user/${wf:user()}/${examplesRoot}/input-data/table</param>
<param>OUTPUT=/user/${wf:user()}/${examplesRoot}/output-data/hive</param>
</hive>
<ok to="end"/>
<error to="fail"/>
Script .q
This is a batch file which is simply executing whatever Hive commands we specify. I created my own version of the script.q . In my script I will create a Hive table called test by replacing the sample contents of script.q with the following :
CREATE EXTERNAL TABLE test (a INT) STORED AS TEXTFILE LOCATION ‘${INPUT}’;
INSERT OVERWRITE DIRECTORY ‘${OUTPUT}’ SELECT * FROM test;
Preparing and running the job
Now that we are done with the configuration we need to complete a couple more steps. If you recall, in the configuration we set the application folder as wasb:///user/admin/examples/apps/ooziejobs
Let’s upload the Hive script and the workflow in that location –
c:\apps\dist\hadoop-1.2.0.1.3.2.0-05>hadoop fs -copyFromLocal C:\apps\dist\oozie[CG8]
-3.3.2.1.3.2.0-05\oozie-win-distro\examples\apps\hive ///user/admin/examples/apps/ooziejobs
After this step is completed, we can attempt to run the oozie job (finally) J
oozie job -oozie http://localhost:11000/oozie -config C:\ooziejobs\job.properties -run
If everything runs successfully, you will see right after the command line you just executed the message indicating the ID of the oozie job
c:\apps\dist\oozie-3.3.2.1.3.2.0-05\oozie-win-distro\bin>oozie job -oozie http:/
/namenodehost:11000/oozie -config C:\apps\dist\oozie-3.3.2.1.3.2.0-05\oozie-win
-distro\examples\apps\hive\job.properties -run
job: 0000000-140130144826022-oozie-hdp-W
Also in the MapReduce administrative console there will be an indication that the Map/Reduce job was submitted
In the column “Name” of the job, you will see the job ID, which matches the job id we saw in the command line.
To test if the job executed successfully, we can check the log, or use another simple way – run a hive command and see if the table was created:
hive> show tables;
OK
hivesampletable
test
Time taken: 1.925 seconds, Fetched: 2 row(s)
hive>
All worked well! We took a sample Oozie script, customized it for our own needs, and executed it to create a Hive table.
Hi Alexei,
Thanks for your post!
I’m .net Guy with 8 years exp but will not hesitate to learn other technologies that fits best for any concept. knowing the fact that the “assembly language” of BigData is Java. It will be appreciated if you can help me how I can start working on Hadoop. As I found Hdinsight , Cloudera etc. Can you help me in choosing the platform.
Hi,
The big differentiator between HDI and the rest is that the MS distribution is a more or less classical PaaS platform with all its advantages and disadvantages. Saying this, I think it is then very much clear that we are talking about pure Cloud environment, where the others such as Cloudera, HortonWorks and etc are suited for running your workloads on-premises. The emulator can help you to start with the product and see if you feel comfortable with it