Configuring an Oozie job with a HDInsight Hadoop cluster

Oozie is widely used in the Hadoop world as a workflow scheduler. Microsoft HDInsight supports Oozie “out of the box” and comes with all necessary bits and examples which should help you to successfully configure Oozie in your Microsoft HDInsight environment.

While playing around with the HDInsight and trying to reproduce a customer problem, I had to launch a Hive action from the Oozie job. To make it all simple, I looked at the example Oozie job, which is executing some hive commands. Then I adopted that example to my own needs. The path to the example after you have made a remote connection to the head node is:

PathToExample

 

If you look at the folder content, you will see there are 3 files:

·         job.properties

·         script.q

·         workflow

Now let’s look at what is going on in these sample files and then look at where we need to modify them?

Job.properties

This file’s role is to set up the configuration environment for the Oozie job and it has the following configuration parameters:

#

nameNode=hdfs://localhost:8020          ß points out to the default file system

jobTracker=localhost:8021                         ßThis is where our jobTracker service runs

queueName=default

examplesRoot=examples          

 

oozie.use.system.libpath=true                     ßtells us whether to use default Oozie libraries

oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/hive

 

One classic scenario where HDInsight Service in Azure should be used is when you store your data in the Cloud and provision Hadoop clusters only when you need to run particular calculations. In other words, you are scaling your compute separately from your storage.  HDInsight defaults to Windows Azure Storage (called WASB when referenced from HDInsight) and your data is stored in blob storage and not in the HDFS file system.

Therefore, in order to tell Oozie about this fact you need to change the string 

nameNode=hdfs://localhost:8020

 

to something like

nameNode=wasb://container_name@storage_name.blob.core.windows.net

 

You may want to store scripts, which will be executed during the Oozie job and some other related configurations, such as workflows description, in separate folders. Therefore, the job should also specify where to find this information and data. In my case this line looks like the following:

 

oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobs

 

Another two strings to pay attention to are:

 

oozie.use.system.libpath=true 

jobTracker=jobtrackerhost:9010 <-default value was something like  localhost:8021

 

 

 

So, the final version of the job.properties file will look like the following:

 

nameNode=wasb://container_name@storage_name.blob.core.windows.net

jobTracker=jobtrackerhost:9010

queueName=default

 

oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobs

outputDir=ooziejobs-out

oozie.use.system.libpath=true

 

 

Now, let’s look at the workflow configuration XML file.

 

Workflow.xml

 

The structure of the workflow.xml is very well described here. What we need to know is that this is the place where all the actions of the Oozie job will be specified. In our example of executing a sample hive job, we can expect that the specified Hive script script.q will be executed based on the below code:

 

    <action name="hive-node">

        <hive xmlns="uri:oozie:hive-action:0.2">

            <job-tracker>${jobTracker}</job-tracker>

            <name-node>${nameNode}</name-node>

            <prepare>

                <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/hive"/>

                <mkdir path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>

            </prepare>

            <configuration>

                <property>

                    <name>mapred.job.queue.name</name>

                    <value>${queueName}</value>

                </property>

            </configuration>

            <script>script.q</script> ß This is where our Hive script will be called

            <param>INPUT=/user/${wf:user()}/${examplesRoot}/input-data/table</param>

            <param>OUTPUT=/user/${wf:user()}/${examplesRoot}/output-data/hive</param>

        </hive>

        <ok to="end"/>

        <error to="fail"/>

 

Script .q

This is a batch file which is simply executing whatever Hive commands we specify. I created my own version of the script.q . In my script I will create a Hive table called test by replacing the sample contents of script.q with the following :

 

CREATE EXTERNAL TABLE test (a INT) STORED AS TEXTFILE LOCATION ‘${INPUT}’;

INSERT OVERWRITE DIRECTORY ‘${OUTPUT}’ SELECT * FROM test;

Preparing and running the job

 

Now that we are done with the configuration we need to complete a couple more steps. If you recall, in the configuration we set the application folder as wasb:///user/admin/examples/apps/ooziejobs

Let’s upload the Hive script and the workflow in that location –  

c:\apps\dist\hadoop-1.2.0.1.3.2.0-05>hadoop fs -copyFromLocal C:\apps\dist\oozie[CG8] 

-3.3.2.1.3.2.0-05\oozie-win-distro\examples\apps\hive    ///user/admin/examples/apps/ooziejobs

After this step is completed, we can attempt to run the oozie job (finally) J

 

oozie job -oozie http://localhost:11000/oozie -config  C:\ooziejobs\job.properties -run

 


If everything runs successfully, you will see right after the command line you just executed the message indicating the ID of the oozie job

 

c:\apps\dist\oozie-3.3.2.1.3.2.0-05\oozie-win-distro\bin>oozie job -oozie http:/

/namenodehost:11000/oozie -config  C:\apps\dist\oozie-3.3.2.1.3.2.0-05\oozie-win

-distro\examples\apps\hive\job.properties -run

job: 0000000-140130144826022-oozie-hdp-W

 

Also in the MapReduce administrative console there will be an indication that the Map/Reduce job was submitted

 MApreduce

 

 

In the column “Name” of the job, you will see the job ID, which matches the job id we saw in the command line.

To test if the job executed successfully, we can check the log, or use another simple way – run a hive command and see if the table was created:

 

hive> show tables;

OK

hivesampletable

test

Time taken: 1.925 seconds, Fetched: 2 row(s)

hive>

 

All worked well! We took a sample Oozie script, customized it for our own needs, and executed it to create a Hive table.


 

2 Responses to Configuring an Oozie job with a HDInsight Hadoop cluster

  1. shafitrumboo says:

    Hi Alexei,

    Thanks for your post!
    I’m .net Guy with 8 years exp but will not hesitate to learn other technologies that fits best for any concept. knowing the fact that the “assembly language” of BigData is Java. It will be appreciated if you can help me how I can start working on Hadoop. As I found Hdinsight , Cloudera etc. Can you help me in choosing the platform.

    • Hi,

      The big differentiator between HDI and the rest is that the MS distribution is a more or less classical PaaS platform with all its advantages and disadvantages. Saying this, I think it is then very much clear that we are talking about pure Cloud environment, where the others such as Cloudera, HortonWorks and etc are suited for running your workloads on-premises. The emulator can help you to start with the product and see if you feel comfortable with it

Leave a comment