Configuring an Oozie job with a HDInsight Hadoop cluster

Oozie is widely used in the Hadoop world as a workflow scheduler. Microsoft HDInsight supports Oozie “out of the box” and comes with all necessary bits and examples which should help you to successfully configure Oozie in your Microsoft HDInsight environment.

While playing around with the HDInsight and trying to reproduce a customer problem, I had to launch a Hive action from the Oozie job. To make it all simple, I looked at the example Oozie job, which is executing some hive commands. Then I adopted that example to my own needs. The path to the example after you have made a remote connection to the head node is:

PathToExample

 

If you look at the folder content, you will see there are 3 files:

·         job.properties

·         script.q

·         workflow

Now let’s look at what is going on in these sample files and then look at where we need to modify them?

Job.properties

This file’s role is to set up the configuration environment for the Oozie job and it has the following configuration parameters:

#

nameNode=hdfs://localhost:8020          ß points out to the default file system

jobTracker=localhost:8021                         ßThis is where our jobTracker service runs

queueName=default

examplesRoot=examples          

 

oozie.use.system.libpath=true                     ßtells us whether to use default Oozie libraries

oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/hive

 

One classic scenario where HDInsight Service in Azure should be used is when you store your data in the Cloud and provision Hadoop clusters only when you need to run particular calculations. In other words, you are scaling your compute separately from your storage.  HDInsight defaults to Windows Azure Storage (called WASB when referenced from HDInsight) and your data is stored in blob storage and not in the HDFS file system.

Therefore, in order to tell Oozie about this fact you need to change the string 

nameNode=hdfs://localhost:8020

 

to something like

nameNode=wasb://container_name@storage_name.blob.core.windows.net

 

You may want to store scripts, which will be executed during the Oozie job and some other related configurations, such as workflows description, in separate folders. Therefore, the job should also specify where to find this information and data. In my case this line looks like the following:

 

oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobs

 

Another two strings to pay attention to are:

 

oozie.use.system.libpath=true 

jobTracker=jobtrackerhost:9010 <-default value was something like  localhost:8021

 

 

 

So, the final version of the job.properties file will look like the following:

 

nameNode=wasb://container_name@storage_name.blob.core.windows.net

jobTracker=jobtrackerhost:9010

queueName=default

 

oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobs

outputDir=ooziejobs-out

oozie.use.system.libpath=true

 

 

Now, let’s look at the workflow configuration XML file.

 

Workflow.xml

 

The structure of the workflow.xml is very well described here. What we need to know is that this is the place where all the actions of the Oozie job will be specified. In our example of executing a sample hive job, we can expect that the specified Hive script script.q will be executed based on the below code:

 

    <action name="hive-node">

        <hive xmlns="uri:oozie:hive-action:0.2">

            <job-tracker>${jobTracker}</job-tracker>

            <name-node>${nameNode}</name-node>

            <prepare>

                <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/hive"/>

                <mkdir path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>

            </prepare>

            <configuration>

                <property>

                    <name>mapred.job.queue.name</name>

                    <value>${queueName}</value>

                </property>

            </configuration>

            <script>script.q</script> ß This is where our Hive script will be called

            <param>INPUT=/user/${wf:user()}/${examplesRoot}/input-data/table</param>

            <param>OUTPUT=/user/${wf:user()}/${examplesRoot}/output-data/hive</param>

        </hive>

        <ok to="end"/>

        <error to="fail"/>

 

Script .q

This is a batch file which is simply executing whatever Hive commands we specify. I created my own version of the script.q . In my script I will create a Hive table called test by replacing the sample contents of script.q with the following :

 

CREATE EXTERNAL TABLE test (a INT) STORED AS TEXTFILE LOCATION ‘${INPUT}’;

INSERT OVERWRITE DIRECTORY ‘${OUTPUT}’ SELECT * FROM test;

Preparing and running the job

 

Now that we are done with the configuration we need to complete a couple more steps. If you recall, in the configuration we set the application folder as wasb:///user/admin/examples/apps/ooziejobs

Let’s upload the Hive script and the workflow in that location –  

c:\apps\dist\hadoop-1.2.0.1.3.2.0-05>hadoop fs -copyFromLocal C:\apps\dist\oozie[CG8] 

-3.3.2.1.3.2.0-05\oozie-win-distro\examples\apps\hive    ///user/admin/examples/apps/ooziejobs

After this step is completed, we can attempt to run the oozie job (finally) J

 

oozie job -oozie http://localhost:11000/oozie -config  C:\ooziejobs\job.properties -run

 


If everything runs successfully, you will see right after the command line you just executed the message indicating the ID of the oozie job

 

c:\apps\dist\oozie-3.3.2.1.3.2.0-05\oozie-win-distro\bin>oozie job -oozie http:/

/namenodehost:11000/oozie -config  C:\apps\dist\oozie-3.3.2.1.3.2.0-05\oozie-win

-distro\examples\apps\hive\job.properties -run

job: 0000000-140130144826022-oozie-hdp-W

 

Also in the MapReduce administrative console there will be an indication that the Map/Reduce job was submitted

 MApreduce

 

 

In the column “Name” of the job, you will see the job ID, which matches the job id we saw in the command line.

To test if the job executed successfully, we can check the log, or use another simple way – run a hive command and see if the table was created:

 

hive> show tables;

OK

hivesampletable

test

Time taken: 1.925 seconds, Fetched: 2 row(s)

hive>

 

All worked well! We took a sample Oozie script, customized it for our own needs, and executed it to create a Hive table.


 

Advertisements

Expanding HDP Hadoop file system to Azure Blob Storage.

If you are building  Cloud based BigData solution, HDInsight cluster in Windows Azure could be one of the first choices when looking at the possible platforms. The Hortonworks HDP running on the IaaS VMs is another option you may want consider. If deciding which one of two to select, even considering the same code base underneath of both solutions, there is a number of factors, which may influence the  final choice, however the key is _how_ the cluster is going to be used: do you need it for temporary calculations, do you have dependency on additional Hadoop related tools and therefore dependency on Java etc.

I am not going to focus on those questions here.  What is interesting for me at the moment  – HDI relies on the Azure Storage for saving data. The Azure Storage is cheap and scales very well. If you look at the HDP cluster in Azure VMs – it is also using Azure Storage: all the VHD drives, which you as Data drives in your VM stored in Azure Blob storage. The largest Azure VM can support up to x16 drives 1 TB each, so you can easily get 16TB of data stored in your Hadoop cluster. What to do if I need more?

Disclaimer: following configuration works, but officially not yet supported by neither Microsoft nor Hortonworks. If you are going to use it – you do it at your own risk!

At very early stages of HDI validations Cindy Gross has published instructions on how to connect Azure Blob storage (asv – Azure Storage Vault) to HDI cluster.

So, assuming you have the HDP cluster (either in the Cloud or on-prem) up and running and Azure Storage Account created, let’s try to make these two technologies friends.

Following Cindy’s recommendation let’s look at the core-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!– Put site-specific property overrides in this file. –>
<configuration xmlns:xi="http://www.w3.org/2001/XInclude">

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>hdfs://HDPServer:8020</value>
<description>The name of the default file system. Either the literal string "local" or a host:port for NDFS.</description>
<final>true</final>
</property>

Currently we have only the default HDFS file system. In my solution I want ASV be my default file system, so I need to change this part of the core-site.xml to re-point Hadoop to the Azure Storage.

In order to successfully communicate with the Azure Storage Hadoop needs to know:

  • · Storage Account to connect to
  • · Security key to connect

Let’s add this information to the configuration file. First, I will change “default file system” config already mentioned above:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://hdp@YourStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system. Either the literal string "local" or a host:port for NDFS.</description>
<final>true</final>
</property>

Now, the next property doesn’t exist in the core-site XML – you need to add it manually:

<property>
<name>fs.azure.account.key.YourStorageAccount.blob.core.windows.net</name>
<value>The_security_key_value </value>
</property>

Now, let’s restart the Hadoop cluster and try to copy something from local folder into the asv:// path:

F:\hdp\hadoop>hadoop fs -copyFromLocal c:/Test asv://
F:\hdp\hadoop>hadoop fs -ls asv:/
Found 3 items
drwxr-xr-x – HDPAdmin supergroup 0 2014-01-09 18:18 /Test
drwxr-xr-x – hadoop supergroup 0 2013-11-14 10:04 /mapred

You can also try and use something like Azure Storage Explorer to see if the data were copied into the Azure Blob.

Now, I also want keep some data in the HDFS file system. However, If I will try to launch the Hadoop Name Node console, browser fails to connect to the Head Node. Well, no wonder – I have changed the default file system configuration and have not said anything to the cluster regarding how I want HDFS to be treated.

There is one more configuration which I need to add to the core-site.xml to fix it:

<property>
  <name>dfs.namenode.rpc-address</name>
  <value>hdfs://namenodehost:9000</value>
</property>

Now restarting Hadoop again and I can also try copying files into the HDFS:

F:\hdp\hadoop>hadoop fs -copyFromLocal c:/Test hdfs://HDPServer:9000/
F:\hdp\hadoop>hadoop fs -ls hdfs://HDPServer:9000/
Found 7 items

drwxr-xr-x – HDPAdmin supergroup 0 2014-01-09 16:29 /Test
drwxr-xr-x – hadoop supergroup 0 2013-08-23 16:22 /apps
drwxr-xr-x – hadoop supergroup 0 2013-09-04 13:28 /hive
drwxr-xr-x – hadoop supergroup 0 2013-12-17 14:36 /mapred
drwxr-xr-x – HDPAdmin supergroup 0 2013-10-09 16:32 /tmp
drwxr-xr-x – HDPAdmin supergroup 0 2013-09-03 16:40 /tpch_1gb
drwxr-xr-x – HDPAdmin supergroup 0 2013-10-09 16:27 /user

So, everything works and even my Hadoop Name Node console is back again.

The very last question left: what if I need to add more Storage Accounts? Just follow the steps Cindy described in her blog:

  • · Add the information that associates the key value with your default storage account

<property>
  <name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
  <value>YOURActualStorageKeyValue</value>
</property>

  • · Add any additional storage accounts you plan to access

<property>
  <name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
  <value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Have fun!

Setting up multi-node HDP Cluster

As you may be already aware Hortonworks announced support of the HADOOP cluster on Windows platform.  In the project where we involved at the moment, we’ve got luck to  play  with the multi-node HDP cluster. Personally I felt as I am back in the beginning of the 90th with all the configurations and much  stuff which should run  from the command line. However, the HDP is really a lot of fun and  easy to setup and run. Here is the list of things you may need to know in order to save time setting up the HDP in your environment:

1. Even if it sounds very obvious, follow the documentation. Software listed in the SW requirements list should be installed and the parameters like JAVA_HOME have to be configured.

2. Pay attention to the Firewall ports which have to be opened! HDP needs a lot of ports, in some cases I was opening the range instead of the single ports (like, 50000-60000 instead of all the single ports in the 50K range).

3. Leave only one network card and disable the rest that you have on your box.

4. Disable IPv6. You can do it by executing following command for each interface:

netsh interface teredo set state disable
netsh interface 6to4 set state disabled
netsh interface isatap set state disabled

5. In the clusterproperties.txt make sure that you specify path for setting up HDP, pay attention – path with subfolders is not  allowed, no space in the folder name allowed. The documentation misleading here. Example of the correct string:

#Log directory

HDP_LOG_DIR=C:\HDP_LOG

#Data directory

HDP_DATA_DIR=C:\HDP_DATA

6. Same configuration file, #Hosts section – the FULL UNC path will not work! Use short  names instead, Like bellow:

NAMENODE_HOST=server02

7. Edit hosts file and add nodes names with their IP addresses there

8. Now you can run installation. Example of the setup sting:

C:\Windows\system32>msiexec /i "C:\HDP_SETUP\hdp-1.1.0-160.winpkg.msi" /lv "C:\HDP_SETUP\installer.log" HDP_LAYOUT="c:\HDP_SETUP\clusterproperties.txt" HDP_DIR="C:\hdp" DESTROY_DATA="Yes"

You will need to run this script on every node of the cluster!

9. In case if setup filed, make sure you run uninstall command before trying to re-run installation process.

msiexec /x "C:\HDP_SETUP\hdp-1.1.0-160.winpkg.msi" /lv "C:\HDP_SETUP\installer.log" DESTROY_DATA="no"

10. Now, before running setup you will need to enable Remote Scripting on every node. Looks like it doesn’t work correctly. Therefore even if setup was successful, you will see in the error log that something like this:

CAQuietExec:  WINPKG: rd /s /q G:\HadoopInstallFiles\HadoopPackages\hdp-1.1.0-winpkg\resources\hadoop-1.1.0-SNAPSHOT.winpkg

CAQuietExec:  powershell.exe : powershell.exe : powershell.exe : Stop-Service : Cannot find

CAQuietExec:  any service with

CAQuietExec:  At G:\HadoopInstallFiles\HadoopSetupTools\winpkg.ps1:118 char:9

CAQuietExec:  +     $out = powershell.exe -InputFormat none -Command "$command" 2>&1

CAQuietExec:  +            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

CAQuietExec:      + CategoryInfo          : NotSpecified: (powershell.exe …y service with 

CAQuietExec:     :String) , RemoteException

CAQuietExec:      + FullyQualifiedErrorId : NativeCommandError

11. On every node manually start the services by calling start_local_hdp_services.cmd

12. If everything is done right, you will be able to run a smoke test with no errors and enjoy the HDP cluster

Big Data on Azure tutorials and scenario

Microsoft released couple of the tutorials helping understanding the HADOOP and HADOOP on Azure. There was also some interesting information published covering typical scenarios, like storage of the structured and unstructured data, as well as extraction of the BI information.

Hadoop on Windows Azure – Working With Data

Big Data Scenarios

Enjoy reading!

Using SQOOP for moving data between HADOOP and SQL Server

 

Following the post on how to load data from HADOOP into SQL Server, here is an example how you can use different technology for transferring data using SQOOP.

 

Move data from HIVE table in HADOOP into SQL.

 

1. Hadoop service in the https://www.hadooponazure.com/ comes with the Hive sample table called HiveSampleTable. For the simplicity I will use this table. If you want to create your own, you can follow the steps in appendix, which describe how you can create your own source table

2. On the SQL Server side we need to make sure we have table, which will be receiving the data.

Here is the T-SQL syntax for the TARGET table

CREATE TABLE [dbo].[TARGET](

[clientid] [nvarchar](334) NULL,

[querytime] [nvarchar](334) NULL,

[market] [nvarchar](334) NULL,

[deviceplatform] [nvarchar](334) NULL,

[devicemake] [nvarchar](334) NULL,

[devicemodel] [nvarchar](334) NULL,

[state] [nvarchar](334) NULL,

[country] [nvarchar](334) NULL,

[querydwelltime] [float] NULL,

[sessionid] [bigint] NULL,

[sessionpagevieworder] [bigint] NULL

) ON [PRIMARY]

3. You also need to make sure that you can move data in and out of the HADOOP and appropriate ODBC ports were opened:

clip_image002

4. Open Hadoop command prompt and type following command to initiate SQOOP moving data:

c:\Apps\dist\sqoop>bin\sqoop.cmd export –connect "jdbc:sqlserver://alexeik0N:1433;database=RelationalDB;user=sa;password=Password" –table SOURCE –export-dir /hive/warehouse/hivesampletable –input-fields-terminated-by "\t"

where is:

jdbc:sqlserver://alexeik0N:1433 – your destination SQL Server

database=RelationalDB – destination DB

user=sa;password=Password – credentials for the user to allow SQOOP connect to the DB

table TARGET – destination table. SQOOP doesn’t like space in the name!!

export-dir /hive/warehouse/hivesampletable – Hive table directory

input-fields-terminated-by "\t" – delimiter symbol. In this particular case we say that original table uses horizontal TAB delimiter

5. Execute this command and monitor the execution in the Hadoop Command Prompt:

clip_image003

Verify that the records were exported

6. Now on the SQL Server side make sure that data successfully landed in the TARGET table:

clip_image005

Move data from SQL Table into HADOOP table.

 

1. Select SQL Server table which will be used as the Source. Here is the example of the table, that I will be using:

SELECT [clientid]

,[querytime]

,[market]

,[deviceplatform]

,[devicemake]

,[devicemodel]

,[state]

,[country]

,[querydwelltime]

,[sessionid]

,[sessionpagevieworder]

FROM [RelationalDB].[dbo].[SOURCE]

2. Now, Let’s write the SQOOP Command to load date from SQL Server to the HDFS

c:\Apps\dist\sqoop\bin>sqoop.cmd import –connect “jdbc:sqlserver://alexeik0N:1433;database=RelationalDB;user=sa;password=Password" –table SOURCE –target-dir /hive/warehouse/SQLDATA/ –as-textfile –split-by clientid –fields-terminated -by \t -m 1

Where:

jdbc:sqlserver://alexeik0N:1433Source SQL Server

database=RelationalDB Database on the server, which stores source table and data

user=sa;password=Password" – User credential, which SQOOP will be using to connect to SQL Server

–table SOURCE – Name of the source table in SQL Server. Doesn’t like space in the name!

–target-dir /hive/warehouse/SQLDATA/ – HDFSdirectory which will be created for uploading data

–split-by clientid – Column of the table used to split work units

–fields-terminated-by \t – Here you can define how do you separate columns in the text file

-m 1 – number of MAP tasks

 

Important note: you may be concerned submitting password in the connection string. In that case you may use the option –P, which will prompt you for entering password.

— HIVE IMPORT APPEND with -P (console password)

sqoop.cmd import –connect "jdbc:sqlserver://sqlserver.database.windows.net:1433;database=someDB1gb" –username username@sqlserverP  –table lineitem –split-by l_orderkey –hive-import –append

3. Monitor execution of the script in the HADOOP Command Prompt

clip_image006

You can also check that the folder was created and the data arrived successfully there:

clip_image008

And then in the folder bellow SQLDATA, you find your text file:

clip_image010

That’s all!

Moving data from Hive table in MS Hadoop into SQL Server using SSIS Server.

 

I am going to dedicate some blogs to the Microsoft distribution of the HADOOP. If you interested in learning more about it you can get more information from the https://www.hadooponazure.com/

Some people believe that the data magically appear in data warehouses or in the HADOOP cluster. Since I have different opinion on that, I simply decided to learn how can I move data between HADOOP and SQL and the next couple of post will be dedicated to this subject.

Here is the step by step description of how can you move data from the Hive table residing in the HADOOP cluster into the SQL Server using SSIS Package.

I am using here HADOOP cluster in the https://www.hadooponazure.com/for storing HIVE table and SQL Server 2012 RC0 as relational Data base and ETL tools.

Preparing environment

 

1. Hadoop service in the https://www.hadooponazure.com/ comes with the Hive sample table called HiveSampleTable. For the simplicity I will use this table. If you want to create your own, you can follow the steps in appendix, which describe how you can create your own source table

2. In order to access your data from ‘outside’ the Hadoop cluster, make sure that you have all necessary ports opened. In the hadoopazure.com, make sure that you opened the ODBC port like this:

clip_image002

3. You will need to use Hive ODBC driver to connect to the Hive Table in Hadoop. Instructions on how to install and Configure this driver could be found here: How To Connect Excel to Hadoop on Azure via HiveODBC

 

Creating and configuring SSIS Package

 

4. Now it is time switch to the SSIS environment and create the package for loading Hive data into SQL table. For doing that, start SQL Server Data Tools and create new Package.

5. Right click in the Connection Manager filed and select New Connection

clip_image003

6. Chose ODBC Connection Manager, click Add, in the Configure ODBC Connection Manager click New. In the new window , specify the Data Source name as well as your HADOOP credentials.

IMPORTANT: in the Login information you should provide the account information which you used while creating cluster in the hadooponazure.com

clip_image004

7. Create new Data Flow Task by dragging and dropping the Data Flow Task component from the SSIS Toolbox into the Control Flow field.

8. Drag and drop ADO Net Source from the SSIS Toolbox into he Data Flow field

9. Edit the Ado Net Source: specify connection manager that we just created and then chose the HiveSampleTable from the drop down list

clip_image006

10. Assuming that you already created target database on your SQL Server, create destination in Connection Manager

clip_image007

11. You can add then some more transformation, but we will skip it in this document. To complete the exercise, drag and drop the OLEDB Destination into the Data Flow field.

clip_image009

You can use SSIS functionality to create the destination table by clicking on the New button and this will generate Create statement for you.

12. To verify that everything works, you can try to debug SSIS package you have just created:

clip_image010

And see how the rows are moving

Done.

Appendix

Creating Hive table in Hadoop cluster:

1. Browse to the folder C:\Apps\dist\examples\hive\ CreateSampleTable.hql of your Hadoop cluster installation

2. Start the Hadoop Command Shell

3. In the command shell switch to hive

clip_image012

4. Copy the CreateSampleTable.hql Create Table statement and past it into the hive command prompt and execute the statement:

clip_image014

5. After execution completed successfully, load HiveSampleTable with the data. To do that, copy LOAD DATA statement into the have command prompt and execute this statement:

clip_image016