Configuring an Oozie job with a HDInsight Hadoop cluster

Oozie is widely used in the Hadoop world as a workflow scheduler. Microsoft HDInsight supports Oozie “out of the box” and comes with all necessary bits and examples which should help you to successfully configure Oozie in your Microsoft HDInsight environment.

While playing around with the HDInsight and trying to reproduce a customer problem, I had to launch a Hive action from the Oozie job. To make it all simple, I looked at the example Oozie job, which is executing some hive commands. Then I adopted that example to my own needs. The path to the example after you have made a remote connection to the head node is:

PathToExample

 

If you look at the folder content, you will see there are 3 files:

·         job.properties

·         script.q

·         workflow

Now let’s look at what is going on in these sample files and then look at where we need to modify them?

Job.properties

This file’s role is to set up the configuration environment for the Oozie job and it has the following configuration parameters:

#

nameNode=hdfs://localhost:8020          ß points out to the default file system

jobTracker=localhost:8021                         ßThis is where our jobTracker service runs

queueName=default

examplesRoot=examples          

 

oozie.use.system.libpath=true                     ßtells us whether to use default Oozie libraries

oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/hive

 

One classic scenario where HDInsight Service in Azure should be used is when you store your data in the Cloud and provision Hadoop clusters only when you need to run particular calculations. In other words, you are scaling your compute separately from your storage.  HDInsight defaults to Windows Azure Storage (called WASB when referenced from HDInsight) and your data is stored in blob storage and not in the HDFS file system.

Therefore, in order to tell Oozie about this fact you need to change the string 

nameNode=hdfs://localhost:8020

 

to something like

nameNode=wasb://container_name@storage_name.blob.core.windows.net

 

You may want to store scripts, which will be executed during the Oozie job and some other related configurations, such as workflows description, in separate folders. Therefore, the job should also specify where to find this information and data. In my case this line looks like the following:

 

oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobs

 

Another two strings to pay attention to are:

 

oozie.use.system.libpath=true 

jobTracker=jobtrackerhost:9010 <-default value was something like  localhost:8021

 

 

 

So, the final version of the job.properties file will look like the following:

 

nameNode=wasb://container_name@storage_name.blob.core.windows.net

jobTracker=jobtrackerhost:9010

queueName=default

 

oozie.wf.application.path=wasb:///user/admin/examples/apps/ooziejobs

outputDir=ooziejobs-out

oozie.use.system.libpath=true

 

 

Now, let’s look at the workflow configuration XML file.

 

Workflow.xml

 

The structure of the workflow.xml is very well described here. What we need to know is that this is the place where all the actions of the Oozie job will be specified. In our example of executing a sample hive job, we can expect that the specified Hive script script.q will be executed based on the below code:

 

    <action name="hive-node">

        <hive xmlns="uri:oozie:hive-action:0.2">

            <job-tracker>${jobTracker}</job-tracker>

            <name-node>${nameNode}</name-node>

            <prepare>

                <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/hive"/>

                <mkdir path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data"/>

            </prepare>

            <configuration>

                <property>

                    <name>mapred.job.queue.name</name>

                    <value>${queueName}</value>

                </property>

            </configuration>

            <script>script.q</script> ß This is where our Hive script will be called

            <param>INPUT=/user/${wf:user()}/${examplesRoot}/input-data/table</param>

            <param>OUTPUT=/user/${wf:user()}/${examplesRoot}/output-data/hive</param>

        </hive>

        <ok to="end"/>

        <error to="fail"/>

 

Script .q

This is a batch file which is simply executing whatever Hive commands we specify. I created my own version of the script.q . In my script I will create a Hive table called test by replacing the sample contents of script.q with the following :

 

CREATE EXTERNAL TABLE test (a INT) STORED AS TEXTFILE LOCATION ‘${INPUT}’;

INSERT OVERWRITE DIRECTORY ‘${OUTPUT}’ SELECT * FROM test;

Preparing and running the job

 

Now that we are done with the configuration we need to complete a couple more steps. If you recall, in the configuration we set the application folder as wasb:///user/admin/examples/apps/ooziejobs

Let’s upload the Hive script and the workflow in that location –  

c:\apps\dist\hadoop-1.2.0.1.3.2.0-05>hadoop fs -copyFromLocal C:\apps\dist\oozie[CG8] 

-3.3.2.1.3.2.0-05\oozie-win-distro\examples\apps\hive    ///user/admin/examples/apps/ooziejobs

After this step is completed, we can attempt to run the oozie job (finally) J

 

oozie job -oozie http://localhost:11000/oozie -config  C:\ooziejobs\job.properties -run

 


If everything runs successfully, you will see right after the command line you just executed the message indicating the ID of the oozie job

 

c:\apps\dist\oozie-3.3.2.1.3.2.0-05\oozie-win-distro\bin>oozie job -oozie http:/

/namenodehost:11000/oozie -config  C:\apps\dist\oozie-3.3.2.1.3.2.0-05\oozie-win

-distro\examples\apps\hive\job.properties -run

job: 0000000-140130144826022-oozie-hdp-W

 

Also in the MapReduce administrative console there will be an indication that the Map/Reduce job was submitted

 MApreduce

 

 

In the column “Name” of the job, you will see the job ID, which matches the job id we saw in the command line.

To test if the job executed successfully, we can check the log, or use another simple way – run a hive command and see if the table was created:

 

hive> show tables;

OK

hivesampletable

test

Time taken: 1.925 seconds, Fetched: 2 row(s)

hive>

 

All worked well! We took a sample Oozie script, customized it for our own needs, and executed it to create a Hive table.


 

Expanding HDP Hadoop file system to Azure Blob Storage.

If you are building  Cloud based BigData solution, HDInsight cluster in Windows Azure could be one of the first choices when looking at the possible platforms. The Hortonworks HDP running on the IaaS VMs is another option you may want consider. If deciding which one of two to select, even considering the same code base underneath of both solutions, there is a number of factors, which may influence the  final choice, however the key is _how_ the cluster is going to be used: do you need it for temporary calculations, do you have dependency on additional Hadoop related tools and therefore dependency on Java etc.

I am not going to focus on those questions here.  What is interesting for me at the moment  – HDI relies on the Azure Storage for saving data. The Azure Storage is cheap and scales very well. If you look at the HDP cluster in Azure VMs – it is also using Azure Storage: all the VHD drives, which you as Data drives in your VM stored in Azure Blob storage. The largest Azure VM can support up to x16 drives 1 TB each, so you can easily get 16TB of data stored in your Hadoop cluster. What to do if I need more?

Disclaimer: following configuration works, but officially not yet supported by neither Microsoft nor Hortonworks. If you are going to use it – you do it at your own risk!

At very early stages of HDI validations Cindy Gross has published instructions on how to connect Azure Blob storage (asv – Azure Storage Vault) to HDI cluster.

So, assuming you have the HDP cluster (either in the Cloud or on-prem) up and running and Azure Storage Account created, let’s try to make these two technologies friends.

Following Cindy’s recommendation let’s look at the core-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!– Put site-specific property overrides in this file. –>
<configuration xmlns:xi="http://www.w3.org/2001/XInclude">

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>hdfs://HDPServer:8020</value>
<description>The name of the default file system. Either the literal string "local" or a host:port for NDFS.</description>
<final>true</final>
</property>

Currently we have only the default HDFS file system. In my solution I want ASV be my default file system, so I need to change this part of the core-site.xml to re-point Hadoop to the Azure Storage.

In order to successfully communicate with the Azure Storage Hadoop needs to know:

  • · Storage Account to connect to
  • · Security key to connect

Let’s add this information to the configuration file. First, I will change “default file system” config already mentioned above:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://hdp@YourStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system. Either the literal string "local" or a host:port for NDFS.</description>
<final>true</final>
</property>

Now, the next property doesn’t exist in the core-site XML – you need to add it manually:

<property>
<name>fs.azure.account.key.YourStorageAccount.blob.core.windows.net</name>
<value>The_security_key_value </value>
</property>

Now, let’s restart the Hadoop cluster and try to copy something from local folder into the asv:// path:

F:\hdp\hadoop>hadoop fs -copyFromLocal c:/Test asv://
F:\hdp\hadoop>hadoop fs -ls asv:/
Found 3 items
drwxr-xr-x – HDPAdmin supergroup 0 2014-01-09 18:18 /Test
drwxr-xr-x – hadoop supergroup 0 2013-11-14 10:04 /mapred

You can also try and use something like Azure Storage Explorer to see if the data were copied into the Azure Blob.

Now, I also want keep some data in the HDFS file system. However, If I will try to launch the Hadoop Name Node console, browser fails to connect to the Head Node. Well, no wonder – I have changed the default file system configuration and have not said anything to the cluster regarding how I want HDFS to be treated.

There is one more configuration which I need to add to the core-site.xml to fix it:

<property>
  <name>dfs.namenode.rpc-address</name>
  <value>hdfs://namenodehost:9000</value>
</property>

Now restarting Hadoop again and I can also try copying files into the HDFS:

F:\hdp\hadoop>hadoop fs -copyFromLocal c:/Test hdfs://HDPServer:9000/
F:\hdp\hadoop>hadoop fs -ls hdfs://HDPServer:9000/
Found 7 items

drwxr-xr-x – HDPAdmin supergroup 0 2014-01-09 16:29 /Test
drwxr-xr-x – hadoop supergroup 0 2013-08-23 16:22 /apps
drwxr-xr-x – hadoop supergroup 0 2013-09-04 13:28 /hive
drwxr-xr-x – hadoop supergroup 0 2013-12-17 14:36 /mapred
drwxr-xr-x – HDPAdmin supergroup 0 2013-10-09 16:32 /tmp
drwxr-xr-x – HDPAdmin supergroup 0 2013-09-03 16:40 /tpch_1gb
drwxr-xr-x – HDPAdmin supergroup 0 2013-10-09 16:27 /user

So, everything works and even my Hadoop Name Node console is back again.

The very last question left: what if I need to add more Storage Accounts? Just follow the steps Cindy described in her blog:

  • · Add the information that associates the key value with your default storage account

<property>
  <name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
  <value>YOURActualStorageKeyValue</value>
</property>

  • · Add any additional storage accounts you plan to access

<property>
  <name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
  <value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Have fun!

Setting up multi-node HDP Cluster

As you may be already aware Hortonworks announced support of the HADOOP cluster on Windows platform.  In the project where we involved at the moment, we’ve got luck to  play  with the multi-node HDP cluster. Personally I felt as I am back in the beginning of the 90th with all the configurations and much  stuff which should run  from the command line. However, the HDP is really a lot of fun and  easy to setup and run. Here is the list of things you may need to know in order to save time setting up the HDP in your environment:

1. Even if it sounds very obvious, follow the documentation. Software listed in the SW requirements list should be installed and the parameters like JAVA_HOME have to be configured.

2. Pay attention to the Firewall ports which have to be opened! HDP needs a lot of ports, in some cases I was opening the range instead of the single ports (like, 50000-60000 instead of all the single ports in the 50K range).

3. Leave only one network card and disable the rest that you have on your box.

4. Disable IPv6. You can do it by executing following command for each interface:

netsh interface teredo set state disable
netsh interface 6to4 set state disabled
netsh interface isatap set state disabled

5. In the clusterproperties.txt make sure that you specify path for setting up HDP, pay attention – path with subfolders is not  allowed, no space in the folder name allowed. The documentation misleading here. Example of the correct string:

#Log directory

HDP_LOG_DIR=C:\HDP_LOG

#Data directory

HDP_DATA_DIR=C:\HDP_DATA

6. Same configuration file, #Hosts section – the FULL UNC path will not work! Use short  names instead, Like bellow:

NAMENODE_HOST=server02

7. Edit hosts file and add nodes names with their IP addresses there

8. Now you can run installation. Example of the setup sting:

C:\Windows\system32>msiexec /i "C:\HDP_SETUP\hdp-1.1.0-160.winpkg.msi" /lv "C:\HDP_SETUP\installer.log" HDP_LAYOUT="c:\HDP_SETUP\clusterproperties.txt" HDP_DIR="C:\hdp" DESTROY_DATA="Yes"

You will need to run this script on every node of the cluster!

9. In case if setup filed, make sure you run uninstall command before trying to re-run installation process.

msiexec /x "C:\HDP_SETUP\hdp-1.1.0-160.winpkg.msi" /lv "C:\HDP_SETUP\installer.log" DESTROY_DATA="no"

10. Now, before running setup you will need to enable Remote Scripting on every node. Looks like it doesn’t work correctly. Therefore even if setup was successful, you will see in the error log that something like this:

CAQuietExec:  WINPKG: rd /s /q G:\HadoopInstallFiles\HadoopPackages\hdp-1.1.0-winpkg\resources\hadoop-1.1.0-SNAPSHOT.winpkg

CAQuietExec:  powershell.exe : powershell.exe : powershell.exe : Stop-Service : Cannot find

CAQuietExec:  any service with

CAQuietExec:  At G:\HadoopInstallFiles\HadoopSetupTools\winpkg.ps1:118 char:9

CAQuietExec:  +     $out = powershell.exe -InputFormat none -Command "$command" 2>&1

CAQuietExec:  +            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

CAQuietExec:      + CategoryInfo          : NotSpecified: (powershell.exe …y service with 

CAQuietExec:     :String) , RemoteException

CAQuietExec:      + FullyQualifiedErrorId : NativeCommandError

11. On every node manually start the services by calling start_local_hdp_services.cmd

12. If everything is done right, you will be able to run a smoke test with no errors and enjoy the HDP cluster

Using SSIS in Azure solutions

We finished work on those papers some time before beginning of the 2013, however I just found time to refer to them from my blog.

Here we go:

Tips Tricks and Best Practices: SSIS Operational and Tuning Guide

and

SSIS Tips Tricks and Best Practices: SSIS for Azure and Hybrid Data Movement

Enjoy reading 🙂

Working around “Cannot start the columnstore index build because…”

 

Recently I run into the issue trying to improve query performance by building the column store index on the one of the tables. The message I got building column store index:

Cannot start the columnstore index build because it requires at least 2458240 KB, while the maximum memory grant is limited to 854208 KB per query in workload group ‘default’ (2) and resource pool ‘default’ (2). Retry after modifying columnstore index to contain fewer columns, or after increasing the maximum memory grant limit with Resource Governor.

And this is very strange: I have 8 GB of RAM on the box, 7 GB dedicated to the SQL Server. Nothing else is running on this computer. If we look at the error message, we see that the query is asking for 2,4GB. So, there is enough of memory to successfully build the index.

Well, the error message tells very much directly to check the Resource Governor settings on the server. You may want to use SQL Server Management Studio for this, as well as the script like this:

sp_configure ‘show advanced options’, 1

go

reconfigure

go

sp_configure

go

select * from sys.resource_governor_resource_pools

select * from sys.resource_governor_workload_groups

Running those queries, I got following output

Query Pool configuration :

pool_id

name

min_cpu_percent

max_cpu_percent

min_memory_percent

max_memory_percent

cap_cpu_percent

1

internal

0

100

0

100

100

2

default

0

100

0

100

100

Workload groups configuration:

group_id

name

importance

request_max_memory_grant_percent

request_max_cpu_time_sec

pool_id

1

internal

Medium

25

0

1

2

default

Medium

100

0

2

So, according to the output, default query pool may hope for all available for SQL Server memory, but for whatever reason not getting it. What is really going on on the box you can check with the dbcc memorystatus.

Alternatively, and this is that I did, it is possible to force server not releasing memory by setting up minimum memory settings (I set it to the 45% of the memory size dedicated to the server) . And voila! Column store index was built successfully!

The conference season!

It looks to be a great tradition of running SQL events once the vacation season is over, people coming back from the vacation and trying to get up to the speed.

European SQL CAT team ( and I as the part of this team) will be speaking at some of the conferences on different subjects, covering SQL in the Cloud, ETL tuning and the ROLAP solutions.

My schedule looks as following:

19-21 September: SQL U Summit Bratislava. I never been to Bratislava, looking forward to meet and learn more about this city and people there!

24-25 September: SQLTuneIn in Zagreb. Looking forward to seeing familiar faces there!

1-3 October: SQL RALLY Nordic. Another great conference, looking forward meeting a lot of old friends there.

Nice schedule, but looks like I will not have much time for visiting Oktoberfest this year Улыбка

Big Data on Azure tutorials and scenario

Microsoft released couple of the tutorials helping understanding the HADOOP and HADOOP on Azure. There was also some interesting information published covering typical scenarios, like storage of the structured and unstructured data, as well as extraction of the BI information.

Hadoop on Windows Azure – Working With Data

Big Data Scenarios

Enjoy reading!