Expanding HDP Hadoop file system to Azure Blob Storage.

If you are building  Cloud based BigData solution, HDInsight cluster in Windows Azure could be one of the first choices when looking at the possible platforms. The Hortonworks HDP running on the IaaS VMs is another option you may want consider. If deciding which one of two to select, even considering the same code base underneath of both solutions, there is a number of factors, which may influence the  final choice, however the key is _how_ the cluster is going to be used: do you need it for temporary calculations, do you have dependency on additional Hadoop related tools and therefore dependency on Java etc.

I am not going to focus on those questions here.  What is interesting for me at the moment  – HDI relies on the Azure Storage for saving data. The Azure Storage is cheap and scales very well. If you look at the HDP cluster in Azure VMs – it is also using Azure Storage: all the VHD drives, which you as Data drives in your VM stored in Azure Blob storage. The largest Azure VM can support up to x16 drives 1 TB each, so you can easily get 16TB of data stored in your Hadoop cluster. What to do if I need more?

Disclaimer: following configuration works, but officially not yet supported by neither Microsoft nor Hortonworks. If you are going to use it – you do it at your own risk!

At very early stages of HDI validations Cindy Gross has published instructions on how to connect Azure Blob storage (asv – Azure Storage Vault) to HDI cluster.

So, assuming you have the HDP cluster (either in the Cloud or on-prem) up and running and Azure Storage Account created, let’s try to make these two technologies friends.

Following Cindy’s recommendation let’s look at the core-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!– Put site-specific property overrides in this file. –>
<configuration xmlns:xi="http://www.w3.org/2001/XInclude">

<!– cluster variant –>
<description>The name of the default file system. Either the literal string "local" or a host:port for NDFS.</description>

Currently we have only the default HDFS file system. In my solution I want ASV be my default file system, so I need to change this part of the core-site.xml to re-point Hadoop to the Azure Storage.

In order to successfully communicate with the Azure Storage Hadoop needs to know:

  • · Storage Account to connect to
  • · Security key to connect

Let’s add this information to the configuration file. First, I will change “default file system” config already mentioned above:

<!– cluster variant –>
<description>The name of the default file system. Either the literal string "local" or a host:port for NDFS.</description>

Now, the next property doesn’t exist in the core-site XML – you need to add it manually:

<value>The_security_key_value </value>

Now, let’s restart the Hadoop cluster and try to copy something from local folder into the asv:// path:

F:\hdp\hadoop>hadoop fs -copyFromLocal c:/Test asv://
F:\hdp\hadoop>hadoop fs -ls asv:/
Found 3 items
drwxr-xr-x – HDPAdmin supergroup 0 2014-01-09 18:18 /Test
drwxr-xr-x – hadoop supergroup 0 2013-11-14 10:04 /mapred

You can also try and use something like Azure Storage Explorer to see if the data were copied into the Azure Blob.

Now, I also want keep some data in the HDFS file system. However, If I will try to launch the Hadoop Name Node console, browser fails to connect to the Head Node. Well, no wonder – I have changed the default file system configuration and have not said anything to the cluster regarding how I want HDFS to be treated.

There is one more configuration which I need to add to the core-site.xml to fix it:


Now restarting Hadoop again and I can also try copying files into the HDFS:

F:\hdp\hadoop>hadoop fs -copyFromLocal c:/Test hdfs://HDPServer:9000/
F:\hdp\hadoop>hadoop fs -ls hdfs://HDPServer:9000/
Found 7 items

drwxr-xr-x – HDPAdmin supergroup 0 2014-01-09 16:29 /Test
drwxr-xr-x – hadoop supergroup 0 2013-08-23 16:22 /apps
drwxr-xr-x – hadoop supergroup 0 2013-09-04 13:28 /hive
drwxr-xr-x – hadoop supergroup 0 2013-12-17 14:36 /mapred
drwxr-xr-x – HDPAdmin supergroup 0 2013-10-09 16:32 /tmp
drwxr-xr-x – HDPAdmin supergroup 0 2013-09-03 16:40 /tpch_1gb
drwxr-xr-x – HDPAdmin supergroup 0 2013-10-09 16:27 /user

So, everything works and even my Hadoop Name Node console is back again.

The very last question left: what if I need to add more Storage Accounts? Just follow the steps Cindy described in her blog:

  • · Add the information that associates the key value with your default storage account


  • · Add any additional storage accounts you plan to access


Have fun!