The basics of the data warehouse modeling


Very often working on the different large scale data warehousing projects we observed some common patterns, which had absolutely nothing to do with the technical abilities of the SQL Server to scale and perform, however had significant impact on the success or not success of the projects.


‘Building Bricks’ and MBA

We all live in the world, where (especially in big companies) a lot of the great business managers with MBA degree define internal process. One of the key concepts here is that all processes can be presented as the building blocks and you can construct and optimize your business and the company by moving those blocks around, adding new ones, removing old.

This leads to the situation, where business and IT departments connected with each other due to the absolutely minimum of the interfaces and very often IT has no idea of what business is doing. The role of IT is getting more into direction of the ‘procurement department for computers and printers’ and when business becomes big enough and demands large and very special solution, people in IT face the huge challenge: what should we offer to business to address their demand?

Here they do the best that learned to do – start looking for the ‘building blocks’ again. Speaking of the data warehouses, which block you, can pick to build it? Of course it is ‘industrial models’ and the hardware, which can support the scale of the project.


Hardware sizing.

It is clear, that the hardware should be simply big. Storage should store all the data that business will need in the next few years and processors so fast that they queries will fly! If money is not the problem – then the biggest storage as well as the largest with the most of the cores ever server will be ordered. If IT is short on money, they will start calling vendors around them requesting proposals for their size of data warehouses.



Industrial data models

Now, hardware is purchased, it will probably be delivered very soon, the next question becomes: how should we model our data warehouse? Well, if people are familiar with the Kimball approach – they know that the next steps should be. What do we do if we have no idea regarding how to deal with something new, that we never experienced? We will look at what others do! Therefore, the next activity those MBA IT managers will start – looking up for the existing data models hoping that if somebody else used that – this will fit me too as well.


Is it good or bad if I buy the data model for the warehouse?

Actually, there is nothing wrong about data models themselves, if you know how to use them. There few pro’s and con’s about data models developed by third-party companies:


1. You can consider those data models as the “super-inventory”, which will give you the outlook of what types of data your business works with


1. Very often those models will not 100% fit into your business processes and you will need to re-design them to adjust to your business demands. Sometimes re-design is more complicated process rather building from scratch. And not to mention, that the amount of the information which you can recycle for your needs is somewhere between 15-25%.

2. Those models often do not tell you anything about physical data model, which you will build in your data warehouse

One of the ‘best’ examples of how it should not work I saw nearly a year ago, where the customer was trying to build the ULTIMATE data warehouse (they called it the ’Enterprise Data Warehouse’), which was intended to store all the data ever produced in the company with a predicted size of nearly 100TB. Just in case. What if you need to generate some reports and data is not in the EDW? But before starting building this ULIMATE EDW, they decided to invest few hundreds thousands Euro in buying data model for one well known vendor, which specialized on producing those things.

For many months, specially educated people were moving from room to room interviewing business. One day THE LOGICAL DATA MODEL was presented to the management. This model was presented as the print out – several meters long and full with few centimeters wide figures. You had to be the scientist to understand this diagram and what you could do with this model.

This wouldn’t be the a big problem if nearly at the same time we would get asked to help optimizing performance in one of the Banks, where same vendor build the model and build 1:1 physical model out of it: a highly normalized schema. The Performance we saw there was an absolute disaster: loading 50 GB of test data was taking 36 hours!!

Is this the model’s fault? No! But this is that we see quite a lot – people simply do not think about the next step which they have do after buying logical model. They fail to understand how model is mapped to another one.

How do we approach those problems in our projects?


Different approach.

Time is money!

What is typically the key problem that business tries always to solve in the first place? It is all about money! Usually, there is a number of information sources (Data Sources), which helps business to earn money if this information delivered on time.


If we speak about large data warehouses, we can often see that the most important challenge here is related to moving data from the data sources into data warehouse and make it available to the clients, such as direct queries, or some reporting and analysis engines.

Understanding the time window which satisfies business demand will give you then the necessary basic information regarding the architectural decisions that you can apply.

Know your limits

It is extremely important to understand the limits of the technologies, which will be used in building large data warehouse. Obviously, you won’t expect form the 10 Gb network card the throughput higher, than 10Gb/sec?

We often see people get confused in correlating compute power of the server, the number of CPUs it has and how fast can workflow run on this box. I have observed situations, where customers replaced an old 4 core serer with new 32 cores box and saw significant workflow performance degradation! Well, if your workload was using single core only and the old server had much higher CPU clock speed, it should not surprise, what the workload will slow down!

Analyze and find the limits of the hardware components that you use: e.g. one CPU core can process approximately 200 MB/sec, 10Gb network card will not deliver more than 10Gb/sec throughput. Also measure and create the baseline for the processes level above hardware: what is the highest ETL speed you can reach, how fast can you build an index?

We know, as an example, what most of the data transfer operations in SQL server, which rely on the BULK API are single stream and give us somewhere around 25MB/sec for the write operations throughput and ~35 MB/sec read operations performance.


Not enough speed? Run parallel

Now, we know the limits, but we see that with the restrictions that we recognized, we can’t meet the business requirements. Is there any solution how we could scale the throughput of the system? Simply multiply the amount of the ‘streams’, so that all together they deliver performance that we were looking for. If reading data from the data source with the speed of 35MB/sec is too slow – make more of the ‘read’ components. If processing data with the single core doesn’t give the necessary speed – make sure your design your workflow in the way it runs on multiple cores.


Rough model

If we agreed on using Kimball model in building our data warehouse model, we had better follow this approach precisely. Just to refresh the memory, the very first step in this approach to understand the limits of the technologies, which will be used –identify the business problem which you are about to solve: which questions business really wants to ask.

Sit together with your business departments, whose data you will reflect in your data warehouse and sketch the rough logical data model, like this:


This will give you the idea about which Fact and dimension tables you will have in the data warehouse and later will be reflected in the schema like this:



Hardware sizing

So, now let’s talk about hardware sizing. Microsoft together with some other vendors developed data warehouse reference architecture called Fast Track. This reference architecture reflects the years of experience building large data warehouses and provides with balanced hardware and software configuration on the scale of the nearly 100TB. One thing to notice though – you need to make sure that your data warehouse characteristics comply with that we usually understand under ‘classical data warehouse’:

· Star schema

· Large amount of scan operations

· Write and Read operations dominate over UPDATES (Updates happen mostly on the Dimension tables)

What if you need to scale above 100TB? You may want then to consider Parallel Data Warehouse which can host (as of now) up to 500TB of data.


Physical Data Model for scale

Now, that we understand, which hardware will be used, This model was presented as the print out – several meters long and full with few centimeters big. how we will model data warehouse, know limits and the baselines for the ETL and other processes, it is time to think of the architectural decisions, which will help to address the business challenges. Very often, one of the biggest customer’s concerns we are hearing is something like: ‘my business grows; I need to process more and more data in the short period of time and make it available to the data consumers! What options do I have?’

This tells us immediately, that loading single stream with the speed of 35 MB/sec might actually not work. So, we need more speed – means we need more streams. Our target fact table, especially if business is interested in working with historical data (this is why we often build data warehouse, correct?), will be partitioned table. Often, partitioning by time (day, month, year) may not give you enough performance populating the table, especially, if source data could deliver better distribution by some other key. This can be data source ID, customer ID or area code. In this situation you may think of ‘two dimensional partitioning’, where the base principle is that you create partitioned fact tables optimized for data loading ( let’s say one partition will be loaded at the time and each partition corresponds to the area code) and this fact table carries specific time range, such as day, month or anything else, which is often defined by the time range which business needs to analyze. Let’s say you will have year ranges, therefore in order to give business outlook on that was happening in historical perspective – you will need to join those entire partitioned fact tables in the VIEW structure, which will be exposed then to the business users or application as the object, where all data will be stored.


I am intentionally not diving into the deeper discussion about the particular possible designs and their benefits, since the goal of this blog was to get on the same page about approach in building data warehouses.

Thomas, Marcel and I, we already posted and will be definitely writing more on some particular techniques (I see that Marcel started writing series of blogs regarding the ETL scale which we presented together at the SQL Rally Nordic 2012). Stay tuned and lot of fun developing data warehouses with SQL Server!


Appliances and reference architectures: Another prove of the Russian wisdom: ‘Trust, but verify’

Together with my friend Marcel, I worked  validating the new Data Warehouse concept onsite by  one of the big and very known customers Улыбка . Even though we’ve got nicely balanced pre-configured hardware architecture, we learned again, that sometimes it is better to double-check whether everything runs as you expect. Speaking generally, I would recommend this as one of the mandatory steps if you are about tuning your system anyway – make sure you get performance expecting from the HW, then think of any further optimizations.

Here is the link to the blog published at SQLCAT.COM:  Fast Track: improving performance through correct LUN Mapping and Storage Enclosure configuration

Using SQOOP for moving data between HADOOP and SQL Server


Following the post on how to load data from HADOOP into SQL Server, here is an example how you can use different technology for transferring data using SQOOP.


Move data from HIVE table in HADOOP into SQL.


1. Hadoop service in the comes with the Hive sample table called HiveSampleTable. For the simplicity I will use this table. If you want to create your own, you can follow the steps in appendix, which describe how you can create your own source table

2. On the SQL Server side we need to make sure we have table, which will be receiving the data.

Here is the T-SQL syntax for the TARGET table


[clientid] [nvarchar](334) NULL,

[querytime] [nvarchar](334) NULL,

[market] [nvarchar](334) NULL,

[deviceplatform] [nvarchar](334) NULL,

[devicemake] [nvarchar](334) NULL,

[devicemodel] [nvarchar](334) NULL,

[state] [nvarchar](334) NULL,

[country] [nvarchar](334) NULL,

[querydwelltime] [float] NULL,

[sessionid] [bigint] NULL,

[sessionpagevieworder] [bigint] NULL


3. You also need to make sure that you can move data in and out of the HADOOP and appropriate ODBC ports were opened:


4. Open Hadoop command prompt and type following command to initiate SQOOP moving data:

c:\Apps\dist\sqoop>bin\sqoop.cmd export –connect "jdbc:sqlserver://alexeik0N:1433;database=RelationalDB;user=sa;password=Password" –table SOURCE –export-dir /hive/warehouse/hivesampletable –input-fields-terminated-by "\t"

where is:

jdbc:sqlserver://alexeik0N:1433 – your destination SQL Server

database=RelationalDB – destination DB

user=sa;password=Password – credentials for the user to allow SQOOP connect to the DB

table TARGET – destination table. SQOOP doesn’t like space in the name!!

export-dir /hive/warehouse/hivesampletable – Hive table directory

input-fields-terminated-by "\t" – delimiter symbol. In this particular case we say that original table uses horizontal TAB delimiter

5. Execute this command and monitor the execution in the Hadoop Command Prompt:


Verify that the records were exported

6. Now on the SQL Server side make sure that data successfully landed in the TARGET table:


Move data from SQL Table into HADOOP table.


1. Select SQL Server table which will be used as the Source. Here is the example of the table, that I will be using:

SELECT [clientid]











FROM [RelationalDB].[dbo].[SOURCE]

2. Now, Let’s write the SQOOP Command to load date from SQL Server to the HDFS

c:\Apps\dist\sqoop\bin>sqoop.cmd import –connect “jdbc:sqlserver://alexeik0N:1433;database=RelationalDB;user=sa;password=Password" –table SOURCE –target-dir /hive/warehouse/SQLDATA/ –as-textfile –split-by clientid –fields-terminated -by \t -m 1


jdbc:sqlserver://alexeik0N:1433Source SQL Server

database=RelationalDB Database on the server, which stores source table and data

user=sa;password=Password" – User credential, which SQOOP will be using to connect to SQL Server

–table SOURCE – Name of the source table in SQL Server. Doesn’t like space in the name!

–target-dir /hive/warehouse/SQLDATA/ – HDFSdirectory which will be created for uploading data

–split-by clientid – Column of the table used to split work units

–fields-terminated-by \t – Here you can define how do you separate columns in the text file

-m 1 – number of MAP tasks


Important note: you may be concerned submitting password in the connection string. In that case you may use the option –P, which will prompt you for entering password.

— HIVE IMPORT APPEND with -P (console password)

sqoop.cmd import –connect "jdbc:sqlserver://;database=someDB1gb" –username username@sqlserverP  –table lineitem –split-by l_orderkey –hive-import –append

3. Monitor execution of the script in the HADOOP Command Prompt


You can also check that the folder was created and the data arrived successfully there:


And then in the folder bellow SQLDATA, you find your text file:


That’s all!

Moving data from Hive table in MS Hadoop into SQL Server using SSIS Server.


I am going to dedicate some blogs to the Microsoft distribution of the HADOOP. If you interested in learning more about it you can get more information from the

Some people believe that the data magically appear in data warehouses or in the HADOOP cluster. Since I have different opinion on that, I simply decided to learn how can I move data between HADOOP and SQL and the next couple of post will be dedicated to this subject.

Here is the step by step description of how can you move data from the Hive table residing in the HADOOP cluster into the SQL Server using SSIS Package.

I am using here HADOOP cluster in the storing HIVE table and SQL Server 2012 RC0 as relational Data base and ETL tools.

Preparing environment


1. Hadoop service in the comes with the Hive sample table called HiveSampleTable. For the simplicity I will use this table. If you want to create your own, you can follow the steps in appendix, which describe how you can create your own source table

2. In order to access your data from ‘outside’ the Hadoop cluster, make sure that you have all necessary ports opened. In the, make sure that you opened the ODBC port like this:


3. You will need to use Hive ODBC driver to connect to the Hive Table in Hadoop. Instructions on how to install and Configure this driver could be found here: How To Connect Excel to Hadoop on Azure via HiveODBC


Creating and configuring SSIS Package


4. Now it is time switch to the SSIS environment and create the package for loading Hive data into SQL table. For doing that, start SQL Server Data Tools and create new Package.

5. Right click in the Connection Manager filed and select New Connection


6. Chose ODBC Connection Manager, click Add, in the Configure ODBC Connection Manager click New. In the new window , specify the Data Source name as well as your HADOOP credentials.

IMPORTANT: in the Login information you should provide the account information which you used while creating cluster in the


7. Create new Data Flow Task by dragging and dropping the Data Flow Task component from the SSIS Toolbox into the Control Flow field.

8. Drag and drop ADO Net Source from the SSIS Toolbox into he Data Flow field

9. Edit the Ado Net Source: specify connection manager that we just created and then chose the HiveSampleTable from the drop down list


10. Assuming that you already created target database on your SQL Server, create destination in Connection Manager


11. You can add then some more transformation, but we will skip it in this document. To complete the exercise, drag and drop the OLEDB Destination into the Data Flow field.


You can use SSIS functionality to create the destination table by clicking on the New button and this will generate Create statement for you.

12. To verify that everything works, you can try to debug SSIS package you have just created:


And see how the rows are moving



Creating Hive table in Hadoop cluster:

1. Browse to the folder C:\Apps\dist\examples\hive\ CreateSampleTable.hql of your Hadoop cluster installation

2. Start the Hadoop Command Shell

3. In the command shell switch to hive


4. Copy the CreateSampleTable.hql Create Table statement and past it into the hive command prompt and execute the statement:


5. After execution completed successfully, load HiveSampleTable with the data. To do that, copy LOAD DATA statement into the have command prompt and execute this statement:


Let’s get started…


I was thinking how can I introduce my blog and couldn’t find anything better than to steal part of speaker Bio, when I am invited to present something:

Alexei has been with Microsoft since 10+ years. He has started with SQL Server collecting globalization requirements from European countries during the version SQL Server 2005 development. Since about 4 years a member of European part of the SQL Customer Advisory team. He is focusing on large scale Data Warehouses such as well-known European Telecommunication Organizations, on architecture principles which are well reflected in the Fast Track reference Architecture guide. Has published articles on different database topics. Alexei has graduated from Moscow Engineering Physics Institute

As it is clearly stated above, my main focus is on anything related to the large SQL Data Warehouses, right now I am also trying to get up to speed with the Microsoft distribution of the HADOOP.
What am I going to post here? I think I will be focusing on the subjects where I found myself ‘it could be helpful if it was somewhere documented’. Actually, a lot of people prefer different type of how the information should be delivered to them. I hope my way of showing things could find its audience.