Introduction
Very often working on the different large scale data warehousing projects we observed some common patterns, which had absolutely nothing to do with the technical abilities of the SQL Server to scale and perform, however had significant impact on the success or not success of the projects.
‘Building Bricks’ and MBA
We all live in the world, where (especially in big companies) a lot of the great business managers with MBA degree define internal process. One of the key concepts here is that all processes can be presented as the building blocks and you can construct and optimize your business and the company by moving those blocks around, adding new ones, removing old.
This leads to the situation, where business and IT departments connected with each other due to the absolutely minimum of the interfaces and very often IT has no idea of what business is doing. The role of IT is getting more into direction of the ‘procurement department for computers and printers’ and when business becomes big enough and demands large and very special solution, people in IT face the huge challenge: what should we offer to business to address their demand?
Here they do the best that learned to do – start looking for the ‘building blocks’ again. Speaking of the data warehouses, which block you, can pick to build it? Of course it is ‘industrial models’ and the hardware, which can support the scale of the project.
Hardware sizing.
It is clear, that the hardware should be simply big. Storage should store all the data that business will need in the next few years and processors so fast that they queries will fly! If money is not the problem – then the biggest storage as well as the largest with the most of the cores ever server will be ordered. If IT is short on money, they will start calling vendors around them requesting proposals for their size of data warehouses.
Industrial data models
Now, hardware is purchased, it will probably be delivered very soon, the next question becomes: how should we model our data warehouse? Well, if people are familiar with the Kimball approach – they know that the next steps should be. What do we do if we have no idea regarding how to deal with something new, that we never experienced? We will look at what others do! Therefore, the next activity those MBA IT managers will start – looking up for the existing data models hoping that if somebody else used that – this will fit me too as well.
Is it good or bad if I buy the data model for the warehouse?
Actually, there is nothing wrong about data models themselves, if you know how to use them. There few pro’s and con’s about data models developed by third-party companies:
Pro:
1. You can consider those data models as the “super-inventory”, which will give you the outlook of what types of data your business works with
Cons:
1. Very often those models will not 100% fit into your business processes and you will need to re-design them to adjust to your business demands. Sometimes re-design is more complicated process rather building from scratch. And not to mention, that the amount of the information which you can recycle for your needs is somewhere between 15-25%.
2. Those models often do not tell you anything about physical data model, which you will build in your data warehouse
One of the ‘best’ examples of how it should not work I saw nearly a year ago, where the customer was trying to build the ULTIMATE data warehouse (they called it the ’Enterprise Data Warehouse’), which was intended to store all the data ever produced in the company with a predicted size of nearly 100TB. Just in case. What if you need to generate some reports and data is not in the EDW? But before starting building this ULIMATE EDW, they decided to invest few hundreds thousands Euro in buying data model for one well known vendor, which specialized on producing those things.
For many months, specially educated people were moving from room to room interviewing business. One day THE LOGICAL DATA MODEL was presented to the management. This model was presented as the print out – several meters long and full with few centimeters wide figures. You had to be the scientist to understand this diagram and what you could do with this model.
This wouldn’t be the a big problem if nearly at the same time we would get asked to help optimizing performance in one of the Banks, where same vendor build the model and build 1:1 physical model out of it: a highly normalized schema. The Performance we saw there was an absolute disaster: loading 50 GB of test data was taking 36 hours!!
Is this the model’s fault? No! But this is that we see quite a lot – people simply do not think about the next step which they have do after buying logical model. They fail to understand how model is mapped to another one.
How do we approach those problems in our projects?
Different approach.
Time is money!
What is typically the key problem that business tries always to solve in the first place? It is all about money! Usually, there is a number of information sources (Data Sources), which helps business to earn money if this information delivered on time.
If we speak about large data warehouses, we can often see that the most important challenge here is related to moving data from the data sources into data warehouse and make it available to the clients, such as direct queries, or some reporting and analysis engines.
Understanding the time window which satisfies business demand will give you then the necessary basic information regarding the architectural decisions that you can apply.
Know your limits
It is extremely important to understand the limits of the technologies, which will be used in building large data warehouse. Obviously, you won’t expect form the 10 Gb network card the throughput higher, than 10Gb/sec?
We often see people get confused in correlating compute power of the server, the number of CPUs it has and how fast can workflow run on this box. I have observed situations, where customers replaced an old 4 core serer with new 32 cores box and saw significant workflow performance degradation! Well, if your workload was using single core only and the old server had much higher CPU clock speed, it should not surprise, what the workload will slow down!
Analyze and find the limits of the hardware components that you use: e.g. one CPU core can process approximately 200 MB/sec, 10Gb network card will not deliver more than 10Gb/sec throughput. Also measure and create the baseline for the processes level above hardware: what is the highest ETL speed you can reach, how fast can you build an index?
We know, as an example, what most of the data transfer operations in SQL server, which rely on the BULK API are single stream and give us somewhere around 25MB/sec for the write operations throughput and ~35 MB/sec read operations performance.
Not enough speed? Run parallel
Now, we know the limits, but we see that with the restrictions that we recognized, we can’t meet the business requirements. Is there any solution how we could scale the throughput of the system? Simply multiply the amount of the ‘streams’, so that all together they deliver performance that we were looking for. If reading data from the data source with the speed of 35MB/sec is too slow – make more of the ‘read’ components. If processing data with the single core doesn’t give the necessary speed – make sure your design your workflow in the way it runs on multiple cores.
Rough model
If we agreed on using Kimball model in building our data warehouse model, we had better follow this approach precisely. Just to refresh the memory, the very first step in this approach to understand the limits of the technologies, which will be used –identify the business problem which you are about to solve: which questions business really wants to ask.
Sit together with your business departments, whose data you will reflect in your data warehouse and sketch the rough logical data model, like this:
This will give you the idea about which Fact and dimension tables you will have in the data warehouse and later will be reflected in the schema like this:
Hardware sizing
So, now let’s talk about hardware sizing. Microsoft together with some other vendors developed data warehouse reference architecture called Fast Track. This reference architecture reflects the years of experience building large data warehouses and provides with balanced hardware and software configuration on the scale of the nearly 100TB. One thing to notice though – you need to make sure that your data warehouse characteristics comply with that we usually understand under ‘classical data warehouse’:
· Star schema
· Large amount of scan operations
· Write and Read operations dominate over UPDATES (Updates happen mostly on the Dimension tables)
What if you need to scale above 100TB? You may want then to consider Parallel Data Warehouse which can host (as of now) up to 500TB of data.
Physical Data Model for scale
Now, that we understand, which hardware will be used, This model was presented as the print out – several meters long and full with few centimeters big. how we will model data warehouse, know limits and the baselines for the ETL and other processes, it is time to think of the architectural decisions, which will help to address the business challenges. Very often, one of the biggest customer’s concerns we are hearing is something like: ‘my business grows; I need to process more and more data in the short period of time and make it available to the data consumers! What options do I have?’
This tells us immediately, that loading single stream with the speed of 35 MB/sec might actually not work. So, we need more speed – means we need more streams. Our target fact table, especially if business is interested in working with historical data (this is why we often build data warehouse, correct?), will be partitioned table. Often, partitioning by time (day, month, year) may not give you enough performance populating the table, especially, if source data could deliver better distribution by some other key. This can be data source ID, customer ID or area code. In this situation you may think of ‘two dimensional partitioning’, where the base principle is that you create partitioned fact tables optimized for data loading ( let’s say one partition will be loaded at the time and each partition corresponds to the area code) and this fact table carries specific time range, such as day, month or anything else, which is often defined by the time range which business needs to analyze. Let’s say you will have year ranges, therefore in order to give business outlook on that was happening in historical perspective – you will need to join those entire partitioned fact tables in the VIEW structure, which will be exposed then to the business users or application as the object, where all data will be stored.
I am intentionally not diving into the deeper discussion about the particular possible designs and their benefits, since the goal of this blog was to get on the same page about approach in building data warehouses.
Thomas, Marcel and I, we already posted and will be definitely writing more on some particular techniques (I see that Marcel started writing series of blogs regarding the ETL scale which we presented together at the SQL Rally Nordic 2012). Stay tuned and lot of fun developing data warehouses with SQL Server!