Working around “Cannot start the columnstore index build because…”

 

Recently I run into the issue trying to improve query performance by building the column store index on the one of the tables. The message I got building column store index:

Cannot start the columnstore index build because it requires at least 2458240 KB, while the maximum memory grant is limited to 854208 KB per query in workload group ‘default’ (2) and resource pool ‘default’ (2). Retry after modifying columnstore index to contain fewer columns, or after increasing the maximum memory grant limit with Resource Governor.

And this is very strange: I have 8 GB of RAM on the box, 7 GB dedicated to the SQL Server. Nothing else is running on this computer. If we look at the error message, we see that the query is asking for 2,4GB. So, there is enough of memory to successfully build the index.

Well, the error message tells very much directly to check the Resource Governor settings on the server. You may want to use SQL Server Management Studio for this, as well as the script like this:

sp_configure ‘show advanced options’, 1

go

reconfigure

go

sp_configure

go

select * from sys.resource_governor_resource_pools

select * from sys.resource_governor_workload_groups

Running those queries, I got following output

Query Pool configuration :

pool_id

name

min_cpu_percent

max_cpu_percent

min_memory_percent

max_memory_percent

cap_cpu_percent

1

internal

0

100

0

100

100

2

default

0

100

0

100

100

Workload groups configuration:

group_id

name

importance

request_max_memory_grant_percent

request_max_cpu_time_sec

pool_id

1

internal

Medium

25

0

1

2

default

Medium

100

0

2

So, according to the output, default query pool may hope for all available for SQL Server memory, but for whatever reason not getting it. What is really going on on the box you can check with the dbcc memorystatus.

Alternatively, and this is that I did, it is possible to force server not releasing memory by setting up minimum memory settings (I set it to the 45% of the memory size dedicated to the server) . And voila! Column store index was built successfully!

SQL Server SSIS Balanced Data Distributor for SQL Server 2012

From the description  on download center:

“Microsoft® SSIS Balanced Data Distributor (BDD) is a new SSIS transform. This transform takes a single input and distributes the incoming rows to one or more outputs uniformly via multithreading. The transform takes one pipeline buffer worth of rows at a time and moves it to the next output in a round robin fashion. It’s balanced and synchronous so if one of the downstream transforms or destinations is slower than the others, the rest of the pipeline will stall so this transform works best if all of the outputs have identical transforms and destinations. The intention of BDD is to improve performance via multi-threading. Several characteristics of the scenarios BDD applies to: 1) the destinations would be uniform, or at least be of the same type. 2) the input is faster than the output, for example, reading from flat file to OleDB. “

Download Center

I am going to try it. Looks very promising.

The basics of the data warehouse modeling

Introduction

Very often working on the different large scale data warehousing projects we observed some common patterns, which had absolutely nothing to do with the technical abilities of the SQL Server to scale and perform, however had significant impact on the success or not success of the projects.

 

‘Building Bricks’ and MBA

We all live in the world, where (especially in big companies) a lot of the great business managers with MBA degree define internal process. One of the key concepts here is that all processes can be presented as the building blocks and you can construct and optimize your business and the company by moving those blocks around, adding new ones, removing old.

This leads to the situation, where business and IT departments connected with each other due to the absolutely minimum of the interfaces and very often IT has no idea of what business is doing. The role of IT is getting more into direction of the ‘procurement department for computers and printers’ and when business becomes big enough and demands large and very special solution, people in IT face the huge challenge: what should we offer to business to address their demand?

Here they do the best that learned to do – start looking for the ‘building blocks’ again. Speaking of the data warehouses, which block you, can pick to build it? Of course it is ‘industrial models’ and the hardware, which can support the scale of the project.

 

Hardware sizing.

It is clear, that the hardware should be simply big. Storage should store all the data that business will need in the next few years and processors so fast that they queries will fly! If money is not the problem – then the biggest storage as well as the largest with the most of the cores ever server will be ordered. If IT is short on money, they will start calling vendors around them requesting proposals for their size of data warehouses.

image

 

Industrial data models

Now, hardware is purchased, it will probably be delivered very soon, the next question becomes: how should we model our data warehouse? Well, if people are familiar with the Kimball approach – they know that the next steps should be. What do we do if we have no idea regarding how to deal with something new, that we never experienced? We will look at what others do! Therefore, the next activity those MBA IT managers will start – looking up for the existing data models hoping that if somebody else used that – this will fit me too as well.

image

Is it good or bad if I buy the data model for the warehouse?

Actually, there is nothing wrong about data models themselves, if you know how to use them. There few pro’s and con’s about data models developed by third-party companies:

Pro:

1. You can consider those data models as the “super-inventory”, which will give you the outlook of what types of data your business works with

Cons:

1. Very often those models will not 100% fit into your business processes and you will need to re-design them to adjust to your business demands. Sometimes re-design is more complicated process rather building from scratch. And not to mention, that the amount of the information which you can recycle for your needs is somewhere between 15-25%.

2. Those models often do not tell you anything about physical data model, which you will build in your data warehouse

One of the ‘best’ examples of how it should not work I saw nearly a year ago, where the customer was trying to build the ULTIMATE data warehouse (they called it the ’Enterprise Data Warehouse’), which was intended to store all the data ever produced in the company with a predicted size of nearly 100TB. Just in case. What if you need to generate some reports and data is not in the EDW? But before starting building this ULIMATE EDW, they decided to invest few hundreds thousands Euro in buying data model for one well known vendor, which specialized on producing those things.

For many months, specially educated people were moving from room to room interviewing business. One day THE LOGICAL DATA MODEL was presented to the management. This model was presented as the print out – several meters long and full with few centimeters wide figures. You had to be the scientist to understand this diagram and what you could do with this model.

This wouldn’t be the a big problem if nearly at the same time we would get asked to help optimizing performance in one of the Banks, where same vendor build the model and build 1:1 physical model out of it: a highly normalized schema. The Performance we saw there was an absolute disaster: loading 50 GB of test data was taking 36 hours!!

Is this the model’s fault? No! But this is that we see quite a lot – people simply do not think about the next step which they have do after buying logical model. They fail to understand how model is mapped to another one.

How do we approach those problems in our projects?

 

Different approach.

Time is money!

What is typically the key problem that business tries always to solve in the first place? It is all about money! Usually, there is a number of information sources (Data Sources), which helps business to earn money if this information delivered on time.

image

If we speak about large data warehouses, we can often see that the most important challenge here is related to moving data from the data sources into data warehouse and make it available to the clients, such as direct queries, or some reporting and analysis engines.

Understanding the time window which satisfies business demand will give you then the necessary basic information regarding the architectural decisions that you can apply.

Know your limits

It is extremely important to understand the limits of the technologies, which will be used in building large data warehouse. Obviously, you won’t expect form the 10 Gb network card the throughput higher, than 10Gb/sec?

We often see people get confused in correlating compute power of the server, the number of CPUs it has and how fast can workflow run on this box. I have observed situations, where customers replaced an old 4 core serer with new 32 cores box and saw significant workflow performance degradation! Well, if your workload was using single core only and the old server had much higher CPU clock speed, it should not surprise, what the workload will slow down!

Analyze and find the limits of the hardware components that you use: e.g. one CPU core can process approximately 200 MB/sec, 10Gb network card will not deliver more than 10Gb/sec throughput. Also measure and create the baseline for the processes level above hardware: what is the highest ETL speed you can reach, how fast can you build an index?

We know, as an example, what most of the data transfer operations in SQL server, which rely on the BULK API are single stream and give us somewhere around 25MB/sec for the write operations throughput and ~35 MB/sec read operations performance.

 

Not enough speed? Run parallel

Now, we know the limits, but we see that with the restrictions that we recognized, we can’t meet the business requirements. Is there any solution how we could scale the throughput of the system? Simply multiply the amount of the ‘streams’, so that all together they deliver performance that we were looking for. If reading data from the data source with the speed of 35MB/sec is too slow – make more of the ‘read’ components. If processing data with the single core doesn’t give the necessary speed – make sure your design your workflow in the way it runs on multiple cores.

 

Rough model

If we agreed on using Kimball model in building our data warehouse model, we had better follow this approach precisely. Just to refresh the memory, the very first step in this approach to understand the limits of the technologies, which will be used –identify the business problem which you are about to solve: which questions business really wants to ask.

Sit together with your business departments, whose data you will reflect in your data warehouse and sketch the rough logical data model, like this:

clip_image008

This will give you the idea about which Fact and dimension tables you will have in the data warehouse and later will be reflected in the schema like this:

clip_image010

 

Hardware sizing

So, now let’s talk about hardware sizing. Microsoft together with some other vendors developed data warehouse reference architecture called Fast Track. This reference architecture reflects the years of experience building large data warehouses and provides with balanced hardware and software configuration on the scale of the nearly 100TB. One thing to notice though – you need to make sure that your data warehouse characteristics comply with that we usually understand under ‘classical data warehouse’:

· Star schema

· Large amount of scan operations

· Write and Read operations dominate over UPDATES (Updates happen mostly on the Dimension tables)

What if you need to scale above 100TB? You may want then to consider Parallel Data Warehouse which can host (as of now) up to 500TB of data.

 

Physical Data Model for scale

Now, that we understand, which hardware will be used, This model was presented as the print out – several meters long and full with few centimeters big. how we will model data warehouse, know limits and the baselines for the ETL and other processes, it is time to think of the architectural decisions, which will help to address the business challenges. Very often, one of the biggest customer’s concerns we are hearing is something like: ‘my business grows; I need to process more and more data in the short period of time and make it available to the data consumers! What options do I have?’

This tells us immediately, that loading single stream with the speed of 35 MB/sec might actually not work. So, we need more speed – means we need more streams. Our target fact table, especially if business is interested in working with historical data (this is why we often build data warehouse, correct?), will be partitioned table. Often, partitioning by time (day, month, year) may not give you enough performance populating the table, especially, if source data could deliver better distribution by some other key. This can be data source ID, customer ID or area code. In this situation you may think of ‘two dimensional partitioning’, where the base principle is that you create partitioned fact tables optimized for data loading ( let’s say one partition will be loaded at the time and each partition corresponds to the area code) and this fact table carries specific time range, such as day, month or anything else, which is often defined by the time range which business needs to analyze. Let’s say you will have year ranges, therefore in order to give business outlook on that was happening in historical perspective – you will need to join those entire partitioned fact tables in the VIEW structure, which will be exposed then to the business users or application as the object, where all data will be stored.

image

I am intentionally not diving into the deeper discussion about the particular possible designs and their benefits, since the goal of this blog was to get on the same page about approach in building data warehouses.

Thomas, Marcel and I, we already posted and will be definitely writing more on some particular techniques (I see that Marcel started writing series of blogs regarding the ETL scale which we presented together at the SQL Rally Nordic 2012). Stay tuned and lot of fun developing data warehouses with SQL Server!


Appliances and reference architectures: Another prove of the Russian wisdom: ‘Trust, but verify’

Together with my friend Marcel, I worked  validating the new Data Warehouse concept onsite by  one of the big and very known customers Улыбка . Even though we’ve got nicely balanced pre-configured hardware architecture, we learned again, that sometimes it is better to double-check whether everything runs as you expect. Speaking generally, I would recommend this as one of the mandatory steps if you are about tuning your system anyway – make sure you get performance expecting from the HW, then think of any further optimizations.

Here is the link to the blog published at SQLCAT.COM:  Fast Track: improving performance through correct LUN Mapping and Storage Enclosure configuration