Following on from my previous post about Update Queries in Azure SQL Data Warehouse, I thought I would put together a mini-series of blogs related to my ‘struggles’ working with the Azure SQL DW. Don’t get me wrong, its great, just has some teething issues of which there are work-arounds!
This blog post is going to look at what Statistics in the database world are, the differences between them on-prem (SQL Server) and in the cloud (Azure SQL Data Warehouse) and also how to use them in Azure Data Warehouse.
What are statistics?
Statistics are great, they provide information about your data which in turn helps queries execute faster, The more information that is available about your data, the quicker your queries will run as it will create the most optimal plan for the query.
Think of the statistics as you would the mathematical ones- they give us information regarding the distribution of values in a table, column(s) or indexes. The statistics are stored in a histogram which shows the distribution of values, range of values and selectivity of values. Statistics objects on multiple columns store information regarding correlation of values among the columns. They are most important with queries that have JOINS and GROUP BY, HAVING, and WHERE clauses.
In SQL Server, you can get information about the statistics by querying the catalog views sys.stats and sys.stats_columns. By default, SQL Server automatically creates statistics for each index, and single columns.
See here for more information.
How does it work in Azure SQL Data Warehouse?
In Azure SQL Data Warehouse, statistics have to be created manually. On previous SQL Server projects, creating and maintaining statistics wasn’t something that we had to incorporate into our design (and really think about!) however with SQL DW we need to make sure we think about how to include it in our process in order to make sure we take advantage of the benefits of working with Azure DW.
The major selling point of Azure SQL Data Warehouse is that it is capable of processing huge volumes of data, one of the specific performance optimisations that has been made is the distributed query optimiser. Using the information obtained from the statistics (information on data size and distribution), the service is able to optimize queries by assessing the cost of specific distributed query operations. Therefore, since the query optimiser is cost-based, SQL DW will always choose the plan with the lowest cost.
Statistics are important for minimising data movement within the warehouse i.e. moving data from distributions to satisfy a query. If we don’t have statistics, azure data warehouse could end up performing data movement on the larger (perhaps fact) table instead of the smaller (dimension) table as it wouldn’t know any information about the size of them and would just have to guess!
How do we implement statistics in Azure Data Warehouse?
Microsoft have actually provided the code of how to generate the statistics so its just a case of deciding when in your process you want to create them or maintain.
In my current project, we have created a stored procedure which will create statistics and another that will update them if they already exists. Once data has ben loaded into a table, we call the stored procedure and then the statistics will be created or updated (depending on what is needed).
See the documentation for more information and the code.
Tip: On my current project, we were getting errors when running normal stored procedures to load the data.
Error message:
‘Number of Requests per session had been reached’.
Upon investigation in the system tables,’Show Statistics’ was treated as a request which was also evaluated for each node causing the number of requests to blow up. By increasing the data warehouse units (DWUs) and also the resource group allocation this problem went away. So, take advantage of the extra power available to you!
There is a big list on the Microsoft Azure website of features not supported in Azure SQL Data Warehouse, take a look ‘here‘. I will cover further issues in my next few blogs 
For more information, check out our Azure Data Warehouse Health Check, or feel free to contact us if you have any questions or queries.
How Artificial Intelligence and Data Add Value to Businesses
Knowledge is power. And the data that you collect in the course of your business
May
Databricks Vs Synapse Spark Pools – What, When and Where?
Databricks or Synapse seems to be the question on everyone’s lips, whether its people asking
1 Comment
May
Power BI to Power AI – Part 2
This post is the second part of a blog series on the AI features of
Apr
Geospatial Sample architecture overview
The first blog ‘Part 1 – Introduction to Geospatial data’ gave an overview into geospatial
Apr
Data Lakehouses for Dummies
When we are thinking about data platforms, there are many different services and architectures that
Apr
Enable Smart Facility Management with Azure Digital Twins
Before I started writing this blog, I went to Google and searched for the keywords
Apr
Migrating On-Prem SSIS workload to Azure
Goal of this blog There can be scenario where organization wants to migrate there existing
Mar
Send B2B data with Azure Logic Apps and Enterprise Integration Pack
After creating an integration account that has partners and agreements, we are ready to create
Mar