With SQL Server 2014 due for release soon, there’s a lot of talk about how little is changing for those of us more interested in Business Intelligence. We’ll be getting clustered, updatable columnstore, which is very convenient, but I’ve been looking into the potential benefits of other new features. The real gem for me is the potential of memory-optimised tables or in-memory OLTP (previously known as Project Hekaton).
I’ve had a few questions around how columnstore and memory-optimised tables differ – they’re both in-memory solutions for data storage right?
The big difference comes in disk usage – in order for a columnstore index to exist, there must also be a physical table written to the disk. Any updates to the table, even in SQL Server 2014, will require disk writes, log writes too if you’re not bulk loading. Memory-optimised tables don’t have this overhead – we can push the staging data straight into memory and skip the disk entirely.
There are plenty of uses for this new type of table, but it’s the traditional ETL routine I’m interested in. We would normally pull data from our source systems, place it in a staging database while we clean, validate etc, then write the cleaned data to the warehouse:
The data written to the staging database is transient, it’s only of use during the load process. It is therefore redundant to write this data to disk each time; I believe this is where memory-optimised tables will really have an impact.
To prove this theory, I set up a quick demonstration. I started by creating two identical tables, one physical, one memory-optimised, both schema copies of [AdventureWorks2012].[Sales].[SalesOrderDetail].
The memory optimised table was created using the following script:
I then built the simplest ofs, a data flow to move data from AdventureWorks2012.Sales.SalesOrderDetail to a ‘staging’ table, then read the data from the staging table and write it back to a disk-based table, representing our final warehouse.
I configured two versions of the package, one where the staging table was disk-based, one where it is our memory-optimised table. I deployed and scheduled these packages as alternating jobs and kicked them off for a few hours to see how performance varies over time. The results are as follows:
The initial results are very promising – we see a huge reduction in execution time when using in-memory tables for our transient ETL data. We’re still performing minimal logging with this setup – you can reduce this even further by switching to SCHEMA_ONLY durability, but by bypassing writing the data to disk we’ve cut our ETL window in half.
There are obvious hardware limits to how this can be used, and in-memory tables have their own limitations, but implemented carefully it could change the face of warehouse loading drastically.
Celebrating International Women’s Day: from Classroom to Code
As we celebrate International Women’s Day, I want to share my journey of breaking stereotypes
Mar
Pretty Power BI – Adding Pagination to Bar Charts
Good User Experience (UX) design is crucial in enabling stakeholders to maximise the insights that
Feb
Pretty Power BI – Creating Dynamic Histograms
Good User Experience (UX) design is crucial in enabling stakeholders to maximise the insights that
Feb
Top Tips to Pass the Databricks Certified Data Engineer Professional Exam
Having recently passed the Databricks Certified Data Engineer Professional exam, this blog post covers some
Jan
Python vs. PySpark Navigating Data Analytics in Databricks – Part 1
Introduction When it comes to conquering the data analytics landscape in Databricks, two heavyweights, Python
Jan
Impact of AI on Business Analysis
Artificial intelligence (AI) is rapidly transforming our world, and this blog post concentrates on the
Jan
Creating Clickbait Using Python
In 2023, about 5 billion people used the internet. With so many people contributing and
Dec
A Brief Overview of Security in Microsoft Fabric
Where Fabric Sits in the Hierarchy As you are probably aware, Microsoft Fabric is Microsoft’s
Dec