In this blog we look at how to optimise SSIS lookups with HASHBYTES. I was recently working on a load for a Dimension table which worked like the below.
The package follows a fairly standard routine. Data for the Dimension is transformed into the required format using SSIS, following this it is inserted into the target table. On the way an initial check is made to determine if the record is new ‘LKU – Check Exists’ and a secondary check to determine if the record has changed, which is subsequently used in an update routine.
Although the package was incrementally loading, it was having to sometimes check several million rows to determine if these had changed, as the table was quite wide I decided to implement the HASHBYTES() function in order to take a hash of the columns for comparison rather than compare each of the columns.
The initial results were quite promising with the package running in around half the time, however on a particularly big load I noticed the package failed as it had run out of memory for the second Lookup (full cache was used). I found this odd because the HASHBYTES function only returned a single value and I anticipated this would be smaller than the sum of the columns, however on a little deeper investigation I found that by default any row returned using the HASHBYTES function are all of a default size 8000 bytes.
A quick check of the SQL Documentation of the HASHBYTES function at http://msdn.microsoft.com/en-GB/library/ms174415.aspx states that the size of the data returned for the HASHBYTES function when used with the SHA2-256 algorithm is 32 bytes meaning that most of this space was being wasted.
Therefore changing the formula from 1 to 2 below significantly increases the efficiency of the lookup and also make the cache size required smaller by a factor of around 250!
1. HASHBYTES(‘SHA2_256’,[ProductName] + ‘|‘ + [CategoryName]) AS HashValue
2. CONVERT(VARBINARY(32),HASHBYTES(‘SHA2_256’,[ProductName] + ‘|‘ + [CategoryName])) AS NewHash
Upon doing this the size of each row was significantly reduced, the package is running faster than before and most importantly there are no memory issues 🙂
Meet the Team – Jason Bonello, Senior Consultant
Meet Jason Bonello! Jason has been with us for just over two years and works
Apr
Meet the Team – Matt How, Principal Consultant
Next up in our series of meet the team blogs is Matt How. Matt has
Apr
MLFlow: Introduction to MLFlow Tracking
MLFlow is an open-source MLOps platform designed by Databricks to enable organisations to easily manage
Apr
Adatis are pleased to announce expansion plans into India
Adatis has offices in London, Surrey and Bulgaria – and has some big expansion plans
Mar
Querying and Importing Data from Excel to SSMS
Introduction There are couple of ways to import data from Excel to SSMS – via
Mar
Data Engineering for Graduates and New Starters
This blog aims to give potential graduates and other new starters in the industry some
Mar
Passing DP-900 Azure Data Fundamentals
Back in December, I took the DP-900 Azure Data Fundamentals exam which is one of
Feb
Real-time Dashboards Provide Transparency to Everyone
Real-time dashboards enable data analytics firm Adatis to be agile and transparent with its team
Feb