This is a quick post to highlight how making some small changes to the SSIS Fuzzy Lookup settings can make a significant difference to the matches that are returned.
As I’ve mentioned before, the SSIS Fuzzy matching relies on matching substrings within a string (known as q-grams) in order to produce a result. A given input string (and reference strings) is split up into several different substrings. Simply put, the more substrings, or q-grams, that the input string has in common with the reference string, then the better the match. For example, a word such as ‘sandwich’ would be split up into ‘sand’, ‘andw’ and ‘dwic’, in order to aid the matching process.
I’ve hit an issue recently whereby the sub strings being produced where causing false matches – lets start by going through a quick example, loosely based on Adventure Works data. Here is a package that takes data from a data source, before using a Fuzzy lookup:
I’ve narrowed down the data source to have just two rows for this example, which are:
There are a few more so called Master Products, these are what we want to match against:
By default, if we run the SSIS package then the following results will appear in a data viewer after the fuzzy lookup. Note – the lookup is intentionally returning the top 2 matches to illustrate this point.
The Source Product column reflects the actual situation I experienced recently, whereby the Brand name (Apollo and RS in this case) appears in the column to match. Although this is made up, it closely follows the situation I experienced whereby the wrong match is produced in both cases – “Apollo X.C. Large” looks very similar to “Apollo XC”, yet its not the top match.
The reason this occurs is actually the Token Delimiters setting on the Fuzzy Lookup. By default they include both ampersands and full stops, meaning in the case of ‘Apollo X.C. Large’ the following tokens/substrings are created by default:
Note there is no token created for XC, meaning this will not match well against a substring of ‘XC’. One option is to remove the full stops and ampersands as delimiters as shown below in the advanced tab:
This produces the following results in the data viewer:
In this case, the correct master product per row (namely ‘Apollo XC’ and ‘RS Shorts’) is returned correctly, so the settings in this case have made a difference.
Conclusion
This shows the effect that the token delimiters can have on the matching process. Unfortunately, what works in one situation may not work in another, so one option may be to try an initial match, followed by a second match for any low matching rows. Data quality will always have a big impact on matching, so cleaning and enriching data before matching is always a good idea.
How Artificial Intelligence and Data Add Value to Businesses
Knowledge is power. And the data that you collect in the course of your business
May
Databricks Vs Synapse Spark Pools – What, When and Where?
Databricks or Synapse seems to be the question on everyone’s lips, whether its people asking
1 Comment
May
Power BI to Power AI – Part 2
This post is the second part of a blog series on the AI features of
Apr
Geospatial Sample architecture overview
The first blog ‘Part 1 – Introduction to Geospatial data’ gave an overview into geospatial
Apr
Data Lakehouses for Dummies
When we are thinking about data platforms, there are many different services and architectures that
Apr
Enable Smart Facility Management with Azure Digital Twins
Before I started writing this blog, I went to Google and searched for the keywords
Apr
Migrating On-Prem SSIS workload to Azure
Goal of this blog There can be scenario where organization wants to migrate there existing
Mar
Send B2B data with Azure Logic Apps and Enterprise Integration Pack
After creating an integration account that has partners and agreements, we are ready to create
Mar