This is a quick post to highlight how making some small changes to the SSIS Fuzzy Lookup settings can make a significant difference to the matches that are returned.
As I’ve mentioned before, the SSIS Fuzzy matching relies on matching substrings within a string (known as q-grams) in order to produce a result. A given input string (and reference strings) is split up into several different substrings. Simply put, the more substrings, or q-grams, that the input string has in common with the reference string, then the better the match. For example, a word such as ‘sandwich’ would be split up into ‘sand’, ‘andw’ and ‘dwic’, in order to aid the matching process.
I’ve hit an issue recently whereby the sub strings being produced where causing false matches – lets start by going through a quick example, loosely based on Adventure Works data. Here is a package that takes data from a data source, before using a Fuzzy lookup:
I’ve narrowed down the data source to have just two rows for this example, which are:
There are a few more so called Master Products, these are what we want to match against:
By default, if we run the SSIS package then the following results will appear in a data viewer after the fuzzy lookup. Note – the lookup is intentionally returning the top 2 matches to illustrate this point.
The Source Product column reflects the actual situation I experienced recently, whereby the Brand name (Apollo and RS in this case) appears in the column to match. Although this is made up, it closely follows the situation I experienced whereby the wrong match is produced in both cases – “Apollo X.C. Large” looks very similar to “Apollo XC”, yet its not the top match.
The reason this occurs is actually the Token Delimiters setting on the Fuzzy Lookup. By default they include both ampersands and full stops, meaning in the case of ‘Apollo X.C. Large’ the following tokens/substrings are created by default:
Note there is no token created for XC, meaning this will not match well against a substring of ‘XC’. One option is to remove the full stops and ampersands as delimiters as shown below in the advanced tab:
This produces the following results in the data viewer:
In this case, the correct master product per row (namely ‘Apollo XC’ and ‘RS Shorts’) is returned correctly, so the settings in this case have made a difference.
Conclusion
This shows the effect that the token delimiters can have on the matching process. Unfortunately, what works in one situation may not work in another, so one option may be to try an initial match, followed by a second match for any low matching rows. Data quality will always have a big impact on matching, so cleaning and enriching data before matching is always a good idea.
Introduction to Data Wrangler in Microsoft Fabric
What is Data Wrangler? A key selling point of Microsoft Fabric is the Data Science
Jul
Autogen Power BI Model in Tabular Editor
In the realm of business intelligence, Power BI has emerged as a powerful tool for
Jul
Microsoft Healthcare Accelerator for Fabric
Microsoft released the Healthcare Data Solutions in Microsoft Fabric in Q1 2024. It was introduced
Jul
Unlock the Power of Colour: Make Your Power BI Reports Pop
Colour is a powerful visual tool that can enhance the appeal and readability of your
Jul
Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2
Part 2: Exploring Advanced Functionalities in Databricks Welcome back to our Databricks journey! In this
May
GPT-4 with Vision vs Custom Vision in Anomaly Detection
Businesses today are generating data at an unprecedented rate. Automated processing of data is essential
May
Exploring DALL·E Capabilities
What is DALL·E? DALL·E is text-to-image generation system developed by OpenAI using deep learning methodologies.
May
Using Copilot Studio to Develop a HR Policy Bot
The next addition to Microsoft’s generative AI and large language model tools is Microsoft Copilot
Apr