A standard practice in the data warehousing world is the process of delivering conformed dimensions, which is an activity that can involve data cleaning, standardisation, de-duplicating and matching, with a goal of ensuring that data from multiple different locations can be analysed in a consistent manner. For example, System A may have a customer called William Smith, whereas System B might hold the exact same customer as Mr W Smith. The data warehouse ultimately needs to output one version of William Smith to the end user in this example, which will allow all data associated with this customer to be easily analysed.
I’m currently facing the challenge of de-duplicating and matching across data sources, although in my case most of the data sources are not in English. Its one thing to match customers or products that have slightly different descriptions using fuzzy logic in your ETL, but how do you do this when your products are in multiple different languages? Although we’ve not actually ended up needing to use it, the experience got me thinking, could SSIS be used to translate values in one language to another?
Microsoft Translator API
Enter Microsoft Translator, a “statistical machine translation system”, that’s available on the Azure Marketplace for what looks to be a fairly reasonable price. For example, you can translate up to 2 million characters a month for free, whereas 4 million a month will set you back £26.36 a month. As an example I’ve translated the word ‘Bonjour’ here, without even telling it what the input language is (the service has auto detect capability):
SSIS and the Microsoft Translator API
Using an example from AdventureWorks, I’d like to see if its possible to translate the French descriptions found in dbo.DimProduct into English. To do this, I’ve set up a very simple data flow that starts off with extracting the top 100 products from Adventure Works that have a French description. There is then a Script Component, which makes the all important call to the Microsoft Translator API, then finally a Union All just to allow me to catch the rows while debugging. Overall the data flow is as follows:
The script component is where all the magic happens, as this contains the call to the API using following MS translator example code. The MS translator code needs adapting to fit into the SSIS environment and rather than re-posting every line of code, I’m just going to highlight the changes I made to get this working with SSIS. My first step was to add a script component with the following output column:
Then within the Edit Script window I added the following additional references:
I then setup my using section as follows:
using System; using System.Data; using System.IO; using Microsoft.SqlServer.Dts.Pipeline.Wrapper; using Microsoft.SqlServer.Dts.Runtime.Wrapper; using System.Xml.Linq; using System.Collections.Generic; using System.Web; using System.Web.UI; using System.Web.UI.WebControls; using System.Net;
The next step is to call the PreExecute() method once in order to get an access token from Azure Marketplace. The key is to pass in a Client ID and a Client Secret, both of which can be setup here.
Finally ProcessInputRow now needs to carry out the translation, which is carried out with the following code:
public override void Input0_ProcessInputRow(Input0Buffer Row) { if (!Row.FrenchInputName_IsNull) { //Speficy from french, to English string uri = "https://api.microsofttranslator.com/v2/Http.svc/Translate?text=" + System.Web.HttpUtility.UrlEncode(Row.FrenchInputName.ToString()) + "&from=fr&to=en"; System.Net.WebRequest translationWebRequest = System.Net.WebRequest.Create(uri); translationWebRequest.Headers.Add("Authorization", headerValue); System.Net.WebResponse response = null; response = translationWebRequest.GetResponse(); Stream stream = response.GetResponseStream(); System.Text.Encoding encode = System.Text.Encoding.GetEncoding("utf-8"); StreamReader translatedStream = new System.IO.StreamReader(stream, encode); System.Xml.XmlDocument xTranslation = new System.Xml.XmlDocument(); xTranslation.LoadXml(translatedStream.ReadToEnd()); Row.Translation = xTranslation.InnerText; } }
Results
Et voila! After compiling the script code and running the package, we can see the following results in the data viewer:
The third column called Translation is the translated English column name from the API, whereas the last column is the correct English translation from Adventure Works. As you can see, for the first few rows its only missing hyphens and capitalisation and the other rows are very close to the correct English translation from Adventure Works. It’s not perfect in all scenarios that I tested, and this approach won’t be suitable for large data volumes, but certainly for the above data it gives a very good match – better than I expected.
Microsoft Translator seems to have done a very good job as a first pass of translating the data. There would be nothing stopping you enhancing this further of course, perhaps with something like DQS or similar cleansing tools, which in turn would mean that you could get the data into good shape before attempting to carry out fuzzy matching.
Introduction to Data Wrangler in Microsoft Fabric
What is Data Wrangler? A key selling point of Microsoft Fabric is the Data Science
Jul
Autogen Power BI Model in Tabular Editor
In the realm of business intelligence, Power BI has emerged as a powerful tool for
Jul
Microsoft Healthcare Accelerator for Fabric
Microsoft released the Healthcare Data Solutions in Microsoft Fabric in Q1 2024. It was introduced
Jul
Unlock the Power of Colour: Make Your Power BI Reports Pop
Colour is a powerful visual tool that can enhance the appeal and readability of your
Jul
Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2
Part 2: Exploring Advanced Functionalities in Databricks Welcome back to our Databricks journey! In this
May
GPT-4 with Vision vs Custom Vision in Anomaly Detection
Businesses today are generating data at an unprecedented rate. Automated processing of data is essential
May
Exploring DALL·E Capabilities
What is DALL·E? DALL·E is text-to-image generation system developed by OpenAI using deep learning methodologies.
May
Using Copilot Studio to Develop a HR Policy Bot
The next addition to Microsoft’s generative AI and large language model tools is Microsoft Copilot
Apr