PII Anonymisation and Self-Joins in U-SQL

It’s been a while since I wrote my last blog, so I decided to share one of the latest challenges I faced in a project.

Requirements:

Anonymise any Personably Identifiable Information (PII) data stored in an Azure Data Lake Store (ADLS);
Anonymise any PII data from customers identified in a configuration file.

Details:

All PII data should be anonymised as soon as the files land in the DEV or UAT ADLS;
The PII data landing in the PROD ADLS should only be anonymised if identified in the configuration file.

Goal:

Create a single U-SQL pattern to achieve the requirements.

Step 1

Reference assemblies and declare variables. Pay special attention to the variable Environment. This will be dynamically populated by the Azure Data Factory (ADF) pipeline activity and will identify in which environment the U-SQL is executed.

REFERENCE ASSEMBLY [USQL.Core];
USING [USQL].[Core].[Utilities];
USING [USQL].[Core].[Anonymisation];

//Set variables
DECLARE @month string = DateTime.UtcNow.ToString("MM");
DECLARE @day string = DateTime.UtcNow.ToString("dd");
DECLARE @year string = DateTime.UtcNow.ToString("yyyy");
DECLARE @schemaVersion int = 1;
DECLARE @Environment string = "DEV";
DECLARE @inputLocation = "RAW/Sensitive/" + @schemaVersion + "/" + @year + "/" + @month + "/" + @day + "/{*}.csv";
DECLARE @outputLocation = "RAW/Anonymized/" + @schemaVersion + "/" + @year + "/" + @month + "/" + @day + "/Customers.csv";
DECLARE @configLocation = "RAW/Config/Configuration.csv";

Step 2

Extract the data from the source and configuration file. The configuration file only includes an ID that identifies a customer.

//Extract data from source file
@ExtractSourceData =
EXTRACT [CustomerId] string,
[FirstName] string,
[LastName] string,
[EmailAddress] string,
[HomeTel] string,
[MobileNumber] string,
[Address] string,
[PostalCode] string
FROM @inputLocation
USING Extractors.Text(delimiter : '|', silent : false, quoting : true, skipFirstNRows : 1);

//Extract data from the configuration file
@ExtractConfigurationData =
EXTRACT [Id] string
FROM @configLocation
USING Extractors.Text(silent : true, quoting : true, skipFirstNRows : 1);

Step 3

Create two rowsets, one to include the distinct list of CustomerId from the source file and the other to include the distinct list of Id from the configuration file.

//Obtain a list of distinct CustomerId from source file
@SelectSourceData =
SELECT DISTINCT [CustomerId]
FROM @ExtractSourceData;

//Obtain a list of distinct Ids from configuration file
@SelectConfigurationData =
SELECT DISTINCT [Id]
FROM @ExtractConfigurationData;

Step 4

This is one of the most important details in this script. U-SQL does not support self-joins, which is needed to ensure we anonymise all data if we are in a non-production environment. To overcome this limitation, we create a new rowset to union the IDs from the source and configuration file.

//Create a new rowset to use on self-join
@UnionIds =
SELECT [CustomerId], "" AS [Id]
FROM @SelectSourceData
UNION
SELECT "" AS [CustomerId], [Id]
FROM @SelectConfigurationData;

Step 5

In this step, we identify which records should and shouldn’t be anonymised. If you remember from the requirements, if the data is in a non-production environment, we have to anonymise all PII data, however, if we are in production, we should only anonymise the records identified in the configuration file. This could easily be achieved with a self-join, however, because it isn’t supported by U-SQL, we use the rowset from the previous step.

//Identify records to be anonymised
@FlagAnonymiseRecords =
SELECT DISTINCT A.[CustomerId],
[FirstName],
[LastName],
[EmailAddress],
[HomeTel],
[MobileNumber],
[Address],
[PostalCode]
FROM @ExtractSourceData AS A
JOIN @UnionIds AS B
ON A.[CustomerId] == (@Environment == "PROD" ? B.[Id] : B.[CustomerId]);

//Identify records that shouldn't be anonymised.
//ANTISEMIJOIN works as a SQL NOT IN
@FlagDoNotAnonymiseRecords =
SELECT DISTINCT [CustomerId],
[FirstName],
[LastName],
[EmailAddress],
[HomeTel],
[MobileNumber],
[Address],
[PostalCode]
FROM @ExtractSourceData AS A
ANTISEMIJOIN
(
SELECT DISTINCT [CustomerId]
FROM @FlagAnonymiseRecords
) AS B
ON A.[CustomerId] == B.[CustomerId];

Step 6

Now that we identified the records that should be anonymised, we can start applying the correct mask. This is achieved by using different classes created in an assembly that is registered in the Azure Data Lake Analytics (ADLA).

//Anonymise data
@AnonymizsData =
SELECT [CustomerId],
Utilities.ReturnLenght([FirstName]) == "0" ? [FirstName] : Anonymisation.AnonymiseForename([CustomerId], [FirstName]) AS [FirstName],
Utilities.ReturnLenght([LastName]) == "0" ? [LastName] : Anonymisation.AnonymiseSurname([CustomerId], [LastName]) AS [LastName],
Utilities.ReturnLenght([EmailAddress]) == "0" ? [EmailAddress] : Anonymisation.AnonymiseEmail([EmailAddress]) AS [HomeTel],
Utilities.ReturnLenght([HomeTel]) == "0" ? [HomeTel] : Anonymisation.AnonymiseNumbers([HomeTel]) AS [HomeTel],
Utilities.ReturnLenght([MobileNumber]) == "0" ? [MobileNumber] : Anonymisation.AnonymiseNumbers([MobileNumber]) AS [CellNumber],
Utilities.ReturnLenght([PostalCode]) == "0" ? [PostalCode] : Anonymisation.AnonymisePostalCode([PostalCode]) AS [PostalCode]
FROM @FlagAnonymiseRecords;

Step 7

The last step in this process is to union the anonymised and non-anonymised rowsets and output the file to the ADLS.

//Union anonymised and non-anonymised data
@FullData =
SELECT [CustomerId],
[FirstName],
[LastName],
[EmailAddress],
[HomeTel],
[MobileNumber],
[Address],
[PostalCode]
FROM @AnonymiseData
UNION
SELECT [CustomerId],
[FirstName],
[LastName],
[EmailAddress],
[HomeTel],
[MobileNumber],
[Address],
[PostalCode]
FROM @FlagDoNotAnonymiseRecords;

//Select data for output
@Output =
SELECT [CustomerId],
[FirstName],
[LastName],
[EmailAddress],
[HomeTel],
[MobileNumber],
[Address],
[PostalCode]
FROM @FullData;

//Output data to destination
OUTPUT @Output
TO @outputLocation
USING Outputters.Text(outputHeader : true, delimiter : '|', quoting : true);

As always, if you have any questions or comments do let me know.

Introduction to Data Wrangler in Microsoft Fabric

Autogen Power BI Model in Tabular Editor

Microsoft Healthcare Accelerator for Fabric

Unlock the Power of Colour: Make Your Power BI Reports Pop

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Exploring DALL·E Capabilities

Using Copilot Studio to Develop a HR Policy Bot

London

Aberdeen

Surrey

Bulgaria

India

Glasgow

Manchester

Get in touch

Our Social Channels

Requirements:

Details:

Goal:

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Related Posts

Jose Mendes

Introduction to Data Wrangler in Microsoft Fabric

Autogen Power BI Model in Tabular Editor

Microsoft Healthcare Accelerator for Fabric

Unlock the Power of Colour: Make Your Power BI Reports Pop

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Exploring DALL·E Capabilities

Using Copilot Studio to Develop a HR Policy Bot