Slowly-Changing Parent-Child Dimensions – Part 2: The Implementation

Distributing the slides from my talk last night, it was well received and I think everyone was able to take something away.

Credit goes to Marco Russo and Alberto Ferrari for their data modelling ideas, visit www.sqlbi.com

http://dl.dropbox.com/u/25925341/Learning%27s%20from%20large%20scale%20dw%20project.pptx

Hopefully not my last talk as i enjoyed doing it.  Look out for further Bath based events at:

http://avonim.wordpress.com/

All comments welcome

Calvin

First of all, an apology. It’s always hard to recognise whether your posts are useful for people out there and this one slipped through the cracks. After several requests, I thought I’d finally get around to posting the related solution to a post I wrote 2 years ago! Apologies to anyone this could have helped quite some time ago.

To set the scene, I had to build a slowly-changing parent child dimension and manage the relationships accordingly. I described the theory behind the solution in this post, probably best give that a skim if you’re not familiar with the approach.

To summarise – if a parent node changes, you create new SCD records for all descendants, essentially you create historic “branches” as each change happens.

Initial Data Setup

For the example, we’ll use the AdventureWorksDW2012 sample DB as it has an org chart we can very easily play with. You can find the download here.

We’re going to create a new parent-child SCD dimension using dbo.DimOrganization by running the script below:

CREATE TABLE [dbo].[DimOrganizationSCD](
    [OrgKey]         [int]    IDENTITY(1,1),
    [ParentOrgKey]     [int]    NULL,
    [OrganizationID] [int]  NOT NULL,
    [ParentOrganizationID] [int] NULL,
    [PercentageOfOwnership] [nvarchar](16) NULL,
    [OrganizationName] [nvarchar](50) NULL,
    [CurrencyKey] [int] NULL,
    [StartDate] datetime NOT NULL,
    [EndDate] datetime NULL,
    [CurrentFlag] [int] NOT NULL
) ON [PRIMARY]

GO

INSERT INTO [dbo].[DimOrganizationSCD]
(ParentOrgKey, OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag)
SELECT -1 [ParentOrgKey]
      ,[OrganizationKey]
      ,[ParentOrganizationKey]
      ,[PercentageOfOwnership]
      ,[OrganizationName]
      ,[CurrencyKey]
      ,'2014-01-01 00:00:000' [StartDate]
      ,NULL [EndDate]
      ,1 [CurrentFlag]
FROM [dbo].[DimOrganization]

UPDATE C
SET ParentOrgKey = P.OrgKey
FROM [dbo].[DimOrganizationSCD] C
    INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID

Take a look in the table and you’ll see something like this:

clip_image001

We now have a separation between the unique business identifier for each organisational entity (OrganizationID), and its historical version (OrganizationKey).

The Org Change Occurs

Next let’s make a change to track – maybe the North American operation has been split off as a subsidiary and we need to insert a hierarchy level for the new brand “SuperCycles US”. We’ll create an ‘updates’ table to represent the delta coming from our ETL and put in two records – the new organisation record for SuperCycles, and the update to North American Operation to represent the new Parent ID.

CREATE TABLE [dbo].[DimOrganization_Updates](
    [OrganizationID] [int]  NOT NULL,
    [ParentOrganizationID] [int] NULL,
    [PercentageOfOwnership] [nvarchar](16) NULL,
    [OrganizationName] [nvarchar](50) NULL,
    [CurrencyKey] [int] NULL
) ON [PRIMARY]

GO

INSERT INTO [dbo].[DimOrganization_Updates]
VALUES  (15, 1, 1, 'SuperCycles US', 100),
        (2, 15, 1, 'North America Operations', 100)

The ETL Logic

For this example, I’ve written it as one big stored procedure, will all the logic occurring in one script. I’d normally break this into separate procedures with SSIS handling data flows, lookups etc and logging results of individual scripts, but a single SQL flow is easier to demonstrate.

Firstly, we create a couple of variables used within the proc:

--Createa a table used to hold all records impacted by the hierarchy changes
DECLARE @Updates TABLE  (
    [OrganizationID] [int]  NOT NULL,
    [ParentOrganizationID] [int] NULL,
    [PercentageOfOwnership] [nvarchar](16) NULL,
    [OrganizationName] [nvarchar](50) NULL,
    [CurrencyKey] [int] NULL
);

--Use a common timestamp across all SCD updates
DECLARE @SCDDate datetime = getdate();

Now, as far as any children of the North American Operation go, they technically haven’t changed and so we won’t see an update for them. So our ETL logic needs to take that into account.

We take our update rows and build a recursive CTE that finds all active child rows in our dimension:

--Use a recursive CTE to find all updates AND all descendants of updated records
WITH OrgUpdates AS
(
    --Our Changed Records
    SELECT 
        U.OrganizationID,
        U.ParentOrganizationID, 
        U.PercentageOfOwnership, 
        U.OrganizationName, 
        U.CurrencyKey
    FROM [dbo].[DimOrganization_Updates] U
        LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID
                                                    and D.CurrentFlag = 1
UNION ALL
    --Our Descendants (this will recurse until no more descendants are found)
    SELECT 
        D.OrganizationID,
        D.ParentOrganizationID, 
        D.PercentageOfOwnership, 
        D.OrganizationName, 
        D.CurrencyKey
    FROM [dbo].[DimOrganizationSCD] D
        INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID
                                                    and D.CurrentFlag = 1
)
--Store the results of the CTE in a table variable
INSERT INTO @Updates
SELECT 
    OrganizationID,
    ParentOrganizationID, 
    PercentageOfOwnership, 
    OrganizationName, 
    CurrencyKey
FROM OrgUpdates

This will loop a maximum of 100 times by default, you’ll need to override this if your parent-child structure has more than 100 levels. Hopefully this isn’t an issue for anyone!

We now have a table variable populated with each of the business IDs that relates to a changed record, or a descendant of a changed record.

From here on, we can follow a standard SCD Type 2 method – we stamp the existing records with an end date & amend the current flag, then insert the new records.

The end-dating of the old records needs to be done first, as it is easier to identify the new records by their current flag once they have been inserted. So the historical update can be something like:

--Mark the existing records as old by finding all related records that are currently active.
--It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new!
UPDATE D
SET EndDate = @SCDDate,
    CurrentFlag = 0
FROM @Updates U
    INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID
                                                    and D.CurrentFlag = 1

And then we insert new records:

--Insert the new rows into the dimension, these will each have new surrogate IDs associated
--These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated
INSERT INTO [dbo].[DimOrganizationSCD]
(OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag)
SELECT     OrganizationID,
    ParentOrganizationID, 
    PercentageOfOwnership, 
    OrganizationName, 
    CurrencyKey,
    @SCDDate    StartDate,
    NULL        EndDate,
    1            CurrentFlag
FROM @Updates

Note that at this point, we’re taking the full cut of records from our “Updates” table, so this will include any new records. Our “SuperCycles US” record will have been inserted here, along with the updated records.

The newly inserted records have deliberately not had their ParentOrganizationKey populated as for many of the record, the parent’s surrogate key has not yet been generated. We need to run an update on the dimension table after the insert to go through and fill this in.

--Finally, update the ParentOrganizationKey for all Current records
UPDATE C
SET ParentOrgKey = P.OrgKey
FROM [dbo].[DimOrganizationSCD] C
    INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID
WHERE C.CurrentFlag = 1
    AND P.CurrentFlag = 1

And that’s it, we now have a new branch of the organisation chart to represent today’s changes.

Taking another look at our dimension table, we now have a new set of active records representing the current Org hierarchy.

image

Improvements

This particular logic has been written to demonstrate an approach, there are many improvements you may want to make if including in a large, resilient ETL process.

Firstly, as mentioned above, I would separate the logic into individual stored procedures that each returned rowcounts for better logging.

You can also amend the insertion procedure to take original values if you need to mix and match SCD types and include a Type 3 “original value” column.

Finally, you may want to include a HierarchyId type column and populate using the full grandparent>parent>child surrogate key path, this allows you to query for such things as “all descendants” without having to build in recursion. This can vastly improve performance if you’re going to be running aggregations over huge volumes directly on the SQL layer.

Further Reading

Recursive Query using Common Table Expressions

Hierarchal Data

Code Extract

Putting all the code together, our ETL query is as follows:

--Create a table used to hold all records impacted by the hierarchy changes
DECLARE @Updates TABLE  (
    [OrganizationID] [int]  NOT NULL,
    [ParentOrganizationID] [int] NULL,
    [PercentageOfOwnership] [nvarchar](16) NULL,
    [OrganizationName] [nvarchar](50) NULL,
    [CurrencyKey] [int] NULL
);

--Use a common timestamp across all SCD updates
DECLARE @SCDDate datetime = getdate();

--Use a recursive CTE to find all updates AND all descendants of updated records
WITH OrgUpdates AS
(
    --Our Changed Records
    SELECT 
        U.OrganizationID,
        U.ParentOrganizationID, 
        U.PercentageOfOwnership, 
        U.OrganizationName, 
        U.CurrencyKey
    FROM [dbo].[DimOrganization_Updates] U
        LEFT JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID
                                                    and D.CurrentFlag = 1
UNION ALL
    --Our Descendants (this will recurse until no more descendants are found)
    SELECT 
        D.OrganizationID,
        D.ParentOrganizationID, 
        D.PercentageOfOwnership, 
        D.OrganizationName, 
        D.CurrencyKey
    FROM [dbo].[DimOrganizationSCD] D
        INNER JOIN OrgUpdates U on D.ParentOrganizationID = U.OrganizationID
                                                    and D.CurrentFlag = 1
)
--Store the results of the CTE in a table variable
INSERT INTO @Updates
SELECT 
    OrganizationID,
    ParentOrganizationID, 
    PercentageOfOwnership, 
    OrganizationName, 
    CurrencyKey
FROM OrgUpdates

--Mark the existing records as old by finding all related records that are currently active.
--It is very important that we do this before inserting the new records as otherwise we would not be able to distinguish between old and new!
UPDATE D
SET EndDate = @SCDDate,
    CurrentFlag = 0
FROM @Updates U
    INNER JOIN [dbo].[DimOrganizationSCD] D on U.OrganizationID = D.OrganizationID
                                                    and D.CurrentFlag = 1

--Insert the new rows into the dimension, these will each have new surrogate IDs associated
--These will all have NULL ParentOrganizationKeys as we need to wait for all surrogates to be generated
INSERT INTO [dbo].[DimOrganizationSCD]
(OrganizationID, ParentOrganizationID, PercentageOfOwnership, OrganizationName, CurrencyKey, StartDate, EndDate, CurrentFlag)
SELECT     OrganizationID,
    ParentOrganizationID, 
    PercentageOfOwnership, 
    OrganizationName, 
    CurrencyKey,
    @SCDDate    StartDate,
    NULL        EndDate,
    1            CurrentFlag
FROM @Updates

--Finally, update the ParentOrganizationKey for all Current records
UPDATE C
SET ParentOrgKey = P.OrgKey
FROM [dbo].[DimOrganizationSCD] C
    INNER JOIN [dbo].[DimOrganizationSCD] P ON C.ParentOrganizationID = P.OrganizationID
WHERE C.CurrentFlag = 1
    AND P.CurrentFlag = 1