If like me, you’ve been keeping taps on what Microsoft has been up to on the Power Platform world, you would have noticed that there are two concepts that are regularly referenced in their architectures and generally associated to each other, Azure Data Lake Storage (ADLS) Gen 2 and Common Data Model (CDM).
As Francesco referred in his blog, Microsoft ultimate vision is for the CDM to be the de facto standard data model, however, although there is a fair amount of resources talking about the capabilities and features, it can be a bit confusing to understand how you can actually store your data in the CDM format in ADLS and use it to run data analytics such as data warehousing, Power BI reporting and Machine Learning.
In this blog series I’ll cover 5 different ways to instantiate a CDM model in ADLS:
- Export to data lake (Common Data Service)
- Power BI Dataflows
- Azure Data Services – Databricks
- Azure Data Services – Data Factory Data Flows
- CDM SDK
As John Macintyre brilliantly summarised back in late 2018, “Azure Data Services enable advanced analytics that let you maximize the business value of data stored in CDM folders in the data lake. Data engineers and data scientists can use Azure Databricks and Azure Data Factory dataflows to cleanse and reshape data, ensuring it is accurate and complete. Data from different sources and in different formats can be normalized, reformatted, and merged to optimize the data for analytics processing. Data scientists can use Azure Machine Learning to define and train machine learning models on the data, enabling predictions and recommendations that can be incorporated into BI dashboards and reports, and used in production applications. Data engineers can use Azure Data Factory to combine data from CDM folders with data from across the enterprise to create an historically accurate, curated enterprise-wide view of data in Azure SQL Data Warehouse. At any point, data processed by any Azure Data Service can be written back to new CDM folders, to make the insights created in Azure accessible to Power BI and other CDM-enabled apps or tools.”
In my previous blog, I demonstrated how to ingest, expand and write CDM entities to ADLS using Azure Databricks. This time, I will show how to achieve similar results with Azure Data Factory Data Flow and cover a few limitations and workarounds. Mapping data flows are visually designed data transformations in Azure Data Factory that allows data engineers to apply data transformations without writing code.
First, we create a new Data Flow transformation in Data Factory and configure the Integration Runtime. When enabling data flow debug, we can interactively see the results of each transformation step while building and debugging the data flow.
The CDM public preview connector can read CDM entity files, using manifest files and model.json, however, it can only write using manifest files. The other caveat is that the writing operation will only work if the entity definition is already available. It is not possible, at time of writing, to create new CDM entities using this method, as opposed to the CDM library available in Azure Databricks.
First, I decided to read an entity that was populated in ADLS via the Export to Data Lake service from Power Apps. When selecting a source type, we select the inline dataset Common Data Model.
Next, we need to select Model.json as the Metadata format and populate the Root Location and Entity. When exporting from CDS, the model.json file is stored in the root folder. This caused issues to many people trying to use Data Flows to transform the data since the folder path can’t be left empty. This solution worked for me, but, I can’t guarantee it’ll work for everyone. Simply add / to the File Path, as shown below.
Because we are using an inline dataset, we have to manually import the schema to apply further mappings and transformations, as opposed to using datasets. This can easily be done by importing the projection.
When opening Data preview, we can see all conditions were set accordingly, because we can visualise the data stored in ADLS.
In the second test I decided to try and load a CDM entity produced by Power BI Dataflows. In terms of configuration, the only difference is that this time the model.json is not stored in the root folder and I could provide a folder path.
All seemed very promising until I attempted to preview the data. Although I could import the projection no data was actually being consumed. After some digging, I found that consuming entities produced by Power BI Dataflow is still not possible as they are not compatible.
In the third test, I attempted to load a CDM entity produced by Azure Databricks. Loading entities based on the manifest file require more details, so before jumping into the configuration, let’s have a quick look at the entity structure in ADLS.
The mandatory fields are Root Location, Schema linked service, Container, Corpus Folder and Entity. Although not marked as required, if the corpus folder is not provided, you’ll get an error when trying to import projection or visualise the data. In my scenario, the manifest file was saved with the default name. If it had a different name then the details would have to be provided as well.
Similar to what happened with the previous test, I could import the projection but couldn’t preview the data. I thought it could be incompatible with the structure created by Databricks, however, that shouldn’t be a problem considering I could write to the same entity. In the next blog, I’ll populate the entity using the CDM SDK and will re-validate this scenario, but for now, it remains a mystery that might be solved once this feature goes GA.
In the final test, I tried to expand the CDS entity (the only one that was working so far) by joining Customer data extracted from Adventure Works.
1 – Read the CDS Entity
2 – Read the Customer data from Adventure Works. In this scenario, I select the columns I needed, renamed equivalent columns according to the CDS naming convention and left with the original name the columns that I wanted to add to the CDM entity.
SELECT CAST(CustomerID AS NVARCHAR(10)) AS contactid ,Title AS salutation ,FirstName AS firstname ,MiddleName AS middlename ,LastName AS lastname ,Suffix AS suffix ,CompanyName ,SalesPerson ,EmailAddress AS emailaddress1 ,Phone AS telephone1 FROM [SalesLT].[Customer]
3 – LEFT JOIN the two sources using Contact ID. This produces 255 column (245 columns from the CDS +10 columns from Adventure Works)
4 – Aggregate the columns with the same name. This produces 247 columns (245 columns with same name + 2 new columns)
5 – Write the entity to ADLS in the parquet format. As mentioned, for this to work, we need to have the entity definition previously stored in ADLS. Although it’s impossible to write a new entity without a definition, we can still change the entity by adding/removing columns, as shown in this test.