Data Engineering

The Common Data Model in Azure Data Lake Storage – Export to Data Lake

Posted on 26th February 202013th October 2023 by Jose Mendes

26
Feb

If like me, you’ve been keeping tabs on what Microsoft has been up to on the Power Platform world, you would have noticed that there are two concepts that are regularly referenced in their architectures and generally associated to each other – Azure Data Lake Storage (ADLS) Gen 2 and Common Data Model (CDM).

As my colleague, Francesco Sbrescia, referenced in his blog, Microsoft’s ultimate goal is for the CDM to be the de facto standard data model. However, although there is a fair amount of resources talking about the capabilities and features, it can be a bit confusing to understand how you can actually store your data in the CDM format in ADLS and use it to run data analytics such as data warehousing, Power BI reporting and Machine Learning.

In this blog series I’ll cover 6 different ways to instantiate a CDM model in ADLS:

Export to data lake (Common Data Service)
Power BI Dataflows
Azure Data Services – Databricks
Azure Data Services – Data Factory Data Flows
Synapse Link for Dataverse – Delta Lake Export
Dataverse integration with Microsoft Fabric

Let’s start by having a look at the first option and understand how it works.

Export to data lake (code name Athena) is a Power Apps service that allows Common Data Service (CDS) customers to continuously push their data to ADLS in the CDM format (further details here). The greatest thing about this service is the ability to incrementally replicate the CUD operations (create, update, delete) in ADLS without having to run manual/scheduled triggers. Any data or metadata changes in CDS are automatically pushed to ADLS without any additional action. Read-only snapshots are created every hour in ADLS, allowing data analytics consumers to reliably consume the data.

There are two major aspects to bear in mind before attempting to use this service. First, the storage account must be created in the same Azure AD tenant as your Power Apps tenant and second, the storage account must be created in the same region as the Power Apps environment that will host the applications.

To test this service, I went ahead and created an application based on the CDM Contacts entity (you can follow this link to understand how to create an app using CDM).

The next step is to create a new link to the data lake and select which entities you want to export. As previously mentioned, you will be asked to attach a storage account in the same location of the Power Apps tenant. You will also need to be the Owner of the storage account and the administrator of the CDS.

Once you press save, the CDS environment is linked to ADLS and the selected entities are replicated to the storage account.

On the Power Apps portal you will see a dashboard showing the status (Initial sync and Last sync status), the last synchronised timestamp and the count of records for each selected entity.

On the storage account, you will see two new containers, one for the link you just created and another for the power platform dataflows. Within the first container, you will have a model.json file, a folder for the Trickle Feed Service and a folder for each selected entity.

Model.json

The model.json file describes the data in the CDM folders, metadata and location (further details here). The file allows data consumers to easily understand which entities exist in the storage account and how they are structured and partitioned. The file also contains annotations that provide details about the initial sync status and completed time.

The entities replicated to the data lake are partitioned by year in order to make it more efficient for the consumer applications to consume the files. As per Microsoft, the plan is to eventually partition based on the amount of the data.

The model.json file will only include the location of the latest snapshot. Whenever a new entity snapshot is created, a new model.json snapshot is taken to ensure that the original snapshot viewer may continue to work on the older snapshot files while newer viewers can read the latest updates.

Trickle Feed Service

Trickle feed is the process of supplying continuous small amounts of something. In this scenario, changes in CDS are continuously pushed to the corresponding csv files using the trickle feed engine. Below, is the content of the TrickleFeedService folder.

In this scenario, we have a contact folder, an EntityMetadata and a model json files. The contact folder contains files used by the Trickle Feed service, that are updated every time a CUD operation happens in CDS.

The contact-model.json file is the same as the model.json file, but, will only contain the details about the contact entity. This file is updated whenever there are structural changes in CDS, while the model.json file in the root folder, will only be updated on an hourly basis.

The contact-EntityMetadata.json file contains multiple metadata properties as demonstrated below.

The ModelJsonSnapshots folder contains the model.json snapshots. A read-only snapshot copy is taken every hour, independently of changes in CDS occur.

Entity folder

Each entity will have a folder containing the partitioned csv files and the snapshot copies (at some point, it will also support parquet files). Each file is named after it’s partition, which in this case is year. The 2020 file contains the most up to date version of the CDS data. Whenever there’s a CUD operation, this file is updated, however, the snapshot read-only copy is created only after the hour cycle is completed (as demonstrated in the below screenshots).

Jose Mendes

Head of Data Engineering for Adatis Consulting with over 12 years’ experience in delivering Microsoft Azure/ SQL Data Analytics solutions. Presented for the first time in 2017 at SQLBits and since then has been blogging and speaking about the Bot Framework, Modern Data Platforms, and other cool Azure services.

Introduction to Data Wrangler in Microsoft Fabric

What is Data Wrangler? A key selling point of Microsoft Fabric is the Data Science

25
Jul

Autogen Power BI Model in Tabular Editor

In the realm of business intelligence, Power BI has emerged as a powerful tool for

12
Jul

Microsoft Healthcare Accelerator for Fabric

Microsoft released the Healthcare Data Solutions in Microsoft Fabric in Q1 2024. It was introduced

09
Jul

Unlock the Power of Colour: Make Your Power BI Reports Pop

Colour is a powerful visual tool that can enhance the appeal and readability of your

09
Jul

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

Part 2: Exploring Advanced Functionalities in Databricks Welcome back to our Databricks journey! In this

20
May

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Businesses today are generating data at an unprecedented rate. Automated processing of data is essential

08
May

Exploring DALL·E Capabilities

What is DALL·E? DALL·E is text-to-image generation system developed by OpenAI using deep learning methodologies.

03
May

Using Copilot Studio to Develop a HR Policy Bot

The next addition to Microsoft’s generative AI and large language model tools is Microsoft Copilot

29
Apr

23 thoughts on “The Common Data Model in Azure Data Lake Storage – Export to Data Lake”

Amir says:
As the model.json file contains the schema definition and CSV files have no column names.
Then how to consume the Data Lake data for reporting?
Appreciate you answer

5th June 2020 at 5:26 am
- Jose Mendes says:
  Hi Amir,
  
  Is there a specific reporting tool you want to use?
  
  To consume the CDM data from ADLS in Power BI you need to create a Power BI Dataflow as demonstrated in this blog (https://adatis.co.uk/the-common-data-model-in-azure-data-lake-storage-power-bi-dataflows/).
  
  9th June 2020 at 8:29 am
Vigneshvaran says:
Hi Sir,

Am trying to load the on premise SQL DB tables to Azure Data lake.
i have created the corresponding linked services and exported the data from SQL DB to Azure Data Lake.

Now i need to load the incremental push records. This one also i have achieved using change tracking technology.
Now both files are seperate files.
How i need to update the incremental push records into an existing file without disturbing the other records which are not modified.
Because if am trying to merge those two files and it will create the new seperate file only. So finally extra data count occurs (data which are not modified also need to write the new file).
So is there any way to push the incremental records in the existing file without disturbing the unmodified records in existing file.

31st July 2020 at 10:01 am
- Jose Mendes says:
  Hi Vigneshvaran,
  
  Apologies for my late response.
  
  I’m not sure I fully understood your question. How did you export the data from the on premise SQL DB to the Data Lake and how did you export the data from Azure SQL DB to the Data Lake? What is your final goal, migrate the data from an on premise SQL DB to Azure?
  
  19th August 2020 at 2:14 pm
Chandru says:
Hi Sir,
I want to load the data using ADF through common data model. I have exported the data to adls from power apps and have the same folder structure shown by you. The ms website doc converts the model.json to manifest file. The package provided are not accessible and missing in nuget as well. Can u temme how to perform data load thru ADF.

14th October 2020 at 7:04 pm
- Jose Mendes says:
  Apologies, I wasn’t notified of your first post.
  
  16th October 2020 at 4:35 pm
Milind Vengurlekar says:
Hello Jose,

How to update model.json file to accommodate snapshot details?

21st January 2021 at 3:44 am
- Jose Mendes says:
  Hi Milind,
  
  The model.json file is managed by the service and you should not update it. If you update the file, it will be overwritten the next time there’s a structural change in your CDS model.
  
  Is there a particular reason for your request?
  
  21st January 2021 at 9:29 am
Nilesh says:
Hi Jose,

How to create the manifest.json from model.json file ?

25th March 2021 at 9:26 pm
- Jose Mendes says:
  As far as I’m aware there’s no easy straight way to make such conversion. In this series I showed how to do that by using Databricks and ADF Dataflows, but there are limitations in both approaches.
  
  26th March 2021 at 9:58 am
Shashi says:
Hi Josh,
I am working on pulling ADLS v2 entities (dynamics 365 exported data to) to onprem using spark. I am not sure how metadata changes (new field addition, dropping few fields, reorder of fields in dynamics) in dynamics get reflected in ADLS data and how to handle that. for e.g. lets say 2 new field is added to contact entity, now what happens to historical data.

29th May 2021 at 8:35 pm
- Jose Mendes says:
  If you are using the Export to Data Lake feature then all metadata changes are reflected in the lake, providing such entities in Dynamics are configured accordingly. The partitioned files containing the data and the model.json will be updated shortly after the changes are applied in dynamics.
  The partitioned files will always contain the most recent structure. If you require to have access to history changes then you would have to look at the snapshot folders.
  If you are uncertain how often the schema will change, then you could build a process that reads the model.json file and validates the schema before you attempt to copy the data to your final destination from the lake.
  
  2nd June 2021 at 8:39 am
Bhanu says:
Hi ,
As the model.json file contains the schema definition and CSV files have no column names.
Then how do we load data from ADLSG2 to Azure SQL (We have 2020 as one file , 2021 as another file ) . how do we load ?
Appreciate you answer

31st May 2021 at 1:33 pm
- Jose Mendes says:
  It depends how you plan to load your data in Azure SQL. If you are using ADF Data Flow, you have access to the schema due to the Common Data Model source type(as shown here).
  If you are using ADF, then there will be some manual work to ensure you can map the columns accordingly.
  Alternatively, you can always use the Dataverse (Common Data Service for Apps) dataset and connect directly to your dynamics application and skip the data lake.
  
  2nd June 2021 at 9:03 am
Matthew Jones says:
After the initial link between Power apps and the Data lake – the model.json is incomplete and is not updating every hour as you suggest.

When i try to connect to an entity I can see a folder for in the Data lake, using ADF i get an error
Could not read ‘/sbr_area’ from the ‘cdm’ namespace. Reason ‘com.microsoft.commondatamodel.objectmodel.storage.StorageAdapterException: Could not read content at path: /logical/sbr_area

Any ideas?

21st December 2021 at 3:02 pm
- Jose Mendes says:
  Hi Matthew,
  
  Apologies for my late response but I was OOO.
  
  Without understanding your scenario and how it was setup, I can’t offer much support unfortunately. I would suggest to check your setup in Power Apps and see if you can find any error or if you are using a custom entity and you didn’t configured it to track the changes. Without this step, you will not see it correctly synced to the data lake.
  
  With regards the error you are seeing in Data Factory. Please refer to this blog. As you will see, at the time I was testing, Data Factory could not read CDM entity files based on model.json, only the ones using the most recent manifest files. This could be the reason why you are having such errors.
  
  12th January 2022 at 9:32 am
veronica says:
Hi Jose,

I really like your blog and detailed description i have a challenge i need to connect to cdm via databricks can you suggest how i can achieve this

21st May 2022 at 8:35 am
- Jose Mendes says:
  Hi Veronica,
  Apologies for the delay in my response. The post has not been updated for a while, but the principle should still be there (Ingest CDM from Databricks)
  
  24th May 2022 at 9:31 am
Ram says:
I do have similar issue facing. I could see entry in the model.json for a entity but no files inside the entity folder in ADLS container. Is there way to handle if there are no files to load skip that entity for processing. I have seen property “allow no files found” on Dataflow setting but that seems not working.

Please let me know how you handled this scenario.

24th May 2022 at 8:26 am
- Jose Mendes says:
  Hi Ram,
  I have been really busy with work but I will try to find some time and see if I can come back to you with an answer.
  
  24th May 2022 at 9:36 am
Elshan says:
Hi,
is it possible somehow export CDM formated files into SQL server as External Tables in the way that SQL extracts the metadata from the ‘model.json’? If yes, could you give any idea how to do that?

28th June 2022 at 8:40 am
- Jose Mendes says:
  Hi Elshan,
  Apologies for the delay getting back to you. If you want to extract the data from dynamics to Synapse, then you can use Azure Synapse Link for Dataverse
  
  19th July 2022 at 9:28 pm
Wajih says:
Hi Jose,
Thanks for great knowledge sharing. I am using Azure Synapse Link for Dataverse along with ADF to copy data in Azure SQL. However, on each sync there is error that Objects already exists in database. I have deleted Tables and ten executing that, but still getting the same error. I am using Copy Dataverse data into Azure SQL template to do all this.

Regards

Wajih

5th October 2022 at 10:42 am

Comments are closed.

Data Engineering

The Common Data Model in Azure Data Lake Storage – Export to Data Lake

Model.json

Trickle Feed Service