The Common Data Model in Azure Data Lake Storage – Export to Data Lake

If like me, you’ve been keeping taps on what Microsoft has been up to on the Power Platform world, you would have noticed that there are two concepts that are regularly referenced in their architectures and generally associated to each other, Azure Data Lake Storage (ADLS) Gen 2 and Common Data Model (CDM).

As Francesco referred in his blog, Microsoft ultimate vision is for the CDM to be the de facto standard data model, however, although there is a fair amount of resources talking about the capabilities and features, it can be a bit confusing to understand how you can actually store your data in the CDM format in ADLS and use it to run data analytics such as data warehousing, Power BI reporting and Machine Learning.

In this blog series I’ll cover 5 different ways to instantiate a CDM model in ADLS:

Let’s start by having a look at the first option and understand how it works.

Export to data lake (code name Athena) is a Power Apps service that allows Common Data Service (CDS) customers to continuously push their data to ADLS in the CDM format (further details here). The greatest thing about this service is the ability to incrementally replicate the CUD operations (create, update, delete) in ADLS without having to run manual/scheduled triggers. Any data or metadata changes in CDS are automatically pushed to ADLS without any additional action. Read-only snapshots are created every hour in ADLS, allowing data analytics consumers to reliably consume the data.

There are two major aspects to bear in mind before attempting to use this service. First, the storage account must be created in the same Azure AD tenant as your Power Apps tenant and second, the storage account must be created in the same region as the Power Apps environment that will host the applications.

To test this service, I went ahead and created an application based on the CDM Contacts entity (you can follow this link to understand how to create an app using CDM).

The next step is to create a new link to the data lake and select which entities you want to export. As previously mentioned, you will be asked to attach a storage account in the same location of the Power Apps tenant. You will also need to be the Owner of the storage account and the administrator of the CDS.

Once you press save, the CDS environment is linked to ADLS and the selected entities are replicated to the storage account.

On the Power Apps portal you will see a dashboard showing the status (Initial sync and Last sync status), the last synchronised timestamp and the count of records for each selected entity.

On the storage account, you will see two new containers, one for the link you just created and another for the power platform dataflows. Within the first container, you will have a model.json file, a folder for the Trickle Feed Service and a folder for each selected entity.

Model.json

The model.json file describes the data in the CDM folders, metadata and location (further details here). The file allows data consumers to easily understand which entities exist in the storage account and how they are structured and partitioned. The file also contains annotations that provide details about the initial sync status and completed time.

The entities replicated to the data lake are partitioned by year in order to make it more efficient for the consumer applications to consume the files. As per Microsoft, the plan is to eventually partition based on the amount of the data.

The model.json file will only include the location of the latest snapshot. Whenever a new entity snapshot is created, a new model.json snapshot is taken to ensure that the original snapshot viewer may continue to work on the older snapshot files while newer viewers can read the latest updates.

Trickle Feed Service

Trickle feed is the process of supplying continuous small amounts of something. In this scenario, changes in CDS are continuously pushed to the corresponding csv files using the trickle feed engine. Below, is the content of the TrickleFeedService folder.

In this scenario, we have a contact folder, an EntityMetadata and a model json files. The contact folder contains files used by the Trickle Feed service, that are updated every time a CUD operation happens in CDS.

The contact-model.json file is the same as the model.json file, but, will only contain the details about the contact entity. This file is updated whenever there are structural changes in CDS, while the model.json file in the root folder, will only be updated on an hourly basis.

The contact-EntityMetadata.json file contains multiple metadata properties as demonstrated below.

The ModelJsonSnapshots folder contains the model.json snapshots. A read-only snapshot copy is taken every hour, independently of changes in CDS occur.

Entity folder

Each entity will have a folder containing the partitioned csv files and the snapshot copies (at some point, it will also support parquet files). Each file is named after it’s partition, which in this case is year. The 2020 file contains the most up to date version of the CDS data. Whenever there’s a CUD operation, this file is updated, however, the snapshot read-only copy is created only after the hour cycle is completed (as demonstrated in the below screenshots).

 

6 thoughts on “The Common Data Model in Azure Data Lake Storage – Export to Data Lake

  1. Avatar
    Amir says:

    As the model.json file contains the schema definition and CSV files have no column names.
    Then how to consume the Data Lake data for reporting?
    Appreciate you answer

  2. Avatar
    Vigneshvaran says:

    Hi Sir,

    Am trying to load the on premise SQL DB tables to Azure Data lake.
    i have created the corresponding linked services and exported the data from SQL DB to Azure Data Lake.

    Now i need to load the incremental push records. This one also i have achieved using change tracking technology.
    Now both files are seperate files.
    How i need to update the incremental push records into an existing file without disturbing the other records which are not modified.
    Because if am trying to merge those two files and it will create the new seperate file only. So finally extra data count occurs (data which are not modified also need to write the new file).
    So is there any way to push the incremental records in the existing file without disturbing the unmodified records in existing file.

    • Jose Mendes
      Jose Mendes says:

      Hi Vigneshvaran,

      Apologies for my late response.

      I’m not sure I fully understood your question. How did you export the data from the on premise SQL DB to the Data Lake and how did you export the data from Azure SQL DB to the Data Lake? What is your final goal, migrate the data from an on premise SQL DB to Azure?

  3. Avatar
    Chandru says:

    Hi Sir,
    I want to load the data using ADF through common data model. I have exported the data to adls from power apps and have the same folder structure shown by you. The ms website doc converts the model.json to manifest file. The package provided are not accessible and missing in nuget as well. Can u temme how to perform data load thru ADF.

Leave a Reply

Your email address will not be published. Required fields are marked *