Exploring the Potential of Managed Airflow in Azure Data Factory
Microsoft recently made an announcement regarding the integration of Apache Airflow within its Azure cloud offering. Following much anticipation, a public preview of Managed Airflow in Azure Data Factory is now in place and can be tested across several Azure regions. This blog post will attempt to explain the basics of Airflow, how the Azure integration works and how you can potentially use it to your advantage.
What is Apache Airflow?
Apache Airflow is a community-driven open-source platform to programmatically author, schedule and monitor data workflows. Since its initial development by Airbnb and subsequent adoption by many companies worldwide including Yahoo, Intel and PayPal, it has become an important and popular feature of the Apache ecosystem.
The platform is Python-based and allows for the visual management of DAGs. ‘DAGs’ stands for Directed Acyclic Graphs and users of platforms like Apache Spark and Hadoop might already be familiar with them. A DAG resembles a flow chart featuring nodes connected by edges and these represent various tasks to be executed along with their dependencies. The ‘acyclic’ part of the name is referring to the fact that the tasks cannot be self-referential or dependent on themselves, thus avoiding infinite cycles.
Example of a DAG flow
Within Airflow, DAGs are declared in Python and define tasks for execution. The ability to configure the tasks in Python using various operators and libraries makes Airflow dynamic and extensible. This flexibility is one of the main reasons for Airflow’s popularity. Another advantage of Airflow is its scalability, which allows for the scheduling and orchestration of various worker nodes on demand. The interface makes it easy to manage and monitor workflows alongside troubleshooting logging and problems.
Why use Managed Airflow in Azure?
The main advantage of Managed Airflow is combining the best of both worlds – using the flexibility and extensibility of Airflow powered by an active open community alongside the convenience and security of an Azure-managed environment. ADF jobs can now be executed within Airflow DAGs, thus giving a welcome extension to ADF’s orchestration capabilities. Managed Airflow in ADF has the following key features:
- Ease of set-up – Managed Airflow allows for a quick automated setup of a fully managed environment in Data Factory. Normally Airflow has a complex setup involving some components for which it is often used in conjunction with Kubernetes, so this is a significant step up in terms of convenience.
- Built-in security and authentication – Secure automatic encryption and Azure Active Directory role-based authentication makes it easier to address security concerns.
- Autoscaling – automatic scaling of worker nodes that uses minimum and maximum number of allowed nodes.
- Automatic upgrades – the ADF Managed Airflow version will be upgraded periodically following the schedule of upgrades and patches of the normal version.
- Integration with Azure – Managed Airflow in ADF makes it easy to use other components of Azure such as ADF pipelines, Cosmos DB, Azure Key Vault, Azure Monitor, ADLS Gen 2 to name a few.
Creating a new Managed Airflow environment.
Why is Managed Airflow important?
This is a welcome announcement and we’re excited to see how Managed Airflow develops within ADF. This feature will help Azure users get more customisation and extensibility beyond what ADF’s orchestration normally provides, and the convenience of the managed environment could resolve some of the problems that existing Apache Airflow users are facing. Those who might benefit the most from this feature are those seeking a ‘lift and shit’ of their existing DAGs within Azure. We’ll continue exploring and posting about this tool as it gets further developed by the Azure and open-source community.
Using Copilot Studio to Develop a HR Policy Bot
The next addition to Microsoft’s generative AI and large language model tools is Microsoft Copilot
Apr
Pretty Power BI – Adding GIFs
Good UX design is critical in enabling stakeholders to maximise the key insight that they
Apr
Pareto Charts in Power BI and the DAX behind them
The Pareto principle, commonly referred to as the 80/20 rule, is a concept of prioritisation.
Apr
Databricks: Cluster Configuration
Databricks, a cloud-based platform for data engineering, offers several tools that can be used to
Apr
AI Assistance in Microsoft Fabric
The exponential growth of Large Language Models (LLMs) couples with Microsoft’s close partnership with OpenAI
Apr
10 reasons why it’s worth the effort to understand the value of your data
“If leaders really want to create a data driven culture, the journey starts with them!
Apr
Content Safety in Azure AI Studio
Azure AI Content Safety is a solution designed to identify harmful content, whether generated by
Apr
Model Benchmarks in Azure AI Studio
In the constantly changing field of artificial intelligence (AI) and machine learning (ML), choosing the
Apr