Data Engineering

Getting Started With Databricks on Azure

Posted on 12th February 201816th December 2019 by Hugh Freestone

12
Feb

Databricks is a managed platform running Apache Spark. Spark is a fast general-purpose cluster computing system which provides high-level APIs in Java, Scala, Python and R. Spark programs have a driver program which contain a SparkContext object which co-ordinates processes running independently distributed across worker nodes in the cluster. Spark uses a cluster manager such as YARN to allocate resources to applications.

Databricks allows a user to access the functionality of Spark without any of the hassle of managing a Spark cluster. It also provides the convenience to be able to create and tear down a clusters at will.

Starting Databricks

Creating a Databricks service in Azure is simple. In the Azure portal select “New” then “Data + Analytics” then “Azure Databricks”:

Enter a workspace name, select the subscription, resource group and subscription and click create

Wait a few minutes and the Service will be read for use. Clicking on the Databricks icon will take you the Azure Databricks blade. Click “Launch Workspace” to access the Databricks portal which is something of a different experience than other services in Azure, the screen looks like this:

Note the icons on the dark background on the left which are useful to jump to links

Databricks comes with some excellent documentation so take a moment to review the “Databricks Guide” documentation. We are going to start with something simple, using a notebook. When you select Notebook you will be asked for a name and to choose a language, options are Python, Scala, SQL and R. However one of the features of notebooks in Databricks is that the language is only a default and can be overridden by specifying an alternative language within the notebook.

Having selected Notebook and provided a name and default language you will be presented with an empty notebook. In the top left hand corner you will see the word “Detached”. To use the notebook you you will need to attach the notebook to a Spark cluster or more if you haven’t done this before create a cluster. The dropdown on “Detached” provides this option:-

This will take you to a page such as the on below, Clearly if you are just investigating you will want to minimise the cluster size for now.

Having created a cluster (which will take a few minutes) you can navigate back to the workbook, Attach the workbook to the now running cluster, type a command and using the small arrow on the right hand size execute the command to test everything is working

What to do now, so many options. Well lets load some data and view it.

Databricks has its own file system which will have been deployed for you called the Databricks File System (DBFS). You can instead access Data Lake Store or Blob storage but for now this will do.

Click on “Data” on the left hand side then the “+” icon by tables. This is a little counter intuitive as it doesn’t look like it will lead to an upload option, but it does.

Browse to the file you want to upload, and the UI conveniently tells you where the file can be found in the file system.

Now the file can be accessed from the notebook, the syntax differs slightly depending on what language you choose, here using python the data is read into a dataframe and then output to the screen

As the latest addition to the Azure Analytics stables Databricks comes with great promise. It’s notably well documented already despite only being in preview and the UI is mainly intuitive even if it differs in style somewhat from other Azure analytics options. If you have used Jupyter notebooks before you will appreciate the notebooks interface as a great way to dive in and investigate data. Also once its all deployed its in-memory operation makes it feel fast compared to running small queries on for instance running Hive queries in HDI clusters and USQL queries in Azure Data Lake Analytics.

If you have any questions or comments let me know

Hugh Freestone

Introduction to Data Wrangler in Microsoft Fabric

What is Data Wrangler? A key selling point of Microsoft Fabric is the Data Science

25
Jul

Autogen Power BI Model in Tabular Editor

In the realm of business intelligence, Power BI has emerged as a powerful tool for

12
Jul

Microsoft Healthcare Accelerator for Fabric

Microsoft released the Healthcare Data Solutions in Microsoft Fabric in Q1 2024. It was introduced

09
Jul

Unlock the Power of Colour: Make Your Power BI Reports Pop

Colour is a powerful visual tool that can enhance the appeal and readability of your

09
Jul

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

Part 2: Exploring Advanced Functionalities in Databricks Welcome back to our Databricks journey! In this

20
May

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Businesses today are generating data at an unprecedented rate. Automated processing of data is essential

08
May

Exploring DALL·E Capabilities

What is DALL·E? DALL·E is text-to-image generation system developed by OpenAI using deep learning methodologies.

03
May

Using Copilot Studio to Develop a HR Policy Bot

The next addition to Microsoft’s generative AI and large language model tools is Microsoft Copilot

29
Apr

One thought on “Getting Started With Databricks on Azure”

sandeep says:
Thanks @ HUGH.
This is one such good blog I found,If anyone is looking to expand your skill set on data side using Azure Data bricks.It is simple and nice to understand and that’s all you need to go a head with your work.

@HUGH,please give me some info,What storage service should be used with Databricks and why? ADLS Gen1, ADLS Gen2, general storage account (blob container)?

22nd September 2019 at 9:49 am

Comments are closed.

Data Engineering