Data Architecture, Data Engineering

Parsing nested JSON lists in Databricks using Python

Posted on 8th June 201816th December 2019 by Ivan Vazharov

08
Jun

Parsing complex JSON structures is usually not a trivial task. When your destination is a database, what you expect naturally is a flattened result set. Things get more complicated when your JSON source is a web service and the result consists of multiple nested objects including lists in lists and so on. Things get even more complicated if the JSON schema changes over time, which is often a real-life scenario.

We have these wonderful Azure Logic Apps, which help us consistently get the JSON results from various sources. However, Logic Apps are not so good at parsing more complex nested structures. And they definitely don’t like even subtle source schema changes.

Enter Databricks!

With Databricks you get:

An easy way to infer the JSON schema and avoid creating it manually
Subtle changes in the JSON schema won’t break things
The ability to explode nested lists into rows in a very easy way (see the Notebook below)
Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe. Here you go:

from pyspark.sql.functions import explode, col

source_json = """
{
    "persons": [
        {
            "name": "John",
            "age": 30,
            "cars": [
                {
                    "name": "Ford",
                    "models": [
                        "Fiesta",
                        "Focus",
                        "Mustang"
                    ]
                },
                {
                    "name": "BMW",
                    "models": [
                        "320",
                        "X3",
                        "X5"
                    ]
                }
            ]
        },
        {
            "name": "Peter",
            "age": 46,
            "cars": [
                {
                    "name": "Huyndai",
                    "models": [
                        "i10",
                        "i30"
                    ]
                },
                {
                    "name": "Mercedes",
                    "models": [
                        "E320",
                        "E63 AMG"
                    ]
                }
            ]
        }
    ]
}
"""

dbutils.fs.put("/tmp/source.json", source_json, True)

source_df = spark.read.option("multiline", "true").json("/tmp/source.json")

# Explode all persons into different rows
persons = source_df.select(explode("persons").alias("persons"))

# Explode all car brands into different rows
persons_cars = persons.select(
   col("persons.name").alias("persons_name")
 , col("persons.age").alias("persons_age")
 , explode("persons.cars").alias("persons_cars_brands")
 , col("persons_cars_brands.name").alias("persons_cars_brand")
)

# Explode all car models into different rows
persons_cars_models = persons_cars.select(
   col("persons_name")
 , col("persons_age")
 , col("persons_cars_brand")
 , explode("persons_cars_brands.models").alias("persons_cars_model")
)

display(persons_cars_models)

We’ve seen here how we can use Databricks to flatten JSON with just a few lines of code.

Keep your eyes open for future Databricks related blogs, which will demonstrate more of the versatility of this great platform.

More on some of the used functions (PySpark 2.3.0 documentation):

Ivan Vazharov

Introduction to Data Wrangler in Microsoft Fabric

What is Data Wrangler? A key selling point of Microsoft Fabric is the Data Science

25
Jul

Autogen Power BI Model in Tabular Editor

In the realm of business intelligence, Power BI has emerged as a powerful tool for

12
Jul

Microsoft Healthcare Accelerator for Fabric

Microsoft released the Healthcare Data Solutions in Microsoft Fabric in Q1 2024. It was introduced

09
Jul

Unlock the Power of Colour: Make Your Power BI Reports Pop

Colour is a powerful visual tool that can enhance the appeal and readability of your

09
Jul

Python vs. PySpark: Navigating Data Analytics in Databricks – Part 2

Part 2: Exploring Advanced Functionalities in Databricks Welcome back to our Databricks journey! In this

20
May

GPT-4 with Vision vs Custom Vision in Anomaly Detection

Businesses today are generating data at an unprecedented rate. Automated processing of data is essential

08
May

Exploring DALL·E Capabilities

What is DALL·E? DALL·E is text-to-image generation system developed by OpenAI using deep learning methodologies.

03
May

Using Copilot Studio to Develop a HR Policy Bot

The next addition to Microsoft’s generative AI and large language model tools is Microsoft Copilot

29
Apr

16 thoughts on “Parsing nested JSON lists in Databricks using Python”

Jeffry says:
Wonderful work! That is the type of info that are
meant to be shared around the internet. Shame
on Google for now not positioning this put up higher!
Come on over and talk over with my web site . Thanks =)

15th September 2019 at 9:19 am
เที่ยวเวียดนาม says:
What’s up, the whole thing is going perfectly here and ofcourse every one is sharing data, that’s in fact excellent, keep up writing.

20th September 2019 at 9:26 pm
tennesse volunteers says:
Good day! This post could not be written any better! Reading through this post reminds me of my good old room mate!
He always kept talking about this. I will forward this
page to him. Pretty sure he will have a good read. Many thanks for sharing!

21st September 2019 at 10:59 pm
Achyuta says:
Thanks Man. this is really helpful.

9th October 2019 at 10:36 am
metal roofing says:
Greetings from Idaho! I’m bored to tears at work so I decided to browse your site on my iphone during lunch break.

I love the info you provide here and can’t wait to
take a look when I get home. I’m surprised at how fast your blog loaded on my cell phone ..
I’m not even using WIFI, just 3G .. Anyways, amazing site!

10th October 2019 at 4:27 am
v buck generator says:
It’s the best time to make some plans for the future and
it is time to be happy. I have read this post and if I
could I want to suggest you few interesting things or advice.
Maybe you could write next articles referring to this article.
I wish to read more things about it!

16th October 2019 at 5:36 pm
church fundraiser says:
Keep this going please, great job!

23rd October 2019 at 6:32 pm
UltraToned Keto Reviews says:
Hi there! I simply want to give you a big thumbs up for your great info you have got here on this post.
I’ll be returning to your web site for more soon.

27th October 2019 at 9:28 am
Magnumax Male Enhancement Review says:
Incredible! This blog looks just like my old one!
It’s on a totally different topic but it has pretty much the same layout and
design. Wonderful choice of colors!

27th October 2019 at 4:09 pm
Sunday Scaries CBD Reviews says:
You have observed very interesting points! ps nice website.

27th October 2019 at 4:46 pm
Ultra Fast Keto Boost Reviews says:
This site really has all the info I needed about this subject and
didn’t know who to ask.

28th October 2019 at 2:30 pm
http://wonderfullketo.org/ says:
I am impressed with this site, rattling I am a fan.

29th October 2019 at 4:37 am
Mahesh says:
Awesome help! Thank you very much.

9th January 2020 at 5:47 am
Denise says:
I just like the helpful info you supply for your articles.
I’ll bookmark your weblog and take a look at once more here frequently.
I’m rather certain I will be informed a lot of new stuff proper
here! Good luck for the next!

25th January 2020 at 5:05 pm
is bitstamp trustworthy says:
Woah! I’m really loving the template/theme of this site.
It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between user friendliness and visual appeal.
I must say that you’ve done a very good job with this.
Also, the blog loads extremely fast for me on Opera.

Exceptional Blog!

27th January 2020 at 5:28 am
Drake says:
Thanks for sharing these tips! I’m always interested in learning more about this.

12th May 2020 at 2:33 pm