Parsing nested JSON lists in Databricks using Python

Parsing complex JSON structures is usually not a trivial task. When your destination is a database, what you expect naturally is a flattened result set. Things get more complicated when your JSON source is a web service and the result consists of multiple nested objects including lists in lists and so on. Things get even more complicated if the JSON schema changes over time, which is often a real-life scenario.

We have these wonderful Azure Logic Apps, which help us consistently get the JSON results from various sources. However, Logic Apps are not so good at parsing more complex nested structures. And they definitely don’t like even subtle source schema changes.

Enter Databricks!

With Databricks you get:

  • An easy way to infer the JSON schema and avoid creating it manually
  • Subtle changes in the JSON schema won’t break things
  • The ability to explode nested lists into rows in a very easy way (see the Notebook below)
  • Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe. Here you go:

from pyspark.sql.functions import explode, col
source_json = """
{
    "persons": [
        {
            "name": "John",
            "age": 30,
            "cars": [
                {
                    "name": "Ford",
                    "models": [
                        "Fiesta",
                        "Focus",
                        "Mustang"
                    ]
                },
                {
                    "name": "BMW",
                    "models": [
                        "320",
                        "X3",
                        "X5"
                    ]
                }
            ]
        },
        {
            "name": "Peter",
            "age": 46,
            "cars": [
                {
                    "name": "Huyndai",
                    "models": [
                        "i10",
                        "i30"
                    ]
                },
                {
                    "name": "Mercedes",
                    "models": [
                        "E320",
                        "E63 AMG"
                    ]
                }
            ]
        }
    ]
}
"""
dbutils.fs.put("/tmp/source.json", source_json, True)
source_df = spark.read.option("multiline", "true").json("/tmp/source.json")
# Explode all persons into different rows
persons = source_df.select(explode("persons").alias("persons"))

# Explode all car brands into different rows
persons_cars = persons.select(
   col("persons.name").alias("persons_name")
 , col("persons.age").alias("persons_age")
 , explode("persons.cars").alias("persons_cars_brands")
 , col("persons_cars_brands.name").alias("persons_cars_brand")
)

# Explode all car models into different rows
persons_cars_models = persons_cars.select(
   col("persons_name")
 , col("persons_age")
 , col("persons_cars_brand")
 , explode("persons_cars_brands.models").alias("persons_cars_model")
)
display(persons_cars_models)

We’ve seen here how we can use Databricks to flatten JSON with just a few lines of code.

Keep your eyes open for future Databricks related blogs, which will demonstrate more of the versatility of this great platform.

More on some of the used functions (PySpark 2.3.0 documentation):

16 thoughts on “Parsing nested JSON lists in Databricks using Python

  1. Jeffry says:

    Wonderful work! That is the type of info that are
    meant to be shared around the internet. Shame
    on Google for now not positioning this put up higher!
    Come on over and talk over with my web site . Thanks =)

  2. tennesse volunteers says:

    Good day! This post could not be written any better! Reading through this post reminds me of my good old room mate!
    He always kept talking about this. I will forward this
    page to him. Pretty sure he will have a good read. Many thanks for sharing!

  3. metal roofing  says:

    Greetings from Idaho! I’m bored to tears at work so I decided to browse your site on my iphone during lunch break.

    I love the info you provide here and can’t wait to
    take a look when I get home. I’m surprised at how fast your blog loaded on my cell phone ..
    I’m not even using WIFI, just 3G .. Anyways, amazing site!

  4. v buck generator says:

    It’s the best time to make some plans for the future and
    it is time to be happy. I have read this post and if I
    could I want to suggest you few interesting things or advice.
    Maybe you could write next articles referring to this article.
    I wish to read more things about it!

  5. Denise says:

    I just like the helpful info you supply for your articles.
    I’ll bookmark your weblog and take a look at once more here frequently.
    I’m rather certain I will be informed a lot of new stuff proper
    here! Good luck for the next!

  6. is bitstamp trustworthy says:

    Woah! I’m really loving the template/theme of this site.
    It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between user friendliness and visual appeal.
    I must say that you’ve done a very good job with this.
    Also, the blog loads extremely fast for me on Opera.

    Exceptional Blog!

Comments are closed.