Nested JSON to DataFrame example(Scala)

Loading...

This example notebook shows you how to flatten nested JSON, using only $"column.*" and explode methods.

Start by passing the sample JSON string to the reader.

val json ="""
{
    "id": "0001",
    "type": "donut",
    "name": "Cake",
    "ppu": 0.55,
    "batters":
        {
            "batter":
                [
                    { "id": "1001", "type": "Regular" },
                    { "id": "1002", "type": "Chocolate" },
                    { "id": "1003", "type": "Blueberry" },
                    { "id": "1004", "type": "Devil's Food" }
                ]
        },
    "topping":
        [
            { "id": "5001", "type": "None" },
            { "id": "5002", "type": "Glazed" },
            { "id": "5005", "type": "Sugar" },
            { "id": "5007", "type": "Powdered Sugar" },
            { "id": "5006", "type": "Chocolate with Sprinkles" },
            { "id": "5003", "type": "Chocolate" },
            { "id": "5004", "type": "Maple" }
        ]
}
"""
json: String = " { "id": "0001", "type": "donut", "name": "Cake", "ppu": 0.55, "batters": { "batter": [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, { "id": "1003", "type": "Blueberry" }, { "id": "1004", "type": "Devil's Food" } ] }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5005", "type": "Sugar" }, { "id": "5007", "type": "Powdered Sugar" }, { "id": "5006", "type": "Chocolate with Sprinkles" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] } "

Add the JSON string as a collection type and pass it as an input to spark.createDataset. This converts it to a DataFrame. The JSON reader infers the schema automatically from the JSON string.

This sample code uses a list collection type, which is represented as json :: Nil. You can also use other Scala collection types, such as Seq (Scala Sequence).

import org.apache.spark.sql.functions._
import spark.implicits._
val DF= spark.read.json(spark.createDataset(json :: Nil))
import org.apache.spark.sql.functions._ import spark.implicits._ DF: org.apache.spark.sql.DataFrame = [batters: struct<batter: array<struct<id:string,type:string>>>, id: string ... 4 more fields]

Display the DataFrame to view the current state.

display(DF)
 
batters
id
name
ppu
topping
type
1
{"batter": [{"id": "1001", "type": "Regular"}, {"id": "1002", "type": "Chocolate"}, {"id": "1003", "type": "Blueberry"}, {"id": "1004", "type": "Devil's Food"}]}
0001
Cake
0.55
[{"id": "5001", "type": "None"}, {"id": "5002", "type": "Glazed"}, {"id": "5005", "type": "Sugar"}, {"id": "5007", "type": "Powdered Sugar"}, {"id": "5006", "type": "Chocolate with Sprinkles"}, {"id": "5003", "type": "Chocolate"}, {"id": "5004", "type": "Maple"}]
donut

Showing all 1 rows.

Use $"column.*" and explode methods to flatten the struct and array types before displaying the flattened DataFrame.

display(DF.select($"id" as "main_id",$"name",$"batters",$"ppu",explode($"topping")) // Exploding the topping column using explode as it is an array type
        .withColumn("topping_id",$"col.id") // Extracting topping_id from col using DOT form
        .withColumn("topping_type",$"col.type") // Extracting topping_tytpe from col using DOT form
        .drop($"col")
        .select($"*",$"batters.*") // Flattened the struct type batters tto array type which is batter
        .drop($"batters")
        .select($"*",explode($"batter"))
        .drop($"batter")
        .withColumn("batter_id",$"col.id") // Extracting batter_id from col using DOT form
        .withColumn("battter_type",$"col.type") // Extracting battter_type from col using DOT form
        .drop($"col")
       )
 
main_id
name
ppu
topping_id
topping_type
batter_id
battter_type
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
0001
Cake
0.55
5001
None
1001
Regular
0001
Cake
0.55
5001
None
1002
Chocolate
0001
Cake
0.55
5001
None
1003
Blueberry
0001
Cake
0.55
5001
None
1004
Devil's Food
0001
Cake
0.55
5002
Glazed
1001
Regular
0001
Cake
0.55
5002
Glazed
1002
Chocolate
0001
Cake
0.55
5002
Glazed
1003
Blueberry
0001
Cake
0.55
5002
Glazed
1004
Devil's Food
0001
Cake
0.55
5005
Sugar
1001
Regular
0001
Cake
0.55
5005
Sugar
1002
Chocolate
0001
Cake
0.55
5005
Sugar
1003
Blueberry
0001
Cake
0.55
5005
Sugar
1004
Devil's Food
0001
Cake
0.55
5007
Powdered Sugar
1001
Regular
0001
Cake
0.55
5007
Powdered Sugar
1002
Chocolate
0001
Cake
0.55
5007
Powdered Sugar
1003
Blueberry
0001
Cake
0.55
5007
Powdered Sugar
1004
Devil's Food
0001
Cake
0.55
5006
Chocolate with Sprinkles
1001
Regular

Showing all 28 rows.