JSON Files

If your cluster is running Databricks Runtime 4.0 and above, you can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel.

Single-line mode

In this example, there is one JSON object per line:

{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}

To read the JSON data, you should use something like this code sample:

val df = spark.read.json("example.json")

Spark infers the schema automatically.

df.printSchema
root
 |-- array: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- dict: struct (nullable = true)
 |    |-- extra_key: string (nullable = true)
 |    |-- key: string (nullable = true)
 |-- int: long (nullable = true)
 |-- string: string (nullable = true)

Multi-line mode

If a JSON object occupies multiple lines, you must enable multi-line mode for Spark to load the file. Files will be loaded as a whole entity and cannot be split.

[
    {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
    {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
    {
        "string": "string3",
        "int": 3,
        "array": [
            3,
            6,
            9
        ],
        "dict": {
            "key": "value3",
            "extra_key": "extra_value3"
        }
    }
]

To read the JSON data, you should enable multi-line mode:

val mdf = spark.read.option("multiline", "true").json("multi.json")
mdf.show(false)
+---------+---------------------+---+-------+
|array    |dict                 |int|string |
+---------+---------------------+---+-------+
|[1, 2, 3]|[null,value1]        |1  |string1|
|[2, 4, 6]|[null,value2]        |2  |string2|
|[3, 6, 9]|[extra_value3,value3]|3  |string3|
+---------+---------------------+---+-------+

Charset auto-detection

By default, Spark detects the charset of input files automatically, but you can always specify the charset explicitly via this option:

spark.read.option("charset", "UTF-16BE").json("fileInUTF16.json")

Some supported charsets include: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32. For the full list of charsets supported by Oracle Java SE, see Supported Encodings.

Read JSON Files Example Notebook