Reading JSON Files

Reading JSON files is as easy as reading CSV files in Apache Spark.

If your cluster is running Databricks Runtime 4.0 and above, JSON files can be read in single-line or multi-line mode. Previous versions support only single-line mode. In single-line mode, a file can be split into many parts and read in parallel.

Single-line mode

In this example, there is one JSON object per line:

{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}

To read the JSON data, you should use something like this code sample:

val df ="example.json")

Spark infers the schema automatically.

 |-- array: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- dict: struct (nullable = true)
 |    |-- extra_key: string (nullable = true)
 |    |-- key: string (nullable = true)
 |-- int: long (nullable = true)
 |-- string: string (nullable = true)

Multi-line mode

If a JSON object occupies multiple lines, you must enable multi-line mode for Spark to load the file. Files will be loaded as a whole entity and cannot be split.

    {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
    {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
        "string": "string3",
        "int": 3,
        "array": [
        "dict": {
            "key": "value3",
            "extra_key": "extra_value3"

To read the JSON data, you should enable multi-line mode:

val mdf ="multiline", "true").json("multi.json")
|array    |dict                 |int|string |
|[1, 2, 3]|[null,value1]        |1  |string1|
|[2, 4, 6]|[null,value2]        |2  |string2|
|[3, 6, 9]|[extra_value3,value3]|3  |string3|

Charset auto-detection

By default, Spark detects the charset of input files automatically, but you can always specify the charset explicitly via this option:"charset", "UTF-16BE").json("fileInUTF16.json")

Some supported charsets include: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32. For the full list of charsets supported by Oracle Java SE, see Supported Encodings.