Reading JSON Files

Reading JSON files is as easy as reading csv files in Apache Spark however there are some specific caveats. Your JSON files should be line delimited JSON, that is to say, there should be one JSON object per line. More concretely, take a look at the code below.

Here is an example of well-formed JSON:

{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}

Here is an example of mal-formed JSON: .. code:

[
  {"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}},
  {"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}},
  {"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}
]

In order to read malformed JSON data, you must use something like the code below.

First create a test file.

json_text = """[
{
"id": 1,
"name": "a"
},
{
"id": 2,
"name": "b"
}
]"""

dbutils.fs.put("/tmp/file1.json", json_text, True)

If it’s small enough to read on the driver use this.

# option 1, small enough or single file to parse on the driver
import json

parsed = json.loads(json_text)

df = spark.createDataFrame(parsed)
display(df)

If you have to read it in with Spark use this.

# option 2, multiple files on S3 or too large to bring all down to driver
import json

df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)