JSON files
You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.
For more information, see the Apache Spark documentation on JSON Files.
Options
See the following Apache Spark reference articles for supported read and write options.
Rescued data column
This feature is supported in Databricks Runtime 8.2 (EoL) and above.
The rescued data column ensures that you never lose data during ETL. The rescued data column contains any data that wasn't parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn't match that in the schema. The rescued data column is returned as a JSON blob containing the rescued columns and the source file path of the record. To remove the source file path from the rescued data column, set the SQL configuration spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false"). To enable the rescued data column, set the option rescuedDataColumn to a column name, such as _rescued_data, with spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load(<path>).
The JSON parser supports three modes when parsing records: PERMISSIVE, DROPMALFORMED, and FAILFAST. When used together with rescuedDataColumn, data type mismatches do not cause records to be dropped in DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is, incomplete or malformed JSON—are dropped or throw errors. If you use the option badRecordsPath when parsing JSON, data type mismatches are not considered bad records when using the rescuedDataColumn. Only incomplete and malformed JSON records are stored in badRecordsPath.
Examples
The following examples demonstrate how to read and write JSON files using the DataFrame API and SQL.
Single-line mode
Each line contains one JSON object:
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}
To read the JSON data, use:
- Python
- Scala
df = spark.read.format("json").load("example.json")
df.printSchema()
val df = spark.read.format("json").load("example.json")
df.printSchema
Spark infers the schema automatically:
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)
Multi-line mode
The following JSON object spans multiple lines:
[
{ "string": "string1", "int": 1, "array": [1, 2, 3], "dict": { "key": "value1" } },
{ "string": "string2", "int": 2, "array": [2, 4, 6], "dict": { "key": "value2" } },
{
"string": "string3",
"int": 3,
"array": [3, 6, 9],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]
To read this object, enable multi-line mode:
- Python
- Scala
- SQL
mdf = spark.read.option("multiline", "true").format("json").load("/tmp/multi-line.json")
mdf.show(truncate=False)
val mdf = spark.read.option("multiline", "true").format("json").load("/tmp/multi-line.json")
mdf.show(false)
CREATE TEMPORARY VIEW multiLineJsonTable
USING json
OPTIONS (path="/tmp/multi-line.json",multiline=true)
Read JSON files using SQL
You can use the read_files table-valued function table-valued function in SQL to read JSON files.
SELECT * FROM read_files(
'<path to json file or folder>',
format => 'json',
multiLine => true)
You can also use USING JSON to read JSON files. However, Databricks recommends using read_files instead of USING JSON because read_files allows the specification of schema and additional file processing options.
DROP TABLE IF EXISTS json_data;
CREATE TABLE json_data
USING JSON
OPTIONS (path "/path/to/file.json", multiline true);
SELECT * FROM json_data;
Charset auto-detection
By default, the charset of input files is detected automatically. You can specify the charset explicitly using the charset option:
spark.read.option("charset", "UTF-16BE").format("json").load("fileInUTF16.json")
Some supported charsets include: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32. For the full list of charsets supported by Oracle Java SE, see Supported Encodings.
Notebook example: Read JSON files
The following notebook demonstrates single-line mode and multi-line mode.