Skip to main content

JSON files

You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.

For more information, see the Apache Spark documentation on JSON Files.

Options

See the following Apache Spark reference articles for supported read and write options.

Rescued data column

note

This feature is supported in Databricks Runtime 8.2 (EoL) and above.

The rescued data column ensures that you never lose data during ETL. The rescued data column contains any data that wasn't parsed, either because it was missing from the given schema, or because there was a type mismatch, or because the casing of the column in the record or file didn't match that in the schema. The rescued data column is returned as a JSON blob containing the rescued columns and the source file path of the record. To remove the source file path from the rescued data column, set the SQL configuration spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false"). To enable the rescued data column, set the option rescuedDataColumn to a column name, such as _rescued_data, with spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load(<path>).

The JSON parser supports three modes when parsing records: PERMISSIVE, DROPMALFORMED, and FAILFAST. When used together with rescuedDataColumn, data type mismatches do not cause records to be dropped in DROPMALFORMED mode or throw an error in FAILFAST mode. Only corrupt records—that is, incomplete or malformed JSON—are dropped or throw errors. If you use the option badRecordsPath when parsing JSON, data type mismatches are not considered bad records when using the rescuedDataColumn. Only incomplete and malformed JSON records are stored in badRecordsPath.

Examples

The following examples demonstrate how to read and write JSON files using the DataFrame API and SQL.

Single-line mode

Each line contains one JSON object:

JSON
{"string":"string1","int":1,"array":[1,2,3],"dict": {"key": "value1"}}
{"string":"string2","int":2,"array":[2,4,6],"dict": {"key": "value2"}}
{"string":"string3","int":3,"array":[3,6,9],"dict": {"key": "value3", "extra_key": "extra_value3"}}

To read the JSON data, use:

Python
df = spark.read.format("json").load("example.json")
df.printSchema()

Spark infers the schema automatically:

root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- dict: struct (nullable = true)
| |-- extra_key: string (nullable = true)
| |-- key: string (nullable = true)
|-- int: long (nullable = true)
|-- string: string (nullable = true)

Multi-line mode

The following JSON object spans multiple lines:

JSON
[
{ "string": "string1", "int": 1, "array": [1, 2, 3], "dict": { "key": "value1" } },
{ "string": "string2", "int": 2, "array": [2, 4, 6], "dict": { "key": "value2" } },
{
"string": "string3",
"int": 3,
"array": [3, 6, 9],
"dict": {
"key": "value3",
"extra_key": "extra_value3"
}
}
]

To read this object, enable multi-line mode:

Python
mdf = spark.read.option("multiline", "true").format("json").load("/tmp/multi-line.json")
mdf.show(truncate=False)

Read JSON files using SQL

You can use the read_files table-valued function table-valued function in SQL to read JSON files.

SQL
SELECT * FROM read_files(
'<path to json file or folder>',
format => 'json',
multiLine => true)

You can also use USING JSON to read JSON files. However, Databricks recommends using read_files instead of USING JSON because read_files allows the specification of schema and additional file processing options.

SQL
DROP TABLE IF EXISTS json_data;

CREATE TABLE json_data
USING JSON
OPTIONS (path "/path/to/file.json", multiline true);

SELECT * FROM json_data;

Charset auto-detection

By default, the charset of input files is detected automatically. You can specify the charset explicitly using the charset option:

Python
spark.read.option("charset", "UTF-16BE").format("json").load("fileInUTF16.json")

Some supported charsets include: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32. For the full list of charsets supported by Oracle Java SE, see Supported Encodings.

Notebook example: Read JSON files

The following notebook demonstrates single-line mode and multi-line mode.

Read JSON files notebook

Open notebook in new tab