Skip to main content

Read and write JSON files

JSON (JavaScript Object Notation) is a widely used semi-structured format for data exchange and storage. Databricks supports JSON for both reading and writing with Apache Spark, including single-line and multi-line modes, automatic schema inference, and rescued data. You can read JSON files from cloud storage using the Spark DataFrame API or SQL, and write DataFrames back to JSON.

Prerequisites

Databricks does not require additional configuration to use JSON files.

Options

Use the .option() and .options() methods of DataFrameReader and DataFrameWriter to configure JSON data sources. For a complete list of supported options, see DataFrameReader JSON options and DataFrameWriter JSON options.

Usage

The following examples use the Wanderbricks sample dataset to demonstrate reading and writing JSON files in single-line and multi-line modes using the Spark DataFrame API and SQL.

Write and read JSON files

In single-line mode (the default), each line of the output contains one complete JSON object. Write Wanderbricks reviews to JSON format, then read them back.

Python
# Write wanderbricks reviews to JSON format
df = spark.read.table("samples.wanderbricks.reviews")
df.write.format("json").save("/Volumes/<catalog>/<schema>/<volume>/reviews_json")

# Read the JSON files into a DataFrame
df = spark.read.format("json").load("/Volumes/<catalog>/<schema>/<volume>/reviews_json")
df.printSchema()
display(df)

Read multi-line JSON files

In multi-line mode, a single JSON object can span multiple lines. Enable multi-line mode to read JSON files where records are formatted across multiple lines.

Python
mdf = spark.read.option("multiline", "true").format("json").load("/Volumes/<catalog>/<schema>/<volume>/multi-line.json")
mdf.show(truncate=False)

Read JSON files using SQL

You can use the read_files table-valued function table-valued function in SQL to read JSON files.

SQL
SELECT * FROM read_files(
'/Volumes/<catalog>/<schema>/<volume>/reviews_json',
format => 'json',
multiLine => true)

You can also use USING JSON to read JSON files. However, Databricks recommends using read_files instead of USING JSON because read_files allows the specification of schema and additional file processing options.

SQL
DROP TABLE IF EXISTS reviews_json_table;

CREATE TABLE reviews_json_table
USING JSON
OPTIONS (path "/Volumes/<catalog>/<schema>/<volume>/reviews_json", multiline true);

SELECT * FROM reviews_json_table;

Specify character encoding

By default, the charset of input files is detected automatically. You can specify the charset explicitly using the charset option:

Python
spark.read.option("charset", "UTF-16BE").format("json").load("/Volumes/<catalog>/<schema>/<volume>/fileInUTF16.json")

Some supported charsets include: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32. For the full list of charsets supported by Oracle Java SE, see Supported Encodings.

Enable the rescued data column

The rescued data column ensures that you never lose data during ETL. It captures any data that wasn't parsed because one or more fields in a record have one of the following issues:

  • Absent from the provided schema.
  • Does not match the data type of the provided schema.
  • Has a case mismatch with the field names in the provided schema.

The rescued data column is returned as a JSON blob containing the rescued columns and the source file path of the record.

To enable the rescued data column, set the rescuedDataColumn option to a column name when reading:

Python
df = spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load("/Volumes/<catalog>/<schema>/<volume>/reviews_json")

To remove the source file path from the rescued data column, set:

Python
spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false")

The JSON parser supports three modes when parsing records: PERMISSIVE, DROPMALFORMED, and FAILFAST. When used together with rescuedDataColumn, the following rules apply:

  • Data type mismatches do not cause records to be dropped in DROPMALFORMED mode or throw an error in FAILFAST mode.
  • Only corrupt records—that is, incomplete or malformed JSON—are dropped or throw errors.
  • If you use the badRecordsPath option, data type mismatches are not considered bad records. Only incomplete and malformed JSON records are stored in badRecordsPath.

Additional resources

  • Read and write Parquet files: If your workload is primarily analytical and read-heavy, Parquet's columnar layout offers more efficient query performance than JSON's row-based text format.
  • Read and write Avro files: If you are producing or consuming JSON from an event streaming system such as Apache Kafka, Avro provides a more compact binary encoding with schema evolution support.