Read and write JSON files
JSON (JavaScript Object Notation) is a widely used semi-structured format for data exchange and storage. Databricks supports JSON for both reading and writing with Apache Spark, including single-line and multi-line modes, automatic schema inference, and rescued data. You can read JSON files from cloud storage using the Spark DataFrame API or SQL, and write DataFrames back to JSON.
Prerequisites
Databricks does not require additional configuration to use JSON files.
Options
Use the .option() and .options() methods of DataFrameReader and DataFrameWriter to configure JSON data sources. For a complete list of supported options, see DataFrameReader JSON options and DataFrameWriter JSON options.
Usage
The following examples use the Wanderbricks sample dataset to demonstrate reading and writing JSON files in single-line and multi-line modes using the Spark DataFrame API and SQL.
Write and read JSON files
In single-line mode (the default), each line of the output contains one complete JSON object. Write Wanderbricks reviews to JSON format, then read them back.
- Python
- Scala
# Write wanderbricks reviews to JSON format
df = spark.read.table("samples.wanderbricks.reviews")
df.write.format("json").save("/Volumes/<catalog>/<schema>/<volume>/reviews_json")
# Read the JSON files into a DataFrame
df = spark.read.format("json").load("/Volumes/<catalog>/<schema>/<volume>/reviews_json")
df.printSchema()
display(df)
// Write wanderbricks reviews to JSON format
val reviews = spark.read.table("samples.wanderbricks.reviews")
reviews.write.format("json").save("/Volumes/<catalog>/<schema>/<volume>/reviews_json")
// Read the JSON files into a DataFrame
val df = spark.read.format("json").load("/Volumes/<catalog>/<schema>/<volume>/reviews_json")
df.printSchema()
df.show()
Read multi-line JSON files
In multi-line mode, a single JSON object can span multiple lines. Enable multi-line mode to read JSON files where records are formatted across multiple lines.
- Python
- Scala
- SQL
mdf = spark.read.option("multiline", "true").format("json").load("/Volumes/<catalog>/<schema>/<volume>/multi-line.json")
mdf.show(truncate=False)
val mdf = spark.read.option("multiline", "true").format("json").load("/Volumes/<catalog>/<schema>/<volume>/multi-line.json")
mdf.show(false)
CREATE TEMPORARY VIEW multiLineJsonTable
USING json
OPTIONS (path="/Volumes/<catalog>/<schema>/<volume>/multi-line.json",multiline=true)
Read JSON files using SQL
You can use the read_files table-valued function table-valued function in SQL to read JSON files.
SELECT * FROM read_files(
'/Volumes/<catalog>/<schema>/<volume>/reviews_json',
format => 'json',
multiLine => true)
You can also use USING JSON to read JSON files. However, Databricks recommends using read_files instead of USING JSON because read_files allows the specification of schema and additional file processing options.
DROP TABLE IF EXISTS reviews_json_table;
CREATE TABLE reviews_json_table
USING JSON
OPTIONS (path "/Volumes/<catalog>/<schema>/<volume>/reviews_json", multiline true);
SELECT * FROM reviews_json_table;
Specify character encoding
By default, the charset of input files is detected automatically. You can specify the charset explicitly using the charset option:
- Python
- Scala
- SQL
spark.read.option("charset", "UTF-16BE").format("json").load("/Volumes/<catalog>/<schema>/<volume>/fileInUTF16.json")
spark.read.option("charset", "UTF-16BE").format("json").load("/Volumes/<catalog>/<schema>/<volume>/fileInUTF16.json")
SELECT * FROM read_files(
'/Volumes/<catalog>/<schema>/<volume>/fileInUTF16.json',
format => 'json',
charset => 'UTF-16BE'
)
Some supported charsets include: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32. For the full list of charsets supported by Oracle Java SE, see Supported Encodings.
Enable the rescued data column
The rescued data column ensures that you never lose data during ETL. It captures any data that wasn't parsed because one or more fields in a record have one of the following issues:
- Absent from the provided schema.
- Does not match the data type of the provided schema.
- Has a case mismatch with the field names in the provided schema.
The rescued data column is returned as a JSON blob containing the rescued columns and the source file path of the record.
To enable the rescued data column, set the rescuedDataColumn option to a column name when reading:
- Python
- Scala
- SQL
df = spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load("/Volumes/<catalog>/<schema>/<volume>/reviews_json")
val df = spark.read.option("rescuedDataColumn", "_rescued_data").format("json").load("/Volumes/<catalog>/<schema>/<volume>/reviews_json")
SELECT * FROM read_files(
'/Volumes/<catalog>/<schema>/<volume>/reviews_json',
format => 'json',
rescuedDataColumn => '_rescued_data'
)
To remove the source file path from the rescued data column, set:
spark.conf.set("spark.databricks.sql.rescuedDataColumn.filePath.enabled", "false")
The JSON parser supports three modes when parsing records: PERMISSIVE, DROPMALFORMED, and FAILFAST. When used together with rescuedDataColumn, the following rules apply:
- Data type mismatches do not cause records to be dropped in
DROPMALFORMEDmode or throw an error inFAILFASTmode. - Only corrupt records—that is, incomplete or malformed JSON—are dropped or throw errors.
- If you use the
badRecordsPathoption, data type mismatches are not considered bad records. Only incomplete and malformed JSON records are stored inbadRecordsPath.
Additional resources
- Read and write Parquet files: If your workload is primarily analytical and read-heavy, Parquet's columnar layout offers more efficient query performance than JSON's row-based text format.
- Read and write Avro files: If you are producing or consuming JSON from an event streaming system such as Apache Kafka, Avro provides a more compact binary encoding with schema evolution support.