Handling Bad Files/Records

New in version runtime-3.0.

See Databricks Runtime 3.0 for more information.

Description

When reading data from a file-based data source, Spark SQL faces two typical error cases. First, the files may not be readable (for instance, they could be missing, inaccessible or corrupted). Second, even if the files are processable, some records may not be parsable (for example, due to syntax errors and schema mismatch).

Databricks provides a unified interface for handling bad records and files without interrupting the Spark jobs. Users can obtain the exception records/files and reasons from the exception logs by setting the data source option badRecordsPath. badRecordsPath specifies a path to store the exception files for recording the information about bad records for CSV and JSON sources and bad files for all the file-based built-in sources (e.g., Parquet).

Examples

val df = spark.read
  .option("badRecordsPath", "/tmp/badRecordsPath")
  .parquet("/input/parquetFile")

// Delete the input parquet file '/input/parquetFile'
dbutils.fs.rm("/input/parquetFile")

df.show()

In the above example, since df.show() is unable to find the input file, Spark creates an exception file in the Json format to record the error. For example, /tmp/badRecordsPath/20170724T101153/bad_files/xyz is the path of the exception file. This file is under the user-specified badRecordsPath directory (i.e., /tmp/badRecordsPath). 20170724T101153 is the creation time of this DataFrameReader. bad_files is the exception type. xyz is a file that contains a json record, which has the path of the bad file and the exception/reason message.

// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.text("/tmp/input/jsonFile")

val df = spark.read
  .option("badRecordsPath", "/tmp/badRecordsPath")
  .schema("a int, b int")
  .json("/tmp/input/jsonFile")

df.show()

In this example, the DataFrame only contains the first parsable record ({"a": 1, "b": 2}). The second bad record ({bad-record) can be obtained in the exception file, which is a Json file located in /tmp/badRecordsPath/20170724T114715/bad_records/xyz. This exception file contains the bad record, the path of the file containing this record, and the exception/reason message. After you locate the exception files, you can use our JSON reader to process them.