Handling Bad Records and Files

When reading data from a file-based data source, Spark SQL faces two typical error cases. First, the files may not be readable (for instance, they could be missing, inaccessible or corrupted). Second, even if the files are processable, some records may not be parsable (for example, due to syntax errors and schema mismatch).

Databricks provides a unified interface for handling bad records and files without interrupting Spark jobs. You can obtain the exception records/files and reasons from the exception logs by setting the data source option badRecordsPath. badRecordsPath specifies a path to store exception files for recording the information about bad records for CSV and JSON sources and bad files for all the file-based built-in sources (for example, Parquet).

In addition, when reading files transient errors like network connection exception, IO exception, and so on, may occur. These errors are ignored and also recorded under the badRecordsPath, and Spark will continue to run the tasks.

The badRecordsPath data source option is supported on Databricks Runtime 3.0 and above.

Examples

Unable to find input file

val df = spark.read
  .option("badRecordsPath", "/tmp/badRecordsPath")
  .parquet("/input/parquetFile")

// Delete the input parquet file '/input/parquetFile'
dbutils.fs.rm("/input/parquetFile")

df.show()

In the above example, since df.show() is unable to find the input file, Spark creates an exception file in JSON format to record the error. For example, /tmp/badRecordsPath/20170724T101153/bad_files/xyz is the path of the exception file. This file is under the specified badRecordsPath director, /tmp/badRecordsPath. 20170724T101153 is the creation time of this DataFrameReader. bad_files is the exception type. xyz is a file that contains a JSON record, which has the path of the bad file and the exception/reason message.

Input file contains bad record

// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.text("/tmp/input/jsonFile")

val df = spark.read
  .option("badRecordsPath", "/tmp/badRecordsPath")
  .schema("a int, b int")
  .json("/tmp/input/jsonFile")

df.show()

In this example, the DataFrame contains only the first parsable record ({"a": 1, "b": 2}). The second bad record ({bad-record) is recorded in the exception file, which is a JSON file located in /tmp/badRecordsPath/20170724T114715/bad_records/xyz. The exception file contains the bad record, the path of the file containing the record, and the exception/reason message. After you locate the exception files, you can use a JSON reader to process them.