Handle bad records and files

Databricks provides a number of options for dealing with files that contain bad records. Examples of bad data include:

  • Incomplete or corrupt records: Mainly observed in text based file formats like JSON and CSV. For example, a JSON record that doesn’t have a closing brace or a CSV record that doesn’t have as many columns as the header or first record of the CSV file.

  • Mismatched data types: When the value for a column doesn’t have the specified or inferred data type.

  • Bad field names: Can happen in all file formats, when the column name specified in the file or record has a different casing than the specified or inferred schema.

  • Corrupted files: When a file cannot be read, which might be due to metadata or data corruption in binary file types such as Avro, Parquet, and ORC. On rare occasion, might be caused by long-lasting transient failures in the underlying storage system.

  • Missing files: A file that was discovered during query analysis time and no longer exists at processing time.

Use badRecordsPath

When you set badRecordsPath, the specified path records exceptions for bad records or files encountered during data loading.

In addition to corrupt records and files, errors indicating deleted files, network connection exception, IO exception, and so on are ignored and recorded under the badRecordsPath.

Note

Using the badRecordsPath option in a file-based data source has a few important limitations:

  • It is non-transactional and can lead to inconsistent results.

  • Transient errors are treated as failures.

Unable to find input file

val df = spark.read
  .option("badRecordsPath", "/tmp/badRecordsPath")
  .format("parquet").load("/input/parquetFile")

// Delete the input parquet file '/input/parquetFile'
dbutils.fs.rm("/input/parquetFile")

df.show()

In the above example, since df.show() is unable to find the input file, Spark creates an exception file in JSON format to record the error. For example, /tmp/badRecordsPath/20170724T101153/bad_files/xyz is the path of the exception file. This file is under the specified badRecordsPath directory, /tmp/badRecordsPath. 20170724T101153 is the creation time of this DataFrameReader. bad_files is the exception type. xyz is a file that contains a JSON record, which has the path of the bad file and the exception/reason message.

Input file contains bad record

// Creates a json file containing both parsable and corrupted records
Seq("""{"a": 1, "b": 2}""", """{bad-record""").toDF().write.format("text").save("/tmp/input/jsonFile")

val df = spark.read
  .option("badRecordsPath", "/tmp/badRecordsPath")
  .schema("a int, b int")
  .format("json")
  .load("/tmp/input/jsonFile")

df.show()

In this example, the DataFrame contains only the first parsable record ({"a": 1, "b": 2}). The second bad record ({bad-record) is recorded in the exception file, which is a JSON file located in /tmp/badRecordsPath/20170724T114715/bad_records/xyz. The exception file contains the bad record, the path of the file containing the record, and the exception/reason message. After you locate the exception files, you can use a JSON reader to process them.