Binary file

Databricks Runtime supports the binary file data source, which reads binary files and converts each file into a single record that contains the raw content and metadata of the file. The binary file data source produces a DataFrame with the following columns and possibly partition columns:

To read binary files, specify the data source format as binaryFile.

Options

To load files with paths matching a given glob pattern while keeping the behavior of partition discovery, you can use the pathGlobFilter option. The following code reads all JPG files from the input directory with partition discovery:

df = spark.read.format("binaryFile").option("pathGlobFilter", "*.jpg").load("<path-to-dir>")

If you want to ignore partition discovery and recursively search files under the input directory, use the recursiveFileLookup option. This option searches through nested directories even if their names do not follow a partition naming scheme like date=2019-07-01. The following code reads all JPG files recursively from the input directory and ignores partition discovery:

df = spark.read.format("binaryFile") \
  .option("pathGlobFilter", "*.jpg") \
  .option("recursiveFileLookup", "true") \
  .load("<path-to-dir>")

Similar APIs exist for Scala, Java, and R.

Note

To improve read performance when you load data back, Databricks recommends turning off compression when you save data loaded from binary files:

spark.conf.set("spark.sql.parquet.compression.codec", "uncompressed")
df.write.format("delta").save("<path-to-table>")