1. ETL images into a Delta table
- Use flowers dataset hosted under
/databricks-datasets
. - Use the Auto Loader with binary file data source to load images in a Delta table.
- Extract image metadata and store them together with image data.
- Use Delta Lake to simplify data management.
The flowers dataset
This example uses the flowers dataset from the TensorFlow team. It contains flower photos stored under five sub-directories, one per class, and is available in Databricks Datasets for easy access.
Use the Auto Loader with binary file data source to load images in a Delta table
Databricks Runtime supports the binary file data source, which reads binary files and converts each file into a single record that contains the raw content and metadata of the file.
Auto Loader (cloudFiles
data source) incrementally and efficiently processes existing and new data files as they arrive.
Auto Loader supports two modes for detecting new files. This notebook demonstrates the default directory listing mode. The file notification mode might provide better performance. For more information, see the Auto Loader documentation (AWS|Azure|GCP).
Expand the DataFrame with extra metadata columns.
Extract some frequently used metadata from images
DataFrame:
- extract labels from file paths,
- extract image sizes.
Save the DataFrame in Delta format.
The following code uses "trigger once" mode to run the streaming job. At the first run, it ingests all the existing image files into the Delta table and exits. If you have a static image folder, this is all that is required.
If you have continuously arriving new images:
- If the latency requirement is loose, you can schedule a job that runs every day (or other appropriate intervals), and subsequent runs will ingest new images into the Delta table.
- If the latency requirement is strict, you can run this notebook as a "real-time" streaming job by removing
.trigger(once=True)
, and new images will be ingested into the Delta table instantly.
For more information about Auto Loader and how to optimize its configuration, see the Auto Loader documentation (AWS|Azure|GCP) and this blog post.
On Databricks Runtime 9.0 and above, images in binaryFile
format that are loaded or saved as Delta tables using Auto Loader have annotations attached so that the image thumbnails are shown when displayed. The command below shows an example. For more information, see the binary file documentation (AWS|Azure|GCP).