Ingest image data with Auto Loader
Note
Available in Databricks Runtime 9.0 and above.
Using Auto Loader to ingest image data into Delta Lake takes only a few lines of code. By using Auto Loader, you get the following benefits:
Automatic discovery of new files to process: You don’t need to have special logic to handle late arriving data or keep track of which files have been processed yourself.
Scalable file discovery: Auto Loader can ingest billions of files.
Optimized storage: Auto Loader can provide Delta Lake with additional information over the data to optimize file storage.
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "binaryFile") \
.load("<path_to_source_data>") \
.writeStream \
.option("checkpointLocation", "<path_to_checkpoint>") \
.start("<path_to_target")
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.load("<path_to_source_data>")
.writeStream
.option("checkpointLocation", "<path_to_checkpoint>")
.start("<path_to_target")
The preceding code will write your image data into a Delta table in an optimized format.
Use a Delta table for machine learning
Once the data is stored in Delta Lake, you can run distributed inference on the data. See the reference article for more details.