Reference solution for image applications

This article and its accompanying notebooks describe a reference solution for distributed image model inference based on a common setup shared by many real-world image applications. This setup assumes that you store many images in an object store and optionally have continuously arriving new images. Suppose you have several trained deep learning (DL) models for image classification and object detection—for example, MobileNetV2 for detecting human objects in user-uploaded photos to help protect privacy—and you want to apply these DL models to the stored images.

You might re-train the models and update previously computed predictions. However, it is both I/O-heavy and compute-heavy to load many images and apply DL models. Fortunately, the inference workload is embarrassingly parallel and in theory can be distributed easily. This guide walks you through a practical solution that contains two major stages:

  1. ETL images into a Delta table using Auto Loader
  2. Perform distributed inference using pandas UDF

ETL images into a Delta table using Auto Loader

For image applications, including training and inference tasks, Databricks recommends that you ETL images into a Delta table with the Auto Loader. The Auto Loader helps data management and automatically handles continuously arriving new images.

ETL image dataset into a Delta table notebook

Open notebook in new tab

Perform distributed inference using pandas UDF

The following notebooks use PyTorch and TensorFlow tf.Keras to demonstrate the reference solution.

Distributed inference via Pytorch and pandas UDF notebook

Open notebook in new tab

Distributed inference via Keras and pandas UDF notebook

Open notebook in new tab

Limitations

For large image files (average image size greater than 100 MB), Databricks recommends using the Delta table only to manage the metadata (list of file names) and loading the images from the object store using their paths when needed.