Reference solution for distributed image model inference

This article and its accompanying notebooks describe a reference solution for distributed image model inference based on a common setup shared by many real-world image applications. This setup assumes that you store many images in an object store. Suppose you have several trained deep learning (DL) models for image classification and object detection—for example, MobileNetV2 for detecting human objects in user-uploaded photos to help protect privacy—and you want to apply these DL models to the stored images.

You might re-train the models and update previously computed predictions. However, it is both I/O-heavy and compute-heavy to load many images and apply DL models. Fortunately, the inference workload is embarrassingly parallel and in theory can be distributed easily. This guide walks you through a practical solution that contains two major stages:

  1. ETL images into a Delta table. A dedicated ETL job helps data management and simplifies the inference task.
  2. Perform distributed inference using Pandas UDF.

Requirements

Databricks Runtime 5.5 LTS ML.

Notebooks

The following notebooks use the installed PyTorch and TensorFlow tf.Keras to demonstrate the reference solution.

ETL image dataset into a Delta table notebook

Open notebook in new tab

Distributed inference via Pytorch and Pandas UDF notebook

Open notebook in new tab

Distributed inference via Keras and Pandas UDF notebook

Open notebook in new tab