

Do distributed model inference with TensorFlow tf.keras and Delta

  • Start from the Delta table /databricks-datasets/flowers/, which is a copy of the output table of the ETL image dataset in a Delta table notebook.
  • Use scalar iterator Pandas UDF to make predictions

Define a Pandas UDF for the inference task

There are three UDFs in PySpark that provides 1:1 mapping semantic:

  • PySpark UDF: record -> record, performance issues in data serialization, not recommended
  • Scalar Pandas UDF: pandas Series/DataFrame -> pandas Series/DataFrame, no shared states among batches
  • Scalar iterator Pandas UDF: initialize some state first, then go through batches.

Databricks recommends scalar iterator Pandas UDF for model inference.

Do distributed inference in DataFrames API

  • You declare a predictions column and how to compute each.
  • Let Spark optimize the execution
  • To automatically apply inference to new data in the Delta table, use spark.readStream to load the Delta table as a stream source, and write the predictions to another Delta table.

    You can do more (optional)

    Filter images to predict based on some metadata.

    Do inference directly in SQL.