Do distributed model inference with TensorFlow tf.keras and Delta
- Start from the Delta table
/databricks-datasets/flowers/
, which is a copy of the output table of the ETL image dataset in a Delta table notebook. - Use scalar iterator Pandas UDF to make predictions
Define a Pandas UDF for the inference task
There are three UDFs in PySpark that provides 1:1 mapping semantic:
- PySpark UDF: record -> record, performance issues in data serialization, not recommended
- Scalar Pandas UDF: pandas Series/DataFrame -> pandas Series/DataFrame, no shared states among batches
- Scalar iterator Pandas UDF: initialize some state first, then go through batches.
Databricks recommends scalar iterator Pandas UDF for model inference.
Do distributed inference in DataFrames API
- You declare a predictions column and how to compute each.
- Let Spark optimize the execution
- To automatically apply inference to new data in the Delta table, use
spark.readStream
to load the Delta table as a stream source, and write the predictions to another Delta table.
You can do more (optional)
Filter images to predict based on some metadata.
Do inference directly in SQL.