dist-img-infer-2-pandas-udf(Python)

Loading...

2. Do distributed model inference from Delta


  • Start from the Delta table /databricks-datasets/flowers/, which is a copy of the output table of the ETL image dataset in a Delta table notebook.
  • Use scalar iterator Pandas UDF to make batch predictions.

Define a Dataset that processes the input

Define a Pandas UDF for the inference task

There are three UDFs in PySpark that provides 1:1 mapping semantic:

  • PySpark UDF: record -> record, performance issues in data serialization, not recommended
  • Scalar Pandas UDF: pandas Series/DataFrame -> pandas Series/DataFrame, no shared states among batches
  • Scalar iterator Pandas UDF: initialize some state first, then go through batches.

Databricks recommends scalar iterator Pandas UDF for model inference.

Do distributed inference in DataFrames API

  • Specify the required columns and how to compute each.
  • Let Spark optimize the execution instead of writing imperative RDD code.
  • To automatically apply inference to new data in the Delta table, use spark.readStream to load the Delta table as a stream source, and write the predictions to another Delta table.

    You can do more (optional)

    Filter images to predict based on some metadata.

    Compare predictions from two models.

    Do inference directly in SQL.