Batch inference using Foundation Model API provisioned throughput
This article provides an example notebook that performs batch inference on a provisioned throughput endpoint using Foundation Model APIs. It also includes an example notebook for determining the optimal concurrency for your endpoint based on your batch inference workload.
Requirements
A workspace in a Foundation Model APIs supported region.
Databricks Runtime 14.3 ML LTS or above.
Run batch inference
Generally, setting up batch inference involves 3 steps:
Prepare sample data and set up a benchmark endpoint.
Run a load test with the sample data on the benchmark endpoint to determine the ideal endpoint configuration.
Create the endpoint to be used for batch inference and send the batch inference requests.
The example notebook sets up batch inference and uses the Meta Llama 3.1 70B model and PySpark to accomplish the following:
Sample the input data to build a representative dataset
Create a benchmark endpoint with the chosen model
Load test the benchmark endpoint using the sample data to determine latency and concurrency
Create a provisioned throughput endpoint for batch inference given load test results
Construct the batch requests and send them to the batch inference endpoint
Determine optimal concurrency for your batch inference workload
The following notebook provides an alternative tool for load testing the benchmark endpoint using PySpark.