Large language model endpoints benchmarking script
To use this notebook, update the Databricks serving endpoint_name
and number of input_tokens
and output tokens
in the next cell. At the end of the notebook a latency versus throughput graph is calculated and the benchmark is printed.
2
Initial setup
4
The following get_request
function sets the request for each query. The number of tokens in the prompt must match the number of tokens the model sees. The prompt also must contain a single token from the tokenizer corresponding to the model being benchmarked. The example in this notebook works for Llama models.
6
Next, you can validate the number of input tokens. However, you might need to manually edit this as it depends on the tokenizer used by the model. The following example:
- Runs 10 queries.
- Validates the number of input tokens matches the number of tokens the model can see.
- Warms up the model.
8
Benchmarking library
10
Run the benchmark with differing parallel queries
12