llm-benchmarking(Python)

Loading...

Large language model endpoints benchmarking script

To use this notebook, update the Databricks serving endpoint_name and number of input_tokens and output tokens in the next cell. At the end of the notebook a latency versus throughput graph is calculated and the benchmark is printed.

2

Initial setup

4

The following get_request function sets the request for each query. The number of tokens in the prompt must match the number of tokens the model sees. The prompt also must contain a single token from the tokenizer corresponding to the model being benchmarked. The example in this notebook works for Llama models.

6

Next, you can validate the number of input tokens. However, you might need to manually edit this as it depends on the tokenizer used by the model. The following example:

  • Runs 10 queries.
  • Validates the number of input tokens matches the number of tokens the model can see.
  • Warms up the model.
8

Benchmarking library

10

Run the benchmark with differing parallel queries

12