This article describes how and why Databricks measures tokens per second for provisioned throughput workloads for Foundation Model APIs.
Performance for large language models (LLMs) is often measured in terms of tokens per second. When configuring production model serving endpoints, it’s important to consider the number of requests your application sends to the endpoint. Doing so helps you understand if your endpoint needs to be configured to scale so as to not impact latency.
When configuring the scale-out ranges for endpoints deployed with provisioned throughput, Databricks found it easier to reason about the inputs going into your system using tokens.
LLMs read and generate text in terms of what is called a token. Tokens can be words or sub-words, and the exact rules for splitting text into tokens vary from model to model. For instance, you can use online tools to see how Llama’s tokenizer converts words to tokens.
Traditionally, serving endpoints are configured based on the number of concurrent requests per second (RPS). However, an LLM inference request takes a different amount of time based on how many tokens are passed in and how many it generates, which can be imbalanced across requests. Therefore, deciding how much scale out your endpoint needs really requires measuring endpoint scale in terms of the content of your request - tokens.
Different use cases feature different input and output token ratios:
Varying lengths of input contexts: While some requests might involve only a few input tokens, for example a short question, others may involve hundreds or even thousands of tokens, like a long document for summarization. This variability makes configuring a serving endpoint based only on RPS challenging since it does not account for the varying processing demands of the different requests.
Varying lengths of output depending on use case: Different use cases for LLMs can lead to vastly different output token lengths. Generating output tokens is the most time intensive part of LLM inference, so this can dramatically impact throughput. For example, summarization involves shorter, pithier responses, but text generation, like writing articles or product descriptions, can generate much longer answers.
Provisioned throughput serving endpoints are configured in terms of a range of tokens per second that you can send to the endpoint. The endpoint scales up and down to handle the load of your production application. You are charged per hour based on the range of tokens per second your endpoint is scaled to.
The best way to know what tokens per second range on your provisioned throughput serving endpoint works for your use case is to perform a load test with a representative dataset. See Conduct your own LLM endpoint benchmarking.
There are two important factors to consider:
How Databricks measures tokens per second performance of the LLM
Databricks benchmarks endpoints against a workload representing summarization tasks that are common for retrieval-augmented generation use cases. Specifically, the workload consists of:
2048 input tokens
256 output tokens
The token ranges displayed combine input and output token throughput and, by default, optimize for balancing throughput and latency.
Databricks benchmarks that users can send that many tokens per second concurrently to the endpoint at a batch size of 1 per request. This simulates multiple requests hitting the endpoint at the same time, which more accurately represents how you would actually use the endpoint in production.
How autoscaling works
Model Serving features a rapid autoscaling system that scales the underlying compute to meet the tokens per second demand of your application. Databricks scales up provisioned throughput in chunks of tokens per second, so you are charged for additional units of provisioned throughput only when you’re using them.