Model units in provisioned throughput

Model units are a unit of throughput which determine how much work your endpoint can handle per minute. When you create a new provisioned throughput endpoint, you specify how many model units to provision for each model served.

The amount of work required to process each request to your endpoint depends on the size of both the input and the generated output. As the number of input and output tokens increases, the amount of work required to process a request also increases. Generating output tokens is more resource-intensive than processing input tokens. The work required for each request grows in a non-linear fashion as the input or output token counts increase, meaning that for a given amount of model units, your endpoint can handle either:

Multiple small requests at a time.
Fewer long-context requests at time before it runs out of capacity.

For example, with a medium-sized workload with 3500 input tokens and 300 output tokens, you can estimate the tokens per second throughput for a given number of model units:

Model	Model Units	Estimated Tokens per Second
Llama 4 Maverick	50	3250

Models that use model units

The following models use model units to provision inference capacity:

Google Gemini 2.5 Pro
Google Gemini 2.5 Flash
OpenAI GPT-5
OpenAI GPT-5 mini
OpenAI GPT-5 nano
OpenAI GPT OSS 120B
OpenAI GPT OSS 20B
Google Gemma 3 12B
Meta Llama 4 Maverick (preview)

note

Model serving endpoints that serve models from the following legacy model families provision inference capacity based on tokens per second bands:

Meta Llama 3.3
Meta Llama 3.2 3B
Meta Llama 3.2 1B
Meta Llama 3.1
GTE v1.5 (English)
BGE v1.5 (English)
DeepSeek R1 (not available in Unity Catalog)
Meta Llama 3
Meta Llama 2
DBRX
Mistral
Mixtral
MPT

Models that use model units​

Models that use model units