Foundation Model APIs limits and quotas

This page describes the limits and quotas for Databricks Foundation Model APIs workloads.

Databricks Foundation Model APIs enforce rate limits to ensure reliable performance and fair resource allocation across all users. These limits vary based on the workspace platform tier, foundation model type and how you deploy your foundation model.

Pay-per-token endpoint rate limits

Pay-per-token endpoints are governed by token-based and query-based rate limits. Token-based rate limits control the maximum number of tokens that can be processed per minute and are enforced separately for input and output tokens.

Input tokens per minute (ITPM): The maximum number of input tokens (from your prompts) that can be processed within a 60-second window. An ITPM rate limit controls the input token throughput of an endpoint.
Output tokens per minute (OTPM) : The maximum number of output tokens (from the model's responses) that can be generated within a 60-second window. An OTPM rate limit controls the output token throughput of an endpoint.
Queries per hour: The maximum number of queries or requests that can be processed within a 60 minute window. For production applications with sustained usage patterns, Databricks recommends provisioned throughput endpoints, which provide guaranteed capacity.

How limits are tracked and enforced

The most restrictive rate limit (ITPM, OTPM, QPH) applies at any given time. For example, even if you haven't reached your ITPM limit, you migh still be rate-limited if you exceed the QPH or OTPM limit. When either ITPM or OTPM limit is reached, subsequent requests receive a 429 error that indicates too many requests were received. This message persists until the rate limit window resets.

Databricks tracks and enforces tokens per minute (TPM) rate limits using the following features:

Feature	Details
Token accounting and pre-admission checks	Input token counting: Input tokens are counted from your actual prompt at request time. Output token estimation: If you provide `max_tokens` in your request, Databricks uses this value to estimate and reserve output token capacity before the request is admitted for processing. Pre-admission validation: Databricks checks if your request would exceed ITPM or OTPM limits before processing begins. If `max_tokens` would cause you to exceed OTPM limits, Databricks rejects the request immediately with a 429 error. Actual vs estimated output: After the response is generated, the actual output tokens are counted. Importantly, if the actual token usage is less than the reserved `max_tokens`, Databricks credits the difference back to your rate limit allowance, making those tokens immediately available for other requests. No max_tokens specified: If you don't specify `max_tokens`, Databricks uses a default reservation, and the actual token count is reconciled after generation. Note: Claude Sonnet 4 specifically defaults to 1,000 output tokens when `max_tokens` is not set, returning finish reason "length" when reached. This is not the model's max context length. Claude 3.7 Sonnet has no such default.
Burst capacity and smoothing	Burst buffer: The rate limiter includes a small buffer to accommodate short bursts of traffic above the nominal rate. Sliding window: Token consumption is tracked using a sliding window algorithm that provides smoother rate limiting than hard per-minute boundaries. Token bucket algorithm: Databricks uses a token bucket implementation that allows for some burst capacity while maintaining the average rate limit over time.

Feature

Details

Token accounting and pre-admission checks

Input token counting: Input tokens are counted from your actual prompt at request time.
Output token estimation: If you provide max_tokens in your request, Databricks uses this value to estimate and reserve output token capacity before the request is admitted for processing.
Pre-admission validation: Databricks checks if your request would exceed ITPM or OTPM limits before processing begins. If max_tokens would cause you to exceed OTPM limits, Databricks rejects the request immediately with a 429 error.
Actual vs estimated output: After the response is generated, the actual output tokens are counted. Importantly, if the actual token usage is less than the reserved max_tokens, Databricks credits the difference back to your rate limit allowance, making those tokens immediately available for other requests.
No max_tokens specified: If you don't specify max_tokens, Databricks uses a default reservation, and the actual token count is reconciled after generation.

Note: Claude Sonnet 4 specifically defaults to 1,000 output tokens when max_tokens is not set, returning finish reason "length" when reached. This is not the model's max context length. Claude 3.7 Sonnet has no such default.

Burst capacity and smoothing

Burst buffer: The rate limiter includes a small buffer to accommodate short bursts of traffic above the nominal rate.
Sliding window: Token consumption is tracked using a sliding window algorithm that provides smoother rate limiting than hard per-minute boundaries.
Token bucket algorithm: Databricks uses a token bucket implementation that allows for some burst capacity while maintaining the average rate limit over time.

The following is an example of how pre-admission checking and the credit-back behavior work.

Python
# Request with max_tokens specified
request = {
    "prompt": "Write a story about...",  # 10 input tokens
    "max_tokens": 500  # System reserves 500 output tokens
}

# Pre-admission check:
# - Verifies 10 tokens against ITPM limit
# - Reserves 500 tokens against OTPM limit
# - If either would exceed limits, returns 429 immediately

# If admitted, actual response uses only 350 tokens
# The systen credits back 150 tokens (500 - 350) to your OTPM allowance
# These 150 tokens are immediately available for other requests

Rate limits by model

The following tables summarize the ITPM, OTPM and QPH rate limits for pay-per-token Foundation Model API endpoints for Enterprise tier workspaces:

Large language models	ITPM limit	OTPM limit	QPH limit	Notes
GPT OSS 120B	200,000	10,000	7,200	General-purpose LLM
GPT OSS 20B	200,000	10,000	7,200	Smaller GPT variant
Gemma 3 12B	200,000	10,000	7,200	Google's Gemma model
Llama 4 Maverick	200,000	10,000	2,400	Latest Llama release
Llama 3.3 70B Instruct	200,000	10,000	2,400	Mid-size Llama model
Llama 3.1 8B Instruct	200,000	10,000	7,200	Lightweight Llama model
Llama 3.1 405B Instruct	5,000	500	1,200	Largest Llama model - reduced limits due to size

Anthropic Claude models	ITPM limit	OTPM limit	QPH limit	Notes
Claude 3.7 Sonnet	50,000	5,000	2,400	Balanced Claude model
Claude Sonnet 4	50,000	5,000	60	Latest Sonnet version
Claude Opus 4	50,000	5,000	600	Most capable Claude model

Embedding models	ITPM limit	OTPM limit	QPH limit	Notes
GTE Large (En)	N/A	N/A	540,000	Text embedding model - does not generate normalized embeddings
BGE Large (En)	N/A	N/A	2,160,000	Text embedding model

Manage TPM rate limits best practices

Step 1. Monitor token usage

Track both input and output token counts separately in your applications:

Python
# Example: Track token usage
response = model.generate(prompt)
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens

# Check against limits
if input_tokens > ITPM_LIMIT or output_tokens > OTPM_LIMIT:
    # Implement backoff strategy
    pass

Step 2. Implement retry logic

Add exponential backoff when you encounter rate limit errors:

Python
import time
import random

def retry_with_exponential_backoff(
    func,
    initial_delay: float = 1,
    exponential_base: float = 2,
    jitter: bool = True,
    max_retries: int = 10,
):
    """Retry a function with exponential backoff."""

    num_retries = 0
    delay = initial_delay

    while num_retries < max_retries:
        try:
            return func()
        except Exception as e:
            if "rate_limit" in str(e) or "429" in str(e):
                num_retries += 1

                if jitter:
                    delay *= exponential_base * (1 + random.random())
                else:
                    delay *= exponential_base

                time.sleep(delay)
            else:
                raise e

    raise Exception(f"Maximum retries {max_retries} exceeded")

Step 3. Optimize token usage

Minimize prompt length: Use concise, well-structured prompts
Control output length: Use max_tokens parameter to limit response size
Set max_tokens explicitly for Claude Sonnet 4: Always specify max_tokens when using Claude Sonnet 4 to avoid the default 1,000 token limit
Batch efficiently: Group related requests when possible while staying within limits

Step 4. Consider model selection

Smaller models for high-volume tasks: Use models like Llama 3.1 8B for tasks that require higher throughput
Large models for complex tasks: Reserve Llama 3.1 405B for tasks that require maximum capability

Monitoring and troubleshooting

Monitor your token usage patterns to optimize performance:

Python
# Example: Log token usage for monitoring
import logging

logger = logging.getLogger(__name__)

def log_token_usage(response):
    usage = response.usage
    logger.info(f"Input tokens: {usage.prompt_tokens}")
    logger.info(f"Output tokens: {usage.completion_tokens}")
    logger.info(f"Total tokens: {usage.total_tokens}")

    # Alert if approaching limits
    if usage.prompt_tokens > ITPM_LIMIT * 0.8:
        logger.warning("Approaching ITPM limit")
    if usage.completion_tokens > OTPM_LIMIT * 0.8:
        logger.warning("Approaching OTPM limit")

Handle rate limit errors

When you exceed rate limits, the API returns a 429 Too Many Requests error:

JSON
{
  "error": {
    "message": "Rate limit exceeded: ITPM limit of 200,000 tokens reached",
    "type": "rate_limit_exceeded",
    "code": 429,
    "limit_type": "input_tokens_per_minute",
    "limit": 200000,
    "current": 200150,
    "retry_after": 15
  }
}

The error response includes:

limit_type: Which specific limit was exceeded (ITPM, OTPM, QPS, or QPH)
limit: The configured limit value
current: Your current usage
retry_after: Suggested wait time in seconds

Common issues and solutions

Issue	Solution
Frequent 429 errors	Implement exponential backoff, reduce request rate, and request higher rate limits
ITPM limit reached	Optimize prompt length
OTPM limit reached	Use `max_tokens` to limit response length
QPH limit reached	Distribute requests more evenly over time

Provisioned throughput limits

For production workloads that require higher limits, provisioned throughput endpoints offer:

No TPM restrictions: Processing capacity based on provisioned resources
Higher rate limits: Up to 200 queries per second per workspace
Predictable performance: Dedicated resources ensure consistent latency

The following are limitation for provisioned throughput workloads:

To use the DBRX model architecture for a provisioned throughput workload, your serving endpoint must be in us-east-1 or us-west-2.
For provisioned throughput workloads that use Llama 4 Maverick:
- Support for this model on provisioned throughput workloads is in Public Preview.
- Autoscaling is not supported.
- Metrics panels are not supported.
- Traffic splitting is not supported on an endpoint that serves Llama 4 Maverick. You cannot serve multiple models on an endpoint that serves Llama 4 Maverick.
To deploy a Meta Llama model from system.ai in Unity Catalog, you must choose the applicable Instruct version. Base versions of the Meta Llama models are not supported for deployment from Unity Catalog. See Deploy provisioned throughput endpoints.

Regional availability and data processing

For Databricks-hosted foundation model region availability, see Foundation Model overview.

For data processing and residency details, see Data processing and residency.

Pay-per-token endpoint rate limits​

How limits are tracked and enforced​

Rate limits by model​

Manage TPM rate limits best practices​

Step 1. Monitor token usage​

Step 2. Implement retry logic​

Step 3. Optimize token usage​

Step 4. Consider model selection​

Monitoring and troubleshooting​

Handle rate limit errors​

Common issues and solutions​

Provisioned throughput limits​

Regional availability and data processing​

Additional resources​