Foundation Model APIs limits and quotas
This page describes the limits and quotas for Databricks Foundation Model APIs workloads.
Databricks Foundation Model APIs enforce rate limits to ensure reliable performance and fair resource allocation across all users. These limits vary based on the workspace platform tier, foundation model type and how you deploy your foundation model.
Pay-per-token endpoint rate limits
Pay-per-token endpoints are governed by token-based and query-based rate limits. Token-based rate limits control the maximum number of tokens that can be processed per minute and are enforced separately for input and output tokens.
- Input tokens per minute (ITPM): The maximum number of input tokens (from your prompts) that can be processed within a 60-second window. An ITPM rate limit controls the input token throughput of an endpoint.
- Output tokens per minute (OTPM) : The maximum number of output tokens (from the model's responses) that can be generated within a 60-second window. An OTPM rate limit controls the output token throughput of an endpoint.
- Queries per hour: The maximum number of queries or requests that can be processed within a 60 minute window. For production applications with sustained usage patterns, Databricks recommends provisioned throughput endpoints, which provide guaranteed capacity.
How limits are tracked and enforced
The most restrictive rate limit (ITPM, OTPM, QPH) applies at any given time. For example, even if you haven't reached your ITPM limit, you migh still be rate-limited if you exceed the QPH or OTPM limit. When either ITPM or OTPM limit is reached, subsequent requests receive a 429 error that indicates too many requests were received. This message persists until the rate limit window resets.
Databricks tracks and enforces tokens per minute (TPM) rate limits using the following features:
Feature | Details |
---|---|
Token accounting and pre-admission checks |
|
Burst capacity and smoothing |
|
The following is an example of how pre-admission checking and the credit-back behavior work.
# Request with max_tokens specified
request = {
"prompt": "Write a story about...", # 10 input tokens
"max_tokens": 500 # System reserves 500 output tokens
}
# Pre-admission check:
# - Verifies 10 tokens against ITPM limit
# - Reserves 500 tokens against OTPM limit
# - If either would exceed limits, returns 429 immediately
# If admitted, actual response uses only 350 tokens
# The systen credits back 150 tokens (500 - 350) to your OTPM allowance
# These 150 tokens are immediately available for other requests
Rate limits by model
The following tables summarize the ITPM, OTPM and QPH rate limits for pay-per-token Foundation Model API endpoints for Enterprise tier workspaces:
Large language models | ITPM limit | OTPM limit | QPH limit | Notes |
---|---|---|---|---|
GPT OSS 120B | 200,000 | 10,000 | 7,200 | General-purpose LLM |
GPT OSS 20B | 200,000 | 10,000 | 7,200 | Smaller GPT variant |
Gemma 3 12B | 200,000 | 10,000 | 7,200 | Google's Gemma model |
Llama 4 Maverick | 200,000 | 10,000 | 2,400 | Latest Llama release |
Llama 3.3 70B Instruct | 200,000 | 10,000 | 2,400 | Mid-size Llama model |
Llama 3.1 8B Instruct | 200,000 | 10,000 | 7,200 | Lightweight Llama model |
Llama 3.1 405B Instruct | 5,000 | 500 | 1,200 | Largest Llama model - reduced limits due to size |
Anthropic Claude models | ITPM limit | OTPM limit | QPH limit | Notes |
---|---|---|---|---|
Claude 3.7 Sonnet | 50,000 | 5,000 | 2,400 | Balanced Claude model |
Claude Sonnet 4 | 50,000 | 5,000 | 60 | Latest Sonnet version |
Claude Opus 4 | 50,000 | 5,000 | 600 | Most capable Claude model |
Embedding models | ITPM limit | OTPM limit | QPH limit | Notes |
---|---|---|---|---|
GTE Large (En) | N/A | N/A | 540,000 | Text embedding model - does not generate normalized embeddings |
BGE Large (En) | N/A | N/A | 2,160,000 | Text embedding model |
Manage TPM rate limits best practices
Step 1. Monitor token usage
Track both input and output token counts separately in your applications:
# Example: Track token usage
response = model.generate(prompt)
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens
# Check against limits
if input_tokens > ITPM_LIMIT or output_tokens > OTPM_LIMIT:
# Implement backoff strategy
pass
Step 2. Implement retry logic
Add exponential backoff when you encounter rate limit errors:
import time
import random
def retry_with_exponential_backoff(
func,
initial_delay: float = 1,
exponential_base: float = 2,
jitter: bool = True,
max_retries: int = 10,
):
"""Retry a function with exponential backoff."""
num_retries = 0
delay = initial_delay
while num_retries < max_retries:
try:
return func()
except Exception as e:
if "rate_limit" in str(e) or "429" in str(e):
num_retries += 1
if jitter:
delay *= exponential_base * (1 + random.random())
else:
delay *= exponential_base
time.sleep(delay)
else:
raise e
raise Exception(f"Maximum retries {max_retries} exceeded")
Step 3. Optimize token usage
- Minimize prompt length: Use concise, well-structured prompts
- Control output length: Use
max_tokens
parameter to limit response size - Set max_tokens explicitly for Claude Sonnet 4: Always specify
max_tokens
when using Claude Sonnet 4 to avoid the default 1,000 token limit - Batch efficiently: Group related requests when possible while staying within limits
Step 4. Consider model selection
- Smaller models for high-volume tasks: Use models like Llama 3.1 8B for tasks that require higher throughput
- Large models for complex tasks: Reserve Llama 3.1 405B for tasks that require maximum capability
Monitoring and troubleshooting
Monitor your token usage patterns to optimize performance:
# Example: Log token usage for monitoring
import logging
logger = logging.getLogger(__name__)
def log_token_usage(response):
usage = response.usage
logger.info(f"Input tokens: {usage.prompt_tokens}")
logger.info(f"Output tokens: {usage.completion_tokens}")
logger.info(f"Total tokens: {usage.total_tokens}")
# Alert if approaching limits
if usage.prompt_tokens > ITPM_LIMIT * 0.8:
logger.warning("Approaching ITPM limit")
if usage.completion_tokens > OTPM_LIMIT * 0.8:
logger.warning("Approaching OTPM limit")
Handle rate limit errors
When you exceed rate limits, the API returns a 429 Too Many Requests
error:
{
"error": {
"message": "Rate limit exceeded: ITPM limit of 200,000 tokens reached",
"type": "rate_limit_exceeded",
"code": 429,
"limit_type": "input_tokens_per_minute",
"limit": 200000,
"current": 200150,
"retry_after": 15
}
}
The error response includes:
limit_type
: Which specific limit was exceeded (ITPM, OTPM, QPS, or QPH)limit
: The configured limit valuecurrent
: Your current usageretry_after
: Suggested wait time in seconds
Common issues and solutions
Issue | Solution |
---|---|
Frequent 429 errors | Implement exponential backoff, reduce request rate, and request higher rate limits |
ITPM limit reached | Optimize prompt length |
OTPM limit reached | Use |
QPH limit reached | Distribute requests more evenly over time |
Provisioned throughput limits
For production workloads that require higher limits, provisioned throughput endpoints offer:
- No TPM restrictions: Processing capacity based on provisioned resources
- Higher rate limits: Up to 200 queries per second per workspace
- Predictable performance: Dedicated resources ensure consistent latency
The following are limitation for provisioned throughput workloads:
- To use the DBRX model architecture for a provisioned throughput workload, your serving endpoint must be in
us-east-1
orus-west-2
. - For provisioned throughput workloads that use Llama 4 Maverick:
- Support for this model on provisioned throughput workloads is in Public Preview.
- Autoscaling is not supported.
- Metrics panels are not supported.
- Traffic splitting is not supported on an endpoint that serves Llama 4 Maverick. You cannot serve multiple models on an endpoint that serves Llama 4 Maverick.
- To deploy a Meta Llama model from
system.ai
in Unity Catalog, you must choose the applicable Instruct version. Base versions of the Meta Llama models are not supported for deployment from Unity Catalog. See Deploy provisioned throughput endpoints.
Regional availability and data processing
For Databricks-hosted foundation model region availability, see Foundation Model overview.
For data processing and residency details, see Data processing and residency.