Skip to main content

Foundation Model APIs limits and quotas

This page describes the limits and quotas for Databricks Foundation Model APIs workloads.

Databricks Foundation Model APIs enforce rate limits to ensure reliable performance and fair resource allocation across all users. These limits vary based on the workspace platform tier, foundation model type and how you deploy your foundation model.

Pay-per-token endpoint rate limits

Pay-per-token endpoints are governed by token-based and query-based rate limits. Token-based rate limits control the maximum number of tokens that can be processed per minute and are enforced separately for input and output tokens.

  • Input tokens per minute (ITPM): The maximum number of input tokens (from your prompts) that can be processed within a 60-second window. An ITPM rate limit controls the input token throughput of an endpoint.
  • Output tokens per minute (OTPM) : The maximum number of output tokens (from the model's responses) that can be generated within a 60-second window. An OTPM rate limit controls the output token throughput of an endpoint.
  • Queries per hour: The maximum number of queries or requests that can be processed within a 60 minute window. For production applications with sustained usage patterns, Databricks recommends provisioned throughput endpoints, which provide guaranteed capacity.

How limits are tracked and enforced

The most restrictive rate limit (ITPM, OTPM, QPH) applies at any given time. For example, even if you haven't reached your ITPM limit, you migh still be rate-limited if you exceed the QPH or OTPM limit. When either ITPM or OTPM limit is reached, subsequent requests receive a 429 error that indicates too many requests were received. This message persists until the rate limit window resets.

Databricks tracks and enforces tokens per minute (TPM) rate limits using the following features:

Feature

Details

Token accounting and pre-admission checks

  • Input token counting: Input tokens are counted from your actual prompt at request time.

  • Output token estimation: If you provide max_tokens in your request, Databricks uses this value to estimate and reserve output token capacity before the request is admitted for processing.

  • Pre-admission validation: Databricks checks if your request would exceed ITPM or OTPM limits before processing begins. If max_tokens would cause you to exceed OTPM limits, Databricks rejects the request immediately with a 429 error.

  • Actual vs estimated output: After the response is generated, the actual output tokens are counted. Importantly, if the actual token usage is less than the reserved max_tokens, Databricks credits the difference back to your rate limit allowance, making those tokens immediately available for other requests.

  • No max_tokens specified: If you don't specify max_tokens, Databricks uses a default reservation, and the actual token count is reconciled after generation.

    Note: Claude Sonnet 4 specifically defaults to 1,000 output tokens when max_tokens is not set, returning finish reason "length" when reached. This is not the model's max context length. Claude 3.7 Sonnet has no such default.

Burst capacity and smoothing

  • Burst buffer: The rate limiter includes a small buffer to accommodate short bursts of traffic above the nominal rate.
  • Sliding window: Token consumption is tracked using a sliding window algorithm that provides smoother rate limiting than hard per-minute boundaries.
  • Token bucket algorithm: Databricks uses a token bucket implementation that allows for some burst capacity while maintaining the average rate limit over time.

The following is an example of how pre-admission checking and the credit-back behavior work.

Python
# Request with max_tokens specified
request = {
"prompt": "Write a story about...", # 10 input tokens
"max_tokens": 500 # System reserves 500 output tokens
}

# Pre-admission check:
# - Verifies 10 tokens against ITPM limit
# - Reserves 500 tokens against OTPM limit
# - If either would exceed limits, returns 429 immediately

# If admitted, actual response uses only 350 tokens
# The systen credits back 150 tokens (500 - 350) to your OTPM allowance
# These 150 tokens are immediately available for other requests

Rate limits by model

The following tables summarize the ITPM, OTPM and QPH rate limits for pay-per-token Foundation Model API endpoints for Enterprise tier workspaces:

Large language models

ITPM limit

OTPM limit

QPH limit

Notes

GPT OSS 120B

200,000

10,000

7,200

General-purpose LLM

GPT OSS 20B

200,000

10,000

7,200

Smaller GPT variant

Gemma 3 12B

200,000

10,000

7,200

Google's Gemma model

Llama 4 Maverick

200,000

10,000

2,400

Latest Llama release

Llama 3.3 70B Instruct

200,000

10,000

2,400

Mid-size Llama model

Llama 3.1 8B Instruct

200,000

10,000

7,200

Lightweight Llama model

Llama 3.1 405B Instruct

5,000

500

1,200

Largest Llama model - reduced limits due to size

Anthropic Claude models

ITPM limit

OTPM limit

QPH limit

Notes

Claude 3.7 Sonnet

50,000

5,000

2,400

Balanced Claude model

Claude Sonnet 4

50,000

5,000

60

Latest Sonnet version

Claude Opus 4

50,000

5,000

600

Most capable Claude model

Embedding models

ITPM limit

OTPM limit

QPH limit

Notes

GTE Large (En)

N/A

N/A

540,000

Text embedding model - does not generate normalized embeddings

BGE Large (En)

N/A

N/A

2,160,000

Text embedding model

Manage TPM rate limits best practices

Step 1. Monitor token usage

Track both input and output token counts separately in your applications:

Python
# Example: Track token usage
response = model.generate(prompt)
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens

# Check against limits
if input_tokens > ITPM_LIMIT or output_tokens > OTPM_LIMIT:
# Implement backoff strategy
pass

Step 2. Implement retry logic

Add exponential backoff when you encounter rate limit errors:

Python
import time
import random

def retry_with_exponential_backoff(
func,
initial_delay: float = 1,
exponential_base: float = 2,
jitter: bool = True,
max_retries: int = 10,
):
"""Retry a function with exponential backoff."""

num_retries = 0
delay = initial_delay

while num_retries < max_retries:
try:
return func()
except Exception as e:
if "rate_limit" in str(e) or "429" in str(e):
num_retries += 1

if jitter:
delay *= exponential_base * (1 + random.random())
else:
delay *= exponential_base

time.sleep(delay)
else:
raise e

raise Exception(f"Maximum retries {max_retries} exceeded")

Step 3. Optimize token usage

  • Minimize prompt length: Use concise, well-structured prompts
  • Control output length: Use max_tokens parameter to limit response size
  • Set max_tokens explicitly for Claude Sonnet 4: Always specify max_tokens when using Claude Sonnet 4 to avoid the default 1,000 token limit
  • Batch efficiently: Group related requests when possible while staying within limits

Step 4. Consider model selection

  • Smaller models for high-volume tasks: Use models like Llama 3.1 8B for tasks that require higher throughput
  • Large models for complex tasks: Reserve Llama 3.1 405B for tasks that require maximum capability

Monitoring and troubleshooting

Monitor your token usage patterns to optimize performance:

Python
# Example: Log token usage for monitoring
import logging

logger = logging.getLogger(__name__)

def log_token_usage(response):
usage = response.usage
logger.info(f"Input tokens: {usage.prompt_tokens}")
logger.info(f"Output tokens: {usage.completion_tokens}")
logger.info(f"Total tokens: {usage.total_tokens}")

# Alert if approaching limits
if usage.prompt_tokens > ITPM_LIMIT * 0.8:
logger.warning("Approaching ITPM limit")
if usage.completion_tokens > OTPM_LIMIT * 0.8:
logger.warning("Approaching OTPM limit")

Handle rate limit errors

When you exceed rate limits, the API returns a 429 Too Many Requests error:

JSON
{
"error": {
"message": "Rate limit exceeded: ITPM limit of 200,000 tokens reached",
"type": "rate_limit_exceeded",
"code": 429,
"limit_type": "input_tokens_per_minute",
"limit": 200000,
"current": 200150,
"retry_after": 15
}
}

The error response includes:

  • limit_type: Which specific limit was exceeded (ITPM, OTPM, QPS, or QPH)
  • limit: The configured limit value
  • current: Your current usage
  • retry_after: Suggested wait time in seconds

Common issues and solutions

Issue

Solution

Frequent 429 errors

Implement exponential backoff, reduce request rate, and request higher rate limits

ITPM limit reached

Optimize prompt length

OTPM limit reached

Use max_tokens to limit response length

QPH limit reached

Distribute requests more evenly over time

Provisioned throughput limits

For production workloads that require higher limits, provisioned throughput endpoints offer:

  • No TPM restrictions: Processing capacity based on provisioned resources
  • Higher rate limits: Up to 200 queries per second per workspace
  • Predictable performance: Dedicated resources ensure consistent latency

The following are limitation for provisioned throughput workloads:

  • To use the DBRX model architecture for a provisioned throughput workload, your serving endpoint must be in us-east-1 or us-west-2.
  • For provisioned throughput workloads that use Llama 4 Maverick:
    • Support for this model on provisioned throughput workloads is in Public Preview.
    • Autoscaling is not supported.
    • Metrics panels are not supported.
    • Traffic splitting is not supported on an endpoint that serves Llama 4 Maverick. You cannot serve multiple models on an endpoint that serves Llama 4 Maverick.
  • To deploy a Meta Llama model from system.ai in Unity Catalog, you must choose the applicable Instruct version. Base versions of the Meta Llama models are not supported for deployment from Unity Catalog. See Deploy provisioned throughput endpoints.

Regional availability and data processing

For Databricks-hosted foundation model region availability, see Foundation Model overview.

For data processing and residency details, see Data processing and residency.

Additional resources