Model Serving limits and regions

This article summarizes the limitations and region availability for Mosaic AI Model Serving and supported endpoint types.

Resource and payload limits

Mosaic AI Model Serving imposes default limits to ensure reliable performance. If you have feedback on these limits, reach out to your Databricks account team.

The following table summarizes resource and payload limitations for model serving endpoints.

Feature	Granularity	Limit
Payload size	Per request	16 MB. For endpoints serving foundation models, external models, or AI agents the limit is 4 MB.
Request/response size	Per request	Any request/response over 1 MB will not be logged.
Queries per second (QPS)	Per workspace	200. For higher QPS, enable route optimization.
Model execution duration	Per request	120 seconds
CPU endpoint model memory usage	Per endpoint	4GB
GPU endpoint model memory usage	Per endpoint	Greater than or equal to assigned GPU memory, depends on the GPU workload size
Provisioned concurrency	Per model and per workspace	200 concurrency. Can be increased by reaching out to your Databricks account team.
Overhead latency	Per request	Less than 50 milliseconds
Init scripts		Init scripts are not supported.
Foundation Model APIs (pay-per-token) rate limits	Per workspace	If the following limits are insufficient for your use case, Databricks recommends using provisioned throughput. Claude Sonnet 4 has a limit of 2 queries per second. Claude Opus 4 has a limit of 2 queries per second. Llama 4 Maverick has a limit of 4 queries per second and 2400 queries per hour. Claude 3.7 Sonnet has a limit of 4 queries per second and 2400 queries per hour. Llama 3.3 70B Instruct has a limit of 4 queries per second and 2400 queries per hour. Llama 3.1 405B Instruct has a limit of 1 query per second and 1200 queries per hour. Llama 3.1 8B Instruct has a limit of 2 query per second. GTE Large (En) has a rate limit of 150 queries per second BGE Large (En) has a rate limit of 600 queries per second.
Foundation Model APIs (provisioned throughput) rate limits	Per workspace	200 queries per second.

Networking and security limitations

Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace, like IP allowlists and PrivateLink.
By default, Model Serving does not support PrivateLink to external endpoints. Support for this functionality is evaluated and implemented on a per-region basis. Reach out to your Databricks account team for more information.
Model Serving does not provide security patches to existing model images because of the risk of destabilization to production deployments. A new model image created from a new model version will contain the latest patches. Reach out to your Databricks account team for more information.
You can restrict outbound network access from Model Serving endpoints by configuring network policies. See Manage network policies for serverless egress control.

Compliance security profile standards: CPU and GPU workloads

The following table lists the region availability and supported compliance security profile compliance standards for model serving on CPU and GPU workloads.

note

These compliance standards require served containers to be built in the most recent 30 days. Databricks automatically rebuilds outdated containers on your behalf. However, if this automated job fails, an event log message like the following appears and provides guidance on how to ensure your endpoints stay within compliance requirements:

"Databricks couldn't complete a scheduled compliance check for model $servedModelName. This can happen if the system can't apply a required update. To resolve, try relogging your model. If the issue persists, contact support@databricks.com."

Region	Location	HIPAA	PCI-DSS	FedRAMP Moderate	IRAP	CCCS Medium (Protected B)	UK Cyber Essentials Plus
`ap-northeast-1`	Asia Pacific (Tokyo)	✓	✓
`ap-northeast-2`	Asia Pacific (Seoul)	✓	✓
`ap-south-1`	Asia Pacific (Mumbai)	✓	✓
`ap-southeast-1`	Asia Pacific (Singapore)	✓	✓
`ap-southeast-2`	Asia Pacific (Sydney)	✓	✓		✓
`ca-central-1`	Canada (Central)	✓	✓			✓
`eu-central-1`	EU (Frankfurt)	✓	✓
`eu-west-1`	EU (Ireland)	✓	✓
`eu-west-2`	EU (London)	✓	✓				✓
`eu-west-3`	EU (Paris)
`sa-east-1`	South America (Sao Paulo)	✓	✓
`us-east-1`	US East (Northern Virginia)	✓	✓	✓
`us-east-2`	US East (Ohio)	✓	✓
`us-gov-west-1`	US Gov West (Pendleton)
`us-west-1`	US West (Northern California)
`us-west-2`	US West (Oregon)	✓	✓	✓

Compliance security profile standards: Provisioned throughput

The following table lists the supported compliance security profile compliance standards for Foundation Model APIs provisioned throughput workloads.

note

Region	Location	HIPAA	PCI-DSS	FedRAMP Moderate	IRAP	CCCS Medium (Protected B)	UK Cyber Essentials Plus
`ap-northeast-1`	Asia Pacific (Tokyo)	✓	✓
`ap-northeast-2`	Asia Pacific (Seoul)	✓	✓
`ap-south-1`	Asia Pacific (Mumbai)	✓	✓
`ap-southeast-1`	Asia Pacific (Singapore)	✓	✓
`ap-southeast-2`	Asia Pacific (Sydney)	✓	✓		✓
`ca-central-1`	Canada (Central)	✓	✓			✓
`eu-central-1`	EU (Frankfurt)	✓	✓
`eu-west-1`	EU (Ireland)	✓	✓
`eu-west-2`	EU (London)	✓	✓				✓*
`eu-west-3`	EU (Paris)
`sa-east-1`	South America (Sao Paulo)	✓	✓
`us-east-1`	US East (Northern Virginia)	✓	✓	✓
`us-east-2`	US East (Ohio)	✓	✓
`us-gov-west-1`	US Gov West (Pendleton)
`us-west-1`	US West (Northern California)
`us-west-2`	US West (Oregon)	✓	✓	✓

* Some models require cross geography routing for provisioned throughput and therefore are not UK Cyber Essentials Plus compliant. Reach out to your Databricks account team for more information.

Foundation Model APIs limits

note

As part of providing the Foundation Model APIs, Databricks might process your data outside of the region and cloud provider where your data originated.

For both pay-per-token and provisioned throughput workloads:

Only workspace admins can change the governance settings, such as rate limits for Foundation Model APIs endpoints. To change rate limits use the following steps:
1. Open the Serving UI in your workspace to see your serving endpoints.
2. From the kebab menu on the Foundation Model APIs endpoint you want to edit, select View details.
3. From the kebab menu on the upper-right side of the endpoints details page, select Change rate limit.
The GTE Large (En) embedding models do not generate normalized embeddings.

Pay-per-token limits

The following are limits relevant to Foundation Model APIs pay-per-token workloads:

Pay-per-token workloads are HIPAA compliant.
- For customers with the Compliance Security Profile enabled, pay-per-token workloads are available provided that compliance standard HIPAA or None is selected. Other compliance standards are not supported for pay-per-token workloads.
The following pay-per-token models are supported only in the Foundation Model APIs pay-per-token supported US regions:
- Anthropic Claude Sonnet 4
- Anthropic Claude Opus 4
- Meta Llama 3.1 405B Instruct
- BGE Large (En)
If your workspace is in a Model Serving region but not a U.S. or EU region, your workspace must be enabled for cross-Geo data processing. See Databricks Designated Services for geographic areas process pay-per-token workloads and where workloads are routed when cross-geo data processing is enabled.

Provisioned throughput limits

The following are limits relevant to Foundation Model APIs provisioned throughput workloads:

Provisioned throughput supports the HIPAA compliance profile and is recommended for workloads that require compliance certifications.
To use the DBRX model architecture for a provisioned throughput workload, your serving endpoint must be in us-east-1 or us-west-2.
For provisioned throughput workloads that use Llama 4 Maverick:
- Support for this model on provisioned throughput workloads is in Public Preview.
- Autoscaling is not supported.
- Metrics panels are not supported.
- Traffic splitting is not supported on an endpoint that serves Llama 4 Maverick. You cannot serve multiple models on an endpoint that serves Llama 4 Maverick.
To deploy a Meta Llama model from system.ai in Unity Catalog, you must choose the applicable Instruct version. Base versions of the Meta Llama models are not supported for deployment from Unity Catalog. See [Recommended] Deploy foundation models from Unity Catalog.

Region availability

note

If you require an endpoint in an unsupported region, reach out to your Databricks account team.

If your workspace is deployed in a region that supports model serving but is served by a control plane in an unsupported region, the workspace does not support model serving. If you attempt to use model serving in such a workspace, you will see in an error message stating that your workspace is not supported. Reach out to your Databricks account team for more information.

See Model serving feature availability for more information on regional availability of each Model Serving feature.

For Databricks-hosted foundation model region availability, see Foundation models hosted on Databricks.

Resource and payload limits​

Networking and security limitations​

Compliance security profile standards: CPU and GPU workloads​

Compliance security profile standards: Provisioned throughput​

Foundation Model APIs limits​

Pay-per-token limits​

Provisioned throughput limits​

Region availability​

Resource and payload limits

Networking and security limitations

Compliance security profile standards: CPU and GPU workloads

Compliance security profile standards: Provisioned throughput

Foundation Model APIs limits

Pay-per-token limits

Provisioned throughput limits

Region availability