Model Serving limits and regions
This article summarizes the limitations and region availability for Databricks Model Serving and supported endpoint types.
Resource and payload limits
Model Serving imposes default limits to ensure reliable performance. If you have feedback on these limits, reach out to your Databricks account team.
The following table summarizes resource and payload limitations for model serving endpoints.
Feature | Granularity | Limit |
|---|---|---|
Payload size | Per request | 16 MB. For endpoints serving foundation models, external models, or AI agents the limit is 4 MB. |
Queries per second (QPS) | Per workspace |
|
Model execution duration | Per request | 297 seconds |
CPU endpoint model memory usage | Per endpoint | 4GB |
Provisioned concurrency | Per workspace | 200 concurrency. Can be increased by reaching out to your Databricks account team. |
Overhead latency | Per request | Less than 50 milliseconds |
Init scripts |
| Init scripts are not supported. |
Foundation Model APIs rate limits | Per workspace | See Foundation Model APIs rate limits and quotas for detailed information about pay-per-token and provisioned throughput limits. |
Built-in metrics viewer limitations
Databricks provides a built-in viewer for Model Serving metrics. However, the viewer has the following limitations. For longer retention and uninterrupted observability, Databricks recommends that you export serving endpoint metrics to external monitoring systems
- Built-in metrics history is available for up to 14 days.
- After certain endpoint updates, the built-in viewer may display gaps in historical metrics. Endpoint updates are primarily caused by user actions. However, endpoint updates may also be caused by back-end infrastructure changes outside of the user's control. These gaps affect only the metrics display. Your serving endpoint remains fully operational during this time. To avoid gaps in monitoring history, export metrics to an external monitoring system.
Networking and security limitations
- Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace.
- Model Serving does not provide security patches to existing model images because of the risk of destabilization to production deployments. A new model image created from a new model version will contain the latest patches. Reach out to your Databricks account team for more information.
Foundation Model APIs limits
For detailed information about Foundation Model APIs, see:
- Rate limits and quotas: Foundation Model APIs rate limits and quotas - Includes TPM limits, regional availability, and model-specific restrictions
- Compliance and security: Foundation Model APIs compliance and security - Covers compliance standards, data processing, and security requirements
Region availability
If you require an endpoint in an unsupported region, reach out to your Databricks account team.
If your workspace is deployed in a region that supports model serving but is served by a control plane in an unsupported region, the workspace does not support model serving. If you attempt to use model serving in such a workspace, you will see in an error message stating that your workspace is not supported. Reach out to your Databricks account team for more information.
For more information on regional availability of features, see Model serving features availability.
For Databricks-hosted foundation model region availability, see Foundation models hosted on Databricks.