Skip to main content

Model Serving limits and regions

This article summarizes the limitations and region availability for Databricks Model Serving and supported endpoint types.

Resource and payload limits

Model Serving imposes default limits to ensure reliable performance. If you have feedback on these limits, reach out to your Databricks account team.

The limits in this section apply to custom model and AI agent endpoints only. For Foundation Model APIs and external model resource and payload limits, see Foundation Model APIs rate limits and quotas.

Custom models and AI agents

Feature

Granularity

Limit

Endpoints

Per workspace

1000. Reach out to your Databricks account team to increase.

Queries per second (QPS)

Per endpoint

300,000 using route optimization. If 1024 concurrency is not enough, reach out to your Databricks account team to increase.

Queries per second (QPS)

Per workspace

300,000 using route optimization. 200 for non-route optimized, recommended only for small dev use-cases.

Provisioned concurrency

Per model

1024 with custom option and route optimization. Reach out to your Databricks account team to increase.

Provisioned concurrency

Per workspace

4096. Reach out to your Databricks account team to increase.

Create/update operations

Per workspace

50 in 5 minutes.

Payload size

Per request

16 MB. For AI agent endpoints the limit is 4 MB.

Model execution duration

Per request

297 seconds

CPU endpoint model memory usage

Per endpoint

4GB

Environment variables

Per served model

30. Reach out to your Databricks account team to increase.

Overhead latency

Per request

Less than 20 milliseconds with route optimization.

:::

Networking and security limitations

  • Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace.
  • Model Serving does not provide security patches to existing model images because of the risk of destabilization to production deployments. A new model image created from a new model version will contain the latest patches. Reach out to your Databricks account team for more information.

Foundation Model APIs limits

For detailed information about Foundation Model APIs, including resource and payload limits for foundation and external models, see Foundation Model APIs rate limits and quotas.

Region availability

note

If you require an endpoint in an unsupported region, reach out to your Databricks account team.

If your workspace is deployed in a region that supports model serving but is served by a control plane in an unsupported region, the workspace does not support model serving. If you attempt to use model serving in such a workspace, you will see in an error message stating that your workspace is not supported. Reach out to your Databricks account team for more information.

For more information on regional availability of features, see Model serving features availability.

For Databricks-hosted foundation model region availability, see Foundation models hosted on Databricks.