Model Serving limits and regions

This article summarizes the limitations and region availability for Databricks Model Serving and supported endpoint types.

Limitations

Databricks Model Serving imposes default limits to ensure reliable performance. If you have feedback on these limits, please reach out to your Databricks account team.

The following table summarizes resource and payload limitations for model serving endpoints.

Feature

Granularity

Limit

Payload size

Per request

16 MB

Queries per second (QPS)

Per workspace

200, but can be increased to 25,000 or more by reaching out to your Databricks account team.

Model execution duration

Per request

120 seconds

CPU endpoint model memory usage

Per endpoint

5GB

GPU endpoint model memory usage

Per endpoint

Greater than or equal to assigned GPU memory, depends on the GPU workload size

Provisioned concurrency

Per workspace

200 concurrency. Can be increased by reaching out to your Databricks account team.

Overhead latency

Per request

Less than 50 milliseconds

Foundation Model APIs (pay-per-token) rate limits

Per workspace

Chat and completion models have a default rate limit of 2 queries per second. Embedding models have a default 300 embedding inputs per second. Please reach out to your Databricks account team to increase these limits.

Foundation Model APIs (provisioned throughput) rate limits

Per workspace

Same as Model Serving QPS limit listed above.

Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace, like IP allowlists and PrivateLink.

Additional limitations exist:

  • It is possible for a workspace to be deployed in a supported region, but be served by a control plane in a different region. These workspaces do not support Model Serving and result in an error message saying that your workspace is not supported. Reach out to your Databricks account team for more information.

  • Model Serving does not support init scripts.

  • Model Serving is not currently in compliance with HIPAA regulations.

Foundation Model APIs limits

Note

As part of providing the Foundation Model APIs, Databricks may process your data outside of the region and cloud provider where your data originated.

The following are limits relevant to Foundation Model APIs workloads:

  • Foundation Model APIs workloads are not HIPAA or compliance security profile compliant.

  • For Foundation Model APIs endpoints, only workspace admins can change the governance settings, like the rate limits. To change rate limits use the following steps:

    1. Open the Serving UI in your workspace to see your serving endpoints.

    2. From the kebab menu on the Foundation Model APIs endpoint you want to edit, select View details.

    3. From the kebab menu on the upper-right side of the endpoints details page, select Change rate limit.

Region availability

Note

If you require an endpoint in an unsupported region, reach out to your Databricks account team.

Region

Location

Core Model Serving capability *

Foundation Model APIs (provisioned throughout) **

Foundation Model APIs (pay-per-token)

External models

ap-northeast-1

Asia Pacific (Tokyo)

ap-northeast-2

Asia Pacific (Seoul)

ap-south-1

Asia Pacific (Mumbai)

ap-southeast-1

Asia Pacific (Singapore)

X

X

ap-southeast-2

Asia Pacific (Sydney)

X

X

X

ca-central-1

Canada (Central)

X

X

X

eu-central-1

EU (Frankfurt)

X

X

X

eu-west-1

EU (Ireland)

X

X

X

eu-west-2

EU (London)

eu-west-3

EU (Paris)

sa-east-1

South America (Sao Paulo)

us-west-1

US West (Northern California)

us-west-2

US West (Oregon)

X

X

X

X

us-east-1

US East (Northern Virginia)

X

X

X

X

us-east-2

US East (Ohio)

X

X

X

X

* only cpu compute

** includes gpu support