Model Serving limits and regions
This article summarizes the limitations and region availability for Mosaic AI Model Serving and supported endpoint types.
Limitations
Mosaic AI Model Serving imposes default limits to ensure reliable performance. If you have feedback on these limits, please reach out to your Databricks account team.
The following table summarizes resource and payload limitations for model serving endpoints.
Feature |
Granularity |
Limit |
---|---|---|
Payload size |
Per request |
16 MB. For endpoints serving foundation models or external models the limit is 4 MB. |
Queries per second (QPS) |
Per workspace |
200, but can be increased to 25,000 or more by reaching out to your Databricks account team. |
Model execution duration |
Per request |
120 seconds |
CPU endpoint model memory usage |
Per endpoint |
4GB |
GPU endpoint model memory usage |
Per endpoint |
Greater than or equal to assigned GPU memory, depends on the GPU workload size |
Provisioned concurrency |
Per workspace |
200 concurrency. Can be increased by reaching out to your Databricks account team. |
Overhead latency |
Per request |
Less than 50 milliseconds |
Foundation Model APIs (pay-per-token) rate limits |
Per workspace |
If the following limits are insufficient for your use case, Databricks recommends using provisioned throughput.
|
Foundation Model APIs (provisioned throughput) rate limits |
Per workspace |
200 |
Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace, like IP allowlists and PrivateLink.
Additional limitations exist:
If your workspace is deployed in a region that supports model serving but is served by a control plane in an unsupported region, the workspace does not support model serving. If you attempt to use model serving in such a workspace, you will see in an error message stating that your workspace is not supported. Reach out to your Databricks account team for more information.
Model Serving does not support init scripts.
By default, Model Serving does not support PrivateLink to external endpoints. Support for this functionality is evaluated and implemented on a per region basis. Reach out to your Databricks account team for more information.
Model Serving does not provide security patches to existing model images because of the risk of destabilization to production deployments. A new model image created from a new model version will contain the latest patches. Reach out to your Databricks account team for more information.
Foundation Model APIs limits
Note
As part of providing the Foundation Model APIs, Databricks may process your data outside of the region and cloud provider where your data originated.
The following are limits relevant to Foundation Model APIs workloads:
Provisioned throughput supports the HIPAA compliance profile and should be used for workloads requiring compliance certifications.
Pay-per-token workloads are not HIPAA or compliance security profile compliant.
For Foundation Model APIs endpoints, only workspace admins can change the governance settings, like the rate limits. To change rate limits use the following steps:
Open the Serving UI in your workspace to see your serving endpoints.
From the kebab menu on the Foundation Model APIs endpoint you want to edit, select View details.
From the kebab menu on the upper-right side of the endpoints details page, select Change rate limit.
To use the DBRX model architecture for a provisioned throughput workload, your serving endpoint must be in
us-east-1
orus-west-2
.Only the GTE Large (En) and Meta Llama 3.1 70B Instruct models are available in pay-per-token EU and US supported regions.
The following pay-per-token models are supported only in the Foundation Model APIs pay-per-token supported US regions:
Meta Llama 3.1 405B Instruct
DBRX Instruct
Mixtral-8x7B Instruct
BGE Large (En)
Llama 2 70B Chat
Region availability
Note
If you require an endpoint in an unsupported region, reach out to your Databricks account team.
For more information on regional availability of features, see Model serving feature availability.