Model Serving limits and regions

This article summarizes the limitations and region availability for Mosaic AI Model Serving and supported endpoint types.

Resource and payload limits

Mosaic AI Model Serving imposes default limits to ensure reliable performance. If you have feedback on these limits, reach out to your Databricks account team.

The following table summarizes resource and payload limitations for model serving endpoints.

Feature

Granularity

Limit

Payload size

Per request

16 MB. For endpoints serving foundation models or external models the limit is 4 MB.

Queries per second (QPS)

Per workspace

200, but can be increased to 25,000 or more by reaching out to your Databricks account team.

Model execution duration

Per request

120 seconds

CPU endpoint model memory usage

Per endpoint

4GB

GPU endpoint model memory usage

Per endpoint

Greater than or equal to assigned GPU memory, depends on the GPU workload size

Provisioned concurrency

Per model and per workspace

200 concurrency. Can be increased by reaching out to your Databricks account team.

Overhead latency

Per request

Less than 50 milliseconds

Init scripts

Init scripts are not supported.

Foundation Model APIs (pay-per-token) rate limits

Per workspace

If the following limits are insufficient for your use case, Databricks recommends using provisioned throughput.

  • Llama 3.3 70B Instruct has a limit of 2 queries per second and 1200 queries per hour.

  • Llama 3.1 405B Instruct has a limit of 1 query per second and 1200 queries per hour.

  • The DBRX Instruct model has a limit of 1 query per second.

  • Mixtral-8x 7B Instruct has a default rate limit of 2 queries per second.

  • GTE Large (En) has a rate limit of 150 queries per second

  • BGE Large (En) has a rate limit of 600 queries per second.

Foundation Model APIs (provisioned throughput) rate limits

Per workspace

200

Networking and security limitations

  • Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace, like IP allowlists and PrivateLink.

  • By default, Model Serving does not support PrivateLink to external endpoints. Support for this functionality is evaluated and implemented on a per-region basis. Reach out to your Databricks account team for more information.

  • Model Serving does not provide security patches to existing model images because of the risk of destabilization to production deployments. A new model image created from a new model version will contain the latest patches. Reach out to your Databricks account team for more information.

Foundation Model APIs limits

Note

As part of providing the Foundation Model APIs, Databricks might process your data outside of the region and cloud provider where your data originated.

For both pay-per-token and provisioned throughput workloads:

  • Only workspace admins can change the governance settings, such as rate limits for Foundation Model APIs endpoints. To change rate limits use the following steps:

    1. Open the Serving UI in your workspace to see your serving endpoints.

    2. From the kebab menu on the Foundation Model APIs endpoint you want to edit, select View details.

    3. From the kebab menu on the upper-right side of the endpoints details page, select Change rate limit.

  • The GTE Large (En) embedding models do not generate normalized embeddings.

Pay-per-token limits

The following are limits relevant to Foundation Model APIs pay-per-token workloads:

  • Pay-per-token workloads are not HIPAA or compliance security profile compliant.

  • Meta Llama 3.3 70B Instruct and GTE Large (En) models are available in pay-per-token EU and US supported regions.

  • The following pay-per-token models are supported only in the Foundation Model APIs pay-per-token supported US regions:

    • Meta Llama 3.1 405B Instruct

    • DBRX Instruct

    • Mixtral-8x7B Instruct

    • BGE Large (En)

  • If your workspace is in a Model Serving region but not a U.S. or EU region, your workspace must be enabled for cross-Geo data processing. When enabled, your pay-per-token workload is routed to the U.S. Databricks Geo. To see which geographic regions process pay-per-token workloads, see Databricks Designated Services.

Provisioned throughput limits

The following are limits relevant to Foundation Model APIs provisioned throughput workloads:

  • Provisioned throughput supports the HIPAA compliance profile and is recommended for workloads that require compliance certifications.

  • To use the DBRX model architecture for a provisioned throughput workload, your serving endpoint must be in us-east-1 or us-west-2.

  • The following table shows the region availability of the supported Meta Llama 3.1 and 3.2 models. See Deploy fine-tuned foundation models for guidance on how to deploy fine-tuned models.

Meta Llama model variant

Regions

meta-llama/Llama-3.1-8B

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.1-8B-Instruct

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.1-70B

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.1-70B-Instruct

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.1-405B

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.1-405B-Instruct

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.2-1B

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.2-1B-Instruct

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.2-3B

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.2-3B-Instruct

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

meta-llama/Llama-3.3-70B

  • us-east-1

  • us-east-2

  • us-west-2

  • ap-northeast-1

  • ap-southeast-1

Region availability

Note

If you require an endpoint in an unsupported region, reach out to your Databricks account team.

If your workspace is deployed in a region that supports model serving but is served by a control plane in an unsupported region, the workspace does not support model serving. If you attempt to use model serving in such a workspace, you will see in an error message stating that your workspace is not supported. Reach out to your Databricks account team for more information.

For more information on regional availability of features, see Model serving feature availability.