Model serving with Databricks

This article describes Databricks Model Serving, including its advantages and limitations.

Model Serving exposes your MLflow machine learning models as scalable REST API endpoints and provides a highly available and low-latency service for deploying models. The service automatically scales up or down to meet demand changes within the chosen concurrency range. This functionality uses Serverless compute. See the Model Serving pricing page for more details.

Why use Model Serving?

Model Serving offers:

  • Launch an endpoint with one click: Databricks automatically prepares a production-ready environment for your model and offers serverless configuration options for compute.

  • High availability and scalability: Model Serving is intended for production use and can support up to 3000+ queries-per-second (QPS). Model Serving endpoints automatically scale up and down, which means that endpoints automatically adjust based on the volume of scoring requests. You can also serve multiple models from a single endpoint.

  • MLflow integration: Natively connects to the MLflow Model Registry which enables fast and easy deployment of models.

  • Dashboards: Use the built-in Model Serving dashboard to monitor the health of your model endpoints using metrics such as QPS, latency, and error rate.

  • Feature store integration: When your model is trained with features from Databricks Feature Store, the model is packaged with feature metadata. If you configure your online store, these features are incorporated in real-time as scoring requests are received.



The following limits apply:

  • Payload size limit of 16 MB per request.

  • Default limit of 200 QPS of scoring requests per workspace. You can increase this limit to up to 3000 QPS per workspace by reaching out to your Databricks support contact.

  • Best effort support on less than 100 millisecond latency overhead and availability.

  • It is possible for a workspace to be deployed in a supported region, but be served by a control plane in a different region. These workspaces do not support Model Serving, resulting in a Your workspace is not currently supported. message. To resolve, create a new workspace in a supported region, or use the feature in a different workspace that does not have this issue. Reach out to your Databricks representative for more information.

  • Model Serving is not currently in compliance with HIPAA regulations.

Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace, like IP allowlists and PrivateLink.

Region availability


If you require an endpoint in an unsupported region, reach out to your Databricks representative.

Model Serving is available in the following AWS regions:

  • eu-west-1

  • eu-central-1

  • us-east-1

  • us-east-2

  • us-west-2

  • ca-central-1

  • ap-southeast-1

  • ap-southeast-2

Endpoint creation and update expectations

Deploying a newly registered model version involves packaging the model and its model environment and provisioning the model endpoint itself. This process can take approximately 10 minutes.

Databricks performs a zero-downtime update of endpoints by keeping the existing endpoint configuration up until the new one becomes ready. Doing so reduces risk of interruption for endpoints that are in use.

If model computation takes longer than 60 seconds, requests will time out. If you believe your model computation will take longer than 60 seconds, reach out to your Databricks support contact.