Model serving with Databricks

This article describes Databricks Model Serving, including its advantages and limitations.

What is Model Serving?

Model Serving exposes your MLflow machine learning models as scalable REST API endpoints and provides a highly available and low-latency service for deploying models. The service automatically scales up or down to meet demand changes, saving infrastructure costs while optimizing latency performance. This functionality uses serverless compute. See the Model Serving pricing page for more details.

With Model Serving, you can centrally manage and govern all your models in one place, including those hosted on Databricks and those from external providers.

Model serving supports serving the following models:

  • Third-party models, referred to as external models.

  • State-of-the-art open models, referred to as foundation models. This also includes curated foundation models on Databricks Marketplace, an open marketplace for sharing third-party data and AI assets.

  • Custom models. MLflow models that are installed in Unity Catalog or registered in the workspace model registry.

Model serving offers a unified OpenAI-compatible API and MLflow Deployment API for CRUD and querying tasks. In addition, it provides a single UI to manage all your models and their respective serving endpoints. You can also access models directly from SQL using AI functions for easy integration into analytics workflows.

For an introductory tutorial on how to serve models on Databricks, see Model serving tutorial.

Why use Model Serving?

Model Serving offers:

  • Launch an endpoint with one click: Databricks automatically prepares a production-ready environment for your model and offers serverless configuration options for compute.

  • High availability and scalability: Model Serving is intended for production use and can support up to 25000+ queries-per-second (QPS). Model Serving endpoints automatically scale up and down, which means that endpoints automatically adjust based on the volume of scoring requests. You can also serve multiple custom models from a single endpoint.

  • Secure: Models are deployed in a secure network boundary. Models use dedicated compute that terminates (and are never reused) when the model is deleted, or scaled down to zero.

  • Govern and monitor models: Centrally manage all model endpoints in one place, including those that are externally hosted. You can manage permissions, track and set usage limits, and monitor the quality of all types of models. This enables you to democratize access to SaaS and open LLMs within your organization while ensuring appropriate guardrails are in place.

  • Quality and diagnostics:

  • MLflow integration:

    • Standard CRUD and query interface with the MLflow Deployment API.

    • Natively connects to the MLflow Model Registry which enables fast and easy deployment of models.

  • Feature store integration: When your model is trained with features from Databricks Feature Store, the model is packaged with feature metadata. If you configure your online store, these features are incorporated in real-time as scoring requests are received.

  • Vector Search integration: You can serve your preferred embedding model and use that endpoint to perform automatic embedding of queries for real-time retrieval. See Databricks Vector Search.


Enable Model Serving for your workspace

To use Model Serving, your account admin must read and accept the terms and conditions for enabling serverless compute in the account console.


If your account was created after March 28, 2022, serverless compute is enabled by default for your workspaces.

If you are not an account admin, you cannot perform these steps. Contact an account admin if your workspace needs access to serverless compute.

  1. As an account admin, go to the feature enablement tab of the account console settings page.

  2. A banner at the top of the page prompts you to accept the additional terms. Once you read the terms, click Accept. If you do not see the banner asking you to accept the terms, this step has been completed already.

After you’ve accepted the terms, your account is enabled for serverless.

No additional steps are required to enable Model Serving in your workspace.


The following limits apply:

  • Payload size limit of 16 MB per request.

  • Default limit of 200 QPS of scoring requests per workspace. You can increase this limit to 25000 QPS or more per workspace by reaching out to your Databricks support contact.

  • Model Serving supports models with evaluation latency up to 120 seconds.

  • The default limit for provisioned concurrency is 200. This limit is based off of the maximum number of concurrency that can be allocated across your endpoints. For example, if an endpoint has one served model using a Large workload size that supports 16-64 concurrent requests, the maximum provisioned concurrency for this endpoint is 64. You can increase this default limit by reaching out to your Databricks support contact.

  • Best effort support on less than 50 millisecond latency overhead and availability.

  • The memory available to your model is 5 GB.

  • It is possible for a workspace to be deployed in a supported region, but be served by a control plane in a different region. These workspaces do not support Model Serving, resulting in a Your workspace is not currently supported. message. To resolve, create a new workspace in a supported region, or use the feature in a different workspace that does not have this issue. Reach out to your Databricks account team for more information.

  • Model Serving is not currently in compliance with HIPAA regulations.

  • Model Serving does not support init scripts.

  • Models trained using AutoML may fail on Model Serving due to package dependencies. See how to resolve package dependencies for serving AutoML trained models.

Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace, like IP allowlists and PrivateLink.

Region availability


If you require an endpoint in an unsupported region, reach out to your Databricks account team.

See which Databricks clouds and regions Model Serving is available in.

Endpoint creation and update expectations


The information in this section does not apply to endpoints that serve external models or foundation models made available through Foundation Model APIs.

Deploying a newly registered model version involves packaging the model and its model environment and provisioning the model endpoint itself. This process can take approximately 10 minutes.

Databricks performs a zero-downtime update of endpoints by keeping the existing endpoint configuration up until the new one becomes ready. Doing so reduces risk of interruption for endpoints that are in use.

If model computation takes longer than 120 seconds, requests will time out. If you believe your model computation will take longer than 120 seconds, reach out to your Databricks support contact.

Endpoint scale up and scale down expectations


The information in this section does not apply to endpoints that serve external models or foundation models made available through Foundation Model APIs.

Serving endpoints scale up and down based on the volume of traffic coming into the endpoint and the capacity of the currently provisioned concurrency units.

Provisioned concurrency is the maximum number of parallel requests that the system can handle. You can estimate the required provisioned concurrency using the formula: provisioned concurrency = queries per second (QPS) * model execution time (s).

When traffic increases, an endpoint attempts to scale up almost immediately, depending on the size of the traffic volume increase. When traffic decreases, Databricks makes an attempt every five minutes to scale down to a concurrency size that represents the current volume of traffic.

When an endpoint has scale to zero enabled, it scales down to zero after 30 minutes of observing no traffic to the endpoint. When an endpoint has scaled down to zero, the first request experiences what’s known as a “cold start”. This implies a higher latency than the median latency per request for this first request. If this feature is used with a latency-sensitive application, Databricks recommends either not scaling to zero or sending warmup requests to the endpoint before user-facing traffic arrives at your service.

Data protection in Model Serving

Databricks takes data security seriously. Databricks understands the importance of the data you analyze using Databricks Model Serving, and implements the following security controls to protect your data.

  • Every customer request to Model Serving is logically isolated, authenticated, and authorized.

  • Databricks Model Serving encrypts all data at rest (AES-256) and in transit (TLS 1.2+).

For all paid accounts, Databricks Model Serving does not use user inputs submitted to the service or outputs from the service to train any models or improve any Databricks services.

For Databricks Foundation Model APIs, as part of providing the service, Databricks may temporarily process and store inputs and outputs for the purposes of preventing, detecting, and mitigating abuse or harmful uses. Your inputs and outputs are isolated from those of other customers, stored in the same region as your workspace for up to thirty (30) days, and only accessible for detecting and responding to security or abuse concerns.