Model serving with Databricks
This article describes Databricks Model Serving, including its advantages and limitations.
Model Serving exposes your MLflow machine learning models as scalable REST API endpoints and provides a highly available and low-latency service for deploying models. The service automatically scales up or down to meet demand changes, saving infrastructure costs while optimizing latency performance. This functionality uses serverless compute. See the Model Serving pricing page for more details.
For an introductory tutorial on how to serve models on Databricks, see Model serving tutorial.
Why use Model Serving?
Model Serving offers:
Launch an endpoint with one click: Databricks automatically prepares a production-ready environment for your model and offers serverless configuration options for compute.
High availability and scalability: Model Serving is intended for production use and can support up to 25000+ queries-per-second (QPS). Model Serving endpoints automatically scale up and down, which means that endpoints automatically adjust based on the volume of scoring requests. You can also serve multiple models from a single endpoint.
Secure: Models are deployed in a secure network boundary. Models use dedicated compute that terminates (and are never reused) when the model is deleted, or scaled down to zero.
MLflow integration: Natively connects to the MLflow Model Registry which enables fast and easy deployment of models.
Quality and diagnostics: Automatically capture requests and responses in a Delta table to monitor and debug models or generate training datasets. Endpoint health metrics, including QPS, latency, and error rates, are displayed in near-real time and can be exported to preferred observability tools.
Feature store integration: When your model is trained with features from Databricks Feature Store, the model is packaged with feature metadata. If you configure your online store, these features are incorporated in real-time as scoring requests are received.
Requirements
Registered model in the MLflow Unity Catalog or Model Registry.
Permissions on the registered models as described in Serving endpoints access control.
Enable Model Serving for your workspace
To use Model Serving, your account admin must read and accept the terms and conditions for enabling serverless compute in the account console.
Note
If your account was created after March 28, 2022, serverless compute is enabled by default for your workspaces.
If you are not an account admin, you cannot perform these steps. Contact an account admin if your workspace needs access to serverless compute.
As an account admin, go to the feature enablement tab of the account console settings page.
A banner at the top of the page prompts you to accept the additional terms. Once you read the terms, click Accept. If you do not see the banner asking you to accept the terms, this step has been completed already.
After you’ve accepted the terms, your account is enabled for serverless.
No additional steps are required to enable Model Serving in your workspace.
Limitations
The following limits apply:
Payload size limit of 16 MB per request.
Default limit of 200 QPS of scoring requests per workspace. You can increase this limit to 25000 QPS or more per workspace by reaching out to your Databricks support contact.
Model Serving supports models with evaluation latency up to 120 seconds.
The default limit for provisioned concurrency is 200. This limit is based off of the maximum number of concurrency that can be allocated across your endpoints. For example, if an endpoint has one served model using a
Large
workload size that supports 16-64 concurrent requests, the maximum provisioned concurrency for this endpoint is 64. You can increase this default limit by reaching out to your Databricks support contact.Best effort support on less than 50 millisecond latency overhead and availability.
The memory available to your model is 4 GB by default. You can increase this limit up to 16 GB per model by reaching out to your Databricks support contact.
It is possible for a workspace to be deployed in a supported region, but be served by a control plane in a different region. These workspaces do not support Model Serving, resulting in a
Your workspace is not currently supported.
message. To resolve, create a new workspace in a supported region, or use the feature in a different workspace that does not have this issue. Reach out to your Databricks representative for more information.Model Serving is not currently in compliance with HIPAA regulations.
Model Serving does not support init scripts.
Models trained using AutoML may fail on Model Serving due to package dependencies. See how to resolve package dependencies for serving AutoML trained models.
Model Serving endpoints are protected by access control and respect networking-related ingress rules configured on the workspace, like IP allowlists and PrivateLink.
Region availability
Note
If you require an endpoint in an unsupported region, reach out to your Databricks representative.
See which Databricks clouds and regions Model Serving is available in.
Endpoint creation and update expectations
Deploying a newly registered model version involves packaging the model and its model environment and provisioning the model endpoint itself. This process can take approximately 10 minutes.
Databricks performs a zero-downtime update of endpoints by keeping the existing endpoint configuration up until the new one becomes ready. Doing so reduces risk of interruption for endpoints that are in use.
If model computation takes longer than 120 seconds, requests will time out. If you believe your model computation will take longer than 120 seconds, reach out to your Databricks support contact.
Endpoint scale up and scale down expectations
Serving endpoints scale up and down based on the volume of traffic coming into the endpoint and the capacity of the currently provisioned concurrency units.
Provisioned concurrency is the maximum number of parallel requests that the system can handle. You can estimate the required provisioned concurrency using the formula: provisioned concurrency = queries per second (QPS) * model execution time (s).
When traffic increases, an endpoint attempts to scale up almost immediately, depending on the size of the traffic volume increase. When traffic decreases, Databricks makes an attempt every five minutes to scale down to a concurrency size that represents the current volume of traffic.
When an endpoint has scale to zero enabled, it scales down to zero after 30 minutes of observing no traffic to the endpoint. When an endpoint has scaled down to zero, the first request experiences what’s known as a “cold start”. This implies a higher latency than the median latency per request for this first request. If this feature is used with a latency-sensitive application, Databricks recommends either not scaling to zero or sending warmup requests to the endpoint before user-facing traffic arrives at your service.
Additional resources
- Model serving tutorial
- Migrate to Model Serving
- Create and manage model serving endpoints
- Send scoring requests to serving endpoints
- Serve multiple models to a Model Serving endpoint
- Model serving optimized for large language models
- Use custom Python libraries with Model Serving
- Package custom artifacts for Model Serving
- Inference tables for monitoring and debugging models
- Enable inference tables on model serving endpoints
- Monitor served models with Lakehouse Monitoring
- Export serving endpoint health metrics to Prometheus and Datadog
- Deploy custom models with Model Serving
- Configure access to resources from model serving endpoints
- Add an instance profile to a model serving endpoint