Model serving with Databricks
This article describes Mosaic AI Model Serving, including its advantages and limitations.
What is Mosaic AI Model Serving?
Mosaic AI Model Serving provides a unified interface to deploy, govern, and query AI models for real-time and batch inference. Each model you serve is available as a REST API that you can integrate into your web or client application.
Model Serving provides a highly available and low-latency service for deploying models. The service automatically scales up or down to meet demand changes, saving infrastructure costs while optimizing latency performance. This functionality uses serverless compute. See the Model Serving pricing page for more details.
Model serving supports serving:
Custom models. These are Python models packaged in the MLflow format. They can be registered either in Unity Catalog or in the workspace model registry. Examples include scikit-learn, XGBoost, PyTorch, and Hugging Face transformer models.
Agent serving is supported as a custom model. See Deploy an agent for generative AI application
State-of-the-art open models made available by Foundation Model APIs. These models are curated foundation model architectures that support optimized inference. Base models, like Meta-Llama-3.1-70B-Instruct, GTE-Large, and Mistral-7B are available for immediate use with pay-per-token pricing, and workloads that require performance guarantees and fine-tuned model variants can be deployed with provisioned throughput.
Databricks recommends using
ai_query
with Model Serving for batch inference. For quick experimentation,ai_query
can be used with pay-per-token endpoints. When you are ready to run batch inference on large or production data, Databricks recommends using provisioned throughput endpoints for faster performance. See Perform batch LLM inference using ai_query.
External models. These are generative AI models that are hosted outside of Databricks. Examples include models like OpenAI’s GPT-4, Anthropic’s Claude, and others. Endpoints that serve external models can be centrally governed and customers can establish rate limits and access control for them.
Note
You can interact with supported large language models using the AI Playground. The AI Playground is a chat-like environment where you can test, prompt, and compare LLMs. This functionality is available in your Databricks workspace.
Model serving offers a unified REST API and MLflow Deployment API for CRUD and querying tasks. In addition, it provides a single UI to manage all your models and their respective serving endpoints. You can also access models directly from SQL using AI functions for easy integration into analytics workflows.
For an introductory tutorial on how to serve custom models on Databricks, see Tutorial: Deploy and query a custom model.
For a getting started tutorial on how to query a foundation model on Databricks, see Get started querying LLMs on Databricks.
Why use Model Serving?
Deploy and query any models: Model Serving provides a unified interface that so you can manage all models in one location and query them with a single API, regardless of whether they are hosted on Databricks or externally. This approach simplifies the process of experimenting with, customizing, and deploying models in production across various clouds and providers.
Securely customize models with your private data: Built on a Data Intelligence Platform, Model Serving simplifies the integration of features and embeddings into models through native integration with the Databricks Feature Store and Mosaic AI Vector Search. For even more improved accuracy and contextual understanding, models can be fine-tuned with proprietary data and deployed effortlessly on Model Serving.
Govern and monitor models: The Serving UI allows you to centrally manage all model endpoints in one place, including those that are externally hosted. You can manage permissions, track and set usage limits, and monitor the quality of all types of models. This enables you to democratize access to SaaS and open LLMs within your organization while ensuring appropriate guardrails are in place.
Reduce cost with optimized inference and fast scaling: Databricks has implemented a range of optimizations to ensure you get the best throughput and latency for large models. The endpoints automatically scale up or down to meet demand changes, saving infrastructure costs while optimizing latency performance. Monitor model serving costs.
Note
For workloads that are latency sensitive or involve a high number of queries per second, Databricks recommends using route optimization on custom model serving endpoints. Reach out to your Databricks account team to ensure your workspace is enabled for high scalability.
Bring reliability and security to Model Serving: Model Serving is designed for high-availability, low-latency production use and can support over 25K queries per second with an overhead latency of less than 50 ms. The serving workloads are protected by multiple layers of security, ensuring a secure and reliable environment for even the most sensitive tasks.
Note
Model Serving does not provide security patches to existing model images because of the risk of destabilization to production deployments. A new model image created from a new model version will contain the latest patches. Reach out to your Databricks account team for more information.
Requirements
Registered model in Unity Catalog or the Workspace Model Registry.
Permissions on the registered models as described in Serving endpoint ACLs.
MLflow 1.29 or higher.
Enable Model Serving for your workspace
To use Model Serving, your account admin must read and accept the terms and conditions for enabling serverless compute in the account console.
Note
If your account was created after March 28, 2022, serverless compute is enabled by default for your workspaces.
If you are not an account admin, you cannot perform these steps. Contact an account admin if your workspace needs access to serverless compute.
As an account admin, go to the feature enablement tab of the account console settings page.
A banner at the top of the page prompts you to accept the additional terms. Once you read the terms, click Accept. If you do not see the banner asking you to accept the terms, this step has been completed already.
After you’ve accepted the terms, your account is enabled for serverless.
No additional steps are required to enable Model Serving in your workspace.
Limitations and region availability
Mosaic AI Model Serving imposes default limits to ensure reliable performance. See Model Serving limits and regions. If you have feedback on these limits or an endpoint in an unsupported region, reach out to your Databricks account team.
Data protection in Model Serving
Databricks takes data security seriously. Databricks understands the importance of the data you analyze using Mosaic AI Model Serving, and implements the following security controls to protect your data.
Every customer request to Model Serving is logically isolated, authenticated, and authorized.
Mosaic AI Model Serving encrypts all data at rest (AES-256) and in transit (TLS 1.2+).
For all paid accounts, Mosaic AI Model Serving does not use user inputs submitted to the service or outputs from the service to train any models or improve any Databricks services.
For Databricks Foundation Model APIs, as part of providing the service, Databricks may temporarily process and store inputs and outputs for the purposes of preventing, detecting, and mitigating abuse or harmful uses. Your inputs and outputs are isolated from those of other customers, stored in the same region as your workspace for up to thirty (30) days, and only accessible for detecting and responding to security or abuse concerns. Foundation Model APIs is a Databricks Designated Service, meaning it adheres to data residency boundaries as implemented by Databricks Geos.