Create and manage model serving endpoints
This article describes how to create and manage model serving endpoints that utilize Databricks Model Serving.
Requirements
Model Serving is only available for Python-based MLflow models registered in Unity Catalog or the Workspace Model Registry. You must declare all model dependencies in the conda environment or requirements file.
If you don’t have a registered model, see the notebook examples for pre-packaged models you can use to get up and running with Model Serving endpoints.
Your workspace must be in a supported region.
If you use custom libraries or libraries from a private mirror server with your model, see Use custom Python libraries with Model Serving before you create the model endpoint.
Important
If you rely on Anaconda, review the terms of service notice for additional information.
Access control
To understand access control options for model serving endpoints and best practice guidance for endpoint management, see Serving endpoints access control.
Create model serving endpoints
You can create Model Serving endpoints with the Databricks Machine Learning API or the Databricks Machine Learning UI. An endpoint can serve any registered Python MLflow model in the Unity Catalog or Workspace Model Registry.
API workflow
The following example creates an endpoint that serves the first version of the ads1
model that is registered in the model registry. To specify a model from Unity Catalog, provide the full model name including parent catalog and schema such as, catalog.schema.example-model
.
Note
Databricks supports model serving for GPU workloads as a Public Preview functionality.
POST /api/2.0/serving-endpoints
{
"name": "feed-ads",
"config":{
"served_models": [{
"model_name": "ads1",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": true
}]
}
}
The following is an example response. The endpoint’s config_update
state is IN_PROGRESS
and the served model is in a CREATING
state. The pending_config
field shows the details of the update that is in progress.
{
"name": "feed-ads",
"creator": "customer@example.com",
"creation_timestamp": 1666829055000,
"last_updated_timestamp": 1666829055000,
"state": {
"ready": "NOT_READY",
"config_update": "IN_PROGRESS"
},
"pending_config": {
"start_time": 1666718879000,
"served_models": [{
"name": "ads1-1",
"model_name": "ads1",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": true,
"state": {
"deployment": "DEPLOYMENT_CREATING",
"deployment_state_message": "Creating"
},
"creator": "customer@example.com",
"creation_timestamp": 1666829055000
}],
"config_version": 1,
"traffic_config": {
"routes": [
{
"served_model_name": "ads1-1",
"traffic_percentage": 100
}
]
}
},
"id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"permission_level": "CAN_MANAGE"
}
UI workflow
You can create an endpoint for model serving with the Serving UI.
Click Serving in the sidebar to display the Serving UI.
Click Create serving endpoint.
In the Serving endpoint name field provide a name for your endpoint.
In the Edit configuration section
Select whether the model you want to serve is currently in the Workspace Model Registry or Unity Catalog.
Select which model and model version you want to serve.
Click Confirm.
Select what size compute to use.
Note
Databricks supports model serving for GPU workloads as a Public Preview functionality.
Specify if the endpoint should scale to zero when not in use, and the percentage of traffic to route to a served model.
Click Create serving endpoint. The Serving endpoints page appears with Serving endpoint state shown as Not Ready.
You can also access the Serving UI to create an endpoint from the registered model page in the Databricks Machine Learning UI.
Select the model you want to serve.
Click the Use model for inference button.
Select the Real-time tab.
Select the model version and provide an endpoint name.
Select the compute size for your endpoint, and specify if your endpoint should scale to zero when not in use.
Note
Databricks supports model serving for GPU workloads as a Public Preview functionality.
Click Create serving endpoint. The Serving endpoints page appears with Serving endpoint state shown as Not Ready. After a few minutes, Serving endpoint state changes to Ready.
Modify the compute configuration of an endpoint
After enabling a model endpoint, you can set the compute configuration as desired with the API or the UI. This configuration is particularly helpful if you need additional resources for your model. Workload size and compute configuration play a key role in what resources are allocated for serving your model.
Until the new configuration is ready, the old configuration keeps serving prediction traffic. While there is an update in progress, another update cannot be made.
You can also:
Configure your endpoint to access external resources using Databricks Secrets.
Enable inference tables to automatically capture incoming requests and outgoing responses to your model serving endpoints.
API workflow
Note
Databricks supports model serving for GPU workloads as a Public Preview functionality.
PUT /api/2.0/serving-endpoints/{name}/config
{
"served_models": [{
"model_name": "ads1",
"model_version": "2",
"workload_size": "Small",
"scale_to_zero_enabled": true
}]
}
The following is a response example:
{
"name": "feed-ads",
"creator": "cuastomer@example.com",
"creation_timestamp": 1666829055000,
"last_updated_timestamp": 1666946600000,
"state": {
"ready": true,
"update_state": "IN_PROGRESS"
},
"config": {
"served_models": [
{
"name": "ads1-1",
"model_name": "ads1",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": true,
"state": {
"deployment": "DEPLOYMENT_READY",
"deployment_state_message": ""
},
"creator": "customer@example.com",
"creation_timestamp": 1666887851000
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "ads1-1",
"traffic_percentage": 100
}
]
},
"config_version": 2
},
"pending_update": {
"start_time": 1666946600000,
"served_models": [
{
"name": "ads1-2",
"model_name": "ads1",
"model_version": "2",
"workload_size": "Small",
"scale_to_zero_enabled": true,
"state": {
"deployment": "DEPLOYMENT_CREATING",
"deployment_state_message": "Created"
},
"creator": "customer@example.com",
"creation_timestamp": 1666946600000
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "ads1-2",
"traffic_percentage": 100
}
]
}
"config_version": 3
},
"id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"permission_level": "CAN_MANAGE"
}
UI workflow
After you enable a model endpoint, select Edit configuration to modify the compute configuration of your endpoint.
You can do the following:
Choose from a few workload sizes, and autoscaling is automatically configured within the workload size.
Specify if your endpoint should scale down to zero when not in use.
Modify the percent of traffic to route to your served model.
GPU workload types
Preview
This feature is in Public Preview.
Model serving on Databricks supports GPU deployment of PyTorch and Tensorflow models, as well as models logged with mlflow.pyfunc
, mlflow.pytorch
, mlflow.tensorflow
, and mlflow.transformers
flavors.
This preview capability is compatible with the following package versions:
Pytorch 1.13.0 - 2.0.1
Tensorflow 2.5.0 - 2.13.0
MLflow 2.4.0 and above
To deploy your models using GPUs include the workload_type
field in your endpoint configuration during endpoint creation or as an endpoint configuration update using the API. To configure your endpoint for GPU workloads with the Serving UI, select the desired GPU type from the Compute dropdown.
{
“served_models”: [{
“model_name”: “ads1”,
“model_version”: “2”,
“workload_type”: “GPU_MEDIUM”,
“workload_size”: “Small”,
“scale_to_zero_enabled”: false,
}]
}
The following table summarizes the available GPU workload types supported during the Public Preview.
Note
Number of concurrency per GPU is contingent upon both the model’s size and its computational complexity.
GPU workload type |
GPU instance |
GPU memory |
---|---|---|
|
1xT4 |
16GB |
|
1xA10G |
24GB |
|
4xA10G |
96GB |
|
8xA10G |
192GB |
|
A10G |
320GB |
Limitations
The following are limitations for serving endpoints with GPU workloads during Public Preview:
Container image creation for GPU serving takes longer than image creation for CPU serving due to model size and increased installation requirements for models served on GPU.
When deploying very large models, the deployment process might timeout if the container build and model deployment exceed a 60-minute duration. Should this occur, initiating a retry of the process should successfully deploy the model.
Autoscaling for GPU serving takes longer than for CPU serving.
Endpoints configured with GPU workloads do not support scale to zero.
This functionality is available in the following regions:
ap-southeast-2
ca-central-1
eu-central-1
eu-west-1
eu-west-2
us-east-1
us-east-2
us-west-2
Scoring a model endpoint
To score your model, you can send requests to the model serving endpoint. See Send scoring requests to serving endpoints to learn about recommendations and accepted formats.
Get the status of the model endpoint
API workflow
You can check the status of an endpoint with the following:
GET /api/2.0/serving-endpoints/{name}
In the following example response, the state.ready
field is “READY”, which means the endpoint is ready to receive traffic. The state.update_state
field is NOT_UPDATING
and pending_config
is no longer returned because the update was finished successfully.
{
"name": "feed-ads",
"creator": "customer@example.com",
"creation_timestamp": 1666829055000,
"last_updated_timestamp": 1666829055000,
"state": {
"ready": "READY",
"update_state": "NOT_UPDATING"
},
"config": {
"served_models": [
{
"name": "ads1-1",
"model_name": "ads1",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": false,
"state": {
"deployment": "DEPLOYMENT_READY",
"deployment_state_message": ""
},
"creator": "customer@example.com",
"creation_timestamp": 1666829055000
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "ads1-1",
"traffic_percentage": 100
}
]
}
"config_version": 1
},
"id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"permission_level": "CAN_MANAGE"
}
Delete a model serving endpoint
To disable serving for a model, you can delete the endpoint it’s served on.
Debug your model serving endpoint
To debug any issues with the endpoint, you can fetch:
Model server container build logs
Model server logs
These logs are also accessible from the Endpoints UI in the Logs tab.
For the build logs for a served model you can use the following request:
GET /api/2.0/serving-endpoints/{name}/served-models/{served-model-name}/build-logs
{
“config_version”: 1 // optional
}
For the model server logs for a serve model, you can use the following request:
GET /api/2.0/serving-endpoints/{name}/served-models/{served-model-name}/logs
{
“config_version”: 1 // optional
}
Notebook examples
The following notebooks include different models that you can use to get up and running with Model Serving endpoints. The model examples can be imported into the workspace by following the directions in Import a notebook. After you choose and create a model from one of the examples, register it in the MLflow Model Registry, and then follow the UI workflow steps for model serving.
The following notebook example demonstrates how to create and manage model serving endpoint using Python.
Anaconda licensing update
The following notice is for customers relying on Anaconda.
Important
Anaconda Inc. updated their terms of service for anaconda.org channels. Based on the new terms of service you may require a commercial license if you rely on Anaconda’s packaging and distribution. See Anaconda Commercial Edition FAQ for more information. Your use of any Anaconda channels is governed by their terms of service.
MLflow models logged before v1.18 (Databricks Runtime 8.3 ML or earlier) were by default logged with the conda defaults
channel (https://repo.anaconda.com/pkgs/) as a dependency. Because of this license change, Databricks has stopped the use of the defaults
channel for models logged using MLflow v1.18 and above. The default channel logged is now conda-forge
, which points at the community managed https://conda-forge.org/.
If you logged a model before MLflow v1.18 without excluding the defaults
channel from the conda environment for the model, that model may have a dependency on the defaults
channel that you may not have intended.
To manually confirm whether a model has this dependency, you can examine channel
value in the conda.yaml
file that is packaged with the logged model. For example, a model’s conda.yaml
with a defaults
channel dependency may look like this:
channels:
- defaults
dependencies:
- python=3.8.8
- pip
- pip:
- mlflow
- scikit-learn==0.23.2
- cloudpickle==1.6.0
name: mlflow-env
Because Databricks can not determine whether your use of the Anaconda repository to interact with your models is permitted under your relationship with Anaconda, Databricks is not forcing its customers to make any changes. If your use of the Anaconda.com repo through the use of Databricks is permitted under Anaconda’s terms, you do not need to take any action.
If you would like to change the channel used in a model’s environment, you can re-register the model to the model registry with a new conda.yaml
. You can do this by specifying the channel in the conda_env
parameter of log_model()
.
For more information on the log_model()
API, see the MLflow documentation for the model flavor you are working with, for example, log_model for scikit-learn.
For more information on conda.yaml
files, see the MLflow documentation.