Create and manage model serving endpoints
This article describes how to create and manage model serving endpoints that utilize Databricks Model Serving.
Requirements
Model Serving is only available for Python-based MLflow models registered in the MLflow Model Registry. You must declare all model dependencies in the conda environment or requirements file.
If you don’t have a registered model, see the notebook examples for pre-packaged models you can use to get up and running with Model Serving endpoints.
Your workspace must be in a supported region.
If you use custom libraries or libraries from a private mirror server with your model, see Use custom Python libraries with Model Serving before you create the model endpoint.
Important
If you rely on Anaconda, review the terms of service notice for additional information.
Access control
To understand access control options for model serving endpoints and best practice guidance for endpoint management, see Serving endpoints access control.
Create model serving endpoints
You can create Model Serving endpoints with the Databricks Machine Learning API or the Databricks Machine Learning UI. An endpoint can serve any registered Python MLflow model in the Model Registry.
API workflow
You can create an endpoint with the following:
POST /api/2.0/serving-endpoints
{
"name": "feed-ads",
"config": {
"served_models": [{
"model_name": "ads1",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": true,
}]
}
}
The following is an example response. The endpoint’s config_update
state is IN_PROGRESS
and the served model is in a CREATING
state. The pending_config
field shows the details of the update that is in progress.
{
"name": "feed-ads",
"creator": "customer@example.com",
"creation_timestamp": 1666829055000,
"last_updated_timestamp": 1666829055000,
"state": {
"ready": "NOT_READY",
"config_update": "IN_PROGRESS"
},
"pending_config": {
"start_time": 1666718879000,
"served_models": [{
"name": "ads1-1",
"model_name": "ads1",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": true,
"state": {
"deployment": "DEPLOYMENT_CREATING",
"deployment_state_message": "Creating"
},
"creator": "customer@example.com",
"creation_timestamp": 1666829055000
}],
"config_version": 1,
"traffic_config": {
"routes": [
{
"served_model_name": "ads1-1",
"traffic_percentage": 100
}
]
}
},
"id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"permission_level": "CAN_MANAGE"
}
UI workflow
You can create an endpoint for model serving with the Serving UI.
Click Serving in the sidebar to display the Serving UI.
Click Create serving endpoint.
In the Serving endpoint name field provide a name for your endpoint.
In the Edit configuration section select which model and model version you want to serve.
Select what size compute to use.
Specify if the endpoint should scale to zero when not in use, and the percentage of traffic to route to a served model.
Click Create serving endpoint. The Serving endpoints page appears with Serving endpoint state shown as Not Ready.
You can also access the Serving UI to create an endpoint from the registered model page in the Databricks Machine Learning UI.
Select the model you want to serve.
Click the Use model for inference button.
Select the Real-time tab.
Select the model version and provide an endpoint name.
Select the compute size for your endpoint, and specify if your endpoint should scale to zero when not in use.
Click Create serving endpoint. The Serving endpoints page appears with Serving endpoint state shown as Not Ready. After a few minutes, Serving endpoint state changes to Ready.
Modify the compute configuration of an endpoint
After enabling a model endpoint, you can set the compute configuration as desired with the API or the UI. This configuration is particularly helpful if you need additional resources for your model. Workload size and compute configuration play a key role in what resources are allocated for serving your model.
Until the new configuration is ready, the old configuration keeps serving prediction traffic. While there is an update in progress, another update cannot be made.
You can also configure your endpoint to serve multiple models. See Serve multiple models to a Model Serving endpoint.
API workflow
PUT /api/2.0/serving-endpoints/{name}/config
{
"served_models": [{
"model_name": "ads1",
"model_version": "2",
"workload_size": "Small",
"scale_to_zero_enabled": true,
}]
}
The following is a response example:
{
"name": "feed-ads",
"creator": "cuastomer@example.com",
"creation_timestamp": 1666829055000,
"last_updated_timestamp": 1666946600000,
"state": {
"ready": true,
"update_state": "IN_PROGRESS"
},
"config": {
"served_models": [
{
"name": "ads1-1",
"model_name": "ads1",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": true,
"state": {
"deployment": "DEPLOYMENT_READY",
"deployment_state_message": ""
},
"creator": "customer@example.com",
"creation_timestamp": 1666887851000
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "ads1-1",
"traffic_percentage": 100
}
]
},
"config_version": 2
},
"pending_update": {
"start_time": 1666946600000,
"served_models": [
{
"name": "ads1-2",
"model_name": "ads1",
"model_version": "2",
"workload_size": "Small",
"scale_to_zero_enabled": true,
"state": {
"deployment": "DEPLOYMENT_CREATING",
"deployment_state_message": "Created"
},
"creator": "customer@example.com",
"creation_timestamp": 1666946600000
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "ads1-2",
"traffic_percentage": 100
}
]
}
"config_version": 3
},
"id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"permission_level": "CAN_MANAGE"
}
UI workflow
After you enable a model endpoint, select Edit configuration to modify the compute configuration of your endpoint.
You can do the following:
Choose from a few workload sizes, and autoscaling is automatically configured within the workload size.
Specify if your endpoint should scale down to zero when not in use.
Modify the percent of traffic to route to your served model.
Scoring a model endpoint
To score a deployed model, you can send a REST API request to the model URL or use the UI.
You can call a model by calling the API and score using this URI:
POST /serving-endpoints/{endpoint-name}/invocations
Request format
Requests should be sent by constructing a JSON with one of the following keys and a JSON object corresponding to the input format.
There are four formats for the input JSON depending on your use case:
dataframe_split
is JSON-serialized Pandas Dataframe in thesplit
orientation.{ "dataframe_split": [{ "index": [0, 1], "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"], "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]] }] }
dataframe_records
is JSON-serialized Pandas Dataframe in therecords
orientation.Note
This format does not guarantee the preservation of column ordering, and the
split
format is preferred over therecords
format.{ "dataframe_records": [ { "sepal length (cm)": 5.1, "sepal width (cm)": 3.5, "petal length (cm)": 1.4, "petal width (cm)": 0.2 }, { "sepal length (cm)": 4.9, "sepal width (cm)": 3, "petal length (cm)": 1.4, "petal width (cm)": 0.2 }, { "sepal length (cm)": 4.7, "sepal width (cm)": 3.2, "petal length (cm)": 1.3, "petal width (cm)": 0.2 } ] }
instances
is a tensors-based format that accepts tensors in row format. Use this format if all the input tensors have the same 0-th dimension. Conceptually, each tensor in the instances list could be joined with the other tensors of the same name in the rest of the list to construct the full input tensor for the model, which would only be possible if all of the tensors have the same 0-th dimension.{"instances": [ "a", "b", "c" ]}
or
In the following example, there are three dimensions, so you have exactly three of each input tensor.
{ "instances": [ { "t1": "a", "t2": [1, 2, 3, 4, 5], "t3": [[1, 2], [3, 4], [5, 6]] }, { "t1": "b", "t2": [6, 7, 8, 9, 10], "t3": [[7, 8], [9, 10], [11, 12]] } ] }
inputs
send queries with tensors in columnar format. This request is different because there are actually a different number of tensor instances oft2
(3) thant1
andt3
, so it is not possible to represent this input in theinstances
format.{ "inputs": { "t1": ["a", "b"], "t2": [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]], "t3": [[[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]]] } }
Response format
The response from the endpoint is in the following format. The output from your model is wrapped in a predictions
key.
{
"predictions": "<json-output-from-model>"
}
UI workflow
Sending requests using the UI is the easiest and fastest way to test the model. From the Serving endpoint page, select Query endpoint. You can insert the model input data in JSON format and click Send Request. If the model has been logged with an input example, click Show Example to load the input example.
API workflow
You can send a scoring request through the REST API using standard Databricks authentication. The following examples demonstrate authentication using a personal access token.
Note
As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. To create access tokens for service principals, see Manage access tokens for a service principal.
Given a MODEL_VERSION_URI
like https://<databricks-instance>/model/iris-classifier/Production/invocations
, where <databricks-instance>
is the name of your Databricks instance, and a Databricks REST API token called DATABRICKS_API_TOKEN
, the following are example snippets of how to score a served model.
Score a model accepting dataframe records input format.
curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
-H 'Content-Type: application/json' \
-d '{"dataframe_records": [
{
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2
}
]}'
Score a model accepting tensor inputs. Tensor inputs should be formatted as described in TensorFlow Serving’s API docs.
curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
-H 'Content-Type: application/json' \
-d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'
import numpy as np
import pandas as pd
import requests
def create_tf_serving_json(data):
return {'inputs': {name: data[name].tolist() for name in data.keys()} if isinstance(data, dict) else data.tolist()}
def score_model(model_uri, databricks_token, data):
headers = {
"Authorization": f"Bearer {databricks_token}",
"Content-Type": "application/json",
}
data_json = json.dumps({'dataframe_records': data.to_dict(orient='records')}) if isinstance(data, pd.DataFrame) else create_tf_serving_json(data)
response = requests.request(method='POST', headers=headers, url=model_uri, json=data_json)
if response.status_code != 200:
raise Exception(f"Request failed with status {response.status_code}, {response.text}")
return response.json()
# Scoring a model that accepts pandas DataFrames
data = pd.DataFrame([{
"sepal_length": 5.1,
"sepal_width": 3.5,
"petal_length": 1.4,
"petal_width": 0.2
}])
score_model(MODEL_VERSION_URI, DATABRICKS_API_TOKEN, data)
# Scoring a model that accepts tensors
data = np.asarray([[5.1, 3.5, 1.4, 0.2]])
score_model(MODEL_VERSION_URI, DATABRICKS_API_TOKEN, data)
You can score a dataset in Power BI Desktop using the following steps:
Open dataset you want to score.
Go to Transform Data.
Right-click in the left panel and select Create New Query.
Go to View > Advanced Editor.
Replace the query body with the code snippet below, after filling in an appropriate
DATABRICKS_API_TOKEN
andMODEL_VERSION_URI
.(dataset as table ) as table => let call_predict = (dataset as table ) as list => let apiToken = DATABRICKS_API_TOKEN, modelUri = MODEL_VERSION_URI, responseList = Json.Document(Web.Contents(modelUri, [ Headers = [ #"Content-Type" = "application/json", #"Authorization" = Text.Format("Bearer #{0}", {apiToken}) ], Content = {"dataframe_records": Json.FromValue(dataset)} ] )) in responseList, predictionList = List.Combine(List.Transform(Table.Split(dataset, 256), (x) => call_predict(x))), predictionsTable = Table.FromList(predictionList, (x) => {x}, {"Prediction"}), datasetWithPrediction = Table.Join( Table.AddIndexColumn(predictionsTable, "index"), "index", Table.AddIndexColumn(dataset, "index"), "index") in datasetWithPrediction
Name the query with your desired model name.
Open the advanced query editor for your dataset and apply the model function.
See the following notebook for an example of how to test your Model Serving endpoint with a Python model:
Get the status of the model endpoint
API workflow
You can check the status of an endpoint with the following:
GET /api/2.0/serving-endpoints/{name}
In the following example response, the state.ready
field is “READY”, which means the endpoint is ready to receive traffic. The state.update_state
field is NOT_UPDATING
and pending_config
is no longer returned because the update was finished successfully.
{
"name": "feed-ads",
"creator": "customer@example.com",
"creation_timestamp": 1666829055000,
"last_updated_timestamp": 1666829055000,
"state": {
"ready": "READY",
"update_state": "NOT_UPDATING"
},
"config": {
"served_models": [
{
"name": "ads1-1",
"model_name": "ads1",
"model_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": false,
"state": {
"deployment": "DEPLOYMENT_READY",
"deployment_state_message": ""
},
"creator": "customer@example.com",
"creation_timestamp": 1666829055000
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "ads1-1",
"traffic_percentage": 100
}
]
}
"config_version": 1
},
"id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"permission_level": "CAN_MANAGE"
}
Delete a model serving endpoint
To disable serving for a model, you can delete the endpoint it’s served on.
Debug your model serving endpoint
To debug any issues with the endpoint, you can fetch:
Model server container build logs
Model server logs
These logs are also accessible from the Endpoints UI in the Logs tab.
For the build logs for a served model you can use the following request:
GET /api/2.0/serving-endpoints/{name}/served-models/{served-model-name}/build-logs
{
“config_version”: 1 // optional
}
For the model server logs for a serve model, you can use the following request:
GET /api/2.0/serving-endpoints/{name}/served-models/{served-model-name}/logs
{
“config_version”: 1 // optional
}
Notebook examples
The following notebooks include different models that you can use to get up and running with Model Serving endpoints. The model examples can be imported into the workspace by following the directions in Import a notebook. After you choose and create a model from one of the examples, register it in the MLflow Model Registry, and then follow the UI workflow steps for model serving.
The following notebook example demonstrates how to create and manage model serving endpoint using Python.
Anaconda licensing update
The following notice is for customers relying on Anaconda.
Important
Anaconda Inc. updated their terms of service for anaconda.org channels. Based on the new terms of service you may require a commercial license if you rely on Anaconda’s packaging and distribution. See Anaconda Commercial Edition FAQ for more information. Your use of any Anaconda channels is governed by their terms of service.
MLflow models logged before v1.18 (Databricks Runtime 8.3 ML or earlier) were by default logged with the conda defaults
channel (https://repo.anaconda.com/pkgs/) as a dependency. Because of this license change, Databricks has stopped the use of the defaults
channel for models logged using MLflow v1.18 and above. The default channel logged is now conda-forge
, which points at the community managed https://conda-forge.org/.
If you logged a model before MLflow v1.18 without excluding the defaults
channel from the conda environment for the model, that model may have a dependency on the defaults
channel that you may not have intended.
To manually confirm whether a model has this dependency, you can examine channel
value in the conda.yaml
file that is packaged with the logged model. For example, a model’s conda.yaml
with a defaults
channel dependency may look like this:
channels:
- defaults
dependencies:
- python=3.8.8
- pip
- pip:
- mlflow
- scikit-learn==0.23.2
- cloudpickle==1.6.0
name: mlflow-env
Because Databricks can not determine whether your use of the Anaconda repository to interact with your models is permitted under your relationship with Anaconda, Databricks is not forcing its customers to make any changes. If your use of the Anaconda.com repo through the use of Databricks is permitted under Anaconda’s terms, you do not need to take any action.
If you would like to change the channel used in a model’s environment, you can re-register the model to the model registry with a new conda.yaml
. You can do this by specifying the channel in the conda_env
parameter of log_model()
.
For more information on the log_model()
API, see the MLflow documentation for the model flavor you are working with, for example, log_model for scikit-learn.
For more information on conda.yaml
files, see the MLflow documentation.