Create and manage model serving endpoints

This article describes how to create and manage model serving endpoints that utilize Databricks Model Serving.

Requirements

  • Model Serving is only available for Python-based MLflow models registered in the MLflow Model Registry. You must declare all model dependencies in the conda environment or requirements file.

    • If you don’t have a registered model, see the notebook examples for pre-packaged models you can use to get up and running with Model Serving endpoints.

  • Your workspace must be in a supported region.

  • If you use custom libraries or libraries from a private mirror server with your model, see Use custom Python libraries with Model Serving before you create the model endpoint.

Important

If you rely on Anaconda, review the terms of service notice for additional information.

Access control

To understand access control options for model serving endpoints and best practice guidance for endpoint management, see Serving endpoints access control.

Create model serving endpoints

You can create Model Serving endpoints with the Databricks Machine Learning API or the Databricks Machine Learning UI. An endpoint can serve any registered Python MLflow model in the Model Registry.

API workflow

You can create an endpoint with the following:

POST /api/2.0/serving-endpoints

{
  "name": "feed-ads",
  "config": {
   "served_models": [{
     "model_name": "ads1",
     "model_version": "1",
     "workload_size": "Small",
     "scale_to_zero_enabled": true,
   }]
 }
}

The following is an example response. The endpoint’s config_update state is IN_PROGRESS and the served model is in a CREATING state. The pending_config field shows the details of the update that is in progress.

{
  "name": "feed-ads",
  "creator": "customer@example.com",
  "creation_timestamp": 1666829055000,
  "last_updated_timestamp": 1666829055000,
  "state": {
    "ready": "NOT_READY",
    "config_update": "IN_PROGRESS"
  },
  "pending_config": {
    "start_time": 1666718879000,
    "served_models": [{
      "name": "ads1-1",
      "model_name": "ads1",
      "model_version": "1",
      "workload_size": "Small",
      "scale_to_zero_enabled": true,
      "state": {
        "deployment": "DEPLOYMENT_CREATING",
        "deployment_state_message": "Creating"
      },
      "creator": "customer@example.com",
      "creation_timestamp": 1666829055000
    }],
    "config_version": 1,
    "traffic_config": {
      "routes": [
        {
          "served_model_name": "ads1-1",
          "traffic_percentage": 100
        }
      ]
    }
  },
  "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "permission_level": "CAN_MANAGE"
}

UI workflow

You can create an endpoint for model serving with the Serving UI.

  1. Click Serving in the sidebar to display the Serving UI.

  2. Click Create serving endpoint.

  3. In the Serving endpoint name field provide a name for your endpoint.

  4. In the Edit configuration section select which model and model version you want to serve.

  5. Select what size compute to use.

  6. Specify if the endpoint should scale to zero when not in use, and the percentage of traffic to route to a served model.

  7. Click Create serving endpoint. The Serving endpoints page appears with Serving endpoint state shown as Not Ready.

You can also access the Serving UI to create an endpoint from the registered model page in the Databricks Machine Learning UI.

  1. Select the model you want to serve.

  2. Click the Use model for inference button.

  3. Select the Real-time tab.

  4. Select the model version and provide an endpoint name.

  5. Select the compute size for your endpoint, and specify if your endpoint should scale to zero when not in use.

  6. Click Create serving endpoint. The Serving endpoints page appears with Serving endpoint state shown as Not Ready. After a few minutes, Serving endpoint state changes to Ready.

Modify the compute configuration of an endpoint

After enabling a model endpoint, you can set the compute configuration as desired with the API or the UI. This configuration is particularly helpful if you need additional resources for your model. Workload size and compute configuration play a key role in what resources are allocated for serving your model.

Until the new configuration is ready, the old configuration keeps serving prediction traffic. While there is an update in progress, another update cannot be made.

You can also configure your endpoint to serve multiple models. See Serve multiple models to a Model Serving endpoint.

API workflow

PUT /api/2.0/serving-endpoints/{name}/config

{
  "served_models": [{
    "model_name": "ads1",
    "model_version": "2",
    "workload_size": "Small",
    "scale_to_zero_enabled": true,
  }]
}

The following is a response example:

{
  "name": "feed-ads",
  "creator": "cuastomer@example.com",
  "creation_timestamp": 1666829055000,
  "last_updated_timestamp": 1666946600000,
  "state": {
    "ready": true,
    "update_state": "IN_PROGRESS"
  },
  "config": {
    "served_models": [
      {
        "name": "ads1-1",
        "model_name": "ads1",
        "model_version": "1",
        "workload_size": "Small",
        "scale_to_zero_enabled": true,
        "state": {
          "deployment": "DEPLOYMENT_READY",
          "deployment_state_message": ""
        },
        "creator": "customer@example.com",
        "creation_timestamp": 1666887851000
      }
    ],
    "traffic_config": {
      "routes": [
        {
          "served_model_name": "ads1-1",
          "traffic_percentage": 100
        }
      ]
    },
    "config_version": 2
  },
  "pending_update": {
    "start_time": 1666946600000,
    "served_models": [
      {
        "name": "ads1-2",
        "model_name": "ads1",
        "model_version": "2",
        "workload_size": "Small",
        "scale_to_zero_enabled": true,
        "state": {
          "deployment": "DEPLOYMENT_CREATING",
          "deployment_state_message": "Created"
        },
        "creator": "customer@example.com",
        "creation_timestamp": 1666946600000
      }
    ],
     "traffic_config": {
      "routes": [
        {
          "served_model_name": "ads1-2",
          "traffic_percentage": 100
        }
      ]
    }
    "config_version": 3
  },
  "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "permission_level": "CAN_MANAGE"
}

UI workflow

After you enable a model endpoint, select Edit configuration to modify the compute configuration of your endpoint.

You can do the following:

  • Choose from a few workload sizes, and autoscaling is automatically configured within the workload size.

  • Specify if your endpoint should scale down to zero when not in use.

  • Modify the percent of traffic to route to your served model.

Scoring a model endpoint

To score a deployed model, you can send a REST API request to the model URL or use the UI.

You can call a model by calling the API and score using this URI:

POST /serving-endpoints/{endpoint-name}/invocations

Request format

Requests should be sent by constructing a JSON with one of the following keys and a JSON object corresponding to the input format.

There are four formats for the input JSON depending on your use case:

  • dataframe_split is JSON-serialized Pandas Dataframe in the split orientation.

    {
      "dataframe_split": [{
        "index": [0, 1],
        "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
        "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
      }]
    }
    
  • dataframe_records is JSON-serialized Pandas Dataframe in the records orientation.

    Note

    This format does not guarantee the preservation of column ordering, and the split format is preferred over the records format.

    {
      "dataframe_records": [
      {
         "sepal length (cm)": 5.1,
         "sepal width (cm)": 3.5,
         "petal length (cm)": 1.4,
         "petal width (cm)": 0.2
      },
      {
         "sepal length (cm)": 4.9,
         "sepal width (cm)": 3,
         "petal length (cm)": 1.4,
         "petal width (cm)": 0.2
       },
       {
         "sepal length (cm)": 4.7,
         "sepal width (cm)": 3.2,
         "petal length (cm)": 1.3,
         "petal width (cm)": 0.2
       }
      ]
    }
    
  • instances is a tensors-based format that accepts tensors in row format. Use this format if all the input tensors have the same 0-th dimension. Conceptually, each tensor in the instances list could be joined with the other tensors of the same name in the rest of the list to construct the full input tensor for the model, which would only be possible if all of the tensors have the same 0-th dimension.

      {"instances": [ "a", "b", "c" ]}
    

    or

    In the following example, there are three dimensions, so you have exactly three of each input tensor.

    {
     "instances": [
      {
       "t1": "a",
       "t2": [1, 2, 3, 4, 5],
       "t3": [[1, 2], [3, 4], [5, 6]]
      },
      {
       "t1": "b",
       "t2": [6, 7, 8, 9, 10],
       "t3": [[7, 8], [9, 10], [11, 12]]
      }
     ]
    }
    
  • inputs send queries with tensors in columnar format. This request is different because there are actually a different number of tensor instances of t2 (3) than t1 and t3, so it is not possible to represent this input in the instances format.

    {
     "inputs": {
      "t1": ["a", "b"],
      "t2": [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]],
      "t3": [[[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]]]
     }
    }
    

Response format

The response from the endpoint is in the following format. The output from your model is wrapped in a predictions key.

{
  "predictions": "<json-output-from-model>"
}

UI workflow

Sending requests using the UI is the easiest and fastest way to test the model. From the Serving endpoint page, select Query endpoint. You can insert the model input data in JSON format and click Send Request. If the model has been logged with an input example, click Show Example to load the input example.

API workflow

You can send a scoring request through the REST API using standard Databricks authentication. The following examples demonstrate authentication using a personal access token.

Note

As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. To create access tokens for service principals, see Manage access tokens for a service principal.

Given a MODEL_VERSION_URI like https://<databricks-instance>/model/iris-classifier/Production/invocations, where <databricks-instance> is the name of your Databricks instance, and a Databricks REST API token called DATABRICKS_API_TOKEN, the following are example snippets of how to score a served model.

Score a model accepting dataframe records input format.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
  -H 'Content-Type: application/json' \
  -d '{"dataframe_records": [
    {
      "sepal_length": 5.1,
      "sepal_width": 3.5,
      "petal_length": 1.4,
      "petal_width": 0.2
    }
  ]}'

Score a model accepting tensor inputs. Tensor inputs should be formatted as described in TensorFlow Serving’s API docs.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
   -H 'Content-Type: application/json' \
   -d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'
import numpy as np
import pandas as pd
import requests

def create_tf_serving_json(data):
  return {'inputs': {name: data[name].tolist() for name in data.keys()} if isinstance(data, dict) else data.tolist()}

def score_model(model_uri, databricks_token, data):
  headers = {
    "Authorization": f"Bearer {databricks_token}",
    "Content-Type": "application/json",
  }
  data_json = json.dumps({'dataframe_records': data.to_dict(orient='records')}) if isinstance(data, pd.DataFrame) else create_tf_serving_json(data)
  response = requests.request(method='POST', headers=headers, url=model_uri, json=data_json)
  if response.status_code != 200:
      raise Exception(f"Request failed with status {response.status_code}, {response.text}")
  return response.json()


# Scoring a model that accepts pandas DataFrames
data =  pd.DataFrame([{
  "sepal_length": 5.1,
  "sepal_width": 3.5,
  "petal_length": 1.4,
  "petal_width": 0.2
}])
score_model(MODEL_VERSION_URI, DATABRICKS_API_TOKEN, data)

# Scoring a model that accepts tensors
data = np.asarray([[5.1, 3.5, 1.4, 0.2]])
score_model(MODEL_VERSION_URI, DATABRICKS_API_TOKEN, data)

You can score a dataset in Power BI Desktop using the following steps:

  1. Open dataset you want to score.

  2. Go to Transform Data.

  3. Right-click in the left panel and select Create New Query.

  4. Go to View > Advanced Editor.

  5. Replace the query body with the code snippet below, after filling in an appropriate DATABRICKS_API_TOKEN and MODEL_VERSION_URI.

    (dataset as table ) as table =>
    let
      call_predict = (dataset as table ) as list =>
      let
        apiToken = DATABRICKS_API_TOKEN,
        modelUri = MODEL_VERSION_URI,
        responseList = Json.Document(Web.Contents(modelUri,
          [
            Headers = [
              #"Content-Type" = "application/json",
              #"Authorization" = Text.Format("Bearer #{0}", {apiToken})
            ],
            Content = {"dataframe_records": Json.FromValue(dataset)}
          ]
        ))
      in
        responseList,
      predictionList = List.Combine(List.Transform(Table.Split(dataset, 256), (x) => call_predict(x))),
      predictionsTable = Table.FromList(predictionList, (x) => {x}, {"Prediction"}),
      datasetWithPrediction = Table.Join(
        Table.AddIndexColumn(predictionsTable, "index"), "index",
        Table.AddIndexColumn(dataset, "index"), "index")
    in
      datasetWithPrediction
    
  6. Name the query with your desired model name.

  7. Open the advanced query editor for your dataset and apply the model function.

See the following notebook for an example of how to test your Model Serving endpoint with a Python model:

Test Model Serving endpoint notebook

Open notebook in new tab

Get the status of the model endpoint

API workflow

You can check the status of an endpoint with the following:

GET /api/2.0/serving-endpoints/{name}

In the following example response, the state.ready field is “READY”, which means the endpoint is ready to receive traffic. The state.update_state field is NOT_UPDATING and pending_config is no longer returned because the update was finished successfully.

{
  "name": "feed-ads",
  "creator": "customer@example.com",
  "creation_timestamp": 1666829055000,
  "last_updated_timestamp": 1666829055000,
  "state": {
    "ready": "READY",
    "update_state": "NOT_UPDATING"
  },
  "config": {
    "served_models": [
      {
        "name": "ads1-1",
        "model_name": "ads1",
        "model_version": "1",
        "workload_size": "Small",
        "scale_to_zero_enabled": false,
        "state": {
          "deployment": "DEPLOYMENT_READY",
          "deployment_state_message": ""
        },
        "creator": "customer@example.com",
        "creation_timestamp": 1666829055000
      }
    ],
    "traffic_config": {
      "routes": [
        {
          "served_model_name": "ads1-1",
          "traffic_percentage": 100
        }
      ]
    }
    "config_version": 1
  },
  "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "permission_level": "CAN_MANAGE"
}

UI workflow

In the UI, you can check the status of an endpoint from the Serving endpoint state indicator at the top of your endpoint’s details page.

Delete a model serving endpoint

To disable serving for a model, you can delete the endpoint it’s served on.

API workflow

To delete a serving endpoint for a model, use the following:

DELETE /api/2.0/serving-endpoints/{name}

UI workflow

You can delete an endpoint from the your endpoint’s details page.

  1. Click Serving on the sidebar.

  2. Click the endpoint you want to delete.

  3. Click the kebab menu at the top and select Delete.

Debug your model serving endpoint

To debug any issues with the endpoint, you can fetch:

  • Model server container build logs

  • Model server logs

These logs are also accessible from the Endpoints UI in the Logs tab.

For the build logs for a served model you can use the following request:

GET /api/2.0/serving-endpoints/{name}/served-models/{served-model-name}/build-logs

{
  “config_version”: 1  // optional
}

For the model server logs for a serve model, you can use the following request:

GET /api/2.0/serving-endpoints/{name}/served-models/{served-model-name}/logs

{
  “config_version”: 1  // optional
}

Notebook examples

The following notebooks include different models that you can use to get up and running with Model Serving endpoints. The model examples can be imported into the workspace by following the directions in Import a notebook. After you choose and create a model from one of the examples, register it in the MLflow Model Registry, and then follow the UI workflow steps for model serving.

Train and register a scikit-learn model for model serving notebook

Open notebook in new tab

Train and register a Pytorch model for model serving notebook

Open notebook in new tab

Train and register a HuggingFace model for model serving notebook

Open notebook in new tab

The following notebook example demonstrates how to create and manage model serving endpoint using Python.

Create and manage a serving endpoint with a Python notebook

Open notebook in new tab

Anaconda licensing update

The following notice is for customers relying on Anaconda.

Important

Anaconda Inc. updated their terms of service for anaconda.org channels. Based on the new terms of service you may require a commercial license if you rely on Anaconda’s packaging and distribution. See Anaconda Commercial Edition FAQ for more information. Your use of any Anaconda channels is governed by their terms of service.

MLflow models logged before v1.18 (Databricks Runtime 8.3 ML or earlier) were by default logged with the conda defaults channel (https://repo.anaconda.com/pkgs/) as a dependency. Because of this license change, Databricks has stopped the use of the defaults channel for models logged using MLflow v1.18 and above. The default channel logged is now conda-forge, which points at the community managed https://conda-forge.org/.

If you logged a model before MLflow v1.18 without excluding the defaults channel from the conda environment for the model, that model may have a dependency on the defaults channel that you may not have intended. To manually confirm whether a model has this dependency, you can examine channel value in the conda.yaml file that is packaged with the logged model. For example, a model’s conda.yaml with a defaults channel dependency may look like this:

channels:
- defaults
dependencies:
- python=3.8.8
- pip
- pip:
    - mlflow
    - scikit-learn==0.23.2
    - cloudpickle==1.6.0
      name: mlflow-env

Because Databricks can not determine whether your use of the Anaconda repository to interact with your models is permitted under your relationship with Anaconda, Databricks is not forcing its customers to make any changes. If your use of the Anaconda.com repo through the use of Databricks is permitted under Anaconda’s terms, you do not need to take any action.

If you would like to change the channel used in a model’s environment, you can re-register the model to the model registry with a new conda.yaml. You can do this by specifying the channel in the conda_env parameter of log_model().

For more information on the log_model() API, see the MLflow documentation for the model flavor you are working with, for example, log_model for scikit-learn.

For more information on conda.yaml files, see the MLflow documentation.