Create and manage Serverless Real-Time Inference endpoints

Preview

This feature is in Public Preview.

This article describes how to create and manage endpoints that utilize Databricks Serverless Real-Time Inference. Learn more about Serverless Real-Time Inference.

Important

  • API definitions and workflows are subject to change during the public preview.

  • If you’re relying on Anaconda, please review the terms of service notice for additional information.

Requirements

  • Serverless Real-Time Inference is only available for Python-based MLflow models registered in the MLflow Model Registry. You must declare all model dependencies in the conda environment or requirements file.

    • If you don’t have a registered model, see the notebook examples for pre-packaged models you can use to get up and running with Serverless Real-Time Inference endpoints.

  • Your workspace must be enabled for Serverless Real-Time Inference. To enable Model Serving, you must have cluster creation permission.

  • If you use custom libraries or libraries from a private mirror with your model, see Use custom Python libraries with Serverless Real-Time Inference before you create the model endpoint.

Create model serving endpoints

You can create Serverless Real-Time Inference endpoints for model serving with the Databricks Machine Learning API or the Databricks Machine Learning UI. An endpoint can serve any registered Python MLflow model registered in the Model Registry.

API Workflow

You can use the Enable Serving API to create an endpoint for model serving. In the following example, ElasticNet is the name of the registered model.

POST /preview/mlflow/endpoints-v2/enable

{
   "registered_model_name": "ElasticNet"
}

UI Workflow

You enable a model for serving from its registered model page in the Databricks Machine Learning UI.

  1. Click the Serving tab. If the model is not already enabled for serving, the Enable Serverless Real-Time Inference button appears.

  2. Click Enable Serverless Real-Time Inference. The Serving tab appears with Status shown as Pending. After a few minutes, Status changes to Ready.

Modify the compute configuration of an endpoint

After enabling a model endpoint, you can set the compute configuration as desired with the API or the UI. This configuration is particularly helpful if you need additional resources for your model. Workload size and compute configuration play a key role in what resources are allocated for serving your model. Learn more about WorkloadConfigSpec objects.

API Workflow

The status of the configuration update can be tracked in the config_update_status field of the Endpoint Version status.

PUT /preview/model-serving-api/endpoints-v2/update-compute-config

In the following, populate desired_workload_config_spec with WorkloadConfigSpec properties.

{
  "registered_model_name": "ElasticNet",
  "stage": "Staging|Production",
  "desired_workload_config_spec": {"WorkloadConfigSpec"}
}

UI Workflow

After you enable a model endpoint, you can set the desired compute configuration on the Compute Settings tab. You can set separate configurations for Staging and Production model versions.

You can choose from a few workload sizes, and autoscaling is automatically configured within the workload size. If you’d like your endpoint to scale down to zero, you can check the checkbox titled “Scale to zero”.

Scoring a model endpoint

To score a deployed model, you can send a REST API request to the model URL or use the UI.

You should call a model by calling the API for its stage. For example, if version 1 is in the Production stage, it can be scored using this UR:

https://<databricks-instance>/model-endpoint/iris-classifier/Production/invocations

The list of available model URIs appears at the top of the Model Versions tab on the Serving tab.

Request format

Requests should be sent by constructing a JSON with one of the below keys and a JSON object corresponding to the input format.

There are four formats for the input JSON depending on your use case:

  • dataframe_split is JSON-serialized Pandas Dataframe in the split orientation.

    {
      "dataframe_split": {
        "index": [0, 1],
        "columns": ["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
        "data": [[5.1, 3.5, 1.4, 0.2], [4.9, 3.0, 1.4, 0.2]]
      }
    }
    
  • dataframe_records is JSON-serialized Pandas Dataframe in the records orientation.

    Note

    This format does not guarantee the preservation of column ordering, and the split format is preferred over the records format.

    {
      "dataframe_records": [
      {
         "sepal length (cm)": 5.1,
         "sepal width (cm)": 3.5,
         "petal length (cm)": 1.4,
         "petal width (cm)": 0.2
      },
      {
         "sepal length (cm)": 4.9,
         "sepal width (cm)": 3,
         "petal length (cm)": 1.4,
         "petal width (cm)": 0.2
       },
       {
         "sepal length (cm)": 4.7,
         "sepal width (cm)": 3.2,
         "petal length (cm)": 1.3,
         "petal width (cm)": 0.2
       }
      ]
    }
    
  • instances is a tensors-based format that accepts tensors in row format. Use this format if all the input tensors have the same 0-th dimension. Conceptually, each tensor in the instances list could be joined with the other tensors of the same name in the rest of the list to construct the full input tensor for the model, which would only be possible if all of the tensors have the same 0-th dimension.

      {"instances": [ "a", "b", "c" ]}
    

    or

    In the following example, there are three dimensions, so you have exactly three of each input tensor.

    {
     "instances": [
      {
       "t1": "a",
       "t2": [1, 2, 3, 4, 5],
       "t3": [[1, 2], [3, 4], [5, 6]]
      },
      {
       "t1": "b",
       "t2": [6, 7, 8, 9, 10],
       "t3": [[7, 8], [9, 10], [11, 12]]
      }
     ]
    }
    
  • inputs send queries with tensors in columnar format. This request is different because there are actually a different number of tensor instances of t2 (3) than t1 and t3, so it is not possible to represent this input in the instances format.

    {
     "inputs": {
      "t1": ["a", "b"],
      "t2": [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]],
      "t3": [[[1, 2], [3, 4], [5, 6]], [[7, 8], [9, 10], [11, 12]]]
     }
    }
    

Response Format

The response from the endpoint is in the following format. The output from your model is wrapped in a “predictions” key.

{
  "predictions": "<JSON output from model>"
}

UI Workflow

Sending requests using the UI is the easiest and fastest way to test the model. You can insert the model input data in JSON format and click Send Request. If the model has been logged with an input example (as shown in the graphic above), click Show Example to load the input example.

API Workflow

You can send a scoring request through the REST API using standard Databricks authentication. The following examples demonstrate authentication using a personal access token.

Note

As a security best practice, when authenticating with automated tools, systems, scripts, and apps, Databricks recommends you use access tokens belonging to service principals instead of workspace users. To create access tokens for service principals, see Manage access tokens for a service principal.

Given a MODEL_VERSION_URI like https://<databricks-instance>/model/iris-classifier/Production/invocations, where <databricks-instance> is the name of your Databricks instance, and a Databricks REST API token called DATABRICKS_API_TOKEN, the following are example snippets of how to score a served model.

Score a model accepting dataframe records input format.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
  -H 'Content-Type: application/json' \
  -d '{"dataframe_records": [
    {
      "sepal_length": 5.1,
      "sepal_width": 3.5,
      "petal_length": 1.4,
      "petal_width": 0.2
    }
  ]}'

Score a model accepting tensor inputs. Tensor inputs should be formatted as described in TensorFlow Serving’s API docs.

curl -X POST -u token:$DATABRICKS_API_TOKEN $MODEL_VERSION_URI \
   -H 'Content-Type: application/json' \
   -d '{"inputs": [[5.1, 3.5, 1.4, 0.2]]}'
import numpy as np
import pandas as pd
import requests

def create_tf_serving_json(data):
  return {'inputs': {name: data[name].tolist() for name in data.keys()} if isinstance(data, dict) else data.tolist()}

def score_model(model_uri, databricks_token, data):
  headers = {
    "Authorization": f"Bearer {databricks_token}",
    "Content-Type": "application/json",
  }
  data_json = json.dumps({'dataframe_records': data.to_dict(orient='records')}) if isinstance(data, pd.DataFrame) else create_tf_serving_json(data)
  response = requests.request(method='POST', headers=headers, url=model_uri, json=data_json)
  if response.status_code != 200:
      raise Exception(f"Request failed with status {response.status_code}, {response.text}")
  return response.json()


# Scoring a model that accepts pandas DataFrames
data =  pd.DataFrame([{
  "sepal_length": 5.1,
  "sepal_width": 3.5,
  "petal_length": 1.4,
  "petal_width": 0.2
}])
score_model(MODEL_VERSION_URI, DATABRICKS_API_TOKEN, data)

# Scoring a model that accepts tensors
data = np.asarray([[5.1, 3.5, 1.4, 0.2]])
score_model(MODEL_VERSION_URI, DATABRICKS_API_TOKEN, data)

You can score a dataset in Power BI Desktop using the following steps:

  1. Open dataset you want to score.

  2. Go to Transform Data.

  3. Right-click in the left panel and select Create New Query.

  4. Go to View > Advanced Editor.

  5. Replace the query body with the code snippet below, after filling in an appropriate DATABRICKS_API_TOKEN and MODEL_VERSION_URI.

    (dataset as table ) as table =>
    let
      call_predict = (dataset as table ) as list =>
      let
        apiToken = DATABRICKS_API_TOKEN,
        modelUri = MODEL_VERSION_URI,
        responseList = Json.Document(Web.Contents(modelUri,
          [
            Headers = [
              #"Content-Type" = "application/json",
              #"Authorization" = Text.Format("Bearer #{0}", {apiToken})
            ],
            Content = {"dataframe_records": Json.FromValue(dataset)}
          ]
        ))
      in
        responseList,
      predictionList = List.Combine(List.Transform(Table.Split(dataset, 256), (x) => call_predict(x))),
      predictionsTable = Table.FromList(predictionList, (x) => {x}, {"Prediction"}),
      datasetWithPrediction = Table.Join(
        Table.AddIndexColumn(predictionsTable, "index"), "index",
        Table.AddIndexColumn(dataset, "index"), "index")
    in
      datasetWithPrediction
    
  6. Name the query with your desired model name.

  7. Open the advanced query editor for your dataset and apply the model function.

See the following notebook for an example of how to test your Serverless Real-Time Inference endpoint with a Python model:

Test Serverless Real-Time Inference endpoint notebook

Open notebook in new tab

Update the model version served by a model endpoint

A model version must either be in Staging or Production in the Model Registry before it can be served to the endpoint.

API Workflow

To transition a new model version into Serving, you can use the Model Registry to transition the model version you want to serve into its appropriate stage.

The following code example transitions version 2 of the model ElasticNet into Staging. By setting archive_existing_versions to true, any existing model versions are archived, which causes the Staging URL to point to the new model version after it’s ready for serving. Before the new version is ready, the Staging endpoint serves the old model version, so the transition is made with zero downtime.

POST /mlflow/databricks/model-versions/transition-stage

{
   "name": "ElasticNet",
   "version": "2",
   "stage": "Staging",
   "archive_existing_versions": true,
   "comment": "Deploying version 1 to Staging endpoint!"
}

Keep multiple versions in a single stage

You can also choose to keep the previous Staging version in Staging. Multiple versions of a model can be in the same stage. In this scenario, both versions are served, but the Staging URL points only to the newest version. The older version is still accessible by its version URL.

If you want to try out a new version behind your staging endpoint, you can do the same as above, but set archive_existing_versions to false to ensure the previous Staging version doesn’t get archived.

POST /mlflow/databricks/model-versions/transition-stage

{
...
   "archive_existing_versions": false,
...
}

UI Workflow

To transition model versions to Staging or Production using the Databricks Machine Learning UI:

  1. Select models icon Models in the sidebar.

  2. Identify and select the registered model you want to update.

  3. Select the model version you want to transition to Staging or Production. The link opens that model version’s detail page.

  4. Use the Stage dropdown menu at the top right to transition the model version to Staging or Production.

Get the status of the model endpoint

API Workflow

Databricks provides the following to check the status of an endpoint. Learn more about EndpointStatus objects.

POST /preview/mlflow/endpoints-v2/get-status
{
  "registered_model_name": "ElasticNet"
}

This returns the EndpointStatus object properties:

{
  "endpoint_status": {"EndpointStatus"}
}

UI Workflow

In the UI, you can check the status of an endpoint from the Status indicator at the top of the Serving tab.

Get the status of model endpoint versions

You can get the status of a particular endpoint version that has been deployed. This lets you:

  • Track which versions are being served.

  • Track the status of those versions.

  • Verify whether a particular model version is ready for use.

API Workflow

Databricks provide two APIs to check the status of endpoint versions. To check the status for all endpoint versions for a particular registered model, you can use ListVersions. Learn more about EndpointVersionStatus objects.

GET /preview/mlflow/endpoints-v2/list-versions
{
  "registered_model_name": "ElasticNet"
}

This returns EndpointVersionStatus object properties:

{
  "endpoint_statuses": ["EndpointVersionStatus"]
}

Alternatively, if you already know the specific version whose status you want to know, you can use GetVersions.

GET /preview/mlflow/endpoints-v2/get-version-status
{
  "registered_model_name": "ElasticNet",
  "endpoint_version_name": "1"
}

This returns EndpointVersionStatus object properties:

{
  "endpoint_status": {"EndpointVersionStatus"}
}

Getting the status for a stage

You can also get the status for a particular stage. To do so, you first need to determine which endpoint version is currently serving the Stage. To retrieve that information, you can use ListVersionAliases.

GET /preview/mlflow/endpoints-v2/list-version-aliases
{
  "registered_model_name": "ElasticNet"
}

This returns:

{
  "aliases": [
   {
      "alias": "Staging",
      "endpoint_version_name": "2"
   },
   {
      "alias": "Production",
      "endpoint_version_name": "1"
   }
  ]
}

From there, you can use the above to get the status of the particular endpoint version.

UI Workflow

In the Serving tab of the UI, you can see each endpoint version with its own tab on the left. When you select each tab, detailed information about a particular version appears. The version that is currently serving a stage can be seen from the Staging or Production label on the endpoint version.

Disable model serving

API Workflow

You can use the API to disable model serving for any registered model present in the Model Registry.

To disable model serving for a model use the Disable Serving API:

POST /preview/mlflow/endpoints-v2/disable

{
   "registered_model_name": "ElasticNet"
}

UI Workflow

You can disable a model for serving from its registered model page.

  1. Click the Serving tab. If the model is not already enabled for serving, the Enable Serverless Real-Time Inference button appears.

  2. Click Disable Serving.

Debug your model endpoint

Note

You can only debug your model endpoint through the UI.

You can debug and troubleshoot your endpoint by viewing the model logs on the endpoint version’s tab in the Databricks Machine Learning UI. Logs for all replicas of the model are merged in the All Replicas tab.

In addition to the model’s logs, you can view significant serving events pertaining to the model in the Model Events tab.

Core API objects

This section contains design patterns and syntax for Serverless Real-Time Inference’s core API objects.

Important

API definitions are subject to change during the public preview.

Workload configuration

WorkloadConfigSpec describes the configuration used to scale the compute for a particular stage.

  "WorkloadConfigSpec":
  {
   "workload_size_id": "Small|Medium|Large",
   "scale_to_zero_enabled": false
  }

ComputeConfig represents the configuration used to scale the compute for a particular stage along with accompanying metadata.

In the following, populate workload_spec by replacing "WorkloadConfigSpec" with the previously defined properties of your WorkloadConfigSpec object.

  "ComputeConfig":
  {
   "stage":  "Staging|Production",
   "creation_timestamp": 12345678,
   "user_id": "first.last@databricks.com",
   "workload_spec": {"WorkloadConfigSpec"}
  }

Endpoint status

The health of an endpoint reflects whether any of the stages can be scored or have resources generated for particular versions for the model.

In the following EndpointStatus object, populate compute_config by reusing the previously defined properties of your ComputeConfig object and any other properties as an array.

  "EndpointStatus":
  {
   "registered_model_name": "ElasticNet",
   "state": "PENDING|READY|FAILED",
   "state_message": "State message",
   "compute_config": ["ComputeConfig and additional properties as an array"]
  }

Endpoint version status

An endpoint version has a particular URI that can be queried. The URI represents a single model version which is being served and whose compute is configured by the compute configurations set for its stage.

In the following EndpointVersionStatus object, populate both config fields –service_status and config_update_status– by replacing "ComputeConfig" with the previously defined properties of your ComputeConfig object.

  "EndpointVersionStatus":
  {
   "registered_model_name": "ElasticNet",
   "endpoint_version_name": "1",
   "service_status": {
      "state": "SERVICE_STATE_UNKNOWN|SERVICE_STATE_PENDING|SERVICE_STATE_READY|SERVICE_STATE_UNKNOWN",
      "message": "Ready",
      "config": {"ComputeConfig"}
   },
   "config_update_status": {
      "state": "SERVICE_STATE_UNKNOWN|SERVICE_STATE_PENDING|SERVICE_STATE_READY|SERVICE_STATE_UNKNOWN",
      "message": "Pending",
      "config": {"ComputeConfig"}
   }
  }

Notebook examples

The following notebooks include different models that you can use to get up and running with Serverless Real-Time Inference endpoints. The model examples can be imported into the workspace by following the directions in Import a notebook. After you choose and create a model from one of the examples, register it in the MLFlow Model Registry, and then follow the UI workflow steps for model serving.

Train and register a scikit-learn model for model serving notebook

Open notebook in new tab

Train and register a Pytorch model for model serving notebook

Open notebook in new tab

Host multiple models in an endpoint notebook

Open notebook in new tab

Anaconda licensing update

The following notice is for customers relying on Anaconda.

Important

Anaconda Inc. updated their terms of service for anaconda.org channels. Based on the new terms of service you may require a commercial license if you rely on Anaconda’s packaging and distribution. See Anaconda Commercial Edition FAQ for more information. Your use of any Anaconda channels is governed by their terms of service.

MLflow models logged before v1.18 (Databricks Runtime 8.3 ML or earlier) were by default logged with the conda defaults channel (https://repo.anaconda.com/pkgs/) as a dependency. Because of this license change, Databricks has stopped the use of the defaults channel for models logged using MLflow v1.18 and above. The default channel logged is now conda-forge, which points at the community managed https://conda-forge.org/.

If you logged a model before MLflow v1.18 without excluding the defaults channel from the conda environment for the model, that model may have a dependency on the defaults channel that you may not have intended. To manually confirm whether a model has this dependency, you can examine channel value in the conda.yaml file that is packaged with the logged model. For example, a model’s conda.yaml with a defaults channel dependency may look like this:

channels:
- defaults
dependencies:
- python=3.8.8
- pip
- pip:
    - mlflow
    - scikit-learn==0.23.2
    - cloudpickle==1.6.0
      name: mlflow-env

Because Databricks can not determine whether your use of the Anaconda repository to interact with your models is permitted under your relationship with Anaconda, Databricks is not forcing its customers to make any changes. If your use of the Anaconda.com repo through the use of Databricks is permitted under Anaconda’s terms, you do not need to take any action.

If you would like to change the channel used in a model’s environment, you can re-register the model to the model registry with a new conda.yaml. You can do this by specifying the channel in the conda_env parameter of log_model().

For more information on the log_model() API, see the MLflow documentation for the model flavor you are working with, for example, log_model for scikit-learn.

For more information on conda.yaml files, see the MLflow documentation.