Create and manage model serving endpoints

This article describes how to create and manage model serving endpoints that utilize Databricks Model Serving.

Requirements

  • Model Serving is only available for Python-based MLflow models registered in Unity Catalog or the Workspace Model Registry. You must declare all model dependencies in the conda environment or requirements file.

    • If you don’t have a registered model, see the notebook examples for pre-packaged models you can use to get up and running with Model Serving endpoints.

  • Your workspace must be in a supported region.

  • If you use custom libraries or libraries from a private mirror server with your model, see Use custom Python libraries with Model Serving before you create the model endpoint.

Important

If you rely on Anaconda, review the terms of service notice for additional information.

Access control

To understand access control options for model serving endpoints and best practice guidance for endpoint management, see Serving endpoints access control.

Create model serving endpoints

You can create Model Serving endpoints with the Databricks Machine Learning API or the Databricks Machine Learning UI. An endpoint can serve any registered Python MLflow model in the Unity Catalog or Workspace Model Registry.

API workflow

The following example creates an endpoint that serves the first version of the ads1 model that is registered in the model registry. To specify a model from Unity Catalog, provide the full model name including parent catalog and schema such as, catalog.schema.example-model.

Note

Databricks supports model serving for GPU workloads as a Public Preview functionality.

POST /api/2.0/serving-endpoints

{
  "name": "feed-ads",
  "config":{
   "served_models": [{
     "model_name": "ads1",
     "model_version": "1",
     "workload_size": "Small",
     "scale_to_zero_enabled": true
    }]
  }
}

The following is an example response. The endpoint’s config_update state is IN_PROGRESS and the served model is in a CREATING state. The pending_config field shows the details of the update that is in progress.

{
  "name": "feed-ads",
  "creator": "customer@example.com",
  "creation_timestamp": 1666829055000,
  "last_updated_timestamp": 1666829055000,
  "state": {
    "ready": "NOT_READY",
    "config_update": "IN_PROGRESS"
  },
  "pending_config": {
    "start_time": 1666718879000,
    "served_models": [{
      "name": "ads1-1",
      "model_name": "ads1",
      "model_version": "1",
      "workload_size": "Small",
      "scale_to_zero_enabled": true,
      "state": {
        "deployment": "DEPLOYMENT_CREATING",
        "deployment_state_message": "Creating"
      },
      "creator": "customer@example.com",
      "creation_timestamp": 1666829055000
    }],
    "config_version": 1,
    "traffic_config": {
      "routes": [
        {
          "served_model_name": "ads1-1",
          "traffic_percentage": 100
        }
      ]
    }
  },
  "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "permission_level": "CAN_MANAGE"
}

UI workflow

You can create an endpoint for model serving with the Serving UI.

  1. Click Serving in the sidebar to display the Serving UI.

  2. Click Create serving endpoint.

    Model serving pane in Databricks UI
  3. In the Serving endpoint name field provide a name for your endpoint.

  4. In the Edit configuration section

    1. Select whether the model you want to serve is currently in the Workspace Model Registry or Unity Catalog.

    2. Select which model and model version you want to serve.

    3. Click Confirm.

  5. Select what size compute to use.

    Note

    Databricks supports model serving for GPU workloads as a Public Preview functionality.

  6. Specify if the endpoint should scale to zero when not in use, and the percentage of traffic to route to a served model.

  7. Click Create serving endpoint. The Serving endpoints page appears with Serving endpoint state shown as Not Ready.

    Create a model serving endpoint

You can also access the Serving UI to create an endpoint from the registered model page in the Databricks Machine Learning UI.

  1. Select the model you want to serve.

  2. Click the Use model for inference button.

  3. Select the Real-time tab.

  4. Select the model version and provide an endpoint name.

  5. Select the compute size for your endpoint, and specify if your endpoint should scale to zero when not in use.

    Note

    Databricks supports model serving for GPU workloads as a Public Preview functionality.

  6. Click Create serving endpoint. The Serving endpoints page appears with Serving endpoint state shown as Not Ready. After a few minutes, Serving endpoint state changes to Ready.

Modify the compute configuration of an endpoint

After enabling a model endpoint, you can set the compute configuration as desired with the API or the UI. This configuration is particularly helpful if you need additional resources for your model. Workload size and compute configuration play a key role in what resources are allocated for serving your model.

Until the new configuration is ready, the old configuration keeps serving prediction traffic. While there is an update in progress, another update cannot be made.

You can also:

API workflow

Note

Databricks supports model serving for GPU workloads as a Public Preview functionality.

PUT /api/2.0/serving-endpoints/{name}/config

{
  "served_models": [{
    "model_name": "ads1",
    "model_version": "2",
    "workload_size": "Small",
    "scale_to_zero_enabled": true
  }]
}

The following is a response example:

{
  "name": "feed-ads",
  "creator": "cuastomer@example.com",
  "creation_timestamp": 1666829055000,
  "last_updated_timestamp": 1666946600000,
  "state": {
    "ready": true,
    "update_state": "IN_PROGRESS"
  },
  "config": {
    "served_models": [
      {
        "name": "ads1-1",
        "model_name": "ads1",
        "model_version": "1",
        "workload_size": "Small",
        "scale_to_zero_enabled": true,
        "state": {
          "deployment": "DEPLOYMENT_READY",
          "deployment_state_message": ""
        },
        "creator": "customer@example.com",
        "creation_timestamp": 1666887851000
      }
    ],
    "traffic_config": {
      "routes": [
        {
          "served_model_name": "ads1-1",
          "traffic_percentage": 100
        }
      ]
    },
    "config_version": 2
  },
  "pending_update": {
    "start_time": 1666946600000,
    "served_models": [
      {
        "name": "ads1-2",
        "model_name": "ads1",
        "model_version": "2",
        "workload_size": "Small",
        "scale_to_zero_enabled": true,
        "state": {
          "deployment": "DEPLOYMENT_CREATING",
          "deployment_state_message": "Created"
        },
        "creator": "customer@example.com",
        "creation_timestamp": 1666946600000
      }
    ],
     "traffic_config": {
      "routes": [
        {
          "served_model_name": "ads1-2",
          "traffic_percentage": 100
        }
      ]
    }
    "config_version": 3
  },
  "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "permission_level": "CAN_MANAGE"
}

UI workflow

After you enable a model endpoint, select Edit configuration to modify the compute configuration of your endpoint.

You can do the following:

  • Choose from a few workload sizes, and autoscaling is automatically configured within the workload size.

  • Specify if your endpoint should scale down to zero when not in use.

  • Modify the percent of traffic to route to your served model.

GPU workload types

Preview

This feature is in Public Preview.

Model serving on Databricks supports GPU deployment of PyTorch and Tensorflow models, as well as models logged with mlflow.pyfunc, mlflow.pytorch, mlflow.tensorflow, and mlflow.transformers flavors.

This preview capability is compatible with the following package versions:

  • Pytorch 1.13.0 - 2.0.1

  • Tensorflow 2.5.0 - 2.13.0

  • MLflow 2.4.0 and above

To deploy your models using GPUs include the workload_type field in your endpoint configuration during endpoint creation or as an endpoint configuration update using the API. To configure your endpoint for GPU workloads with the Serving UI, select the desired GPU type from the Compute dropdown.

{
  “served_models”: [{
    “model_name”: “ads1”,
    “model_version”: “2”,
    “workload_type”: “GPU_MEDIUM”,
    “workload_size”: “Small”,
    “scale_to_zero_enabled”: false,
  }]
}

The following table summarizes the available GPU workload types supported during the Public Preview.

Note

Number of concurrency per GPU is contingent upon both the model’s size and its computational complexity.

GPU workload type

GPU instance

GPU memory

GPU_SMALL

1xT4

16GB

GPU_MEDIUM

1xA10G

24GB

GPU_MEDIUM_4

4xA10G

96GB

GPU_MEDIUM_8

8xA10G

192GB

GPU_LARGE_8

A10G

320GB

Limitations

The following are limitations for serving endpoints with GPU workloads during Public Preview:

  • Container image creation for GPU serving takes longer than image creation for CPU serving due to model size and increased installation requirements for models served on GPU.

  • When deploying very large models, the deployment process might timeout if the container build and model deployment exceed a 60-minute duration. Should this occur, initiating a retry of the process should successfully deploy the model.

  • Autoscaling for GPU serving takes longer than for CPU serving.

  • Endpoints configured with GPU workloads do not support scale to zero.

  • This functionality is available in the following regions:

    • ap-southeast-2

    • ca-central-1

    • eu-central-1

    • eu-west-1

    • eu-west-2

    • us-east-1

    • us-east-2

    • us-west-2

Scoring a model endpoint

To score your model, you can send requests to the model serving endpoint. See Send scoring requests to serving endpoints to learn about recommendations and accepted formats.

Get the status of the model endpoint

API workflow

You can check the status of an endpoint with the following:

GET /api/2.0/serving-endpoints/{name}

In the following example response, the state.ready field is “READY”, which means the endpoint is ready to receive traffic. The state.update_state field is NOT_UPDATING and pending_config is no longer returned because the update was finished successfully.

{
  "name": "feed-ads",
  "creator": "customer@example.com",
  "creation_timestamp": 1666829055000,
  "last_updated_timestamp": 1666829055000,
  "state": {
    "ready": "READY",
    "update_state": "NOT_UPDATING"
  },
  "config": {
    "served_models": [
      {
        "name": "ads1-1",
        "model_name": "ads1",
        "model_version": "1",
        "workload_size": "Small",
        "scale_to_zero_enabled": false,
        "state": {
          "deployment": "DEPLOYMENT_READY",
          "deployment_state_message": ""
        },
        "creator": "customer@example.com",
        "creation_timestamp": 1666829055000
      }
    ],
    "traffic_config": {
      "routes": [
        {
          "served_model_name": "ads1-1",
          "traffic_percentage": 100
        }
      ]
    }
    "config_version": 1
  },
  "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "permission_level": "CAN_MANAGE"
}

UI workflow

In the UI, you can check the status of an endpoint from the Serving endpoint state indicator at the top of your endpoint’s details page.

Delete a model serving endpoint

To disable serving for a model, you can delete the endpoint it’s served on.

API workflow

To delete a serving endpoint for a model, use the following:

DELETE /api/2.0/serving-endpoints/{name}

UI workflow

You can delete an endpoint from the your endpoint’s details page.

  1. Click Serving on the sidebar.

  2. Click the endpoint you want to delete.

  3. Click the kebab menu at the top and select Delete.

Debug your model serving endpoint

To debug any issues with the endpoint, you can fetch:

  • Model server container build logs

  • Model server logs

These logs are also accessible from the Endpoints UI in the Logs tab.

For the build logs for a served model you can use the following request:

GET /api/2.0/serving-endpoints/{name}/served-models/{served-model-name}/build-logs

{
  “config_version”: 1  // optional
}

For the model server logs for a serve model, you can use the following request:

GET /api/2.0/serving-endpoints/{name}/served-models/{served-model-name}/logs

{
  “config_version”: 1  // optional
}

Notebook examples

The following notebooks include different models that you can use to get up and running with Model Serving endpoints. The model examples can be imported into the workspace by following the directions in Import a notebook. After you choose and create a model from one of the examples, register it in the MLflow Model Registry, and then follow the UI workflow steps for model serving.

Train and register a scikit-learn model for model serving notebook

Open notebook in new tab

Train and register a Pytorch model for model serving notebook

Open notebook in new tab

Train and register a HuggingFace model for model serving notebook

Open notebook in new tab

Serve a SparkML model notebook

Open notebook in new tab

The following notebook example demonstrates how to create and manage model serving endpoint using Python.

Create and manage a serving endpoint with a Python notebook

Open notebook in new tab

Anaconda licensing update

The following notice is for customers relying on Anaconda.

Important

Anaconda Inc. updated their terms of service for anaconda.org channels. Based on the new terms of service you may require a commercial license if you rely on Anaconda’s packaging and distribution. See Anaconda Commercial Edition FAQ for more information. Your use of any Anaconda channels is governed by their terms of service.

MLflow models logged before v1.18 (Databricks Runtime 8.3 ML or earlier) were by default logged with the conda defaults channel (https://repo.anaconda.com/pkgs/) as a dependency. Because of this license change, Databricks has stopped the use of the defaults channel for models logged using MLflow v1.18 and above. The default channel logged is now conda-forge, which points at the community managed https://conda-forge.org/.

If you logged a model before MLflow v1.18 without excluding the defaults channel from the conda environment for the model, that model may have a dependency on the defaults channel that you may not have intended. To manually confirm whether a model has this dependency, you can examine channel value in the conda.yaml file that is packaged with the logged model. For example, a model’s conda.yaml with a defaults channel dependency may look like this:

channels:
- defaults
dependencies:
- python=3.8.8
- pip
- pip:
    - mlflow
    - scikit-learn==0.23.2
    - cloudpickle==1.6.0
      name: mlflow-env

Because Databricks can not determine whether your use of the Anaconda repository to interact with your models is permitted under your relationship with Anaconda, Databricks is not forcing its customers to make any changes. If your use of the Anaconda.com repo through the use of Databricks is permitted under Anaconda’s terms, you do not need to take any action.

If you would like to change the channel used in a model’s environment, you can re-register the model to the model registry with a new conda.yaml. You can do this by specifying the channel in the conda_env parameter of log_model().

For more information on the log_model() API, see the MLflow documentation for the model flavor you are working with, for example, log_model for scikit-learn.

For more information on conda.yaml files, see the MLflow documentation.