Skip to main content

Create custom model serving endpoints

This article describes how to create model serving endpoints that serve custom models using Databricks Model Serving.

Model Serving provides the following options for serving endpoint creation:

  • The Serving UI
  • REST API
  • MLflow Deployments SDK

For creating endpoints that serve generative AI models, see Create foundation model serving endpoints.

Requirements

  • Your workspace must be in a supported region.
  • If you use custom libraries or libraries from a private mirror server with your model, see Use custom Python libraries with Model Serving before you create the model endpoint.
  • For creating endpoints using the MLflow Deployments SDK, you must install the MLflow Deployment client. To install it, run:
Python
import mlflow.deployments

client = mlflow.deployments.get_deploy_client("databricks")

Access control

To understand access control options for model serving endpoints for endpoint management, see Manage permissions on your model serving endpoint.

You can also:

Create an endpoint

MLflow Deployments provides an API for create, update and deletion tasks. The APIs for these tasks accept the same parameters as the REST API for serving endpoints. See POST /api/2.0/serving-endpoints for endpoint configuration parameters.

The following example creates an endpoint that serves the third version of the my-ads-model model that is registered in the Unity Catalog model registry. You must provide the full model name including parent catalog and schema such as, catalog.schema.example-model.

Python

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")
endpoint = client.create_endpoint(
name="unity-catalog-model-endpoint",
config={
"served_entities": [
{
"name": "ads-entity"
"entity_name": "catalog.schema.my-ads-model",
"entity_version": "3",
"workload_size": "Small",
"scale_to_zero_enabled": True
}
],
"traffic_config": {
"routes": [
{
"served_model_name": "my-ads-model-3",
"traffic_percentage": 100
}
]
}
}
)

You can also:

GPU workload types

GPU deployment is compatible with the following package versions:

  • Pytorch 1.13.0 - 2.0.1
  • TensorFlow 2.5.0 - 2.13.0
  • MLflow 2.4.0 and above

To deploy your models using GPUs include the workload_type field in your endpoint configuration during endpoint creation or as an endpoint configuration update using the API. To configure your endpoint for GPU workloads with the Serving UI, select the desired GPU type from the Compute Type dropdown.

Bash
{
"served_entities": [{
"entity_name": "catalog.schema.ads1",
"entity_version": "2",
"workload_type": "GPU_MEDIUM",
"workload_size": "Small",
"scale_to_zero_enabled": false,
}]
}

The following table summarizes the available GPU workload types supported.

GPU workload type

GPU instance

GPU memory

GPU_SMALL

1xT4

16GB

GPU_MEDIUM

1xA10G

24GB

MULTIGPU_MEDIUM

4xA10G

96GB

GPU_MEDIUM_8

8xA10G

192GB

Modify a custom model endpoint

After enabling a custom model endpoint, you can update the compute configuration as desired. This configuration is particularly helpful if you need additional resources for your model. Workload size and compute configuration play a key role in what resources are allocated for serving your model.

Until the new configuration is ready, the old configuration keeps serving prediction traffic. While there is an update in progress, another update cannot be made. However, you can cancel an in progress update from the Serving UI.

The MLflow Deployments SDK uses the same parameters as the REST API, see PUT /api/2.0/serving-endpoints/{name}/config for request and response schema details.

The following code sample uses a model from the Unity Catalog model registry:

Python
import mlflow
from mlflow.deployments import get_deploy_client

mlflow.set_registry_uri("databricks-uc")
client = get_deploy_client("databricks")

endpoint = client.create_endpoint(
name=f"{endpointname}",
config={
"served_entities": [
{
"entity_name": f"{catalog}.{schema}.{model_name}",
"entity_version": "1",
"workload_size": "Small",
"scale_to_zero_enabled": True
}
],
"traffic_config": {
"routes": [
{
"served_model_name": f"{model_name}-1",
"traffic_percentage": 100
}
]
}
}
)

Scoring a model endpoint

To score your model, send requests to the model serving endpoint.

Additional resources

Notebook examples

The following notebooks include different Databricks registered models that you can use to get up and running with model serving endpoints. For additional examples, see Tutorial: Deploy and query a custom model.

The model examples can be imported into the workspace by following the directions in Import a notebook. After you choose and create a model from one of the examples, register it in Unity Catalog, and then follow the UI workflow steps for model serving.

Train and register a scikit-learn model for model serving notebook

Open notebook in new tab

Train and register a HuggingFace model for model serving notebook

Open notebook in new tab
Was this article helpful?