Create and manage serving endpoints using MLflow

This article describes how to create and manage model serving endpoints using the MLflow Deployments API.

This article includes examples for creating endpoints that serve,

  • Foundation models, which include:

  • Custom models. MLflow models that are installed in Unity Catalog or registered in the workspace model registry.

MLflow Deployments provides an API for create, update and deletion tasks. The APIs for these tasks accept the same parameters as the REST API for serving endpoints. See POST /api/2.0/serving-endpoints for endpoint configuration parameters.

Requirements

  • Databricks Runtime 13.0 ML or above.

  • MLflow 2.9 or above, %pip install mlflow[genai]>=2.9.0.

  • Your workspace must be in a supported region.

  • The MLflow Deployment client. To install it, run:

    import mlflow.deployments
    
    client = mlflow.deployments.get_deploy_client("databricks")
    

Access control

To understand access control options for model serving endpoints and best practice guidance for endpoint management, see Serving endpoints access control.

Create a custom model endpoint


from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")
endpoint = client.create_endpoint(
    name="workspace-model-endpoint",
    config={
        "served_entities": [
            {
                "entity_name": "my-ads-model",
                "entity_version": "3",
                "workload_size": "Small",
                "scale_to_zero_enabled": true
            }
        ],
        "traffic_config": {
            "routes": [
                {
                    "served_model_name": "my-ads-model-3",
                    "traffic_percentage": 100
                }
            ]
        }
    }
)

GPU workload types

Preview

This feature is in Public Preview.

Model serving on Databricks supports GPU deployment of PyTorch and TensorFlow models, as well as models logged with mlflow.pyfunc, mlflow.pytorch, mlflow.tensorflow, and mlflow.transformers flavors.

GPU deployment also automatically supports optimized model serving for large language models. See Provisioned throughput Foundation Model APIs.

This preview capability is compatible with the following package versions:

  • Pytorch 1.13.0 - 2.0.1

  • TensorFlow 2.5.0 - 2.13.0

  • MLflow 2.4.0 and above

To deploy your models using GPUs include the workload_type field in your endpoint configuration during endpoint creation or as an endpoint configuration update.

See (/machine-learning/model-serving/create-manage-serving-endpoints.md#gpu) for supported GPU sizes.

Create an foundation model endpoint

The following creates an endpoint for embeddings with OpenAI text-embedding-ada-002.

For foundation model endpoints, you must provide API keys for the model provider you want to use. See See POST /api/2.0/serving-endpoints in the REST API for request and response schema details.

You can also create endpoints for completions and chat tasks, as specified by the task field in the external_model section of the configuration. See External models in Databricks Model Serving for supported models and providers for each task.


from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")
endpoint = client.create_endpoint(
    name="chat",
    config={
        "served_entities": [
            {
                "name": "test",
                "external_model": {
                    "name": "gpt-4",
                    "provider": "openai",
                    "task": "llm/v1/chat",
                    "openai_config": {
                        "openai_api_key": "{{secrets/scope/key}}",
                    },
                },
            }
        ],
    },
)
assert endpoint == {
    "name": "chat",
    "creator": "alice@company.com",
    "creation_timestamp": 0,
    "last_updated_timestamp": 0,
    "state": {...},
    "config": {...},
    "tags": [...],
    "id": "88fd3f75a0d24b0380ddc40484d7a31b",
}

Update a custom model endpoint

To update your custom model endpoint use the following. See PUT /api/2.0/serving-endpoints/{name}/config for request and response schema details.

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")
endpoint = client.update_endpoint(
    name="workspace-model-endpoint",
    config={
        "served_entities": [
            {
                "entity_name": "my-ads-model",
                "entity_version": "3",
                "workload_size": "Small",
                "scale_to_zero_enabled": true
            }
        ]
    }
)
assert endpoint == {
    "name": "chat",
    "creator": "alice@company.com",
    "creation_timestamp": 0,
    "last_updated_timestamp": 0,
    "state": {...},
    "config": {...},
    "tags": [...],
    "id": "88fd3f75a0d24b0380ddc40484d7a31b",
}

Update an foundation model endpoint

To update your foundation model endpoint, use the following. See the REST API update configuration documentation for request and response schema details.

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")
endpoint = client.update_endpoint(
    endpoint="chat",
    config={
        "served_entities": [
            {
                "name": "test",
                "external_model": {
                    "name": "gpt-4",
                    "provider": "openai",
                    "task": "llm/v1/chat",
                    "openai_config": {
                        "openai_api_key": "{{secrets/scope/key}}",
                    },
                },
            }
        ],
    },
)
assert endpoint == {
    "name": "chat",
    "creator": "alice@company.com",
    "creation_timestamp": 0,
    "last_updated_timestamp": 0,
    "state": {...},
    "config": {...},
    "tags": [...],
    "id": "88fd3f75a0d24b0380ddc40484d7a31b",
}

rate_limits = client.update_endpoint(
    endpoint="chat",
    config={
        "rate_limits": [
            {
                "key": "user",
                "renewal_period": "minute",
                "calls": 10,
            }
        ],
    },
)
assert rate_limits == {
    "rate_limits": [
        {
            "key": "user",
            "renewal_period": "minute",
            "calls": 10,
        }
    ],
}

Get the status of the model endpoint

You can get the status and the details of a specified endpoint using the following:

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")
endpoint = client.get_endpoint(endpoint="chat")
assert endpoint == {
    "name": "chat",
    "creator": "alice@company.com",
    "creation_timestamp": 0,
    "last_updated_timestamp": 0,
    "state": {...},
    "config": {...},
    "tags": [...],
    "id": "88fd3f75a0d24b0380ddc40484d7a31b",
}

Delete a model serving endpoint

You can delete a serving endpoint for a model using the following:

from mlflow.deployments import get_deploy_client

client = get_deploy_client("databricks")
client.delete_endpoint(endpoint="chat")