Serve multiple models to a Model Serving endpoint

This article describes how to serve multiple models to a serving endpoint that utilizes Databricks Model Serving.

Requirements

See Requirements for Model Serving endpoint creation.

To understand access control options for model serving endpoints and best practice guidance for endpoint management, see Serving endpoint ACLs.

Create an endpoint and set the initial traffic split

You can create Model Serving endpoints with the Databricks Machine Learning API. An endpoint can serve any registered Python MLflow model registered in the Model Registry.

The following API example creates a single endpoint with two models and sets the endpoint traffic split between those models. The served model, current, hosts version 1 of model-A and gets 90% of the endpoint traffic, while the other served model, challenger, hosts version 1 of model-B and gets 10% of the endpoint traffic.

POST /api/2.0/serving-endpoints

{
   "name":"multi-model"
   "config":{
      "served_entities":[
         {
            "name":"current",
            "entity_name":"model-A",
            "entity_version":"1",
            "workload_size":"Small",
            "scale_to_zero_enabled":true
         },
         {
            "name":"challenger",
            "entity_name":"model-B",
            "entity_version":"1",
            "workload_size":"Small",
            "scale_to_zero_enabled":true
         }
      ],
      "traffic_config":{
         "routes":[
            {
               "served_model_name":"current",
               "traffic_percentage":"90"
            },
            {
               "served_model_name":"challenger",
               "traffic_percentage":"10"
            }
         ]
      }
   }
}

Update the traffic split between served models

You can also update the traffic split between served models. The following API example sets the served model, current, to get 50% of the endpoint traffic and the other model, challenger, to get the remaining 50% of the traffic.

You can also make this update from the Serving tab in the Databricks Machine Learning UI using the Edit configuration button.

PUT /api/2.0/serving-endpoints/{name}/config

{
   "served_entities":[
      {
         "name":"current",
         "entity_name":"model-A",
         "entity_version":"1",
         "workload_size":"Small",
         "scale_to_zero_enabled":true
      },
      {
         "name":"challenger",
         "entity_name":"model-B",
         "entity_version":"1",
         "workload_size":"Small",
         "scale_to_zero_enabled":true
      }
   ],
   "traffic_config":{
      "routes":[
         {
            "served_model_name":"current",
            "traffic_percentage":"50"
         },
         {
            "served_model_name":"challenger",
            "traffic_percentage":"50"
         }
      ]
   }
}

Query individual models behind an endpoint

In some scenarios, you may want to query individual models behind the endpoint.

You can do so by using:

POST /serving-endpoints/{endpoint-name}/served-models/{served-model-name}/invocations

Here the specific served model is queried. The request format is the same as querying the endpoint. While querying the individual served model, the traffic settings are ignored.

In the context of the multi-model endpoint example, if all requests are sent to /serving-endpoints/multi-model/served-models/challenger/invocations, then all requests are served by the challenger served model.

Notebook: Package multiple models into one model

To save on compute costs, you can also package multiple models into one model.

Package multiple MLflow models into one MLflow model notebook

Open notebook in new tab