Migrate to Model Serving

This article demonstrates how to enable Model Serving on your workspace and switch your models to Mosaic AI Model Serving experience built on serverless compute.

Requirements

Significant changes

  • In Model Serving, the format of the request to the endpoint and the response from the endpoint are slightly different from Legacy MLflow Model Serving. See Scoring a model endpoint for details on the new format protocol.

  • In Model Serving, the endpoint URL includes serving-endpoints instead of model.

  • Model Serving includes full support for managing resources with API workflows.

  • Model Serving is production-ready and backed by the Databricks SLA.

Migrate Legacy MLflow Model Serving served models to Model Serving

You can create a Model Serving endpoint and flexibly transition model serving workflows without disabling Legacy MLflow Model Serving.

The following steps show how to accomplish this with the UI. For each model on which you have Legacy MLflow Model Serving enabled:

  1. Register your model to Unity Catalog.

  2. Navigate to Serving endpoints on the sidebar of your machine learning workspace.

  3. Follow the workflow described in Create custom model serving endpoints on how to create a serving endpoint with your model.

  4. Transition your application to use the new URL provided by the serving endpoint to query the model, along with the new scoring format.

  5. When your models are transitioned over, you can navigate to Models on the sidebar of your machine learning workspace.

  6. Select the model for which you want to disable Legacy MLflow Model Serving.

  7. On the Serving tab, select Stop.

  8. A message appears to confirm. Select Stop Serving.

Migrate deployed model versions to Model Serving

In previous versions of the Model Serving functionality, the serving endpoint was created based on the stage of the registered model version: Staging or Production. To migrate your served models from that experience, you can replicate that behavior in the new Model Serving experience.

This section demonstrates how to create separate model serving endpoints for Staging model versions and Production model versions. The following steps show how to accomplish this with the serving endpoints API for each of your served models.

In the example, the registered model name modelA has version 1 in the model stage Production and version 2 in the model stage Staging.

  1. Create two endpoints for your registered model, one for Staging model versions and another for Production model versions.

    For Staging model versions:

    POST /api/2.0/serving-endpoints
      {
         "name":"modelA-Staging"
         "config":
         {
            "served_entities":
            [
               {
                  "entity_name":"model-A",
                  "entity_version":"2",  // Staging Model Version
                  "workload_size":"Small",
                  "scale_to_zero_enabled":true
               },
            ],
         },
      }
    

    For Production model versions:

    POST /api/2.0/serving-endpoints
      {
         "name":"modelA-Production"
         "config":
         {
            "served_entities":
            [
               {
                  "entity_name":"model-A",
                  "entity_version":"1",   // Production Model Version
                  "workload_size":"Small",
                  "scale_to_zero_enabled":true
               },
            ],
         },
      }
    
  2. Verify the status of the endpoints.

    For Staging endpoint: GET /api/2.0/serving-endpoints/modelA-Staging

    For Production endpoint: GET /api/2.0/serving-endpoints/modelA-Production

  3. Once the endpoints are ready, query the endpoint using:

    For Staging endpoint: POST /serving-endpoints/modelA-Staging/invocations

    For Production endpoint: POST /serving-endpoints/modelA-Production/invocations

  4. Update the endpoint based on model version transitions.

    In the scenario where a new model version 3 is created, you can have the model version 2 transition to Production, while model version 3 can transition to Staging and model version 1 is Archived. These changes can be reflected in separate model serving endpoints as follows:

    For the Staging endpoint, update the endpoint to use the new model version in Staging.

    PUT /api/2.0/serving-endpoints/modelA-Staging/config
    {
       "served_entities":
       [
          {
             "entity_name":"model-A",
             "entity_version":"3",  // New Staging model version
             "workload_size":"Small",
             "scale_to_zero_enabled":true
          },
       ],
    }
    

    For Production endpoint, update the endpoint to use the new model version in Production.

    PUT /api/2.0/serving-endpoints/modelA-Production/config
    {
       "served_entities":
       [
          {
             "entity_name":"model-A",
             "entity_version":"2",  // New Production model version
             "workload_size":"Small",
             "scale_to_zero_enabled":true
          },
       ],
    }
    

Migrate MosaicML inference workflows to Model Serving

This section provides guidance on how to migrate your MosaicML inference deployments to Mosaic AI Model Serving and includes a notebook example.

The following table summarizes the parity between MosaicML inference and model serving on Databricks.

MosaicML Inference

Mosaic AI Model Serving

create_inference_deployment

Create a model serving endpoint

update_inference_deployment

Update a model serving endpoint

delete_inference_deployment

Delete a model serving endpoint

get_inference_deployment

Get status of a model serving endpoint

The following notebook provides a guided example of migrating a llama-13b model from MosaicML to Mosaic AI Model Serving.

Migrate from MosaicML inference to Mosaic AI Model Serving notebook

Open notebook in new tab