Migrate to Serverless Real-Time Inference

Preview

This feature is in Public Preview.

This article demonstrates how to enable Serverless Real-Time Inference on your workspace and switch your models from using Classic MLflow model serving to model serving with Serverless Real-Time Inference.

For general information about Serverless Real-Time Inference, see Model serving with Serverless Real-Time Inference.

Requirements

Significant changes

  • In Serverless Real-Time Inference, the format of the request to the endpoint and the response from the endpoint are slightly different from Classic Model Serving. See Scoring a model endpoint for details on the new format protocol.

  • In Serverless Real-Time Inference, the endpoint URL includes model-endpoint instead of model

  • Serverless Real-Time Inference includes full support for managing resources with API workflows and is production-ready.

Enable Serverless Real-Time Inference for your workspace

Important

Serverless Real-Time Inference must be enabled for your workspace. The first time that it is enabled for the workspace, the admin must read and accept the terms and conditions.

To enable Serverless Real-Time Inference for your workspace:

  1. Enroll in the preview.

    1. Email the model-serving-feedback team and request to join the Serverless Real-Time Inference public preview.

    2. Databricks sends you a Google form.

    3. Fill out the form and submit it to Databricks. The form includes information about which workspace to enroll.

    4. Wait until Databricks notifies you that your workspace is enrolled in the preview.

  2. As an admin, access the admin console.

  3. Select Workspace Settings.

  4. Select MLflow Serverless Real-Time Inference Enablement.

Disable Classic MLflow model serving on your models

Before you can enable Serverless Real-Time Inference for your models, you need to disable Classic MLflow model serving on your currently served models.

The following steps show how to accomplish this with the UI.

  1. Navigate to Models on the sidebar of your Machine Learning workspace.

  2. Select the model for which you want to disable Classic model serving.

  3. On the Serving tab, select Stop.

  4. A message appears to confirm. Select Stop Serving.

Enable Serverless Real-Time Inference on your models

Once Serverless Real-Time Inference is enabled on your workspace, you will see the following screen on the Serving tab of your registered models. To enable Serverless Real-Time Inference for that model, click the Enable Serverless Real-Time Inference button.

Serving pane

Important

If you do not see that button, but you instead see an Enable Serving button, you are using endpoints for Classic model serving not Serverless model endpoints. Contact an admin to enable the feature on this workspace.