Migrate to Serverless Real-Time Inference

Important

  • This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.

  • The guidance in this article is for the preview version of the Model Serving, formerly Serverless Real-Time Inference, functionality. Databricks recommends you migrate your model serving workflows to the generally available functionality. See Model serving with Databricks.

Preview

This feature is in Public Preview.

This article demonstrates how to enable Serverless Real-Time Inference on your workspace and switch your models from using Legacy MLflow Model Serving to model serving with Serverless Real-Time Inference.

For general information about Serverless Real-Time Inference, see Model serving with Serverless Real-Time Inference.

Requirements

  • Registered model in the MLflow Model Registry.

  • Cluster Create permissions in your workspace. See Manage entitlements.

  • CAN MANAGE PRODUCTION VERSIONS permissions on the registered model. See MLFlow model ACLs.

Significant changes

  • In Serverless Real-Time Inference, the format of the request to the endpoint and the response from the endpoint are slightly different from Legacy MLflow Model Serving. See Scoring a model endpoint for details on the new format protocol.

  • In Serverless Real-Time Inference, the endpoint URL includes model-endpoint instead of model

  • Serverless Real-Time Inference includes full support for managing resources with API workflows and is production-ready.

Enable Serverless Real-Time Inference for your workspace

Important

Serverless Real-Time Inference must be enabled for your workspace. The first time that it is enabled for the workspace, the workspace admin must read and accept the terms and conditions.

To enable Serverless Real-Time Inference for your workspace:

  1. Enroll in the preview.

    1. Reach out to your Databricks account team to request to join the Serverless Real-Time Inference public preview.

    2. Databricks sends you a Google form.

    3. Fill out the form and submit it to Databricks. The form includes information about which workspace to enroll.

    4. Wait until Databricks notifies you that your workspace is enrolled in the preview.

  2. As a workspace admin, access the admin settings page.

  3. Select Workspace Settings.

  4. Select MLflow Serverless Real-Time Inference Enablement.

Disable Legacy MLflow Model Serving on your models

Before you can enable Serverless Real-Time Inference for your models, you need to disable Legacy MLflow Model Serving on your currently served models.

The following steps show how to accomplish this with the UI.

  1. Navigate to Models on the sidebar of your Machine Learning workspace.

  2. Select the model for which you want to disable Legacy MLflow Model Serving.

  3. On the Serving tab, select Stop.

  4. A message appears to confirm. Select Stop Serving.

Enable Serverless Real-Time Inference on your models

Once Serverless Real-Time Inference is enabled on your workspace, you will see the following screen on the Serving tab of your registered models. To enable Serverless Real-Time Inference for that model, click the Enable Serverless Real-Time Inference button.

Serving pane

Important

If you do not see that button, but you instead see an Enable Serving button, you are using endpoints for Legacy MLflow Model Serving not serverless model endpoints. Contact a workspace admin to enable the feature on this workspace.