Migrate to Serverless Real-Time Inference
Important
This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.
The guidance in this article is for the preview version of the Model Serving, formerly Serverless Real-Time Inference, functionality. Databricks recommends you migrate your model serving workflows to the generally available functionality. See Model serving with Databricks.
Preview
This feature is in Public Preview.
This article demonstrates how to enable Serverless Real-Time Inference on your workspace and switch your models from using Legacy MLflow Model Serving to model serving with Serverless Real-Time Inference.
For general information about Serverless Real-Time Inference, see Model serving with Serverless Real-Time Inference.
Requirements
Registered model in the MLflow Model Registry.
Cluster Create permissions in your workspace. See Configure cluster creation entitlement.
Can Manage Production Version permissions on the registered model. See MLflow Model permissions.
Significant changes
In Serverless Real-Time Inference, the format of the request to the endpoint and the response from the endpoint are slightly different from Legacy MLflow Model Serving. See Scoring a model endpoint for details on the new format protocol.
In Serverless Real-Time Inference, the endpoint URL includes
model-endpoint
instead ofmodel
Serverless Real-Time Inference includes full support for managing resources with API workflows and is production-ready.
Enable Serverless Real-Time Inference for your workspace
Important
Serverless Real-Time Inference must be enabled for your workspace. The first time that it is enabled for the workspace, the workspace admin must read and accept the terms and conditions.
To enable Serverless Real-Time Inference for your workspace:
Enroll in the preview.
Reach out to your Databricks representative to request to join the Serverless Real-Time Inference public preview.
Databricks sends you a Google form.
Fill out the form and submit it to Databricks. The form includes information about which workspace to enroll.
Wait until Databricks notifies you that your workspace is enrolled in the preview.
As a workspace admin, access the admin settings page.
Select Workspace Settings.
Select MLflow Serverless Real-Time Inference Enablement.
Disable Legacy MLflow Model Serving on your models
Before you can enable Serverless Real-Time Inference for your models, you need to disable Legacy MLflow Model Serving on your currently served models.
The following steps show how to accomplish this with the UI.
Navigate to Models on the sidebar of your Machine Learning workspace.
Select the model for which you want to disable Legacy MLflow Model Serving.
On the Serving tab, select Stop.
A message appears to confirm. Select Stop Serving.
Enable Serverless Real-Time Inference on your models
Once Serverless Real-Time Inference is enabled on your workspace, you will see the following screen on the Serving tab of your registered models. To enable Serverless Real-Time Inference for that model, click the Enable Serverless Real-Time Inference button.

Important
If you do not see that button, but you instead see an Enable Serving button, you are using endpoints for Legacy MLflow Model Serving not Serverless model endpoints. Contact a workspace admin to enable the feature on this workspace.