Upgrade ML workflows to target models in Unity Catalog

This article explains how to migrate and upgrade ML workflows to target models in Unity Catalog.

Requirements

Before getting started, make sure to meet the requirements in Requirements. In particular, make sure that the users or principals used to execute your model training, deployment, and inference workflows have the necessary privileges on a registered model in Unity Catalog:

  • Training: Ownership of the registered model (required to create new model versions), plus USE CATALOG and USE SCHEMA privileges on the enclosing catalog and schema.

  • Deployment: Ownership of the registered model (required to set aliases on the model), plus USE CATALOG and USE SCHEMA privileges on the enclosing catalog and schema.

  • Inference: EXECUTE privilege on the registered model (required to read and perform inference with model versions), plus USE CATALOG and `USE SCHEMA privileges on the enclosing catalog and schema.

Creating parallel training, deployment, and workflows

To upgrade model training and inference workflows to Unity Catalog, Databricks recommends an incremental approach in which you create a parallel training, deployment, and inference pipeline that leverage models in Unity Catalog. When you’re comfortable with the results using Unity Catalog, you can switch downstream consumers to read the batch inference output, or increase the traffic routed to models in Unity Catalog in serving endpoints.

Model training workflow

Clone your model training workflow. Then, ensure that:

  1. The workflow cluster has access to Unity Catalog and meets the requirements described in Requirements.

  2. The principal running the workflow has the necessary permissions on a registered model in Unity Catalog.

Next, modify model training code in the cloned workflow. You may need to clone the notebook run by the workflow, or create and target a new git branch in the cloned workflow. Follow these steps to install the necessary version of MLflow, configure the client to target Unity Catalog in your training code, and update the model training code to register models to Unity Catalog.

Model deployment workflow

Clone your model deployment workflow, following similar steps as in Model training workflow to update its compute configuration to enable access to Unity Catalog.

Ensure the principal who owns the cloned workflow has the necessary permissions. If you have model validation logic in your deployment workflow, update it to load model versions from UC. Use aliases to manage production model rollouts.

Model inference workflow

Batch inference workflow

Follow similar steps as in Model training workflow to clone the batch inference workflow and update its compute configuration to enable access to Unity Catalog. Ensure the principal running the cloned batch inference job has the necessary permissions to load the model for inference.

Model serving workflow

If you are using Databricks Model Serving, you do not need to clone your existing endpoint. Instead, you can leverage the traffic split feature to route a small fraction of traffic to models in Unity Catalog.

First, ensure the principal who owns the model serving endpoint has the necessary permissions to load the model for inference. Then, update your cloned model deployment workflow to assign a small percentage of traffic to model versions in Unity Catalog.

Promote a model across environments

Databricks recommends that you deploy ML pipelines as code. This eliminates the need to promote models across environments, as all production models can be produced through automated training workflows in a production environment.

However, in some cases, it may be too expensive to retrain models across environments. In such scenarios, you can copy model versions across registered models in Unity Catalog to promote them across environments.

You need the following privileges to execute the example code below:

  • USE CATALOG on the staging and prod catalogs.

  • USE SCHEMA on the staging.ml_team and prod.ml_team schemas.

  • EXECUTE on staging.ml_team.fraud_detection.

In addition, you must be the owner of the registered model prod.ml_team.fraud_detection.

The following code snippet uses the copy_model_version MLflow Client API, available in MLflow version 2.8.0 and above.

import mlflow
mlflow.set_registry_uri("databricks-uc")

client = mlflow.tracking.MlflowClient()
src_model_name = "staging.ml_team.fraud_detection"
src_model_version = "1"
src_model_uri = f"models:/{src_model_name}/{src_model_version}"
dst_model_name = "prod.ml_team.fraud_detection"
copied_model_version = client.copy_model_version(src_model_uri, dst_model_name)

After the model version is in the production environment, you can perform any necessary pre-deployment validation. Then, you can mark the model version for deployment using aliases.

client = mlflow.tracking.MlflowClient()
client.set_registered_model_alias(name="prod.ml_team.fraud_detection", alias="Champion", version=copied_model_version.version)

In the example above, only users who can read from the staging.ml_team.fraud_detection registered model and write to the prod.ml_team.fraud_detection registered model can promote staging models to the production environment. The same users can also use aliases to manage which model versions are deployed within the production environment. You don’t need to configure any other rules or policies to govern model promotion and deployment.

You can customize this flow to promote the model version across multiple environments that match your setup, such as dev, qa, and prod. Access control is enforced as configured in each environment.

Use job webhooks for manual approval for model deployment

Databricks recommends that you automate model deployment if possible, using appropriate checks and tests during the model deployment process. However, if you do need to perform manual approvals to deploy production models, you can use job webhooks to call out to external CI/CD systems to request manual approval for deploying a model, after your model training job completes successfully. After manual approval is provided, your CI/CD system can then deploy the model version to serve traffic, for example by setting the “Champion” alias on it.