%md # Models in Unity Catalog Example This notebook illustrates how to use Models in Unity Catalog APIs to manage models. The notebook includes the following steps: - Track and log models with MLflow. - Register models to Unity Catalog. - Use the API to add descriptions to models and model versions. - Use aliases to deploy model versions. - Use the API to load model versions for inference. - Delete models. This tutorial leverages features from MLflow 3.0. For more details, see "Get started with MLflow 3.0" ([AWS](https://docs.databricks.com/aws/en/mlflow/mlflow-3-install)|[Azure](https://learn.microsoft.com/en-us/azure/databricks/mlflow/mlflow-3-install)|[GCP](https://docs.databricks.com/gcp/en/mlflow/mlflow-3-install))
Models in Unity Catalog Example
This notebook illustrates how to use Models in Unity Catalog APIs to manage models. The notebook includes the following steps:
- Track and log models with MLflow.
- Register models to Unity Catalog.
- Use the API to add descriptions to models and model versions.
- Use aliases to deploy model versions.
- Use the API to load model versions for inference.
- Delete models.
This tutorial leverages features from MLflow 3.0. For more details, see "Get started with MLflow 3.0" (AWS|Azure|GCP)
%md ## Requirements - This notebook requires a workspace that has been enabled for Unity Catalog. Your workspace must be attached to a Unity Catalog metastore that supports privilege inheritance. This is true for all metastores created after August 25, 2022. - The notebook must be attached to a cluster that has access to Unity Catalog and that is running Databricks Runtime for Machine Learning 13.3 LTS or above. - This notebook creates models in the `main.default` schema by default. This requires `USE CATALOG` privilege on the `main` catalog, plus `CREATE MODEL` and `USE SCHEMA` privileges on the `main.default` schema. You can change the catalog and schema used in this notebook, as long as you have the same privileges on the catalog and schema you specify. - This notebook uses MLflow 3.0, which requires installing a version of mlflow that is >= 3.0
Requirements
- This notebook requires a workspace that has been enabled for Unity Catalog. Your workspace must be attached to a Unity Catalog metastore that supports privilege inheritance. This is true for all metastores created after August 25, 2022.
- The notebook must be attached to a cluster that has access to Unity Catalog and that is running Databricks Runtime for Machine Learning 13.3 LTS or above.
- This notebook creates models in the
main.default
schema by default. This requiresUSE CATALOG
privilege on themain
catalog, plusCREATE MODEL
andUSE SCHEMA
privileges on themain.default
schema. You can change the catalog and schema used in this notebook, as long as you have the same privileges on the catalog and schema you specify. - This notebook uses MLflow 3.0, which requires installing a version of mlflow that is >= 3.0
# Upgrade to the latest MLflow version to use MLflow 3.0 features %pip install mlflow>=3.0 --upgrade dbutils.library.restartPython()
%md ## Configure MLflow client to access models in Unity Catalog By default, the MLflow Python client creates models in the Databricks workspace model registry. To upgrade to models in Unity Catalog, configure the MLflow client as shown:
Configure MLflow client to access models in Unity Catalog
By default, the MLflow Python client creates models in the Databricks workspace model registry. To upgrade to models in Unity Catalog, configure the MLflow client as shown:
import mlflow mlflow.set_registry_uri("databricks-uc")
# You can update the catalog and schema name containing the model in Unity Catalog if needed CATALOG_NAME = "main" SCHEMA_NAME = "default" MODEL_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.bike_share"
%md ## Load and pre-process dataset This notebook uses bike sharing data from a dataset that is included in Databricks datasets. The prediction target in the data is the number of bicycles rented at a certain location each hour. Variables in the dataset provide information about the date, time, and weather.
Load and pre-process dataset
This notebook uses bike sharing data from a dataset that is included in Databricks datasets. The prediction target in the data is the number of bicycles rented at a certain location each hour. Variables in the dataset provide information about the date, time, and weather.
df = spark.read.csv("/databricks-datasets/bikeSharing/data-001/hour.csv", header="true", inferSchema="true")
import pandas as pd df = pd.read_csv("/databricks-datasets/bikeSharing/data-001/hour.csv", header=0) df = df[["season", "yr", "mnth", "hr", "holiday", "weekday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed", "cnt"]] df.display()
from sklearn.model_selection import train_test_split X = df.drop(columns=["cnt"]).astype("float64") y = df["cnt"].astype("float64") # Split out the training data X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.60, random_state=123) # Split the remaining data equally into validation and test X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)
%md ## Train, register, and deploy model The following code trains a regression model using gradient boosting. It illustrates the use of MLflow APIs to register the fitted model to Unity Catalog.
Train, register, and deploy model
The following code trains a regression model using gradient boosting. It illustrates the use of MLflow APIs to register the fitted model to Unity Catalog.
import mlflow.sklearn from sklearn.ensemble import GradientBoostingRegressor from sklearn.metrics import mean_squared_error with mlflow.start_run() as run: gradient_booster = GradientBoostingRegressor() gradient_booster.fit(X_train, y_train) mse = mean_squared_error(gradient_booster.predict(X_val), y_val) print("Validation MSE: %d" % mse) mlflow.log_metric("mse", mse) example_input = X_val.iloc[[0]] # To register the model to Unity Catalog, specify the `registered_model_name` parameter # of the `mlflow.sklearn.log_model()` function. This automatically creates a new model version. # All metrics of the model will be available in Unity Catalog. You can log additional metrics # any time to the model with mlflow.log_metric() by passing in the model_id argument, which # will all be available under the model version in Unity Catalog. mlflow.sklearn.log_model( sk_model=gradient_booster, name="sklearn-model", input_example=example_input, registered_model_name=MODEL_NAME )
%md ### Add model and model version descriptions using the API You can use MLflow APIs to find the recently trained model version, then add descriptions to the model version and the registered model:
Add model and model version descriptions using the API
You can use MLflow APIs to find the recently trained model version, then add descriptions to the model version and the registered model:
from mlflow.tracking.client import MlflowClient # This function returns the latest model version. def get_latest_model_version(model_name): client = MlflowClient() model_version_infos = client.search_model_versions("name = '%s'" % model_name) return max([int(model_version_info.version) for model_version_info in model_version_infos])
latest_version = get_latest_model_version(model_name=MODEL_NAME)
client = MlflowClient() client.update_registered_model( name=MODEL_NAME, description="Bike share model." ) client.update_model_version( name=MODEL_NAME, version=latest_version, description="This model version was built using the scikit-learn GradientBoostingRegressor." )
%md ### View the model in the UI You can view and manage registered models and model versions in Unity Catalog using Catalog Explorer. In the left sidebar, click **Catalog** and navigate the catalog directory to the catalog and schema where you created the model. If you did not change `CATALOG_NAME` and `SCHEMA_NAME` in cell 5, you can find the model you just created in the `main` catalog and `default` schema. For more information about Catalog Explorer, see the documentation: ([AWS](https://docs.databricks.com/data/index.html) | [Azure](https://learn.microsoft.com/azure/databricks/data/) | [GCP](https://docs.gcp.databricks.com/data/index.html)).
View the model in the UI
You can view and manage registered models and model versions in Unity Catalog using Catalog Explorer. In the left sidebar, click Catalog and navigate the catalog directory to the catalog and schema where you created the model. If you did not change CATALOG_NAME
and SCHEMA_NAME
in cell 5, you can find the model you just created in the main
catalog and default
schema.
For more information about Catalog Explorer, see the documentation: (AWS | Azure | GCP).
%md ### Deploy a model version for inference Models in Unity Catalog support aliases ([AWS](https://docs.databricks.com/mlflow/model-registry.html#model-registry-concepts) | [Azure](https://learn.microsoft.com/en-us/azure/databricks/mlflow/model-registry#model-registry-concepts) | [GCP](https://docs.gcp.databricks.com/mlflow/model-registry.html#model-registry-concepts)) for model deployment. Aliases provide mutable, named references (e.g. "Champion", "Challenger") to a particular version of a registered model, that you can reference and target in downstream inference workflows. The following cell shows how to use MLflow APIs to assign the "Champion" alias to the newly-trained model version.
Deploy a model version for inference
Models in Unity Catalog support aliases (AWS | Azure | GCP) for model deployment. Aliases provide mutable, named references (e.g. "Champion", "Challenger") to a particular version of a registered model, that you can reference and target in downstream inference workflows. The following cell shows how to use MLflow APIs to assign the "Champion" alias to the newly-trained model version.
client = MlflowClient() latest_version = get_latest_model_version(MODEL_NAME) client.set_registered_model_alias(MODEL_NAME, "Champion", latest_version)
%md ## Load model versions using the API The MLflow Models component defines functions for loading models from different machine learning frameworks. For example, `mlflow.pyfunc.load_model()` is used to load models that were saved in the MLflow `pyfunc` format, and `mlflow.sklearn.load_model()` is used to load scikit-learn models that were saved in MLflow format. These functions can load models from Models in Unity Catalog by version number or alias. The following cell shows examples.
Load model versions using the API
The MLflow Models component defines functions for loading models from different machine learning frameworks. For example, mlflow.pyfunc.load_model()
is used to load models that were saved in the MLflow pyfunc
format, and mlflow.sklearn.load_model()
is used to load scikit-learn models that were saved in MLflow format.
These functions can load models from Models in Unity Catalog by version number or alias. The following cell shows examples.
import mlflow.pyfunc model_version_uri = "models:/{model_name}/1".format(model_name=MODEL_NAME) print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_version_uri)) model_version_1 = mlflow.pyfunc.load_model(model_version_uri) model_champion_uri = "models:/{model_name}@Champion".format(model_name=MODEL_NAME) print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_champion_uri)) champion_model = mlflow.pyfunc.load_model(model_champion_uri)
%md ### Make predictions using the champion model In this section, the champion model is used to make predictions. The `load_and_predict()` function defined in the following cell loads the version of the model specified by `model_alias` (in this example, "Champion") and uses it to make predictions on the dataset specified by `new_data`.
Make predictions using the champion model
In this section, the champion model is used to make predictions. The load_and_predict()
function defined in the following cell loads the version of the model specified by model_alias
(in this example, "Champion") and uses it to make predictions on the dataset specified by new_data
.
from mlflow.tracking import MlflowClient def load_and_predict(model_name, model_alias, new_data): import pandas as pd client = MlflowClient() model_uri = "models:/{model_name}@{model_alias}".format(model_name=MODEL_NAME, model_alias=model_alias) model = mlflow.pyfunc.load_model(model_uri) predictions = pd.DataFrame(model.predict(new_data)) print(predictions) return predictions gb_predictions = load_and_predict(MODEL_NAME, "Champion", X_val)
%md ## Create and deploy a new model version The following code trains a random forest model using scikit-learn RandomForestRegressor. It then registers the model to Unity Catalog, using the same `registered_model_name` that you used when you trained the GradientBooster version of the model. This creates a new model version.
Create and deploy a new model version
The following code trains a random forest model using scikit-learn RandomForestRegressor. It then registers the model to Unity Catalog, using the same registered_model_name
that you used when you trained the GradientBooster version of the model. This creates a new model version.
import mlflow.sklearn from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error with mlflow.start_run() as run: n_estimators = 300 mlflow.log_param("n_estimators", n_estimators) rand_forest = RandomForestRegressor(n_estimators=n_estimators) rand_forest.fit(X_train, y_train) mse = mean_squared_error(rand_forest.predict(X_val), y_val) print("Validation MSE: %d" % mse) mlflow.log_metric("mse", mse) example_input = X_val.iloc[[0]] # Specify the `registered_model_name` parameter of the `mlflow.sklearn.log_model()` # function to register the model to Unity Catalog. This automatically # creates a new model version. mlflow.sklearn.log_model( sk_model=rand_forest, name="sklearn-model", input_example=example_input, registered_model_name=MODEL_NAME )
%md ### Add a description for the new model version
Add a description for the new model version
client.update_model_version( name=MODEL_NAME, version=get_latest_model_version(MODEL_NAME), description="This model version was built using the scikit-learn RandomForestRegressor." )
%md ### Mark new model version as Challenger and test the model The following code assigns the "Challenger" alias to the new model version, and uses that version to make predictions on the same dataset used previously, to compare the two models' performance.
Mark new model version as Challenger and test the model
The following code assigns the "Challenger" alias to the new model version, and uses that version to make predictions on the same dataset used previously, to compare the two models' performance.
client = MlflowClient() latest_version = get_latest_model_version(MODEL_NAME) client.set_registered_model_alias(MODEL_NAME, "Challenger", latest_version) rf_predictions = load_and_predict(MODEL_NAME, "Challenger", X_val)
%md ### Compare the performance of the two model versions
Compare the performance of the two model versions
# Convert y_val to a DataFrame ground_truth = y_val.to_frame() # Reset indices to ensure alignment gb_predictions = gb_predictions.reset_index(drop=True) rf_predictions = rf_predictions.reset_index(drop=True) ground_truth = ground_truth.reset_index(drop=True) # Combine c1, c2, and c3 into a single DataFrame combined_df = pd.concat([gb_predictions, rf_predictions, ground_truth], axis=1) combined_df.columns = ['gb_preds', 'rf_preds', 'ground_truth'] # Display the combined DataFrame display(combined_df)
%md ### Calculate the mean squared error for the predictions of each model relative to the ground truth
Calculate the mean squared error for the predictions of each model relative to the ground truth
from sklearn.metrics import mean_squared_error mse_rf = mean_squared_error(combined_df['rf_preds'], combined_df['ground_truth']) mse_gb = mean_squared_error(combined_df['gb_preds'], combined_df['ground_truth']) print(f"Random Forest model mean squared error: {mse_rf}") print(f"Gradient Booster model mean squared error: {mse_gb}")
%md ## Deploy the new model version using the "Champion" alias The Random Forest model is performing better. The following code assigns the "Champion" alias to the Random Forest model.
Deploy the new model version using the "Champion" alias
The Random Forest model is performing better. The following code assigns the "Champion" alias to the Random Forest model.
new_model_version = get_latest_model_version(MODEL_NAME) client.set_registered_model_alias( name=MODEL_NAME, alias="Champion", version=new_model_version )
%md There are now two model versions of the forecasting model: the Gradient Booster version and the Random Forest version. At this point, both the "Champion" alias and the "Challenger" alias are assigned to the Random Forest version. This ensures that any downstream workloads that target the "Challenger" model version continue to run successfully.
There are now two model versions of the forecasting model: the Gradient Booster version and the Random Forest version. At this point, both the "Champion" alias and the "Challenger" alias are assigned to the Random Forest version. This ensures that any downstream workloads that target the "Challenger" model version continue to run successfully.
%md ## Remove a model alias To remove an alias from a model version, use `delete_registered_model_alias`, as shown in the following cell.
Remove a model alias
To remove an alias from a model version, use delete_registered_model_alias
, as shown in the following cell.
client.delete_registered_model_alias(name=MODEL_NAME, alias="Challenger")
%md ## Delete model versions and models
Delete model versions and models
%md When a model version is no longer being used, you can delete it. You can also delete an entire registered model; this removes all of its associated model versions. Note that deleting a model version clears any aliases assigned to the model version.
When a model version is no longer being used, you can delete it. You can also delete an entire registered model; this removes all of its associated model versions. Note that deleting a model version clears any aliases assigned to the model version.
client.delete_model_version( name=MODEL_NAME, version=1, )
client = MlflowClient() client.delete_registered_model(name=MODEL_NAME)