feature-store-with-uc-basic-example(Python)

Loading...

Basic example for Feature Engineering in Unity Catalog

This notebook illustrates how you can use Databricks Feature Engineering in Unity Catalog to create, store, and manage Unity Catalog Features to train ML models and make batch predictions, including with features whose value is only available at the time of prediction. In this example, the goal is to predict the wine quality using a ML model with a variety of static wine features and a realtime input.

This notebook shows how to:

  • Create a feature table and use it to build a training dataset for a machine learning model.
  • Modify the feature table and use the updated table to create a new version of the model.
  • Use the Databricks Features UI to determine how features relate to models.
  • Perform batch scoring using automatic feature lookup.

Requirements

  • Databricks Runtime 13.2 for Machine Learning or above.
    • If you do not have access to Databricks Runtime for Machine Learning, you can run this notebook on Databricks Runtime 13.2 or above. To do so, run %pip install databricks-feature-engineering at the start of this notebook.

Load dataset

The code in the following cell loads the dataset and does some minor data preparation: creates a unique ID for each observation and removes spaces from the column names. The unique ID column (wine_id) is the primary key of the feature table and is used to lookup features.

Create a new catalog or reuse an existing catalog

To create a new catalog, you must have the CREATE CATALOG privilege on the metastore. To use an existing catalog, you must have the USE CATALOG privilege on the catalog.

Create a new schema in the catalog

To create a new schema in the catalog, you must have the CREATE SCHEMA privilege on the catalog.

Create the feature table

The first step is to create a FeatureEngineeringClient.

Create the feature table. For a complete API reference, see (AWS|Azure|GCP).

You can also use create_table without providing a dataframe, and then later populate the feature table using fe.write_table.

Example:

fe.create_table(
    name=table_name,
    primary_keys=["wine_id"],
    schema=features_df.schema,
    description="wine features"
)

fe.write_table(
    name=table_name,
    df=features_df,
    mode="merge"
)

Train a model with Feature Engineering in Unity Catalog

The feature table does not include the prediction target. However, the training dataset needs the prediction target values. There may also be features that are not available until the time the model is used for inference.

This example uses the feature real_time_measurement to represent a characteristic of the wine that can only be observed at inference time. This feature is used in training and the feature value for a wine is provided at inference time.

Use a FeatureLookup to build a training dataset that uses the specified lookup_key to lookup features from the feature table and the online feature real_time_measurement. If you do not specify the feature_names parameter, all features except the primary key are returned.

The code in the next cell trains a scikit-learn RandomForestRegressor model and logs the model with the Feature Engineering in UC.

The code starts an MLflow experiment to track training parameters and results. Note that model autologging is disabled (mlflow.sklearn.autolog(log_models=False)); this is because the model is logged using fe.log_model.

To view the logged model, navigate to the MLflow Experiments page for this notebook. To access the Experiments page, click the Experiments icon on the left navigation bar:

Find the notebook experiment in the list. It has the same name as the notebook, in this case, "Basic example for Feature Engineering in Unity Catalog".

Click the experiment name to display the experiment page. The packaged Feature Engineering in UC model, created when you called fe.log_model appears in the Artifacts section of this page. You can use this model for batch scoring.

The model is also automatically registered in the Unity Catalog.

Batch scoring

Use score_batch to apply a packaged Feature Engineering in UC model to new data for inference. The input data only needs the primary key column wine_id and the realtime feature real_time_measurement. The model automatically looks up all of the other feature values from the feature tables.

Modify feature table

Suppose you modify the dataframe by adding a new feature. You can use fe.write_table with mode="merge" to update the feature table.

Update the feature table using fe.write_table with mode="merge".

To read feature data from the feature tables, use fe.read_table().

Train a new model version using the updated feature table

Build a training dataset that will use the indicated key to lookup features.

Apply the latest version of the registered MLflow model to features using score_batch.

Control permissions for and delete feature tables

  • To control who has access to a Unity Catalog feature table, use the Permissions button on the Catalog Explorer table details page.
  • To delete a Unity Catalog feature table, click the kebab menu on the Catalog Explorer table details page and select Delete. When you delete a Unity Catalog feature table using the UI, the corresponding Delta table is also deleted.