Advanced example for Feature Engineering in Unity Catalog

This notebook illustrates the use of Feature Engineering in Unity Catalog to create a model that predicts NYC Yellow Taxi fares. It includes these steps:

Compute and write time series features directly in Unity Catalog.
Train a model using these features to predict fares.
Evaluate that model on a fresh batch of data using existing features.

Requirements

Databricks Runtime 13.3 LTS for Machine Learning or above
- If you do not have access to Databricks Runtime for Machine Learning, you can run this notebook on Databricks Runtime 13.3 LTS or above. To do so, run %pip install databricks-feature-engineering at the start of this notebook.

Load the raw data used to compute features

Load the nyc-taxi-tiny dataset. This was generated from the full NYC Taxi Data which can be found at /databricks-datasets/nyctaxi by applying the following transformations:

Apply a UDF to convert latitude and longitude coordinates into ZIP codes, and add a ZIP code column to the DataFrame.
Subsample the dataset into a smaller dataset based on a date range query using the .sample() method of the Spark DataFrame API.
Rename certain columns and drop unnecessary columns.

5

8

10

11

12

13

16

18

19

22

24

26

27

When writing, merge mode is supported.

fe.write_table(
  name="ml.taxi_example.trip_pickup_time_series_features",
  df=new_pickup_features,
  mode="merge",
)

Data can also be streamed into feature tables by passing a dataframe where df.isStreaming is set to True:

fe.write_table(
  name="ml.taxi_example.trip_pickup_time_series_features",
  df=streaming_pickup_features,
  mode="merge",
)

You can schedule a notebook to periodically update features using Databricks Jobs (AWS|Azure|GCP).

30

You can now discover your feature tables in Unity Catalog in the Features UI. Search by "ml.taxi_example.trip_pickup_time_series_features" or "ml.taxi_example.trip_dropoff_time_series_features" and click the table name to see details such as table schema, metadata, and lineage in the Catalog Explorer UI. You can also edit the description for the feature table. For more information about feature discovery and tracking feature lineage, see (AWS|Azure|GCP).

You can also set feature table permissions in the Catalog Explorer UI. For details, see (AWS|Azure|GCP).

35

Understanding how a training dataset is created

In order to train a model, you need to create a training dataset that is used to train the model. The training dataset is comprised of:

Raw input data
Features from the feature tables in Unity Catalog

The raw input data is needed because it contains:

Primary keys and timeseries columns are used to join with features with point-in-time correctness (AWS|Azure|GCP).
Raw features like trip_distance that are not in the feature tables.
Prediction targets like fare that are required for model training.

Here's a visual overview that shows the raw input data being combined with the features in the Unity Catalog to produce the training dataset:

These concepts are described further in the Creating a Training Dataset documentation (AWS|Azure|GCP).

The next cell loads features from Unity Catalog for model training by creating a FeatureLookup for each needed feature.

To perform a point-in-time lookup for feature values from a time series feature table, you must specify a timestamp_lookup_key in the feature’s FeatureLookup, which indicates the name of the DataFrame column that contains timestamps against which to lookup time series features. For each row in the DataFrame, the feature values retrieved are the latest feature values prior to the timestamps specified in the DataFrame’s timestamp_lookup_key column and whose primary keys match the values in the DataFrame’s lookup_key columns, or null if no such feature value exists.

37

39

When fe.create_training_set(..) is invoked, the following steps take place:

A TrainingSet object is created, which selects specific features from feature tables to use in training your model. Each feature is specified by the FeatureLookup's created previously.
Features are joined with the raw input data according to each FeatureLookup's lookup_key.
Point-in-Time lookup is applied to avoid data leakage problems. Only the most recent feature values, based on timestamp_lookup_key, are joined.

The TrainingSet is then transformed into a DataFrame for training. This DataFrame includes the columns of taxi_data, as well as the features specified in the FeatureLookups.

41

42

44

45

48

52

Use the score_batch API to evaluate the model on the batch of data, retrieving needed features from Feature Engineering in Unity Catalog.

When you score a model trained with features from time series feature tables, the appropriate features are retrieved using point-in-time lookups with metadata packaged with the model during training. The DataFrame you provide to FeatureEngineeringClient.score_batch must contain a timestamp column with the same name and DataType as the timestamp_lookup_key of the FeatureLookup provided to FeatureEngineeringClient.create_training_set.

54

56

59

61

feature-store-with-uc-taxi-example(Python)

Advanced example for Feature Engineering in Unity Catalog

Requirements

Compute features

Load the raw data used to compute features

Pickup features

Drop off features

Helper functions

Data scientist's custom code to compute features

Create new time series feature tables in Unity Catalog

Write initial features to feature tables in Unity Catalog

Update features

Feature search and discovery

Train a model

Helper functions

Understanding how a training dataset is created

Configure MLflow client to access models in Unity Catalog

See the model lineage in Catalog Explorer

Build and log a custom PyFunc model

Scoring: batch inference

View the taxi fare predictions

View the PyFunc predictions

Next steps