Advanced example for Feature Engineering in Unity Catalog
This notebook illustrates the use of Feature Engineering in Unity Catalog to create a model that predicts NYC Yellow Taxi fares. It includes these steps:
- Compute and write time series features directly in Unity Catalog.
- Train a model using these features to predict fares.
- Evaluate that model on a fresh batch of data using existing features.
Requirements
- Databricks Runtime 13.3 LTS for Machine Learning or above
- If you do not have access to Databricks Runtime for Machine Learning, you can run this notebook on Databricks Runtime 13.3 LTS or above. To do so, run
%pip install databricks-feature-engineering
at the start of this notebook.
- If you do not have access to Databricks Runtime for Machine Learning, you can run this notebook on Databricks Runtime 13.3 LTS or above. To do so, run

Compute features
Load the raw data used to compute features
Load the nyc-taxi-tiny
dataset. This was generated from the full NYC Taxi Data which can be found at /databricks-datasets/nyctaxi
by applying the following transformations:
- Apply a UDF to convert latitude and longitude coordinates into ZIP codes, and add a ZIP code column to the DataFrame.
- Subsample the dataset into a smaller dataset based on a date range query using the
.sample()
method of the SparkDataFrame
API. - Rename certain columns and drop unnecessary columns.
From the taxi fares transactional data, we will compute two groups of features based on trip pickup and drop off zip codes.
Pickup features
- Count of trips (time window = 1 hour, sliding window = 15 minutes)
- Mean fare amount (time window = 1 hour, sliding window = 15 minutes)
Drop off features
- Count of trips (time window = 30 minutes)
- Does trip end on the weekend (custom feature using python code)

Helper functions
Data scientist's custom code to compute features
Create new time series feature tables in Unity Catalog
First, create a new catalog or reuse an existing one and create the schema where the feature tables will be stored.
- To create a new catalog, you must have the
CREATE CATALOG
privilege on the metastore. - To use an existing catalog, you must have the
USE CATALOG
privilege on the catalog. - To create a new schema in the catalog, you must have the
CREATE SCHEMA
privilege on the catalog.
Next, create time series feature tables in Unity Catalog with Primary Key Constraints.
You can directly create a table in Unity Catalog using CREATE TABLE
SQL syntax. Use the primary key constraint to specify primary key columns. For time series tables, use TIMESERIES
to annotate the timeseries column (AWS|Azure|GCP).
The timeseries column must be of TimestampType
or DateType
.
Write initial features to feature tables in Unity Catalog
Create an instance of the Feature Engineering client.
Use the write_table
API to write features to the feature tables in Unity Catalog.
To write to a time series feature table, the DataFrame must contain a column that you designate as the timeseries column.
Update features
Use the write_table
function to update the feature table values.
When writing, merge
mode is supported.
fe.write_table(
name="ml.taxi_example.trip_pickup_time_series_features",
df=new_pickup_features,
mode="merge",
)
Data can also be streamed into feature tables by passing a dataframe where df.isStreaming
is set to True
:
fe.write_table(
name="ml.taxi_example.trip_pickup_time_series_features",
df=streaming_pickup_features,
mode="merge",
)
You can schedule a notebook to periodically update features using Databricks Jobs (AWS|Azure|GCP).
Analysts can interact with feature tables in Unity Catalog using SQL, for example:
Feature search and discovery
You can now discover your feature tables in Unity Catalog in the Features UI. Search by "ml.taxi_example.trip_pickup_time_series_features" or "ml.taxi_example.trip_dropoff_time_series_features" and click the table name to see details such as table schema, metadata, and lineage in the Catalog Explorer UI. You can also edit the description for the feature table. For more information about feature discovery and tracking feature lineage, see (AWS|Azure|GCP).
You can also set feature table permissions in the Catalog Explorer UI. For details, see (AWS|Azure|GCP).
Train a model
This section illustrates how to create a training set with the time series pickup and dropoff feature tables using point-in-time lookup and train a model using the training set. It trains a LightGBM model to predict taxi fare.
Helper functions
Understanding how a training dataset is created
In order to train a model, you need to create a training dataset that is used to train the model. The training dataset is comprised of:
- Raw input data
- Features from the feature tables in Unity Catalog
The raw input data is needed because it contains:
- Primary keys and timeseries columns are used to join with features with point-in-time correctness (AWS|Azure|GCP).
- Raw features like
trip_distance
that are not in the feature tables. - Prediction targets like
fare
that are required for model training.
Here's a visual overview that shows the raw input data being combined with the features in the Unity Catalog to produce the training dataset:

These concepts are described further in the Creating a Training Dataset documentation (AWS|Azure|GCP).
The next cell loads features from Unity Catalog for model training by creating a FeatureLookup
for each needed feature.
To perform a point-in-time lookup for feature values from a time series feature table, you must specify a timestamp_lookup_key
in the feature’s FeatureLookup
, which indicates the name of the DataFrame column that contains timestamps against which to lookup time series features. For each row in the DataFrame, the feature values retrieved are the latest feature values prior to the timestamps specified in the DataFrame’s timestamp_lookup_key
column and whose primary keys match the values in the DataFrame’s lookup_key
columns, or null
if no such feature value exists.
Configure MLflow client to access models in Unity Catalog
When fe.create_training_set(..)
is invoked, the following steps take place:
A
TrainingSet
object is created, which selects specific features from feature tables to use in training your model. Each feature is specified by theFeatureLookup
's created previously.Features are joined with the raw input data according to each
FeatureLookup
'slookup_key
.Point-in-Time lookup is applied to avoid data leakage problems. Only the most recent feature values, based on
timestamp_lookup_key
, are joined.
The TrainingSet
is then transformed into a DataFrame for training. This DataFrame includes the columns of taxi_data, as well as the features specified in the FeatureLookups
.
Train a LightGBM model on the data returned by TrainingSet.to_df
, then log the model with FeatureEngineeringClient.log_model
. The model will be packaged with feature metadata.
See the model lineage in Catalog Explorer
Visit the table details page in Catalog Explorer. Go to "Lineage" tab and click "See lineage graph". You can see that the feature table now has a downstream model lineage.
Build and log a custom PyFunc model
To add preprocessing or post-processing code to the model and generate processed predictions with batch inference, you can build a custom PyFunc MLflow model that encapusulates these methods. The following cell shows an example that returns a string output based on the numeric prediction from the model.
Scoring: batch inference
Suppose another data scientist now wants to apply this model to a different batch of data.
Display the data to use for inference, reordered to highlight the fare_amount
column, which is the prediction target.
Use the score_batch
API to evaluate the model on the batch of data, retrieving needed features from Feature Engineering in Unity Catalog.
When you score a model trained with features from time series feature tables, the appropriate features are retrieved using point-in-time lookups with metadata packaged with the model during training. The DataFrame you provide to FeatureEngineeringClient.score_batch
must contain a timestamp column with the same name and DataType as the timestamp_lookup_key
of the FeatureLookup provided to FeatureEngineeringClient.create_training_set
.
To score using the logged PyFunc model:

View the taxi fare predictions
This code reorders the columns to show the taxi fare predictions in the first column. Note that the predicted_fare_amount
roughly lines up with the actual fare_amount
, although more data and feature engineering would be required to improve the model accuracy.
View the PyFunc predictions
Next steps
- Explore the feature tables created in this example in the Features UI.
- Publish the feature tables to online stores (AWS|Azure).
- Deploy the model in Unity Catalog to Model Serving (AWS|Azure).
- Adapt this notebook to your own data and create your own feature tables.