Feature Engineering in Unity Catalog

This page describes how to create and work with feature tables in Unity Catalog.

This page applies only to workspaces that are enabled for Unity Catalog. If your workspace is not enabled for Unity Catalog, see Work with features in Workspace Feature Store.

Requirements

Feature Engineering in Unity Catalog requires Databricks Runtime 13.2 or above. In addition, the Unity Catalog metastore must have Privilege Model Version 1.0.

Install Feature Engineering in Unity Catalog Python client

Feature Engineering in Unity Catalog provides a Python client FeatureEngineeringClient. The class is available on PyPI with the databricks-feature-engineering package and is pre-installed in Databricks Runtime 13.2 ML and above. If you are using a non-ML Databricks Runtime, you need to install the client manually. Use the compatibility matrix to find the right version for your Databricks Runtime version.

%pip install databricks-feature-engineering

dbutils.library.restartPython()

Create a catalog and a schema for feature tables in Unity Catalog

You must create a new catalog or use an existing catalog for feature tables.

To create a new catalog, you must have the CREATE CATALOG privilege on the metastore.

CREATE CATALOG IF NOT EXISTS <catalog-name>

To use an existing catalog, you must have the USE CATALOG privilege on the catalog.

USE CATALOG <catalog-name>

Feature tables in Unity Catalog must be stored in a schema. To create a new schema in the catalog, you must have the CREATE SCHEMA privilege on the catalog.

CREATE SCHEMA IF NOT EXISTS <schema-name>

Create a feature table in Unity Catalog

Note

You can use an existing Delta table in Unity Catalog as a feature table. See Use an existing Delta table in Unity Catalog as a feature table.

You can also use an existing Delta Live Table in Unity Catalog as a feature table. See Use an existing streaming table or materialized view created by a Delta Live Tables pipeline as a feature table.

Feature tables in Unity Catalog are Delta tables or Delta Live Tables managed by Unity Catalog. Feature tables must have a primary key. The name of a feature table in Unity Catalog has a three-level structure, <catalog-name>.<schema-name>.<table-name>.

You can use Databricks SQL or the Python FeatureEngineeringClient to create feature Delta tables in Unity Catalog. You can use Delta Live Tables pipelines to create feature Delta Live Tables in Unity Catalog.

You can use any Delta table with a Primary Key Constraint as a feature table. The following code shows how to create a table with a primary key:

CREATE TABLE ml.recommender_system.customer_features (
  customer_id int NOT NULL,
  feat1 long,
  feat2 varchar(100),
  CONSTRAINT customer_features_pk PRIMARY KEY (customer_id)
);

To create a time series feature table, add a time column as a primary key column and specify the TIMESERIES keyword. The TIMESERIES keyword requires Databricks Runtime 13.3 LTS or above.

CREATE TABLE ml.recommender_system.customer_features (
  customer_id int NOT NULL,
  ts timestamp NOT NULL,
  feat1 long,
  feat2 varchar(100),
  CONSTRAINT customer_features_pk PRIMARY KEY (customer_id, ts TIMESERIES)
);

After the table is created, you can write data to it in the same way you write other Delta tables, and it can be used as a feature table.

For details about the commands and parameters used in the following examples, see the Feature Engineering Python API reference.

  1. Write the Python functions to compute the features. The output of each function should be an Apache Spark DataFrame with a unique primary key. The primary key can consist of one or more columns.

  2. Create a feature table by instantiating a FeatureEngineeringClient and using create_table.

  3. Populate the feature table using write_table.

from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

# Prepare feature DataFrame
def compute_customer_features(data):
  ''' Feature computation code returns a DataFrame with 'customer_id' as primary key'''
  pass

customer_features_df = compute_customer_features(df)

# Create feature table with `customer_id` as the primary key.
# Take schema from DataFrame output by compute_customer_features
customer_feature_table = fe.create_table(
  name='ml.recommender_system.customer_features',
  primary_keys='customer_id',
  schema=customer_features_df.schema,
  description='Customer features'
)

# An alternative is to use `create_table` and specify the `df` argument.
# This code automatically saves the features to the underlying Delta table.

# customer_feature_table = fe.create_table(
#  ...
#  df=customer_features_df,
#  ...
# )

# To use a composite primary key, pass all primary key columns in the create_table call

# customer_feature_table = fe.create_table(
#   ...
#   primary_keys=['customer_id', 'date'],
#   ...
# )

# To create a time series table, set the timeseries_columns argument

# customer_feature_table = fe.create_table(
#   ...
#   primary_keys=['customer_id', 'date'],
#   timeseries_columns='date',
#   ...
# )

Note

Table constraints on Delta Live Tables are in Public Preview. The following code examples must be run using the Delta Live Tables preview channel.

You can use any Delta Live Table with a Primary Key Constraint as a feature table. The following code shows how to create a Delta Live Table with a primary key:

CREATE LIVE TABLE customer_features (
  customer_id int NOT NULL,
  feat1 long,
  feat2 varchar(100),
  CONSTRAINT customer_features_pk PRIMARY KEY (customer_id)
) AS SELECT * FROM ...;

To create a time series feature table, add a time column as a primary key column and specify the TIMESERIES keyword.

CREATE LIVE TABLE customer_features (
  customer_id int NOT NULL,
  ts timestamp NOT NULL,
  feat1 long,
  feat2 varchar(100),
  CONSTRAINT customer_features_pk PRIMARY KEY (customer_id, ts TIMESERIES)
) AS SELECT * FROM ...;

After the table is created, you can write data to it in the same way you write other Delta Live Tables, and it can be used as a feature table.

Table constraints for Delta Live Tables are only supported in SQL. To set primary keys for Delta Live Tables that were declared in Python, see Use an existing streaming table or materialized view created by a Delta Live Tables pipeline as a feature table.

Use an existing Delta table in Unity Catalog as a feature table

Any Delta table in Unity Catalog with a primary key can be a feature table in Unity Catalog, and you can use the Features UI and API with the table.

Note

  • Only the table owner can declare Primary Key Constraints. The owner’s name is displayed on the table detail page of Catalog Explorer.

  • Verify the data type in the Delta table is supported by Feature Engineering in Unity Catalog. See Supported data types.

  • The TIMESERIES keyword requires Databricks Runtime 13.3 LTS or above.

If an existing Delta table does not have a Primary Key Constraint, you can create one as follows:

  1. Set primary key columns to NOT NULL. For each primary key column, run:

    ALTER TABLE <full_table_name> ALTER COLUMN <pk_col_name> SET NOT NULL
    
  2. Alter the table to add the Primary Key Constraint:

    ALTER TABLE <full_table_name> ADD CONSTRAINT <pk_name> PRIMARY KEY(pk_col1, pk_col2, ...)
    

    pk_name is the name of the Primary Key Constraint. By convention, you can use the table name (without schema and catalog) with a _pk suffix. For example, a table with the name "ml.recommender_system.customer_features" would have customer_features_pk as the name of its Primary Key Constraint.

    To make the table a time series feature table, specify the TIMESERIES keyword on one of the primary key columns, as follows:

    ALTER TABLE <full_table_name> ADD CONSTRAINT <pk_name> PRIMARY KEY(pk_col1 TIMESERIES, pk_col2, ...)
    

    After you add the Primary Key Constraint on the table, the table appears in the Features UI and you can use it as a feature table.

Use an existing streaming table or materialized view created by a Delta Live Tables pipeline as a feature table

Any streaming table or materialized view in Unity Catalog with a primary key can be a feature table in Unity Catalog, and you can use the Features UI and API with the table.

Note

  • Table constraints on Delta Live Tables are in Public Preview. The following code examples must be run using the Delta Live Tables preview channel.

  • Only the table owner can declare Primary Key Constraints. The owner’s name is displayed on the table detail page of Catalog Explorer.

  • Verify that the data type in the Delta table is supported by Feature Engineering in Unity Catalog. See Supported data types.

Add primary key to a streaming table or materialized view that was created using SQL

To set primary keys for an existing streaming table or materialized view that was creating using SQL, update the schema of the streaming table or materialized view in the notebook that manages the table. Then, refresh the table for the constraint to be applied in Unity Catalog.

The code should be similar to the following:

CREATE OR REFRESH MATERIALIZED VIEW existing_live_table(
  id int NOT NULL PRIMARY KEY,
  ...
) AS SELECT ...

Add primary key to a streaming table or materialized view that was created using Python

To create primary keys for an existing streaming table or materialized view that was created by a Delta Live Tables pipeline, you must use SQL, even if the streaming table or materialized view was created using Python. Create a new SQL notebook to define a new streaming table or materialized view that reads from the existing one. Then, run the notebook as a step of the existing Delta Live Tables pipeline or in a new pipeline.

The new SQL notebook should include code similar to the following:

CREATE OR REFRESH MATERIALIZED VIEW new_live_table_with_constraint(
  id int NOT NULL PRIMARY KEY,
  ...
) AS SELECT * FROM existing_live_table

Control access to feature tables in Unity Catalog

Access control for feature tables in Unity Catalog is managed by Unity Catalog. See Unity Catalog privileges.

Update a feature table in Unity Catalog

You can update a feature table in Unity Catalog by adding new features or by modifying specific rows based on the primary key.

The following feature table metadata should not be updated:

  • Primary key.

  • Partition key.

  • Name or data type of an existing feature.

Altering them will cause downstream pipelines that use features for training and serving models to break.

Add new features to an existing feature table in Unity Catalog

You can add new features to an existing feature table in one of two ways:

  • Update the existing feature computation function and run write_table with the returned DataFrame. This updates the feature table schema and merges new feature values based on the primary key.

  • Create a new feature computation function to calculate the new feature values. The DataFrame returned by this new computation function must contain the feature tables’s primary keys and partition keys (if defined). Run write_table with the DataFrame to write the new features to the existing feature table, using the same primary key.

Update only specific rows in a feature table

Use mode = "merge" in write_table. Rows whose primary key does not exist in the DataFrame sent in the write_table call remain unchanged.

from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()
fe.write_table(
  name='ml.recommender_system.customer_features',
  df = customer_features_df,
  mode = 'merge'
)

Schedule a job to update a feature table

To ensure that features in feature tables always have the most recent values, Databricks recommends that you create a job that runs a notebook to update your feature table on a regular basis, such as every day. If you already have a non-scheduled job created, you can convert it to a scheduled job to make sure the feature values are always up-to-date.

Code to update a feature table uses mode='merge', as shown in the following example.

from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()

customer_features_df = compute_customer_features(data)

fe.write_table(
  df=customer_features_df,
  name='ml.recommender_system.customer_features',
  mode='merge'
)

Store past values of daily features

Define a feature table with a composite primary key. Include the date in the primary key. For example, for a feature table customer_features, you might use a composite primary key (date, customer_id) and partition key date for efficient reads.

CREATE TABLE ml.recommender_system.customer_features (
  customer_id int NOT NULL,
  `date` date NOT NULL,
  feat1 long,
  feat2 varchar(100),
  CONSTRAINT customer_features_pk PRIMARY KEY (`date`, customer_id)
)
PARTITIONED BY (`date`)
COMMENT "Customer features";
from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()
fe.create_table(
  name='ml.recommender_system.customer_features',
  primary_keys=['date', 'customer_id'],
  partition_columns=['date'],
  schema=customer_features_df.schema,
  description='Customer features'
)

You can then create code to read from the feature table filtering date to the time period of interest.

You can also create a time series feature table which enables point-in-time lookups when you use create_training_set or score_batch. See Create a feature table in Unity Catalog.

To keep the feature table up to date, set up a regularly scheduled job to write features, or stream new feature values into the feature table.

Create a streaming feature computation pipeline to update features

To create a streaming feature computation pipeline, pass a streaming DataFrame as an argument to write_table. This method returns a StreamingQuery object.

def compute_additional_customer_features(data):
  ''' Returns Streaming DataFrame
  '''
  pass

from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()

customer_transactions = spark.readStream.load("dbfs:/events/customer_transactions")
stream_df = compute_additional_customer_features(customer_transactions)

fe.write_table(
  df=stream_df,
  name='ml.recommender_system.customer_features',
  mode='merge'
)

Read from a feature table in Unity Catalog

Use read_table to read feature values.

from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()
customer_features_df = fe.read_table(
  name='ml.recommender_system.customer_features',
)

Search and browse feature tables in Unity Catalog

Use the Features UI to search for or browse feature tables in Unity Catalog.

  1. Click Feature Store Icon Features in the sidebar to display the Features UI.

  2. Select catalog with the catalog selector to view all of the available feature tables in that catalog. In the search box, enter all or part of the name of a feature table, a feature, or a comment. You can also enter all or part of the key or value of a tag. Search text is case-insensitive.

    Feature search example

Get metadata of feature tables in Unity Catalog

Use get_table to get feature table metadata.

from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()
ft = fe.get_table(name="ml.recommender_system.user_feature_table")
print(ft.features)

Use tags with feature tables and features in Unity Catalog

You can use tags, which are simple key-value pairs, to categorize and manage your feature tables and features.

For feature tables, you can create, edit, and delete tags using the Catalog Explorer UI, Databricks SQL, or the Feature Engineering Python API.

For features, you can create, edit, and delete tags using the Catalog Explorer UI or Databricks SQL.

The following example shows how to use the Feature Engineering Python API to create, update, and delete feature table tags.

from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()

# Create feature table with tags
customer_feature_table = fe.create_table(
  # ...
  tags={"tag_key_1": "tag_value_1", "tag_key_2": "tag_value_2", ...},
  # ...
)

# Upsert a tag
fe.set_feature_table_tag(name="customer_feature_table", key="tag_key_1", value="new_key_value")

# Delete a tag
fe.delete_feature_table_tag(name="customer_feature_table", key="tag_key_2")

Delete a feature table in Unity Catalog

You can delete a feature table in Unity Catalog by directly deleting the Delta table in Unity Catalog using Catalog Explorer or using the Feature Engineering Python API.

Note

  • Deleting a feature table can lead to unexpected failures in upstream producers and downstream consumers (models, endpoints, and scheduled jobs). You must delete published online stores with your cloud provider.

  • When you delete a feature table in Unity Catalog, the underlying Delta table is also dropped.

  • drop_table is not supported in Databricks Runtime 13.1 ML or below. Use SQL command to delete the table.

You can use Databricks SQL or FeatureEngineeringClient.drop_table to delete a feature table in Unity Catalog:

DROP TABLE ml.recommender_system.customer_features;
from databricks.feature_engineering import FeatureEngineeringClient
fe = FeatureEngineeringClient()
fe.drop_table(
  name='ml.recommender_system.customer_features'
)