This section describes concepts to help you use Databricks Feature Store and feature tables.
Features are organized as feature tables. Each table is backed by a Delta table and additional metadata.
A feature table must have a primary key. Features in a feature table are typically computed and updated using a common computation function.
Feature table metadata tracks the data sources from which a table was generated and the notebooks and jobs that created or wrote to the table.
You can publish a feature table to an online store for real-time model inference.
With Databricks Runtime 13.2 and above, if your workspace is enabled for Unity Catalog, you can use any Delta table in Unity Catalog with a primary key as a feature table. These feature tables are called “Feature tables in Unity Catalog”. See Feature Engineering in Unity Catalog.
Feature tables that are stored in the local Workspace Feature Store are called “Workspace feature tables”. See Work with features in Workspace Feature Store.
The data used to train a model often has time dependencies built into it. When you build the model, you must consider only feature values up until the time of the observed target value. If you train on features based on data measured after the timestamp of the target value, the model’s performance may suffer.
Time series feature tables include a timestamp key column that ensures that each row in the training dataset represents the latest known feature values as of the row’s timestamp. You should use time series feature tables whenever feature values change over time, for example with time series data, event-based data, or time-aggregated data.
When you create a time series feature table, you specify time-related columns in your primary keys to be timestamp keys using the
timestamp_keys argument. This enables point-in-time lookups when you use
score_batch. The system performs an as-of timestamp join, using the
timestamp_lookup_key you specify.
If you do not use the
timestamp_keys argument, and only designate a timestamp column as a primary key column, Feature Store does not apply point-in-time logic to the timestamp column during joins. Instead, it matches only rows with an exact time match instead of matching all rows prior to the timestamp.
The offline feature store is used for feature discovery, model training, and batch inference. It contains feature tables materialized as Delta tables.
An online store is a low-latency database used for real-time model inference. For a list of online stores that Databricks supports, see Work with online stores.
In addition to batch writes, Databricks Feature Store supports streaming. You can write feature values to a feature table from a streaming source, and feature computation code can utilize Structured Streaming to transform raw data streams into features.
You can also stream feature tables from the offline store to an online store.
A training set consists of a list of features and a DataFrame containing raw training data, labels, and primary keys by which to look up features. You create the training set by specifying features to extract from Feature Store, and provide the training set as input during model training.
See Create a training dataset for an example of how to create and use a training set.
A machine learning model trained using features from Databricks Feature Store retains references to these features. At inference time, the model can optionally retrieve feature values from Feature Store. The caller only needs to provide the primary key of the features used in the model (for example,
user_id), and the model retrieves all required feature values from Feature Store.
In batch inference, feature values are retrieved from the offline store and joined with new data prior to scoring. In real-time inference, feature values are retrieved from the online store.
To package a model with feature metadata, use