Feature Views
This feature is in Public Preview. Workspace admins can control access to this feature from the Previews page. See Manage Databricks previews.
Feature Views enable you to define and compute features from data sources. Features can be defined using a variety of sources (Delta table, Kafka Stream, and request-time data) and computations (time-windowed aggregations, simple column selections, and more). This guide covers the following workflows:
- Feature development workflow
- Use
create_featureto define Unity Catalog feature objects that can be used in model training and serving workflows. - Alternatively, construct
Featureobjects locally and useregister_featureto persist them to Unity Catalog later. Locally constructed features can be used withcreate_training_setbefore registration.
- Use
- Model training workflow
- Use
create_training_setto calculate point-in-time aggregated features for machine learning. For detailed documentation on training with Feature Views, see Train models with Feature Views.
- Use
- Feature materialization and serving workflow
- After defining a feature with
create_featureor retrieving it usingget_feature, you can usematerialize_featuresto materialize the feature or set of features to an offline store for efficient reuse, or to an online store for online serving. - Use
create_training_setwith the materialized view to prepare an offline batch training dataset.
- After defining a feature with
For API details, see Feature Views API reference.
Requirements
-
Serverless compute or a classic compute cluster running Databricks Runtime 17.0 ML or above.
-
You must install the custom Python package. Run the following lines of code each time you run a notebook:
Python%pip install databricks-feature-engineering>=0.16.0
dbutils.library.restartPython()
Quickstart example
For a runnable quickstart notebook, see Example notebook.
from databricks.feature_engineering import FeatureEngineeringClient
from databricks.feature_engineering.entities import (
CronSchedule, DeltaTableSource, Feature, AggregationFunction,
Sum, Avg, ColumnSelection, TableTrigger,
TumblingWindow, SlidingWindow,
OfflineStoreConfig, OnlineStoreConfig,
)
from datetime import timedelta
CATALOG_NAME = "main"
SCHEMA_NAME = "feature_store"
TABLE_NAME = "transactions"
# 1. Create data source
source = DeltaTableSource(
catalog_name=CATALOG_NAME,
schema_name=SCHEMA_NAME,
table_name=TABLE_NAME,
)
# 2. Define features locally (no catalog/schema needed yet)
avg_feature = Feature(
source=source,
entity=["user_id"],
timeseries_column="transaction_time",
function=AggregationFunction(Avg(input="amount"), TumblingWindow(window_duration=timedelta(days=30))),
name="avg_transaction_30d",
)
sum_feature = Feature(
source=source,
entity=["user_id"],
timeseries_column="transaction_time",
function=AggregationFunction(Sum(input="amount"), SlidingWindow(window_duration=timedelta(days=7), slide_duration=timedelta(days=1))),
# name auto-generated: "amount_sum_sliding_7d_1d"
)
fe = FeatureEngineeringClient()
# 3. Explore features with compute_features
feature_df = fe.compute_features(features=[avg_feature, sum_feature])
feature_df.display()
# 4. Create training set using local features
# `labeled_df` should have columns "user_id", "transaction_time", and "target".
training_set = fe.create_training_set(
df=labeled_df,
features=[avg_feature, sum_feature],
label="target",
)
training_set.load_df().display()
# 5. Register features in Unity Catalog
avg_feature = fe.register_feature(
feature=avg_feature,
catalog_name=CATALOG_NAME,
schema_name=SCHEMA_NAME,
)
sum_feature = fe.register_feature(
feature=sum_feature,
catalog_name=CATALOG_NAME,
schema_name=SCHEMA_NAME,
)
# 6. Or use create_feature for a one-step define-and-register workflow
latest_amount = fe.create_feature(
source=source,
function=ColumnSelection("amount"),
entity=["user_id"],
timeseries_column="transaction_time",
catalog_name=CATALOG_NAME,
schema_name=SCHEMA_NAME,
name="latest_amount",
)
# 7. Train model
with mlflow.start_run():
training_df = training_set.load_df()
# training code
fe.log_model(
model=model,
artifact_path="recommendation_model",
flavor=mlflow.sklearn,
training_set=training_set,
registered_model_name=f"{CATALOG_NAME}.{SCHEMA_NAME}.recommendation_model",
)
# 8. (Optional) Materialize features for serving
# Features must be registered in UC before calling materialize_features
online_config = OnlineStoreConfig(
catalog_name=CATALOG_NAME,
schema_name=SCHEMA_NAME,
table_name_prefix="customer_features_serving",
online_store_name="customer_features_store",
)
# Aggregation features use CronSchedule and support both offline and online configs
fe.materialize_features(
features=[avg_feature, sum_feature],
offline_config=OfflineStoreConfig(
catalog_name=CATALOG_NAME,
schema_name=SCHEMA_NAME,
table_name_prefix="customer_features",
),
online_config=online_config,
trigger=CronSchedule(
quartz_cron_expression="0 0 * * * ?", # Hourly
timezone_id="UTC",
),
)
# ColumnSelection features use TableTrigger and only support online config
fe.materialize_features(
features=[latest_amount],
online_config=online_config,
trigger=TableTrigger(),
)
Example notebook
Feature Views quickstart notebook
Streaming features
In addition to batch features from Delta tables, you can define features from streaming sources for real-time use cases. Streaming features use the same Feature class as batch features — same Feature constructors, same aggregation functions, same training and serving workflows — so upgrading from batch to real-time requires minimal code changes. Once materialized, streaming features deliver sub-second end-to-end freshness (p99 latency of 200ms) directly to your model serving endpoints.
To use streaming features, first set up a Stream, then reference it using a StreamSource. Stream sources support Kafka as input and automatically maintain an ingestion (Delta) table as a historical copy of the data for training.
Define a streaming feature
A StreamSource references a Stream by its three-part name (catalog.schema.stream_name). A Stream is not a Unity Catalog securable object, but it is scoped to a Unity Catalog schema and access is governed by the Stream's ingestion table. Column references in entity, timeseries, and function definitions must be prefixed with value. or key. to indicate which part of the Kafka message to read. Nested fields are supported using dot notation (for example, value.user.address.city).
from databricks.feature_engineering import FeatureEngineeringClient
from databricks.feature_engineering.entities import (
StreamSource,
Feature,
AggregationFunction,
Sum,
RollingWindow,
)
from datetime import timedelta
client = FeatureEngineeringClient()
stream_source = StreamSource(
full_name="my_catalog.my_schema.my_stream",
)
feature = Feature(
name="user_purchase_sum",
source=stream_source,
entity=["value.user_id"],
timeseries_column="value.event_time",
function=AggregationFunction(
operator=Sum(input="value.amount"),
time_window=RollingWindow(window_duration=timedelta(hours=1)),
),
)
Filter conditions on StreamSource
Use filter_condition to filter rows from the stream before aggregation, just like on DeltaTableSource.
stream_source = StreamSource(
full_name="my_catalog.my_schema.my_stream",
filter_condition="value.event_type = 'purchase'",
)
Column selection from streams
ColumnSelection features work with streaming sources. The selected column represents the latest value from the Stream for each entity while respecting point-in-time accuracy.
from databricks.feature_engineering.entities import ColumnSelection
passenger_count = Feature(
name="passenger_count",
source=stream_source,
entity=["value.user_id"],
timeseries_column="value.event_time",
function=ColumnSelection(column="value.passenger_count"),
)
Access nested fields
You can access nested JSON fields using dot notation (for example, value.nested_field.amount). At serving time, the request payload and response use leaf node names (for example, amount instead of value.amount). Leaf node names must be unique across all entity, timeseries, and feature output columns within a model or Feature Spec, because the serving endpoint uses leaf names to route values.
Time windows for streaming features
Streaming features support only RollingWindow for aggregations. Rolling windows continuously recompute over the most recent data, which aligns with the real-time nature of streaming sources. TumblingWindow and SlidingWindow are designed for batch computation over fixed historical intervals.
Streaming features example notebook
Streaming Feature Views quickstart notebook
Model training and inference
To train models and run batch inference with Feature Views, including log_model(), score_batch(), and create_training_set(), see Train models with Feature Views.
Feature materialization
After you define features, you can materialize them to offline or online stores for efficient reuse in training and serving workflows. After materializing features, you can serve models using CPU model serving. For details, see Materialize Feature Views.
Best practices
Feature naming
- Use descriptive names for business-critical features.
- Follow consistent naming conventions across teams.
- Use auto-generated names as you begin developing features.
Time windows
- Align window boundaries with business cycles (daily, weekly).
- Shorter windows capture recent trends but can be noisy. Longer windows produce more stable feature distributions but might miss recent behavioral shifts. Choose based on how quickly the underlying signal changes for your use case. For example, a 7-day window smooths out daily fluctuations and produces consistent model inputs, while a 1-hour window reacts quickly to behavioral changes but might introduce variance that degrades model performance. If your model's accuracy degrades when the distribution shifts, use a longer window to stabilize inputs.
- Tumbling and sliding windows are more scalable than rolling (continuous) windows. Start with sliding windows for most use cases.
Performance
- Materialize features from the same data source in a single
materialize_featurescall to minimize data scans. - Use the same granularity (for example, all 1-hour or all 1-day slide durations) for features on the same data source to enable better grouping during materialization.
Entity columns vs. filter conditions
Use this decision guide when working with features from the same source table:
Use entity (on create_feature) when you need different aggregation levels:
- Customer-level features (one row per customer):
entity=["customer_id"] - Customer-merchant features (multiple rows per customer):
entity=["customer_id", "merchant_id"] - Different aggregation levels can share the same
DeltaTableSource: specify differententityvalues on each feature definition
Use filter_condition (on DeltaTableSource) when you need to filter rows at the same aggregation level:
- High-value transactions only:
filter_condition="amount > 100"(still aggregated per customer) - Completed orders only:
filter_condition="status = 'completed'"(still aggregated per customer)
Rule of thumb: If your change would result in a different number of rows per entity value, use different entity values on your feature definitions. If you're just filtering which rows contribute to the same aggregation, use filter_condition on the source.
Common patterns
Customer analytics
from databricks.feature_engineering.entities import AggregationFunction, Sum, Count, RollingWindow
fe = FeatureEngineeringClient()
features = [
# Recency: Number of transactions in the last day
fe.create_feature(catalog_name="main", schema_name="ecommerce", source=transactions,
entity=["user_id"], timeseries_column="transaction_time",
function=AggregationFunction(Count(input="transaction_id"), RollingWindow(window_duration=timedelta(days=1)))),
# Frequency: transaction count over the last 90 days
fe.create_feature(catalog_name="main", schema_name="ecommerce", source=transactions,
entity=["user_id"], timeseries_column="transaction_time",
function=AggregationFunction(Count(input="transaction_id"), RollingWindow(window_duration=timedelta(days=90)))),
# Monetary: total spend in the last month
fe.create_feature(catalog_name="main", schema_name="ecommerce", source=transactions,
entity=["user_id"], timeseries_column="transaction_time",
function=AggregationFunction(Sum(input="amount"), RollingWindow(window_duration=timedelta(days=30)))),
]
Trend analysis
# Compare recent vs. historical behavior
fe = FeatureEngineeringClient()
recent_avg = fe.create_feature(
catalog_name="main", schema_name="ecommerce",
source=transactions, entity=["user_id"], timeseries_column="transaction_time",
function=AggregationFunction(Avg(input="amount"), RollingWindow(window_duration=timedelta(days=7))),
)
historical_avg = fe.create_feature(
catalog_name="main", schema_name="ecommerce",
source=transactions, entity=["user_id"], timeseries_column="transaction_time",
function=AggregationFunction(Avg(input="amount"), RollingWindow(window_duration=timedelta(days=7), delay=timedelta(days=7))),
)
Seasonal patterns
# Same day of week, 4 weeks ago
fe = FeatureEngineeringClient()
weekly_pattern = fe.create_feature(
catalog_name="main", schema_name="ecommerce",
source=transactions, entity=["user_id"], timeseries_column="transaction_time",
function=AggregationFunction(Avg(input="amount"), RollingWindow(window_duration=timedelta(days=1), delay=timedelta(weeks=4))),
)
Limitations
- Names of entity and timeseries columns must match between the training (labeled) dataset and the feature definitions when used in the
create_training_setAPI. - The column name used as the
labelcolumn in the training dataset should not exist in the source tables used for definingFeatures. - A limited list of functions (UDAFs) is supported in the
create_featureAPI. See Supported functions. - Entity columns cannot be of type
DATEorTIMESTAMP. RequestSourcesupports only scalar data types defined inScalarDataType(INTEGER,FLOAT,BOOLEAN,STRING,DOUBLE,LONG,TIMESTAMP,DATE,SHORT). Complex types such as arrays, maps, and structs are not supported.RequestSourcedoes not support aggregation functions or time windows. OnlyColumnSelectionfunctions can be used.- The set of entity column names, timeseries column names, and request feature column names must be globally unique across all sources in a training set or serving endpoint.
For materialization-specific limitations, see Limitations.