Train models with declarative features
This feature is Beta and is available in the following regions: us-east-1 and us-west-2.
This page describes how to use declarative features for model training. For information about defining declarative features, see Declarative features.
Requirements
- Features must be created with the declarative feature API. See Declarative features.
API methods
create_training_set()
After you create declarative features, the next step is to create training data for your model. To do this, pass a labeled dataset to create_training_set, which automatically ensures point-in-time accurate computation of each feature value.
For example:
FeatureEngineeringClient.create_training_set(
df: DataFrame, # DataFrame with training data
features: Optional[List[Feature]], # List of Feature objects
label: Union[str, List[str], None], # Label column name(s)
exclude_columns: Optional[List[str]] = None, # Optional: columns to exclude
) -> TrainingSet
Call TrainingSet.load_df to join original training data with point-in-time dynamically computed features.
The df argument must meet the following requirements:
- Must contain all
entity_columnsfrom feature data sources. - Must contain
timeseries_columnfrom feature data sources. - Should contain label column(s).
Point-in-time correctness: Features are computed using only source data available before each row's timestamp, to prevent future data leakage into model training. Computations use Spark's windowing functions for efficiency.
log_model()
You can use MLflow to log a model with feature metadata for lineage tracking and automatic feature lookup during inference:
FeatureEngineeringClient.log_model(
model, # Trained model object
artifact_path: str, # Path to store model artifact
flavor: ModuleType, # MLflow flavor module (e.g., mlflow.sklearn)
training_set: TrainingSet, # TrainingSet used for training
registered_model_name: Optional[str], # Optional: register model in Unity Catalog
)
The flavor parameter specifies the MLflow model flavor module to use, such as mlflow.sklearn or mlflow.xgboost.
Models logged with a TrainingSet automatically track lineage to the features used in training. For details, see Train models with feature tables.
score_batch()
Perform batch inference with automatic feature lookup:
FeatureEngineeringClient.score_batch(
model_uri: str, # URI of logged model
df: DataFrame, # DataFrame with entity keys and timestamps
) -> DataFrame
score_batch uses the feature metadata stored with the model to automatically compute point-in-time correct features for inference, ensuring consistency with training. For details, see Train models with feature tables.
Example workflow
import mlflow
from databricks.feature_engineering import FeatureEngineeringClient
from sklearn.ensemble import RandomForestClassifier
fe = FeatureEngineeringClient()
# Assume features were created with fe.create_feature()
# labeled_df should have columns "user_id", "transaction_time", and "is_fraud"
# 1. Create training set using declarative features
training_set = fe.create_training_set(
df=labeled_df,
features=features,
label="is_fraud",
)
# 2. Load training data with computed features
training_df = training_set.load_df()
X = training_df.drop("is_fraud").toPandas()
y = training_df.select("is_fraud").toPandas().values.ravel()
# 3. Train model
model = RandomForestClassifier().fit(X, y)
# 4. Log model with feature metadata
with mlflow.start_run():
fe.log_model(
model=model,
artifact_path="fraud_model",
flavor=mlflow.sklearn,
training_set=training_set,
registered_model_name="main.ecommerce.fraud_model",
)
# 5. Batch scoring with automatic feature lookup
# inference_df must contain the same entity_columns and timeseries_column
# used during training. Features are automatically computed.
predictions = fe.score_batch(
model_uri="models:/main.ecommerce.fraud_model/1",
df=inference_df,
)
predictions.display()