Skip to main content

Train models with declarative features

Beta

This feature is Beta and is available in the following regions: us-east-1 and us-west-2.

This page describes how to use declarative features for model training. For information about defining declarative features, see Declarative features.

Requirements

API methods

create_training_set()

After you create declarative features, the next step is to create training data for your model. To do this, pass a labeled dataset to create_training_set, which automatically ensures point-in-time accurate computation of each feature value.

For example:

Python
FeatureEngineeringClient.create_training_set(
df: DataFrame, # DataFrame with training data
features: Optional[List[Feature]], # List of Feature objects
label: Union[str, List[str], None], # Label column name(s)
exclude_columns: Optional[List[str]] = None, # Optional: columns to exclude
) -> TrainingSet

Call TrainingSet.load_df to join original training data with point-in-time dynamically computed features.

The df argument must meet the following requirements:

  • Must contain all entity_columns from feature data sources.
  • Must contain timeseries_column from feature data sources.
  • Should contain label column(s).

Point-in-time correctness: Features are computed using only source data available before each row's timestamp, to prevent future data leakage into model training. Computations use Spark's windowing functions for efficiency.

log_model()

You can use MLflow to log a model with feature metadata for lineage tracking and automatic feature lookup during inference:

Python
FeatureEngineeringClient.log_model(
model, # Trained model object
artifact_path: str, # Path to store model artifact
flavor: ModuleType, # MLflow flavor module (e.g., mlflow.sklearn)
training_set: TrainingSet, # TrainingSet used for training
registered_model_name: Optional[str], # Optional: register model in Unity Catalog
)

The flavor parameter specifies the MLflow model flavor module to use, such as mlflow.sklearn or mlflow.xgboost.

Models logged with a TrainingSet automatically track lineage to the features used in training. For details, see Train models with feature tables.

score_batch()

Perform batch inference with automatic feature lookup:

Python
FeatureEngineeringClient.score_batch(
model_uri: str, # URI of logged model
df: DataFrame, # DataFrame with entity keys and timestamps
) -> DataFrame

score_batch uses the feature metadata stored with the model to automatically compute point-in-time correct features for inference, ensuring consistency with training. For details, see Train models with feature tables.

Example workflow

Python
import mlflow
from databricks.feature_engineering import FeatureEngineeringClient
from sklearn.ensemble import RandomForestClassifier

fe = FeatureEngineeringClient()

# Assume features were created with fe.create_feature()
# labeled_df should have columns "user_id", "transaction_time", and "is_fraud"

# 1. Create training set using declarative features
training_set = fe.create_training_set(
df=labeled_df,
features=features,
label="is_fraud",
)

# 2. Load training data with computed features
training_df = training_set.load_df()
X = training_df.drop("is_fraud").toPandas()
y = training_df.select("is_fraud").toPandas().values.ravel()

# 3. Train model
model = RandomForestClassifier().fit(X, y)

# 4. Log model with feature metadata
with mlflow.start_run():
fe.log_model(
model=model,
artifact_path="fraud_model",
flavor=mlflow.sklearn,
training_set=training_set,
registered_model_name="main.ecommerce.fraud_model",
)

# 5. Batch scoring with automatic feature lookup
# inference_df must contain the same entity_columns and timeseries_column
# used during training. Features are automatically computed.
predictions = fe.score_batch(
model_uri="models:/main.ecommerce.fraud_model/1",
df=inference_df,
)
predictions.display()