client

Classes

class databricks.feature_store.client.FeatureStoreClient(feature_store_uri: Optional[str] = None, model_registry_uri: Optional[str] = None)

Bases: object

Client for interacting with the Databricks Feature Store.

Methods

create_feature_table(name: str, keys: Union[str, List[str]], features_df: pyspark.sql.dataframe.DataFrame = None, schema: pyspark.sql.types.StructType = None, partition_columns: Union[str, List[str]] = None, description: str = None, **kwargs) → databricks.feature_store.entities.feature_table.FeatureTable

Create and return a feature table with the given name and primary keys.

The returned feature table has the given name and primary keys. Uses the provided schema or the inferred schema of the provided features_df. If features_df is provided, this data will be saved in a Delta table. Supported data types for features are: IntegerType, LongType, FloatType, DoubleType, StringType, BooleanType, DateType, TimestampType, ShortType, and ArrayType.

Parameters:
  • name – A feature table name of the form <database_name>.<table_name>, for example dev.user_features.
  • keys – The primary keys. If multiple columns are required, specify a list of column names, for example ['customer_id', 'region'].
  • features_df – Data to insert into this feature table. The schema of features_df will be used as the feature table schema.
  • schema – Feature table schema. Either schema or features_df must be provided.
  • partition_columns

    Columns used to partition the feature table. If a list is provided, column ordering in the list will be used for partitioning.

    Note

    When choosing partition columns for your feature table, use columns that do not have a high cardinality. An ideal strategy would be such that you expect data in each partition to be at least 1 GB. The most commonly used partition column is a date.

    Additional info: Choosing the right partition columns for Delta tables

  • description – Description of the feature table.
Other Parameters:
 
  • path (Optional[str]) – Path in a supported filesystem. Defaults to the database location.
create_training_set(df: pyspark.sql.dataframe.DataFrame, feature_lookups: List[databricks.feature_store.entities.feature_lookup.FeatureLookup], label: Union[str, List[str], None], exclude_columns: List[str] = []) → databricks.feature_store.training_set.TrainingSet

Create a TrainingSet.

Parameters:
  • df – The DataFrame used to join features into.
  • feature_lookups – List of features to join into the DataFrame.
  • label – Names of column(s) in DataFrame that contain training set labels. To create a training set without a label field, i.e. for unsupervised training set, specify label = None.
  • exclude_columns – Names of the columns to drop from the TrainingSet DataFrame.
Returns:

A TrainingSet object.

get_feature_table(name: str) → databricks.feature_store.entities.feature_table.FeatureTable

Get a feature table’s metadata.

Parameters:name – A feature table name of the form <database_name>.<table_name>, for example dev.user_features.
log_model(model: Any, artifact_path: str, *, flavor: module, training_set: databricks.feature_store.training_set.TrainingSet, registered_model_name: str = None, await_registration_for: int = 300, **kwargs)

Log an MLflow model packaged with feature lookup information.

Note

The DataFrame returned by TrainingSet.load_df() must be used to train the model. If it has been modified (for example data normalization, add a column, and similar), these modifications will not be applied at inference time, leading to training-serving skew.

Parameters:
  • model – Model to be saved. This model must be capable of being saved by flavor.save_model. See the MLflow Model API.
  • artifact_path – Run-relative artifact path.
  • flavor – MLflow module to use to log the model. flavor should have type ModuleType. The module must have a method save_model, and must support the python_function flavor. For example, mlflow.sklearn, mlflow.xgboost, and similar.
  • training_set – The TrainingSet used to train this model.
  • registered_model_name

    Note

    Experimental: This argument may change or be removed in a future release without warning.

    If given, create a model version under registered_model_name, also creating a registered model if one with the given name does not exist.

  • await_registration_for – Number of seconds to wait for the model version to finish being created and is in READY status. By default, the function waits for five minutes. Specify 0 or None to skip waiting.
Returns:

None

publish_table(name: str, online_store: databricks.feature_store.online_store_spec.online_store_spec.OnlineStoreSpec, filter_condition: str = None, mode: str = 'merge', streaming: bool = False, checkpoint_location: Optional[str] = None, trigger: Dict[str, Any] = {'processingTime': '5 minutes'}) → Optional[pyspark.sql.streaming.StreamingQuery]

Publish a feature table to an online store.

Parameters:
  • name – Name of the feature table.
  • online_store – Specification of the online store.
  • filter_condition – A SQL expression using feature table columns that filters feature rows prior to publishing to the online store. For example, "dt > '2020-09-10'". This is analogous to running df.filter or a WHERE condition in SQL on a feature table prior to publishing.
  • mode

    Specifies the behavior when data already exists in this feature table in the online store. If "overwrite" mode is used, existing data is replaced by the new data. If "merge" mode is used, the new data will be merged in, under these conditions:

    • If a key exists in the online table but not the offline table, the row in the online table is unmodified.
    • If a key exists in the offline table but not the online table, the offline table row is inserted into the online table.
    • If a key exists in both the offline and the online tables, the online table row will be updated.
  • streaming – If True, streams data to the online store.
  • checkpoint_location – Sets the Structured Streaming checkpointLocation option. By setting a checkpoint_location, Spark Structured Streaming will store progress information and intermediate state, enabling recovery after failures. This parameter is only supported when streaming=True.
  • trigger – If streaming=True, trigger defines the timing of stream data processing. The dictionary will be unpacked and passed to DataStreamWriter.trigger as arguments. For example, trigger={'once': True} will result in a call to DataStreamWriter.trigger(once=True).
Returns:

If streaming=True, returns a PySpark StreamingQuery, None otherwise.

read_table(name: str, as_of_delta_timestamp: str = None) → pyspark.sql.dataframe.DataFrame

Read the contents of a feature table.

Parameters:
  • name – A feature table name of the form <database_name>.<table_name>, for example dev.user_features.
  • as_of_delta_timestamp – If provided, reads the feature table as of this time. Only date or timestamp strings are accepted. For example, "2019-01-01" and "2019-01-01T00:00:00.000Z".
Returns:

The feature table contents, or None if the feature table does not exist.

score_batch(model_uri: str, df: pyspark.sql.dataframe.DataFrame, result_type: str = 'double') → pyspark.sql.dataframe.DataFrame

Evaluate the model on the provided DataFrame.

Additional features required for model evaluation will be automatically retrieved from Feature Store.

The model must have been logged with FeatureStoreClient.log_model(), which packages the model with feature metadata. Unless present in df, these features will be looked up from Feature Store and joined with df prior to scoring the model.

If a feature is included in df, the provided feature values will be used rather than those stored in Feature Store.

For example, if a model is trained on two features account_creation_date and num_lifetime_purchases, as in:

feature_lookups = [
    FeatureLookup(
        table_name = 'trust_and_safety.customer_features',
        feature_name = 'account_creation_date',
        lookup_key = 'customer_id',
    ),
    FeatureLookup(
        table_name = 'trust_and_safety.customer_features',
        feature_name = 'num_lifetime_purchases',
        lookup_key = 'customer_id'
    ),
]

with mlflow.start_run():
    training_set = fs.create_training_set(
        df,
        feature_lookups = feature_lookups,
        label = 'is_banned',
        exclude_columns = ['customer_id']
    )
    ...
      fs.log_model(
        model,
        "model",
        flavor=mlflow.sklearn,
        training_set=training_set,
        registered_model_name="example_model"
      )

Then at inference time, the caller of FeatureStoreClient.score_batch() must pass a DataFrame that includes customer_id, the lookup_key specified in the FeatureLookups of the training_set. If the DataFrame contains a column account_creation_date, the values of this column will be used in lieu of those in Feature Store. As in:

# batch_df has columns ['customer_id', 'account_creation_date']
predictions = fs.score_batch(
    'models:/example_model/1',
    batch_df
)
Parameters:
  • model_uri

    The location, in URI format, of the MLflow model logged using FeatureStoreClient.log_model(). One of:

    • runs:/<mlflow_run_id>/run-relative/path/to/model
      • models:/<model_name>/<model_version>
      • models:/<model_name>/<stage>

    For more information about URI schemes, see Referencing Artifacts.

  • df

    The DataFrame to score the model on. Feature Store features will be joined with df prior to scoring the model. df must:

    1. Contain columns for lookup keys required to join feature data from Feature

    Store, as specified in the feature_spec.yaml artifact.

    2. Contain columns for all source keys required to score the model, as specified in

    the feature_spec.yaml artifact.

    3. Not contain a column prediction, which is reserved for the model’s predictions.

    df may contain additional columns.

  • result_type – The return type of the model. See mlflow.pyfunc.spark_udf() result_type.
Returns:

A DataFrame containing:

  1. All columns of df.
    1. All feature values retrieved from Feature Store.
    2. A column prediction containing the output of the model.

write_table(name: str, df: pyspark.sql.dataframe.DataFrame, mode: str, checkpoint_location: Optional[str] = None, trigger: Dict[str, Any] = {'processingTime': '5 seconds'}) → Optional[pyspark.sql.streaming.StreamingQuery]

Writes to a feature table.

If the input DataFrame is streaming, will create a write stream.

Parameters:
  • name – A feature table name of the form <database_name>.<table_name>, for example dev.user_features. Raises an exception if this feature table does not exist.
  • df – Spark DataFrame with feature data. Raises an exception if the schema does not match that of the feature table.
  • mode

    Two supported write modes:

    • "overwrite" updates the whole table.
    • "merge" will upsert the rows in df into the feature table. If df contains columns not present in the feature table, these columns will be added as new features.
  • checkpoint_location – Sets the Structured Streaming checkpointLocation option. By setting a checkpoint_location, Spark Structured Streaming will store progress information and intermediate state, enabling recovery after failures. This parameter is only supported when the argument df is a streaming DataFrame.
  • trigger – If df.isStreaming, trigger defines the timing of stream data processing, the dictionary will be unpacked and passed to DataStreamWriter.trigger as arguments. For example, trigger={'once': True} will result in a call to DataStreamWriter.trigger(once=True).
Returns:

If df.isStreaming, returns a PySpark StreamingQuery. None otherwise.