feature-store-time-series-example(Python)

Loading...

Feature Store Time Series Feature Table

In this notebook, you create time series feature tables based on simulated Internet of Things (IoT) sensor data. You then:

  • Generate a training set by performing a point-in-time lookup on the time series feature tables.
  • Use the training set to train a model.
  • Register the model.
  • Perform batch inference on new sensor data.

Requirements

  • This notebook is intended for workspaces that are not enabled for Unity Catalog. If your workspace is enabled for Unity Catalog, use the version of this notebook designed for Unity Catalog. (AWS | Azure | GCP).
  • Databricks Runtime 10.4 LTS for Machine Learning or above.

Note: Starting with Databricks Runtime 13.2 ML, a change was made to the create_table API. Timestamp key columns must now be specified in the primary_keys argument. If you are using this notebook with Databricks Runtime 13.1 ML or below, use the commented-out code for the create_table call in Cmd 9.

Background

The data used in this notebook is simulated to represent this situation: you have a series of readings from a set of IoT sensors installed in different rooms of a warehouse. You want to use this data to train a model that can detect when a person has entered a room. Each room has a temperature sensor, a light sensor, and a CO2 sensor, each of which records data at a different frequency.

Database name: point_in_time_demo_50316e10366349078610bfb4b3bd28dd Model name: pit_demo_model_50316e10366349078610bfb4b3bd28dd
DataFrame[]

Generate the simulated dataset

In this step, you generate the simulated dataset and then create four Spark DataFrames, one each for the light sensors, the temperature sensors, the CO2 sensors, and the ground truth.

Create the time series feature tables

In this step you create the time series feature tables. Each table uses the room as the primary key.

2024/08/11 22:32:12 INFO databricks.ml_features._compute_client._compute_client: Created feature table 'hive_metastore.point_in_time_demo_50316e10366349078610bfb4b3bd28dd.temp_sensors'. 2024/08/11 22:32:29 INFO databricks.ml_features._compute_client._compute_client: Created feature table 'hive_metastore.point_in_time_demo_50316e10366349078610bfb4b3bd28dd.light_sensors'. 2024/08/11 22:32:44 INFO databricks.ml_features._compute_client._compute_client: Created feature table 'hive_metastore.point_in_time_demo_50316e10366349078610bfb4b3bd28dd.co2_sensors'.
<FeatureTable: name='point_in_time_demo_50316e10366349078610bfb4b3bd28dd.co2_sensors', table_id='8a91fbbdf3614a22aed89e17439240bb', description='Readings from CO2 sensors', primary_keys=['r', 'co2_ts'], partition_columns=[], features=['co2_ts', 'ppm', 'r'], creation_timestamp=1723415553482, online_stores=[], notebook_producers=[notebook_id: 4161531970048598 revision_id: 1723415563747 creation_timestamp: 1723415564210 creator_id: "andrea.kress@databricks.com" notebook_workspace_id: 8498204313176882 feature_table_workspace_id: 8498204313176882 notebook_workspace_url: "https://db-sme-demo-docs.cloud.databricks.com" producer_action: CREATE ], job_producers=[], table_data_sources=[], path_data_sources=[], custom_data_sources=[], timestamp_keys=['co2_ts'], tags={}>

The time series feature tables are now visible in the Feature Store UI. The Timestamp Keys field is populated for these feature tables.

Updating the time-series feature tables

Suppose that after you create the feature table, you receive updated values. For example, maybe some temperature readings were incorrectly pre-processed and need to be updated in the temperature time series feature table.

    When you write a DataFrame to a time series feature table, the DataFrame must specify all the features of the feature table. To update a single feature column in the time series feature table, you must first join the updated feature column with other features in the table, specifying both a primary key and a timestamp key. Then, you can update the feature table.

    Create a training set with point-in-time lookups on time series feature tables

    In this step, you create a training set using the ground truth data by performing point-in-time lookups for the sensor data in the time series feature tables.

    The point-in-time lookup retrieves the latest sensor value as of the timestamp given by the ground truth data for the room given by the ground truth data.

      Train the model

      2024/08/11 22:33:23 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '345c93ec0df845a58dfbd3a0b4a5dc1b', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current lightgbm workflow [LightGBM] [Info] Number of positive: 7021, number of negative: 36446 [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001138 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 1020 [LightGBM] [Info] Number of data points in the train set: 43467, number of used features: 4 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.161525 -> initscore=-1.646926 [LightGBM] [Info] Start training from score -1.646926 2024/08/11 22:33:29 WARNING mlflow.utils.autologging_utils: MLflow autologging encountered a warning: "/databricks/python/lib/python3.10/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils."
      Uploading artifacts: 0%| | 0/5 [00:00<?, ?it/s]

      2024/08/11 22:33:36 WARNING mlflow.models.model: Model logged without a signature. Signatures will be required for upcoming model registry features as they validate model inputs and denote the expected schema of model outputs. Please visit https://www.mlflow.org/docs/2.9.2/models.html#set-signature-on-logged-model for instructions on setting a model signature on your logged model.
      Uploading artifacts: 0%| | 0/10 [00:00<?, ?it/s]
      2024/08/11 22:33:36 INFO mlflow.store.artifact.cloud_artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false Successfully registered model 'pit_demo_model_50316e10366349078610bfb4b3bd28dd'. 2024/08/11 22:33:38 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: pit_demo_model_50316e10366349078610bfb4b3bd28dd, version 1 Created version '1' of model 'pit_demo_model_50316e10366349078610bfb4b3bd28dd'.

      Score data with point-in-time lookups on time series feature tables

      The point-in-time lookup metadata provided to create the training set is packaged with the model so that the same lookup can be performed during scoring.

      Downloading artifacts: 0%| | 0/10 [00:00<?, ?it/s]
      2024/08/11 22:33:46 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
      Downloading artifacts: 0%| | 0/5 [00:00<?, ?it/s]
      2024/08/11 22:33:47 WARNING mlflow.pyfunc: Calling `spark_udf()` with `env_manager="local"` does not recreate the same environment that was used during training, which may lead to errors or inaccurate predictions. We recommend specifying `env_manager="conda"`, which automatically recreates the environment that was used to train the model and performs inference in the recreated environment.
      Downloading artifacts: 0%| | 0/1 [00:00<?, ?it/s]
      2024/08/11 22:33:47 INFO mlflow.models.flavor_backend_registry: Selected backend for flavor 'python_function'