feature-store-taxi-example(Python)

Loading...

Feature Store taxi example with Point-in-Time Lookup

This notebook illustrates the use of Feature Store to create a model that predicts NYC Yellow Taxi fares. It includes these steps:

  • Compute and write time series features.
  • Train a model using these features to predict fares.
  • Evaluate that model on a new batch of data using existing features, saved to Feature Store.

Requirements

  • Databricks Runtime 10.4 LTS for Machine Learning or above.
    • If you do not have access to Databricks Runtime ML, you can run this notebook on Databricks Runtime 10.4 LTS or above. To do so, run %pip install databricks-feature-store at the start of this notebook.

Note: Starting with Databricks Runtime 13.2 ML, a change was made to the create_table API. Timestamp key columns must now be specified in the primary_keys argument. If you are using this notebook with Databricks Runtime 13.1 ML or below, use the commented-out code for the create_table call in Cmd 19.

Compute features

Load the raw data used to compute features

Load the nyc-taxi-tiny dataset. This was generated from the full NYC Taxi Data which can be found at /databricks-datasets/nyctaxi by applying the following transformations:

  1. Apply a UDF to convert latitude and longitude coordinates into ZIP codes, and add a ZIP code column to the DataFrame.
  2. Subsample the dataset into a smaller dataset based on a date range query using the .sample() method of the Spark DataFrame API.
  3. Rename certain columns and drop unnecessary columns.
5

From the taxi fares transactional data, we will compute two groups of features based on trip pickup and drop off zip codes.

Pickup features

  1. Count of trips (time window = 1 hour, sliding window = 15 minutes)
  2. Mean fare amount (time window = 1 hour, sliding window = 15 minutes)

Drop off features

  1. Count of trips (time window = 30 minutes)
  2. Does trip end on the weekend (custom feature using python code)

Helper functions

8

Data scientist's custom code to compute features

10

11

12

    Use Feature Store library to create new time series feature tables

    First, create the database where the feature tables will be stored.

    15

      Next, create an instance of the Feature Store client.

      17

      To create a time series feature table, the DataFrame or schema must contain a column that you designate as the timestamp key. The timestamp key column must be of TimestampType or DateType and cannot also be a primary key.

      Use the create_table API to define schema, unique ID keys, and timestamp keys. If the optional argument df is passed, the API also writes the data to Feature Store.

      19

      Update features

      Use the write_table function to update the feature table values.

      21

        22

        When writing, both merge and overwrite modes are supported.

        fs.write_table(
          name="feature_store_taxi_example.trip_pickup_time_series_features",
          df=new_pickup_features,
          mode="overwrite",
        )
        

        Data can also be streamed into Feature Store by passing a dataframe where df.isStreaming is set to True:

        fs.write_table(
          name="feature_store_taxi_example.trip_pickup_time_series_features",
          df=streaming_pickup_features,
          mode="merge",
        )
        

        You can schedule a notebook to periodically update features using Databricks Jobs (AWS|Azure|GCP).

        Analysts can interact with Feature Store using SQL, for example:

        25

        Feature search and discovery

        You can now discover your feature tables in the Feature Store UI. Search by "trip_pickup_time_series_features" or "trip_dropoff_time_series_features" and click the table name to see details such as table schema, metadata, data sources, producers, and online stores. You can also edit the description for the feature table. For more information about feature discovery and tracking feature lineage, see (AWS|Azure|GCP).

        You can also set feature table permissions in the Feature Store UI. For details, see (AWS|Azure|GCP).

        Train a model

        This section illustrates how to create a training set with the time series pickup and dropoff feature tables using point-in-time lookup and train a model using the training set. It trains a LightGBM model to predict taxi fare.

        Helper functions

        30

        Understanding how a training dataset is created

        In order to train a model, you need to create a training dataset that is used to train the model. The training dataset is comprised of:

        1. Raw input data
        2. Features from the feature store

        The raw input data is needed because it contains:

        1. Primary keys and timestamp keys are used to join with features with point-in-time correctness (AWS|Azure|GCP).
        2. Raw features like trip_distance that are not in the feature store.
        3. Prediction targets like fare that are required for model training.

        Here's a visual overview that shows the raw input data being combined with the features in the Feature Store to produce the training dataset:

        These concepts are described further in the Creating a Training Dataset documentation (AWS|Azure|GCP).

        The next cell loads features from Feature Store for model training by creating a FeatureLookup for each needed feature.

        To perform a point-in-time lookup for feature values from a time series feature table, you must specify a timestamp_lookup_key in the feature’s FeatureLookup, which indicates the name of the DataFrame column that contains timestamps against which to lookup time series features. For each row in the DataFrame, Databricks Feature Store retrieves the latest feature values prior to the timestamps specified in the DataFrame’s timestamp_lookup_key column and whose primary keys match the values in the DataFrame’s lookup_key columns, or null if no such feature value exists.

        32

        When fs.create_training_set(..) is invoked, the following steps take place:

        1. A TrainingSet object is created, which selects specific features from Feature Store to use in training your model. Each feature is specified by the FeatureLookup's created previously.

        2. Features are joined with the raw input data according to each FeatureLookup's lookup_key.

        3. Point-in-Time lookup is applied to avoid data leakage problems. Only the most recent feature values, based on timestamp_lookup_key, are joined.

        The TrainingSet is then transformed into a DataFrame for training. This DataFrame includes the columns of taxi_data, as well as the features specified in the FeatureLookups.

        34

        35

        Train a LightGBM model on the data returned by TrainingSet.to_df, then log the model with FeatureStoreClient.log_model. The model will be packaged with feature metadata.

        37

        38

        Build and log a custom PyFunc model

        To add preprocessing or post-processing code to the model and generate processed predictions with batch inference, you can build a custom PyFunc MLflow model that encapusulates these methods. The following cell shows an example that returns a string output based on the numeric prediction from the model.

        40

        Scoring: batch inference

        Suppose another data scientist now wants to apply this model to a different batch of data.

        Display the data to use for inference, reordered to highlight the fare_amount column, which is the prediction target.

        44

        Use the score_batch API to evaluate the model on the batch of data, retrieving needed features from FeatureStore.

        When you score a model trained with features from time series feature tables, Databricks Feature Store retrieves the appropriate features using point-in-time lookups with metadata packaged with the model during training. The DataFrame you provide to FeatureStoreClient.score_batch must contain a timestamp column with the same name and DataType as the timestamp_lookup_key of the FeatureLookup provided to FeatureStoreClient.create_training_set.

        46

        To score using the logged PyFunc model:

        48

        View the taxi fare predictions

        This code reorders the columns to show the taxi fare predictions in the first column. Note that the predicted_fare_amount roughly lines up with the actual fare_amount, although more data and feature engineering would be required to improve the model accuracy.

        51

        View the PyFunc predictions

        53

          Next steps

          1. Explore the feature tables created in this example in the Feature Store UI.
          2. Adapt this notebook to your own data and create your own feature tables.