Lakehouse monitoring example notebook: InferenceLog classification analysis
User requirements
- You must have access to run commands on a cluster with access to Unity Catalog.
- You must have
USE CATALOG
privilege on at least one catalog, and you must haveUSE SCHEMA
privileges on at least one schema. This notebook creates tables in themain.default
schema. If you do not have the required privileges on themain.default
schema, you must edit the notebook to change the default catalog and schema to ones that you do have privileges on.
System requirements:
- Your workspace must be enabled for Unity Catalog.
- Databricks Runtime 12.2 LTS ML or above.
- A Single user or Assigned cluster.
This notebook illustrates how to train and deploy a classification model and monitor its corresponding batch inference table.
For more information about Lakehouse monitoring, see the documentation (AWS | Azure).
Setup
- Verify cluster configuration
- Install Python SDK
- Define catalog, schema, model and table names
Helper methods
The function(s) are for cleanup, if the notebook has been run multiple times. You would not typically use these functions in a normal setup.
Background
The following are required to create an inference log monitor:
A Delta table in Unity Catalog that you own.
The data can be batch scored data or inference logs. The following columns are required:
timestamp
(TimeStamp): Used for windowing and aggregation when calculating metricsmodel_id
(String): Model version/id used for each prediction.prediction
(String): Value predicted by the model.
The following column is optional:
label
(String): Ground truth label.
You can also provide an optional baseline table to track performance changes in the model and drifts in the statistical characteristics of features.
- To track performance changes in the model, consider using the test or validation set.
- To track drifts in feature distributions, consider using the training set or the associated feature tables.
- The baseline table must use the same column names as the monitored table, and must also have a
model_version
column.
Databricks recommends enabling Delta's Change-Data-Feed (AWS|Azure) table property for better metric computation performance for all monitored tables, including the baseline table. This notebook shows how to enable Change Data Feed when you create the Delta table.
User Journey
- Create Delta table: Read raw input and features data and create training and inference sets.
- Train a model, register the model the MLflow Model Registry.
- Generate predictions on test set and create the baseline table.
- Generate predictions on
scoring_df1
. This is the inference table. - Create the monitor on the inference table and analyse profile/drift metrics and fairness and bias metrics.
- Simulate drifts in 3 relevant features,
scoring_df2
and generate/materialize predictions. - Add/Join ground-truth labels to monitoring table and refresh monitor.
- [Optional] Calculate custom metrics.
- [Optional] Delete the monitor.
1. Read dataset and prepare data
Dataset used for this example: UCI's Adult Census
- Add a dummy identifer
- Clean and standardize missing values
1.1 Split data
Split data into a training set, baseline test table, and inference table.
- The baseline test data will serve as the table with reference feature distributions.
- The inference table will then be split into two dataframes,
scoring_df1
andscoring_df2
: they will function as new incoming batches for scoring. We will further simulate drifts on thescoring_df
(s).
2. Train a random forest model
4. Generate predictions on incoming scoring data
Example pre-processing step
- Extract ground-truth labels (in practice, labels might arrive later)
- Split into two batches
4.1 Write scoring data with predictions out
- Add
model_version
column and write to the table that we will attach a monitor to - Add ground-truth
label_col
column with empty/NaN values
Set mergeSchema
to True
to enable appending dataframes without label column available
5. Create the monitor
Use InferenceLog
type analysis.
Make sure to drop any column that you don't want to track or which doesn't make sense from a business or use-case perspective, otherwise create a VIEW with only columns of interest and monitor it.
5.1 Inspect the metrics tables
By default, the metrics tables are saved in the default database.
The create_monitor
call created two new tables: the profile metrics table and the drift metrics table.
These two tables record the outputs of analysis jobs. The tables use the same name as the primary table to be monitored, with the suffixes _profile_metrics
and _drift_metrics
.
Orientation to the profile metrics table
The profile metrics table has the suffix _profile_metrics
. For a list of statistics that are shown in the table, see the documentation (AWS|Azure).
- For every column in the primary table, the profile table shows summary statistics for the baseline table and for the primary table. The column
log_type
showsINPUT
to indicate statistics for the primary table, andBASELINE
to indicate statistics for the baseline table. The column from the primary table is identified in the columncolumn_name
. - For
TimeSeries
type analysis, thegranularity
column shows the granularity corresponding to the row. For baseline table statistics, thegranularity
column showsnull
. - The table shows statistics for each value of each slice key in each time window, and for the table as whole. Statistics for the table as a whole are indicated by
slice_key
=slice_value
=null
. - In the primary table, the
window
column shows the time window corresponding to that row. For baseline table statistics, thewindow
column showsnull
. - Some statistics are calculated based on the table as a whole, not on a single column. In the column
column_name
, these statistics are identified by:table
.
Orientation to the drift metrics table
The drift metrics table has the suffix _drift_metrics
. For a list of statistics that are shown in the table, see the documentation (AWS|Azure).
- For every column in the primary table, the drift table shows a set of metrics that compare the current values in the table to the values at the time of the previous analysis run and to the baseline table. The column
drift_type
showsBASELINE
to indicate drift relative to the baseline table, andCONSECUTIVE
to indicate drift relative to a previous time window. As in the profile table, the column from the primary table is identified in the columncolumn_name
.- At this point, because this is the first run of this monitor, there is no previous window to compare to. So there are no rows where
drift_type
isCONSECUTIVE
.
- At this point, because this is the first run of this monitor, there is no previous window to compare to. So there are no rows where
- For
TimeSeries
type analysis, thegranularity
column shows the granularity corresponding to that row. - The table shows statistics for each value of each slice key in each time window, and for the table as whole. Statistics for the table as a whole are indicated by
slice_key
=slice_value
=null
. - The
window
column shows the the time window corresponding to that row. Thewindow_cmp
column shows the comparison window. If the comparison is to the baseline table,window_cmp
isnull
. - Some statistics are calculated based on the table as a whole, not on a single column. In the column
column_name
, these statistics are identified by:table
.
6. Create data drifts(s) in 3 features
Simulate distribution changes for workclass
, gender
and hours_per_week
6.1 Generate predictions on drifted observations and update inference tables
- Add the column
model_id
7. (Ad-hoc) Join/Update ground-truth labels to inference table
Note: if ground-truth value can change for a given id through time, then consider also joining/merging on timestamp column
Refresh metrics and inspect dashboard
9. [Optional] Delete the monitor
Uncomment the following line of code to clean up the monitor (if you wish to run the quickstart on this table again).