Data profiling example notebook: Snapshot analysis
User requirements
- You must have access to run commands on a cluster with access to Unity Catalog.
- You must have
USE CATALOGprivilege on at least one catalog, and you must haveUSE SCHEMAprivileges on at least one schema. This notebook creates tables in themain.defaultschema. If you do not have the required privileges on themain.defaultschema, you must edit the notebook to change the default catalog and schema to ones that you do have privileges on.
System requirements:
- Your workspace must be enabled for Unity Catalog.
- Databricks Runtime 12.2 LTS or above.
- A Single user or Assigned cluster.
This notebook illustrates how to create a snapshot profile, which calculates statistics over all data in the table each time the monitor is refreshed. If you want to calculate statistics separately for time windows within a table, use a TimeSeries monitor.
For more information about data profiling, see the documentation (AWS | Azure | GCP).
Setup
- Verify cluster configuration
- Install the Python SKD
- Define catalog, schema and table names
User Journey
- Create tables: Read raw data and create the primary table (the table to be monitored) and the baseline table (which contains data known to meet expected quality standards).
- Create a monitor on the primary table and refresh it to collect monitor data.
- Inspect the metrics tables.
- Apply changes to table and refresh metrics. Inspect the metrics tables.
- [Optional] Delete the monitor.
1. Create the primary and (optional) baseline tables in Unity Catalog
- The tables must be Delta tables registered in Unity Catalog and owned by the user running the notebook.
- The table to be monitored is also called the "primary table".
- The baseline table must have the same schema as the monitored table.
This example uses the winequality datasets.
You can access documentation for commands as shown in the following cell
Create the monitor
3. Inspect the metric tables
By default, the metric tables are saved in the default database.
The create_monitor call created two new tables: the profile metrics table and the drift metrics table.
These two tables record the outputs of analysis jobs. The tables use the same name as the primary table to be monitored, with the suffixes _profile_metrics and _drift_metrics.
Orientation to the profile metrics table
The profile metrics table has the suffix _profile_metrics. For a list of statistics that are shown in the table, see the documentation (AWS | Azure | GCP).
- For every column in the primary table, the analysis table shows summary statistics for the baseline table and for the primary table. The column
log_typeshowsINPUTto indicate statistics for the primary table, andBASELINEto indicate statistics for the baseline table. The column from the primary table is identified in the columncolumn_name. - For snapshot type analysis, the
granularitycolumn is alwaysexact, because the snapshot table does not include a timestamp.exactindicates that the statistics shown in the table correspond to the exact time that the analysis was run. - The table shows statistics for each value of each slice key, and for the table as whole. Statistics for the table as a whole are indicated by
slice_key=slice_value=null. - The
windowcolumn shows the time that the analysis was run for primary table statistics. For baseline table statistics, thewindowcolumn showsnull. - Some statistics are calculated based on the table as a whole, not on a single column. In the column
column_name, these statistics are identified by:table.
Orientation to the drift metrics table
The drift metrics table has the suffix _drift_metrics. For a list of statistics that are shown in the table, see the documentation (AWS | Azure | GCP).
- For every column in the primary table, the drift table shows a set of metrics that compare the current values in the table to the values at the time of the previous analysis run and to the baseline table. The column
drift_typeshowsBASELINEto indicate drift relative to the baseline table, andCONSECUTIVEto indicate drift relative to the previous analysis. As in the profile table, the column from the primary table is identified in the columncolumn_name. - For snapshot type analysis, the
granularitycolumn is alwaysexact, because the snapshot table does not include a timestamp.exactindicates that the statistics shown in the table correspond to the exact time that the analysis was run. - The table shows statistics for each value of each slice key, and for the table as whole. Statistics for the table as a whole are indicated by
slice_key=slice_value=null. - The
windowcolumn shows the time that the analysis was run. Thewindow_cmpcolumn shows the time that the current analysis is being compared to. If the comparison is to the baseline table,window_cmpisnull. - Some statistics are calculated based on the table as a whole, not on a single column. In the column
column_name, these statistics are identified by:table.
4. Refresh metrics after changes to the table data or schema
4.1 Change some values in the table
The command in the following cell changes some values in the table. For any row where the value of the alcohol column is less than 9.0, the value is replaced by null.
Refresh metrics
Open the monitoring dashboard to notice the changes.
5. [Optional] Delete the monitor
Uncomment the following line of code to clean up the monitor. Only a single monitor can exist for a table.