Anomaly detection
This feature is in Beta.
This page describes what anomaly detection is, what it monitors, and how to use it.
To provide feedback on anomaly detection, email lakehouse-monitoring-feedback@databricks.com
.
What is anomaly detection?
With Lakehouse Monitoring anomaly detection, you can easily monitor the data quality of all of the tables in a schema. Databricks leverages data intelligence to automatically assess data quality, specifically evaluating the freshness and completeness of each table. Quality insights are populated in health indicators so consumers can understand health at a glance. Data owners have access to logging tables and dashboards so they can quickly identify, set alerts, and resolve anomalies across an entire schema.
Anomaly Detection does not modify any tables it monitors, nor does it add overhead to any jobs that populate these tables.
Requirements
- Unity Catalog enabled workspace.
- Serverless compute enabled. For instructions, see Connect to serverless compute.
- To enable anomaly detection on a schema, you must have MANAGE SCHEMA or MANAGE CATALOG privileges on the catalog schema.
How does anomaly detection work?
Databricks monitors enabled tables for freshness and completeness.
Freshness refers to how recently a table has been updated. Anomaly detection analyzes the history of commits to a table and builds a per-table model to predict the time of the next commit. If a commit is unusually late, the table is marked as stale. For time series tables, you can specify event time columns. Anomaly detection then detects if the data's ingestion latency, defined as the difference between the commit time and the event time, is unusually high.
Completeness refers to the number of rows expected to be written to the table in the last 24 hours. Anomaly detection analyzes the historical row count, and based on this data, predicts a range of expected number of rows. If the number of rows committed over the last 24 hours is less than the lower bound of this range, a table is marked as incomplete.
Enable anomaly detection on a schema
To enable anomaly detection on a schema, navigate to the schema in Unity Catalog.
-
On the schema page, click the Details tab.
-
Click the Anomaly Detection toggle to enable it.
-
A Databricks scheduled job is initiated to scan the schema. For the first job run, Databricks also runs anomaly detection on historical data (“backtesting”) to check the quality of your tables as if anomaly detection had been enabled on your schema two weeks ago. These results are logged to the logging table.
-
By default, the job runs every 6 hours. To change this setting, see Set parameters for freshness and completeness evaluation.
-
To view the progress of the job or to configure the job, click the gear icon
next to the toggle. In the dialog that appears, click View details.
-
When the job is complete, you will see the detected anomalies logged in the output logging table with insights populated in the Anomaly Detection Dashboard. You can access the dashboard at any time by clicking See results next to the anomaly detection toggle.
Anomaly detection dashboard
The first anomaly detection run creates a dashboard to summarize results and trends derived from the logging table. The job run displays a button that you can use to access the dashboard. The dashboard is automatically populated with insights for the scanned schema. A single dashboard is created per workspace at this path: /Shared/Databricks Anomaly Detection/Anomaly Detection
.
Quality overview
The Quality Overview tab shows a summary of the latest quality status of tables in your schema based on the most recent anomaly detection evaluation.
To get started, you must enter the logging table for the schema you want to analyze to populate the dashboard.
The top section of the dashboard shows an overview of the results of the scan.
The Breakdown of Quality Checks section shows the details of the freshness and completeness checks, along with plots of the results over time.
Below the summary is a table of anomaly detection results from the most recent run, along with a table of identified static tables that have not been updated in a long time. Only the recent incidents table is shown here.
Table quality details
The Table Quality Details tab allows you to dive deeper into trends and analyze specific tables in your schema. Again, enter the logging table name that corresponds to the schema your table is in, and select your table from the drop-down menu. Optionally, set the evaluated_at
window to show only anomaly detection results from that timeframe.
Given a table, the tab shows summaries from each quality check for the table, with graphs of predicted and observed values at each evaluation timestamp. The graphs plot results starting from two weeks before your first anomaly detection run.
View health indicators
Anomaly detection provides data consumers with a quick visual confirmation of the data freshness of the tables they use.
On the schema page, in the Overview tab, tables that passed the most recent freshness scan are marked with a green dot. Tables that failed the scan are shown with an orange dot.
Click on the dot to see the time and status of the most recent scan.
As a data owner, you can easily assess the overall health of your schema by sorting tables based on quality. Use the Sort menu at the upper-right of the table list to sort tables by quality.
On the table page, in the Overview tab, a Quality indicator shows the status of the table and lists any anomalies that were identified in the most recent scan.
Set up alerts
To configure a Databricks SQL alert on the output results table, follow these steps in the Databricks alerts UI.
-
Configure the query for the alert:
SQLWITH rounded_data AS (
SELECT
DATE_TRUNC('HOUR', evaluated_at) AS evaluated_at,
CONCAT(catalog, ".", schema, ".", table_name) as full_table_name,
table_name,
status,
MAX(downstream_impact.num_queries_on_affected_tables) AS impacted_queries,
MAX(CASE WHEN quality_check_type = 'Freshness' AND additional_debug_info.commit_staleness.expectation IS NOT NULL
THEN additional_debug_info.commit_staleness.expectation END) AS commit_expected,
MAX(CASE WHEN quality_check_type = 'Freshness' AND additional_debug_info.commit_staleness.actual_value IS NOT NULL
THEN additional_debug_info.commit_staleness.actual_value END) AS commit_actual,
MAX(CASE WHEN quality_check_type = 'Freshness' AND additional_debug_info.event_staleness.expectation IS NOT NULL
THEN additional_debug_info.event_staleness.expectation END) AS event_expected,
MAX(CASE WHEN quality_check_type = 'Freshness' AND additional_debug_info.event_staleness.actual_value IS NOT NULL
THEN additional_debug_info.event_staleness.actual_value END) AS event_actual,
MAX(CASE WHEN quality_check_type = 'Completeness' AND additional_debug_info.daily_row_count.expectation IS NOT NULL
THEN additional_debug_info.daily_row_count.expectation END) AS completeness_expected,
MAX(CASE WHEN quality_check_type = 'Completeness' AND additional_debug_info.daily_row_count.actual_value IS NOT NULL
THEN additional_debug_info.daily_row_count.actual_value END) AS completeness_actual
FROM <catalog>.<schema>._quality_monitoring_summary
GROUP BY ALL
)
SELECT
evaluated_at,
full_table_name,
status,
commit_expected,
commit_actual,
event_expected,
event_actual,
completeness_expected,
completeness_actual,
impacted_queries,
CONCAT("<link-to-dashboard>&f_table-quality-details~table-quality-details-logging-table-name=<catalog>.<schema>._quality_monitoring_summary&f_table-quality-details~9d146eba=", table_name) AS dash_link
FROM rounded_data
WHERE
evaluated_at >= current_timestamp() - INTERVAL 6 HOUR AND
-- enter the minimum number of table violations before the alert is triggered
impacted_queries > <min-tables-affected> AND
status = "Unhealthy" -
Configure the alert condition:
-
Customize the email template:
Html<h4>The following tables are failing quality checks in the last 6 hours</h4>
<table>
<tr>
<td>
<table>
<tr>
<th>Table</th>
<th>Expected Staleness</th>
<th>Actual Staleness</th>
<th>Expected Row Volume</th>
<th>Actual Row Volume</th>
<th>Impact (queries)</th>
</tr>
{{#QUERY_RESULT_ROWS}}
<tr>
<td><a href="{{dash_link}}">{{full_table_name}}</a></td>
<td>{{commit_expected}}</td>
<td>{{commit_actual}}</td>
<td>{{completeness_expected}}</td>
<td>{{completeness_actual}}</td>
<td>{{impacted_queries}}</td>
</tr>
{{/QUERY_RESULT_ROWS}}
</table>
</td>
</tr>
</table>
Now, you have an alert that triggers based on the downstream impact of the quality issue, and surfaces a link to the dashboard to help you debug the table that triggered the alert.
Disable anomaly detection
To disable anomaly detection, click the Anomaly Detection toggle to disable it. The anomaly detection job will be deleted, all anomaly detection tables and information are deleted.
Limitation
Anomaly detection does not support the following:
- Views, materialized views, or streaming tables.
- The determination of completeness does not take into account metrics such as the fraction of nulls, zero values, or NaN.
- Health indicators for completeness
- "Backtesting" for event-based freshness or completeness
Advanced
Review logged results
By default, the results of an anomaly detection scan are saved in the schema in a table named _quality_monitoring_summary
to which only the user who enabled anomaly detection has access. To configure the name or location of the logging table, see Set parameters for freshness and completeness evaluation.
The table has the following information:
Column name | Type | Description |
---|---|---|
| timestamp | Start time of anomaly scan run. |
| string | Catalog containing the schema on which anomaly scan was run. |
| string | Schema on which anomaly scan was run. |
| string | Name of the scanned table. |
| string |
|
| string | Result of the quality check. One of |
| map | This field provides the values that were used to determine the table's status. For details, see Debugging information. |
| string | If |
| string | Link to table lineage tab in Catalog Explorer, to help with investigating the root cause of an |
| struct | Impact of an anomaly on downstream assets. For details, see Downstream impact information. |
Debugging information
In the logged results table, the column additional_debug_info
provides information in the following format:
[
<metric_name>:
actual_value: <value> ,
expectation: “actual value < [predicted_value]”
is_violated: true/false,
error_code: <error_code>
from_backtesting: true/false
...
]
For example:
{
commit_staleness:
actual_value: "31 min"
expectation: "actual_value < 1 day"
is_violated: "false"
error_code: "None"
from_backtesting: "false"
}
Downstream impact information
In the logged results table, the column downstream_impact
is a struct
with the following fields:
Field | Type | Description |
---|---|---|
| int | Integer value between 1 and 4 indicating the severity of the anomaly. Higher values indicate greater disruption. |
| int | Number of downstream tables that might be affected by the anomaly. |
| int | Total number of queries that have referenced the affected and downstream tables in the past 30 days. |
Set parameters for freshness and completeness evaluation
To edit the parameters that control the anomaly detection job, such as how often the job runs or the name of the logged results table, you must edit the job parameters on the Tasks tab of the job page.
The following sections describe specific settings. For information about how to set task parameters, see Configure task parameters.
Schedule and notifications
To customize the schedule for the anomaly detection job, or to set up notifications, use the Schedules & Triggers settings on the jobs page. See Automating jobs with schedules and triggers.
Name of logging table
To change the name of the logging table, or save the table in a different schema, edit the job task parameter logging_table_name
and specify the desired name. To save the logging table in a different schema, specify the full 3-level name.
Customize freshness
and completeness
evaluations
All of the parameters in this section are optional. By default, anomaly detection determines thresholds based on an analysis of the table's history.
These parameters are fields inside the task parameter metric_configs
. The format of metric_configs
is a JSON string with the following default values:
[
{
"disable_check": false,
"tables_to_skip": null,
"tables_to_scan": null,
"table_threshold_overrides": null,
"table_latency_threshold_overrides": null,
"static_table_threshold_override": null,
"event_timestamp_col_names": null,
"metric_type": "FreshnessConfig"
},
{
"disable_check": true,
"tables_to_skip": null,
"tables_to_scan": null,
"table_threshold_overrides": null,
"metric_type": "CompletenessConfig"
}
]
The following parameters can be used for both freshness
and completeness
evaluations.
Field name | Description | Example |
---|---|---|
| Only the specified tables are scanned for anomaly detection. |
|
| The specified tables are skipped during the anomaly detection scan. |
|
| Anomaly scan is not run. Use this parameter if you want to disable only the |
|
The following parameters apply only to the freshness
evaluation:
Field name | Description | Example |
---|---|---|
| List of timestamp columns tables in your schema might have. If a table has one of these columns, it is marked |
|
| A dictionary consisting of table names and thresholds (in seconds) that specify the maximum interval since the last table update before marking a table as |
|
| A dictionary consisting of table names and latency thresholds (in seconds) that specify the maximum interval since the last timestamp in the table before marking a table as |
|
| Amount of time (in seconds) before a table is considered as a static table (that is, one that is no longer updated). |
|
The following parameter applies only to the completeness
evaluation:
Field name | Description | Example |
---|---|---|
| A dictionary consisting of table names and row volume thresholds (specified as integers). If the number of rows added to a table over the previous 24 hours is less than the specified threshold, the table is marked |
|