Aggregate data on Databricks
This article introduces the general semantics for aggregation and discusses the differences between results computed using batch queries, materialized views, and streaming.
Batch aggregates
Batch aggregation is the default behavior observed when running an ad hoc query in SQL or processing data with Apache Spark DataFrames.
An aggregate query written against a table or data source computes the aggregate statistics for all records in the data source. Databricks leverages optimizations and metadata whenever possible to optimize these queries, and can compute many aggregates effeciently for large datasets.
Batch aggregation latency and compute costs can increase as data size increases, and pre-computed frequently referenced aggregate values can save users substantial time and money. Databricks recommends using materialized views to incrementally update aggregate values. See Incremental aggregates.
Stateful aggregates
Aggregates defined in streaming workloads are stateful. Stateful aggregates track observed records over time and recompute results when processing new data.
You must use watermarks when computing stateful aggregates. Omitting a watermark from a stateful aggregate query results in state information building up infinitely over time. This results in processing slowdowns and can lead to out-of-memory errors.
You should not use a stateful aggregate to calculate statistics over an entire dataset. Databricks recommends using materialized views for incremental aggregate calculation on an entire dataset. See Incremental aggregates.
Configuring workloads that compute stateful aggregates efficiently and correctly requires understanding how data arrives from source systems and how Databricks uses watermarks, output modes, and trigger intervals to control query state and results computation.
Incremental aggregates
You can use materialized views to compute many aggregate values incrementally. Materialized views automatically track changes in the data source and apply appropriate updates to aggregate values on refresh. The results returned by a materialzed view are equivalent to those returned by recomputing aggregate results on source data with a batch job or ad hoc query.
Approximate aggregates
While Databricks excels at computing on extremely large datasets, using approximation for aggregates can accelerate query processing and reduce costs when you don’t require precise results.
Using LIMIT
statements is sometimes good enough for getting a quick snapshot of data, but does not introduce randomness, or guarantee that sampling is distributed across the dataset.
Spark SQL has the following native methods for approximating aggregations on numeric or categorical data:
You can also specify a sample percent with TABLESAMPLE
to generate a random sample from a dataset and calculate approximate aggregates. See TABLESAMPLE clause.
Monitor datasets using aggregate statistcs
Lakehouse Monitoring uses aggregate statistics and data distributions to track data quality over time. You can generate reports to visualize trends and schedule alerts to flag unexpected changes in data. See Introduction to Databricks Lakehouse Monitoring.