histogram_numeric aggregate function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 10,2 and later

Computes a histogram on expr using numBins bins.

Syntax

histogram_numeric ( [ALL | DISTINCT ] expr, numBins ) [ FILTER ( WHERE cond ) ]

Arguments

  • expr: A numeric, TIMESTAMP, DATE, or INTERVAL expression that the function consumes and computes the histogram on.

  • numBins: An INTEGER literal that must be greater than 1, specifying the number of bins for the histogram computation.

  • cond: An optional BOOLEAN expression that filters the rows for aggregation.

Returns

The return value is an ARRAY of STRUCTS with fields x and y representing the centers of the histogram’s bins. The type of x is the same as the type of expr, and the type of y is DOUBLE. Increasing the value of numBins refines the histogram approximation, making it finer-grained. However, it can introduce artifacts around outliers. Typically, 20-40 bins are effective for histograms, though skewed or smaller datasets might require more bins. Note that this function creates a histogram with non-uniform bin widths. It offers no guarantees in terms of the mean-squared-error of the histogram, but in practice is comparable to the histograms produced by other computing packages.

Specifying DISTINCT makes the function operate only on a unique set of expr values.

Examples

> SELECT histogram_numeric(col, 5)
    FROM VALUES (0), (1), (2), (10) AS tab(col);
 [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (0L), (1L), (2L), (10L) AS tab(col);
 [{"x":0,"y":1.0},{"x":1,"y":1.0},{"x":2,"y":1.0},{"x":10,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (0F), (1F), (2F), (10F) AS tab(col);
 [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (0D), (1D), (2D), (10D) AS tab(col);
 [{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (INTERVAL 0 YEAR), (INTERVAL 1 YEAR), (INTERVAL 2 YEAR),
    (INTERVAL 3 YEAR) AS tab(col);
 [{"x":0-0,"y":1.0},{"x":1-0,"y":1.0},{"x":2-0,"y":1.0},{"x":3-0,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (INTERVAL 0 DAY), (INTERVAL 1 DAY), (INTERVAL 2 DAY),
    (INTERVAL 3 DAY) AS tab(col);
  [{"x":0 00:00:00.000000000,"y":1.0},{"x":1 00:00:00.000000000,"y":1.0},{"x":2 00:00:00.000000000,"y":1.0},{"x":3 00:00:00.000000000,"y":1.0}]

> SELECT histogram_numeric(col, 5)
    FROM VALUES (TIMESTAMP '2020-01-01'), (TIMESTAMP'2020-02-01'),
    (TIMESTAMP'2020-03-01'), (TIMESTAMP'2020-10-01') AS tab(col)
  [{"x":2020-01-01 00:00:00,"y":1.0},{"x":2020-02-01 00:00:00,"y":1.0},{"x":2020-03-01 00:00:00,"y":1.0},{"x":2020-10-01 00:00:00,"y":1.0}]