Higher Order Functions

New in version runtime-3.0.

Description

Databricks provides dedicated primitives for manipulating arrays in Spark SQL; these make working with arrays much easier, more concise and do away with the large amounts of boilerplate code typically required.

These primitives revolve around two functional programming constructs: higher order functions and anonymous (lambda) functions. These work together to allow you to define functions that manipulate arrays in SQL. The higher order function takes an array, implements how the array is processed, and what the result of the computation will be. It delegates to a lambda function how to process each item in the array.

A simple example

Let’s ground the previous concepts with a basic transformation. In this case, the higher order function, transform, will iterate over the array values, apply the associated lambda function to each element, and create a new array. The lambda function, element + 1, specifies how each element is manipulated. In SQL this will look like this:

SELECT  key,
        values,
        TRANSFORM(values, value -> value + 1) AS values_plus_one
FROM    nested_data

To be abundantly clear, the transformation TRANSFORM(values, value -> value + 1) has two components:

  1. TRANSFORM(values..) is the higher order function. This takes an array and an anonymous function as its input. Internally transform will take care of setting up a new array, applying the anonymous function to each element, and assigning the result to the output array.
  2. The value -> value + 1 which is the anonymous function. The function is divided into two components separated by a -> symbol: - The argument list. In this case we only have one argument: value. We also support multiple arguments by creating a comma separated list of arguments enclosed by parenthesis, for example: (x, y) -> x + y. - The body. This is a SQL expression that can use the arguments and outer variables to calculate the new value. In this case we add 1 to the value argument.

Capturing variables

We can also use other variables than the arguments in a lambda function; this is called capture. We can use variables defined on the top level, or variables defined in intermediate lambda functions. For example the following transform adds the key (top level) variable to each element in the values array:

SELECT  key,
        values,
        TRANSFORM(values, value -> value + key) AS values_plus_key
FROM    nested_data

Nesting

Sometimes data is deeply nested, if you want to transform such data, you can can use nested lambda functions to do this. The following example transforms an array of integer arrays, and adds the key (top level) column and the size of the intermediate array to each element in the nested array.

SELECT   key,
         nested_values,
         TRANSFORM(nested_values,
           values -> TRANSFORM(values,
             value -> value + key + SIZE(values))) AS new_nested_values
FROM     nested_data

Supported functions

We have added the following higher order functions to Databricks Runtime:

transform(array<T>, function<T, U>): array<U>

This produces an array by applying a function<T, U> to each element of an input array<T>.

Note that the functional programming equivalent operation is map. This has been named transform in order to prevent confusion with the map expression (that creates a map from a key value expression).

The following query transforms the values array by adding the key value to each element:

SELECT   key,
         values,
         TRANSFORM(values, value -> value + key) transformed_values
FROM     nested_data

exists(array<T>, function<T, V, Boolean>): Boolean

Return true if predicate function holds for any element in input array.

The following examples checks if the values array contains an elements for which the modulo 10 is equal to 1:

SELECT   key,
         values,
         EXISTS(values, value -> value % 10 == 1) filtered_values
FROM     nested_data

filter(array<T>, function<T, Boolean>): array<T>

Produce an output array<T> from an input array<T> by only only adding elements for which predicate function<T, boolean> holds.

The following examples filters the values array only elements with a value > 50 are allowed:

SELECT   key,
         values,
         FILTER(values, value -> value > 50) filtered_values
FROM     nested_data

aggregate(array<T>, B, function<B, T, B>, function<B, R>): R

Reduce the elements of array<T> into a single value R by merging the elements into a buffer B using function<B, T, B> and by applying a finish function<B, R> on the final buffer. The initial value B is determined by a zero expression. The finalize function is optional, if you do not specify the function the finalize function the identify function (id -> id) is used.

This is the only higher order function that takes two lambda functions.

The following example sums (aggregates) the values array into a single (sum) value. Both a version with a finalize function (summed_values) and one without a finalize function summed_values_simple is shown:

SELECT   key,
         values,
         -- TODO RENAME THIS TO AGGREGATE!
         REDUCE(values, 0, (value, acc) -> value + acc, acc -> acc) summed_values,
         REDUCE(values, 0, (value, acc) -> value + acc) summed_values_simple
FROM     nested_data

You can also compute more complex aggregates. The code below shows the computation of the geometric mean of the array elements.

SELECT   key,
         values,
         AGGREGATE(values,
           (1.0 AS product, 0 AS N),
           (buffer, value) -> (value * buffer.product, buffer.N + 1),
           buffer -> Power(buffer.product, 1.0 / buffer.N)) geomean
FROM     nested_data