This functionality is supported on Databricks Runtime 3.0 and above.
Databricks provides dedicated primitives for manipulating arrays in Spark SQL; these make working with arrays much easier, more concise and do away with the large amounts of boilerplate code typically required.
These primitives revolve around two functional programming constructs: higher order functions and anonymous (lambda) functions. These work together to allow you to define functions that manipulate arrays in SQL. The higher order function takes an array, implements how the array is processed, and what the result of the computation will be. It delegates to a lambda function how to process each item in the array.
Let’s ground the previous concepts with a basic transformation. In this case, the higher order function,
transform, will iterate over the array
values, apply the associated lambda function to each element, and create a new array. The lambda function,
element + 1, specifies how each element is manipulated. In SQL this will look like this:
SELECT key, values, TRANSFORM(values, value -> value + 1) AS values_plus_one FROM nested_data
To be abundantly clear, the transformation
TRANSFORM(values, value -> value + 1) has two components:
TRANSFORM(values..)is the higher order function. This takes an array and an anonymous function as its input. Internally transform will take care of setting up a new array, applying the anonymous function to each element, and assigning the result to the output array.
value -> value + 1which is the anonymous function. The function is divided into two components separated by a
->symbol: - The argument list. In this case we only have one argument:
value. We also support multiple arguments by creating a comma separated list of arguments enclosed by parenthesis, for example:
(x, y) -> x + y. - The body. This is a SQL expression that can use the arguments and outer variables to calculate the new value. In this case we add
1to the value
We can also use other variables than the arguments in a lambda function; this is called capture. We can use variables defined on the top level, or variables defined in intermediate lambda functions. For example the following transform adds the
key (top level) variable to each element in the
SELECT key, values, TRANSFORM(values, value -> value + key) AS values_plus_key FROM nested_data
Sometimes data is deeply nested, if you want to transform such data, you can can use nested lambda functions to do this. The following example transforms an array of integer arrays, and adds the
key (top level) column and the size of the intermediate array to each element in the nested array.
SELECT key, nested_values, TRANSFORM(nested_values, values -> TRANSFORM(values, value -> value + key + SIZE(values))) AS new_nested_values FROM nested_data
We have added the following higher order functions to Databricks Runtime:
This produces an array by applying a
function<T, U> to each element of an input
The functional programming equivalent operation is
map. This has been named
transform in order to prevent confusion with the map expression (that creates a map from a key value expression).
The following query transforms the
values array by adding the
key value to each element:
SELECT key, values, TRANSFORM(values, value -> value + key) transformed_values FROM nested_data
Return true if predicate function holds for any element in input array.
The following examples checks if the
values array contains an elements for which the modulo 10 is equal to 1:
SELECT key, values, EXISTS(values, value -> value % 10 == 1) filtered_values FROM nested_data
Produce an output
array<T> from an input
array<T> by only only adding elements for which predicate
function<T, boolean> holds.
The following examples filters the
values array only elements with a
value > 50 are allowed:
SELECT key, values, FILTER(values, value -> value > 50) filtered_values FROM nested_data
Reduce the elements of
array<T> into a single value
R by merging the elements into a buffer
function<B, T, B> and by applying a finish
function<B, R> on the final buffer. The initial value
B is determined by a zero expression. The finalize function is optional, if you do not specify the function the finalize function the identify function
(id -> id) is used.
This is the only higher order function that takes two lambda functions.
The following example sums (aggregates) the
values array into a single (sum) value. Both a version with a finalize function (
summed_values) and one without a finalize function
summed_values_simple is shown:
SELECT key, values, -- TODO RENAME THIS TO AGGREGATE! REDUCE(values, 0, (value, acc) -> value + acc, acc -> acc) summed_values, REDUCE(values, 0, (value, acc) -> value + acc) summed_values_simple FROM nested_data
You can also compute more complex aggregates. The code below shows the computation of the geometric mean of the array elements.
SELECT key, values, AGGREGATE(values, (1.0 AS product, 0 AS N), (buffer, value) -> (value * buffer.product, buffer.N + 1), buffer -> Power(buffer.product, 1.0 / buffer.N)) geomean FROM nested_data