Databricks provides dedicated primitives for manipulating arrays in Apache Spark SQL; these make working with arrays much easier and more concise and do away with the large amounts of boilerplate code typically required. The primitives revolve around two functional programming constructs: higher-order functions and anonymous (lambda) functions. These work together to allow you to define functions that manipulate arrays in SQL. A higher-order function takes an array, implements how the array is processed, and what the result of the computation will be. It delegates to a lambda function how to process each item in the array.
Apache Spark 2.4, included in Databricks Runtime 5.0, has 24 new built-in functions for manipulating complex types (for example, array types), including higher-order functions.
Before Spark 2.4, for manipulating the complex types directly, there were two typical solutions:
- Exploding the nested structure into individual rows, and applying some functions, and then creating the structure again.
- Building a user defined function (UDF).
In contrast, the new built-in functions can directly manipulate complex types and the higher-order functions can manipulate complex values with an anonymous lambda function similar to UDFs, but with much better performance.
The following notebook illustrates the new functions.