What are user-defined functions (UDFs)?
A user-defined function (UDF) is a function defined by a user, allowing custom logic to be reused in the user environment. Databricks has support for many different types of UDFs to allow for distributing extensible logic. This article introduces some of the general strengths and limitations of UDFs.
See the following articles for more information on UDFs:
When is custom logic not a UDF?
Not all custom functions are UDFs in the strict sense. You can safely define a series of Spark built-in methods using SQL or Spark DataFrames and get fully optimized behavior. For example, the following SQL and Python functions combine Spark built-in methods to define a unit conversion as a reusable function:
CREATE FUNCTION convert_f_to_c(unit STRING, temp DOUBLE)
RETURNS DOUBLE
RETURN CASE
WHEN unit = "F" THEN (temp - 32) * (5/9)
ELSE temp
END;
SELECT convert_f_to_c(unit, temp) AS c_temp
FROM tv_temp;
def convertFtoC(unitCol, tempCol):
from pyspark.sql.functions import when
return when(unitCol == "F", (tempCol - 32) * (5/9)).otherwise(tempCol)
from pyspark.sql.functions import col
df_query = df.select(convertFtoC(col("unit"), col("temp"))).toDF("c_temp")
display(df_query)
To run the above UDFs, you can create example data.
Which UDFs are most efficient?
UDFs might introduce significant processing bottlenecks into code execution. Databricks uses a number of different optimizers automatically for code written with included Apache Spark, SQL, and Delta Lake syntax. When custom logic is introduced by UDFs, these optimizers do not have the ability to efficiently plan tasks around this custom logic. In addition, logic that executes outside the JVM has additional costs around data serialization.
Some UDFs are more efficient than others. In terms of performance:
Built in functions will be fastest because of Databricks optimizers.
Code that executes in the JVM (Scala, Java, Hive UDFs) will be faster than Python UDFs.
Pandas UDFs use Arrow to reduce serialization costs associated with Python UDFs.
Python UDFs should generally be avoided, but Python can be used as glue code without any degradation in performance.
Note
Python UDF and UDAF (user-defined aggregate functions) are not supported in Unity Catalog on clusters that use shared access mode.
Type |
Optimized |
Execution environment |
---|---|---|
Hive UDF |
No |
JVM |
Python UDF |
No |
Python |
Pandas UDF |
No |
Python (Arrow) |
Scala UDF |
No |
JVM |
Spark SQL |
Yes |
JVM |
Spark DataFrame |
Yes |
JVM |
When should you use a UDF?
A major benefit of UDFs is that they allow users to express logic in familiar languages, reducing the human cost associated with refactoring code. For ad hoc queries, manual data cleansing, exploratory data analysis, and most operations on small or medium-sized datasets, latency overhead costs associated with UDFs are unlikely to outweigh costs associated with refactoring code.
For ETL jobs, streaming operations, operations on very large datasets, or other workloads that are executed regularly or continuously, refactoring logic to use native Apache Spark methods quickly pays dividends.
Example data for example UDFs
The code examples in this article use UDFs to convert temperatures between Celcius and Farenheit. If you wish to execute these functions, you can create a sample dataset with the following Python code:
import numpy as np
import pandas as pd
Fdf = pd.DataFrame(np.random.normal(55, 25, 10000000), columns=["temp"])
Fdf["unit"] = "F"
Cdf = pd.DataFrame(np.random.normal(10, 10, 10000000), columns=["temp"])
Cdf["unit"] = "C"
df = spark.createDataFrame(pd.concat([Fdf, Cdf]).sample(frac=1))
df.cache().count()
df.createOrReplaceTempView("tv_temp")