What are user-defined functions (UDFs)?
A user-defined function (UDF) is a function defined by a user, allowing custom logic to be reused in the user environment. Databricks has support for many different types of UDFs to allow for distributing extensible logic. This article introduces some of the general strengths and limitations of UDFs.
Note
Not all forms of UDFs are available in all execution environments on Databricks. If you are working with Unity Catalog, see User-defined functions (UDFs) in Unity Catalog.
Defining custom logic without serialization penalties
Databricks inherits much of its UDF behaviors from Apache Spark, including the efficiency limitations around many types of UDFs. See Which UDFs are most efficient?.
You can safely modularize your code without worrying about potential efficiency tradeoffs associated with UDFs. To do so, you must define your logic as a series of Spark built-in methods using SQL or Spark DataFrames. For example, the following SQL and Python functions combine Spark built-in methods to define a unit conversion as a reusable function:
CREATE FUNCTION convert_f_to_c(unit STRING, temp DOUBLE)
RETURNS DOUBLE
RETURN CASE
WHEN unit = "F" THEN (temp - 32) * (5/9)
ELSE temp
END;
SELECT convert_f_to_c(unit, temp) AS c_temp
FROM tv_temp;
def convertFtoC(unitCol, tempCol):
from pyspark.sql.functions import when
return when(unitCol == "F", (tempCol - 32) * (5/9)).otherwise(tempCol)
from pyspark.sql.functions import col
df_query = df.select(convertFtoC(col("unit"), col("temp"))).toDF("c_temp")
display(df_query)
To run the above UDFs, you can create example data.
Which UDFs are most efficient?
UDFs might introduce significant processing bottlenecks into code execution. Databricks uses a number of different optimizers automatically for code written with included Apache Spark, SQL, and Delta Lake syntax. When custom logic is introduced by UDFs, these optimizers do not have the ability to efficiently plan tasks around this custom logic. In addition, logic that executes outside the JVM has additional costs around data serialization.
Note
Databricks optimizes many functions using Photon if you use Photon-enabled compute. Only functions that chain together Spark SQL of DataFrame commands can be optimized by Photon.
Some UDFs are more efficient than others. In terms of performance:
Built in functions will be fastest because of Databricks optimizers.
Code that executes in the JVM (Scala, Java, Hive UDFs) will be faster than Python UDFs.
Pandas UDFs use Arrow to reduce serialization costs associated with Python UDFs.
Python UDFs work well for procedural logic, but should be avoided for production ETL workloads on large datasets.
Note
In Databricks Runtime 12.2 LTS and below, Python scalar UDFs and Pandas UDFs are not supported in Unity Catalog on clusters that use shared access mode. These UDFs are supported in Databricks Runtime 13.3 LTS and above for all access modes.
In Databricks Runtime 14.1 and below, Scala scalar UDFs are not supported in Unity Catalog on clusters that use shared access mode. These UDFs are supported for all access modes in Databricks Runtime 14.2 and above.
In Databricks Runtime 13.3 LTS and above, you can register scalar Python UDFs to Unity Catalog using SQL syntax. See User-defined functions (UDFs) in Unity Catalog.
Type |
Optimized |
Execution environment |
---|---|---|
Hive UDF |
No |
JVM |
Python UDF |
No |
Python |
Pandas UDF |
No |
Python (Arrow) |
Scala UDF |
No |
JVM |
Spark SQL |
Yes |
JVM (Photon) |
Spark DataFrame |
Yes |
JVM (Photon) |
When should you use a UDF?
A major benefit of UDFs is that they allow users to express logic in familiar languages, reducing the human cost associated with refactoring code. For ad hoc queries, manual data cleansing, exploratory data analysis, and most operations on small or medium-sized datasets, latency overhead costs associated with UDFs are unlikely to outweigh costs associated with refactoring code.
For ETL jobs, streaming operations, operations on very large datasets, or other workloads that are executed regularly or continuously, refactoring logic to use native Apache Spark methods quickly pays dividends.
Example data for example UDFs
The code examples in this article use UDFs to convert temperatures between Celcius and Farenheit. If you wish to execute these functions, you can create a sample dataset with the following Python code:
import numpy as np
import pandas as pd
Fdf = pd.DataFrame(np.random.normal(55, 25, 10000000), columns=["temp"])
Fdf["unit"] = "F"
Cdf = pd.DataFrame(np.random.normal(10, 10, 10000000), columns=["temp"])
Cdf["unit"] = "C"
df = spark.createDataFrame(pd.concat([Fdf, Cdf]).sample(frac=1))
df.cache().count()
df.createOrReplaceTempView("tv_temp")