What are user-defined functions (UDFs)?

A user-defined function (UDF) is a function defined by a user, allowing custom logic to be reused in the user environment. Databricks has support for many different types of UDFs to allow for distributing extensible logic. This article introduces some of the general strengths and limitations of UDFs.

Note

Not all forms of UDFs are available in all execution environments on Databricks. If you are working with Unity Catalog, see User-defined functions (UDFs) in Unity Catalog.

See the following articles for more information on UDFs:

Defining custom logic without serialization penalties

Databricks inherits much of its UDF behaviors from Apache Spark, including the efficiency limitations around many types of UDFs. See Which UDFs are most efficient?.

You can safely modularize your code without worrying about potential efficiency tradeoffs associated with UDFs. To do so, you must define your logic as a series of Spark built-in methods using SQL or Spark DataFrames. For example, the following SQL and Python functions combine Spark built-in methods to define a unit conversion as a reusable function:

CREATE FUNCTION convert_f_to_c(unit STRING, temp DOUBLE)
RETURNS DOUBLE
RETURN CASE
  WHEN unit = "F" THEN (temp - 32) * (5/9)
  ELSE temp
END;

SELECT convert_f_to_c(unit, temp) AS c_temp
FROM tv_temp;
def convertFtoC(unitCol, tempCol):
  from pyspark.sql.functions import when
  return when(unitCol == "F", (tempCol - 32) * (5/9)).otherwise(tempCol)

from pyspark.sql.functions import col

df_query = df.select(convertFtoC(col("unit"), col("temp"))).toDF("c_temp")
display(df_query)

To run the above UDFs, you can create example data.

Which UDFs are most efficient?

UDFs might introduce significant processing bottlenecks into code execution. Databricks uses a number of different optimizers automatically for code written with included Apache Spark, SQL, and Delta Lake syntax. When custom logic is introduced by UDFs, these optimizers do not have the ability to efficiently plan tasks around this custom logic. In addition, logic that executes outside the JVM has additional costs around data serialization.

Note

Databricks optimizes many functions using Photon if you use Photon-enabled compute. Only functions that chain together Spark SQL of DataFrame commands can be optimized by Photon.

Some UDFs are more efficient than others. In terms of performance:

  • Built in functions will be fastest because of Databricks optimizers.

  • Code that executes in the JVM (Scala, Java, Hive UDFs) will be faster than Python UDFs.

  • Pandas UDFs use Arrow to reduce serialization costs associated with Python UDFs.

  • Python UDFs work well for procedural logic, but should be avoided for production ETL workloads on large datasets.

Note

In Databricks Runtime 13.1 and below, Python scalar UDFs and Pandas UDFs are not supported in Unity Catalog on clusters that use shared access mode. These UDFs are supported in Databricks Runtime 13.2 and above for all access modes.

In Databricks Runtime 14.1 and below, Scala scalar UDFs are not supported in Unity Catalog on clusters that use shared access mode. These UDFs are supported for all access modes in Databricks Runtime 14.2 and above.

In Databricks Runtime 13.2 and above, you can register scalar Python UDFs to Unity Catalog using SQL syntax. See User-defined functions (UDFs) in Unity Catalog.

Type

Optimized

Execution environment

Hive UDF

No

JVM

Python UDF

No

Python

Pandas UDF

No

Python (Arrow)

Scala UDF

No

JVM

Spark SQL

Yes

JVM (Photon)

Spark DataFrame

Yes

JVM (Photon)

When should you use a UDF?

A major benefit of UDFs is that they allow users to express logic in familiar languages, reducing the human cost associated with refactoring code. For ad hoc queries, manual data cleansing, exploratory data analysis, and most operations on small or medium-sized datasets, latency overhead costs associated with UDFs are unlikely to outweigh costs associated with refactoring code.

For ETL jobs, streaming operations, operations on very large datasets, or other workloads that are executed regularly or continuously, refactoring logic to use native Apache Spark methods quickly pays dividends.

Example data for example UDFs

The code examples in this article use UDFs to convert temperatures between Celcius and Farenheit. If you wish to execute these functions, you can create a sample dataset with the following Python code:

import numpy as np
import pandas as pd

Fdf = pd.DataFrame(np.random.normal(55, 25, 10000000), columns=["temp"])
Fdf["unit"] = "F"

Cdf = pd.DataFrame(np.random.normal(10, 10, 10000000), columns=["temp"])
Cdf["unit"] = "C"

df = spark.createDataFrame(pd.concat([Fdf, Cdf]).sample(frac=1))

df.cache().count()
df.createOrReplaceTempView("tv_temp")