Skip to main content

What are user-defined functions (UDFs)?

User-defined functions (UDFs) allow you to reuse and share code that extends built-in functionality on Databricks. Use UDFs to perform specific tasks like complex calculations, transformations, or custom data manipulations.

When to use a UDF vs. Apache Spark function?

Use UDFs for logic that is difficult to express with built-in Apache Spark functions. Built-in Apache Spark functions are optimized for distributed processing and offer better performance at scale. For more information, see Functions.

Databricks recommends UDFs for ad hoc queries, manual data cleansing, exploratory data analysis, and operations on small to medium-sized datasets. Common use cases for UDFs include data encryption, decryption, hashing, JSON parsing, and validation.

Use Apache Spark methods for operations on very large datasets and any workloads run regularly or continuously, including ETL jobs and streaming operations.

Understand UDF types

Select a UDF type from the following tabs to see a description, example, and a link to learn more.

Scalar UDFs operate on a single row and return a single result value for each row. They can be Unity Catalog governed or session-scoped.

The following example uses a scalar UDF to calculate the length of each name in a name column and add the value in a new column name_length.

+-------+-------+
| name | score |
+-------+-------+
| alice | 10.0 |
| bob | 20.0 |
| carol | 30.0 |
| dave | 40.0 |
| eve | 50.0 |
+-------+-------+
SQL
-- Create a SQL UDF for name length
CREATE OR REPLACE FUNCTION main.test.get_name_length(name STRING)
RETURNS INT
RETURN LENGTH(name);

-- Use the UDF in a SQL query
SELECT name, main.test.get_name_length(name) AS name_length
FROM your_table;
+-------+-------+-------------+
| name | score | name_length |
+-------+-------+-------------+
| alice | 10.0 | 5 |
| bob | 20.0 | 3 |
| carol | 30.0 | 5 |
| dave | 40.0 | 4 |
| eve | 50.0 | 3 |
+-------+-------+-------------+

To implement this in a Databricks notebook using PySpark:

Python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

@udf(returnType=IntegerType())
def get_name_length(name):
return len(name)

df = df.withColumn("name_length", get_name_length(df.name))

# Show the result
display(df)

See User-defined functions (UDFs) in Unity Catalog and User-defined scalar functions - Python.

Unity Catalog governed vs. session scoped UDFs

Unity Catalog Python UDFs and Batch Unity Catalog Python UDFs are persisted in Unity Catalog for improved governance, reuse, and discoverability. All other UDFs are session-based, which means they are defined in a notebook or job and are scoped to the current SparkSession. You can define and access session-scoped UDFs using Scala or Python.

Unity Catalog governed UDFs cheat sheet

Unity Catalog governed UDFs allow custom functions to be defined, used, securely shared, and governed across computing environments. See User-defined functions (UDFs) in Unity Catalog.

UDF type

Supported compute

Description

Unity Catalog Python UDF

  • Serverless notebooks and jobs
  • Classic compute with standard access mode (Databricks Runtime 13.3 LTS and above)
  • SQL warehouses (serverless, pro, and classic)
  • DLT (classic and serverless)

Define a UDF in Python and register it in Unity Catalog for governance.

Scalar UDFs operate on a single row and return a single result value for each row.

Batch Unity Catalog Python UDF

  • Serverless notebooks and jobs
  • Classic compute with standard access mode (Databricks Runtime 16.3 and above)
  • SQL warehouse (serverless, pro, and classic)

Define a UDF in Python and register it in Unity Catalog for governance.

Batch operations on multiple values and return multiple values. Reduces overhead of row-by-row operations for large-scale data processing.

Session scoped UDFs cheat sheet for user-isolated compute

Session scoped UDFs are defined in a notebook or job and are scoped to the current SparkSession. You can define and access session-scoped UDFs using Scala or Python.

UDF type

Supported compute

Description

Python scalar

  • Serverless notebooks and jobs
  • Classic compute with standard access mode (Databricks Runtime 13.3 LTS and above)
  • DLT (classic and serverless)

Scalar UDFs operate on a single row and return a single result value for each row.

Python non-scalar

  • Serverless notebooks and jobs
  • Classic compute with standard access mode (Databricks Runtime 14.3 LTS and above)
  • DLT (classic and serverless)

Non-scalar UDFs include pandas_udf, mapInPandas, mapInArrow, applyInPandas. Pandas UDFs use Apache Arrow to transfer data and pandas to work with the data. Pandas UDFs support vectorized operations that can vastly increase performance over row-by-row scalar UDFs.

Python UDTFs

  • Serverless notebooks and jobs
  • Classic compute with standard access mode (Databricks Runtime 14.3 LTS and above)
  • DLT (classic and serverless)

A UDTF takes one or more input arguments and returns multiple rows (and possibly multiple columns) for each input row.

Scala scalar UDFs

  • Classic compute with standard access mode (Databricks Runtime 13.3 LTS and above)

Scalar UDFs operate on a single row and return a single result value for each row.

Scala UDAFs

  • Classic compute with standard access mode (Databricks Runtime 13.3 LTS and above)

UDAFs operate on multiple rows and return a single aggregated result.

Performance considerations

  • Built-in functions and SQL UDFs are the most efficient options.

  • Scala UDFs are generally faster than Python UDFs.

    • Unisolated Scala UDFs run in the Java Virtual Machine (JVM), so they avoid the overhead of moving data in and out of the JVM.
    • Isolated Scala UDFs have to move data in and out of the JVM, but they can still be faster than Python UDFs because they handle memory more efficiently.
  • Python UDFs and pandas UDFs tend to be slower than Scala UDFs because they need to serialize data and moved it out of the JVM to the Python interpreter.

    • Pandas UDFs are up to 100x faster than Python UDFs because they use Apache Arrow to reduce serialization costs.