User-defined functions (UDFs) in Unity Catalog

Preview

This feature is in Public Preview.

User-defined functions (UDFs) in Unity Catalog extend SQL and Python’s capabilities within Databricks. They allow custom functions to be defined, used, and securely shared and governed across computing environments.

Python UDFs registered as functions in Unity Catalog differ in scope and support from PySpark UDFs scoped to a notebook or SparkSession. See User-defined scalar functions - Python.

See CREATE FUNCTION (SQL and Python) for complete SQL language reference.

Requirements

To use UDFs in Unity Catalog, the following requirements must be met:

  • To use Python code in UDFs that are registered in Unity Catalog, you must use a serverless or pro SQL warehouse or a cluster running Databricks Runtime 13.3 LTS or above.

  • If a view includes a UC Python UDF, it will fail on SQL Classic Warehouses.

  • Graviton instance support for Scala UDFs on enabled clusters is available in Databricks Runtime 15.2 and above.

Creating UDFs in Unity Catalog

To create a UDF in Unity Catalog, users need USAGE and CREATE permission on the schema and USAGE permission on the catalog. See Unity Catalog for more details.

To run a UDF, users need EXECUTE permission on the UDF. Users also need USAGE permission on the schema and catalog.

The following example registers a new function to the my_schema Unity Catalog schema:

CREATE OR REPLACE FUNCTION my_catalog.my_schema.calculate_bmi(weight DOUBLE, height DOUBLE)
RETURNS DOUBLE
LANGUAGE SQL
AS
SELECT weight / (height * height);

Python UDFs for Unity Catalog use statements offset by double dollar signs ($$). You also need to specify a data type mapping. The following example registers a UDF that calculates body mass index:

CREATE FUNCTION my_catalog.my_schema.calculate_bmi(weight_kg DOUBLE, height_m DOUBLE)
RETURNS DOUBLE
LANGUAGE PYTHON
AS $$
return weight_kg / (height_m ** 2)
$$;

You can now use this Unity Catalog function in your SQL queries or PySpark code:

SELECT person_id, my_catalog.my_schema.calculate_bmi(weight_kg, height_m) AS bmi
FROM person_data;

Using the Unity Catalog UDF in PySpark

from pyspark.sql.functions import expr

result = df.withColumn("bmi", expr("my_catalog.my_schema.calculate_bmi(weight_kg, height_m)"))
display(result)

Upgrade a session-scoped UDF

Note

Syntax and semantics for Python UDFs in Unity Catalog differ from Python UDFs registered to the SparkSession. See user-defined scalar functions - Python.

Given the following session-based UDF in a Databricks notebook:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(StringType())
def greet(name):
    return f"Hello, {name}!"

# Using the session-based UDF
result = df.withColumn("greeting", greet("name"))
result.show()

To register this as a Unity Catalog function, use a SQL CREATE FUNCTION statement, as in the following example:

CREATE OR REPLACE FUNCTION my_catalog.my_schema.greet(name STRING)
RETURNS STRING
LANGUAGE PYTHON
AS $$
return f"Hello, {name}!"
$$

Share UDFs in Unity Catalog

Permissions for UDFs are managed based on the access controls applied to the catalog, schema, or database where the UDF is registered. See Unity Catalog for more information.

Use the Databricks SQL or the Databricks workspace UI to give permissions to a user or group (recommended).

Permissions in the workspace UI

  1. Find the catalog and schema where your UDF is stored and select the UDF.

  2. Look for a Permissions option in the UDF settings. Add users or groups and specify the type of access they should have, such as EXECUTE or MANAGE.

Permissions in Workspace UI

Permissions using Databricks SQL

The following example grants a user the EXECUTE permission on a function:

GRANT EXECUTE ON FUNCTION my_catalog.my_schema.calculate_bmi TO user@example.com;

To remove permissions, use the REVOKE command as in the following example:

REVOKE EXECUTE ON FUNCTION my_catalog.my_schema.calculate_bmi FROM user@example.com;

Best Practices for UDFs

For UDFs that need to be accessible to all users, Databricks recommends creating a dedicated catalog and schema with appropriate access controls.

For team-specific UDFs, use a dedicated schema within the team’s catalog for storage and management.

Databricks recommends you include the following information in the UDF’s docstring:

  • The current version number

  • A changelog to track modifications across versions

  • The UDF’s purpose, parameters, and return value

  • An example of how to use the UDF

Here is an example of a UDF following best practices:

CREATE OR REPLACE FUNCTION my_catalog.my_schema.calculate_bmi(weight_kg DOUBLE, height_m DOUBLE)
RETURNS DOUBLE
COMMENT Calculates Body Mass Index (BMI) from weight and height.
LANGUAGE PYTHON
AS $$
 """
Parameters:
calculate_bmi (version 1.2):
- weight_kg (float): Weight of the individual in kilograms.
- height_m (float): Height of the individual in meters.


Returns:
- float: The calculated BMI.


Example Usage:

SELECT calculate_bmi(weight, height) AS bmi FROM person_data;


Change Log:
- 1.0: Initial version.
- 1.1: Improved error handling for zero or negative height values.
- 1.2: Optimized calculation for performance.


 Note: BMI is calculated as weight in kilograms divided by the square of height in meters.
 """
if height_m <= 0:
 return None  # Avoid division by zero and ensure height is positive
return weight_kg / (height_m ** 2)
$$;

UDFs for accessing external APIs

You can use UDFs to access external APIs from SQL. The following example uses the Python requests library to make an HTTP request.

Note

Python UDFs allow TCP/UDP network traffic over ports 80, 443, and 53 using serverless compute or compute configured with shared access mode.

CREATE FUNCTION my_catalog.my_schema.get_food_calories(food_name STRING)
RETURNS DOUBLE
LANGUAGE PYTHON
AS $$
import requests


api_url = f"https://example-food-api.com/nutrition?food={food_name}"
response = requests.get(api_url)

if response.status_code == 200:
   data = response.json()
   # Assuming the API returns a JSON object with a 'calories' field
   calories = data.get('calories', 0)
   return calories
else:
   return None  # API request failed

$$;

UDFs for security and compliance

Use Python UDFs to implement custom tokenization, data masking, data redaction, or encryption mechanisms.

The following example masks the identity of an email address while maintaining length and domain:

CREATE OR REPLACE FUNCTION my_catalog.my_schema.mask_email(email STRING)
RETURNS STRING
LANGUAGE PYTHON
AS $$
parts = email.split('@')
masked_username = username[0] + '*' * (len(username) - 2) + username[-1]
return f"{masked_username}@{domain}"
$$

The following example applies this UDF in a dynamic view definition:

-- First, create the view
CREATE OR REPLACE VIEW my_catalog.my_schema.masked_customer_view AS
SELECT
  id,
  name,
  my_catalog.my_schema.mask_email(email) AS email
FROM my_catalog.my_schema.customer_data;

-- Now you can query the view
SELECT * FROM my_catalog.my_schema.masked_customer_view;
+---+------------+------------------------+------------------------+
| id|        name|                   email|           masked_email |
+---+------------+------------------------+------------------------+
|  1|    John Doe|   john.doe@example.com |  j*******e@example.com |
|  2| Alice Smith|alice.smith@company.com |a**********h@company.com|
|  3|   Bob Jones|    bob.jones@email.org |   b********s@email.org |
+---+------------+------------------------+------------------------+

Limitations

  • You can define any number of Python functions within a Python UDF but all must return a scalar value.

  • Python functions must handle NULL values independently, and all type mappings must follow Databricks SQL language mappings.

  • You can import standard Python libraries included by Databricks, but you cannot include custom libraries or external dependencies.

  • If no catalog or schema is specified, Python UDFs are registered to the current active schema.

  • Python UDFs execute in a secure, isolated environment and do not have access to file systems or internal services.