User-defined functions (UDFs) in Unity Catalog
Preview
This feature is in Public Preview.
User-defined functions (UDFs) in Unity Catalog extend SQL and Python’s capabilities within Databricks. They allow custom functions to be defined, used, and securely shared and governed across computing environments.
Python UDFs registered as functions in Unity Catalog differ in scope and support from PySpark UDFs scoped to a notebook or SparkSession. See User-defined scalar functions - Python.
See CREATE FUNCTION (SQL and Python) for complete SQL language reference.
Requirements
To use UDFs in Unity Catalog, the following requirements must be met:
To use Python code in UDFs that are registered in Unity Catalog, you must use a serverless or pro SQL warehouse or a cluster running Databricks Runtime 13.3 LTS or above.
If a view includes a UC Python UDF, it will fail on SQL Classic Warehouses.
Graviton instance support for Scala UDFs on enabled clusters is available in Databricks Runtime 15.2 and above.
Creating UDFs in Unity Catalog
To create a UDF in Unity Catalog, users need USAGE and CREATE permission on the schema and USAGE permission on the catalog. See Unity Catalog for more details.
To run a UDF, users need EXECUTE permission on the UDF. Users also need USAGE permission on the schema and catalog.
The following example registers a new function to the my_schema
Unity Catalog schema:
CREATE OR REPLACE FUNCTION my_catalog.my_schema.calculate_bmi(weight DOUBLE, height DOUBLE)
RETURNS DOUBLE
LANGUAGE SQL
AS
SELECT weight / (height * height);
Python UDFs for Unity Catalog use statements offset by double dollar signs ($$). You also need to specify a data type mapping. The following example registers a UDF that calculates body mass index:
CREATE FUNCTION my_catalog.my_schema.calculate_bmi(weight_kg DOUBLE, height_m DOUBLE)
RETURNS DOUBLE
LANGUAGE PYTHON
AS $$
return weight_kg / (height_m ** 2)
$$;
You can now use this Unity Catalog function in your SQL queries or PySpark code:
SELECT person_id, my_catalog.my_schema.calculate_bmi(weight_kg, height_m) AS bmi
FROM person_data;
Using the Unity Catalog UDF in PySpark
from pyspark.sql.functions import expr
result = df.withColumn("bmi", expr("my_catalog.my_schema.calculate_bmi(weight_kg, height_m)"))
display(result)
Upgrade a session-scoped UDF
Note
Syntax and semantics for Python UDFs in Unity Catalog differ from Python UDFs registered to the SparkSession. See user-defined scalar functions - Python.
Given the following session-based UDF in a Databricks notebook:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(StringType())
def greet(name):
return f"Hello, {name}!"
# Using the session-based UDF
result = df.withColumn("greeting", greet("name"))
result.show()
To register this as a Unity Catalog function, use a SQL CREATE FUNCTION
statement, as in the following example:
CREATE OR REPLACE FUNCTION my_catalog.my_schema.greet(name STRING)
RETURNS STRING
LANGUAGE PYTHON
AS $$
return f"Hello, {name}!"
$$
UDFs for accessing external APIs
You can use UDFs to access external APIs from SQL. The following example uses the Python requests
library to make an HTTP request.
Note
Python UDFs allow TCP/UDP network traffic over ports 80, 443, and 53 using serverless compute or compute configured with shared access mode.
CREATE FUNCTION my_catalog.my_schema.get_food_calories(food_name STRING)
RETURNS DOUBLE
LANGUAGE PYTHON
AS $$
import requests
api_url = f"https://example-food-api.com/nutrition?food={food_name}"
response = requests.get(api_url)
if response.status_code == 200:
data = response.json()
# Assuming the API returns a JSON object with a 'calories' field
calories = data.get('calories', 0)
return calories
else:
return None # API request failed
$$;
UDFs for security and compliance
Use Python UDFs to implement custom tokenization, data masking, data redaction, or encryption mechanisms.
The following example masks the identity of an email address while maintaining length and domain:
CREATE OR REPLACE FUNCTION my_catalog.my_schema.mask_email(email STRING)
RETURNS STRING
LANGUAGE PYTHON
AS $$
parts = email.split('@')
masked_username = username[0] + '*' * (len(username) - 2) + username[-1]
return f"{masked_username}@{domain}"
$$
The following example applies this UDF in a dynamic view definition:
-- First, create the view
CREATE OR REPLACE VIEW my_catalog.my_schema.masked_customer_view AS
SELECT
id,
name,
my_catalog.my_schema.mask_email(email) AS email
FROM my_catalog.my_schema.customer_data;
-- Now you can query the view
SELECT * FROM my_catalog.my_schema.masked_customer_view;
+---+------------+------------------------+------------------------+
| id| name| email| masked_email |
+---+------------+------------------------+------------------------+
| 1| John Doe| john.doe@example.com | j*******e@example.com |
| 2| Alice Smith|alice.smith@company.com |a**********h@company.com|
| 3| Bob Jones| bob.jones@email.org | b********s@email.org |
+---+------------+------------------------+------------------------+
Limitations
You can define any number of Python functions within a Python UDF but all must return a scalar value.
Python functions must handle NULL values independently, and all type mappings must follow Databricks SQL language mappings.
You can import standard Python libraries included by Databricks, but you cannot include custom libraries or external dependencies.
If no catalog or schema is specified, Python UDFs are registered to the current active schema.
Python UDFs execute in a secure, isolated environment and do not have access to file systems or internal services.