Python user-defined table functions (UDTFs) in Unity Catalog
Registering Python UDTFs in Unity Catalog is in Public Preview.
A Unity Catalog user-defined table function (UDTF) allows you to register functions that return complete tables instead of scalar values. Unlike scalar functions that return a single result value from each call, UDTFs are invoked in a SQL statement's FROM
clause and can return multiple rows and columns.
UDTFs are particularly useful for:
- Transforming arrays or complex data structures into multiple rows
- Integrating external APIs or services into SQL workflows
- Implementing custom data generation or enrichment logic
- Processing data that requires stateful operations across rows
Each UDTF call can accept zero or more arguments. These arguments can be scalar expressions or table arguments representing entire input tables.
UDTFs can be registered in two ways:
- Unity Catalog: Register the UDTF as a governed object in Unity Catalog.
- Session-scoped: Register to the local
SparkSession
, isolated to the current notebook or job. See Python user-defined table functions (UDTFs).
Requirements
Unity Catalog Python UDTFs are supported on the following compute types:
- Classic compute with standard access mode (Databricks Runtime 17.1 and above)
- SQL warehouse (serverless, pro, and classic)
Create a UDTF in Unity Catalog
Use SQL DDL to create a governed UDTF in Unity Catalog. UDTFs are invoked using a SQL statement’s FROM
clause.
CREATE OR REPLACE FUNCTION square_numbers(start INT, end INT)
RETURNS TABLE (num INT, squared INT)
LANGUAGE PYTHON
HANDLER 'SquareNumbers'
DETERMINISTIC
AS $$
class SquareNumbers:
"""
Basic UDTF that computes a sequence of integers
and includes the square of each number in the range.
"""
def eval(self, start: int, end: int):
for num in range(start, end + 1):
yield (num, num * num)
$$;
SELECT * FROM square_numbers(1, 5);
+-----+---------+
| num | squared |
+-----+---------+
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
| 4 | 16 |
| 5 | 25 |
+-----+---------+
Databricks implements Python UDTFs as Python classes with a mandatory eval
method that yields output rows.
Environment isolation
Shared isolation environments require Databricks Runtime 17.2 and above. In earlier versions, all Unity Catalog Python UDTFs run in strict isolation mode.
Unity Catalog Python UDTFs with the same owner and session can share an isolation environment by default. This improves performance and reduces memory usage by reducing the number of separate environments that need to be launched.
Strict isolation
To ensure a UDTF always runs in its own, fully isolated environment, add the STRICT ISOLATION
characteristic clause.
Most UDTFs don't need strict isolation. Standard data processing UDTFs benefit from the default shared isolation environment and run faster with lower memory consumption.
Add the STRICT ISOLATION
characteristic clause to UDTFs that:
- Run input as code using
eval()
,exec()
, or similar functions. - Write files to the local file system.
- Modify global variables or system state.
- Access or modify environment variables.
The following UDTF example sets a custom environment variable, reads the variable back, and multiplies a set of numbers using the variable. Because the UDTF mutates the process environment, run it in STRICT ISOLATION
. Otherwise, it could leak or override environment variables for other UDFs/UDTFs in the same environment, causing incorrect behavior.
CREATE OR REPLACE TEMPORARY FUNCTION multiply_numbers(factor STRING)
RETURNS TABLE (original INT, scaled INT)
LANGUAGE PYTHON
STRICT ISOLATION
HANDLER 'Multiplier'
AS $$
import os
class Multiplier:
def eval(self, factor: str):
# Save the factor as an environment variable
os.environ["FACTOR"] = factor
# Read it back and convert it to a number
scale = int(os.getenv("FACTOR", "1"))
# Multiply 0 through 4 by the factor
for i in range(5):
yield (i, i * scale)
$$;
SELECT * FROM multiply_numbers("3");
Practical examples
The following examples demonstrate real-world use cases for Unity Catalog Python UDTFs, progressing from simple data transformations to complex external integrations.
Example: Re-implementing explode
While Spark provides a built-in explode
function, creating your own version demonstrates the fundamental UDTF pattern of taking a single input and producing multiple output rows.
CREATE OR REPLACE FUNCTION my_explode(arr ARRAY<STRING>)
RETURNS TABLE (element STRING)
LANGUAGE PYTHON
HANDLER 'MyExplode'
DETERMINISTIC
AS $$
class MyExplode:
def eval(self, arr):
if arr is None:
return
for element in arr:
yield (element,)
$$;
Use the function directly in a SQL query:
SELECT element FROM my_explode(array('apple', 'banana', 'cherry'));
+---------+
|| element |
+---------+
|| apple |
|| banana |
|| cherry |
+---------+
Or apply it to existing table data with a LATERAL
join:
SELECT s.*, e.element
FROM my_items AS s,
LATERAL my_explode(s.items) AS e;
Example: IP address geolocation via REST API
This example demonstrates how UDTFs can integrate external APIs directly into your SQL workflow. Instead of requiring separate ETL processes, analysts can enrich data with real-time API calls using familiar SQL syntax.
CREATE OR REPLACE FUNCTION ip_to_location(ip_address STRING)
RETURNS TABLE (city STRING, country STRING)
LANGUAGE PYTHON
HANDLER 'IPToLocationAPI'
AS $$
class IPToLocationAPI:
def eval(self, ip_address):
import requests
api_url = f"https://api.ip-lookup.example.com/{ip_address}"
try:
response = requests.get(api_url)
response.raise_for_status()
data = response.json()
yield (data.get('city'), data.get('country'))
except requests.exceptions.RequestException as e:
# Return nothing if the API request fails
return
$$;
Python UDTFs allow TCP/UDP network traffic over ports 80, 443, and 53 when using serverless compute or compute configured with standard access mode.
Use the function to enrich web log data with geographic information:
SELECT
l.timestamp,
l.request_path,
geo.city,
geo.country
FROM web_logs AS l,
LATERAL ip_to_location(l.ip_address) AS geo;
This approach enables real-time geographic analysis without requiring pre-processed lookup tables or separate data pipelines. The UDTF handles HTTP requests, JSON parsing, and error handling, making external data sources accessible through standard SQL queries.
Set DETERMINISTIC
if your function produces consistent results
Add DETERMINISTIC
to your function definition if it produces the same outputs for the same inputs. This allows query optimizations to improve performance.
By default, Batch Unity Catalog Python UDTFs are assumed to be non-deterministioc unless explicitly declared. Examples of non-deterministic functions include: generating random values, accessing current times or dates, or making external API calls.
See CREATE FUNCTION (SQL and Python)
Limitations
The following limitations apply to Unity Catalog Python UDTFs:
- Polymorphic table functions are not supported.
- The TABLE argument is not supported.
- Unity Catalog service credentials are not supported.
- Custom dependencies are not supported.