Skip to main content

Python user-defined table functions (UDTFs) in Unity Catalog

Preview

Registering Python UDTFs in Unity Catalog is in Public Preview.

A Unity Catalog user-defined table function (UDTF) allows you to register functions that return complete tables instead of scalar values. Unlike scalar functions that return a single result value from each call, UDTFs are invoked in a SQL statement's FROM clause and can return multiple rows and columns.

UDTFs are particularly useful for:

  • Transforming arrays or complex data structures into multiple rows
  • Integrating external APIs or services into SQL workflows
  • Implementing custom data generation or enrichment logic
  • Processing data that requires stateful operations across rows

Each UDTF call can accept zero or more arguments. These arguments can be scalar expressions or table arguments representing entire input tables.

UDTFs can be registered in two ways:

  • Unity Catalog: Register the UDTF as a governed object in Unity Catalog.
  • Session-scoped: Register to the local SparkSession, isolated to the current notebook or job. See Python user-defined table functions (UDTFs).

Requirements

Unity Catalog Python UDTFs are supported on the following compute types:

  • Classic compute with standard access mode (Databricks Runtime 17.1 and above)
  • SQL warehouse (serverless, pro, and classic)

Create a UDTF in Unity Catalog

Use SQL DDL to create a governed UDTF in Unity Catalog. UDTFs are invoked using a SQL statement’s FROM clause.

SQL
CREATE OR REPLACE FUNCTION square_numbers(start INT, end INT)
RETURNS TABLE (num INT, squared INT)
LANGUAGE PYTHON
HANDLER 'SquareNumbers'
DETERMINISTIC
AS $$
class SquareNumbers:
"""
Basic UDTF that computes a sequence of integers
and includes the square of each number in the range.
"""
def eval(self, start: int, end: int):
for num in range(start, end + 1):
yield (num, num * num)
$$;

SELECT * FROM square_numbers(1, 5);

Output
+-----+---------+
| num | squared |
+-----+---------+
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
| 4 | 16 |
| 5 | 25 |
+-----+---------+

Databricks implements Python UDTFs as Python classes with a mandatory eval method that yields output rows.

Environment isolation

note

Shared isolation environments require Databricks Runtime 17.2 and above. In earlier versions, all Unity Catalog Python UDTFs run in strict isolation mode.

Unity Catalog Python UDTFs with the same owner and session can share an isolation environment by default. This improves performance and reduces memory usage by reducing the number of separate environments that need to be launched.

Strict isolation

To ensure a UDTF always runs in its own, fully isolated environment, add the STRICT ISOLATION characteristic clause.

Most UDTFs don't need strict isolation. Standard data processing UDTFs benefit from the default shared isolation environment and run faster with lower memory consumption.

Add the STRICT ISOLATION characteristic clause to UDTFs that:

  • Run input as code using eval(), exec(), or similar functions.
  • Write files to the local file system.
  • Modify global variables or system state.
  • Access or modify environment variables.

The following UDTF example sets a custom environment variable, reads the variable back, and multiplies a set of numbers using the variable. Because the UDTF mutates the process environment, run it in STRICT ISOLATION. Otherwise, it could leak or override environment variables for other UDFs/UDTFs in the same environment, causing incorrect behavior.

SQL
CREATE OR REPLACE TEMPORARY FUNCTION multiply_numbers(factor STRING)
RETURNS TABLE (original INT, scaled INT)
LANGUAGE PYTHON
STRICT ISOLATION
HANDLER 'Multiplier'
AS $$
import os

class Multiplier:
def eval(self, factor: str):
# Save the factor as an environment variable
os.environ["FACTOR"] = factor

# Read it back and convert it to a number
scale = int(os.getenv("FACTOR", "1"))

# Multiply 0 through 4 by the factor
for i in range(5):
yield (i, i * scale)
$$;

SELECT * FROM multiply_numbers("3");

Practical examples

The following examples demonstrate real-world use cases for Unity Catalog Python UDTFs, progressing from simple data transformations to complex external integrations.

Example: Re-implementing explode

While Spark provides a built-in explode function, creating your own version demonstrates the fundamental UDTF pattern of taking a single input and producing multiple output rows.

SQL
CREATE OR REPLACE FUNCTION my_explode(arr ARRAY<STRING>)
RETURNS TABLE (element STRING)
LANGUAGE PYTHON
HANDLER 'MyExplode'
DETERMINISTIC
AS $$
class MyExplode:
def eval(self, arr):
if arr is None:
return
for element in arr:
yield (element,)
$$;

Use the function directly in a SQL query:

SQL
SELECT element FROM my_explode(array('apple', 'banana', 'cherry'));
Output
+---------+
|| element |
+---------+
|| apple |
|| banana |
|| cherry |
+---------+

Or apply it to existing table data with a LATERAL join:

SQL
SELECT s.*, e.element
FROM my_items AS s,
LATERAL my_explode(s.items) AS e;

Example: IP address geolocation via REST API

This example demonstrates how UDTFs can integrate external APIs directly into your SQL workflow. Instead of requiring separate ETL processes, analysts can enrich data with real-time API calls using familiar SQL syntax.

SQL
CREATE OR REPLACE FUNCTION ip_to_location(ip_address STRING)
RETURNS TABLE (city STRING, country STRING)
LANGUAGE PYTHON
HANDLER 'IPToLocationAPI'
AS $$
class IPToLocationAPI:
def eval(self, ip_address):
import requests
api_url = f"https://api.ip-lookup.example.com/{ip_address}"
try:
response = requests.get(api_url)
response.raise_for_status()
data = response.json()
yield (data.get('city'), data.get('country'))
except requests.exceptions.RequestException as e:
# Return nothing if the API request fails
return
$$;
note

Python UDTFs allow TCP/UDP network traffic over ports 80, 443, and 53 when using serverless compute or compute configured with standard access mode.

Use the function to enrich web log data with geographic information:

SQL
SELECT
l.timestamp,
l.request_path,
geo.city,
geo.country
FROM web_logs AS l,
LATERAL ip_to_location(l.ip_address) AS geo;

This approach enables real-time geographic analysis without requiring pre-processed lookup tables or separate data pipelines. The UDTF handles HTTP requests, JSON parsing, and error handling, making external data sources accessible through standard SQL queries.

Set DETERMINISTIC if your function produces consistent results

Add DETERMINISTIC to your function definition if it produces the same outputs for the same inputs. This allows query optimizations to improve performance.

By default, Batch Unity Catalog Python UDTFs are assumed to be non-deterministioc unless explicitly declared. Examples of non-deterministic functions include: generating random values, accessing current times or dates, or making external API calls.

See CREATE FUNCTION (SQL and Python)

Limitations

The following limitations apply to Unity Catalog Python UDTFs:

Next steps