arrow_udtf

Creates a PyArrow-native user defined table function (UDTF). This function provides a PyArrow-native interface for UDTFs, where the eval method receives PyArrow RecordBatches or Arrays and returns an Iterator of PyArrow Tables or RecordBatches. This enables true vectorized computation without row-by-row processing overhead.

Syntax

Python
from pyspark.databricks.sql import functions as dbf

@dbf.arrow_udtf(returnType=<returnType>)
class MyUDTF:
    def eval(self, ...):
        ...

Parameters

Parameter	Type	Description
`cls`	`class`, optional	The Python user-defined table function handler class.
`returnType`	`pyspark.sql.types.StructType` or `str`, optional	The return type of the user-defined table function. The value can be either a StructType object or a DDL-formatted struct type string.

Examples

UDTF with PyArrow RecordBatch input:

Python
import pyarrow as pa
from pyspark.databricks.sql.functions import arrow_udtf

@arrow_udtf(returnType="x int, y int")
class MyUDTF:
    def eval(self, batch: pa.RecordBatch):
        # Process the entire batch vectorized
        x_array = batch.column('x')
        y_array = batch.column('y')
        result_table = pa.table({
            'x': x_array,
            'y': y_array
        })
        yield result_table

df = spark.range(10).selectExpr("id as x", "id as y")
MyUDTF(df.asTable()).show()

UDTF with PyArrow Array inputs:

Python
@arrow_udtf(returnType="x int, y int")
class MyUDTF2:
    def eval(self, x: pa.Array, y: pa.Array):
        # Process arrays vectorized
        result_table = pa.table({
            'x': x,
            'y': y
        })
        yield result_table

MyUDTF2(lit(1), lit(2)).show()

note

The eval method must accept PyArrow RecordBatches or Arrays as input
The eval method must yield PyArrow Tables or RecordBatches as output

Syntax​

Parameters​

Examples​

Syntax

Parameters

Examples