矢印_udtf

PyArrow ネイティブのユーザー定義テーブル関数 (UDTF) を作成します。この関数は、UDTF 用の PyArrow ネイティブインターフェイスを提供します。eval メソッドは、PyArrow RecordBatches または配列を受け取り、PyArrow Tables または RecordBatches の Iterator を返します。これにより、行ごとの処理オーバーヘッドなしで、真のベクトル化計算が可能になります。

構文

Python
from pyspark.databricks.sql import functions as dbf

@dbf.arrow_udtf(returnType=<returnType>)
class MyUDTF:
    def eval(self, ...):
        ...

パラメーター

パラメーター	Type	説明
`cls`	`class`オプション	Python ユーザー定義テーブル関数ハンドラークラス。
`returnType`	`pyspark.sql.types.StructType` または`str` （オプション）	ユーザー定義テーブル関数の戻り値の型。値は、StructType オブジェクトまたは DDL 形式の構造体型文字列のいずれかになります。

例

PyArrow RecordBatch 入力を使用した UDTF:

Python
import pyarrow as pa
from pyspark.databricks.sql.functions import arrow_udtf

@arrow_udtf(returnType="x int, y int")
class MyUDTF:
    def eval(self, batch: pa.RecordBatch):
        # Process the entire batch vectorized
        x_array = batch.column('x')
        y_array = batch.column('y')
        result_table = pa.table({
            'x': x_array,
            'y': y_array
        })
        yield result_table

df = spark.range(10).selectExpr("id as x", "id as y")
MyUDTF(df.asTable()).show()

PyArrow 配列入力を使用した UDTF:

Python
@arrow_udtf(returnType="x int, y int")
class MyUDTF2:
    def eval(self, x: pa.Array, y: pa.Array):
        # Process arrays vectorized
        result_table = pa.table({
            'x': x,
            'y': y
        })
        yield result_table

MyUDTF2(lit(1), lit(2)).show()

注記

evalメソッドはPyArrow RecordBatchesまたは配列を入力として受け入れる必要があります
evalメソッドは出力としてPyArrowテーブルまたはRecordBatchesを生成する必要があります

構文​

パラメーター​

例​

構文

パラメーター

例