Skip to main content

mapInArrow

Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pyarrow.RecordBatchs both as input and output, and returns the result as a DataFrame.

Syntax

mapInArrow(func: "ArrowMapIterFunction", schema: Union[StructType, str], barrier: bool = False, profile: Optional[ResourceProfile] = None)

Parameters

Parameter

Type

Description

func

function

a Python native function that takes an iterator of pyarrow.RecordBatchs, and outputs an iterator of pyarrow.RecordBatchs.

schema

DataType or str

the return type of the func in PySpark. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.

barrier

bool, optional, default False

Use barrier mode execution, ensuring that all Python workers in the stage will be launched concurrently.

profile

ResourceProfile, optional

The optional ResourceProfile to be used for mapInArrow.

Returns

DataFrame

Examples

Python
import pyarrow as pa
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))
def filter_func(iterator):
for batch in iterator:
pdf = batch.to_pandas()
yield pa.RecordBatch.from_pandas(pdf[pdf.id == 1])
df.mapInArrow(filter_func, df.schema).show()
# +---+---+
# | id|age|
# +---+---+
# | 1| 21|
# +---+---+

df.mapInArrow(filter_func, df.schema, barrier=True).collect()
# [Row(id=1, age=21)]