Optimizing Conversion between Spark and pandas DataFrames

Apache Arrow is an in-memory columnar data format used in Spark to efficiently transfer data between JVM and Python processes. This is beneficial to Python users that work with pandas and NumPy data. However, its usage is not automatic and requires some minor changes to configuration or code to take full advantage and ensure compatibility.

Convert to/from pandas

Arrow is available as an optimization when converting a Spark DataFrame to a pandas DataFrame using the call toPandas() and when creating a Spark DataFrame from a pandas DataFrame with createDataFrame(pandas_df). To use Arrow when executing these calls, set the Spark configuration spark.sql.execution.arrow.enabled to true. This configuration is disabled by default.

In addition, optimizations enabled by spark.sql.execution.arrow.enabled could fall back to a non-Arrow implementation if an error occurs before the computation within Spark. You can control this behavior using the Spark configuration spark.sql.execution.arrow.fallback.enabled.

Example

import numpy as np
import pandas as pd

# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))

# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)

# Convert the Spark DataFrame back to a pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

Using the above optimizations with Arrow produces the same results as when Arrow is not enabled. Even with Arrow, toPandas() results in the collection of all records in the DataFrame to the driver program and should be done on a small subset of the data.

In addition, not all Spark data types are supported and an error can be raised if a column has an unsupported type. If an error occurs during createDataFrame(), Spark falls back to create the DataFrame without Arrow.

Supported SQL types

PyArrow 0.8.0 is installed in Databricks Runtime 4.0-5.1. In Databricks Runtime 5.1 and above, all Spark SQL data types are supported by Arrow-based conversion except MapType, ArrayType of TimestampType, and nested StructType. BinaryType is supported only when PyArrow is equal to or higher than 0.10.0.