Apache Arrow is an in-memory columnar data format used in Apache Spark to efficiently transfer data between JVM and Python processes. This is beneficial to Python developers that work with pandas and NumPy data. However, its usage is not automatic and requires some minor changes to configuration or code to take full advantage and ensure compatibility.
PyArrow is installed in Databricks Runtime. For information on the version of PyArrow available in each Databricks Runtime version, see the Databricks runtime release notes.
All Spark SQL data types are supported by Arrow-based conversion except
TimestampType, and nested
StructType is represented as a
pandas.DataFrame instead of
BinaryType is supported only when PyArrow is equal to or higher than 0.10.0.
Arrow is available as an optimization when converting a PySpark DataFrame
to a pandas DataFrame with
toPandas() and when creating a
PySpark DataFrame from a pandas DataFrame with
To use Arrow for these methods, set the Spark configuration
This configuration is disabled by default.
In addition, optimizations enabled by
spark.sql.execution.arrow.enabled could fall back to
a non-Arrow implementation if an error occurs before the computation within Spark.
You can control this behavior using the Spark configuration
import numpy as np import pandas as pd # Enable Arrow-based columnar data transfers spark.conf.set("spark.sql.execution.arrow.enabled", "true") # Generate a pandas DataFrame pdf = pd.DataFrame(np.random.rand(100, 3)) # Create a Spark DataFrame from a pandas DataFrame using Arrow df = spark.createDataFrame(pdf) # Convert the Spark DataFrame back to a pandas DataFrame using Arrow result_pdf = df.select("*").toPandas()
Using the Arrow optimizations produces the same results
as when Arrow is not enabled. Even with Arrow,
results in the collection of all records in the DataFrame to the driver
program and should be done on a small subset of the data.
In addition, not all Spark data types are supported and an error can be raised if a
column has an unsupported type. If an error occurs during
Spark falls back to create the DataFrame without Arrow.