DataFrame class
A distributed collection of data grouped into named columns.
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.
A DataFrame should not be directly created using the constructor.
Supports Spark Connect
Properties
Property | Description |
|---|---|
Returns SparkSession that created this DataFrame. | |
Returns the content as an RDD of Row (Classic mode only). | |
Returns a DataFrameNaFunctions for handling missing values. | |
Returns a DataFrameStatFunctions for statistic functions. | |
Interface for saving the content of the non-streaming DataFrame out into external storage. | |
Interface for saving the content of the streaming DataFrame out into external storage. | |
Returns the schema of this DataFrame as a StructType. | |
Returns all column names and their data types as a list. | |
Retrieves the names of all columns in the DataFrame as a list. | |
Get the DataFrame's current storage level. | |
Returns True if this DataFrame contains one or more sources that continuously return data as it arrives. | |
Returns a ExecutionInfo object after the query was executed. | |
Returns a PySparkPlotAccessor for plotting functions. |
Methods
Data viewing and inspection
Method | Description |
|---|---|
Converts a DataFrame into a RDD of string or DataFrame. | |
Prints out the schema in the tree format. | |
Prints the (logical and physical) plans to the console for debugging purposes. | |
Prints the first n rows of the DataFrame to the console. | |
Returns all the records in the DataFrame as a list of Row. | |
Returns an iterator that contains all of the rows in this DataFrame. | |
Returns the first num rows as a list of Row. | |
Returns the last num rows as a list of Row. | |
Returns the first n rows. | |
Returns the first row as a Row. | |
Returns the number of rows in this DataFrame. | |
Checks if the DataFrame is empty and returns a boolean value. | |
Computes basic statistics for numeric and string columns. | |
Computes specified statistics for numeric and string columns. |
Temporary views
Method | Description |
|---|---|
Creates a local temporary view with this DataFrame. | |
Creates or replaces a local temporary view with this DataFrame. | |
Creates a global temporary view with this DataFrame. | |
Creates or replaces a global temporary view using the given name. |
Selection and projection
Method | Description |
|---|---|
Projects a set of expressions and returns a new DataFrame. | |
Projects a set of SQL expressions and returns a new DataFrame. | |
Filters rows using the given condition. | |
Alias for filter. | |
Returns a new DataFrame without specified columns. | |
Returns a new DataFrame with new specified column names. | |
Returns a new DataFrame by adding a column or replacing the existing column that has the same name. | |
Returns a new DataFrame by adding multiple columns or replacing the existing columns that have the same names. | |
Returns a new DataFrame by renaming an existing column. | |
Returns a new DataFrame by renaming multiple columns. | |
Returns a new DataFrame by updating an existing column with metadata. | |
Selects a metadata column based on its logical column name and returns it as a Column. | |
Selects column based on the column name specified as a regex and returns it as Column. |
Sorting and ordering
Method | Description |
|---|---|
Returns a new DataFrame sorted by the specified column(s). | |
Alias for sort. | |
Returns a new DataFrame with each partition sorted by the specified column(s). |
Aggregation and grouping
Method | Description |
|---|---|
Groups the DataFrame by the specified columns so that aggregation can be performed on them. | |
Create a multi-dimensional rollup for the current DataFrame using the specified columns. | |
Create a multi-dimensional cube for the current DataFrame using the specified columns. | |
Create multi-dimensional aggregation for the current DataFrame using the specified grouping sets. | |
Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()). | |
Define (named) metrics to observe on the DataFrame. |
Joins
Method | Description |
|---|---|
Joins with another DataFrame, using the given join expression. | |
Returns the cartesian product with another DataFrame. | |
Lateral joins with another DataFrame, using the given join expression. |
Set operations
Method | Description |
|---|---|
Return a new DataFrame containing the union of rows in this and another DataFrame. | |
Returns a new DataFrame containing union of rows in this and another DataFrame. | |
Return a new DataFrame containing rows only in both this DataFrame and another DataFrame. | |
Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates. | |
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame. | |
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates. |
Deduplication
Method | Description |
|---|---|
Returns a new DataFrame containing the distinct rows in this DataFrame. | |
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns. | |
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns, within watermark. |
Sampling and splitting
Method | Description |
|---|---|
Returns a sampled subset of this DataFrame. | |
Returns a stratified sample without replacement based on the fraction given on each stratum. | |
Randomly splits this DataFrame with the provided weights. |
Partitioning
Method | Description |
|---|---|
Returns a new DataFrame that has exactly numPartitions partitions. | |
| Returns a new DataFrame partitioned by the given partitioning expressions. |
Returns a new DataFrame partitioned by the given partitioning expressions. | |
| Returns a new DataFrame partitioned by the given partition ID expression. |
Reshaping
Method | Description |
|---|---|
Unpivot a DataFrame from wide format to long format. | |
Alias for unpivot. | |
Transposes a DataFrame such that the values in the specified index column become the new columns. |
Missing data handling
Method | Description |
|---|---|
Returns a new DataFrame omitting rows with null or NaN values. | |
Returns a new DataFrame which null values are filled with new value. | |
Returns a new DataFrame replacing a value with another value. |
Statistical functions
Method | Description |
|---|---|
Calculates the approximate quantiles of numerical columns of a DataFrame. | |
Calculates the correlation of two columns of a DataFrame as a double value. | |
Calculate the sample covariance for the given columns, specified by their names. | |
Computes a pair-wise frequency table of the given columns. | |
Finding frequent items for columns, possibly with false positives. |
Schema operations
Method | Description |
|---|---|
Returns a new DataFrame where each row is reconciled to match the specified schema. | |
Returns a new DataFrame with an alias set. |
Iteration
Method | Description |
|---|---|
Applies the f function to all Row of this DataFrame. | |
Applies the f function to each partition of this DataFrame. |
Caching and persistence
Method | Description |
|---|---|
Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER). | |
Sets the storage level to persist the contents of the DataFrame across operations. | |
Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk. |
Checkpointing
Method | Description |
|---|---|
Returns a checkpointed version of this DataFrame. | |
Returns a locally checkpointed version of this DataFrame. |
Streaming operations
Method | Description |
|---|---|
Defines an event time watermark for this DataFrame. |
Optimization hints
Method | Description |
|---|---|
Specifies some hint on the current DataFrame. |
Limits and offsets
Method | Description |
|---|---|
Limits the result count to the number specified. | |
Returns a new DataFrame by skipping the first n rows. |
Advanced transformations
Method | Description |
|---|---|
Returns a new DataFrame. Concise syntax for chaining custom transformations. |
Conversion methods
Method | Description |
|---|---|
Returns the contents of this DataFrame as Pandas pandas.DataFrame. | |
Returns the contents of this DataFrame as PyArrow pyarrow.Table. | |
Converts the existing DataFrame into a pandas-on-Spark DataFrame. | |
Maps an iterator of batches in the current DataFrame using a Python native function. | |
Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pyarrow.RecordBatch. |
Writing data
Method | Description |
|---|---|
Create a write configuration builder for v2 sources. | |
Merges a set of updates, insertions, and deletions based on a source table into a target table. |
DataFrame comparison
Method | Description |
|---|---|
Returns True when the logical query plans inside both DataFrames are equal. | |
Returns a hash code of the logical query plan against this DataFrame. |
Metadata and file information
Method | Description |
|---|---|
Returns a best-effort snapshot of the files that compose this DataFrame. |
Advanced SQL features
Method | Description |
|---|---|
Returns True if the collect and take methods can be run locally. | |
Converts the DataFrame into a TableArg object, which can be used as a table argument in a TVF. | |
Return a Column object for a SCALAR Subquery containing exactly one row and one column. | |
Return a Column object for an EXISTS Subquery. |
Examples
Basic DataFrame operations
# Create a DataFrame
people = spark.createDataFrame([
{"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50},
{"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100},
{"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150},
{"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200}
])
# Select columns
people.select("name", "age").show()
# Filter rows
people.filter(people.age > 30).show()
# Add a new column
people.withColumn("age_plus_10", people.age + 10).show()
Aggregation and grouping
# Group by and aggregate
people.groupBy("gender").agg({"salary": "avg", "age": "max"}).show()
# Multiple aggregations
from pyspark.sql import functions as F
people.groupBy("deptId").agg(
F.avg("salary").alias("avg_salary"),
F.max("age").alias("max_age")
).show()
Joins
# Create another DataFrame
department = spark.createDataFrame([
{"id": 1, "name": "PySpark"},
{"id": 2, "name": "ML"},
{"id": 3, "name": "Spark SQL"}
])
# Join DataFrames
people.join(department, people.deptId == department.id).show()
Complex transformations
# Chained operations
result = people.filter(people.age > 30) \\
.join(department, people.deptId == department.id) \\
.groupBy(department.name, "gender") \\
.agg({"salary": "avg", "age": "max"}) \\
.sort("max(age)")
result.show()