Skip to main content

DataFrame class

A distributed collection of data grouped into named columns.

A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.

important

A DataFrame should not be directly created using the constructor.

Supports Spark Connect

Properties

Property

Description

sparkSession

Returns SparkSession that created this DataFrame.

rdd

Returns the content as an RDD of Row (Classic mode only).

na

Returns a DataFrameNaFunctions for handling missing values.

stat

Returns a DataFrameStatFunctions for statistic functions.

write

Interface for saving the content of the non-streaming DataFrame out into external storage.

writeStream

Interface for saving the content of the streaming DataFrame out into external storage.

schema

Returns the schema of this DataFrame as a StructType.

dtypes

Returns all column names and their data types as a list.

columns

Retrieves the names of all columns in the DataFrame as a list.

storageLevel

Get the DataFrame's current storage level.

isStreaming

Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.

executionInfo

Returns a ExecutionInfo object after the query was executed.

plot

Returns a PySparkPlotAccessor for plotting functions.

Methods

Data viewing and inspection

Method

Description

toJSON(use_unicode)

Converts a DataFrame into a RDD of string or DataFrame.

printSchema(level)

Prints out the schema in the tree format.

explain(extended, mode)

Prints the (logical and physical) plans to the console for debugging purposes.

show(n, truncate, vertical)

Prints the first n rows of the DataFrame to the console.

collect()

Returns all the records in the DataFrame as a list of Row.

toLocalIterator(prefetchPartitions)

Returns an iterator that contains all of the rows in this DataFrame.

take(num)

Returns the first num rows as a list of Row.

tail(num)

Returns the last num rows as a list of Row.

head(n)

Returns the first n rows.

first()

Returns the first row as a Row.

count()

Returns the number of rows in this DataFrame.

isEmpty()

Checks if the DataFrame is empty and returns a boolean value.

describe(*cols)

Computes basic statistics for numeric and string columns.

summary(*statistics)

Computes specified statistics for numeric and string columns.

Temporary views

Method

Description

createTempView(name)

Creates a local temporary view with this DataFrame.

createOrReplaceTempView(name)

Creates or replaces a local temporary view with this DataFrame.

createGlobalTempView(name)

Creates a global temporary view with this DataFrame.

createOrReplaceGlobalTempView(name)

Creates or replaces a global temporary view using the given name.

Selection and projection

Method

Description

select(*cols)

Projects a set of expressions and returns a new DataFrame.

selectExpr(*expr)

Projects a set of SQL expressions and returns a new DataFrame.

filter(condition)

Filters rows using the given condition.

where(condition)

Alias for filter.

drop(*cols)

Returns a new DataFrame without specified columns.

toDF(*cols)

Returns a new DataFrame with new specified column names.

withColumn(colName, col)

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

withColumns(*colsMap)

Returns a new DataFrame by adding multiple columns or replacing the existing columns that have the same names.

withColumnRenamed(existing, new)

Returns a new DataFrame by renaming an existing column.

withColumnsRenamed(colsMap)

Returns a new DataFrame by renaming multiple columns.

withMetadata(columnName, metadata)

Returns a new DataFrame by updating an existing column with metadata.

metadataColumn(colName)

Selects a metadata column based on its logical column name and returns it as a Column.

colRegex(colName)

Selects column based on the column name specified as a regex and returns it as Column.

Sorting and ordering

Method

Description

sort(*cols, **kwargs)

Returns a new DataFrame sorted by the specified column(s).

orderBy(*cols, **kwargs)

Alias for sort.

sortWithinPartitions(*cols, **kwargs)

Returns a new DataFrame with each partition sorted by the specified column(s).

Aggregation and grouping

Method

Description

groupBy(*cols)

Groups the DataFrame by the specified columns so that aggregation can be performed on them.

rollup(*cols)

Create a multi-dimensional rollup for the current DataFrame using the specified columns.

cube(*cols)

Create a multi-dimensional cube for the current DataFrame using the specified columns.

groupingSets(groupingSets, *cols)

Create multi-dimensional aggregation for the current DataFrame using the specified grouping sets.

agg(*exprs)

Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()).

observe(observation, *exprs)

Define (named) metrics to observe on the DataFrame.

Joins

Method

Description

join(other, on, how)

Joins with another DataFrame, using the given join expression.

crossJoin(other)

Returns the cartesian product with another DataFrame.

lateralJoin(other, on, how)

Lateral joins with another DataFrame, using the given join expression.

Set operations

Method

Description

union(other)

Return a new DataFrame containing the union of rows in this and another DataFrame.

unionByName(other, allowMissingColumns)

Returns a new DataFrame containing union of rows in this and another DataFrame.

intersect(other)

Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.

intersectAll(other)

Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.

subtract(other)

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.

exceptAll(other)

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.

Deduplication

Method

Description

distinct()

Returns a new DataFrame containing the distinct rows in this DataFrame.

dropDuplicates(subset)

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.

dropDuplicatesWithinWatermark(subset)

Return a new DataFrame with duplicate rows removed, optionally only considering certain columns, within watermark.

Sampling and splitting

Method

Description

sample(withReplacement, fraction, seed)

Returns a sampled subset of this DataFrame.

sampleBy(col, fractions, seed)

Returns a stratified sample without replacement based on the fraction given on each stratum.

randomSplit(weights, seed)

Randomly splits this DataFrame with the provided weights.

Partitioning

Method

Description

coalesce(numPartitions)

Returns a new DataFrame that has exactly numPartitions partitions.

repartition(numPartitions, *cols)

Returns a new DataFrame partitioned by the given partitioning expressions.

repartitionByRange(numPartitions, *cols)

Returns a new DataFrame partitioned by the given partitioning expressions.

repartitionById(numPartitions, partitionIdCol)

Returns a new DataFrame partitioned by the given partition ID expression.

Reshaping

Method

Description

unpivot(ids, values, variableColumnName, valueColumnName)

Unpivot a DataFrame from wide format to long format.

melt(ids, values, variableColumnName, valueColumnName)

Alias for unpivot.

transpose(indexColumn)

Transposes a DataFrame such that the values in the specified index column become the new columns.

Missing data handling

Method

Description

dropna(how, thresh, subset)

Returns a new DataFrame omitting rows with null or NaN values.

fillna(value, subset)

Returns a new DataFrame which null values are filled with new value.

replace(to_replace, value, subset)

Returns a new DataFrame replacing a value with another value.

Statistical functions

Method

Description

approxQuantile(col, probabilities, relativeError)

Calculates the approximate quantiles of numerical columns of a DataFrame.

corr(col1, col2, method)

Calculates the correlation of two columns of a DataFrame as a double value.

cov(col1, col2)

Calculate the sample covariance for the given columns, specified by their names.

crosstab(col1, col2)

Computes a pair-wise frequency table of the given columns.

freqItems(cols, support)

Finding frequent items for columns, possibly with false positives.

Schema operations

Method

Description

to(schema)

Returns a new DataFrame where each row is reconciled to match the specified schema.

alias(alias)

Returns a new DataFrame with an alias set.

Iteration

Method

Description

foreach(f)

Applies the f function to all Row of this DataFrame.

foreachPartition(f)

Applies the f function to each partition of this DataFrame.

Caching and persistence

Method

Description

cache()

Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER).

persist(storageLevel)

Sets the storage level to persist the contents of the DataFrame across operations.

unpersist(blocking)

Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

Checkpointing

Method

Description

checkpoint(eager)

Returns a checkpointed version of this DataFrame.

localCheckpoint(eager, storageLevel)

Returns a locally checkpointed version of this DataFrame.

Streaming operations

Method

Description

withWatermark(eventTime, delayThreshold)

Defines an event time watermark for this DataFrame.

Optimization hints

Method

Description

hint(name, *parameters)

Specifies some hint on the current DataFrame.

Limits and offsets

Method

Description

limit(num)

Limits the result count to the number specified.

offset(num)

Returns a new DataFrame by skipping the first n rows.

Advanced transformations

Method

Description

transform(func, *args, **kwargs)

Returns a new DataFrame. Concise syntax for chaining custom transformations.

Conversion methods

Method

Description

toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

toArrow()

Returns the contents of this DataFrame as PyArrow pyarrow.Table.

pandas_api(index_col)

Converts the existing DataFrame into a pandas-on-Spark DataFrame.

mapInPandas(func, schema, barrier, profile)

Maps an iterator of batches in the current DataFrame using a Python native function.

mapInArrow(func, schema, barrier, profile)

Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pyarrow.RecordBatch.

Writing data

Method

Description

writeTo(table)

Create a write configuration builder for v2 sources.

mergeInto(table, condition)

Merges a set of updates, insertions, and deletions based on a source table into a target table.

DataFrame comparison

Method

Description

sameSemantics(other)

Returns True when the logical query plans inside both DataFrames are equal.

semanticHash()

Returns a hash code of the logical query plan against this DataFrame.

Metadata and file information

Method

Description

inputFiles()

Returns a best-effort snapshot of the files that compose this DataFrame.

Advanced SQL features

Method

Description

isLocal()

Returns True if the collect and take methods can be run locally.

asTable()

Converts the DataFrame into a TableArg object, which can be used as a table argument in a TVF.

scalar()

Return a Column object for a SCALAR Subquery containing exactly one row and one column.

exists()

Return a Column object for an EXISTS Subquery.

Examples

Basic DataFrame operations

Python
# Create a DataFrame
people = spark.createDataFrame([
{"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50},
{"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100},
{"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150},
{"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200}
])

# Select columns
people.select("name", "age").show()

# Filter rows
people.filter(people.age > 30).show()

# Add a new column
people.withColumn("age_plus_10", people.age + 10).show()

Aggregation and grouping

Python
# Group by and aggregate
people.groupBy("gender").agg({"salary": "avg", "age": "max"}).show()

# Multiple aggregations
from pyspark.sql import functions as F
people.groupBy("deptId").agg(
F.avg("salary").alias("avg_salary"),
F.max("age").alias("max_age")
).show()

Joins

Python
# Create another DataFrame
department = spark.createDataFrame([
{"id": 1, "name": "PySpark"},
{"id": 2, "name": "ML"},
{"id": 3, "name": "Spark SQL"}
])

# Join DataFrames
people.join(department, people.deptId == department.id).show()

Complex transformations

Python
# Chained operations
result = people.filter(people.age > 30) \\
.join(department, people.deptId == department.id) \\
.groupBy(department.name, "gender") \\
.agg({"salary": "avg", "age": "max"}) \\
.sort("max(age)")
result.show()