DataFrame class

A distributed collection of data grouped into named columns.

A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.

important

A DataFrame should not be directly created using the constructor.

Supports Spark Connect

Properties

Property	Description
`sparkSession`	Returns SparkSession that created this DataFrame.
`rdd`	Returns the content as an RDD of Row (Classic mode only).
`na`	Returns a DataFrameNaFunctions for handling missing values.
`stat`	Returns a DataFrameStatFunctions for statistic functions.
`write`	Interface for saving the content of the non-streaming DataFrame out into external storage.
`writeStream`	Interface for saving the content of the streaming DataFrame out into external storage.
`schema`	Returns the schema of this DataFrame as a StructType.
`dtypes`	Returns all column names and their data types as a list.
`columns`	Retrieves the names of all columns in the DataFrame as a list.
`storageLevel`	Get the DataFrame's current storage level.
`isStreaming`	Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.
`executionInfo`	Returns a ExecutionInfo object after the query was executed.
`plot`	Returns a PySparkPlotAccessor for plotting functions.

Methods

Data viewing and inspection

Method	Description
`toJSON(use_unicode)`	Converts a DataFrame into a RDD of string or DataFrame.
`printSchema(level)`	Prints out the schema in the tree format.
`explain(extended, mode)`	Prints the (logical and physical) plans to the console for debugging purposes.
`show(n, truncate, vertical)`	Prints the first n rows of the DataFrame to the console.
`collect()`	Returns all the records in the DataFrame as a list of Row.
`toLocalIterator(prefetchPartitions)`	Returns an iterator that contains all of the rows in this DataFrame.
`take(num)`	Returns the first num rows as a list of Row.
`tail(num)`	Returns the last num rows as a list of Row.
`head(n)`	Returns the first n rows.
`first()`	Returns the first row as a Row.
`count()`	Returns the number of rows in this DataFrame.
`isEmpty()`	Checks if the DataFrame is empty and returns a boolean value.
`describe(*cols)`	Computes basic statistics for numeric and string columns.
`summary(*statistics)`	Computes specified statistics for numeric and string columns.

Temporary views

Method	Description
`createTempView(name)`	Creates a local temporary view with this DataFrame.
`createOrReplaceTempView(name)`	Creates or replaces a local temporary view with this DataFrame.
`createGlobalTempView(name)`	Creates a global temporary view with this DataFrame.
`createOrReplaceGlobalTempView(name)`	Creates or replaces a global temporary view using the given name.

Selection and projection

Method	Description
`select(*cols)`	Projects a set of expressions and returns a new DataFrame.
`selectExpr(*expr)`	Projects a set of SQL expressions and returns a new DataFrame.
`filter(condition)`	Filters rows using the given condition.
`where(condition)`	Alias for filter.
`drop(*cols)`	Returns a new DataFrame without specified columns.
`toDF(*cols)`	Returns a new DataFrame with new specified column names.
`withColumn(colName, col)`	Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
`withColumns(*colsMap)`	Returns a new DataFrame by adding multiple columns or replacing the existing columns that have the same names.
`withColumnRenamed(existing, new)`	Returns a new DataFrame by renaming an existing column.
`withColumnsRenamed(colsMap)`	Returns a new DataFrame by renaming multiple columns.
`withMetadata(columnName, metadata)`	Returns a new DataFrame by updating an existing column with metadata.
`metadataColumn(colName)`	Selects a metadata column based on its logical column name and returns it as a Column.
`colRegex(colName)`	Selects column based on the column name specified as a regex and returns it as Column.

Sorting and ordering

Method	Description
`sort(cols, *kwargs)`	Returns a new DataFrame sorted by the specified column(s).
`orderBy(cols, *kwargs)`	Alias for sort.
`sortWithinPartitions(cols, *kwargs)`	Returns a new DataFrame with each partition sorted by the specified column(s).

Aggregation and grouping

Method	Description
`groupBy(*cols)`	Groups the DataFrame by the specified columns so that aggregation can be performed on them.
`rollup(*cols)`	Create a multi-dimensional rollup for the current DataFrame using the specified columns.
`cube(*cols)`	Create a multi-dimensional cube for the current DataFrame using the specified columns.
`groupingSets(groupingSets, *cols)`	Create multi-dimensional aggregation for the current DataFrame using the specified grouping sets.
`agg(*exprs)`	Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()).
`observe(observation, *exprs)`	Define (named) metrics to observe on the DataFrame.

Joins

Method	Description
`join(other, on, how)`	Joins with another DataFrame, using the given join expression.
`crossJoin(other)`	Returns the cartesian product with another DataFrame.
`lateralJoin(other, on, how)`	Lateral joins with another DataFrame, using the given join expression.

Set operations

Method	Description
`union(other)`	Return a new DataFrame containing the union of rows in this and another DataFrame.
`unionByName(other, allowMissingColumns)`	Returns a new DataFrame containing union of rows in this and another DataFrame.
`intersect(other)`	Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.
`intersectAll(other)`	Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.
`subtract(other)`	Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.
`exceptAll(other)`	Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.

Deduplication

Method	Description
`distinct()`	Returns a new DataFrame containing the distinct rows in this DataFrame.
`dropDuplicates(subset)`	Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.
`dropDuplicatesWithinWatermark(subset)`	Return a new DataFrame with duplicate rows removed, optionally only considering certain columns, within watermark.

Sampling and splitting

Method	Description
`sample(withReplacement, fraction, seed)`	Returns a sampled subset of this DataFrame.
`sampleBy(col, fractions, seed)`	Returns a stratified sample without replacement based on the fraction given on each stratum.
`randomSplit(weights, seed)`	Randomly splits this DataFrame with the provided weights.

Partitioning

Method	Description
`coalesce(numPartitions)`	Returns a new DataFrame that has exactly numPartitions partitions.
`repartition(numPartitions, *cols)`	Returns a new DataFrame partitioned by the given partitioning expressions.
`repartitionByRange(numPartitions, *cols)`	Returns a new DataFrame partitioned by the given partitioning expressions.
`repartitionById(numPartitions, partitionIdCol)`	Returns a new DataFrame partitioned by the given partition ID expression.

Reshaping

Method	Description
`unpivot(ids, values, variableColumnName, valueColumnName)`	Unpivot a DataFrame from wide format to long format.
`melt(ids, values, variableColumnName, valueColumnName)`	Alias for unpivot.
`transpose(indexColumn)`	Transposes a DataFrame such that the values in the specified index column become the new columns.

Missing data handling

Method	Description
`dropna(how, thresh, subset)`	Returns a new DataFrame omitting rows with null or NaN values.
`fillna(value, subset)`	Returns a new DataFrame which null values are filled with new value.
`replace(to_replace, value, subset)`	Returns a new DataFrame replacing a value with another value.

Statistical functions

Method	Description
`approxQuantile(col, probabilities, relativeError)`	Calculates the approximate quantiles of numerical columns of a DataFrame.
`corr(col1, col2, method)`	Calculates the correlation of two columns of a DataFrame as a double value.
`cov(col1, col2)`	Calculate the sample covariance for the given columns, specified by their names.
`crosstab(col1, col2)`	Computes a pair-wise frequency table of the given columns.
`freqItems(cols, support)`	Finding frequent items for columns, possibly with false positives.

Schema operations

Method	Description
`to(schema)`	Returns a new DataFrame where each row is reconciled to match the specified schema.
`alias(alias)`	Returns a new DataFrame with an alias set.

Iteration

Method	Description
`foreach(f)`	Applies the f function to all Row of this DataFrame.
`foreachPartition(f)`	Applies the f function to each partition of this DataFrame.

Caching and persistence

Method	Description
`cache()`	Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER).
`persist(storageLevel)`	Sets the storage level to persist the contents of the DataFrame across operations.
`unpersist(blocking)`	Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.

Checkpointing

Method	Description
`checkpoint(eager)`	Returns a checkpointed version of this DataFrame.
`localCheckpoint(eager, storageLevel)`	Returns a locally checkpointed version of this DataFrame.

Streaming operations

Method	Description
`withWatermark(eventTime, delayThreshold)`	Defines an event time watermark for this DataFrame.

Optimization hints

Method	Description
`hint(name, *parameters)`	Specifies some hint on the current DataFrame.

Limits and offsets

Method	Description
`limit(num)`	Limits the result count to the number specified.
`offset(num)`	Returns a new DataFrame by skipping the first n rows.

Advanced transformations

Method	Description
`transform(func, args, *kwargs)`	Returns a new DataFrame. Concise syntax for chaining custom transformations.

Conversion methods

Method	Description
`toPandas()`	Returns the contents of this DataFrame as Pandas pandas.DataFrame.
`toArrow()`	Returns the contents of this DataFrame as PyArrow pyarrow.Table.
`pandas_api(index_col)`	Converts the existing DataFrame into a pandas-on-Spark DataFrame.
`mapInPandas(func, schema, barrier, profile)`	Maps an iterator of batches in the current DataFrame using a Python native function.
`mapInArrow(func, schema, barrier, profile)`	Maps an iterator of batches in the current DataFrame using a Python native function that is performed on pyarrow.RecordBatch.

Writing data

Method	Description
`writeTo(table)`	Create a write configuration builder for v2 sources.
`mergeInto(table, condition)`	Merges a set of updates, insertions, and deletions based on a source table into a target table.

DataFrame comparison

Method	Description
`sameSemantics(other)`	Returns True when the logical query plans inside both DataFrames are equal.
`semanticHash()`	Returns a hash code of the logical query plan against this DataFrame.

Metadata and file information

Method	Description
`inputFiles()`	Returns a best-effort snapshot of the files that compose this DataFrame.

Advanced SQL features

Method	Description
`isLocal()`	Returns True if the collect and take methods can be run locally.
`asTable()`	Converts the DataFrame into a TableArg object, which can be used as a table argument in a TVF.
`scalar()`	Return a Column object for a SCALAR Subquery containing exactly one row and one column.
`exists()`	Return a Column object for an EXISTS Subquery.

Examples

Basic DataFrame operations

Python
# Create a DataFrame
people = spark.createDataFrame([
    {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50},
    {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100},
    {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150},
    {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200}
])

# Select columns
people.select("name", "age").show()

# Filter rows
people.filter(people.age > 30).show()

# Add a new column
people.withColumn("age_plus_10", people.age + 10).show()

Aggregation and grouping

Python
# Group by and aggregate
people.groupBy("gender").agg({"salary": "avg", "age": "max"}).show()

# Multiple aggregations
from pyspark.sql import functions as F
people.groupBy("deptId").agg(
    F.avg("salary").alias("avg_salary"),
    F.max("age").alias("max_age")
).show()

Joins

Python
# Create another DataFrame
department = spark.createDataFrame([
    {"id": 1, "name": "PySpark"},
    {"id": 2, "name": "ML"},
    {"id": 3, "name": "Spark SQL"}
])

# Join DataFrames
people.join(department, people.deptId == department.id).show()

Complex transformations

Python
# Chained operations
result = people.filter(people.age > 30) \\
    .join(department, people.deptId == department.id) \\
    .groupBy(department.name, "gender") \\
    .agg({"salary": "avg", "age": "max"}) \\
    .sort("max(age)")
result.show()

Properties​

Methods​

Data viewing and inspection​

Temporary views​

Selection and projection​

Sorting and ordering​

Aggregation and grouping​

Joins​

Set operations​

Deduplication​

Sampling and splitting​

Partitioning​

Reshaping​

Missing data handling​

Statistical functions​

Schema operations​

Iteration​

Caching and persistence​

Checkpointing​

Streaming operations​

Optimization hints​

Limits and offsets​

Advanced transformations​

Conversion methods​

Writing data​

DataFrame comparison​

Metadata and file information​

Advanced SQL features​

Examples​

Basic DataFrame operations​

Aggregation and grouping​

Joins​

Complex transformations​

Properties

Methods

Data viewing and inspection

Temporary views

Selection and projection

Sorting and ordering

Aggregation and grouping

Joins

Set operations

Deduplication

Sampling and splitting

Partitioning

Reshaping

Missing data handling

Statistical functions

Schema operations

Iteration

Caching and persistence

Checkpointing

Streaming operations

Optimization hints

Limits and offsets

Advanced transformations

Conversion methods

Writing data

DataFrame comparison

Metadata and file information

Advanced SQL features

Examples

Basic DataFrame operations

Aggregation and grouping

Joins

Complex transformations