Skip to main content

PySpark reference

This page provides an overview of reference available for PySpark, a Python API for Spark. For more information about PySpark, see PySpark on Databricks.

Data types

For a complete list of PySpark data types, see PySpark data types.

Classes

Reference

Description

Avro

Support for reading and writing data in Apache Avro format.

Catalog

Interface for managing databases, tables, functions, and other catalog metadata.

Column

Operations for working with DataFrame columns, including transformations and expressions.

Data Types

Available data types in PySpark SQL, including primitive types, complex types, and user-defined types.

DataFrame

Distributed collection of data organized into named columns, similar to a table in a relational database.

DataFrameNaFunctions

Functionality for working with missing data in a DataFrame.

DataFrameReader

Interface used to load a DataFrame from external storage systems.

DataFrameStatFunctions

Functionality for statistical functions with a DataFrame.

DataFrameWriter

Interface used to write a DataFrame to external storage systems.

DataFrameWriterV2

Interface used to write a DataFrame to external storage (version 2).

DataSource

APIs for implementing custom data sources to read from external systems. For information about custom data sources, see PySpark custom data sources.

DataSourceArrowWriter

A base class for data source writers that process data using PyArrow's RecordBatch.

DataSourceRegistration

A wrapper for data source registration.

DataSourceReader

A base class for data source readers.

DataSourceStreamArrowWriter

A base class for data stream writers that process data using PyArrow's RecordBatch.

DataSourceStreamReader

A base class for streaming data source readers.

DataSourceStreamWriter

A base class for data stream writers.

GroupedData

Methods for grouping data and performing aggregation operations on grouped DataFrames.

Observation

Collects metrics and observes DataFrames during query execution for monitoring and debugging.

PlotAccessor

Accessor for DataFrame plotting functionality in PySpark.

ProtoBuf

Support for serializing and deserializing data using Protocol Buffers format.

Row

Represents a row of data in a DataFrame, providing access to individual field values.

RuntimeConfig

Runtime configuration options for Spark SQL, including execution and optimizer settings.

For information on configuration that is only available on Databricks, see Set Spark configuration properties on Databricks.

SparkSession

The entry point for reading data and executing SQL queries in PySpark applications.

Stateful Processor

Manages state across streaming batches for complex stateful operations in structured streaming.

UserDefinedFunction (UDF)

User-defined functions for applying custom Python logic to DataFrame columns.

UDFRegistration

Wrapper for user-defined function registration. This instance can be accessed by spark.udf.

UserDefinedTableFunction (UDTF)

User-defined table functions that return multiple rows for each input row.

UDTFRegistration

Wrapper for user-defined table function registration. This instance can be accessed by spark.udtf.

VariantVal

Represents semi-structured data with flexible schema, which supports dynamic types and nested structures.

Window

Window functions for performing calculations across a set of table rows related to the current row.

WindowSpec

Window functions for performing calculations across a set of table rows related to the current row.

Functions

For a complete list of available built-in functions, see PySpark functions.