PySpark reference
This page provides an overview of reference available for PySpark, a Python API for Spark. For more information about PySpark, see PySpark on Databricks.
Reference | Description |
|---|---|
Main classes for working with PySpark SQL, including SparkSession and DataFrame fundamentals. | |
The entry point for reading data and executing SQL queries in PySpark applications. | |
Runtime configuration options for Spark SQL, including execution and optimizer settings. For information on configuration that is only available on Databricks, see Set Spark configuration properties on Databricks. | |
Distributed collection of data organized into named columns, similar to a table in a relational database. | |
Methods for reading data from and writing data to various file formats and data sources. | |
Operations for working with DataFrame columns, including transformations and expressions. | |
Available data types in PySpark SQL, including primitive types, complex types, and user-defined types. | |
Represents a row of data in a DataFrame, providing access to individual field values. | |
Built-in functions for data manipulation, transformation, and aggregation operations. | |
Window functions for performing calculations across a set of table rows related to the current row. | |
Methods for grouping data and performing aggregation operations on grouped DataFrames. | |
Interface for managing databases, tables, functions, and other catalog metadata. | |
Support for reading and writing data in Apache Avro format. | |
Collects metrics and observes DataFrames during query execution for monitoring and debugging. | |
User-defined functions for applying custom Python logic to DataFrame columns. | |
User-defined table functions that return multiple rows for each input row. | |
Handles semi-structured data with flexible schema, supporting dynamic types and nested structures. | |
Support for serializing and deserializing data using Protocol Buffers format. | |
APIs for implementing custom data sources to read from external systems. For information about custom data sources, see PySpark custom data sources. | |
Manages state across streaming batches for complex stateful operations in structured streaming. |