DataFrameReader class
Interface used to load a DataFrame from external storage systems (e.g. file systems, key-value stores, etc).
Supports Spark Connect
Syntax
Use SparkSession.read to access this interface.
Methods
Method | Description |
|---|---|
Specifies the input data source format. | |
Specifies the input schema. | |
Adds an input option for the underlying data source. | |
Adds input options for the underlying data source. | |
Loads data from a data source and returns it as a DataFrame. | |
Loads JSON files and returns the results as a DataFrame. | |
Returns the specified table as a DataFrame. | |
Loads Parquet files, returning the result as a DataFrame. | |
Loads text files and returns a DataFrame whose schema starts with a string column named "value". | |
Loads a CSV file and returns the result as a DataFrame. | |
| Loads a XML file and returns the result as a DataFrame. |
| Loads Excel files, returning the result as a DataFrame. |
Loads ORC files, returning the result as a DataFrame. | |
| Construct a DataFrame representing the database table named table accessible via JDBC URL url and connection properties. |
Examples
Reading from different data sources
# Access DataFrameReader through SparkSession
spark.read
# Read JSON file
df = spark.read.json("path/to/file.json")
# Read CSV file with options
df = spark.read.option("header", "true").csv("path/to/file.csv")
# Read Parquet file
df = spark.read.parquet("path/to/file.parquet")
# Read from a table
df = spark.read.table("table_name")
Using format and load
# Specify format explicitly
df = spark.read.format("json").load("path/to/file.json")
# With options
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("path/to/file.csv")
Specifying schema
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Define schema
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
# Read CSV with schema
df = spark.read.schema(schema).csv("path/to/file.csv")
# Read CSV with DDL-formatted string schema
df = spark.read.schema("name STRING, age INT").csv("path/to/file.csv")
Reading from JDBC
# Read from database table
df = spark.read.jdbc(
url="jdbc:postgresql://localhost:5432/mydb",
table="users",
properties={"user": "myuser", "password": "mypassword"}
)
# Read with partitioning for parallel loading
df = spark.read.jdbc(
url="jdbc:postgresql://localhost:5432/mydb",
table="users",
column="id",
lowerBound=1,
upperBound=1000,
numPartitions=10,
properties={"user": "myuser", "password": "mypassword"}
)
Method chaining
# Chain multiple configuration methods
df = spark.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("delimiter", ",") \
.schema("name STRING, age INT") \
.load("path/to/file.csv")