Skip to main content

DataFrameReader class

Interface used to load a DataFrame from external storage systems (e.g. file systems, key-value stores, etc).

Supports Spark Connect

Syntax

Use SparkSession.read to access this interface.

Methods

Method

Description

format(source)

Specifies the input data source format.

schema(schema)

Specifies the input schema.

option(key, value)

Adds an input option for the underlying data source.

options(**options)

Adds input options for the underlying data source.

load(path, format, schema, **options)

Loads data from a data source and returns it as a DataFrame.

json(path, schema, ...)

Loads JSON files and returns the results as a DataFrame.

table(tableName)

Returns the specified table as a DataFrame.

parquet(*paths, **options)

Loads Parquet files, returning the result as a DataFrame.

text(paths, wholetext, lineSep, ...)

Loads text files and returns a DataFrame whose schema starts with a string column named "value".

csv(path, schema, sep, encoding, ...)

Loads a CSV file and returns the result as a DataFrame.

xml(path, rowTag, schema, ...)

Loads a XML file and returns the result as a DataFrame.

excel(path, dataAddress, headerRows, ...)

Loads Excel files, returning the result as a DataFrame.

orc(path, mergeSchema, pathGlobFilter, ...)

Loads ORC files, returning the result as a DataFrame.

jdbc(url, table, column, lowerBound, upperBound, numPartitions, predicates, properties)

Construct a DataFrame representing the database table named table accessible via JDBC URL url and connection properties.

Examples

Reading from different data sources

Python
# Access DataFrameReader through SparkSession
spark.read

# Read JSON file
df = spark.read.json("path/to/file.json")

# Read CSV file with options
df = spark.read.option("header", "true").csv("path/to/file.csv")

# Read Parquet file
df = spark.read.parquet("path/to/file.parquet")

# Read from a table
df = spark.read.table("table_name")

Using format and load

Python
# Specify format explicitly
df = spark.read.format("json").load("path/to/file.json")

# With options
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("path/to/file.csv")

Specifying schema

Python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])

# Read CSV with schema
df = spark.read.schema(schema).csv("path/to/file.csv")

# Read CSV with DDL-formatted string schema
df = spark.read.schema("name STRING, age INT").csv("path/to/file.csv")

Reading from JDBC

Python
# Read from database table
df = spark.read.jdbc(
url="jdbc:postgresql://localhost:5432/mydb",
table="users",
properties={"user": "myuser", "password": "mypassword"}
)

# Read with partitioning for parallel loading
df = spark.read.jdbc(
url="jdbc:postgresql://localhost:5432/mydb",
table="users",
column="id",
lowerBound=1,
upperBound=1000,
numPartitions=10,
properties={"user": "myuser", "password": "mypassword"}
)

Method chaining

Python
# Chain multiple configuration methods
df = spark.read \
.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("delimiter", ",") \
.schema("name STRING, age INT") \
.load("path/to/file.csv")