DataFrameReaderクラス

外部ストレージシステムから DataFrame をロードするために使用されるインターフェース（例：ファイルシステム、キー値ストアなど)。

Spark Connectをサポート

構文

このインターフェースにアクセスするにはSparkSession.readを使用します。

方法

手法	説明
`format(source)`	入力データソース形式を指定します。
`schema(schema)`	入力スキーマを指定します。
`option(key, value)`	基礎となるデータソースの入力オプションを追加します。
`options(**options)`	基礎となるデータソースの入力オプションを追加します。
`load(path, format, schema, **options)`	データソースからデータをロードし、それをDataFrameとして返します。
`json(path, schema, ...)`	JSON ファイルを読み込み、結果を DataFrame として返します。
`table(tableName)`	指定されたテーブルを DataFrame として返します。
`parquet(paths, *options)`	Parquet ファイルを読み込み、結果を DataFrame として返します。
`text(paths, wholetext, lineSep, ...)`	テキストファイルを読み込み、スキーマが「value」という名前の文字列列で始まる DataFrame を返します。
`csv(path, schema, sep, encoding, ...)`	CSV ファイルを読み込み、結果を DataFrame として返します。
`xml(path, rowTag, schema, ...)`	XML ファイルを読み込み、結果を DataFrame として返します。
`excel(path, dataAddress, headerRows, ...)`	Excel ファイルを読み込み、結果を DataFrame として返します。
`orc(path, mergeSchema, pathGlobFilter, ...)`	ORC ファイルを読み込み、結果を DataFrame として返します。
`jdbc(url, table, column, lowerBound, upperBound, numPartitions, predicates, properties)`	JDBC URL url および接続プロパティを介してアクセス可能な、table という名前のデータベーステーブルを表す DataFrame を構築します。

例

さまざまなデータソースからの読み取り

Python
# Access DataFrameReader through SparkSession
spark.read

# Read JSON file
df = spark.read.json("path/to/file.json")

# Read CSV file with options
df = spark.read.option("header", "true").csv("path/to/file.csv")

# Read Parquet file
df = spark.read.parquet("path/to/file.parquet")

# Read from a table
df = spark.read.table("table_name")

フォーマットとロードの使用

Python
# Specify format explicitly
df = spark.read.format("json").load("path/to/file.json")

# With options
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("path/to/file.csv")

スキーマの指定

Python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Read CSV with schema
df = spark.read.schema(schema).csv("path/to/file.csv")

# Read CSV with DDL-formatted string schema
df = spark.read.schema("name STRING, age INT").csv("path/to/file.csv")

JDBCからの読み取り

Python
# Read from database table
df = spark.read.jdbc(
    url="jdbc:postgresql://localhost:5432/mydb",
    table="users",
    properties={"user": "myuser", "password": "mypassword"}
)

# Read with partitioning for parallel loading
df = spark.read.jdbc(
    url="jdbc:postgresql://localhost:5432/mydb",
    table="users",
    column="id",
    lowerBound=1,
    upperBound=1000,
    numPartitions=10,
    properties={"user": "myuser", "password": "mypassword"}
)

メソッドチェーン

Python
# Chain multiple configuration methods
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("delimiter", ",") \
    .schema("name STRING, age INT") \
    .load("path/to/file.csv")

構文​

方法​

例​

さまざまなデータソースからの読み取り​

フォーマットとロードの使用​

スキーマの指定​

JDBCからの読み取り​

メソッドチェーン​

構文

方法

例

さまざまなデータソースからの読み取り

フォーマットとロードの使用

スキーマの指定

JDBCからの読み取り

メソッドチェーン