Skip to main content

createDataFrame

Creates a DataFrame from an RDD, a list, a pandas.DataFrame, a numpy.ndarray, or a pyarrow.Table.

Syntax

createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

Parameters

Parameter

Type

Description

data

RDD or iterable

An RDD of any kind of SQL data representation (Row, tuple, int, bool, dict, etc.), or a list, pandas.DataFrame, numpy.ndarray, or pyarrow.Table.

schema

DataType, str, or list, optional

A DataType, a datatype string, or a list of column names. When a list of column names is provided, the type of each column is inferred from data. When None, schema is inferred from data (requires Row, namedtuple, or dict). When a DataType or datatype string is provided, it must match the actual data.

samplingRatio

float, optional

The sample ratio of rows used for schema inference when data is an RDD. If None, the first few rows are used.

verifySchema

bool, optional

Verify data types of every row against the schema. Enabled by default. Not supported with pyarrow.Table input or Arrow-enabled pandas conversion.

Returns

DataFrame

Notes

Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.

Examples

Python
# Create a DataFrame from a list of tuples.
spark.createDataFrame([('Alice', 1)]).show()
# +-----+---+
# | _1| _2|
# +-----+---+
# |Alice| 1|
# +-----+---+

# Create a DataFrame from a list of dictionaries.
spark.createDataFrame([{'name': 'Alice', 'age': 1}]).show()
# +---+-----+
# |age| name|
# +---+-----+
# | 1|Alice|
# +---+-----+

# Create a DataFrame with column names specified.
spark.createDataFrame([('Alice', 1)], ['name', 'age']).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 1|
# +-----+---+

# Create a DataFrame with an explicit schema.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)])
spark.createDataFrame([('Alice', 1)], schema).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 1|
# +-----+---+

# Create a DataFrame with a DDL-formatted schema string.
spark.createDataFrame([('Alice', 1)], "name: string, age: int").show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 1|
# +-----+---+

# Create an empty DataFrame (schema is required when data is empty).
spark.createDataFrame([], "name: string, age: int").show()
# +----+---+
# |name|age|
# +----+---+
# +----+---+

# Create a DataFrame from Row objects.
from pyspark.sql import Row
Person = Row('name', 'age')
spark.createDataFrame([Person("Alice", 1)]).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 1|
# +-----+---+