createDataFrame

Creates a DataFrame from an RDD, a list, a pandas.DataFrame, a numpy.ndarray, or a pyarrow.Table.

Syntax

createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

Parameters

Parameter	Type	Description
`data`	RDD or iterable	An RDD of any kind of SQL data representation (`Row`, `tuple`, `int`, `bool`, `dict`, etc.), or a `list`, `pandas.DataFrame`, `numpy.ndarray`, or `pyarrow.Table`.
`schema`	DataType, str, or list, optional	A `DataType`, a datatype string, or a list of column names. When a list of column names is provided, the type of each column is inferred from `data`. When `None`, schema is inferred from `data` (requires `Row`, `namedtuple`, or `dict`). When a `DataType` or datatype string is provided, it must match the actual data.
`samplingRatio`	float, optional	The sample ratio of rows used for schema inference when `data` is an `RDD`. If `None`, the first few rows are used.
`verifySchema`	bool, optional	Verify data types of every row against the schema. Enabled by default. Not supported with `pyarrow.Table` input or Arrow-enabled pandas conversion.

Returns

DataFrame

Notes

Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.

Examples

Python
# Create a DataFrame from a list of tuples.
spark.createDataFrame([('Alice', 1)]).show()
# +-----+---+
# |   _1| _2|
# +-----+---+
# |Alice|  1|
# +-----+---+

# Create a DataFrame from a list of dictionaries.
spark.createDataFrame([{'name': 'Alice', 'age': 1}]).show()
# +---+-----+
# |age| name|
# +---+-----+
# |  1|Alice|
# +---+-----+

# Create a DataFrame with column names specified.
spark.createDataFrame([('Alice', 1)], ['name', 'age']).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice|  1|
# +-----+---+

# Create a DataFrame with an explicit schema.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)])
spark.createDataFrame([('Alice', 1)], schema).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice|  1|
# +-----+---+

# Create a DataFrame with a DDL-formatted schema string.
spark.createDataFrame([('Alice', 1)], "name: string, age: int").show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice|  1|
# +-----+---+

# Create an empty DataFrame (schema is required when data is empty).
spark.createDataFrame([], "name: string, age: int").show()
# +----+---+
# |name|age|
# +----+---+
# +----+---+

# Create a DataFrame from Row objects.
from pyspark.sql import Row
Person = Row('name', 'age')
spark.createDataFrame([Person("Alice", 1)]).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice|  1|
# +-----+---+

Syntax​

Parameters​

Returns​

Notes​

Examples​

Syntax

Parameters

Returns

Notes

Examples