SparkSession

The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used to create DataFrames, register DataFrames as tables, execute SQL over tables, cache tables, and read parquet files.

Syntax

Python
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Properties

Property	Description
`version`	The version of Spark on which this application is running.
`conf`	Runtime configuration interface for Spark.
`catalog`	Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
`udf`	Returns a UDFRegistration for UDF registration.
`udtf`	Returns a UDTFRegistration for UDTF registration.
`dataSource`	Returns a DataSourceRegistration for data source registration.
`profile`	Returns a Profile for performance/memory profiling.
`sparkContext`	Returns the underlying SparkContext. Classic mode only.
`read`	Returns a DataFrameReader that can be used to read data as a DataFrame.
`readStream`	Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.
`streams`	Returns a StreamingQueryManager that allows managing all active streaming queries.
`tvf`	Returns a TableValuedFunction for calling table-valued functions (TVFs).

Property	Description
`version`	The version of Spark on which this application is running.
`conf`	Runtime configuration interface for Spark.
`catalog`	Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
`udf`	Returns a UDFRegistration for UDF registration.
`udtf`	Returns a UDTFRegistration for UDTF registration.
`dataSource`	Returns a DataSourceRegistration for data source registration.
`profile`	Returns a Profile for performance/memory profiling.
`sparkContext`	Returns the underlying SparkContext. Classic mode only.
`read`	Returns a DataFrameReader that can be used to read data as a DataFrame.
`readStream`	Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.
`streams`	Returns a StreamingQueryManager that allows managing all active streaming queries.
`tvf`	Returns a TableValuedFunction for calling table-valued functions (TVFs).

Methods

Method	Description
`createDataFrame(data, schema, samplingRatio, verifySchema)`	Creates a DataFrame from an RDD, a list, a pandas DataFrame, a numpy ndarray, or a pyarrow Table.
`sql(sqlQuery, args, **kwargs)`	Returns a DataFrame representing the result of the given query.
`table(tableName)`	Returns the specified table as a DataFrame.
`range(start, end, step, numPartitions)`	Creates a DataFrame with a single LongType column named `id`, containing elements in a range.
`newSession()`	Returns a new SparkSession with separate SQLConf, registered temporary views, and UDFs, but shared SparkContext and table cache. Classic mode only.
`getActiveSession()`	Returns the active SparkSession for the current thread.
`active()`	Returns the active or default SparkSession for the current thread.
`stop()`	Stops the underlying SparkContext.
`addArtifacts(*path, pyfile, archive, file)`	Adds artifact(s) to the client session.
`interruptAll()`	Interrupts all operations of this session currently running on the server.
`interruptTag(tag)`	Interrupts all operations of this session with the given tag.
`interruptOperation(op_id)`	Interrupts an operation of this session with the given operationId.
`addTag(tag)`	Adds a tag to be assigned to all operations started by this thread in this session.
`removeTag(tag)`	Removes a tag previously added for operations started by this thread.
`getTags()`	Gets the tags currently set to be assigned to all operations started by this thread.
`clearTags()`	Clears the current thread's operation tags.

Method	Description
`createDataFrame(data, schema, samplingRatio, verifySchema)`	Creates a DataFrame from an RDD, a list, a pandas DataFrame, a numpy ndarray, or a pyarrow Table.
`sql(sqlQuery, args, **kwargs)`	Returns a DataFrame representing the result of the given query.
`table(tableName)`	Returns the specified table as a DataFrame.
`range(start, end, step, numPartitions)`	Creates a DataFrame with a single LongType column named `id`, containing elements in a range.
`newSession()`	Returns a new SparkSession with separate SQLConf, registered temporary views, and UDFs, but shared SparkContext and table cache. Classic mode only.
`getActiveSession()`	Returns the active SparkSession for the current thread.
`active()`	Returns the active or default SparkSession for the current thread.
`stop()`	Stops the underlying SparkContext.
`addArtifacts(*path, pyfile, archive, file)`	Adds artifact(s) to the client session.
`interruptAll()`	Interrupts all operations of this session currently running on the server.
`interruptTag(tag)`	Interrupts all operations of this session with the given tag.
`interruptOperation(op_id)`	Interrupts an operation of this session with the given operationId.
`addTag(tag)`	Adds a tag to be assigned to all operations started by this thread in this session.
`removeTag(tag)`	Removes a tag previously added for operations started by this thread.
`getTags()`	Gets the tags currently set to be assigned to all operations started by this thread.
`clearTags()`	Clears the current thread's operation tags.

Builder

Method	Description
`config(key, value)`	Sets a config option. Options are automatically propagated to both SparkConf and SparkSession's own configuration.
`master(master)`	Sets the Spark master URL to connect to.
`remote(url)`	Sets the Spark remote URL to connect via Spark Connect.
`appName(name)`	Sets a name for the application, which will be shown in the Spark web UI.
`enableHiveSupport()`	Enables Hive support, including connectivity to a persistent Hive metastore.
`getOrCreate()`	Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
`create()`	Creates a new SparkSession.

Method	Description
`config(key, value)`	Sets a config option. Options are automatically propagated to both SparkConf and SparkSession's own configuration.
`master(master)`	Sets the Spark master URL to connect to.
`remote(url)`	Sets the Spark remote URL to connect via Spark Connect.
`appName(name)`	Sets a name for the application, which will be shown in the Spark web UI.
`enableHiveSupport()`	Enables Hive support, including connectivity to a persistent Hive metastore.
`getOrCreate()`	Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
`create()`	Creates a new SparkSession.

Examples

Python
spark = (
    SparkSession.builder
        .master("local")
        .appName("Word Count")
        .config("spark.some.config.option", "some-value")
        .getOrCreate()
)

Python
spark.sql("SELECT * FROM range(10) where id > 7").show()

Output
+---+
| id|
+---+
|  8|
|  9|
+---+

Python
spark.createDataFrame([('Alice', 1)], ['name', 'age']).show()

Output
+-----+---+
| name|age|
+-----+---+
|Alice|  1|
+-----+---+

Python
spark.range(1, 7, 2).show()

Output
+---+
| id|
+---+
|  1|
|  3|
|  5|
+---+

Syntax​

Properties​

Methods​

Builder​

Examples​

Syntax

Properties

Methods

Builder

Examples