Skip to main content

SparkSession

The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used to create DataFrames, register DataFrames as tables, execute SQL over tables, cache tables, and read parquet files.

Syntax

Python
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Properties

Property

Description

version

The version of Spark on which this application is running.

conf

Runtime configuration interface for Spark.

catalog

Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.

udf

Returns a UDFRegistration for UDF registration.

udtf

Returns a UDTFRegistration for UDTF registration.

dataSource

Returns a DataSourceRegistration for data source registration.

profile

Returns a Profile for performance/memory profiling.

sparkContext

Returns the underlying SparkContext. Classic mode only.

read

Returns a DataFrameReader that can be used to read data as a DataFrame.

readStream

Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.

streams

Returns a StreamingQueryManager that allows managing all active streaming queries.

tvf

Returns a TableValuedFunction for calling table-valued functions (TVFs).

Methods

Method

Description

createDataFrame(data, schema, samplingRatio, verifySchema)

Creates a DataFrame from an RDD, a list, a pandas DataFrame, a numpy ndarray, or a pyarrow Table.

sql(sqlQuery, args, **kwargs)

Returns a DataFrame representing the result of the given query.

table(tableName)

Returns the specified table as a DataFrame.

range(start, end, step, numPartitions)

Creates a DataFrame with a single LongType column named id, containing elements in a range.

newSession()

Returns a new SparkSession with separate SQLConf, registered temporary views, and UDFs, but shared SparkContext and table cache. Classic mode only.

getActiveSession()

Returns the active SparkSession for the current thread.

active()

Returns the active or default SparkSession for the current thread.

stop()

Stops the underlying SparkContext.

addArtifacts(*path, pyfile, archive, file)

Adds artifact(s) to the client session.

interruptAll()

Interrupts all operations of this session currently running on the server.

interruptTag(tag)

Interrupts all operations of this session with the given tag.

interruptOperation(op_id)

Interrupts an operation of this session with the given operationId.

addTag(tag)

Adds a tag to be assigned to all operations started by this thread in this session.

removeTag(tag)

Removes a tag previously added for operations started by this thread.

getTags()

Gets the tags currently set to be assigned to all operations started by this thread.

clearTags()

Clears the current thread's operation tags.

Builder

Method

Description

config(key, value)

Sets a config option. Options are automatically propagated to both SparkConf and SparkSession's own configuration.

master(master)

Sets the Spark master URL to connect to.

remote(url)

Sets the Spark remote URL to connect via Spark Connect.

appName(name)

Sets a name for the application, which will be shown in the Spark web UI.

enableHiveSupport()

Enables Hive support, including connectivity to a persistent Hive metastore.

getOrCreate()

Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.

create()

Creates a new SparkSession.

Examples

Python
spark = (
SparkSession.builder
.master("local")
.appName("Word Count")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
Python
spark.sql("SELECT * FROM range(10) where id > 7").show()
Output
+---+
| id|
+---+
| 8|
| 9|
+---+
Python
spark.createDataFrame([('Alice', 1)], ['name', 'age']).show()
Output
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
Python
spark.range(1, 7, 2).show()
Output
+---+
| id|
+---+
| 1|
| 3|
| 5|
+---+