How to create a SparkSession on PySpark

A SparkSession provides access to underlying PySpark features for programmatically creating a PySpark Resilient Distributed Dataset (RDD) and DataFrame.

In a PySpark application, we can create as many SparkSession as we like by calling SparkSession.builder() or SparkSession.newSession(). We’ll need a lot of Spark session objects if we want to keep PySpark tables (that are relational entities) logically isolated.

Create a `SparkSession`

To create a SparkSession in python, we'll need the following methods:

The builder()to create a SparkSession.
The getOrCreate() returns a SparkSession if it exists, otherwise, it creates a new session.
The appName() is used to set the application name.
The master() is used to set the master name as an argument to it (if run on a cluster). When running in standalone mode, we use local[x]. When we utilize RDD, DataFrame, and Dataset, x should be an integer value larger than 0. This represents the number of divisions to be created. Ideally, the x value should be the number of CPU cores.

Code example

Let's look at the code below:

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

How to create a SparkSession on PySpark

Create a SparkSession

Code example

Code explanation

Create a `SparkSession`