A SparkSession
provides access to underlying PySpark features for programmatically creating a PySpark Resilient Distributed Dataset (RDD) and DataFrame.
In a PySpark application, we can create as many SparkSession
as we like by calling SparkSession.builder()
or SparkSession.newSession()
. We’ll need a lot of Spark session objects if we want to keep PySpark tables (that are relational entities) logically isolated.
SparkSession
To create a SparkSession
in python, we'll need the following methods:
builder()
to create a SparkSession
.getOrCreate()
returns a SparkSession
if it exists, otherwise, it creates a new session.appName()
is used to set the application name.master()
is used to set the master name as an argument to it (if run on a cluster). When running in standalone mode, we use local[x]
. When we utilize RDD, DataFrame, and Dataset, x
should be an integer value larger than 0. This represents the number of divisions to be created. Ideally, the x value should be the number of CPU cores.Let's look at the code below:
from pyspark.sql import SparkSession from dotenv import load_dotenv def create_spark_session(): """Create a Spark Session""" _ = load_dotenv() return ( SparkSession .builder .appName("SparkApp") .master("local[5]") .getOrCreate() ) spark = create_spark_session() print('Session Started') print('Code Executed Successfully')
SparkSession
library to create a PySpark session.Free Resources