A SparkSession provides access to underlying PySpark features for programmatically creating a PySpark Resilient Distributed Dataset (RDD) and DataFrame.
In a PySpark application, we can create as many SparkSession as we like by calling SparkSession.builder() or SparkSession.newSession(). We’ll need a lot of Spark session objects if we want to keep PySpark tables (that are relational entities) logically isolated.
SparkSessionTo create a SparkSession in python, we'll need the following methods:
builder()to create a SparkSession.getOrCreate() returns a SparkSession if it exists, otherwise, it creates a new session.appName() is used to set the application name.master() is used to set the master name as an argument to it (if run on a cluster). When running in standalone mode, we use local[x]. When we utilize RDD, DataFrame, and Dataset, x should be an integer value larger than 0. This represents the number of divisions to be created. Ideally, the x value should be the number of CPU cores.Let's look at the code below:
from pyspark.sql import SparkSession
from dotenv import load_dotenv
def create_spark_session():
"""Create a Spark Session"""
_ = load_dotenv()
return (
SparkSession
.builder
.appName("SparkApp")
.master("local[5]")
.getOrCreate()
)
spark = create_spark_session()
print('Session Started')
print('Code Executed Successfully')SparkSession library to create a PySpark session.Free Resources