What are dynamically loading properties in Spark?

Dynamically loading Spark properties

Dynamically loading properties let us decide how to start our application on the go instead of hardcoding the configurations. For instance, if you’d like to run the same application with different masters or different amounts of memory, Spark allows you to simply create an empty SparkConf:

val sc = new SparkContext(new SparkConf())

Then, you can supply configuration values at runtime:

./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=false
  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar

You can use spark-submit to specify any Spark property using the --conf/-c flag and then specify its argument after the = sign. Running --help will show the entire list of these options.

You can also specify properties separately. For example, bin/spark-submit can also read configuration options from conf/spark-defaults.conf in which each line consists of a key and a value separated by whitespace. The spark-defaults.conf will look something like this:

spark.master            spark://5.6.7.8:7077
spark.executor.memory   4g
spark.eventLog.enabled  true
spark.serializer        org.apache.spark.serializer.KryoSerializer

Precedence

Dynamically loading properties are prioritized in two different ways:

1. Loading properties in different ways

Properties set directly on the SparkConf take the highest precedence, next comes flags passed to spark-submit or spark-shell, and then the options in the spark-defaults.conf file.

2. Loading properties with updated names

A few configuration keys have been renamed since earlier versions of Spark. In these cases, older key names are still accepted, but they take lower precedence than any instance of the newer key.

Side note

Spark properties can be divided into two kinds:

The first one is related to deploy, like spark.driver.memory or spark.executor.instances.

These kind of properties may not be affected when setting programmatically through SparkConf in runtime. Another reason could be that the behavior is contingent upon which cluster manager and deploy mode are chosen. Therefore, it is suggested to set through a configuration file (spark-defaults.conf) or spark-submit command-line option.

The second one is related to Spark runtime control like spark.task.maxFailures. These kinds of properties can be set in either way.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.