Spark allows you to configure a system according to your needs. One of the locations to configure a system is Spark Properties. Spark Properties control most application parameters – they can be set using a SparkConf
object or be dynamically loaded.
Dynamically loading properties let us decide how to start our application on the go instead of hardcoding the configurations. For instance, if you’d like to run the same application with different masters or different amounts of memory, Spark allows you to simply create an empty SparkConf
:
val sc = new SparkContext(new SparkConf())
Then, you can supply configuration values at runtime:
./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
You can use spark-submit
to specify any Spark property using the --conf/-c
flag and then specify its argument after the =
sign. Running --help
will show the entire list of these options.
You can also specify properties separately. For example, bin/spark-submit
can also read configuration options from conf/spark-defaults.conf
in which each line consists of a key and a value separated by whitespace. The spark-defaults.conf
will look something like this:
spark.master spark://5.6.7.8:7077
spark.executor.memory 4g
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
Dynamically loading properties are prioritized in two different ways:
Properties set directly on the SparkConf take the highest precedence, next comes flags passed to spark-submit
or spark-shell
, and then the options in the spark-defaults.conf
file.
A few configuration keys have been renamed since earlier versions of Spark. In these cases, older key names are still accepted, but they take lower precedence than any instance of the newer key.
Spark properties can be divided into two kinds:
spark.driver.memory
or spark.executor.instances
.These kind of properties may not be affected when setting programmatically through
SparkConf
in runtime. Another reason could be that the behavior is contingent upon which cluster manager and deploy mode are chosen. Therefore, it is suggested to set through a configuration file (spark-defaults.conf
) orspark-submit
command-line option.
spark.task.maxFailures
. These kinds of properties can be set in either way.Free Resources