Running Spark on YARN

Study Notes

Support for running Spark on YARN (Hadoop NextGen) was added in version 0.6.0 and improved with subsequent releases.
It's important to secure access to the cluster when deploying to an untrusted network, to prevent unauthorized applications from running.
The HADOOP_CONF_DIR or YARN_CONF_DIR environment variable should point to the directory containing the Hadoop configuration files.
These configuration files are used to write to HDFS and connect to the YARN ResourceManager.
The configuration in this directory will be distributed to the YARN containers, ensuring consistency across the application.
If the configuration references Java system properties or environment variables, they should also be set in the Spark application's configuration.

There are two modes for deploying Spark applications on YARN: cluster mode and client mode.
In cluster mode, the driver program runs inside an application master process managed by YARN on the cluster, and the client can disconnect after initiating.
In client mode, the driver runs in the client process, and the application master requests resources from YARN.
To launch a Spark application in cluster mode, use the command spark-submit --deploy-mode cluster --master yarn.
To launch a Spark application in client mode, use the command spark-submit --deploy-mode client --master yarn.

In cluster mode, SparkContext.addJar won't work directly with files local to the client.
To make client-side files available, include them in the launch command with the --jars option.

Spark needs to built with YARN support to run on YARN.
You can download a pre-built binary distribution from the Spark project website.
There are two variants of Spark binary distributions: with-hadoop and no-hadoop.
The with-hadoop distribution contains a built-in Hadoop runtime.
The no-hadoop distribution is smaller and requires a separate Hadoop installation.
For with-hadoop distributions, you can prevent JAR conflicts and populate Yarn's classpath by setting the spark.yarn.populateHadoopClasspath=true property.
For no-hadoop distributions, Yarn's classpath is populated by default to access the Hadoop runtime.
To build Spark yourself, refer to the Building Spark section in the documentation.
You can make Spark runtime jars accessible by specifying spark.yarn.archive or spark.yarn.jars.

Most configuration options are the same for Spark on YARN as for other deployment modes.
Refer to the Spark configuration page for more information.
For debugging Spark applications on YARN, use the yarn.log-aggrega... property to enable log aggregation for containers.