Podcast
Questions and Answers
What is the purpose of the --jars option when running Spark in client mode?
What is the purpose of the --jars option when running Spark in client mode?
Which Spark distribution requires users to provide their own Hadoop installation?
Which Spark distribution requires users to provide their own Hadoop installation?
What happens when a job is submitted to a Hadoop Yarn cluster using with-hadoop Spark distribution?
What happens when a job is submitted to a Hadoop Yarn cluster using with-hadoop Spark distribution?
What configuration must be set to override the default behavior of populating Yarn’s classpath if using a with-hadoop Spark distribution?
What configuration must be set to override the default behavior of populating Yarn’s classpath if using a with-hadoop Spark distribution?
Signup and view all the answers
What does Spark do if neither spark.yarn.archive nor spark.yarn.jars is specified?
What does Spark do if neither spark.yarn.archive nor spark.yarn.jars is specified?
Signup and view all the answers
In YARN, what are executors and application masters referred to as?
In YARN, what are executors and application masters referred to as?
Signup and view all the answers
Which mode does YARN provide for managing application logs after execution?
Which mode does YARN provide for managing application logs after execution?
Signup and view all the answers
What is a necessary step before running Spark on YARN?
What is a necessary step before running Spark on YARN?
Signup and view all the answers
Which feature was added to Spark in version 0.6.0 to improve its functionality?
Which feature was added to Spark in version 0.6.0 to improve its functionality?
Signup and view all the answers
What is required for security when deploying a Spark cluster open to the internet?
What is required for security when deploying a Spark cluster open to the internet?
Signup and view all the answers
What must the HADOOP_CONF_DIR or YARN_CONF_DIR environment variables point to?
What must the HADOOP_CONF_DIR or YARN_CONF_DIR environment variables point to?
Signup and view all the answers
In cluster mode, where does the Spark driver run?
In cluster mode, where does the Spark driver run?
Signup and view all the answers
How does the Spark application on YARN determine the ResourceManager's address?
How does the Spark application on YARN determine the ResourceManager's address?
Signup and view all the answers
What happens to the client in cluster mode after initiating a Spark application?
What happens to the client in cluster mode after initiating a Spark application?
Signup and view all the answers
What command should be used to launch a Spark application in YARN mode?
What command should be used to launch a Spark application in YARN mode?
Signup and view all the answers
What is true about configuring Java system properties when using YARN?
What is true about configuring Java system properties when using YARN?
Signup and view all the answers
Study Notes
Running Spark on YARN
-
Support for running Spark on YARN (Hadoop NextGen) was added in version 0.6.0 and improved with subsequent releases.
-
It's important to secure access to the cluster when deploying to an untrusted network, to prevent unauthorized applications from running.
-
The
HADOOP_CONF_DIR
orYARN_CONF_DIR
environment variable should point to the directory containing the Hadoop configuration files. -
These configuration files are used to write to HDFS and connect to the YARN ResourceManager.
-
The configuration in this directory will be distributed to the YARN containers, ensuring consistency across the application.
-
If the configuration references Java system properties or environment variables, they should also be set in the Spark application's configuration.
Deploy Modes
-
There are two modes for deploying Spark applications on YARN: cluster mode and client mode.
-
In cluster mode, the driver program runs inside an application master process managed by YARN on the cluster, and the client can disconnect after initiating.
-
In client mode, the driver runs in the client process, and the application master requests resources from YARN.
-
To launch a Spark application in cluster mode, use the command
spark-submit --deploy-mode cluster --master yarn
. -
To launch a Spark application in client mode, use the command
spark-submit --deploy-mode client --master yarn
.
Adding JARs
-
In cluster mode, SparkContext.addJar won't work directly with files local to the client.
-
To make client-side files available, include them in the launch command with the
--jars
option.
Preparations
-
Spark needs to built with YARN support to run on YARN.
-
You can download a pre-built binary distribution from the Spark project website.
-
There are two variants of Spark binary distributions: with-hadoop and no-hadoop.
-
The with-hadoop distribution contains a built-in Hadoop runtime.
-
The no-hadoop distribution is smaller and requires a separate Hadoop installation.
-
For with-hadoop distributions, you can prevent JAR conflicts and populate Yarn's classpath by setting the
spark.yarn.populateHadoopClasspath=true
property. -
For no-hadoop distributions, Yarn's classpath is populated by default to access the Hadoop runtime.
-
To build Spark yourself, refer to the Building Spark section in the documentation.
-
You can make Spark runtime jars accessible by specifying
spark.yarn.archive
orspark.yarn.jars
.
Configuration
-
Most configuration options are the same for Spark on YARN as for other deployment modes.
-
Refer to the Spark configuration page for more information.
-
For debugging Spark applications on YARN, use the
yarn.log-aggrega...
property to enable log aggregation for containers.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Description
This quiz covers the essentials of running Spark applications on YARN, including configuration management and deployment modes. Understand the differences between cluster and client modes, as well as the importance of securing your applications. Equip yourself with the knowledge to optimize Spark on Hadoop's next-generation architecture.