Running Spark on YARN
16 Questions
1 Views

Running Spark on YARN

Created by
@PanoramicMesa7925

Podcast Beta

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the purpose of the --jars option when running Spark in client mode?

  • To define the memory allocation for Spark jobs
  • To specify the path of the Spark installation
  • To make local files available to SparkContext.addJar (correct)
  • To set the Hadoop runtime version
  • Which Spark distribution requires users to provide their own Hadoop installation?

  • Pre-built Spark distribution
  • No-hadoop Spark distribution (correct)
  • Cluster Spark distribution
  • With-hadoop Spark distribution
  • What happens when a job is submitted to a Hadoop Yarn cluster using with-hadoop Spark distribution?

  • It populates Yarn’s classpath by default.
  • It prevents JAR conflicts by not populating Yarn’s classpath. (correct)
  • It automatically includes all local JARs.
  • It requires configuration of Hadoop classpath explicitly.
  • What configuration must be set to override the default behavior of populating Yarn’s classpath if using a with-hadoop Spark distribution?

    <p>spark.yarn.populateHadoopClasspath</p> Signup and view all the answers

    What does Spark do if neither spark.yarn.archive nor spark.yarn.jars is specified?

    <p>Creates a zip file with all jars and uploads it to distributed cache.</p> Signup and view all the answers

    In YARN, what are executors and application masters referred to as?

    <p>Containers</p> Signup and view all the answers

    Which mode does YARN provide for managing application logs after execution?

    <p>Log aggregation mode</p> Signup and view all the answers

    What is a necessary step before running Spark on YARN?

    <p>Ensure Spark is built with YARN support</p> Signup and view all the answers

    Which feature was added to Spark in version 0.6.0 to improve its functionality?

    <p>Support for running on YARN</p> Signup and view all the answers

    What is required for security when deploying a Spark cluster open to the internet?

    <p>Access to the cluster should be restricted</p> Signup and view all the answers

    What must the HADOOP_CONF_DIR or YARN_CONF_DIR environment variables point to?

    <p>Directory with Hadoop configuration files</p> Signup and view all the answers

    In cluster mode, where does the Spark driver run?

    <p>Inside the application master process managed by YARN</p> Signup and view all the answers

    How does the Spark application on YARN determine the ResourceManager's address?

    <p>It is retrieved from the Hadoop configuration</p> Signup and view all the answers

    What happens to the client in cluster mode after initiating a Spark application?

    <p>It exits once the application finishes running</p> Signup and view all the answers

    What command should be used to launch a Spark application in YARN mode?

    <p>--master yarn</p> Signup and view all the answers

    What is true about configuring Java system properties when using YARN?

    <p>They should be set in the Spark application’s configuration</p> Signup and view all the answers

    Study Notes

    Running Spark on YARN

    • Support for running Spark on YARN (Hadoop NextGen) was added in version 0.6.0 and improved with subsequent releases.

    • It's important to secure access to the cluster when deploying to an untrusted network, to prevent unauthorized applications from running.

    • The HADOOP_CONF_DIR or YARN_CONF_DIR environment variable should point to the directory containing the Hadoop configuration files.

    • These configuration files are used to write to HDFS and connect to the YARN ResourceManager.

    • The configuration in this directory will be distributed to the YARN containers, ensuring consistency across the application.

    • If the configuration references Java system properties or environment variables, they should also be set in the Spark application's configuration.

    Deploy Modes

    • There are two modes for deploying Spark applications on YARN: cluster mode and client mode.

    • In cluster mode, the driver program runs inside an application master process managed by YARN on the cluster, and the client can disconnect after initiating.

    • In client mode, the driver runs in the client process, and the application master requests resources from YARN.

    • To launch a Spark application in cluster mode, use the command spark-submit --deploy-mode cluster --master yarn.

    • To launch a Spark application in client mode, use the command spark-submit --deploy-mode client --master yarn.

    Adding JARs

    • In cluster mode, SparkContext.addJar won't work directly with files local to the client.

    • To make client-side files available, include them in the launch command with the --jars option.

    Preparations

    • Spark needs to built with YARN support to run on YARN.

    • You can download a pre-built binary distribution from the Spark project website.

    • There are two variants of Spark binary distributions: with-hadoop and no-hadoop.

    • The with-hadoop distribution contains a built-in Hadoop runtime.

    • The no-hadoop distribution is smaller and requires a separate Hadoop installation.

    • For with-hadoop distributions, you can prevent JAR conflicts and populate Yarn's classpath by setting the spark.yarn.populateHadoopClasspath=true property.

    • For no-hadoop distributions, Yarn's classpath is populated by default to access the Hadoop runtime.

    • To build Spark yourself, refer to the Building Spark section in the documentation.

    • You can make Spark runtime jars accessible by specifying spark.yarn.archive or spark.yarn.jars.

    Configuration

    • Most configuration options are the same for Spark on YARN as for other deployment modes.

    • Refer to the Spark configuration page for more information.

    • For debugging Spark applications on YARN, use the yarn.log-aggrega... property to enable log aggregation for containers.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Description

    This quiz covers the essentials of running Spark applications on YARN, including configuration management and deployment modes. Understand the differences between cluster and client modes, as well as the importance of securing your applications. Equip yourself with the knowledge to optimize Spark on Hadoop's next-generation architecture.

    Use Quizgecko on...
    Browser
    Browser