Google Cloud Dataproc and Apache Spark

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which Google Cloud service is a managed Hadoop service?

Compute Engine
Dataproc (correct)
Cloud Storage
BigQuery

Which of the following is a component of the Hadoop ecosystem?

Docker
TensorFlow
Kubernetes
Spark (correct)

What does HDFS stand for?

High-performance Data Storage
Hybrid Data Filtering Service
Hierarchical Data Filing System
Hadoop Distributed File System (correct)

Which of the following is NOT a limitation of on-premises Hadoop clusters?

Automatic version management (A) Signup and view all the answers

Which of the following is a benefit of using Dataproc?

Built-in support for Hadoop (B) Signup and view all the answers

What type of programming model does Spark use?

Declarative (A) Signup and view all the answers

Which of the following features is associated with Dataproc?

Low cost (B) Signup and view all the answers

What is the average time Dataproc clusters take to start, scale, and shutdown?

90 seconds or less (A) Signup and view all the answers

Besides Spark, which other components are frequently updated in Dataproc?

Hadoop, Pig, and Hive (B) Signup and view all the answers

What is a key advantage of using Cloud Storage over HDFS in Dataproc?

Separation of compute and storage (D) Signup and view all the answers

Which of the following is a valid way to create a Dataproc cluster?

All of the above (D) Signup and view all the answers

What is the default minimum number of worker nodes in Dataproc?

2 (C) Signup and view all the answers

What must an application be designed for if using preemptible VMs?

Resilience to prevent data loss (A) Signup and view all the answers

What can custom machine types be used for in Dataproc?

To specify memory and CPU balance (A) Signup and view all the answers

Which of the following ways can jobs be submitted?

All of the above (D) Signup and view all the answers

Before 2006, what characterized big data storage and processing?

Cheap storage, expensive processing (C) Signup and view all the answers

What action is involved when setting up Dataproc?

Creating a cluster (C) Signup and view all the answers

As a best practice how should Cloud Storage traffic be routed?

Directly between Cloud Storage and Compute Engine (B) Signup and view all the answers

Why should the number of input files and Hadoop partitions be controlled?

To enhance performance (A) Signup and view all the answers

In Dataproc, what does the primary node contain?

The HDFS Namenode (C) Signup and view all the answers

What are cluster properties used for?

Run-time values for dynamic startup options (C) Signup and view all the answers

For cost effectiveness, how should you treat Dataproc processing clusters?

Short-lived (A) Signup and view all the answers

How can you adapt existing Hadoop code work with Cloud Storage?

Change the prefix from hdfs:// to gs://. (D) Signup and view all the answers

What should be avoided when using cloud storage?

Small reads (D) Signup and view all the answers

What is the suggested action for on-prem data you know you will need?

Push-based model (C) Signup and view all the answers

What is the risk if Dataproc clusters are geographically distant?

Increased request latency (B) Signup and view all the answers

What type of cluster configuration is best suited for Dataproc?

Ephemeral clusters (A) Signup and view all the answers

What is the first step to using ephemeral clusters for Dataproc?

Create a configured cluster (C) Signup and view all the answers

What is the typical setting that can cause a table to benchmark slower?

Persistent disk is sized to such a small quantity of data (B) Signup and view all the answers

What type of VMs should be used in order to keep create the smallest cluster?

Preemptible (B) Signup and view all the answers

Which of the following can be done to reduce costs with Cloud Storage?

Treat the clusters has ephemeral resources (A) Signup and view all the answers

What provides a consolidated and concise view of all logs?

Cloud Logging (B) Signup and view all the answers

Which is a key tool to be aware of for moving data?

DistCp (A) Signup and view all the answers

What type of data is Cloud Storage primarily designed to store?

Unstructured data (A) Signup and view all the answers

Which of the following is an alternative storage option to Cloud Storage?

Bigtable (D) Signup and view all the answers

What is a use for Bigquery?

Data warehousing (D) Signup and view all the answers

What is used with autoscaling to assist scaling?

Hadoop YARN Metrics (C) Signup and view all the answers

What period is available to let things settle before autoscaling evaluation occurs again?

cooldown period (B) Signup and view all the answers

What is the amount of time in seconds to wait before automatically turning down the cluster?

Duration (A) Signup and view all the answers

Which of the following is a characteristic of a Dataproc Workflow Template?

YAML file (A) Signup and view all the answers

What is Dataproc?

A managed Hadoop service by Google Cloud (A) Signup and view all the answers

What is Spark?

A fast, in-memory data processing engine (C) Signup and view all the answers

Which of these components is part of the Hadoop ecosystem?

Apache Spark (D) Signup and view all the answers

What is the main function of HDFS in the Hadoop ecosystem?

Distributed data storage (D) Signup and view all the answers

What is one advantage of Dataproc over on-premises Hadoop clusters?

Requires less tuning (D) Signup and view all the answers

What is a key benefit of using Cloud Storage instead of HDFS in Dataproc?

Separation of compute and storage (C) Signup and view all the answers

What does 'elastic' refer to in the context of cloud computing?

Ability to quickly scale (A) Signup and view all the answers

What is a typical use case for Dataproc?

Running Hadoop and Spark workloads (C) Signup and view all the answers

What does second-by-second billing mean in Dataproc?

You are charged for the exact time the resources are used (A) Signup and view all the answers

Which of these is a key feature of Dataproc?

Built-in support for Hadoop (D) Signup and view all the answers

What is the purpose of initialization actions in Dataproc?

Customize cluster software (B) Signup and view all the answers

What is the recommendation for using Dataproc clusters?

Short-lived clusters only when needed (A) Signup and view all the answers

What happens to data stored in HDFS when a Dataproc cluster is turned off?

Data is lost (C) Signup and view all the answers

How can existing Hadoop code be adapted to work with Cloud Storage?

By changing the prefix. (C) Signup and view all the answers

What does setting a regional endpoint offer?

Increased isolation. (B) Signup and view all the answers

What is the default number of HDFS replication in Dataproc?

2 (A) Signup and view all the answers

What should be specified to tune a VM to the load and avoid wasting resources?

Custom machine types (C) Signup and view all the answers

What can be used to specify executables or scripts datatproc will run on all nodes?

initialization actions (B) Signup and view all the answers

Which is a way to monitor your job?

All of the above (D) Signup and view all the answers

What is the recommendation on where to put initial data in a big-data pipeline?

Cloud Storage (D) Signup and view all the answers

What type of data in Google Cloud is Cloud Storage designed to store?

Unstructured data (D) Signup and view all the answers

What is the key benefit of using ephemeral clusters?

Reduced costs (A) Signup and view all the answers

What is the purpose of Dataproc Workflow Template?

It processes data via Directed Acyclic Graph or DAG (B) Signup and view all the answers

In Dataproc autoscaling, what determines how many nodes to launch?

scale_up.factor (D) Signup and view all the answers

In google cloud what delivers over 1 petabit per second of bandwidth?

Jupiter networking fabric (A) Signup and view all the answers

Is HDFS replication still needed if your cluster is highly available?

Yes, the r value is still 2 (A) Signup and view all the answers

What allows you to reduce the disk requirements and save costs when using Dataproc?

Using Cloud Storage (A) Signup and view all the answers

Why shouldnt you use hadoops direct interfaces?

Will bypass Dataproc security (B) Signup and view all the answers

What can can happen if services are geographically distant?

Some Google services might copy all of the data from another zone (D) Signup and view all the answers

What action should one take when there are greater than ten thousand input files?

Combine or union more than one. (C) Signup and view all the answers

What is one of the first optimization areas in Dataproc?

Lifting Hadoop workloads (B) Signup and view all the answers

What is a best practice for moving data?

Using DistCP (A) Signup and view all the answers

Using distcp, how is best to move data you will always need?

Using a push based model (B) Signup and view all the answers

If you need to pre-install software, what reduces time for customized code to be operational?

Custom Images (A) Signup and view all the answers

If a workflow contained five jobs in series, where does one retireve initial data?

Cloud Storage (B) Signup and view all the answers

What should not be done with Cloud Storage?

Iterating sequentially (B) Signup and view all the answers

Flashcards

What is Dataproc?

A Google Cloud's managed Hadoop service.

What is Apache Hadoop?

Open source software project that maintains the framework for distributed processing of large datasets across clusters of computers.

What is HDFS?

The main file system Hadoop uses for distributing work to nodes on the cluster.

What is Apache Spark?

An open source software project that provides a high performance analytics engine for processing batch and streaming data.