Google Cloud Dataproc and Apache Spark

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which Google Cloud service is a managed Hadoop service?

  • Compute Engine
  • Dataproc (correct)
  • Cloud Storage
  • BigQuery

Which of the following is a component of the Hadoop ecosystem?

  • Docker
  • TensorFlow
  • Kubernetes
  • Spark (correct)

What does HDFS stand for?

  • High-performance Data Storage
  • Hybrid Data Filtering Service
  • Hierarchical Data Filing System
  • Hadoop Distributed File System (correct)

Which of the following is NOT a limitation of on-premises Hadoop clusters?

<p>Automatic version management (A)</p> Signup and view all the answers

Which of the following is a benefit of using Dataproc?

<p>Built-in support for Hadoop (B)</p> Signup and view all the answers

What type of programming model does Spark use?

<p>Declarative (A)</p> Signup and view all the answers

Which of the following features is associated with Dataproc?

<p>Low cost (B)</p> Signup and view all the answers

What is the average time Dataproc clusters take to start, scale, and shutdown?

<p>90 seconds or less (A)</p> Signup and view all the answers

Besides Spark, which other components are frequently updated in Dataproc?

<p>Hadoop, Pig, and Hive (B)</p> Signup and view all the answers

What is a key advantage of using Cloud Storage over HDFS in Dataproc?

<p>Separation of compute and storage (D)</p> Signup and view all the answers

Which of the following is a valid way to create a Dataproc cluster?

<p>All of the above (D)</p> Signup and view all the answers

What is the default minimum number of worker nodes in Dataproc?

<p>2 (C)</p> Signup and view all the answers

What must an application be designed for if using preemptible VMs?

<p>Resilience to prevent data loss (A)</p> Signup and view all the answers

What can custom machine types be used for in Dataproc?

<p>To specify memory and CPU balance (A)</p> Signup and view all the answers

Which of the following ways can jobs be submitted?

<p>All of the above (D)</p> Signup and view all the answers

Before 2006, what characterized big data storage and processing?

<p>Cheap storage, expensive processing (C)</p> Signup and view all the answers

What action is involved when setting up Dataproc?

<p>Creating a cluster (C)</p> Signup and view all the answers

As a best practice how should Cloud Storage traffic be routed?

<p>Directly between Cloud Storage and Compute Engine (B)</p> Signup and view all the answers

Why should the number of input files and Hadoop partitions be controlled?

<p>To enhance performance (A)</p> Signup and view all the answers

In Dataproc, what does the primary node contain?

<p>The HDFS Namenode (C)</p> Signup and view all the answers

What are cluster properties used for?

<p>Run-time values for dynamic startup options (C)</p> Signup and view all the answers

For cost effectiveness, how should you treat Dataproc processing clusters?

<p>Short-lived (A)</p> Signup and view all the answers

How can you adapt existing Hadoop code work with Cloud Storage?

<p>Change the prefix from hdfs:// to gs://. (D)</p> Signup and view all the answers

What should be avoided when using cloud storage?

<p>Small reads (D)</p> Signup and view all the answers

What is the suggested action for on-prem data you know you will need?

<p>Push-based model (C)</p> Signup and view all the answers

What is the risk if Dataproc clusters are geographically distant?

<p>Increased request latency (B)</p> Signup and view all the answers

What type of cluster configuration is best suited for Dataproc?

<p>Ephemeral clusters (A)</p> Signup and view all the answers

What is the first step to using ephemeral clusters for Dataproc?

<p>Create a configured cluster (C)</p> Signup and view all the answers

What is the typical setting that can cause a table to benchmark slower?

<p>Persistent disk is sized to such a small quantity of data (B)</p> Signup and view all the answers

What type of VMs should be used in order to keep create the smallest cluster?

<p>Preemptible (B)</p> Signup and view all the answers

Which of the following can be done to reduce costs with Cloud Storage?

<p>Treat the clusters has ephemeral resources (A)</p> Signup and view all the answers

What provides a consolidated and concise view of all logs?

<p>Cloud Logging (B)</p> Signup and view all the answers

Which is a key tool to be aware of for moving data?

<p>DistCp (A)</p> Signup and view all the answers

What type of data is Cloud Storage primarily designed to store?

<p>Unstructured data (A)</p> Signup and view all the answers

Which of the following is an alternative storage option to Cloud Storage?

<p>Bigtable (D)</p> Signup and view all the answers

What is a use for Bigquery?

<p>Data warehousing (D)</p> Signup and view all the answers

What is used with autoscaling to assist scaling?

<p>Hadoop YARN Metrics (C)</p> Signup and view all the answers

What period is available to let things settle before autoscaling evaluation occurs again?

<p>cooldown period (B)</p> Signup and view all the answers

What is the amount of time in seconds to wait before automatically turning down the cluster?

<p>Duration (A)</p> Signup and view all the answers

Which of the following is a characteristic of a Dataproc Workflow Template?

<p>YAML file (A)</p> Signup and view all the answers

What is Dataproc?

<p>A managed Hadoop service by Google Cloud (A)</p> Signup and view all the answers

What is Spark?

<p>A fast, in-memory data processing engine (C)</p> Signup and view all the answers

Which of these components is part of the Hadoop ecosystem?

<p>Apache Spark (D)</p> Signup and view all the answers

What is the main function of HDFS in the Hadoop ecosystem?

<p>Distributed data storage (D)</p> Signup and view all the answers

What is one advantage of Dataproc over on-premises Hadoop clusters?

<p>Requires less tuning (D)</p> Signup and view all the answers

What is a key benefit of using Cloud Storage instead of HDFS in Dataproc?

<p>Separation of compute and storage (C)</p> Signup and view all the answers

What does 'elastic' refer to in the context of cloud computing?

<p>Ability to quickly scale (A)</p> Signup and view all the answers

What is a typical use case for Dataproc?

<p>Running Hadoop and Spark workloads (C)</p> Signup and view all the answers

What does second-by-second billing mean in Dataproc?

<p>You are charged for the exact time the resources are used (A)</p> Signup and view all the answers

Which of these is a key feature of Dataproc?

<p>Built-in support for Hadoop (D)</p> Signup and view all the answers

What is the purpose of initialization actions in Dataproc?

<p>Customize cluster software (B)</p> Signup and view all the answers

What is the recommendation for using Dataproc clusters?

<p>Short-lived clusters only when needed (A)</p> Signup and view all the answers

What happens to data stored in HDFS when a Dataproc cluster is turned off?

<p>Data is lost (C)</p> Signup and view all the answers

How can existing Hadoop code be adapted to work with Cloud Storage?

<p>By changing the prefix. (C)</p> Signup and view all the answers

What does setting a regional endpoint offer?

<p>Increased isolation. (B)</p> Signup and view all the answers

What is the default number of HDFS replication in Dataproc?

<p>2 (A)</p> Signup and view all the answers

What should be specified to tune a VM to the load and avoid wasting resources?

<p>Custom machine types (C)</p> Signup and view all the answers

What can be used to specify executables or scripts datatproc will run on all nodes?

<p>initialization actions (B)</p> Signup and view all the answers

Which is a way to monitor your job?

<p>All of the above (D)</p> Signup and view all the answers

What is the recommendation on where to put initial data in a big-data pipeline?

<p>Cloud Storage (D)</p> Signup and view all the answers

What type of data in Google Cloud is Cloud Storage designed to store?

<p>Unstructured data (D)</p> Signup and view all the answers

What is the key benefit of using ephemeral clusters?

<p>Reduced costs (A)</p> Signup and view all the answers

What is the purpose of Dataproc Workflow Template?

<p>It processes data via Directed Acyclic Graph or DAG (B)</p> Signup and view all the answers

In Dataproc autoscaling, what determines how many nodes to launch?

<p>scale_up.factor (D)</p> Signup and view all the answers

In google cloud what delivers over 1 petabit per second of bandwidth?

<p>Jupiter networking fabric (A)</p> Signup and view all the answers

Is HDFS replication still needed if your cluster is highly available?

<p>Yes, the r value is still 2 (A)</p> Signup and view all the answers

What allows you to reduce the disk requirements and save costs when using Dataproc?

<p>Using Cloud Storage (A)</p> Signup and view all the answers

Why shouldnt you use hadoops direct interfaces?

<p>Will bypass Dataproc security (B)</p> Signup and view all the answers

What can can happen if services are geographically distant?

<p>Some Google services might copy all of the data from another zone (D)</p> Signup and view all the answers

What action should one take when there are greater than ten thousand input files?

<p>Combine or union more than one. (C)</p> Signup and view all the answers

What is one of the first optimization areas in Dataproc?

<p>Lifting Hadoop workloads (B)</p> Signup and view all the answers

What is a best practice for moving data?

<p>Using DistCP (A)</p> Signup and view all the answers

Using distcp, how is best to move data you will always need?

<p>Using a push based model (B)</p> Signup and view all the answers

If you need to pre-install software, what reduces time for customized code to be operational?

<p>Custom Images (A)</p> Signup and view all the answers

If a workflow contained five jobs in series, where does one retireve initial data?

<p>Cloud Storage (B)</p> Signup and view all the answers

What should not be done with Cloud Storage?

<p>Iterating sequentially (B)</p> Signup and view all the answers

Flashcards

What is Dataproc?

A Google Cloud's managed Hadoop service.

What is Apache Hadoop?

Open source software project that maintains the framework for distributed processing of large datasets across clusters of computers.

What is HDFS?

The main file system Hadoop uses for distributing work to nodes on the cluster.

What is Apache Spark?

An open source software project that provides a high performance analytics engine for processing batch and streaming data.

Signup and view all the flashcards

What are limitations of on-premise Hadoop clusters?

On-premises Hadoop clusters have no separation between storage and compute resources, hard to scale fast and have capacity limits.

Signup and view all the flashcards

How does Dataproc simplify Hadoop?

Dataproc simplifies Hadoop workloads on Google Cloud with built-in support for Hadoop, managed hardware, simplified version management and flexible job configuration.

Signup and view all the flashcards

What is declarative programming?

Declarative programming is where you tell the system what you want and it figures out how to get it done. Imperative: you tell the system how.

Signup and view all the flashcards

Dataproc's main function?

Dataproc is a managed service for running Hadoop and Spark data processing workloads.

Signup and view all the flashcards

Dataproc's advantages?

Dataproc lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning.

Signup and view all the flashcards

How does Dataproc save money?

Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them.

Signup and view all the flashcards

Dataproc's Pricing?

Dataproc is priced at 1 cent per virtual CPU per cluster per hour with second-by-second billing.

Signup and view all the flashcards

Dataproc's open ecosystem?

Dataproc provides frequent updates to native versions of Spark, Hadoop, Pig, and Hive so there is no need to learn new tools or APIs.

Signup and view all the flashcards

Dataproc version control?

Dataproc allows image versioning to switch between different versions of Apache Spark, Apache Hadoop, and other tools.

Signup and view all the flashcards

Dataproc Availability

Dataproc allows clusters with multiple primary nodes and set jobs to restart on failure to ensure your clusters and jobs are highly available.

Signup and view all the flashcards

Dataproc Initialization functions?

Dataproc allows initialization actions to install or customize settings and libraries when a cluster is created.

Signup and view all the flashcards

Customizing Dataproc?

Dataproc has two ways to customize clusters: optional components and initialization actions.

Signup and view all the flashcards

Customize cluster

Use initialization actions to add other software to cluster at startup using the Cloud SDK

Signup and view all the flashcards

Dataproc cluster components?

A Dataproc cluster has manager nodes, workers, and HDFS.

Signup and view all the flashcards

Cloud Storage instead of HDFS?

You can simply use a cluster on Cloud Storage via the HDFS connector instead of native HDFS on a cluster

Signup and view all the flashcards

Dataproc Workflow

Using Dataproc involves this sequence of events: Setup, Configuration, Optimization, Utilization, and Monitoring.

Signup and view all the flashcards

Creating a cluster

You can create a cluster through the Cloud Console, or from the command line using the gcloud command, or use the REST API.

Signup and view all the flashcards

Dataproc regions

Choose a region and zone, or select a 'global region' when configuring

Signup and view all the flashcards

Custom worker nodes

You can run initialization actions to further customise the worker notes

Signup and view all the flashcards

Use Preemptible VMs

Preemptible VMs can be used to lower costs, they can be pulled from service at any time and within 24 hours

Signup and view all the flashcards

Apache Spark

A simple explanation of Spark is that it is able to mix different kinds of applications and to adjust how it uses the available resources dynamically

Signup and view all the flashcards

Advantage of HDFS

You can use HDFS in the Cloud just by lifting and shifting your Hadoop workloads to Dataproc, requires no code changes

Signup and view all the flashcards

Jupiter Fabric

The Jupiter networking fabric within a Google data center delivers over 1 petabit per second of bandwidth.

Signup and view all the flashcards

Dataproc and Colossus

Dataproc clusters get the advantage of scaling up and down VMs that they need to do the compute while passing off persistent storage needs, controlled by Google Cloud colossus behind the scene

Signup and view all the flashcards

Bigtable and BigQuery

You can use Bigtable to store large amounts of sparse data and BigQuery can be used for data warehousing.

Signup and view all the flashcards

Advantage of dataproc

With Cloud Storage as the backend, you can treat clusters themselves as ephemeral resources, which allows you not to pay for compute capacity when you're not running any jobs.

Signup and view all the flashcards

DistCp

Distcp is Key tool for moving data to Google Cloud

Signup and view all the flashcards

Preemptible system

You can use preemptible for fault tolerant systems for lower cost, and you have option to schedule automatic deletion to help utilization of your resources

Signup and view all the flashcards

Key step

It is the shift away from monolithic, persistent clusters to specialized, ephemeral clusters for resources you only need.

Signup and view all the flashcards

Workflow template

A Dataproc Workflow Template is a YAML file that is processed through a Directed Acyclic Graph or DAG

Signup and view all the flashcards

Autoscaling

Dataproc autoscaling provides clusters that size themselves to the needs with key features of Jobs is fire and forget, no need for manual, and resources and cost savings

Signup and view all the flashcards

Dataproc Errors

The best way to find what error caused a Spark job failure is to look at the driver output and the logs generated by the Spark executors

Signup and view all the flashcards

Setting log levels

You can set the driver log level using the following gcloud command: gcloud dataproc jobs submit hadoop, with the parameter driver-log-levels

Signup and view all the flashcards

Operations in cloud

Cloud operations include logging and monitoring

Signup and view all the flashcards

Regions impact solution

Geographical regions can impact the efficiency of your solution

Signup and view all the flashcards

Study Notes

  • This module discusses Google Cloud's Dataproc, a managed Hadoop service, with focus on Apache Spark

Module Agenda

  • Coverage includes the Hadoop Ecosystem
  • Running Hadoop on Dataproc and understanding its operation
  • Benefits of Cloud Storage instead of HDFS will be covered
  • Optimization of Dataproc and completion of a hands-on lab with Apache Spark will be reviewed

The Hadoop Ecosystem

  • The Hadoop ecosystem was developed out of the need to analyze large datasets via distributed processing
  • Before 2006, storage was cheap, processing was expensive, so data was copied to the processor for analysis
  • Around 2006, distributed processing became practical with Hadoop for big data
  • Hadoop creates a computer cluster, and uses distributed processing
  • HDFS (Hadoop Distributed File System) stores data on cluster machines
  • MapReduce provides distributed data processing
  • An entire ecosystem of Hadoop-related software grew, such as Hive, Pig, and Spark

Big Data Workloads

  • Hadoop is used by organizations for on-premises big data workloads
  • Clusters run a range of applications, like Presto, and Spark
  • Apache Hadoop is an open-source software project that provides the framework for the distributed processing of large datasets
  • Apache Spark is an open-source project that provides a high-performance analytics engine for batch and streaming data
  • Spark can be 100x faster than Hadoop jobs using in-memory processing
  • Spark is powerful and expressive and used for a lot of workloads, and gives methods for dealing with Resilient Distributed Datasets and Dataframes
  • Open Source Software Hadoop has complexity and overhead due to data center design assumptions, but limitations are relieved with better options
  • Two common issues with OSS Hadoop are tuning and utilization; Dataproc can overcome these limitations

Hadoop Elasticity

  • On-premises Hadoop clusters are not elastic due to their physical nature
  • On-premise clusters do not have separation of storage and compute resources
  • On-premise clusters are difficult to scale fast and have capacity limits
  • The only way to increase capacity in on-premise clusters is to add more physical servers

Dataproc on Google Cloud

  • Dataproc on Google Cloud simplifies Hadoop workloads
  • Built-in support for Hadoop and managed Hadoop and Spark environment
  • You specify cluster configuration, Dataproc allocates resources, and you can scale your cluster
  • Dataproc versioning manages open-source tools
  • You can focus on tasks through use cases, which eliminates dependency and software configuration interactions

Apache Spark

  • Apache Spark is a popular, flexible, and powerful way to process large datasets
  • Spark avoids the need to tune Hadoop system for efficient resource use
  • Spark mixes application types and adjusts resource use
  • Spark uses a declarative programming model
  • Spark has a full SQL implementation
  • A common DataFrame model works across Scala, Java, Python, SQL, and R
  • There is a distributed machine learning library called Spark ML-Lib

Running Hadoop on Dataproc

  • Hadoop job code should be processed on the cloud using Dataproc on Google Cloud

Dataproc Features

  • Dataproc is a managed service for running Hadoop and Spark data processing workloads
  • It is enables the advantage of using open-source data tools for batch processing, streaming, querying and machine learning
  • It quickly creates clusters, manages them easily, and saves money by turning them off when not needed
  • There is unique advantages when compared to competing cloud services, and traditional on-premise products
  • There is no need to learn new tools or APIs, making it easy to move existing projects into Dataproc without redevelopment Also, Spark, Hadoop, Hive and Pig are frequently updated

Additional Dataproc Features

  • Key features of Dataproc are listed
  • Low cost: Priced at 1 cent per virtual CPU per cluster per hour on top of Google Cloud resources used with second-by-second billing and one-minute minimum
  • Super-fast: Clusters quickly start, scale, and shut down under 90 seconds on average
  • Resizable clusters: Created and scaled quickly with a variety of virtual machine types, disk sizes, number of nodes, and networking options available
  • Open source ecosystem: Use Spark and Hadoop tools, libraries, and documentation; frequent updates to native versions of Spark, Hadoop, Pig, and Hive; and move projects or ETL pipelines without redevelopment
  • Integrated: Built-in integration with Cloud Storage, BigQuery, and Bigtable ensures data will not be lost and ETL terabytes of raw log data
  • Managed: Easily interact with clusters and Spark or Hadoop jobs; turn off when complete
  • Versioning: Image versioning to switch between different versions of Apache Spark, Apache Hadoop, and other tools
  • Highly available: Run clusters with multiple primary nodes and jobs to restart on failure
  • Developer tools: Multiple ways to manage a cluster are available
  • Initialization actions are available to customize the settings
  • Automatic or manual configuration option

OSS Options

  • Dataproc has options to customize clusters such as optional components and initialization actions
  • Pre-configured optional components can be selected when deploying from the console or via the command line
  • Initialization actions customize the Dataproc cluster immediately after setup

Cluster Customization

  • Initialization actions install additional components via startup
  • Cloud SDK is used to create Dataproc cluster
  • HBase shell scripts are specified to run on clusters initialization
  • Many pre-built startup scripts can be leveraged for Hadoop cluster setup tasks, like Flink, Jupyter, and more

Dataproc Clusters

  • A Dataproc cluster contains manager nodes, workers, and HDFS
  • It can contain either preemptible secondary workers or non-preemptible secondary workers, but not both
  • Virtual machine cluster used for creating persistent storage through HDFS, plus manager and worker nodes
  • Worker nodes are part of a managed instance group where virtual machines share the same template allowing auto resizing based on demand
  • Google Cloud recommends a ratio of 60/40 as maximum between standard VMs and preemptible VMs
  • Spin up the cluster for computing, then turn it down after jobs complete
  • HDFS storage disappears when clusters are turned down therefore use off- cluster connections

HDFS Connectors

  • Instead of native HDFS on a cluster, a cluster on Cloud Storage may be used via the HDFS connector
  • Adapt existing Hadoop code to use Cloud Storage instead of HDFS by changing the storage prefix from hdfs// to gs//
  • Consider writing to Bigtable instead of Hbase off-cluster
  • For analytical workloads, write the data into BigQuery

Using Dataproc

  • Using Dataproc involves setup, configuration, optimization, utilization, and monitoring

Dataproc Setup

  • Setup means creating a cluster through the Cloud Console, command line using the gcloud command, exporting a YAML file, Terraform Configuration.

Dataproc Configuration

  • Cluster can be set as a single VM, which keeps expenses down for development and experimentation
  • Standard nodes include a single Primary Node, and High Availability clusters has three Primary Nodes
  • Choose a region and zone, or select a "global region" to allow the service to choose the zone
  • The cluster defaults to a Global Endpoint, but defining a Regional Endpoint offers increased isolation and in cases, lower latency
  • The Primary Node is where the HDFS NameNode runs, as well as the YARN node and job drivers
  • HDFS replication is 2 in Dataproc
  • Optional Hadoop-ecosystem components - Anaconda, Hive Webcat, Jupyter Notebook, and Zeppelin Notebook
  • Cluster User labels are used to tag the cluster and properties are run-time values used by configuration files for more dynamic startup options
  • vCPU, memory, and storage are separate VM options inside of Primary, Worker and Preemptible Worker Nodes
  • Preemptible nodes include YARN NodeManager but do not run HDFS
  • Minimum default number of worker nodes is 2, while maximum number is determined by a quota and SSD attachments

Configuration Additional Options

  • It is also possible to specify initialization actions and metadata
  • Initialization scripts can further customize the worker nodes
  • Metadata allows VMs to share state information

Dataproc Optimization

  • Preemptible VMs can be used to lower costs, but can be pulled from service at any time and within 24 hours which could result in data loss
  • Custom machine types provide the balance of memory and CPU to tune the VM to the load
  • A custom image can be used to pre-install software
  • A Persistent SSD boot disk helps to quickly boot the cluster

Utilize Job Submission

  • Jobs can be submitted through cloud console, gcloud command, or REST APIs
  • Started by orchestration services like Dataproc Workflow and Cloud Composer
  • Do not use Hadoop’s direct interfaces to submit jobs which disables default security
  • Jobs must be designed to be idempotent and to detect successorship and restore state for command lines and REST APIs
  • Jobs are not restartable by default

Dataproc Monitoring

  • Monitoring can be done by Cloud Monitoring after submitting the job
  • Customized emails can be used to notify users of alerts
  • Details from HDFS, YARN and metrics about utilization can be all monitored with Cloud Monitoring

Google Cloud Monitoring

  • Cloud Console for job status does not require connecting to web interfaces
  • Debugs can be set when job is submitted from the command line for logs
  • YARN is used to consolidate logging by default and is available in Cloud Logging to view all logs

Cloud Storage

  • Native Hadoop file system (HDFS) should be avoided by using the Google Cloud Storage

Hadoop & MapReduce

  • The original MapReduce paper was designed for a world where data was local to the compute machine
  • Petabit networking makes it possible to treat storage and compute independently and move traffic efficiently over the network

HDFS in the Cloud

  • HDFS in the Cloud is a sub-par solution due to how the cluster works
  • You can run HDFS in the Cloud just by lifting and shifting your Hadoop workloads to Dataproc requiring no code changes, however it is sub-par in the long run

Cloud Storage Sub Par

  • Block size ties the performance of input and output to the server’s actual hardware
  • Storage is not elastic which results is resizes and extra computing
  • Locality has similar concerns as data stored on disks
  • In order for HDFS to be highly available, it would be better to separate storage solution

Google's Network

  • Google's network enables new solutions for Big Data, and provides one petabit per second of bandwidth within data centers
  • Bisectional bandwidth supports high network speeds by communicating with a server at full network speeds
  • Having enough network speeds supports the use of the data from where it is stored

Google Cloud

  • On Google Cloud, Jupiter and Colossus separate compute and storage to help Dataproc clusters scale VMs
  • Colossus is the internal name for its massively distributed storage layer
  • Jupiter is the data center network inside the data center

Data Management

  • Following a historical continuum of data management
  • Big data beforehand meant big databases
  • Database design was cheap in storage while processing was expensive
  • Around 2006, distributed processing of big data became practical with Hadoop
  • Around 2010, BigQuery was the first of many Big Data developed by Google
  • Around 2015 Google launched Dataproc for creating Hadoop and Spark clusters and managing data processing workloads

Cloud Storage as Storage

  • Hadoop’s cloud separates computing and storage so clusters can treat themselves as ephemeral resources
  • Cloud Storage is completely scalable, which is connected to other projects

Drop-in Replacement

  • Cloud Storage is a drop-in replacement for HDFS backend for Hadoop
  • Replace "hdfs://" with the "gs://" prefix in your code when referencing objects in Cloud Storage
  • You can also install Cloud Storage Connector manually on non-cloud Hadoop clusters instead of migrating clusters to Cloud
  • With HDFS, you overprovision the data and use persistent disks throughout
  • With Cloud Storage there's use of the pay-as-you-go model

Performance

  • Bulk/parallel operations are optimized in cloud storage that has high output but comes with latency
  • Avoid iterating sequentially over listed nested directories in a single job
  • Avoid small reads by implementing large block sizes

Key Benefits of HDFS v Cloud Storage

  • Using Cloud Storage instead of HDFS benefits from its distributed nature, removing single points of failure
  • Cloud Storage lacks actual directories
  • Renaming Objects is not supported

File Directories

  • Cloud Storage is a core object store that simulates the use of file directories
  • Renames in HDFS are not the same as in Cloud Storage
  • New objects store oriented output
  • Migrated code handles list inconsistency during rename

Moving Data

  • DistCp is a key tool for moving data for any data needed while data that is rarely used could be modeled

Optimizing Dataproc

  • Dataproc can be optimized via configuration and setup
  • Auto-zone features and dataprocs can omit the zone and data’s physical location
  • Make sure the Cloud Storage bucket is in the same region as the Dataproc region

Performance Questions

  • First it must be determined where the data is and where your cluster is
  • To find funnelled network traffic, ensure there are no network rules or routes that funnel Cloud Storage traffic
  • Make sure that more than 10000 input files are not being dealt with
  • If its over that amount combine or union large file sizes from data
  • The size of your persistent diks could limit throughput through small tables
  • Allocate enough virtual machines or resize

HDFS File Systems

  • It's good that the cloud's local HDFS is used
    • Jobs require a lot of metadata operations:
    • When required to modify the HDFS data continuously or rename directories:
    • Heavily depend on the append operation on HDFS files:
    • There's I/O Workloads:
    • Has input output workload, such as latency

Benefits of Cloud Storage Implementation

  • Cloud Storage implementations are:
    • The file system works well within both its initial and final destination -It's great if a work flow continues which five spark jobs in series, this initially retrieves from the data Cloud storage, eventually, all dates and intermediate job is written to HDFS -Cloud Storage results to final spark

Reduce Cost

  • Reducing disk and cost requirements leads to save cost with Dataproc from using in a Cloud Storage
  • Data is kept on Cloud Storage and does not store data on the local HDFS or other smaller disk which allows the ability for separate storage
  • HDFS, Cloud Storage has certain operations that includes:
    • Control and recovery
    • The aggregation of logs
    • Space for shuffling

HDFS Options

  • Re-sizing options includes:
  • Decreasing the size of the local HDFS helps to decrease the overall size
  • Increase the overall size
  • Attach SSDs for HDFS depending on intensivity
  • All CPU and Memory workloads are supported for SSD workloads

Geography Impacts

  • Regions impacts the efficiency of solutions where there's a repercussion throughout jobs that can create
  • Request latency is affected
  • Data can proliferate
  • Performance can degrade

Different Storage

  • There are several storage options available in Google cloud that's:
  • Cloud Storage
  • Primary data -store within a Google Cloud and its unstructured data
  • BigTable
  • Large amounts of dispersed Data
  • HBase-compliant
  • Has low latency and scalability
  • BigQuery
  • A great option if there's data warehousing
  • Great API connections, plus data can be pushed into Big Query

Replicating and Persisting

  • Replicating and persisting on-premises has drawbacks
  • While persistent Data Proc might not resolve problems as it seems
  • There may be limitations for that specific style - to approach
  • Keeping data in HDFS will always cost extra in Cloud
  • Storing data is what requires a lot of the recommended process through keeping data limits the ability to combine

Hadoop Setup

  • The most and effective solution for Hadoop to migrate on the cloud systems is to design small, short-lived clusters that are for specific jobs
  • Google shifts multi -purpose, persistent clusters to being away for smaller cluster

Data Storage

  • Cloud Storage will support multiple and temporary processing clusters
    • Clusters are allocated when jobs finishes with ephemeral and job model is deployed

Time Limit

  • There shouldn't be efficiency and limit to not utilize all resource

Recurring Timed Events

  • The configuration and automated time stamps allows that configuration throughout all types of jobs
    • Timer
    • Seconds/Duration

Appraoch with Workflow

  • The shift within approach and workflow for Google Cloud should be shift from:
    • monolithic
    • persistent
  • New: Workflow cloud

More on Shifting

  • Customers should be able to move into what an "ephemeral" -The fast boot of dataproc may see that a persistent clusters a waste and clusters are resized any time

Workflow Storage

  • Storage from computer should be in ephemeral and recommended for workflows

Clusters and jobs

  • Cluster can be split and jobs
  • Decompose job-scoped clusters
  • The isolation through environments can run and select clusters
  • Reads are taken from different clouds to serve data

Cluster Lifetime

  • One job over another helps to create new opportunities between runs, and storages like tableaus

Dataproc is now part of the Cloud:

  • Jobs over their timeline when set clusters correctly
  • Send up to different Cloud storage options
  • Deleting a Cluster
  • Cloud Output for jobs
  • Cloud Logging and Cloud

Persistent Clusters

  • Persistent Clusters includes:
    • Sizing the jobs -Scaling the Workloads
      • Adding the clusters and job
      • Auto-scaling

Workflow Template Example

  • Template files that is processed the YAML from the Directed Acyclic Graph, and data can:
  • create with new clusters -It's useful to select from clustered data and be able to submit -Hold on and let the jobs run -Delete the cluster
  • The cluster's command with API lets user's see and use existing or different workflow
  • The commands workflow becomes instigated over a DAG

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser