Podcast
Questions and Answers
What best describes Resilient Distributed Datasets (RDDs)?
What best describes Resilient Distributed Datasets (RDDs)?
Which operation is used to create new RDDs with one-to-one mapping between elements?
Which operation is used to create new RDDs with one-to-one mapping between elements?
How does the runtime manage function execution over RDDs?
How does the runtime manage function execution over RDDs?
Which of the following operations would select elements from an RDD based on a specified condition?
Which of the following operations would select elements from an RDD based on a specified condition?
Signup and view all the answers
What does the 'groupBy' operation do in the context of RDDs?
What does the 'groupBy' operation do in the context of RDDs?
Signup and view all the answers
What is a characteristic of iterative computations?
What is a characteristic of iterative computations?
Signup and view all the answers
Which algorithm example is specifically mentioned in the context of iterative computations?
Which algorithm example is specifically mentioned in the context of iterative computations?
Signup and view all the answers
What is one drawback of interactive computations with Hadoop?
What is one drawback of interactive computations with Hadoop?
Signup and view all the answers
What advantage does Apache Spark offer over traditional disk writing methods?
What advantage does Apache Spark offer over traditional disk writing methods?
Signup and view all the answers
What programming model does Apache Spark extend?
What programming model does Apache Spark extend?
Signup and view all the answers
What is a Resilient Distributed Dataset (RDD) used for in Apache Spark?
What is a Resilient Distributed Dataset (RDD) used for in Apache Spark?
Signup and view all the answers
Which of the following best describes the querying capabilities in Apache Spark?
Which of the following best describes the querying capabilities in Apache Spark?
Signup and view all the answers
Why is writing to disk considered slow and costly in data processing?
Why is writing to disk considered slow and costly in data processing?
Signup and view all the answers
What method is used to extract hashtags from tweets in the provided example?
What method is used to extract hashtags from tweets in the provided example?
Signup and view all the answers
What is the purpose of the foreach method in the hashtag processing pipeline?
What is the purpose of the foreach method in the hashtag processing pipeline?
Signup and view all the answers
In the context of window-based transformations, what does the countByValue() function achieve?
In the context of window-based transformations, what does the countByValue() function achieve?
Signup and view all the answers
What does the 'window' parameter represent in the window-based transformation?
What does the 'window' parameter represent in the window-based transformation?
Signup and view all the answers
How frequently does the window function operate in the provided example?
How frequently does the window function operate in the provided example?
Signup and view all the answers
Which of the following correctly describes a DStream?
Which of the following correctly describes a DStream?
Signup and view all the answers
What is the result of applying the flatMap transformation on a DStream?
What is the result of applying the flatMap transformation on a DStream?
Signup and view all the answers
What is the first step in processing Twitter data as shown in the provided content?
What is the first step in processing Twitter data as shown in the provided content?
Signup and view all the answers
What is the primary purpose of using state backends in Flink tasks?
What is the primary purpose of using state backends in Flink tasks?
Signup and view all the answers
What is the main drawback associated with writing to SSDs as mentioned in the content?
What is the main drawback associated with writing to SSDs as mentioned in the content?
Signup and view all the answers
In which state backend is data stored solely in RAM?
In which state backend is data stored solely in RAM?
Signup and view all the answers
What storage structure does RocksDB use to manage its data?
What storage structure does RocksDB use to manage its data?
Signup and view all the answers
What mechanism does RocksDB support for maintaining efficiency in snapshotting?
What mechanism does RocksDB support for maintaining efficiency in snapshotting?
Signup and view all the answers
What is the primary purpose of declaring state in a Flink stream operation?
What is the primary purpose of declaring state in a Flink stream operation?
Signup and view all the answers
Which of the following correctly describes managed state in Flink?
Which of the following correctly describes managed state in Flink?
Signup and view all the answers
In what context does managed state typically operate in Flink?
In what context does managed state typically operate in Flink?
Signup and view all the answers
How does the concept of 'per-key average' relate to managed state in Flink?
How does the concept of 'per-key average' relate to managed state in Flink?
Signup and view all the answers
What is a significant characteristic of global snapshots in Flink?
What is a significant characteristic of global snapshots in Flink?
Signup and view all the answers
What is a key advantage of using managed state in stream processing?
What is a key advantage of using managed state in stream processing?
Signup and view all the answers
Which statement about the scope of managed state in Flink is accurate?
Which statement about the scope of managed state in Flink is accurate?
Signup and view all the answers
What role does state play in stream operations with respect to computations?
What role does state play in stream operations with respect to computations?
Signup and view all the answers
What is the primary function of global snapshots in dataflow systems?
What is the primary function of global snapshots in dataflow systems?
Signup and view all the answers
How does alignment phase contribute to dataflow processing?
How does alignment phase contribute to dataflow processing?
Signup and view all the answers
What type of storage is typically used for snapshots in dataflow systems?
What type of storage is typically used for snapshots in dataflow systems?
Signup and view all the answers
What does the presence of markers in the dataflow graph indicate?
What does the presence of markers in the dataflow graph indicate?
Signup and view all the answers
Why is it important to upgrade software during long-running jobs?
Why is it important to upgrade software during long-running jobs?
Signup and view all the answers
What does decentralized alignment achieve in dataflow systems?
What does decentralized alignment achieve in dataflow systems?
Signup and view all the answers
What is the significance of partial channel logging in cyclic graphs?
What is the significance of partial channel logging in cyclic graphs?
Signup and view all the answers
What challenge does recovering computation after a failure address?
What challenge does recovering computation after a failure address?
Signup and view all the answers
Study Notes
Cloud Computing - Lecture Notes
- Course: LINFO2145
- Lesson: 11 - Beyond Map/Reduce - In-Memory and Stream Big Data Processing
- Lecturer: Pr. Etienne Rivière
Lecture Objectives
- Introduce evolutions of Map/Reduce for:
- in-memory processing
- iterative computations
- interactive computations
- stream-based computations
- Discuss examples, algorithms, and system design using Apache Spark and Flink frameworks
Part 1: In-Memory Processing
- Map/Reduce workflow involves parallel processing of Big Data.
- Map phase extracts intermediate values to local disk.
- Shuffle phase sorts by intermediate key, read from remote disks.
- Reduce set of values for each key, store result to disk.
K-Means with Map/Reduce
- Iterative job: Each iteration writes results to disk, next reads from disk.
Part 2: Iterative Computations
- Certain algorithms require multiple iterations until a stopping condition
- Examples include: K-Means (no point re-assigned to new clusters), Pagerank (ranking webpages based on incoming links), Graph exploration problems (graph coloring, flow propagation), and numerous machine-learning algorithms.
Part 3: Interactive Computations
- Operators make small modifications to a query and submit new queries over the same dataset
- Examples include modifying a reduce operation
- With Hadoop, everything must be recalculated even when data is not changed.
Beyond Map/Reduce: Apache Spark
- Writing to disk is slow and costly, especially when needing to re-read immediately.
- Spark introduces Resilient Distributed Datasets (RDDs)
- Shared memory abstraction
- Programming model based on Map/Reduce
- Avoids re-reading from disk
- Allows different querying models on core engine
- Can use queries under different models
Spark Core Engine and Libraries
- Spark has a core engine with libraries like Streaming SQL, Machine Learning, and Graph
Resilient Distributed Datasets (RDDs)
- An abstraction for a large shared-memory object, partitioned over a cluster, manipulated in parallel, and immutable.
- Generalization of Map/Reduce: input partitions, intermediate key: {values} lists, final results.
- Create and manipulate RDDs using transformations
Operations on RDDs
- Functional-style programming API provides functions to the runtime
- The runtime automatically runs functions in parallel across the partitioned RDDs
- Loaders create RDDs from storage.
- Map creates a new RDD with one-to-one mapping.
- Filter selects elements based on conditions
- GroupBy groups elements into multiple RDDs by a grouping key.
Some Spark Operations
- Transformations (defining new RDDs): map, filter, flatMap, union, sample, groupByKey, reduceByKey, sortBy key, cogroup, mapValues
- Actions (return results to driver program): collect, reduce, count, save, lookupKey, take
M/R vs Spark Terminology
- Map/Reduce operations involve steps such as map, shuffle, and reduce on disk.
- Spark loads data into RDD, and performs operations like map, group by, filter, and persists to disk.
Example of Use in Scala
- Defines an RDD backed by a file in HDFS
- Derives a new RDD from lines using a function literal
- Count is defined as a final action that returns results in memory
RDDs and Lazy Evaluation
- RDDs are evaluated lazily
- Programs are transformed into a graph of relations between RDDs and transformations.
- Spark applies query optimizations by grouping filter and map operations as single operations where applicable.
- Optimizes task placement to minimize data movement after a group by operation.
Fault Recovery using Lineage
- Fault tolerance involves tracking the graph of computations and re-running failed operations.
- Missing partitions can be recomputed if workers fail during shuffling operations
- Rebuilding failed RDDs in parallel is faster than writing to disk.
Example: PageRank
- PageRank computes a rank for every webpage from a web crawl.
- It approximates the probability of a random surfer reaching a given page.
- The rank of a page is based on the number of incoming links from pages with high ranks.
- Used in conjunction with keyword extraction for web search.
Input and Expected Output (PageRank)
- Input/output data for PageRank analysis
PageRank Principles
- Algorithm for calculating page importance.
- Formula for calculating page rank based on contributions from other pages.
First Iteration
- Illustrative diagram showing page rank calculation in the first iteration.
Part 1: Reading the input file
- Creating RDDs by importing from text files present in HDFS and converting them to key value pairs (page number and the pages linked)
Part 2: Initial data ranks (Page Rank)
- Creating an initial RDD with all ranks set to 1.0.
Part 3: Looping and Calculating Contribution (PageRank)
- Calculating contributions to a given page rank from all other pages.
- Implementing the Page Rank formula as multiple steps on a series of RDDs
Final Output Results (Page Rank)
- Collecting results from the PageRank sequence of RDDs into a final dataset.
- Implementing the final output stage for collecting the results.
Illustration of Phase 3 (Page Rank)
- Illustrating the calculation steps for PageRank in visualization format.
Optimizing placement
- Optimizing data partitioning and joining steps and how to avoid unnecessary reshuffles
- Optimizations using hash functions for URL or DNS names used.
Spark Programming Models
- Spark offers high-level libraries for common operations in various contexts such as SQL queries, graph processing, and intermixing different models.
- Enables mixing different models, such as extracting a graph structure from a relational dataset and apply algorithms on top of it.
Spark Summary
- Functional declaration of global aggregate computations
- Resilient data sets as a powerful abstraction for distributed shared memory structures.
- Transparent partitioning, parallel computation, and handling of failures and reprocessing, including joins and other high-level operations
- Computationally efficient for iterative loops compared to multi-stage MapReduce, also usable for interactive jobs
Part 2: Stream Processing
- Many applications require constantly updated results in contrast to one-time computations
Stream Processing Requirements
- Statefulness (e.g., calculating averages or correlating streams with other datasets)
- Handling (bursty) high data rates, and ensuring low latency.
Spark Streaming
- Running streaming computations as a series of very small, deterministic batch jobs.
- Processing batch sizes as low as half a second
- Combining batch and stream processing in the same system.
Example: Get Hashtags from Twitter
- Using the Twitter Streaming API to continuously get hashtags from an input stream.
- Storing results in an RDD, and applying multiple flatMap operations
Example: Get Hashtags from Twitter (Example Code)
- Example showing declaration of
val tweets
andval hashTags
- Implementing the transformations using
.flatMap
on the stream input
Window-based Transformation
- Applying functionalities to compute data over sliding windows.
Fault Tolerance
- Data streams are replicated in memory internally for fault tolerance
- Lost partitions can be recomputed by workers from replicated data
- RDDs remember the operations that created them
Apache Flink
- An open-source framework for stream and batch processing
- Data flows are represented as graphs of transformations.
- Supports exactly-once processing of streams, where each processing operation will be executed once and only once
- Efficient for optimizing the placement of tasks
Flink ≠ Spark Streaming
- Flink is record-centric, while Spark focuses on micro-batches.
- Flink allows for one-event-at-a-time code, which is better for latency.
- Flink allows batch and stream processing to be mixed more easily and potentially is more efficient.
Examples of Flink Sources
- Flink can receive data from Apache Kafka, Amazon Kinesis, and Apache Pulsar.
- It can also act as a sink for receiving output data from other operations.
Flink APIs
- Flink provides different APIs (User-defined functions, data streams, SQL, and table APIs) for various levels of abstraction based on operations and tasks.
- User-defined functions are most closely associated with the performance characteristics of each application or workload.
Word Count Example with DataStream API and Process Functions
- Example illustrating how to perform word-counting in a streaming fashion using Flink's DataStream API and process functions
Example: Word Count using Table API
- Example illustrating how to perform word-counting in a batch-style fashion using Flink's Table API, showing a SQL query and subsequent execution.
Another Example of Use of Table API in the TPC-DS Benchmark
- Example illustrating the use of Flink's Table API in a complex, real-world benchmark
Resulting Data Flow Graph
- Visual diagram showing the data flow steps resulting from processing structured data in a stream with multiple data sources using multiple joins and aggregations.
Execution of Dataflow Graphs in Flink
- Summary of how Flink orchestrates distributed data processing.
- Defining the unit of execution (task).
- Using a cluster manager like Kubernetes to manage the task managers.
- Distributed coordination using ZooKeeper, as in Hadoop.
Logical, Optimized, and Physical Dataflow Graphs
- Logical graph transforms into a set of tasks.
- Consecutive transformations merged into single tasks
- Streams are established between tasks
- Back-pressure mechanism to handle and process streams based on buffer thresholds, slowing down upstream processing
Horizontal Scaling and Physical Mapping of the Dataflow Graph
- Horizontally scaling tasks by duplicating partitions, thereby distributing data and state and minimizing redundant computation.
Global Snapshots
- Snapshots are taken from the graph of dataflow computation, allowing for fault tolerance, recovering the computation in case of failure, re-configuring the parallelism, and updating software without interrupting processing for long running jobs.
- Implemented by using global snapshots to cover each of the operation steps in the graph and using markers to identify the point to resume from in case of failure.
Snapshots and Marker Alignment
- Markers are tagged with epoch and propagate along the streams for local snapshot support.
Usage of Snapshots
- Rescaling data partitions for different computation settings
- Implementing checkpoints that store results during processing
- Using replay channels during recovery using sources like Apache Kafka.
- Exact-once guarantees help ensure that data is only handled once during data processing
State Backend
- Flink tasks typically exhibit write-dominated workloads.
- State is updated with each processed event.
- Local snapshots are made for efficient state management using strategies to reduce the impact of copying massive data to disk.
- Flink provides different types of state backends including in-memory and persisting local storages like RocksDB.
The RocksDB State Backend
- Description and overview of the Rocksdb technology used for persisting state values in Flink state backend.
Example at a Mobile Game Provider (King)
- Overview of how a mobile game provider like King uses Flink to process and manage in-game data
Snapshot Costs and Impact
- Describing the costs of taking snapshot operations in Flink's state management.
- The overhead associated with alignment and propagation of snapshots over the distributed dataflow graph.
Conclusions
- MapReduce pioneered the functional approach to large-scale Big Data processing.
- Apache Spark improves upon/extends the MapReduce principle, supporting iterative, interactive, and stream-processing computations.
- Flink's micro-batching and tuple-centric approach offers further advantages for specific applications and use cases.
References
- List of key research papers and articles on Spark, Flink, and related technologies.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge of Resilient Distributed Datasets (RDDs) in this quiz. Explore key concepts such as function execution management, RDD operations, and one-to-one mapping techniques. Perfect for students and professionals looking to enhance their understanding of RDDs.