Summary

These notes provide an overview of Apache Storm, a real-time stream processing framework. The document explains how Storm handles unbounded data streams, emphasizing its low latency and scalability. Key concepts like tuples, streams, spouts, and bolts are also discussed.

Full Transcript

11/6/2024 Apache Storm 1 What is Apache Storm? Apache Storm is a powerful open-source real-time stream processing framework. It allows developers to process unbounded data streams in a scalable, fault-tolerant manner, making it ideal for real-time analytic...

11/6/2024 Apache Storm 1 What is Apache Storm? Apache Storm is a powerful open-source real-time stream processing framework. It allows developers to process unbounded data streams in a scalable, fault-tolerant manner, making it ideal for real-time analytics, monitoring, and computation tasks. Released as open-source in 2011 by Twitter. 2 1 11/6/2024 3 How Apache Storm Works: Processing Unbounded Streams of Data Real-Time Analytics Continuous Computation ETL (Extract-Transform-Load) Operations 4 2 11/6/2024 Processing Unbounded Streams of Data Apache Storm is designed to handle unbounded data streams, meaning there is no predefined end to the data. The data continues flowing into the system indefinitely, and Storm processes it in real-time as it arrives. 5 Processing Unbounded Streams of Data Characteristics Low Latency: Storm processes data with extremely low latency. This is crucial for use cases requiring real-time decision-making (e.g., fraud detection, financial transactions). Scalability: As data grows, Apache Storm scales horizontally by adding more worker nodes to distribute the processing load across multiple machines. 6 3 11/6/2024 Real-Time Analytics Apache Storm is often used for real-time analytics, where data needs to be processed and analyzed as soon as received. 7 Continuous Computation Continuous computation refers to the ability to perform ongoing calculations on a data stream, updating results dynamically as new data flows in. 8 4 11/6/2024 Companies Using Storm Twitter: Twitter originally developed Apache Storm to process its continuous data streams. Storm helps Twitter process millions of tweets in real-time, allowing for real-time content analysis, trending topic detection, and user engagement monitoring. Spotify: Spotify uses Apache Storm to process streams of user listening data, providing real-time recommendations and ensuring that playlists, top charts, and user experiences are updated as users interact with the platform. 9 KEY CONCEPTS 10 5 11/6/2024 Tuples An ordered list of values/objects (any type) Values/objects must be serializable Tuple = list of values/objects 11 Stream An unbounded sequence of tuples Core abstraction in Storm Stream = sequence of tuples 12 6 11/6/2024 Storm Data Processing Streams of tuples flowing through topologies Vertices represent computation and edges represent the data flow Vertices divided into Spouts – read tuples from external sources. Bolts – encapsulate the application logic. 13 Spouts Generate tuples from other sources event data log files Queues Sources of streams Read input data from an external source Release them as tuple streams into Storm Spouts can release more than one stream Example: read from Twitter streaming API 14 7 11/6/2024 Bolts Process input streams and produce new streams It can be any functionality: filtering, functions, aggregation, joins, etc. Complex transformations require multiple bolts 15 Topology Network of spouts and bolts Graph: node = spout or bolt, edge = which bolt subscribes to which stream 16 8 11/6/2024 Tasks Spouts and bolts execute as many tasks across the cluster 17 Data Flow in Apache Storm Topology 1. Spouts: Data Ingestion 1. Spouts serve as the entry points for data. 2. They pull data from external sources (e.g., Twitter API, Kafka, or log files). 3. Spouts convert the data into tuples and emit these tuples as streams to the topology. 4. Example: A Twitter Spout reads tweets from the Twitter Streaming API. 2. Bolts: Data Processing 1. Bolts receive the tuples emitted by the spouts and perform processing tasks. 2. Bolts can perform various operations such as: 1. Filtering: Removing unwanted data. 2. Transformation: Changing data formats or values. 3. Aggregation: Summing, counting, or averaging values. 4. Join operations: Combining data from multiple streams. 18 9 11/6/2024 Data Flow in Apache Storm Topology 3. Final Output (Result) 1. The final bolt collects the processed data and acts, such as saving it to a database or displaying results in real-time. 2. Example: A ReportBolt stores word counts or generates a report from the processed tweets. 4. Topology Structure 1. A Storm topology is a directed acyclic graph where spouts feed data into bolts, which may further pass processed data to downstream bolts. 2. The topology runs continuously, processing incoming streams in real time. 19 STORM ARCHITECTURE 20 10 11/6/2024 Storm Architecture 21 Storm Architecture 1.Nimbus: 1. Role: Central node responsible for managing and assigning tasks to worker nodes. 2. Responsibilities: Task assignment, topology submission, fault tolerance. Monitors the health of worker nodes and reassigns tasks if needed. High availability: Nimbus can run in a cluster for fault tolerance. 22 11 11/6/2024 Storm Architecture 2. Supervisor: 1. Role: Supervisor nodes run on the worker machines within the Storm cluster. 2. Responsibilities: 1.Task Execution: Supervisor nodes are responsible for executing the actual data processing tasks, which include running spouts and bolts. 2.Reporting: They communicate with Nimbus to request and receive tasks. They also report the status and health of the jobs they execute. 3.Scalability: Storm clusters can be easily scaled by adding or removing supervisor nodes to handle varying workloads. 23 Storm Architecture 3. ZooKeeper: Role: Storm uses Apache ZooKeeper for cluster coordination and configuration management. Responsibilities: Cluster Coordination: ZooKeeper helps Nimbus and supervisors coordinate their actions. It records the cluster’s state, including worker node availability and task assignments. Configuration Management: Storm’s configuration and topology details are stored and managed in ZooKeeper. This ensures that all nodes in the cluster have access to the same configuration and topology definitions. Reliability: ZooKeeper provides a highly reliable and distributed coordination service, which contributes to the overall reliability of the Storm cluster. 24 12 11/6/2024 Topology Example We use the well-known word count example to demonstrate Storm data processing. The example will also cover creating a basic Storm project, including spouts and bolts. The word count topology consists of a single spout connected to three downstream bolts. 25 Sentence Spout The SentenceSpout class will release a stream of single-value tuples with the key name "sentence" and a string value (a sentence), as shown in the following example. { "sentence" : “My dog has fleas" } The data source will be simplified as a static list of sentences, emitting a tuple for every sentence. In a real-world application, a spout would typically connect to a dynamic source, such as tweets retrieved from the Twitter API, files being uploaded, or any messaging queue. 26 13 11/6/2024 Split Sentence Bolt The SplitSentenceBolt will subscribe to the sentence spout's tuple stream. For each tuple received, it will look up the "sentence" object's value, split the value into words, and release a tuple for each word: { "word" : "my" } { "word" : "dog" } { "word" : "has" } { "word" : "fleas" } 27 Word Count Bolt The WordCountBolt subscribes to the output of the SplitSentenceBolt class. It keeps a running count of each word. Whenever it receives a tuple, it will increment the counter associated with a word and release a tuple containing the word and the current count: { "word" : "dog", "count" : 3 } 28 14 11/6/2024 Report Bolt The ReportBolt subscribes to the output of the WordCountBolt class. Itmaintains a table of all words and their corresponding counts, just like WordCountBolt. When it receives a tuple, it updates the table and prints the contents to the console. 29 Reliable Processing 30 15 11/6/2024 Reliable Processing 31 Hadoop vs. Storm Hadoop Storm batch processing  real-time processing runs jobs to completion  topologies run forever stateful nodes  stateless nodes scalable  scalable guarantees no data loss  guarantees no data loss open source  open source big batch processing  fast, reactive, real-time processing 32 16 11/6/2024 Storm: Pros and Cons Pros Cons Fault tolerance: High fault tolerance The use of native scheduler and Latency: very low resource management feature Processing Model: Real-time stream (Nimbus) becomes a bottleneck. processing model Difficulties debugging given the way Programming language the threads and data flows are split. dependency: any programming language Reliable: each tuple of data should be processed at least once Scalability: high scalability 33 17

Use Quizgecko on...
Browser
Browser