Summary

This document provides an overview of Spark Streaming, a powerful extension of Apache Spark for handling real-time data streams. It details how Spark Streaming integrates with batch processing capabilities and offers stateful stream processing. The document also explores the concept of discretized stream processing and the high-level architecture of Spark Streaming as well as window-based operations.

Full Transcript

11/18/2024 Spark Streaming 1 What is Spark Streaming? Spark Streaming is introduced as a robust extension of Apache Spark for handling real-time data streams. It offers stateful stream processing. Spark Streaming integrates seamlessly...

11/18/2024 Spark Streaming 1 What is Spark Streaming? Spark Streaming is introduced as a robust extension of Apache Spark for handling real-time data streams. It offers stateful stream processing. Spark Streaming integrates seamlessly with Spark’s batch and interactive processing capabilities, enabling users to utilize Spark’s ecosystem for comprehensive data handling. The simple API design allows users to write complex algorithms without requiring a separate stack for real-time analytics. 2 1 11/18/2024 What is Spark Streaming? Extends Spark for doing large-scale stream processing Scales to 100s of nodes and achieves second scale latencies Efficient and fault-tolerant stateful stream processing Simple batch-like API for implementing complex algorithms High throughput on large data streams 3 Integration with Batch Processing The need to handle both real-time streaming data and batch-processed historical data. Many systems handle these separately, leading to duplicated work and higher maintenance costs. Spark Streaming’s integration with Spark’s batch processing framework enables a single stack for both live data and historical data, reducing programming complexity, minimizing bugs, and increasing efficiency. 4 2 11/18/2024 Stateful Stream Processing Traditional streaming, which processes each data record individually mutable state Spark Streaming can maintain a state across input batches. records It ensures the processing state is preserved even node 1 if a node fails, providing fault tolerance. This design allows applications to perform node 3 continuous, complex updates, essential for tasks input like aggregating data over time or detecting records patterns. node 2 5 Existing Streaming Systems Stormprocesses each record at least once, which may lead to errors in mutable state during failures. Trident processes records exactly once using transactions but can be slow due to the overhead of managing external transactions. SparkStreaming combines the benefits of both by achieving high throughput while also ensuring fault tolerance without relying on an external transaction system. 6 3 11/18/2024 What is Spark Streaming? Receive data streams from input sources, process them in a cluster, push out to databases / dashboards. Scalable, fault-tolerant, second-scale latencies 7 High-level Architecture 8 4 11/18/2024 Spark Streaming Incoming data represented as Discretized Streams (DStreams) Stream is broken down into micro- batches Each micro-batch is an RDD – can share code between batch and streaming 9 Discretized Stream Processing 1 live data stream Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Chop up the live stream into batches of X seconds batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Spark Finally, the processed results of the RDD operations processed results are returned in batches 10 5 11/18/2024 Discretized Stream Processing 2 live data stream Spark Streaming Batch sizes as low as ½ second, latency of about 1 second batches of X Potential for combining batch seconds processing and streaming processing in the same system Spark processed results 11 Working of Spark Streaming It takes live input data streams and then divides them into batches. After this, the Spark engine processes those streams and generates the final stream results in batches. 12 6 11/18/2024 Spark Streaming Programming Model Discretized Stream (DStream) Represents a stream of data Implemented as a sequence of RDDs DStreams API very similar to RDD API Functional APIs in Scala, Java, Python Create input DStreams from different sources Apply parallel operations 13 Example – Get hashtags from Twitter val tweets = ssc.twitterStream() DStream: a sequence of RDDs representing a stream of data Twitter Streaming API batch @ t batch @ t+1 batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) 14 7 11/18/2024 Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) transformation: modify data in one DStream to new DStream create another DStream batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMap flatMap flatMap hashTags Dstream new RDDs created … [#cat, #dog, … ] for every batch 15 “Micro-batch” Architecture val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)).saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t batch @ t+1 batch @ t+2 tweets DStream Stream composed of small (1-10s) flatMap flatMap flatMap batch computations hashTags DStream save save save every batch saved to HDFS 16 8 11/18/2024 Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)).foreach(hashTagRDD => {... }) foreach: do whatever you want with the processed data batch @ batch @ batch @ t t+1 t+2 tweets DStream flatMap flatMap flatMap hashTags DStream foreach foreach foreach Write to database, update analytics UI, do whatever you want 17 Window-based Operations To apply transformations over a sliding window of data Two parameters Window length: the duration of the window Sliding interval: the interval at which the window operation is performed 18 9 11/18/2024 Window-based Transformations val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) val = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window window sliding operation length interval 19 Arbitrary Stateful Computations Specify function to generate new state based on previous state and new data Example: Maintain per-user mood as state, and update it with their tweets updateMood(newTweets, lastMood) => newMood moods = tweets.updateStateByKey(updateMood _) 20 10 11/18/2024 Arbitrary Combinations of Batch and Streaming Computations Inter-mix RDD and DStream operations! Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) }) 21 DStream Input Sources Out of the box: Kafka, HDFS, Flume, Akka Actors, Raw TCP sockets Very easy to write a receiver for your own data source Define what to when receiver is started and stopped Also, generate your own sequence of RDDs, etc. and push them in as a “stream” 22 11 11/18/2024 Output Sinks HDFS, S3, etc (Hadoop API compatible filesystems) Cassandra (using Spark-Cassandra connector) HBase (existing Spark-Hbase connector can be used directly) Directly push the data anywhere 23 Spark Streaming Runs as a Spark job YARN or standalone for scheduling YARN has KDC(Kerberos Key Distribution Center) integration Use the same code for real-time Spark Streaming and for batch Spark jobs. Integrates natively with messaging systems such as Flume, Kafka, Zero MQ…. Easy to write “Receivers” for custom messaging systems. 24 12 11/18/2024 DStreams + RDDs = Power Combine live data streams with historical data Generate historical data models with Spark, etc. Use data models to process live data stream Combine streaming with MLlib, GraphX algos Offline learning, online prediction Online learning and prediction Interactively query streaming data using SQL select * from table_from_streaming_data 25 Fault-tolerance: Worker tweets input data RDD replicated RDDs remember the operations that created in memory them Batches of input data are replicated in memory flatMap for fault-tolerance Data lost due to worker failure can be recomputed from replicated input data hashTags All transformed data is fault-tolerant, and RDD lost partitions recomputed on exactly-once transformations. other workers 26 13 11/18/2024 Fault-tolerance: Master Master saves the state of the DStreams to a checkpoint file Checkpoint file saved to HDFS periodically If master fails, it can be restarted using the checkpoint file More information in the Spark Streaming guide Link later in the presentation Automated master fault recovery coming soon 27 Performance Can process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency Tested with 100 text streams on 100 EC2 instances with 4 cores each 28 14 11/18/2024 Real Applications: Mobile Millennium Project Traffictransit time estimation using online machine learning on GPS observations Markov chain Monte Carlo simulations on GPS observations Very CPU intensive, requires dozens of machines for useful computation Scales linearly with cluster size 29 Real Applications: Conviva Real-time monitoring and optimization of video metadata Aggregate performance from millions of active video sessions across thousands of metrics Multiple stages of aggregation Successfully ported to run on Spark Streaming Scales linearly with cluster size 30 15 11/18/2024 Vision - one stack to rule them all Explore data interactively using Spark Shell to identify problems Use same code in Spark stand-alone programs to identify problems in production logs Use similar code in Spark Streaming to identify problems in live log streams. 31 Vision - one stack to rule them all 32 16

Use Quizgecko on...
Browser
Browser