Introduction to Spark Streaming PDF
Document Details
Uploaded by PersonalizedBlueTourmaline2204
Sheridan College
Tags
Summary
This document provides an introduction to Spark Streaming, a powerful feature in Spark for handling continuous data streams. It covers core concepts like source, state, sink, and different output modes. The document also explores the differences between batch and stream processing, and highlights challenges like late events and code portability. Information about various data flow pipelines is covered within this guide.
Full Transcript
Introduction to Spark Streaming Streaming Concept Motivation Industry is showing a lot of interest in streaming applications A typical data flow pipeline Extract Analysis Transform Load Batch vs S...
Introduction to Spark Streaming Streaming Concept Motivation Industry is showing a lot of interest in streaming applications A typical data flow pipeline Extract Analysis Transform Load Batch vs Stream processing Batch Streaming Data Frequency Data Size Data Analysis Stream processing concept Stream processing Streaming applications basically reduces the time between acquisition of the data and calculation on that data Data is constantly changing New/changed data must be analyzed quickly Why bother with stream processing? Convenience Critical applications Spark Structured Streaming Module Spark Structured Streaming API Spark Structured Streaming Spark Structured Streaming is a powerful feature in Spark It is being adopted by a lot of companies in recent years To make the distinction, let’s first look at traditional or batch processing in Spark – What we have been working on so far Batch Processing Spark Have a dataset – Represent it as a DataFrame (or DataSet) Run queries against the DataFrame to perform any computation we need We say that the dataset is static and so the DataFrame is also static Now let’s look at Spark Structured Streaming Basic concept of structured streaming Forms of streaming There are two forms of streaming available in Spark Micro-batch processing Micro-batch processing Here, a “batch” of information can be processed as fast as 100 milliseconds Micro-batch processing is once-and-only- once guarantee Continuous stream processing Assume you are not satisfied with the 100ms latency Spark offers with micro- batch streaming There is another form of streaming called Continuous Stream Processing Latency in continuous processing goes down to 1 ms However, Continuous Processing does at- least-once guarantee – Duplicates are possible Spark Streaming vs. Kafka Streaming Challenges Streaming Challenges – Late Events trigger 1 trigger 2 trigger 3 Moe – 11 step Joe - 10 steps Joe - 15 Joe – 0 steps t = 0 Lisa – 15 steps Lisa - 15 Lisa - 5 steps (start streaming) time Joe 25 Joe 10 Joe 25 Lisa 35 Lisa 15 Lisa 30 Moe 11 Streaming Challenges – End to End Guarantee End to end guarantees – Tolerating failure – To achieve this we need a model – We have a source, sink and spark streaming in the middle Streaming Challenges – Code Portability Batch API code is different from Streaming API code But the good new is not by much! Most of your batch code is easily portable to streaming code with minor changes Spark Streaming Architecture and Output Modes Architecture of Spark Streaming Source trigger 1 trigger 2 trigger 3 t = 0 (start streaming) trigger time time State Result Table Sink Example Source (smart watch – steps taken) trigger 1 trigger 2 trigger 3 Moe – 11 step Joe - 10 steps Joe - 15 Joe – 0 steps t = 0 Lisa – 15 steps Lisa - 15 Lisa - 5 steps (start streaming) 10 minutes time Joe 25 Joe 10 Joe 25 State Lisa 15 Lisa 30 Lisa 35 Moe 11 Result Joe 25 Joe 10 Joe 25 Lisa 35 Table Lisa 15 Lisa 30 Moe 11 Sink Spark Streaming Output Modes Complete Output Mode Source (smart watch – steps taken) trigger 1 trigger 2 trigger 3 Moe – 11 step Joe - 10 steps Joe - 15 Joe – 0 steps t = 0 Lisa – 15 steps Lisa - 15 Lisa - 5 steps (start streaming) 10 minutes time Joe 25 Joe 10 Joe 25 State Lisa 15 Lisa 30 Lisa 35 Moe 11 Result Joe 25 Joe 10 Joe 25 Lisa 35 Table Lisa 15 Lisa 30 Moe 11 Sink Joe 25 Joe 10 Joe 25 Lisa 35 Lisa 15 Lisa 30 Moe 11 Update Output Mode Source (smart watch – steps taken) trigger 1 trigger 2 trigger 3 Moe – 11 step Joe - 10 steps Joe - 15 Joe – 0 steps t = 0 Lisa – 15 steps Lisa - 15 Lisa - 5 steps (start streaming) 10 minutes time Joe 25 Joe 10 Joe 25 State Lisa 15 Lisa 30 Lisa 35 Moe 11 Result Joe 25 Joe 10 Joe 25 Lisa 35 Table Lisa 15 Lisa 30 Moe 11 Sink Joe 10 Joe 25 Lisa 35 Lisa 15 Lisa 30 Moe 11 Append Output Mode With Append output mode, only records that are newly added to the results table will be sent to the sink Records in the result table are not allowed to be updated Only new records can be added or appended to the table with append mode Interview Question 1 Is Append mode possible with our example (aggregation)? – No, only new data can be sent to the results table in Append mode – No updates are allowed in Append mode, and aggregation will inevitably update results Aggregate queries are allowed in Append output if using watermarks Interview Question 2 With what you know with complete output mode, what kind of queries spark may not be able to support with complete output mode? – Aggregate or none-aggregate queries? – Answer: Spark does not support complete output mode with non-aggregate queries Summary We familiarized ourselves with some core concepts in Spark Streaming Source, state, sink, results table and output modes We also saw what kind of queries are supported and not supported for append and complete output mode Experiments with Spark Streaming Modes