Spark Streaming Late Events PDF
Document Details
Uploaded by ConsistentBugle
Sheridan College
Tags
Related
- Lecture #2.2 - Spark Structured Streaming API.pdf
- Lecture #4.1 - Spark Structured Streaming API II.pdf
- Lab #5.1 - Apache Spark Stream Processing - Truck Fleet Lab II PDF
- Lecture #6.1 - Data Processing - Apache Spark Graph API.pdf
- (Delta) Ch 8 Operations on Streaming Data.pdf
- Introduction to Spark Streaming PDF
Summary
This document discusses Spark Streaming, focusing on late events and watermarks. It explains how event time and processing time differ in streaming applications and why managing late events is crucial. The concept of watermarks is introduced as an elegant solution to limit processing overhead caused by late events.
Full Transcript
Spark Streaming Late Events Event Time, Window & Late Events Introduction So far we have tried spark streaming with different output modes – Non-aggregate queries – Aggregate queries We understand what is possible and not possible with different output modes ...
Spark Streaming Late Events Event Time, Window & Late Events Introduction So far we have tried spark streaming with different output modes – Non-aggregate queries – Aggregate queries We understand what is possible and not possible with different output modes Windowing So far we got the total action count at a given time – Sitting, standing, biking etc We don’t know the action count for a particular window It would be nice to know the action time for the last hour, 10 minutes, etc Windowing Example This is a very much needed concept especially in IoT applications As an example, assume you are tracking number of orders on your website in real time Total orders is of great interest But wouldn’t it be nice to know total number of orders by hour – We can see the spikes and perhaps add additional resources Timing in Streaming There are two types of timings in streaming applications: event time and processing time Event time – When event originates at the source Processing time – Typically recorded outside of the source – Either when the event reaches Spark streaming for processing or just before We are more interested in event time – This is when event actually happened Timing in Streaming Processing time indicates when the event is picked up for processing We are more interested in event time This is when event actually happened But as we will see later, there are complications with this The concept is called “late events” Windowing function in Spark Before we look at the problem with calculating event times, we will take a look at how to window events So far all we have done is take the total number of events We now want to group events by 10 minute intervals – Take a look at the code for this section Late Events Problem with Late Events Event times are what we really are after in any streaming application But event times are hard to manage because of the need to support late events Essentially, we have to remember that events can arrive at the streaming or processing time significantly late This is a real problem Problem with Late Events For example, take an application where a accelerometer is sending number of steps you take to a server for processing This requires access to a wifi conncection and it is possible that the wifi is disconnected or not functioning for several hours So events generated at say 11:00 am can arrive at processing centre at 8:00 pm! Problem with Late Events This is a significant problem Spark would need to update the 10:00 am window when the event arrives at 8:00 pm To support this, Spark needs to maintain the states for indefinite amount of time The problem is that the result table will grow in size indefinitely Solution for Late Events We need a cutoff time for late events IF the event coming in is beyond a certain threshold, we have to ignore it This way, Spark can drop older windows without any consequence and avoid causing the result table to grow indefinitely Spark achieves this with a concept called watermarks Concept of Watermarks Review To manage late events, Spark has to indefinitely maintain all the windows in its state This makes the state difficult to manage – Overwhelm the storage To mitigate this risk, Spark Streaming introduced the concept of Watermarks Watermark concept Watermarks concept Let assume that the watermark is set to 10 minutes Now assume that in the last micro-batch Spark processed events with event time up to 10:00 am – 10:00 am is the maximum event time in the process so far Therefore, the threshold for considering events would be 9:50 am – Maximum event time minus threshold Watermarks concept Therefore, in the next micro-batch or trigger, Spark will not process events older than 9:50 am Now assume in the next micro-batch, maximum event time was 10:10 am The new threshold for events will be 10:00 am So in the next trigger, events earlier than 10:00 am will not be considered And the process continues… Watermarks concept We need to know the threshold for each trigger or micro-batch To find the threshold, we need to know the maximum event time after each trigger run Here is a more concrete example Example Watermarks concept At Trigger 1 there is no threshold While processing event from Trigger 1, Spark streaming takes note of the maximum event time à 10:39 Therefore, threshold for Trigger 2 will be set 10:39 So any events before event time 10:39 will be ignored Watermarks concept While processing events from Trigger 2 (file 2), Spark takes notes of maximum event time à 10:56 The threshold for Trigger 3 is then set to 10:46 Any event occurring before 10:46 will be ignored moving forward Watermarks and complete output mode What happens with watermark and complete output mode? – By definition, complete output mode dumps all the contents of the result table – This means that complete implicitly guarantees all data in result table will be preserved – So windows older than threshold are not deleted and watermark is not taken into account with complete output mode Watermarks and complete output mode Complete output mode with watermarks is exactly the same as complete output mode without watermarks Essentially watermarks is ignored in complete output mode Try it with the exercises Append mode with watermarks What happens when you try append output mode with watermark? – Append with aggregation is not supported (true!) therefore append with watermark is not supported? – WRONG Append with aggregate queries AND watermarks is supported We’ll see why in the next section Late Events and Append Mode Append mode with Watermarks Remember append mode only allows appends (new data) to the result table You cannot have updates to the result table Therefore, aggregation queries cannot be run with this mode because these type of queries require updates to the table But there is a twist Append mode with watermarks Append mode supports aggregation when it is done with watermarks We’ll see how But first, why do you think Append mode is supported with Watermarks? Watermarks concept Watermarks concept Watermarks concept Watermarks concept