Lecture #4.1 - Spark Structured Streaming API II.pdf

Full Transcript

MODERN DATA ARCHITECTURES FOR BIG DATA II APACHE SPARK STRUCTURED STREAMING API II AGENDA Structured Streaming II Event and Processing Time Windows Operations Late Data and Watermarking Join Operations Stream Deduplication Structured Streaming II - API TIME TO TURN OSBDET ON! We'll use the course en...

MODERN DATA ARCHITECTURES FOR BIG DATA II APACHE SPARK STRUCTURED STREAMING API II AGENDA Structured Streaming II Event and Processing Time Windows Operations Late Data and Watermarking Join Operations Stream Deduplication Structured Streaming II - API TIME TO TURN OSBDET ON! We'll use the course environment by the end of the lesson: 1. STRUCTURED STREAMING II 1.1 EVENT AND PROCESSING TIME EVENT TIMING IN STRUCTURED STREAMING Quick comparison of Event vs Processing time: Both are TypeStamp columns within the Input Table DataFrame Event Time → column from the source and might not be TimeStamp type Processing Time → column created with function current_timestamp Windowing techniques are only applied to event time column 1.2 WINDOWS OPERATIONS WINDOWING* IN STRUCTURED STREAMING With regards to windows operations in Structured Streaming: Tumbling windows - cannot define windows by number of events Time based windows - full support, minor nuance though: window starting inclusive, window ending exclusive → [12:05,12:10) duration & unit as text: millisecond, second, minute, hour,... * Naming convention as in “Streaming Data - Understanding the Real-Time Pipeline”. You'll find different names in the official Spark documentation WINDOWING IN STRUCTURED STREAMING Aggregation function window define windows and: Bucketize rows into one or more windows based on window definition: For Fixed Time Window → name of event time column and window duration For Sliding Time Window → Fixed Time Window + sliding duration Example of 10-minute windows sliding every 5 minutes: 1.3 LATE DATA & WATERMARKING NEVER ASSUME EVENTS ARRIVE ON TIME Events can arrive late to analytics tier due to multiple factors. Spark Streaming keeps partial results while late data arrives. It only waits for late data for a while → watermarks. NEVER ASSUME EVENTS ARRIVE ON TIME This is how watermarking deal with late events*: * More about this at Handling Late Data and Watermarking. 1.4 JOIN OPERATIONS JOIN OPERATIONS AVAILABLE The join function we already know can be used here too. Spark Structured Streaming supports joining a: Streaming DataFrame with a Static one, supporting the following: DataFrame type DataFrame type Supported joins Stream Static Inner Join Left Outer Join Left Semi Join Static Stream Inner Join Right Outer Join Streaming DataFrame with a Streaming one, challenging because: Late events for a given DataFrame will combine late with events on the other one Eventually, processing of both Input Tables will produce the expected insights Watermarking can be considered to get rid of late events no longer valid 1.5 STREAM DEDUPLICATION WHAT'S STREAM DEDUPLICATION? Data deduplication* → technique to remove repeated data. The function dropDuplicates enables this data deduplication Typically used to turn at-least-once into exactly-once: Can be enabled or disabled based on criticality of the data If enabled, the amount of resources increases It needs to exist a unique identifier in the DataFrame: It could be one single column or a combination of multiple columns It can be combined with watermarking: With watermarking → history to maintain limited Without watermarking → all events considered, late or not * Data Deduplication definition by the Wikipedia. 2. STRUCTURED STREAMING II - API EXPLORE THE API IN JUPYTER NOTEBOOK Jump to OSBDET and explore the Structured Streaming API: CONGRATS, WE'RE DONE!

Use Quizgecko on...
Browser
Browser