BDS_12_StreamProcessing_I_2024 PDF
Document Details
Uploaded by TimelySweetPea
Hasso-Plattner-Institut
Martin Boissier
Tags
Related
Summary
This document provides details of instructions for a programming exercise on Big Data Systems and stream processing. It includes announcements, coding exercises and a grading scheme. The document also contains a code of conduct and exercise details.
Full Transcript
Big Data Systems Martin Boissier Stream Processing I Data Engineering Systems Hasso-Plattner-Institut Announcements § Once again: § Submission deadline of Quiz 2 postponed to 2024-12-09 § Programming exercise on B+-trees starts today...
Big Data Systems Martin Boissier Stream Processing I Data Engineering Systems Hasso-Plattner-Institut Announcements § Once again: § Submission deadline of Quiz 2 postponed to 2024-12-09 § Programming exercise on B+-trees starts today § More about that in a minute § Joint Database Systems Seminar with TU Darmstadt § When the Network Becomes the Computer § How can we program in-network computing? § Prof. Lin Wang (University of Paderborn) § Zoom information will be posted in Moodle 2 Programming Exercise 2 3 Grading » Recap quizzes » For each lecture topic » All self-assessment » 4 graded quizzes » All graded » 50% of points in total for exam participation » 3 programming exercises » Passing criterium defined per task » Pass all for exam participation » Exam » February 10 (100% of the grade) Code of Conduct » Asking questions is greatly encouraged » Discuss questions with each other (except exams) » Submit homework individually, but feel free to discuss » The limits of collaboration » Do not just share solutions with each other - explain your solutions » Plagiarism, copying, or other forms of dishonesty will result in failing the course » Communication » Use netiquette in forum, email, chats, etc. » Generally » Treat everyone with respect and consideration -> especially important for online settings » This course and HPI and the university should be safe spaces for everyone Programming Exercises » Via GitHub classroom » Link will be in Moodle » You will get a template git repo » Your submission: » Java code » Text file (README) with answers to the questions » Screenshots » Scoring the exercise: » Code → Autograded » Everything else → Corrected by us » Individual submissions » you may discuss solutions, you may not (!) copy Autograding » Open tests » test scenarios » transparent grading » Please don’t » alter the test code » remove tests » Adding tests » feel free » not included in the scoring » Run the tests locally before submitting » Make sure that you pushed your changes before the deadline 2nd Programming Exercise » Deadline: 17.12.2022 » Will be online by the end of today, latest tomorrow morning » Implement a B+ Tree » Task Description in PDF in Moodle » Framework and tests provided by us 2nd Programming Exercise public abstract class Node { protected Integer[] keys; public Node(Integer[] keys, int capacity) { … } public Integer[] getKeys() { … } public void setKeys(Integer[] keys) { … } } public class InnerNode extends Node { public class LeafNode extends Node { private Node[] children; private String[] values; public InnerNode(Integer[] keys, Node[] children, public LeafNode(Integer[] keys, String[] int capacity) { … } values, int capacity) { … } public Node[] getChildren() { … } public String[] getValues() { … } public void setChildren(Node[] children) { … } public void setValues(String[] values) { … } … … } } 9 Grading and Tests » 96 test cases in total, grouped to get points » 1 point per graded test in GitHub Classroom » Small tests: find keys (1p), insert (1p), delete (1p) » Full tests find keys (1p) » Full tests insert: easy (1p), medium (1p), hard (1p) » Full tests delete: easy (1p), medium (1p), hard (1p) » 10 points total, 6 points to pass » Deadline: 17.12.2022 Timeline I Date Tuesday Wednesday 15.10. /16.10 Intro / Organizational Use Case - Search Engines 22.10. / 23.10. Performance Management Intro to GitHub Classroom 29.10. / 30.10. Map Reduce I Map Reduce II 5.11. / 6.11. Map Reduce III Exercise 12.11. / 13.11. Data Centers Cloud 19.11 / 20.11. File Systems Exercise 26.11. / 27.11. Key Value Stores I Key Value Stores II 3.12 / 4.12. Key Value Stores III Stream Processing I 10.12. / 11.12. Stream Processing II Exercise 17.12. / 18.12. ML Systems I Exercise Christmas Break 11 Timeline II Date Tuesday Wednesday 7.1./ 8.1. ML Systems II Modern Hardware I 14.1. / 15.1. Modern Hardware II Exercise 21.1. / 22.1. TBD TBD 28.1. / 29.1. Industry Talk Exercise 4.2. / 5.2. Exam Prep Buffer/Exam prep 10.2. Exam 12 28.01.2025 Industry Talk – Markus Dreseler (Snowflake) § Markus Dreseler § Senior Software Engineer, Snowflake § Initiated and led development of Hyrise (v2) § 2021 PhD at chair of Prof. Plattner § Snowflake § The ”first real” cloud-based data warehouse § Now: “The AI Data Cloud” (warehouse, AI, and data lake in one) § Handles 6.3 billion queries each day § Talk: Performance Engineering at Snowflake § Benchmarking § Profiling § Optimizing § Telemetry 13 This Lecture § Stream Processing Motivation § Streams § Basic Stream Processing § Stream Processing Execution Models Sources § Tyler Akidau‘s blog and book on streaming - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 § Vasiliki Kalavri: CS 591 K1: Data Stream Processing and Analytics - https://vasia.github.io/dspa21/ § Matthias Boehm: https://mboehm7.github.io/teaching/ss19_dbs/index.htm § Asterios Katsifodimos: Apache Flink Tutorial 14 Where are we? § Low latency data processing Application / Query Language / Analytics / Visualization Application § Handling unlimited data Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 15 Why Stream Processing Alibaba Singles Day 2019 § USD 1B$ in 1 min 8 sec min § USD 38B$ the whole day At this rate, stream processing is key! § For serving models § For monitoring §… § Use TU Berlin & HPI stream processing system: Apache Flink 16 Stream Processing Use Cases § Ubiquitous Data Streams § Event and message streams (e.g., click stream, twitter) § Sensor networks, IoT, and monitoring (traffic, environment, networks) § General Use Cases § Monitoring and alerting (notifications on events / patterns) § Real-time reporting (aggregate statistics for dashboards) § Real-time ETL and event-driven data updates § Real-time decision making (fraud detection) § Data stream mining (summary statistics w/ limited memory) 17 Stream Processing s e rie s Qu erie Qu eries Qu eries Qu Data Stream Stream Processor Result Stream ML System Stream Processor § Challenge § Potentially unlimited data set § Many different queries § Continuous results 18 Traditional Data Management Approaches static data Data Warehouse Online Transaction Management System Complex, offline queries Batched updates during DW OTMS Ad-hoc, one-time queries downtimes, e.g. every night Low rates of arbitrary updates Large, static historical data Data-driven processing model offline online Streaming Data Warehouse Data Stream Management System Light-weight on-the-fly materialized view updates SDW DSMS Continuous queries Pre-processed and historical High rates of append-only updates data streaming data Query-driven processing model Based on M. Tamer Özus Data Stream Management Book - https://www.morganclaypool.com/doi/abs/10.2200/s00284ed1v01y201006dtm005 19 8 Requirements of Big Streaming § Keep the data moving § Integrate stored and streaming data § Streaming architecture § Hybrid stream and batch § Declarative access § Data safety and availability § E.g. StreamSQL, CQL § Fault tolerance, durable state § Handle imperfections § Automatic partitioning and scaling § Late, missing, unordered items § Distributed processing § Predictable outcomes § Instantaneous processing and response § Consistency, event time The 8 Requirements of Real-Time Stream Processing – Stonebraker et al. 2005 20 Why is this hard? Stream Processing Image: Peter Pietzuch § Tension between performance and algorithmic expressiveness 21 Data Streams What is a Stream? Unbounded data § Conceptually infinite, ever growing set of data items / events § Practically continuous stream of data, which needs to be processed / analyzed Push model § Data production and procession is controlled by the source § Publish / subscribe model Concept of time § Often need to reason about when data is produced and when processed data should be output § Time agnostic, processing time, ingestion time, event time This part is largely based on Tyler Akidau‘s great blog on streaming - https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 23 Stream Models S = si, si+1, … si = Turnstile § Elements can come and go § Underlying model is a vector of elements (domain) § si is an update (increment or decrement) to a vector element § Traditional database model § Flexible model for algorithms Cash register § Similar to turnstile, but elements cannot leave Time series § si is is a new vector entry § Vector is increasing § This is what all big stream processing engines use 24 Turnstile Example S = si, si+1, … si = § Managing IP Open Connections § Monitor active IP network connections § Connections can be initiated and terminated Data Stream... Count (srcIP, destIP) 25 Cash Register Example S = si, si+1, … si = § Car Monitoring § Keep protocol of all cars that enter a street § Once added to protocol car cannot leave it L PF 76 H RR 87 P KL 11 B PQ 66 B RS 32 Data Stream... 26 Time Series Example S = si, si+1, … si = § Twitter Data Analytics § Most current statistics of user behavior § User generate data items with continuously growing timestamps Top 1000 used Data Stream hash tags Result Stream in last 10 hours 27 Event Time § Event time § Data item production time § Ingestion time Processing Time § System time when data item is received Real § Processing time Skew Ideal § System time when data item is processed § Typically, these do not match! § In practice, streams are unordered! Event Time 28 Processing Time vs Event Time Example Event Time Episode V Episode VI Episode I Episode II Episode III Episode VII Episode VIII Episode IX Episode IV The Empire Return of The Phantom Attack of Revenge of The Force The Last The Rise of A New Hope Strikes Back the Jedi Menace the Clones the Sith Awakens Jedi Skywalker 1977 1980 1983 1999 2002 2005 2015 2017 2019 § Event time Processing Time § Data item production time § Processing time § System time when data item is processed 29 Durability and Consistency Guarantees § At Most Once § “Send and forget”, ensure data is never counted twice § Will cause data loss on failures § At Least Once § “Store and forward” or acknowledgements from receiver, replay stream from a checkpoint on failures § Might create incorrect state (processed multiple times) § Exactly Once § “Store and forward” w/ guarantees regarding state updates and sent msgs § Often via dedicated transaction mechanisms 30 Stream Processing Streaming Processing Job Data Stream Source Filter Join Sink Result Stream Data Stream Source Aggregation Dataflow Operators Records Control events State 32 Time Agnostic Processing Filtering § Stateless § Can be done per data item 33 Stateful Processing § Word Count § Median § Requires a mechanism to determine when to return results -> windows 34 Processing Windows § Windows § Tumbling § Sliding § Session § … § Triggered by (ready for processing) § Event time, processing time, count, watermark § Eviction policy (remove events from window) § Window width / size 35 Tumbling Window l Key 1 Key 2 Key 3 12:00 12:10 12:20 Event Time § Characteristic § Splits time into segments of equal length l § End of one window is start of next one 36 Sliding Window l s Key 1 Key 2 Key 3 12:00 12:10 12:20 Event Time § Characteristic § Splits time into segments of equal length l and slide s § Windows overlap if l > s § Overlap can be huge 37 Session Window g Key 1 Key 2 Key 3 12:00 12:10 12:20 Event Time § Characteristic § Defines a period of activity followed by a period of inactivity § Session windows end when no events arrive for some time gap ≥ g 38 Processing Time Windows § System waits for x time units § System decides on stream partitioning § Simple, easy to implement § Ignores any time information in the stream -> aggregation result can be arbitrary § Similar: Counting Windows 39 Event Time Windows Input Output Time § Windows based on the time information in stream § Adheres to stream semantic § Correct calculations § Buffering required, potentially unordered 40 Windowing Measures § Time § Event time § Processing time §... § Count § Number of events § Data-dependent § Punctuation, signal § Exceeding a threshold §... 41 Stream Discretization The window After some time, Streams are represented as operator keeps a data-items expire FIFO-Queue of data-items FIFO-Buffer (they are deleted) 42 Stream Discretization The window operator is event driven by data-item arrivals 1.) Trigger Policies (TPs) 2.) Eviction Policies (EPs) Specify when the aggregate is Specify when data-items are executed on the current buffer removed from the buffer. content. Defines the size of a window. Define the moment that results are emitted. 43 Stream Discretization Query Example (tumbling/fixed window of size 3): dataStream.window(Count.of(3)) 1.) Trigger Policies (TPs) 2.) Eviction Policies (EPs) Specify when the aggregate is Specify when data-items are executed on the current buffer removed from the buffer. content. Defines the size of a window. Define the moment that results are emitted. 44 Stream Discretization 1.) Trigger Policies (TPs) 2.) Eviction Policies (EPs) Specify when the aggregate is Specify when data-items are executed on the current buffer removed from the buffer. content. Defines the size of a window. Define the moment that results are emitted. 45 Stream Discretization 1.) Trigger Policies (TPs) 2.) Eviction Policies (EPs) Specify when the aggregate is Specify when data-items are executed on the current buffer removed from the buffer. content. Defines the size of a window. Define the moment that results are emitted. 46 Basic Stream Operators Windowed Aggregation 4 8 § E.g., average speed 3 § Sum of URL accesses 5 § Daily high score 4 7 Aggregate 47 Basic Stream Operators Windowed Aggregation 4 8 § E.g., average speed 3 § Sum of URL accesses 5 Join § Daily high score 4 7 Aggregate Windowed Join § Correlated observations in timeframe § E.g., temperature in time 48 Window Aggregation Functions § Distributive Functions § Final values can be computed as the aggregation of partial aggregates with constant size § Sum, min, max, … § Algebraic Functions § Final values can be computed by applying function on partial aggregates of fixed size § Average, N largest values § Holistic Functions § Partial aggregates have unbounded size § Median, rank, mode (most frequent) 49 Windowed Join § Basic stream join § Tumbling window – use classic join methods § NL Join, Hash Join, … § Sliding window optimization (symmetric for both R and S) For each new r in R: 1. Scan window of stream S to find matching tuples 2. Insert new r into window of stream R 3. Invalidate expired tuples A in window of stream R B 50 Double-Pipelined Hash Join § Join of bounded streams (or unbounded w/ invalidation) § Equijoin predicate, symmetric and non-blocking emit 7(ghvu) emit 7(ghzy) § For every incoming tuple (e.g., left): emit 1(efxw) § probe (right)+emit emit 1(abxw) § build (left) HR,RID HS,SID 1 2 7 ⋈RID=SID 1 7 (1,ab) (7,zy) (2,cd) (1,xw) (1,ef) Stream R Stream S (7,vu) (7,gh) 51 Stream Processing Execution Model Stream Processing § Databases can process very large data since forever (see VLDB) § Why not use those? § Big data is not (fully) structured § No good for database L § We want to learn more from data than just § Select, project, join § First solution: MapReduce 53 MR / Batch Processing Image: Tyler Akidau 54 MR / Batch Window Processing Image: Tyler Akidau 55 MR Discussion Image: Tyler Akidau § Great for large amounts of static data § For streams: only for large windows § Data is not moving! § High latency, low efficiency 56 How to keep data moving? Discretized Streams (mini-batch) Stream discretizer while (true) { Job Job Job Job // get next few records // issue batch computation } Native streaming Long-standing while (true) { operators // process next record } 57 Discussion of Mini-Batch Easy to implement Easy consistency and fault-tolerance Hard to do event time and sessions Image: Tyler Akidau 58 True Streaming Architecture Data Stream Source Filter Join Sink Result Stream Data Stream Source Aggregation Dataflow Stream transformations § Basic transformations: Map, Reduce, Filter, Aggregations… Operators § Binary stream transformations: CoMap, CoReduce… Records § Windowing semantics: Policy based flexible windowing (Time, Count, Delta…) Control events § Temporal binary stream operators: Joins, Crosses… State § Native support for iterations 59 Summary § Stream Processing Motivation emit 7(ghvu) § Streams emit 7(ghzy) emit 1(efxw) emit 1(abxw) § Basic Stream Processing HR,RID HS,SID § Stream Processing Execution Models 1 2 7 ⋈RID=SID 1 7 (1,ab) (7,zy) (2,cd) (1,xw) (1,ef) Stream R Stream S (7,vu) (7,gh) Data Stream Source Filter Join Sink Result Stream Data Stream Source Aggregation 60 Next Part § Stream Processing II § Advanced concepts Application / Query Language / Analytics / Application § Systems Visualization Data Processing Data Management Big Data Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 61 Thank you for your attention! § Questions? § In Moodle § Per email: [email protected] § In Q&A sessions 62