Lecture #2.1 - Streaming Data Introduction.pdf
Document Details
Uploaded by PerfectPanda
IE University
Tags
Full Transcript
MODERN DATA ARCHITECTURES FOR BIG DATA II STREAMING DATA INTRODUCTION AGENDA What's a Real-time system? Soft & Near Real-time Real-time vs Streaming Data systems Streaming Data systems terminology Event vs Processing/Stream time Windows of data Message delivery semantics Use cases 1. WHAT'S A REAL-T...
MODERN DATA ARCHITECTURES FOR BIG DATA II STREAMING DATA INTRODUCTION AGENDA What's a Real-time system? Soft & Near Real-time Real-time vs Streaming Data systems Streaming Data systems terminology Event vs Processing/Stream time Windows of data Message delivery semantics Use cases 1. WHAT'S A REAL-TIME SYSTEM? WHAT DO YOU UNDERSTAND BY REAL-TIME? REAL-TIME SYSTEMS CLASSIFICATION Real-time systems have been around for decades. They've become popular lately causing ambiguity and debate. The following two dimensions will help us clarify: Latency - how much time takes processing data on average Tolerance for delay - how critical having insights late is REAL-TIME SYSTEMS CLASSIFICATION In the Data Analytics space it's only soft and near real-time: SOFT & NEAR REAL-TIME Latency measured in milliseconds, seconds or even minutes. Delay in results is most of the time accepted: no lives at risk. Soft & Near Real-time is often times simplified to Real-time. Non Big Data Real-time Systems are designed as follows: 2. REAL-TIME VS STREAMING DATA SYSTEMS STREAMING DATA SYSTEMS Non-Hard Real-time system making data or insights available. Data or insights available when needed by consumers. Consumers may not need data or insights in real time due to: Network delays Application design (ex. scheduled synchronizations, not running 24x7) Decoupling the Data Service from the Consumer is needed: WHAT DOES DECOUPLING IN DISTRIBUTED SYSTEMS LIKE BIG DATA PLATFORMS MEAN? STREAMING DATA SYSTEMS Data stream → continuous flow of data: Usually modeled as a sequence of elements (ex. messages, events) Theoretically unbounded/infinite in size Streaming Data systems process huge amounts of data which: is very expensive in terms of IT resources might need to be reduced by approximations or aggregations: Sampling the stream Filtering the stream to keep relevant elements Estimating number of different elements... STREAMING DATA BLUEPRINT Big Data Real-time Systems are designed* as follows: * Does this ring a bell? Indeed! It's very similar to a our beloved Big Data Pipeline/Data Value Chain 3. STREAMING DATA SYSTEMS TERMINOLOGY THREE MAIN CONCEPTS Real-time Big Data projects usually deal with: Event vs Processing Time → two timestamps to consider Windows of data → tecnhique to limit unbounded data Message Delivery Semantics → how to deal with critical data EVENT VS PROCESSING/STREAM TIME Messages/events can be timed in two different ways: Event time - when the event occurs Processing/Stream time - when the event enters the streaming system More often than not, Event time!=Processing time: Processing-time lag* - delay between when occurs and it's processed Event-time skew* - how far behind event is when it's processed * They’re just two ways of looking at the same thing; processing-time lag and event-time skew at any given point in time are identical. WINDOWS OF DATA Streams of data infinite in size → data doesn't fit in memory. Window of data - amount of data processed at a given time. Existing attributes in all windowing techniques: Trigger policy - when all data in the window has to be processed Eviction policy - when an event has to be removed from the window WINDOWS OF DATA Two main categories of windowing strategies: Tumbling windows - based on the number of events in the window: Time based windows - three different techniques: WINDOWS OF DATA - FIXED TIME WINDOW Good example of Fixed Time Window → candlesticks: Summarization of asset price (event) evolution over a period of time: Open price - first asset price measured within the period of time Close price - last asset price measured within the period of time Highest price - max aggregation function applied to asset price over period of time Lowest price - min aggregation function applied to asset price over period of time Period of time = Fixed Time Window → 15m, 30m, 1h,... * More info about candlesticks at What Is a Candlestick Chart & How Do You Read One?. WINDOWS OF DATA - SLIDING TIME WINDOW An example of Sliding Time Window → Simple Moving Average: Window size = fixed number of candlesticks to consider (5,10,...) Sliding step = one candlestick at a time Candlestick = time → Time Based Window, no Tumbling Window The SMA function for a given point in time is calculated like this: Consider values of a metric (ex. close) for last fixed number of candlesticks (5, 10,...) Apply an average aggregation function on all those values * More info about SMA at Simple Moving Average (SMA) Define and How to use it. MESSAGE DELIVERY SEMANTICS Related to producers, brokers, consumers and messages. Three semantic guarantees when it comes to message delivery: At most once - message may get lost but no duplicates at consumer At least once - message gets in but potential duplicates at consumer Exactly once - message gets in and no duplicates at consumer 4. USE CASES CLICKSTREAM/SITE ANALYTICS Stream of visitors (identified by cookies) Some potential applications: Find hot-links Access of frequent customers Probability of clicking over new content HIGH-FREQUENCY TRADING Stream of stock trades Some potential applications: Trading opportunities that may open up for ms or secs Identify small price imbalances and generate sizable profits PREDICTIVE MAINTENANCE Stream of measures (whatever can be measured) Some potential applications: Condition based maintenance Unnecessary maintenance and use of spare parts Guaranteeing levels of availability FRAUD DETECTION Stream of actions (whatever can be done) Some potential applications: Credit card fraud Stock trading fraud Video game cheaters Cyber Security risks CONGRATS, WE'RE DONE!