Data Processing (Batch vs. Stream) PDF
Document Details
Uploaded by SubsidizedConnemara4243
Tags
Summary
These lecture notes cover data processing techniques, focusing on batch and stream processing. The document explains the characteristics, examples, and scaling methods for both types of processing. Real-world examples are discussed, providing a practical understanding of the concepts.
Full Transcript
10/17/2024 Data Processing Batch vs. Stream Processing 1 Batch Processing Processing data in groups (batches) from start to finish with no data added in between. Typically run because of an interval st...
10/17/2024 Data Processing Batch vs. Stream Processing 1 Batch Processing Processing data in groups (batches) from start to finish with no data added in between. Typically run because of an interval starting event Processed in a certain size (batch size) An instance of a batch process is often referred to as a job 2 1 10/17/2024 Batch Processing Key Characteristics: Run periodically, triggered by an event or time interval. Common scenarios: reading log files, sending/receiving emails, generating reports. Examples: Processing server logs after hours or generating daily sales reports. 3 Why batch? Simple Generally consistent Multiple ways to improve performance 4 2 10/17/2024 What is scaling? Improving performance Processing more quickly Less time to process the same amount of data Processing more data More data processed in the same amount of time 5 Horizontal Scaling Adding more machines (or CPUs) to distribute the workload across multiple systems. How It Works: In horizontal scaling, tasks are divided into smaller parts and processed by multiple systems in parallel. Each machine works on a portion of the overall task, increasing the system’s capacity to handle more data. 6 3 10/17/2024 Horizontal Scaling Characteristics Parallel Processing: Best suited for tasks that can be divided and executed simultaneously across multiple machines. Cost-Effective: In many cases, adding low-cost servers can be more cost-effective than investing in a single high- performance machine. Near-Linear Performance Improvements: For certain processes, horizontal scaling can achieve near-linear improvements in performance. Requires Distributed Systems: Needs more sophisticated processing frameworks (e.g., Apache Spark, Hadoop, Kafka for stream processing) to manage the distributed architecture. Increased Complexity: Horizontal scaling requires extensive networking, load balancing, and ongoing management. Systems must coordinate between multiple machines, handle network latency, and ensure data consistency. 7 Horizontal Scaling Distributed Intrusion Detection Systems (IDS): Horizontal scaling can be used to monitor vast amounts of network traffic by distributing the workload across several nodes, ensuring faster real-time threat detection. Cloud-Based Security Solutions: Cloud service providers like AWS, Google Cloud, and Azure rely on horizontal scaling to expand their cybersecurity services (e.g., scaling for DDoS attack mitigation or malware detection). 8 4 10/17/2024 Vertical Scaling Improving the performance of a single machine by adding more resources (e.g., CPU, memory, storage). How It Works: In vertical scaling, you enhance the computing power of a single server or machine by increasing its processing capacity, such as adding more RAM, upgrading to a faster CPU, or increasing I/O speed. 9 Vertical Scaling Characteristics Simpler to Implement: Vertical scaling doesn’t require changes to the system architecture or software. You can scale by simply upgrading hardware components. Easier Management: There’s no need for distributed systems or complex networking since everything runs on a single, more powerful machine. Limited by Hardware: There's a limit to how much you can upgrade a single machine. Once you've maxed out its resources, further scaling becomes impossible. Single Point of Failure: Since everything relies on one machine, the entire system is at risk if it goes down. 10 5 10/17/2024 Vertical Scaling On-Premise Security Systems: Vertical scaling is often used in on-premise cybersecurity setups, where upgrading a firewall, server, or other security infrastructure with more powerful hardware can handle higher data loads (e.g., processing more network packets in intrusion detection). Edge Computing: In environments where data is processed locally, vertical scaling can improve processing speeds, such as edge devices used for IoT security or smart city surveillance. 11 Comparison Table: Horizontal vs. Vertical Scaling Aspect Horizontal Scaling Vertical Scaling Adding more resources (CPU, memory, Definition Adding more machines or servers storage) Single-machine processing, improving Best for Distributed tasks, parallel processing hardware Can be cost-effective with commodity Can be more expensive due to hardware Cost hardware upgrades Complexity More complex, requires distributed systems Simpler, fewer changes to architecture Failure Points Redundancy through multiple machines Single point of failure Scalability Virtually unlimited, just add more machines Limited by the capacity of a single machine Use Case in Cybersecurity Distributed IDS, cloud security services On-premise systems, edge computing devices 12 6 10/17/2024 Batch Processing Challenges Delays: There is high latency due to the time it takes to collect, process, and analyze data. Scalability: Requires vertical or horizontal scaling strategies to handle growing data volumes. Case Study: A company that takes 23 hours to process 100GB of logs per day, but with increasing data volume, it exceeds a 24-hour processing window. 13 Batch Processing Workflow Data is collected, stored, and processed in batches regularly. Suitable for analyzing data after it has been collected, which can be crucial for large- scale log file analysis and malware detection. 14 7 10/17/2024 Real-Life Examples Daily log file processing: Companies often collect large volumes of security logs processed at the end of the day to detect anomalies. Historical data analysis: Batch processing effectively identifies long-term trends, such as studying past network attacks or data breaches. 15 Stream processing - Basics Continuous data processing as it arrives, on the fly, with no fixed size or end. Characteristics: Real-time data processing, often low-latency, suitable for IoT and surveillance. Use cases: real-time fraud detection, live sensor monitoring, and social media sentiment analysis. 16 8 10/17/2024 Major components in stream processing Application (generating stream of data) Message processor Stream processor Data storage (stores processed data, state etc.) 17 Data Streaming Lifecycle Data is generated in real-time by upstream sources (e.g., IoT devices, surveillance systems). Stream processors handle data flow, applying real-time analytics and output results in real- time. 18 9 10/17/2024 Real-Life Applications: Intrusion Detection Systems (IDS): Monitor network traffic in real time, detecting anomalies as they occur Fraud Detection: Streaming financial transaction data enables real-time fraud detection analysis. 19 Challenges of Stream Processing Handling Out-of-Order Data: Stream processors must manage late. Scalability: The system must scale dynamically to handle spikes in data. 20 10 10/17/2024 Comparing Batch vs. Stream Processing Batch Processing: Pros: Simple, consistent, and easier to manage scaling (both vertically and horizontally). Cons: High latency and delays in data availability. Stream Processing: Pros: Real-time insights suited for time-sensitive operations. Cons: More complex, requires robust architecture for fault tolerance, real-time handling of massive data streams. 21 Micro batches Micro batches (Near Realtime) Message is not processed immediately after delivery Messages are processed together in small batches Latency is at least the length of the batch interval (usually leads to higher throughput) Output is available within seconds or tens of seconds 22 11 10/17/2024 Comparing Real-Time Streaming vs. Micro-Batching Real-Time Streaming: Each data message is processed immediately, with low latency (milliseconds). Micro-Batching: Data is processed in small batches, with slightly higher latency but increased throughput. 23 Time Windowing Time windows group data points together for processing within a specific period (e.g., a 5-second window to detect anomalous network traffic). Time windows are essential for aggregating and analyzing data in both batch and streaming contexts. 24 12 10/17/2024 Event Time The time the source generated the data event (e.g., when an attack occurs). In cybersecurity, event time is crucial to understanding when an attack or security breach took place. 25 Processing Time The time the system processed the data event. Challenge: Latency between event and processing time can lead to delayed responses. 26 13 10/17/2024 Practical Example Event Time: A DDoS attack starts at 2:00 PM, generating abnormal traffic patterns. Processing Time: The anomaly detection system processes the traffic at 2:05 PM, detecting the attack. Issue: The 5-minute delay between event time and processing time could result in a significant loss for a business. Solution: By reducing latency (via stream processing), we can bring processing time closer to the event time, minimizing delays in detection and response. 27 IoT in Cybersecurity and Data Processing IoT Data Streaming: IoT devices generate continuous data streams that are highly time-sensitive. Time Windowing: Time windows group data points together for processing within a specific period (e.g., a 5-second window to detect anomalous network traffic). Time windows are essential for aggregating and analyzing data in both batch and streaming contexts. 28 14 10/17/2024 Example: A smart city’s traffic monitoring system uses a sliding time window to detect anomalies like unauthorized vehicle access in sensitive areas. The system aggregates data from cameras and sensors in 5-second intervals to monitor real-time activity. 29 Challenges in Time Windowing for Cybersecurity Out-of-Order Events: In real-time streaming, data often arrives out-of-order due to network delays. Time windows must account for this by waiting for delayed events or using techniques like watermarking to manage late data. Example: In a network intrusion detection system, delayed packets may arrive after the window has closed, requiring the system to handle out-of-order events intelligently. 30 15 10/17/2024 Challenges in Time Windowing for Cybersecurity Selecting the Right Window Size: Small windows provide more granular real-time analysis but can increase computational overhead. Large windows allow for more efficient processing but may introduce latency in detecting attacks. Example: In a real-time fraud detection system, smaller time windows (e.g., 10 seconds) allow immediate detection of suspicious financial transactions but require more resources for real-time monitoring. 31 Real-Time Stream Processing and Time Windows Sliding Time Windows: Sliding time windows (e.g., 1-minute intervals sliding every 10 seconds) allow real-time data to be continuously analyzed for anomalies, such as sudden spikes in network traffic. In intrusion detection systems (IDS), sliding windows group packets of data and evaluate them in near real-time. Batch Time Windows: For systems using batch processing, time windows are larger, and data is collected over longer periods (e.g., hourly or daily). Example: Analyzing logs for malware detection at the end of each day using a daily batch time window. 32 16