Big Data Processing - 2024 EAE PDF
Document Details
![RecommendedExuberance5684](https://quizgecko.com/images/avatars/avatar-1.webp)
Uploaded by RecommendedExuberance5684
EAE Business School
2024
EAE
Tags
Summary
This document provides an introduction to big data processing, including infrastructures, types, development, and applications. The document also explores batch and stream processing, cloud computing, and the related technologies.
Full Transcript
Big Data Processing - 2410 02. Introduction to massive data processing: infrastructures, types, development and applications eae.es 45 Big Data Proce...
Big Data Processing - 2410 02. Introduction to massive data processing: infrastructures, types, development and applications eae.es 45 Big Data Processing - 2410 1. INTRODUCTION TO BIG DATA INFRASTRUCTURES 2. DISTRIBUTED COMPUTING 3. HADOOP ECOSYSTEM 4. DATA PROCESSING: BATCH & STREAM PROCCESSING 5. CLOUD COMPUTING FOR BIG DATA 46 Big Data Processing - 2410 1. INTRODUCTION TO BIG DATA INFRASTRUCTURES HDFS (Hadoop Distributed File System) Definition: HDFS is a distributed file system. It combines the storage capacity of multiple distributed machines into a single, very large system. It allows creating directories and storing data in files in a distributed manner across multiple machines. Key Features of HDFS: Distributed: Data is stored across multiple machines, making it scalable and fault-tolerant. Horizontally Scalable: New machines can be added to the system to increase storage and processing capacity. Cost-Effective: Allows starting with simple, low-cost machines and scaling up as needed. Fault Tolerant: Designed to handle hardware failures without data loss or downtime. High Throughput: Optimized for large-scale data processing, ensuring efficient performance. 47 Big Data Processing - 2410 1. INTRODUCTION TO BIG DATA INFRASTRUCTURES Explosion in the amount of data generated by: Applications: Social networks, Web applications, Mobile apps, IoT devices, Log files, etc. It is necessary to explore methods that enable: Collect and Produce: Apps: Web, Mobile, IoT 1.Transmitting and distributing data between distributed applications (Message brokers: RabbitMQ, ActiveMQ, Kafka) Analyze and Transmit and Extract Distribute: 2.Storing and securing data in a distributed manner Knowledge: Data Mining, Brokers like RabbitMQ, (Technologies: Hadoop HDFS, NoSQL databases such as Cassandra, MongoDB, HBase, Machine Learning Big ActiveMQ, Kafka Elasticsearch, etc.) Data 3.Processing and analyzing data in a distributed way to extract knowledge for decision-making 1. Big Data Processing: Batch Processing (e.g., MapReduce, Spark) and Stream Process: Batch Store: HDFS, Processing (e.g., Kafka Stream, Flink, Storm, Samza) and Stream Processing NoSQL, SQL 2. Data Mining and Machine Learning: TensorFlow, DeepLearning4J, Weka, etc., for knowledge extraction. 4.Analyzing and visualizing decision-making indicators (Using Big Data visualization tools) 48 Big Data Processing - 2410 1. INTRODUCTION TO BIG DATA INFRASTRUCTURES Big Data: The 3 V's 1.Volume 1. The quantity of data (Measured in Petabytes - PO) 2.Variety 1. Different formats of data (Structured: 20%, Unstructured: 80%) 1. Examples: Text, CSV, XML, JSON, Binary files, Databases, etc. 3.Velocity 1. The frequency at which data is generated 1. Example: Twitter 1. Every second, approximately 5,900 tweets are sent on the Twitter microblogging platform. 2. This represents 504 million tweets per day or 184 billion tweets per year. 3. This mass of information contributes to the stream of "Big Data" published by humanity daily on the internet. 49 Big Data Processing - 2410 1. INTRODUCTION TO BIG DATA INFRASTRUCTURES Big Data: 5 V's 1.Volume 1. The amount of data. Volumen 2.Variety 1. The diversity of data formats and types. 3.Velocity Value Variety 1. The speed at which data is generated and arrives. Big 4.Veracity Data 1. Reliability and credibility of the collected data (from reliable sources). 5.Value 1. The profit and knowledge that can be extracted from the data. Veracity Velocity 2. Transforming data into actionable value. 50 Big Data Processing - 2410 1. INTRODUCTION TO BIG DATA INFRASTRUCTURES EXAMPLE OF PROBLEM: TRANSPORT COMPANY 1. Volume 4. Veracity The company generates a vast amount of data from its operations: Ensuring data accuracy and reliability is critical: GPS tracking data from thousands of trucks. Verifying sensor data for accurate temperature readings in cold- Fuel consumption records. chain logistics. Delivery schedules for hundreds of customers daily. Ensuring customer delivery addresses are correct to avoid delays. Sensor data (temperature, humidity) for perishable goods. Electronic Proof of Delivery (ePOD) and invoices. Filtering out inaccurate GPS data caused by signal loss. 2. Variety Using reliable data sources (e.g., government traffic databases) to The data comes in diverse formats: avoid false routing decisions. Structured Data: Customer orders, delivery schedules, financial transactions. 5. Value Semi-Structured Data: GPS logs, IoT sensor readings, and email Extracting actionable insights from the data to drive profitability: communications. Unstructured Data: Customer feedback, driver reports, and images of Optimization: Use Big Data to reduce fuel costs by optimizing damaged goods. routes and eliminating empty miles. 3. Velocity Customer Satisfaction: Predict delivery times more accurately, The speed at which the data is generated and needs processing is crucial: improving customer trust and loyalty. Real-time GPS updates (every second) to track trucks' locations. Predictive Maintenance: Use IoT data from trucks to predict Continuous temperature monitoring for refrigerated goods. and prevent breakdowns, minimizing downtime. Real-time alerts for delivery delays, accidents, or maintenance issues. Strategic Decisions: Analyze demand trends to reposition Daily route optimization based on traffic and weather data. trucks more effectively, reducing idle time. 51 Big Data Processing - 2410 DISTRIBUTED COMPUTING Big Data in Everyday Life: A Chef Processing Orders This analogy explains Big Data processing using the example of a chef handling orders in a restaurant: 1.Orders : 1. Just like a restaurant receives multiple orders from customers, Big Data systems collect large amounts of incoming data from various sources. Job 2. Example: Online orders, reservation data, or customer preferences. 2.Job: 1. Each order represents a "job" that needs to be completed. 2. 3. The chef (Big Data system) processes these jobs based on priorities, resources, and capacity. Example: The chef prioritizes meals based on preparation time and delivery requirements. orders Delivery 3.Processing: 1. The chef processes the orders by cooking the meals, akin to Big Data systems processing raw data into actionable insights. 2. Example: In Big Data, this might involve analyzing customer trends or optimizing operations. 4.Storage: 1. Just like the chef uses a refrigerator to store ingredients for future use, Big Data systems store processed or raw data for later analysis. 2. Example: Data stored in databases (HDFS, NoSQL) or cold storage for long-term archiving. 5.Delivery: 1. After processing, the chef delivers the meals to customers. Similarly, Big Data delivers results (insights) to end-users or decision-makers. 2. Examples: Reports, dashboards, or notifications send to stakeholders storage 52 Big Data Processing - 2410 Cooks Sauce Cooks Meat DISTRIBUTED COMPUTING Assembles Shuffling and Reducing Output Input Data Mapping Phase HDFS Sorting Phase Distributed Food Assigning Tasks Prepare for next Distributed Storage Combining results Shelf phase storage 53 Big Data Processing - 2410 3. BIG DATA ECOSYSTEM – APACHE HADOOP Hadoop is a free and open-source framework written in Java, designed to facilitate the creation of massively distributed applications across thousands of nodes. At the storage level: Distributed storage of data (petabytes of data) with HDFS (Hadoop Distributed File System). At the processing level: Data processing with MapReduce. Supports fundamental features: High availability Scalability Fault tolerance Recovery after failure Security (HPC: High-Performance Computing) The core Hadoop framework is composed of the following modules: Hadoop Distributed File System (HDFS): A system for distributed file storage. Hadoop YARN (Yet Another Resource Negotiator): A system for cluster resource management. Hadoop MapReduce: Distributed data processing. 54 Big Data Processing - 2410 3. HADOOP ECOSYSTEM: DATA MANAGEMENT DATA ACCESS DATA PROCESSING DATA STORAGE DISTRIBUTED PROGRAMMING DISTRIBUTED LEARNING/SCHEDULING 55 Big Data Processing - 2410 3. HADOOP ECOSYSTEM: HDFS DATA LEARNI DISTRIBUTED MANA Hadoop MapReduce PROGRAMMING NG GEME Apache Hbase NT Apache Spark Apache Hive Oozie Mahout Sqoop DATA DATA PROCESSING Storm STORAGE DATA Flume MANA DATA GEME ACCESS NT DATA STORAGE 56 Big Data Processing - 2410 Enumeración de las más habituales: - Ecosistema Hadoop 1-HDFS 2-Hadoop MapReduce 3-Apache Hbase 4-Apache Spark 5-Apache Hive 6-Oozie 7-Mahout 8-Sqoop 9-Storm 10-Flume 57 Big Data Processing - 2410 4. DATA PROCESSING – BATCH AND STREAM Batch Processing (Traitement par lots): Definition: Batch processing refers to the processing of blocks of already stored data over a specific period. Example: Processing all the transactions performed by a financial company over the span of a week. Characteristics: The data contains millions of records for each day. These records can be stored as: Text files (e.g., CSV format). Records stored in systems like HDFS, SQL databases, or NoSQL databases. Framework Examples: MapReduce (Hadoop): A framework for batch data processing. Spark: Another framework designed to process large batches of data efficiently. 58 Big Data Processing - 2410 4. DATA PROCESSING – BATCH AND STREAM Stream Processing (Real-Time Processing): Unlike batch processing, where data has a defined start and end and is processed once all the data is available, stream processing is designed for continuously flowing data in real time, arriving endlessly for days, months, years, or indefinitely. Stream processing allows: Real-time data processing: Data can be handled as soon as it is generated. Instant analysis results: Stream processing feeds data directly into analytics tools, enabling immediate insights. 59 Big Data Processing - 2410 4. DATA PROCESSING – BATCH AND STREAM Stream Processing Approaches 1. Native Streaming (Real-Time Processing): Definition: o Each incoming record is processed immediately as it arrives, without waiting for other data. Examples: o Technologies: Storm, Flink, Kafka Streams, Samza. 2. Micro-Batch Processing: Definition: o Incoming records over short intervals (e.g., seconds) are grouped into small batches and processed together with minimal delay. Examples: o Technologies: Spark Streaming, Storm-Trident. 60 Big Data Processing - 2410 4. DATA PROCESSING – STREAM PROCESSING – STRONG AND WEEK POINTS BTW NATIVE STREAMING AND MICRO-BATCH PROCESSING Native Streaming: Micro-Batch Processing: Strong Points: Strong Points: Real-Time Results: Immediate processing makes it ideal for Efficiency: Processing small batches reduces overhead applications requiring instant insights (e.g., fraud detection, live monitoring). compared to handling individual events. Low Latency: Data is processed as soon as it arrives, ensuring Simpler Model: Easier to implement and manage, as it minimal delays. follows a more structured batch-like approach. Granular Processing: Handles each event independently, Integration: Compatible with existing batch processing which is useful for real-time applications. systems (e.g., Hadoop) for hybrid workflows. Weak Points: Weak Points: Complexity: Requires more sophisticated programming models Latency: Slight delays are introduced due to batch for event-by-event processing. intervals, making it less suitable for strict real-time Resource Intensive: Constant processing can strain system requirements. resources. Granularity: Lacks the fine-grained processing of native Fault Tolerance: Ensuring data consistency in case of failures streaming, as events are grouped into batches. is more challenging. 61 Big Data Processing - 2410 4. DATA PROCESSING – STREAM PROCESSING – CHOOSING THE RIGHT APPROACH Choosing the Right Approach Native Streaming: Best for applications that demand immediate action, such as financial fraud detection or live stock market analysis. Micro-Batch Processing: Suitable for near-real-time analytics and applications where small delays are acceptable, such as dashboard updates or periodic reports. 62 Big Data Processing - 2410 4. BIG DATA ECOSYSTEM …KEEPS EXPANDING The Big Data ecosystem has expanded with many tools, such as: Stream Processing: Apache Storm: A framework for computation and distributed processing of data streams (Stream Processing). Apache Flink: A framework for computation and distributed processing of data streams (Stream Processing). Apache Spark: A framework for distributed Big Data processing, offering an alternative to MapReduce and supporting Stream Processing. Apache Kafka Streams: A real-time streaming platform for distributed applications and message systems. Apache Zookeeper: A system for managing the configuration of distributed systems to ensure coordination between nodes. NoSQL Databases (SQBD NoSQL): Apache HBase: A distributed NoSQL database for structured storage in large tables. → NEEDS HDFS!! MongoDB: A NoSQL database (Not Only SQL) where data is stored in a distributed manner across nodes in JSON document format. Cassandra: A NoSQL database (Not Only SQL) where data is stored in a distributed manner across nodes in JSON document format. Elasticsearch: A distributed and multi-entity search engine accessible through a REST interface. Hazelcast: A distributed in-memory cache and NoSQL system, also functioning as a message application system. 63 Big Data Processing - 2410 4. BIG DATA ECOSYSTEM …KEEPS EXPANDING The Big Data ecosystem has expanded with many tools, such as: Apache Pig: A high-level platform for creating MapReduce applications using the Pig Latin language (which resembles SQL) instead of writing Java code. Apache Hive: A data warehouse infrastructure for analysis and querying using a language similar to SQL. Apache Phoenix: A relational database engine built on top of HBase. Apache Impala: A SQL query engine from Cloudera for systems based on HDFS and HBase. Apache Flume: A system for collecting and analyzing log files. Apache Sqoop: Command-line tools for transferring data between relational databases (RDBMS) and Hadoop. Apache Oozie: A workflow scheduling tool for managing Hadoop data processing workflows. 64 Big Data Processing - 2410 5. INTRODUCTION FOR CLOUD COMPUTING IN BIG DATA Cloud computing provides on-demand access to scalable computing resources, including storage, compute power, and applications. 1. Why the Cloud for Big Data? Scalability, Cost Effectiveness, Flexibility, Global Access 2. Key Cloud Service Models for Big Data: Infrastructure as a Service (IaaS): Platform as a Service (PaaS): Software as a Service (SaaS): 3. Benefits of Cloud Computing for Big Data Elastic Scalability, Cost Optimization, Built-In Tools, Reduced Time to Market: Disaster Recovery: 4. Cloud Platforms for Big Data AWS, Azure, GCP, IBM, Oracle, Cloudera 5. Tools and Frameworks for Big Data in the Cloud Data storage, Data Processing and Data Integration 6. Real-World Use Cases Social Media Analysis, IoT Data Processing, E-commerce Analytics 65 Big Data Processing - 2410 5. INTRODUCTION FOR CLOUD COMPUTING IN BIG DATA 7. Challenges and Considerations Data Security and Privacy: Highlight concerns about storing sensitive data in the cloud. Discuss encryption and compliance solutions offered by providers. Cost Management: warn about potencial overuse of resources without proper cost monitoring tools. Data Transfer: discuss potential latency and costs associated with moving large datasets to/from the cloud 8. Future Trends: Infrastructure as a Service (IaaS): Serverless Computing: serverless architectures (e.g., AWS Lambda, Azure Functions) simplify Big Data workloads by abstracting infrastructure. AI and Big Data Integration: Cloud-based AI/ML tools like Google AI Platform and AWS SageMaker. Multi-Cloud and Hybrid Models: Using multiple cloud providers or a combination of on-premises and cloud for flexibility. 66 Big Data Processing - 2410 Opciones de Servicios en Entornos Cloud 67