Big Data Processing - 2410 - EAE Business School - Models, Architectures, and Tools PDF

Summary

This document discusses models, architectures, tools, and high-level languages for massive data processing. It covers topics like cluster computing, types of clusters, cluster structure, distribution models, replication models (master-slave and peer-to-peer), and distributed systems examples. It also explores when distributed systems might not be the ideal solution.

Full Transcript

Big Data Processing - 2410 05. Models, architectures, tools, and high-level languages for massive data processing. eae.es 91 ...

Big Data Processing - 2410 05. Models, architectures, tools, and high-level languages for massive data processing. eae.es 91 Big Data Processing - 2410 Cluster Computing Types of Clusters High availability Minimize shutdown time and provide uninterrupted service when a node fails. Load Balancing They are designed to distribute the workload among nodes, ensuring that tasks are shared and executed as soon as possible. This way, if any node fails, the workload is balanced across other nodes to complete the task 92 Big Data Processing - 2410 Cluster Computing Cluster Structure Symmetric Symmetric clusters are those in which each node functions as an independent computer, capable of running applications. They are part of a subnet, and additional computers can be added without issues. Asymmetric Asymmetric clusters have a head node that connects to all the worker nodes. In this configuration, the system depends on the head node, which acts as a gateway to the data/worker nodes. 93 Big Data Processing - 2410 Distribution Models Replication A copy of the same dataset is stored across different nodes. Sharding The process of partitioning and distributing data across different nodes. Two partitions of the same data are never placed on the same node. Replication & Sharding Both can be used independently or together. 94 Big Data Processing - 2410 Distribution Models Replication Models Master-Slave Model In this model, a master node controls the replication process, sending copies of the data to slave nodes and keeping track of where each replica is stored. If the master node fails, the entire system may become inoperable. To mitigate this risk, a secondary master node (failover master) should be implemented to take over in case of failure, ensuring continuous operation. This model is commonly used for read-heavy workloads, where the master handles updates, and slaves serve read requests. Peer-to-Peer Model In this model, there is no central master node. Instead, all nodes are equal, and replication is typically used for read operations rather than writes. This approach enhances redundancy and load balancing, making it ideal for distributed systems where high availability is a priority. However, without a central authority, consistency management can be more complex, requiring conflict resolution mechanisms. Key Considerations Performance & Scalability: The Master-Slave model is efficient for structured workloads but introduces a single point of failure, whereas Peer-to-Peer systems provide more resilience at the cost of complexity. Use Cases: Master-Slave is often used in traditional databases and enterprise applications, while Peer-to-Peer is more common in decentralized networks, content distribution systems, and blockchain-based architectures. 95 Big Data Processing - 2410 Distributed systems Examples: Storage: Explanation of HDFS and how Distributed Systems work: link Processing: ETL Pipeline: AirFlow, Spark [EMR (cloud computing)] + Load in Snowflake [Data Warehouse solution (cloud storage)] link 96 Big Data Processing - 2410 When Distributed Systems May Not Be the Right Solution Transactional Workloads with Random Data: If the task involves processing jobs in a transactional manner with unpredictable data access patterns, distributed systems may introduce unnecessary complexity and overhead. Non-Parallelizable Workloads: When tasks cannot be broken down and executed in parallel, distributing them across multiple nodes does not provide any advantage and may even degrade performance. Low-Latency Data Access Requirements: If the system demands extremely fast access to data with minimal delays, a centralized architecture or in-memory processing might be a better choice than a distributed approach. Handling a Large Number of Small Files: Distributed systems are optimized for large-scale data processing, but managing a high volume of small files can introduce inefficiencies due to metadata overhead and excessive disk I/O operations. Intensive Computation with Minimal Data: When workloads involve heavy computations but operate on small datasets, the cost of data transfer and synchronization across nodes can outweigh the benefits of distribution, making a local or specialized computing solution more efficient. 97 Big Data Processing - 2410 Use Case: SmartHome & Buildings - IOT world Intro to SmartHome & Building system. What data we can get What we do with it? Let’s think!! 98 Big Data Processing - 2410 High level programming languages Python SQL (Structured Query Language) ‒ Widely used due to its ease of use and extensive ‒ Essential for querying large datasets in distributed ecosystem of libraries for data processing (e.g., Pandas, databases like Hive, Presto, and Google BigQuery. Dask, PySpark). ‒ Used for structured data transformations and aggregations in ‒ Supports machine learning and data analytics big data pipelines. frameworks like TensorFlow and Scikit-learn. Julia Scala ‒ High-performance computing language optimized for ‒ Designed for functional and object-oriented programming, numerical analysis. making it ideal for distributed data processing. ‒ Growing adoption in big data analytics and machine learning. ‒ Native language for Apache Spark, ensuring high Go (Golang) performance and efficient parallelism. ‒ Designed for high concurrency and performance, making it Java useful in distributed computing. ‒ Strongly typed and widely adopted for enterprise-level ‒ Used in large-scale data processing systems like InfluxDB applications. and CockroachDB. ‒ Used in Hadoop and Apache Beam, providing scalability Rust and robustness. ‒ Memory-safe and optimized for performance, making it R suitable for large-scale real-time data processing. ‒ Preferred in statistical computing and data visualization. ‒ Used in distributed data platforms like Vector and Timely ‒ Integrates well with SparkR for large-scale analytics. Dataflow. 99 Big Data Processing - 2410 High level programming languages PERSONAL RECOMMENDATION: Start with Python and SQL Where to Begin 1. Learn Python and SQL: Python is versatile, easy to learn, and widely used in data processing. SQL is essential for querying and managing large datasets efficiently. 2. Work on Practical Projects: Apply what you learn to real-world projects that interest you. Start with simple data analysis, ETL pipelines, or automation scripts. 3. Build a Structured Programming Mindset: Understanding how to structure your code and solve problems is more important than just learning a language. A well-organized thought process will help you adapt to new tools and technologies. 4. Expand Your Knowledge When Needed If a project requires another language or tool, you’ll be prepared to learn it quickly.AI-powered development tools will make learning and implementation even easier in the future. Final Thought: The Idea Matters Most The most critical aspect is having a clear idea of what you want to achieve. Technology is just a tool—what really matters is how you use it to solve problems and create value. GOOD LUCK AND ENJOY YOUR RIDE! 100

Use Quizgecko on...
Browser
Browser