Summary

This document provides an overview of big data storage concepts, focusing on Hadoop and cluster computing. It details how data is processed and managed in a big data environment.

Full Transcript

Big Data Storage Concepts Data reaches users through multiple organizational data structures. This architecture has seen significant improvements due to big data revolution. Hadoop an open-source framework plays a crucial role to stores data on clusters of commodity hardware and enab...

Big Data Storage Concepts Data reaches users through multiple organizational data structures. This architecture has seen significant improvements due to big data revolution. Hadoop an open-source framework plays a crucial role to stores data on clusters of commodity hardware and enables organizations to effectively store and analyse large volumes of data Hadoop acts as an online archive as it highly suitable for handling unstructured and semi-structured. It can also handle structured data that might be expensive to store and process using traditional storage engines such as call center records Data stored in Hadoop is then transferred to a data warehouse, which further distributes to data marts and other downstream systems. End users access and analyze this data using query tools Hadoop → data warehouse → data marts & other downstream systems MapReduce programs help processing vast amount of raw data stored in Hadoop, enabling the creation of applications for data analysis Architecture overview: big data often involves multiple stages of data flow Data flow: data moves from raw sources to storage and analytics platforms (ex: web data, machine data, audio/ video data & external data) Hadoop: an open-source framework for storing massive data on clusters, ideal for unstructured & semi-structured data Data warehousing: raw data stored in Hadoop can be transferred on data warehouses, then to data marts for analytical processing MapReduce: a programming model in Hadoop. Used to write apps to process the massive data stored in Hadoop Cluster computing Def: a Distributed or Parallel computing system comprising multiple standalone PCs (servers or nodes) connected together working as single, integrated, highly available resources Multiple computing resources are connected together in a cluster to constitute a single larger and more powerful virtual computer with each computing resource running an instance of the OS Cluster components connected through local area network (LANs) and used for high availability & load balancing to enhance system performance and reliability Massive parallel processors and cluster computers offer several benefits: High availability: system remains operational even if some nodes fail Fault tolerance: minimize service interruptions Cost-effective hardware: use commodity hardware is cost effective Scalable performance: can add more nodes or remove nodes based on demand without disrupt system operations Scalability allowing for the seamless addition of removal of nodes based on demand without disruption system operations, allowing for flexible resource allocation Cluster computing can be client-server architecture or a peer-peer model. The servers in a cluster are called nodes. Advantages of cluster computing 1. Beneficial for data-intensive applications related to big data technologies, providing high-speed computational power 2. Cluster can dynamically change in size, shrinking or expanding based on operational needs, handling server shutdowns or adding servers to manage increased loads (refer to scalability & when we say a cluster change in size means the number of computers not the memory or size of computers) 3. Clusters provide a cost-effective solution for big data as they allow multiple apps to share computing resources and are adaptable to accommodate increased computing demands (means the system can adjust itself to handle more work or activity when needed. If more people start use an app, the system can add more computers / resources to handle extra demand) 4. Distribution computation infrastructure in cluster computing offers fast and reliable data processing for large-scale big data solution utilizing integrated & geographically separated resources. Explanation: In distributed systems, large scale of data is distributed into several chunks to allow faster and reliable data processing. Each node (computer) can be spread across different locations (geographically separated). Each node in cluster stores / processes specific data based on its region. Which means this particular node will have data from this region and other nodes consist of data from other region. These nodes can be accessed in parallel, so u don’t have to wait to access data from this region to other data from another region 5. Clusters adopt failover mechanisms to prevent interruptions and ensure minimal impact in case failures, maintaining system stability Failover: a process switching to a redundant node upon the abnormal termination or failure of a previously active node (means if one of the node fail the other node will automatically activated) Failover is an automatic mechanism that does not require any human intervention, which differentiates it from the switch-over operation. Just imagine when u have these mega sales of Shopee. How many users are accessing to the data at the same time so meaning to say there is possibility that network down / unable to access the particular data so by having this cluster computing, interruption can be minimized The cluster computing depicts / present multiple individual PCs connected via a dedicated switch in cluster computing. The login nodes act as the entry point for users accessing the cluster, particularly from a public network to prevent unauthorized access. Cluster computing operates through two models: master-salve and peer-to-peer The two major types of clusters: high availability & load-balancing clusters Computer clusters are used to improve the overall performance of a system. Types of clusters: 1. High availability cluster used when the availability of the system is high importance in case of node failure. 2. Load balancing cluster used when the computational workload has to be shared among the cluster needs The configuration of computer clusters is based on the specific business needs and requirements Aspect High availability clusters Load balancing clusters Aim / To minimize downtime and maintain Resource optimization, objective uninterrupted service by allowing nodes Reducing response times, maximizing to take over if other fail. They required throughput & prevent overload on any shared storage and are crucial for failover single resource. Efficient resource and backup. Without clustering an utilization is achieved by controlling how application becomes unavailable when requests are routed the server running it fails Focus It is fault tolerance with redundant nodes Efficiently manages resources, especially which ensure high reliability and in cluster with various machine scalability. More redundancy means capabilities. Less powerful machines higher availability eliminate single points handle less work, ensuring balanced of failure performance Failure If node fails, service seamlessly moves to These clusters distribute work across handling another node automatically, ensuring multiple nodes to share the service load continuous operation without needing evenly. If node fails, its workload shifts manual intervention. Data integrity during to other nodes with identical data copies this process is essential to ensure continuous operations Industries Billing, banking and e-commerce demand Load balancing spreads the workload using it highly available systems to ensure zero across several less costly, lower loss of transaction data, safeguarding performing system which enhancing against significant financial risks scalability and cost effectiveness, use in associated with system failures web applications & content delivery networks Types of load balancing algorithms i) Round robin load balancing: sequentially select servers from a list and resets once the last server is chosen ii) Weight based load balancing considers assigned weights (1-100) to distribute the load proportionally among servers iii) Random load balancing: routes requests randomly, suitable for homogeneous clusters with similarly configured machines. Work bests in homogenous clusters where machines have similar processing power. It doesn’t account for difference in processing capabilities among machines iv) Server affinity load balancing: remembers the server where the clients initial request was routed and directs subsequent requests to the same server Cluster structure In a basic cluster setup, a group of computers are linked and work together as a single computer. There are 2 types of cluster structures based on how these computers are connected Symmetric clusters Asymmetric clusters Each node functions as an individual One machine act as the primary / computer capable of running apps. The head node, connecting users to setup is simple and straightforward. other nodes. It serves as a gateway Individual machines form a sub- between users and remaining nodes network, or machines can join an existing network and cluster-specific software can be installed to it. Additional machines can be added when needed DISTRIBUTION OF MODEL Purpose distribution of data: overcome data process difficulty and cut cost buying expensive servers Benefits: o Handles growing data volumes effectively. o Ensures the network is highly available and robust against failures. Challenge or disadvantages: o It introduces complexity, as managing distributed systems requires advanced coordination and handling. Techniques of data distribution model 1. Sharding 2. Replication 3. Sharding and Replication Sharding Def: process of horizontally partitioning very large data sets into smaller and easily manageable chunks called shards The shards are stored by distributing them across multiple nodes (node refer to a server or machine) By distributing them across multiple nodes means each shard is stored on a separate node and each node is responsible for the data stored on it (node → shard →data) Each shard shares the same schema, and all shards collectively represent the complete dataset. Sharding improves fault tolerance as the failure of a node affects only the block of the data stored in that node A 1GB data block is split up into 4 chunks each of 256 MB. When size of data increases, a single node may be insufficient to store data. With sharding more nodes are added to meet the demands of the massive data growth. Sharding reduces: 1. number of transactions (work) each node handles & increases throughput. 2. data each node needs to store Shows example on how data block split up into shards across multiple nodes. A data set with employee details is split up into 4 small blocks: shard A, shard B, shard C and shard D & stores across 4 different nodes: node A, node B, node C and node D Replication Def: the process of creating copies of the same set of data across multiple servers When a node crashes, data stored in that node will be lost. Then when node is down for maintenance, node will not be available until maintenance process is over. To overcome these issues, data block is copied across multiple nodes. The copy of a block is called replica Replication makes the system fault tolerant since the data is not lost when an individual node fails as the data is redundant across the nodes Replication increases the data availability as the same copy of data is available across multiple nodes Replication is achieved through master-slave and peer-to-peer model Master slave model Def: a model where one centralized device known as master controls one or more devices known as slaves In master slave configuration a replica set consists of master node (one main computer) and several slave nodes (helper computers) After setting up the connections between master and slaves, the master tells the slaves what to do. The flow of control only from master to slaves When new information is added, its save on master node. Then this information is copied to all the slave’s node to keep everything updated and in sync All external write requests including insert, update and delete occur in the master node, whereas read requests are handled by any slave nodes. Writes are managed by the master node and data can be read from either Slave A or Slave B. This architecture supports intensive read requests as the increasing demands can be handled by add additional slave nodes (If many people want to read data at same time, can add more slaves to share the load, making the system faster and more efficient to read data. However, the cluster still suffers from single point of failure if the master fails. Meaning if a master node fails, write requests cannot be fulfilled until the master node is resumed or a new master node is created from one of the slave nodes !! If master fails, the cluster still suffer from single point of failure & only the slaves are guaranteed against single point of failure. The writes are limited to max capacity that master can handle so it provides only read scalability. These disadvantages overcome by peer-to-peer model Example of master slave replication where Amster A is single point of contact for all writers, and data can be read from Slave A and Slave B Master slave replication ideal for read intensive loads (when lot of people or processors read data) rather than write intensive loads since growing read demands can be managed by horizontal scaling to add more slave nodes. Writes are consistent as all writes are coordinated by master node but implication if a lot of changes happen at the same time, the master can get overwhelmed and system may slow down which means as the amount of write performance increases, the write performance will suffer. If the master node fails, reads are still possible via any of the slave nodes A slave node can be configured as a backup node for the master node which mean one of the slaves can be set up to take over if the master fails. IF master node fails, writes are not supported until a master node is reestablished. To fix this problem: The master is restored using a saved copy (backup) of its data or New master node is chosen from the slave nodes One concern with master-slave replication is read inconsistency, which can be an issue if a slave node is read prior to an update to the master being copied to it. To ensure read consistency, a voting system can be implemented where a read is declared consistent if the majority of the slaves contain the same version of the record. Implementation of such a voting system requires a reliable and fast communication mechanism between the slaves. An example Master Slave replication where read inconsistency occurs 1. User A updates data 2. The data is copied over to Slave A node by Master node 3. Before the data is copied over to Slave B node, User B try to read the data from Slave B node which results in an inconsistent read 4. The data will eventually become consistent when Slave B node is updated by Master node Peer to peer model Def: the process of sharing the same data across multiple nodes to avoid single point of failure In peer-to-peer configuration all the nodes have the same responsibility & are at the same level (no master slave concept) The nodes in this model configuration act both as client and the server. Each node known as a peer, is equally capable of handling reads and writes Aspect Peer to peer Master slave Communication All computers are equal Master starts the and any of them can start conversation or sends communication. Each instruction to the slaves. node (peer) equally The slaves only listen and capable of handling reads follow orders & they don’t and writes. Each write is start communication copied to all peers In this model the workload or the task is partitioned among the nodes. The nodes consume as well as donate the resources i.e. storage space, memory, bandwidth, resources are shared among the nodes The nodes that are connected to each other located in different parts of the world. They are not all in the same place but are spread out across various locations globally. Despite the distance, they communicate and work together over the internet. Example: writes are copied to Peer A, B and C simultaneously. Data is read from peer A but it also can be read from peer B or C Peer to peer replication prone to write inconsistencies that occur because of a simultaneous update of the same data across multiple peers. This can be addressed by implementing either a pessimistic or optimistic concurrency strategy Aspect Pessimistic concurrency Optimistic strategy Types of Proactive strategy that Reactive strategy that strategy prevents consistency allows consistency Used Locking to ensure that Does not use locking to only one update to a allow consistency to occur record can occur at a time with knowledge that over time consistency will be achieved after all updates have propagated With optimistic concurrency, peers may remain inconsistent for some period before achieve consistency. However, the database remains available as no locking involved. Like master-slave replication, reads can be inconsistent when some of the peers have completed their updates while others perform their updates. However, reads eventually become consistent when the updates have been executed on all peers. To ensure read consistency, a voting system can be implemented where a read is declared consistent if most of the peers contain the same version of the record. Implementation of voting system requires a reliable and fast communication mechanism between peers Example of peer-to-peer replication where an inconsistent read occurs 1. User A updates data 2. The data is copied over to Peer A 3. The data is copied over to Peer B 4. Before data is copied over to Peer C, User B tries to read data from Peer C resulting in inconsistent read 5. The data will eventually be updated on Peer C and the database will once again become consistent Sharding and replication To improve on the limited fault tolerance offered by sharding and benefiting the increased availability of and scalability of replication, both sharding and replication can be combined. A comparison of sharding and replication that shows how a dataset is distributed between two nodes with the different approaches 1. Combining sharding and master slave replication When combine, multiple shards become slaves of a single master and master itself is a shard. This results in there is multiple masters, a single slave-shard can only be managed by a single master-shard Write consistency is maintained by master shard. But if master shard becomes non-operational or a network outrage occurs, you won’t be able to make changes to the data (write operations). This reduces the system’s fault tolerance (ability to handle failures) for writes. Replicas of shard are kept on multiple slave nodes to provide scalability and fault tolerance for read operations An example that shows the combination of sharding and master slave replication Each node acts both as a master and a slave for different shards. Master for Shard A is Node A, so writes (id = 2) to Node A Slave Shard A is Node B so Node A replicates data (id = 2) to Node B Reads (id = 4) can be served directly by either Node B or Node C as they each contain Shard B. 2. Combining sharding and peer to peer replication When combine, each shard is replicated to multiple peers and each peer is only responsible for a subset of the overall dataset meaning to say that each peer handles only a portion of the entire data, not everything. Collectively by spreading the data across many computers and keeping copies, this helps: Increased scalability (system can handle more users) Increased fault tolerance (can proceed even if one computer fails) There’s no single main computer (master), so the system doesn’t rely on one machine to function. This means it’s better at handling failures and can still handle both reading and writing data even if some computers go offline. Each node contains replicas of two different shards Writes (id = 3) are replicated to both Node A and Node C (peers) as they are responsible for Shard C Reads (id = 6) can be served by either Node B or Node C as they each contain shard B Relational and non-relational databases organize data into tables of rows and columns. A database with only one table is called flat database while database with 2 or more tables that are related called relational database. The rows are called records, columns are called attributes or fields Shows simple table that stores the details of the students registering for the course offered by the institution. The table holds the details of the students and coursed id of the courses for which the students have registered. The above table meets the basic needs to keep track of the courses for which each student has registered This system has problems with being efficient and using storage space wisely. For example, when a student registers for more than one course, then details of the student must be entered for every course he registers. This can be overcome by dividing the data across multiple related tables Data divided across multiple related tables with unique primary and foreign keys Relational tables have attributes that uniquely identify each row meaning to say that for relational database that have 2 or more table related, in each row has specific attributes and information and these attributes used to identify the tuples uniquely. In the context of relational databases, a tuple is a row in a table. Ex: u have table of students, and one row (tuple) have all details about one student like ID, name and age. So, a tuple is one entry in a table The attribute that ensures no two rows are the same is called primary key. StudentID is the primary key hence its value should be unique CourseID in RegisteredCourse is a foreign key CourseID in CourseOffered is a foreign key Relational Database unsuitable when organizations collect vast amounts of customer database, transactions and other data because this type of data often may not be structured to fit into relational database. This has led to the evolution of non-relational database which are schema-less. NoSQL is a non-relational database and few DBMS that frequently used NoSQL Database are: MongoDB Cassandra Redis Neo4J RDBMS Database RDBMS is vertically scalable and exhibits ACID (atomicity, consistency, isolation, durability) properties. Vertically scalable means that we can improve the performance of RDBMS by adding more powerful hardware like faster processors or more memory to a single server RDBMS support data that adhere to a specific schema meaning to say that the data follows the fixed structure or format by that schema. This schema check is made at the time of inserting or updating data hence they are not ideal for capturing and storing data arriving at high velocity. This is because data is coming in high velocity and to check all the data can slow things down. So these systems not the best choice for situations where data needs to be saved as fast as possible without delays Due to architectural limitations, RDBMS aren't ideal as primary storage for big data solutions. RDBMS that was running in corporate data centres have stored bulk of the world’s data. But with increase in volume of data, RDBMS can no longer keep up with the volume, velocity and variety of data being generated and being consumed Traditional data management tools cannot manage big data. But conventional database still exists and used in a large number of apps. So, to resolve the problems with big data: modern alternate database technologies have emerged. These technologies like NoSQL and NewSQL databases, do not require any fixed schemas to store data instead they distributed data across the storage paradigm NoSQL Databases (Not Only SQL database) Includes non- Follows CAP Exhibits BASE model The storage devices do relational theorem Basically not provide consistency databases. Consistency Available but provide eventually Availability Soft state consistency. So, these Partition Eventually databases not suitable tolerance consistent for implementing large transactions Types of NoSQL databases Key-value databases Document databases Column-oriented databases Graph databases NewSQL Databases 1. Offer scalability like NoSQL systems while maintaining traditional ACID properties 2. Ex: VoltDB, NuoDB, Clustrix, MemSQL and TokuDB 3. They are distributed, horizontally scalable, fault tolerant and support a relational data model through administrative, transactional and storage layers 4. Operate in shared architecture, and highly scalable 5. NewSQL has SQL compliant syntax and uses relational data model for storage. Since it supports SQL compliant syntax, transition from RDBMS to the highly scalable system is made easy 6. NewSQL databases suit applications with repeated queries and numerous transactions. Some of commercial products or NewSQL databases are: Clustrix NuoDB VoltDB MemSQL Distributed Distributed Distributed Distributed database database database database Fault Fault Fault Fault tolerance tolerance tolerance Tolerance High - High High performance performance performance Scale out Scale Out - Cloud based In-memory In memory Used in apps with Support both batch They are used to Known for its blazing massive, high and real-time SQL make real-time fast performance transactional vol queries decisions to and useful for real- maximize business time analytics value Scaling Up and Scaling Out storage Def of Scalability: the ability of the system to meet the increasing demand for storage capacity. A system capable of scaling, delivers increased in performance and efficiency. (Meaning to say that system that can expand will work faster. Ex: u add more users or data, the system can handle it without slowing down). As we enter big data era, its essential to improve data storage platforms so they can handle this massive data growth. The storage platforms can be scaled in 2 ways: 1.Scaling-up (vertical scalability) Adds more resources Resources can be Limited to max scaling to the existing server to increase computation power, hard capacity of the server its capability to hold more data. drive & RAM 2.Scaling-up (horizontal scalability) Adds new servers / Big data tech Uses low-cost commodity Multiple components components primarily hardware and storage work together as (called nodes) to relies on this components these allows single system to meet the increased method for easy addition of meet demand demand expanding components without much storage complexity Big Data Storage Concepts Recap Cluster computing provides high availability and scalability. Distribution models like sharding and replication enhance data management. Distributed file systems, such as HDFS, offer resilience and efficiency. Non-relational databases (NoSQL) and hybrid databases (NewSQL) are evolving to handle modern big data requirements

Use Quizgecko on...
Browser
Browser