Session 5: Big Data (Arabic) PDF
Document Details
Uploaded by GlimmeringFantasy7783
Tags
Summary
This document provides an overview of big data storage technologies, including clusters, file systems, and NoSQL databases. It also explores topics like sharding and replication. The information is presented in a question-and-answer format, making it suitable for learning about these related topics.
Full Transcript
علوم الحاسب الفرقة الثالثة البيانات الضخمة Big Data Q1. When is data storage typically required in the context of data wrangling? 1. When external datasets are acquired. 2. When data is manipulated to make it suitable for analysis. 3. When data is processed via ETL (Extrac...
علوم الحاسب الفرقة الثالثة البيانات الضخمة Big Data Q1. When is data storage typically required in the context of data wrangling? 1. When external datasets are acquired. 2. When data is manipulated to make it suitable for analysis. 3. When data is processed via ETL (Extract, Transform, Load) activity. Q2. State the big data storage technologies. 1. Clusters. 2. File Systems and Distributed File Systems. 3. NoSQL. 4. Sharding. 5. Replication. 6. Sharding and Replication. 7. CAP Theorem. 8. ACID. 9. BASE. 1. Cluster: In computing, a cluster is a collection of servers. Cluster servers usually have the same hardware specifications and are connected via a network to work as a single unit. 2. File Systems and Distributed File Systems File Systems A file system is the method of storing and organizing data on a storage device, such as flash drives, DVDs and hard drives. A file is an atomic unit of storage used by the file system to store data. 1 DataBase A file system provides a logical view of the data and presents it as a tree structure of directories and files. Operating systems employ file systems to store and retrieve data on behalf of applications. Distributed File Systems A distributed file system is a file system that can store large files spread across the nodes of a cluster. To the client, files appear to be local; however, this is only a logical view. Physically, the files are distributed throughout the cluster. Examples Google File System (GFS) And Hadoop Distributed File System (HDFS). 3. NoSQL A Not-only SQL (NoSQL) database is a non-relational database It is highly scalable, fault-tolerant and specifically designed to house semi-structured and unstructured data. Provide an API-based query interface that can be called from within an application. 2 DataBase They also support query languages other than (SQL) as SQL was designed to query structured data stored within a relational database. An example, XML files will often use XQuery as the query language, RDF data will use SPARQL to query the relationships. 4. Sharding Sharding is the process of horizontally partitioning a large dataset into a collection of smaller called shards. The shards are distributed across multiple nodes. Each shard o Each Shard stored on a separate node and each node is responsible for only the data stored on it. o Shares the same schema, and all shards collectively represent the complete dataset. 3 DataBase Sharding allows the distribution of processing loads across multiple nodes to achieve horizontal scalability. Horizontal scaling is a method for increasing a system’s capacity by adding similar resources alongside existing resources. How sharding works in practice: ✓ Each shard can independently service reads and writes for the specific subset of data that it is responsible for. ✓ Depending on the query, data may need to be fetched from both shards. [True] ✓ Benefit: In case of a node failure, only data stored on that node is affected. ✓ Example where data is fetched from both Node A and Node B 4 DataBase 5. Replication Replication Stores multiple copies of a dataset on various nodes. Replication provides scalability and availability since the same data is replicated on various nodes. Fault tolerance is also achieved since data redundancy ensures that data is not lost when an individual node fails. There are two different methods that are used to implement replication: ✓ Master-slave. ✓ Peer-to-peer. Master-Slave replication Nodes are arranged in a master-slave configuration, and all data is written to a master node. Once saved, the data is replicated over to multiple slave nodes. All external writing requests, including inserting, update and deleting, occur on the master node whereas read requests can be fulfilled by any slave node. If the master node fails, reads are still possible via any of the slave nodes. 5 DataBase An example of master-slave replication where read inconsistency occurs. Peer to peer With peer-to-peer replication, all nodes operate at the same level. There is not a master-slave relationship between the nodes. Peer-to-peer replication is prone to write inconsistencies that occur because of a simultaneous update of the same data across multiple peers. This can be addressed by implementing either a pessimistic or optimistic concurrency strategy. 6 DataBase Pessimistic concurrency prevents inconsistency. Pessimistic concurrency Only one update to a record can occur at a time. However, the availability since the database record being updated remains unavailable until all locks are released. Optimistic concurrency allows inconsistency to occur with knowledge that consistency will be achieved after all updates have propagated. With optimistic concurrency, peers may remain inconsistent for some period before attaining consistency. However, the database remains available as no locking is involved. Reads can be inconsistent during the time period when some of the peers have completed their updates. To ensure read consistency, a voting system (reliable and fast communication) can be implemented. An example of peer-to-peer replication where an inconsistent read occurs. 7 DataBase