Podcast
Questions and Answers
Which of the following accurately describes a distributed database (DDB)?
Which of the following accurately describes a distributed database (DDB)?
- A single database stored in one central location for easy access and management.
- A database managed by software that hides data distribution, requiring manual user configuration.
- A database system where all nodes must be homogeneous to ensure data consistency.
- A collection of logically interrelated databases distributed over a computer network. (correct)
Which of the following is NOT typically considered a direct benefit of using distributed databases?
Which of the following is NOT typically considered a direct benefit of using distributed databases?
- Improved reliability through redundancy and fault tolerance.
- Enhanced data security due to centralized control and monitoring. (correct)
- Increased scalability to handle growing data volumes and user traffic.
- Better support for geographically distributed organizational structures.
Which aspect of a DDBMS do users typically not need to be aware of, reflecting a key principle of transparency?
Which aspect of a DDBMS do users typically not need to be aware of, reflecting a key principle of transparency?
- Data fragmentation. (correct)
- Data replication methods.
- Data language used for queries.
- Network configuration.
How do replicated components in a distributed system contribute to reliability and availability?
How do replicated components in a distributed system contribute to reliability and availability?
Which of the following performance benefits is primarily associated with data localization in distributed databases?
Which of the following performance benefits is primarily associated with data localization in distributed databases?
What is the primary difference between scaling up and scaling out a database system?
What is the primary difference between scaling up and scaling out a database system?
Why is distribution still desirable when you want to manage the whole system?
Why is distribution still desirable when you want to manage the whole system?
In the context of DDB architecture, what is the role of the Global Conceptual Schema (GCS)?
In the context of DDB architecture, what is the role of the Global Conceptual Schema (GCS)?
Which of the following factors is LEAST critical when designing a distributed database?
Which of the following factors is LEAST critical when designing a distributed database?
What is the main advantage of the top-down approach to distributed database design?
What is the main advantage of the top-down approach to distributed database design?
In the context of a top-down design process for a distributed database, what does 'data fragmentation' primarily involve?
In the context of a top-down design process for a distributed database, what does 'data fragmentation' primarily involve?
Which of the following is an example of the 'View Design' step in a top-down distributed database design process?
Which of the following is an example of the 'View Design' step in a top-down distributed database design process?
What is the primary purpose of 'data allocation' in the context of distributed database design?
What is the primary purpose of 'data allocation' in the context of distributed database design?
Which aspect of database design is primarily addressed during the 'physical design' phase?
Which aspect of database design is primarily addressed during the 'physical design' phase?
What is the primary goal of fragmentation in a distributed database?
What is the primary goal of fragmentation in a distributed database?
What are the 'CDR properties' of fragmentation?
What are the 'CDR properties' of fragmentation?
Why is it important for data fragmentation to allow data reconstruction?
Why is it important for data fragmentation to allow data reconstruction?
What is the main difference between horizontal and vertical fragmentation?
What is the main difference between horizontal and vertical fragmentation?
Which of the following is an benefit of Round Robin data distribution for different queries?
Which of the following is an benefit of Round Robin data distribution for different queries?
Which of the following is a drawback of data fragmentation?
Which of the following is a drawback of data fragmentation?
What is a primary characteristic of primary horizontal fragmentation?
What is a primary characteristic of primary horizontal fragmentation?
In the context of fragmentation, what is the purpose of the 'minterm predicates approach'?
In the context of fragmentation, what is the purpose of the 'minterm predicates approach'?
What is derived horizontal fragmentation?
What is derived horizontal fragmentation?
In the context of database fragmentation, what must derived fragmentation avoid?
In the context of database fragmentation, what must derived fragmentation avoid?
In derived horizontal fragmentation, if relation R is the owner and relation S is a member, how is the fragmentation defined?
In derived horizontal fragmentation, if relation R is the owner and relation S is a member, how is the fragmentation defined?
What must all fragments includes when using vertical fragmentation?
What must all fragments includes when using vertical fragmentation?
When is fragmentation sometimes forced onto database administrators?
When is fragmentation sometimes forced onto database administrators?
What does a semi-join operation reduce when used in a centralized database?
What does a semi-join operation reduce when used in a centralized database?
In the context of replication, what does updating replicated data mean?
In the context of replication, what does updating replicated data mean?
How can fragments be allocated to sites?
How can fragments be allocated to sites?
What is the SSOT Property?
What is the SSOT Property?
Which of the following is a goal of allocating fragments in distributed database design?
Which of the following is a goal of allocating fragments in distributed database design?
Which database fragmentation property is best described as the following: '$∀ F₁, Fj ∈ F, i ≠ j ⇒ F¿ ∩ Fj = ¢$'?
Which database fragmentation property is best described as the following: '$∀ F₁, Fj ∈ F, i ≠ j ⇒ F¿ ∩ Fj = ¢$'?
In allocating data fragments to sites, how is total cost calculated?
In allocating data fragments to sites, how is total cost calculated?
Which of the following must be maintained in order for all copies of replicated data to be updated?
Which of the following must be maintained in order for all copies of replicated data to be updated?
What is NOT an example of a question to consider when using SSOT?
What is NOT an example of a question to consider when using SSOT?
What term refers to the number of tuples that need to be accessed to process a query?
What term refers to the number of tuples that need to be accessed to process a query?
Which of the following best describes what SSOT solves in DDB?
Which of the following best describes what SSOT solves in DDB?
Which of the following is NOT a reason that Semi-join operations are used?
Which of the following is NOT a reason that Semi-join operations are used?
From the trade-off of Fragmentation with Replication image, does fragmentation or full replication have greater update problems?
From the trade-off of Fragmentation with Replication image, does fragmentation or full replication have greater update problems?
Flashcards
Distributed Database (DDB)
Distributed Database (DDB)
A database spread across multiple locations or nodes.
DDBMS
DDBMS
Software that manages a distributed database and hides distribution details.
Why Distributed?
Why Distributed?
The organizational structure of distributed enterprises
Computing Power
Computing Power
Signup and view all the flashcards
Transparency Benefit
Transparency Benefit
Signup and view all the flashcards
Reliability Benefit
Reliability Benefit
Signup and view all the flashcards
Scalability Benefit
Scalability Benefit
Signup and view all the flashcards
Fault Tolerance
Fault Tolerance
Signup and view all the flashcards
Data Localization
Data Localization
Signup and view all the flashcards
Network Factors
Network Factors
Signup and view all the flashcards
Top-Down Approach
Top-Down Approach
Signup and view all the flashcards
Bottom-Up Approach
Bottom-Up Approach
Signup and view all the flashcards
Data Fragmentation
Data Fragmentation
Signup and view all the flashcards
Data Allocation
Data Allocation
Signup and view all the flashcards
Primary Fragmentation
Primary Fragmentation
Signup and view all the flashcards
Derived Fragmentation
Derived Fragmentation
Signup and view all the flashcards
Data Fragment
Data Fragment
Signup and view all the flashcards
Completeness
Completeness
Signup and view all the flashcards
Disjointness
Disjointness
Signup and view all the flashcards
Replication
Replication
Signup and view all the flashcards
Allocation
Allocation
Signup and view all the flashcards
SSOT
SSOT
Signup and view all the flashcards
Study Notes
Examples of Distributed Databases (DDB)
- Google Spanner, Apache Cassandra, CockroachDB, Amazon DynamoDB, Microsoft CosmosDB, Apache HBase, and Nebula Graph are examples of DDBs
Questions to Think About
- What are distributed databases?
- Why use distributed databases?
- What are the benefits of distributed databases?
- How can a distributed database be designed?
Centralized vs. Distributed Databases
- Centralized databases store data in a central location, whereas distributed databases spread it across multiple locations or nodes
- Centralized databases are stored in a single physical site while distributed databases are stored across multiple sites or nodes
- Data access happens from a single server in a centralized database, whereas distributed databases access data from multiple servers
- Centralized databases are less reliable than distributed databases
- Centralized databases have limited scalability, with performance degrading under load, whereas distributed databases can handle load with additional nodes for high scalability
- Centralized databases have minimal data redundancy and typically store a single copy of data; distributed databases have high redundancy, wherein multiple copies of data are kept to improve fault tolerance
- Centralized databases find it easier to maintain consistency as all data is in one place, whereas distributed databases are more complex to maintain due to data replication
- A centralized database may be slower under heavy loads due to a single access point; distributed databases have quicker response times in distributed environments
- Centralized databases are easier to secure, whereas distributed databases are more challenging because of data distribution across multiple locations
Topics Covered
- Distributed systems concepts
- Motivations and benefits of DDBs
- Types of transparency, reliability, and availability
- Performance issues and scalability
- DDB architecture, including schemas and components
- DDB design factors
- Two design approaches: top-down and bottom-up process with an example.
- Fragmentation properties
- Minterm predicates approach
- Derived horizontal fragmentation and semi-join
- Vertical fragmentation and properties
- Allocation of fragments and the SSOT (Single Source of Truth) property
Distributed Computing Systems
- Consist of autonomous, interconnected processing elements that cooperate over a computer network
- These elements are not necessarily homogeneous
- Google Search Engine and Apache Hadoop Distributed Database are examples
Distributed Databases (DDB)
- DDBs are collections of logically interrelated databases distributed over a computer network
- They use a global schema
DDBMS
- DDBMS is software that manages a distributed database while maintaining transparency for the user.
Advantages of Distributed Systems
- Correspond to the organizational structure of distributed enterprises; including physically distributed companies, e-commerce businesses in different geographical centers, and supply chain management
- Offer economic benefits with easy management, a global schema, data integrity management, and prevention of data silos
- Offer better computing power by dividing tasks, which are then efficiently processed by different nodes
- Key feature is the ability to work towards a common goal
- Examples are Dropbox, OneDrive, and Amazon
DDBMS Benefits
- Transparency: hides the complexity of the distributed nature from users
- Reliability: systems continue working even if some parts fail
- Availability: resources are available when needed
- Performance: achieves faster processing
- Expansion: allows easy addition of resources
- Data Integrity: ensures the database remains consistent
Types of Transparency
- Language
- Fragmentation
- Replication
- Network
- Data Independence (logical and physical independence)
Reliability and Availability
- Resources available through a reliable system
- Replicated components(data and software) eliminate single points of failure
- Failure of a single site or communication link should not bring down the entire system
- Managing reachable data requires distributed transaction support to ensure correctness when failures occur
Performance Considerations
- Data localization reduces contention and communication overhead
- Parallel processing with inter-query and intra-query parallelism is also important
- Performance increases using indexing, query optimization, or SQL chips
Expansion and Scalability
- Scale-Up: Increase capacity by upgrading to larger storage or better computers
- Scale-Out: Increase capacity by adding computing, processor, memory, and storage nodes to the network
When to Opt for Distribution
- Manageability is easier when localized, however, distribution is still desirable for handling large data volumes and distributed data accesses
- Distribute when the whole system isn't under your control
- DDB theory is essential to make an educated decision
DDB Architecture: Schemas
- ANSI/SPARC model
- External level - external view
- Conceptual level - global conceptual schema
- Internal level - local internal schema
DDB Architecture: Components
- Global query
- Global query compiler
- Global query optimizer
- Global transaction manager
- Local transaction manager
- Local query transaction
- Stored Data
Key Points for DDB Design:
- Network topology, bandwidth, latency, and reliability
- Distribution of the nodes
- Distribution of application software
- Three dimensions: Level of sharing, Access patterns, Level of knowledge about site databases
Distributed Database Design - Two Main Approaches
- Top-down Approach : Starts with designing the Global Conceptual Schema (GCS) and then considers data fragmentation and allocation before designing component databases
- Bottom-up Approach : Focuses on integrating pre-existing Local Conceptual Schemas (LCSs) into a GCS, involving design views and external schemas - can be have issues with interoperability
Distributed Database Design - Top-down vs Bottom-up Approach
Factor | Top-down Approach | Bottom-up Approach |
---|---|---|
Use case | New system design | Existing databases need integration |
Schema creation | Starts with Global Conceptual Schema (GCS) | Integrates Local Conceptual Schemas (LCS) into GCS |
Data consistency | Easier to enforce global consistency | More difficult due to pre-existing differences |
Performance | Optimized from the start | Might be suboptimal, depending on integration method |
Interoperability | Standardized from the beginning | May require additional middleware for compatibility |
Implementation complexity | Requires detailed planning but smooth execution | More complex due to schema and policy conflicts |
Best for | Banks, airlines, healthcare, stock trading | Mergers, multinational companies, legacy integration |
Top-Down Design Process
- Requirements analysis -> Objectives -> Conceptual Desgin -> GCS -> View Integration -> View Design -> Access information -> External Schema (ECS) -> Distribution Design -> LCS -> Physical Design -> Physical Schema -> Feedback -> Monitoring
Top-Down Design Process - Woolworths Example
- Requirement Analysis: Defining expectations and needs of the database system. For instance, handling nationwide transactions, real-time inventory updates, and providing data consistency
- Conceptual Design: Creating a unified database view including data entities, relationships, and constraints. Visualizing the global conceptual model (GCS) using ER diagrams
- View Design: Designing user-specific perspectives for stakeholders like store managers, supply chain team, and analytics team
- Distribution Design: Defining a strategy to distribute the database across multiple sites, including data fragmentation and allocation
- Physical Design: Creating a physical structure and access methods, choose hardware and storage, use indexing for data retrieval, and clustering for optimizing disk usage
- Monitoring: Dashboards monitor system health using performance tuning, load balancing, data duplication, and data consistency
Fragmentation & Allocation
- Is splitting the database for storage in different locations
- The database is process of being divided into smaller, multiple parts or sub-tables; is called fragmentation
- The smaller parts or sub-tables are called fragments, and fragments are fragments stored at different locations
Factors for fragmentation:
- The tables should only be split when reconstruction is available using UNION and/or JOIN operations
- Fragmentation has horizontal, vertical and hybrid properties
Example about location fragments
- Student(id, name, age,location,...)
- Assume 75% of queries are about St Lucia and 25% of queries are about Gatton
Why fragmentation?
- To improve application views are subsets of relations which can be a fragment
- 2 Fragmentation options when there is no fragmentation is
-
- Entire table is at one site - if the tables are large the remote data access can slow
-
- Entire table is replicated problems in executing updates is better achieved by query or data execution
Cons of Fragmentation
- Fragmenting adds query overheads when fragments are across sites
- Complexity in managing fragmentation
- Problems with defining attributes to fragment on
- Challenge replicated data that exists across multiple locations must be synchronized
Types of Fragmentation
- Horizontal fragmentation involves row-wise splitting based on predicates.
- Primary Fragmentation involves using single-table predicates
- Derived Fragmentation involves foreign relations
- Vertical fragmentation involves dividing using set of columns
Horizontal Partitioning Techniques
- Round Robin Partitioning: Distributes data evenly with good performance for scanning the entire table, not good for point or range queries.
- Hash Partitioning: Distributes data evenly well when the hash function works. Good efficiency for point queries on keys. Not good for range queries.
- Range Partitioning: Good performance with partition attribute but you need to select a good vector to avoid data/execution skew
Properties of Fragmentation (CDR)
- Completeness: All data elements from the original table must be included in at least one fragment.
- Disjointness: No data element should exist in multiple fragments, ensuring each piece of data only appears once.
- Reconstruction: The original table can be reconstructed from the fragments through a union operation.
Minterm Predicates Approach
- Simple Predicate: P₁ = aj θ value; where θ = {,=, … }
- Given a set of simple predicates 𝑃 = {𝑃1,𝑃2, ,𝑃𝑛 } Minterm predicate 𝑀𝑖 = 𝑃1 𝑃2 …𝑃𝑛
- 𝑃∗ is either 𝑃𝑖 or 𝑃𝑖
- Thus, there can be 2" such minterm predicates in total Eliminate useless ones (some are invalid)
- Outcome: M = {M1,M2…… MK} Each minterm predicate defines a fragmentation
Derived Horizontal Fragmentation
- Revisited with Employee and Department example.
Fragmentation Using Semi-Join
- Semi-join operator is defined as: R1 R2 == PiRi(Ri x R2)
- Consider two relations R + S which can be fragmented by semi-join as
- R = F = (F1,F2…..), S – G=(G1, G#……), G = SxFi ( R is department and Sis employee) R is called the owner of the fragmentation the $ is member
Problems of Semi-Join
- Why do we need semi-join in fragmentation?
- Join operation is very expensive, The data must remain consistent where the reading is concerned when a failure occurs semi-join has kinds of applications to define fragments in derived fragmentation process.
Vertical Partitioning Properties
- Completeness: Every attribute from the original relation must be present in at least one of the vertical fragments when combined; U l….m Ai — A
- Disjointness: A vertical fragments should ensure that attributes are exclusive and do not repeat.
Trade-Off of Fragmentation with Replication
- Replication can cause update problems.
Allocation of Fragments to Sites
Now data has been partitioned and possibly replicated Where do the queries originate?, Communication costs Whats the storage cost? What is the query processing strategy How ate data joins done All these will affect what fragments are allocated
What about the ideal allocation
Minmize query time Maximize throughput in site. Minmize updating or other costs
All which are restricted to available storage, bandwidth, and power in sites
Inputs to Fragment Allocation
-
Fragment Selectivity= #Tuples In fragment “F’ that used to process aquery Q
-
Database information is needed for fragments
-
Application information such as;
-
Read Access: The number of read accesses that a query. Q makes toFragment F during execution
-
Update number of Updates Access that aquery Q makes toFragment F during execution
-
System information such as
-
site information -Knowledge to storage and processing capacity
-
network information communication cost
Types of Fragment Allocation
- In assign fragmentation,
- input: Fragments = Fl. F2…. Fm and sites = S11,…. Sn
- Typical queries: Q1…… detailed with a write and read information
- Then out: an allocation can be used as a group of mappings.
- X 1 it is = 1 , if “F’is assigned to‘S’, if other X=0
Summary of DDB Design
- Design with top or bottom and schema, differences, allocation with all that it comes with the fragmentation types and levels with used minterm predicates must ensure what CDR provides single truth
- There are different types of distributed design (top-down and bottom-up), but top-down design is mostly used in DDB desig
- DDB has a Global Conceptual Schema
- Centralized vs distributed – understanding of main difference
- Transparency levels and problems associated with them
- Issues with provision of full scale DBMS functionalit
- Data fragmentation, replication and allocation
- Properties of Fragmentation: Completeness Disjointness Reconstruction CDR)
- Minterm predicates are used to maintain CODE
- Semi-Join operator can be use in fragmentation
- SSOT is a property of DDB.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.