Distributed Databases (DDB) Explained

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following accurately describes a distributed database (DDB)?

  • A single database stored in one central location for easy access and management.
  • A database managed by software that hides data distribution, requiring manual user configuration.
  • A database system where all nodes must be homogeneous to ensure data consistency.
  • A collection of logically interrelated databases distributed over a computer network. (correct)

Which of the following is NOT typically considered a direct benefit of using distributed databases?

  • Improved reliability through redundancy and fault tolerance.
  • Enhanced data security due to centralized control and monitoring. (correct)
  • Increased scalability to handle growing data volumes and user traffic.
  • Better support for geographically distributed organizational structures.

Which aspect of a DDBMS do users typically not need to be aware of, reflecting a key principle of transparency?

  • Data fragmentation. (correct)
  • Data replication methods.
  • Data language used for queries.
  • Network configuration.

How do replicated components in a distributed system contribute to reliability and availability?

<p>They eliminate the single point of failure, so failure of one node does not impact the system. (C)</p> Signup and view all the answers

Which of the following performance benefits is primarily associated with data localization in distributed databases?

<p>Reduced contention for CPU and I/O services. (A)</p> Signup and view all the answers

What is the primary difference between scaling up and scaling out a database system?

<p>Scaling up enhances the capacity of a single server, while scaling out distributes the load across multiple servers. (B)</p> Signup and view all the answers

Why is distribution still desirable when you want to manage the whole system?

<p>It is more expensive to manage, although is still desirable for handling very large data volumes and distributed data accesses (B)</p> Signup and view all the answers

In the context of DDB architecture, what is the role of the Global Conceptual Schema (GCS)?

<p>To provide a unified, integrated view of the entire database. (D)</p> Signup and view all the answers

Which of the following factors is LEAST critical when designing a distributed database?

<p>The brand of database software used at each site. (D)</p> Signup and view all the answers

What is the main advantage of the top-down approach to distributed database design?

<p>Optimal data consistency can be achieved, starting with a comprehensive design. (C)</p> Signup and view all the answers

In the context of a top-down design process for a distributed database, what does 'data fragmentation' primarily involve?

<p>Dividing data into smaller, manageable units/fragments. (B)</p> Signup and view all the answers

Which of the following is an example of the 'View Design' step in a top-down distributed database design process?

<p>Defining what store managers can view inventory and sales data for their specific location. (B)</p> Signup and view all the answers

What is the primary purpose of 'data allocation' in the context of distributed database design?

<p>To assign database fragments to specific locations for optimal performance. (A)</p> Signup and view all the answers

Which aspect of database design is primarily addressed during the 'physical design' phase?

<p>Choosing hardware, storage solutions, indexing and clustering strategy. (B)</p> Signup and view all the answers

What is the primary goal of fragmentation in a distributed database?

<p>To allow for parallel processing by partitioning data and improve data locality. (C)</p> Signup and view all the answers

What are the 'CDR properties' of fragmentation?

<p>Completeness, Disjointness, Reconstruction. (A)</p> Signup and view all the answers

Why is it important for data fragmentation to allow data reconstruction?

<p>To enable complete recovery of the original data from fragments. (B)</p> Signup and view all the answers

What is the main difference between horizontal and vertical fragmentation?

<p>Horizontal divides into rows; vertical divides a table into columns. (C)</p> Signup and view all the answers

Which of the following is an benefit of Round Robin data distribution for different queries?

<p>Distribute data evenly (D)</p> Signup and view all the answers

Which of the following is a drawback of data fragmentation?

<p>Increased query overhead for global queries. (A)</p> Signup and view all the answers

What is a primary characteristic of primary horizontal fragmentation?

<p>It is defined using simple conditions on a single primary table. (B)</p> Signup and view all the answers

In the context of fragmentation, what is the purpose of the 'minterm predicates approach'?

<p>To automatically generate predicates with properties such as completeness and disjointness. (A)</p> Signup and view all the answers

What is derived horizontal fragmentation?

<p>Fragmentation that is being derived from primary relation (using predicates with joined foreign relations). (A)</p> Signup and view all the answers

In the context of database fragmentation, what must derived fragmentation avoid?

<p>Reconstructing or reconstructing tables (D)</p> Signup and view all the answers

In derived horizontal fragmentation, if relation R is the owner and relation S is a member, how is the fragmentation defined?

<p>Fragments of S are defined in terms of R. Semi-join operator is used to define the fragments (A)</p> Signup and view all the answers

What must all fragments includes when using vertical fragmentation?

<p>The primary key for reconstruction (D)</p> Signup and view all the answers

When is fragmentation sometimes forced onto database administrators?

<p>When sites may own data (D)</p> Signup and view all the answers

What does a semi-join operation reduce when used in a centralized database?

<p>Quantity of data from the hard disk into the memory (C)</p> Signup and view all the answers

In the context of replication, what does updating replicated data mean?

<p>All copies of replicated data must be updated in a single transaction to maintain atomicity and consistency (D)</p> Signup and view all the answers

How can fragments be allocated to sites?

<p>Finding an optimal mapping can be minimized with heuristics-based algorithms (C)</p> Signup and view all the answers

What is the SSOT Property?

<p>Single Source of Truth (C)</p> Signup and view all the answers

Which of the following is a goal of allocating fragments in distributed database design?

<p>Minimize query response time (D)</p> Signup and view all the answers

Which database fragmentation property is best described as the following: '$∀ F₁, Fj ∈ F, i ≠ j ⇒ F¿ ∩ Fj = ¢$'?

<p>Disjointness (B)</p> Signup and view all the answers

In allocating data fragments to sites, how is total cost calculated?

<p>$total_cost = total_local_processing_cost + total_data_exchange_cost + total_stoarge_cost$ (A)</p> Signup and view all the answers

Which of the following must be maintained in order for all copies of replicated data to be updated?

<p>Atomicity and Consistency. (A)</p> Signup and view all the answers

What is NOT an example of a question to consider when using SSOT?

<p>Where is the database cluster located? (D)</p> Signup and view all the answers

What term refers to the number of tuples that need to be accessed to process a query?

<p>Fragment Selectivity (B)</p> Signup and view all the answers

Which of the following best describes what SSOT solves in DDB?

<p>Duplication conflicts (D)</p> Signup and view all the answers

Which of the following is NOT a reason that Semi-join operations are used?

<p>Semi-Join operators can be used in semi-structure fragmentation (D)</p> Signup and view all the answers

From the trade-off of Fragmentation with Replication image, does fragmentation or full replication have greater update problems?

<p>Full replication (B)</p> Signup and view all the answers

Flashcards

Distributed Database (DDB)

A database spread across multiple locations or nodes.

DDBMS

Software that manages a distributed database and hides distribution details.

Why Distributed?

The organizational structure of distributed enterprises

Computing Power

Dividing tasks into smaller parts processed by different nodes.

Signup and view all the flashcards

Transparency Benefit

Users don't need to know the underlying complexities.

Signup and view all the flashcards

Reliability Benefit

System continues despite failures.

Signup and view all the flashcards

Scalability Benefit

Additional nodes can be added to handle load.

Signup and view all the flashcards

Fault Tolerance

When a single site or communication link failure should not bring down a system

Signup and view all the flashcards

Data Localization

Reduces contention for CPU, I/O, and communication overhead.

Signup and view all the flashcards

Network Factors

Topology, bandwidth, latency, and reliability

Signup and view all the flashcards

Top-Down Approach

Design of the global conceptual schema.

Signup and view all the flashcards

Bottom-Up Approach

Integration of pre-existing local conceptual schemas.

Signup and view all the flashcards

Data Fragmentation

Dividing data into smaller, manageable pieces.

Signup and view all the flashcards

Data Allocation

Assigning fragments to particular sites.

Signup and view all the flashcards

Primary Fragmentation

Dividing a single table row-wise manner

Signup and view all the flashcards

Derived Fragmentation

Data fragmentation, which is derived from primary relations.

Signup and view all the flashcards

Data Fragment

A subset of tuples in a relation.

Signup and view all the flashcards

Completeness

Vertical fragmentation properties

Signup and view all the flashcards

Disjointness

Vertical fragmentation property

Signup and view all the flashcards

Replication

Data replication must be updated in a single transaction to maintain atomicity and consistency

Signup and view all the flashcards

Allocation

Placement of fragments, and/or best number of copies, to a set of sites

Signup and view all the flashcards

SSOT

To avoid data duplication but still have reliable/available data

Signup and view all the flashcards

Study Notes

Examples of Distributed Databases (DDB)

  • Google Spanner, Apache Cassandra, CockroachDB, Amazon DynamoDB, Microsoft CosmosDB, Apache HBase, and Nebula Graph are examples of DDBs

Questions to Think About

  • What are distributed databases?
  • Why use distributed databases?
  • What are the benefits of distributed databases?
  • How can a distributed database be designed?

Centralized vs. Distributed Databases

  • Centralized databases store data in a central location, whereas distributed databases spread it across multiple locations or nodes
  • Centralized databases are stored in a single physical site while distributed databases are stored across multiple sites or nodes
  • Data access happens from a single server in a centralized database, whereas distributed databases access data from multiple servers
  • Centralized databases are less reliable than distributed databases
  • Centralized databases have limited scalability, with performance degrading under load, whereas distributed databases can handle load with additional nodes for high scalability
  • Centralized databases have minimal data redundancy and typically store a single copy of data; distributed databases have high redundancy, wherein multiple copies of data are kept to improve fault tolerance
  • Centralized databases find it easier to maintain consistency as all data is in one place, whereas distributed databases are more complex to maintain due to data replication
  • A centralized database may be slower under heavy loads due to a single access point; distributed databases have quicker response times in distributed environments
  • Centralized databases are easier to secure, whereas distributed databases are more challenging because of data distribution across multiple locations

Topics Covered

  • Distributed systems concepts
  • Motivations and benefits of DDBs
  • Types of transparency, reliability, and availability
  • Performance issues and scalability
  • DDB architecture, including schemas and components
  • DDB design factors
  • Two design approaches: top-down and bottom-up process with an example.
  • Fragmentation properties
  • Minterm predicates approach
  • Derived horizontal fragmentation and semi-join
  • Vertical fragmentation and properties
  • Allocation of fragments and the SSOT (Single Source of Truth) property

Distributed Computing Systems

  • Consist of autonomous, interconnected processing elements that cooperate over a computer network
  • These elements are not necessarily homogeneous
  • Google Search Engine and Apache Hadoop Distributed Database are examples

Distributed Databases (DDB)

  • DDBs are collections of logically interrelated databases distributed over a computer network
  • They use a global schema

DDBMS

  • DDBMS is software that manages a distributed database while maintaining transparency for the user.

Advantages of Distributed Systems

  • Correspond to the organizational structure of distributed enterprises; including physically distributed companies, e-commerce businesses in different geographical centers, and supply chain management
  • Offer economic benefits with easy management, a global schema, data integrity management, and prevention of data silos
  • Offer better computing power by dividing tasks, which are then efficiently processed by different nodes
  • Key feature is the ability to work towards a common goal
  • Examples are Dropbox, OneDrive, and Amazon

DDBMS Benefits

  • Transparency: hides the complexity of the distributed nature from users
  • Reliability: systems continue working even if some parts fail
  • Availability: resources are available when needed
  • Performance: achieves faster processing
  • Expansion: allows easy addition of resources
  • Data Integrity: ensures the database remains consistent

Types of Transparency

  • Language
  • Fragmentation
  • Replication
  • Network
  • Data Independence (logical and physical independence)

Reliability and Availability

  • Resources available through a reliable system
  • Replicated components(data and software) eliminate single points of failure
  • Failure of a single site or communication link should not bring down the entire system
  • Managing reachable data requires distributed transaction support to ensure correctness when failures occur

Performance Considerations

  • Data localization reduces contention and communication overhead
  • Parallel processing with inter-query and intra-query parallelism is also important
  • Performance increases using indexing, query optimization, or SQL chips

Expansion and Scalability

  • Scale-Up: Increase capacity by upgrading to larger storage or better computers
  • Scale-Out: Increase capacity by adding computing, processor, memory, and storage nodes to the network

When to Opt for Distribution

  • Manageability is easier when localized, however, distribution is still desirable for handling large data volumes and distributed data accesses
  • Distribute when the whole system isn't under your control
  • DDB theory is essential to make an educated decision

DDB Architecture: Schemas

  • ANSI/SPARC model
  • External level - external view
  • Conceptual level - global conceptual schema
  • Internal level - local internal schema

DDB Architecture: Components

  • Global query
  • Global query compiler
  • Global query optimizer
  • Global transaction manager
  • Local transaction manager
  • Local query transaction
  • Stored Data

Key Points for DDB Design:

  • Network topology, bandwidth, latency, and reliability
  • Distribution of the nodes
  • Distribution of application software
  • Three dimensions: Level of sharing, Access patterns, Level of knowledge about site databases

Distributed Database Design - Two Main Approaches

  • Top-down Approach : Starts with designing the Global Conceptual Schema (GCS) and then considers data fragmentation and allocation before designing component databases
  • Bottom-up Approach : Focuses on integrating pre-existing Local Conceptual Schemas (LCSs) into a GCS, involving design views and external schemas - can be have issues with interoperability

Distributed Database Design - Top-down vs Bottom-up Approach

Factor Top-down Approach Bottom-up Approach
Use case New system design Existing databases need integration
Schema creation Starts with Global Conceptual Schema (GCS) Integrates Local Conceptual Schemas (LCS) into GCS
Data consistency Easier to enforce global consistency More difficult due to pre-existing differences
Performance Optimized from the start Might be suboptimal, depending on integration method
Interoperability Standardized from the beginning May require additional middleware for compatibility
Implementation complexity Requires detailed planning but smooth execution More complex due to schema and policy conflicts
Best for Banks, airlines, healthcare, stock trading Mergers, multinational companies, legacy integration

Top-Down Design Process

  • Requirements analysis -> Objectives -> Conceptual Desgin -> GCS -> View Integration -> View Design -> Access information -> External Schema (ECS) -> Distribution Design -> LCS -> Physical Design -> Physical Schema -> Feedback -> Monitoring

Top-Down Design Process - Woolworths Example

  • Requirement Analysis: Defining expectations and needs of the database system. For instance, handling nationwide transactions, real-time inventory updates, and providing data consistency
  • Conceptual Design: Creating a unified database view including data entities, relationships, and constraints. Visualizing the global conceptual model (GCS) using ER diagrams
  • View Design: Designing user-specific perspectives for stakeholders like store managers, supply chain team, and analytics team
  • Distribution Design: Defining a strategy to distribute the database across multiple sites, including data fragmentation and allocation
  • Physical Design: Creating a physical structure and access methods, choose hardware and storage, use indexing for data retrieval, and clustering for optimizing disk usage
  • Monitoring: Dashboards monitor system health using performance tuning, load balancing, data duplication, and data consistency

Fragmentation & Allocation

  • Is splitting the database for storage in different locations
  • The database is process of being divided into smaller, multiple parts or sub-tables; is called fragmentation
  • The smaller parts or sub-tables are called fragments, and fragments are fragments stored at different locations

Factors for fragmentation:

  • The tables should only be split when reconstruction is available using UNION and/or JOIN operations
  • Fragmentation has horizontal, vertical and hybrid properties

Example about location fragments

  • Student(id, name, age,location,...)
  • Assume 75% of queries are about St Lucia and 25% of queries are about Gatton

Why fragmentation?

  • To improve application views are subsets of relations which can be a fragment
  • 2 Fragmentation options when there is no fragmentation is
    1. Entire table is at one site - if the tables are large the remote data access can slow
    1. Entire table is replicated problems in executing updates is better achieved by query or data execution

Cons of Fragmentation

  • Fragmenting adds query overheads when fragments are across sites
  • Complexity in managing fragmentation
  • Problems with defining attributes to fragment on
  • Challenge replicated data that exists across multiple locations must be synchronized

Types of Fragmentation

  • Horizontal fragmentation involves row-wise splitting based on predicates.
  • Primary Fragmentation involves using single-table predicates
  • Derived Fragmentation involves foreign relations
  • Vertical fragmentation involves dividing using set of columns

Horizontal Partitioning Techniques

  • Round Robin Partitioning: Distributes data evenly with good performance for scanning the entire table, not good for point or range queries.
  • Hash Partitioning: Distributes data evenly well when the hash function works. Good efficiency for point queries on keys. Not good for range queries.
  • Range Partitioning: Good performance with partition attribute but you need to select a good vector to avoid data/execution skew

Properties of Fragmentation (CDR)

  • Completeness: All data elements from the original table must be included in at least one fragment.
  • Disjointness: No data element should exist in multiple fragments, ensuring each piece of data only appears once.
  • Reconstruction: The original table can be reconstructed from the fragments through a union operation.

Minterm Predicates Approach

  • Simple Predicate: P₁ = aj θ value; where θ = {,=, … }
  • Given a set of simple predicates 𝑃 = {𝑃1,𝑃2, ,𝑃𝑛 } Minterm predicate 𝑀𝑖 = 𝑃1 𝑃2 …𝑃𝑛
  • 𝑃∗ is either 𝑃𝑖 or 𝑃𝑖
  • Thus, there can be 2" such minterm predicates in total Eliminate useless ones (some are invalid)
  • Outcome: M = {M1,M2…… MK} Each minterm predicate defines a fragmentation

Derived Horizontal Fragmentation

  • Revisited with Employee and Department example.

Fragmentation Using Semi-Join

  • Semi-join operator is defined as: R1 R2 == PiRi(Ri x R2)
  • Consider two relations R + S which can be fragmented by semi-join as
  • R = F = (F1,F2…..), S – G=(G1, G#……), G = SxFi ( R is department and Sis employee) R is called the owner of the fragmentation the $ is member

Problems of Semi-Join

  • Why do we need semi-join in fragmentation?
  • Join operation is very expensive, The data must remain consistent where the reading is concerned when a failure occurs semi-join has kinds of applications to define fragments in derived fragmentation process.

Vertical Partitioning Properties

  • Completeness: Every attribute from the original relation must be present in at least one of the vertical fragments when combined; U l….m Ai — A
  • Disjointness: A vertical fragments should ensure that attributes are exclusive and do not repeat.

Trade-Off of Fragmentation with Replication

  • Replication can cause update problems.

Allocation of Fragments to Sites

Now data has been partitioned and possibly replicated Where do the queries originate?, Communication costs Whats the storage cost? What is the query processing strategy How ate data joins done All these will affect what fragments are allocated

What about the ideal allocation

Minmize query time Maximize throughput in site. Minmize updating or other costs

All which are restricted to available storage, bandwidth, and power in sites

Inputs to Fragment Allocation

  • Fragment Selectivity= #Tuples In fragment “F’ that used to process aquery Q

  • Database information is needed for fragments

  • Application information such as;

  • Read Access: The number of read accesses that a query. Q makes toFragment F during execution

  • Update number of Updates Access that aquery Q makes toFragment F during execution

  • System information such as

  • site information -Knowledge to storage and processing capacity

  • network information communication cost

Types of Fragment Allocation

  • In assign fragmentation,
  • input: Fragments = Fl. F2…. Fm and sites = S11,…. Sn
  • Typical queries: Q1…… detailed with a write and read information
  • Then out: an allocation can be used as a group of mappings.
  • X 1 it is = 1 , if “F’is assigned to‘S’, if other X=0

Summary of DDB Design

  • Design with top or bottom and schema, differences, allocation with all that it comes with the fragmentation types and levels with used minterm predicates must ensure what CDR provides single truth
  • There are different types of distributed design (top-down and bottom-up), but top-down design is mostly used in DDB desig
  • DDB has a Global Conceptual Schema
  • Centralized vs distributed – understanding of main difference
  • Transparency levels and problems associated with them
  • Issues with provision of full scale DBMS functionalit
  • Data fragmentation, replication and allocation
  • Properties of Fragmentation: Completeness Disjointness Reconstruction CDR)
  • Minterm predicates are used to maintain CODE
  • Semi-Join operator can be use in fragmentation
  • SSOT is a property of DDB.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser