Introduction to Data Architecture
16 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is meant by data architecture?

  • A type of data storage solution.
  • The process of collecting data manually.
  • The structure that governs the lifecycle of data management. (correct)
  • The art of designing buildings and software.
  • What does the term 'immutable' refer to in the context of object storage?

  • A type of transient data storage.
  • Data that is frequently updated.
  • Data that cannot be changed once stored. (correct)
  • Data that can be modified after it is created.
  • Which storage class is described as best for data that is accessed frequently?

  • Archive
  • Nearline
  • Standard (correct)
  • Coldline
  • What is the function of buckets in cloud storage?

    <p>To act as containers for storing objects.</p> Signup and view all the answers

    What is the significance of legislative factors like GDPR in data management?

    <p>They influence the lifecycle processes of data.</p> Signup and view all the answers

    Which statement best describes latency in data systems?

    <p>The delay between a request for data and the corresponding response.</p> Signup and view all the answers

    What is one of the main reasons relational databases are considered reliable?

    <p>They guarantee certain properties during transactions.</p> Signup and view all the answers

    In comparing AWS and GCP, what is a similarity between their data lifecycle processes?

    <p>They are based on the same underlying principles of cloud computing.</p> Signup and view all the answers

    What does the term 'eventual consistency' in BASE refer to?

    <p>Data can be inconsistent for a period but will become consistent later.</p> Signup and view all the answers

    What is the expected access frequency for data stored in the Archive storage class?

    <p>Once per 365 days or less.</p> Signup and view all the answers

    Why might ACID properties be considered overly pessimistic for some use cases?

    <p>They can limit scalability and flexibility.</p> Signup and view all the answers

    Which of the following is a challenge associated with sharding?

    <p>Increased complexity in backups and recoveries.</p> Signup and view all the answers

    What is the primary purpose of a cache in data storage?

    <p>To temporarily store frequently accessed data for quick retrieval.</p> Signup and view all the answers

    What does the concept of 'Isolation' in ACID entail?

    <p>Transactions do not interfere with each other, appearing to run in order.</p> Signup and view all the answers

    Which aspect of the relational database model is primarily supported by the use of SQL?

    <p>Defining and manipulating structured data.</p> Signup and view all the answers

    What is a potential drawback of choosing a poor sharding key?

    <p>Uneven distribution of data across shards.</p> Signup and view all the answers

    Study Notes

    Introduction to Data Architecture

    • Data architecture is a basic introduction to how data is managed throughout its lifecycle.
    • Data is any collection of discrete or continuous values that convey information.
    • A famous quote describes data as the new oil, highlighting its increasing importance.

    What is Architecture?

    • Architecture is the art of designing and constructing something, like buildings or software.

    What is Data Architecture?

    • It describes the entire life cycle of data management, from collection to transformation, distribution, and consumption.
    • It establishes a blueprint for how data flows within systems.
    • Key components include data pipelines, real-time analytics, cloud storage, cloud computing, data architecture components, APIs, Kubernetes, and data streaming.

    Data at Rest (Storage Services and Databases)

    • This section focuses on data storage services and databases.
    • Key components and characteristics described include Object File Stores, Random Access File Systems, data consistency, and different database types (RDBMS, NoSQL, and document stores).

    Filesystems and Filestores

    • Object File Stores focus on durability, availability, capacity, and cost.
    • Random Access File Systems are characterized by access times and throughput.
    • Data consistency models like ACID and BASE are mentioned in this context.
    • RDBMS, NoSQL, and document stores are different database types covered.
    • Scaling the data tier is a crucial aspect of data architecture.

    Cloud Storage Systems (Object Stores)

    • Cloud Storage (AWS S3, Azure Blob, GCP Storage) is a service for storing objects.
    • Objects are immutable pieces of data stored as files in various formats.
    • Objects are stored in containers called buckets.
    • Buckets are associated with projects and can be grouped under organizations.
    • Buckets may have lifecycles.

    Data Lifecycles

    • Managing static enterprise data involves considering data lifespans and lifecycles.
    • Time to Live (TTL) policies for objects and policies for retaining noncurrent versions of objects are considered.
    • Policies for downgrading storage classes of objects are included.

    Object Lifecycle Management

    • Not all data is created equal; some have lifespans.
    • Time to Live (TTL) policies exist for objects.
    • Policies are made for maintaining non-current versions of objects.
    • Policies are also needed for managing costs by moving objects to cheaper storage classes during their lifecycles.
    • Legislation like GDPR also impacts data lifecycles.

    A simple Example

    • Lifecycle management configurations can be assigned to object stores.
    • These configurations contain rules applied to objects within the bucket.
    • When an object meets a rule condition, automatic actions are triggered on the object.

    Policy Examples

    • Storage Classes can be downgraded to cold environments (e.g., Coldline Storage) after certain duration, like 365 days.
    • Some objects can be deleted based on their creation date, e.g., objects created before January 1, 2013.
    • Versioning allows maintaining the most recent versions of each object in a bucket.

    Examples of Lifecycles (GCP)

    • Different storage classes (Standard, Nearline, Coldline, Archive) are available, each optimized for different access frequencies.
    • Storage class appropriateness is determined by how often data needs to be accessed.

    Examples of Lifecycles (AWS)

    • Comparing AWS and GCP storage lifecycles for different access frequencies.
    • AWS offers S3 Standard, S3 Glacier.
    • Consideration is given to comparing AWS policies and GCP data lifecycles.

    Random Access File Systems

    • Random Access File Systems (mounted network storage) allow for data accessibility to be extended beyond the standard operating system.
    • Managed services are discussed in this context with examples from AWS, Google, and Azure.

    Filestores

    • Extending standard OS file systems may require managed file services.
    • Examples: AWS EFS, Google Filestore, and Azure File Storage.

    Performance Options

    • Different performance options for object storage are highlighted along with how to measure performance.
    • IOPS (Input/Output Operations Per Second) and throughput data rates are mentioned with different storage types (BASIC_HDD, BASIC_SSD, HIGH_SCALE_SSD).

    Managed Services

    • Managed services offer fully managed infrastructure for predictable performance.
    • Typical service speeds are discussed (480K IOPS and 16 GB/s) along with the services.
    • Use cases involving application migration, media rendering, electronic design automation (EDA), data analytics, and genomics processing are considered.

    Access Times and Throughput

    • Key performance indicators like 480K IOPS (Input/Output Operations Per Second) and 16 GB/s (high-scale SSD) are mentioned.
    • Low latency is also a key performance indicator.
    • Storage suitability as a cross-regional file system engine and standard cloud security are addressed.

    Beyond Relational Data (Structured and Queryable Data)

    • Data in relational formats is structured and queryable through languages like SQL.
    • Structured data is organized in tables with rows and columns, facilitating data querying.
    • A typical data structure is used to represent an example of employee data. This demonstrates a relational table format.

    Relational Data (Why Relational Matters)

    • Relational data models are widely used and employ languages like SQL, which is a query language used for manipulating data.
    • Many database models have been inspired by the SQL query language, including other data intensive systems.

    Non-relational Technologies (NoSQL)

    • NoSQL (Not Only SQL) databases are non-relational and prioritize distributed horizontal scalability over ACID properties.
    • Common NoSQL technologies like key-value storage, column-family databases, document stores, and graph databases are highlighted.

    ACID vs. BASE

    • ACID (Atomicity, Consistency, Isolation, Durability) describes properties required for reliable database transactions.
    • BASE (Basic Availability, Soft state, Eventual consistency) describes alternative properties emphasizing scale and resilience over immediate consistency.

    Scaling the Data Tier (Scaling Storage)

    • Database server resizing is discussed through two approaches:
      • Unmanaged: Manual scaling of database servers.
      • Managed: Using services that automatically resize servers, like AWS RDS or Azure Cloud SQL.

    Read Replicas and Caches

    • Read replicas are used to offload read traffic, boosting read-heavy workloads and efficiency.
    • Caching solutions like Memcached or Redis (and managed cache like ElastiCache) can enhance database performance by storing frequently accessed data.
    • The importance of caches is also highlighted.

    Sharding (Data Distribution)

    • Data sharding breaks down large datasets into smaller, manageable chunks, improving performance.
    • Sharding can be used for efficiency in distributed systems.
    • The concept is explained with an example (using users by last names).

    Risk Factors (Sharding-Risks)

    • Difficulties in maintaining and managing data distribution in data sharding are mentioned.
    • Factors like uneven data distribution, intricate transactional and join complexity, and substantial maintenance and operational overhead are analyzed.

    Read Replicas (Distribution and Scaling)

    • Horizontal scaling strategies are outlined for read-heavy workloads, improving overall system performance.
    • Asynchronous techniques are also emphasized, which allows for fast data replication.
    • Diagrams showcase how read replicas improve system scaling.
    • Google search demonstrates how to apply data architecture methods for data retrieval and handling.
    • Data architecture principles are applied in the steps of Google search engine operation are detailed, including crawling, indexing, and query processing. This provides insight into the data management strategy employed by large search engines.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Data Architecture Lecture PDF

    Description

    This quiz provides a foundational understanding of data architecture, covering the lifecycle of data management from collection to distribution. It explains the significance of data in modern contexts and explores key concepts like data pipelines, cloud storage, and more. Perfect for those looking to grasp the essentials of managing information infrastructure.

    More Like This

    Use Quizgecko on...
    Browser
    Browser