Recent Lessons

Show all results for ""

Introduction to Data Architecture

Introduction to Data Architecture

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is meant by data architecture?

A type of data storage solution.
The process of collecting data manually.
The structure that governs the lifecycle of data management. (correct)
The art of designing buildings and software.

What does the term 'immutable' refer to in the context of object storage?

A type of transient data storage.
Data that is frequently updated.
Data that cannot be changed once stored. (correct)
Data that can be modified after it is created.

Which storage class is described as best for data that is accessed frequently?

Archive
Nearline
Standard (correct)
Coldline

What is the function of buckets in cloud storage?

<p>To act as containers for storing objects. (C)</p> Signup and view all the answers

What is the significance of legislative factors like GDPR in data management?

<p>They influence the lifecycle processes of data. (B)</p> Signup and view all the answers

Which statement best describes latency in data systems?

<p>The delay between a request for data and the corresponding response. (D)</p> Signup and view all the answers

What is one of the main reasons relational databases are considered reliable?

<p>They guarantee certain properties during transactions. (B)</p> Signup and view all the answers

In comparing AWS and GCP, what is a similarity between their data lifecycle processes?

<p>They are based on the same underlying principles of cloud computing. (D)</p> Signup and view all the answers

What does the term 'eventual consistency' in BASE refer to?

<p>Data can be inconsistent for a period but will become consistent later. (A)</p> Signup and view all the answers

What is the expected access frequency for data stored in the Archive storage class?

<p>Once per 365 days or less. (B)</p> Signup and view all the answers

Why might ACID properties be considered overly pessimistic for some use cases?

<p>They can limit scalability and flexibility. (B)</p> Signup and view all the answers

Which of the following is a challenge associated with sharding?

<p>Increased complexity in backups and recoveries. (D)</p> Signup and view all the answers

What is the primary purpose of a cache in data storage?

<p>To temporarily store frequently accessed data for quick retrieval. (B)</p> Signup and view all the answers

What does the concept of 'Isolation' in ACID entail?

<p>Transactions do not interfere with each other, appearing to run in order. (B)</p> Signup and view all the answers

Which aspect of the relational database model is primarily supported by the use of SQL?

<p>Defining and manipulating structured data. (D)</p> Signup and view all the answers

What is a potential drawback of choosing a poor sharding key?

<p>Uneven distribution of data across shards. (C)</p> Signup and view all the answers

Flashcards

Data

Any collection of discrete or continuous values that convey information.

Data Architecture

A blueprint for managing data throughout its entire lifecycle, from collection to consumption.

Cloud Storage

A service for storing objects (files).

Data Lifecycle

The stages a piece of data goes through, from creation to archival.

Signup and view all the flashcards

Storage Classes

Different levels of storage with varying costs and access speeds.

Signup and view all the flashcards

Throughput

The rate at which data is processed.

Signup and view all the flashcards

Latency

The time it takes for data to be processed or accessed.

Signup and view all the flashcards

Managed Services Provider (MSP)

A company that handles IT infrastructure.

Signup and view all the flashcards

Relational Databases

A database model that stores data in tables with relationships between them, using Structured Query Language (SQL) for querying.

Signup and view all the flashcards

Non-relational Databases (NoSQL)

A database model designed for flexibility and scalability, often used for unstructured or semi-structured data, with different data structures and query languages.

Signup and view all the flashcards

ACID Properties

A set of database transaction properties ensuring reliability and consistency: Atomicity, Consistency, Isolation, Durability.

Signup and view all the flashcards

BASE Properties

Relaxed properties in NoSQL databases, prioritizing availability and scalability over strict ACID guarantees: Basic Availability, Soft-state, Eventual Consistency.

Signup and view all the flashcards

Sharding

A technique for distributing data across multiple servers (shards) to improve scalability and performance of a database.

Signup and view all the flashcards

Sharding Risks

Potential issues with sharding, including uneven data distribution, complex transactions and joins, and increased operational overhead.

Signup and view all the flashcards

Data Consistency

The degree to which data across different parts of a database remains consistent and synchronized.

Signup and view all the flashcards

Operational Complexity

Increased complexity in managing and maintaining a sharded database, due to factors like backup, recovery, and monitoring.

Signup and view all the flashcards

Study Notes

Introduction to Data Architecture

Data architecture is a basic introduction to how data is managed throughout its lifecycle.
Data is any collection of discrete or continuous values that convey information.
A famous quote describes data as the new oil, highlighting its increasing importance.

What is Architecture?

Architecture is the art of designing and constructing something, like buildings or software.

What is Data Architecture?

It describes the entire life cycle of data management, from collection to transformation, distribution, and consumption.
It establishes a blueprint for how data flows within systems.
Key components include data pipelines, real-time analytics, cloud storage, cloud computing, data architecture components, APIs, Kubernetes, and data streaming.

Data at Rest (Storage Services and Databases)

This section focuses on data storage services and databases.
Key components and characteristics described include Object File Stores, Random Access File Systems, data consistency, and different database types (RDBMS, NoSQL, and document stores).

Filesystems and Filestores

Object File Stores focus on durability, availability, capacity, and cost.
Random Access File Systems are characterized by access times and throughput.
Data consistency models like ACID and BASE are mentioned in this context.
RDBMS, NoSQL, and document stores are different database types covered.
Scaling the data tier is a crucial aspect of data architecture.

Cloud Storage Systems (Object Stores)

Cloud Storage (AWS S3, Azure Blob, GCP Storage) is a service for storing objects.
Objects are immutable pieces of data stored as files in various formats.
Objects are stored in containers called buckets.
Buckets are associated with projects and can be grouped under organizations.
Buckets may have lifecycles.

Data Lifecycles

Managing static enterprise data involves considering data lifespans and lifecycles.
Time to Live (TTL) policies for objects and policies for retaining noncurrent versions of objects are considered.
Policies for downgrading storage classes of objects are included.

Object Lifecycle Management

Not all data is created equal; some have lifespans.
Time to Live (TTL) policies exist for objects.
Policies are made for maintaining non-current versions of objects.
Policies are also needed for managing costs by moving objects to cheaper storage classes during their lifecycles.
Legislation like GDPR also impacts data lifecycles.

A simple Example

Lifecycle management configurations can be assigned to object stores.
These configurations contain rules applied to objects within the bucket.
When an object meets a rule condition, automatic actions are triggered on the object.

Policy Examples

Storage Classes can be downgraded to cold environments (e.g., Coldline Storage) after certain duration, like 365 days.
Some objects can be deleted based on their creation date, e.g., objects created before January 1, 2013.
Versioning allows maintaining the most recent versions of each object in a bucket.

Examples of Lifecycles (GCP)

Different storage classes (Standard, Nearline, Coldline, Archive) are available, each optimized for different access frequencies.
Storage class appropriateness is determined by how often data needs to be accessed.

Examples of Lifecycles (AWS)

Comparing AWS and GCP storage lifecycles for different access frequencies.
AWS offers S3 Standard, S3 Glacier.
Consideration is given to comparing AWS policies and GCP data lifecycles.

Random Access File Systems

Random Access File Systems (mounted network storage) allow for data accessibility to be extended beyond the standard operating system.
Managed services are discussed in this context with examples from AWS, Google, and Azure.

Filestores

Extending standard OS file systems may require managed file services.
Examples: AWS EFS, Google Filestore, and Azure File Storage.

Performance Options

Different performance options for object storage are highlighted along with how to measure performance.
IOPS (Input/Output Operations Per Second) and throughput data rates are mentioned with different storage types (BASIC_HDD, BASIC_SSD, HIGH_SCALE_SSD).

Managed Services

Managed services offer fully managed infrastructure for predictable performance.
Typical service speeds are discussed (480K IOPS and 16 GB/s) along with the services.
Use cases involving application migration, media rendering, electronic design automation (EDA), data analytics, and genomics processing are considered.

Access Times and Throughput

Key performance indicators like 480K IOPS (Input/Output Operations Per Second) and 16 GB/s (high-scale SSD) are mentioned.
Low latency is also a key performance indicator.
Storage suitability as a cross-regional file system engine and standard cloud security are addressed.

Beyond Relational Data (Structured and Queryable Data)

Data in relational formats is structured and queryable through languages like SQL.
Structured data is organized in tables with rows and columns, facilitating data querying.
A typical data structure is used to represent an example of employee data. This demonstrates a relational table format.

Relational Data (Why Relational Matters)

Relational data models are widely used and employ languages like SQL, which is a query language used for manipulating data.
Many database models have been inspired by the SQL query language, including other data intensive systems.

Non-relational Technologies (NoSQL)

NoSQL (Not Only SQL) databases are non-relational and prioritize distributed horizontal scalability over ACID properties.
Common NoSQL technologies like key-value storage, column-family databases, document stores, and graph databases are highlighted.

ACID vs. BASE

ACID (Atomicity, Consistency, Isolation, Durability) describes properties required for reliable database transactions.
BASE (Basic Availability, Soft state, Eventual consistency) describes alternative properties emphasizing scale and resilience over immediate consistency.

Scaling the Data Tier (Scaling Storage)

Database server resizing is discussed through two approaches:
- Unmanaged: Manual scaling of database servers.
- Managed: Using services that automatically resize servers, like AWS RDS or Azure Cloud SQL.

Read Replicas and Caches

Read replicas are used to offload read traffic, boosting read-heavy workloads and efficiency.
Caching solutions like Memcached or Redis (and managed cache like ElastiCache) can enhance database performance by storing frequently accessed data.
The importance of caches is also highlighted.

Sharding (Data Distribution)

Data sharding breaks down large datasets into smaller, manageable chunks, improving performance.
Sharding can be used for efficiency in distributed systems.
The concept is explained with an example (using users by last names).

Risk Factors (Sharding-Risks)

Difficulties in maintaining and managing data distribution in data sharding are mentioned.
Factors like uneven data distribution, intricate transactional and join complexity, and substantial maintenance and operational overhead are analyzed.

Read Replicas (Distribution and Scaling)

Horizontal scaling strategies are outlined for read-heavy workloads, improving overall system performance.
Asynchronous techniques are also emphasized, which allows for fast data replication.
Diagrams showcase how read replicas improve system scaling.

Real-World Example (Google Search)

Google search demonstrates how to apply data architecture methods for data retrieval and handling.
Data architecture principles are applied in the steps of Google search engine operation are detailed, including crawling, indexing, and query processing. This provides insight into the data management strategy employed by large search engines.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Data Architecture Lecture PDF

More Like This

IT Architecture and Information Management

30 questions

IT Architecture and Information Management

ProudGoshenite

4. Identify elements of the Databricks Platform Architecture, such as what is located in the data plane versus the control plane and what resides in the customer’s cloud account

15 questions

4. Identify elements of the Databricks Platform Architecture, such as...

EnrapturedElf

Data-Driven Organizations and Pipelines

8 questions

Data-Driven Organizations and Pipelines

RosyGreen1670

AWS Modern Data Architecture: Storage Types

38 questions

AWS Modern Data Architecture: Storage Types

WondrousNewOrleans

Use Quizgecko on...

Browser