AWS Modern Data Architecture: Storage Types

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following is a key objective when defining storage types in a modern data architecture?

  • To choose storage solutions that match specific storage needs. (correct)
  • To use only traditional on-premises storage solutions.
  • To limit the number of storage options to simplify management.
  • To select storage solutions based solely on cost considerations.

In the simplified iterative data pipeline, what is the step that immediately follows data ingestion?

  • Data Storage (correct)
  • Data Processing
  • Data Analysis
  • Data Visualization

Which of the following AWS services is primarily used for big data processing?

  • Amazon Redshift
  • Amazon DynamoDB
  • Amazon SageMaker
  • Amazon EMR (correct)

What is a key characteristic that distinguishes object storage from block storage?

<p>Object storage uses a unique identifier for each object. (D)</p> Signup and view all the answers

Which type of storage is MOST suitable for content repositories and media stores, given its high scalability?

<p>File storage (C)</p> Signup and view all the answers

What is a primary characteristic of a data lake regarding schema implementation?

<p>Schema is written at the time of analysis (schema on read). (D)</p> Signup and view all the answers

Which user group is MOST likely to leverage data lakes for discovering trends and creating predictive models?

<p>Data scientists working with machine learning. (A)</p> Signup and view all the answers

What primary benefit does a data lake provide in terms of data management?

<p>Offers a centralized repository for structured and unstructured data. (B)</p> Signup and view all the answers

What is the role of Amazon S3 in the context of data lakes?

<p>It is the foundation upon which data lakes are built. (D)</p> Signup and view all the answers

Which capability does AWS Lake Formation provide related to data management?

<p>Fully managed service to build, secure, and manage data lakes. (A)</p> Signup and view all the answers

What benefit does Amazon S3 provide regarding data integrity in data lakes?

<p>Promotes data integrity through strong data consistency and multipart uploads. (A)</p> Signup and view all the answers

What is a key advantage of using governed tables with AWS Lake Formation?

<p>It allows concurrent data inserts and edits across tables. (C)</p> Signup and view all the answers

What is a primary characteristic of data stored in a data warehouse?

<p>Structured, curated, or transformed data. (C)</p> Signup and view all the answers

Why would a company choose to use Amazon Redshift?

<p>To provide a cloud-based data warehouse solution. (C)</p> Signup and view all the answers

Which storage characteristic is typical for infrequently accessed data within a data warehouse?

<p>Cheap storage. (A)</p> Signup and view all the answers

How does Amazon Redshift Spectrum enhance data warehousing capabilities?

<p>By enabling SQL queries that combine data from data lakes and data warehouses. (D)</p> Signup and view all the answers

In the architecture of Amazon Redshift, what is the role of nodes?

<p>Computing resources (B)</p> Signup and view all the answers

Which factor is MOST important to consider when choosing a purpose-built database?

<p>How well the database supports your application architecture. (A)</p> Signup and view all the answers

Why is considering the data shape important when selecting a database?

<p>It affects how the data will be accessed and updated. (C)</p> Signup and view all the answers

Which database type is BEST suited for social networking applications that require fraud detection?

<p>Graph database (D)</p> Signup and view all the answers

Which AWS service is commonly used for managing content, catalogs, and user profiles in a document database setting?

<p>Amazon DocumentDB (B)</p> Signup and view all the answers

What is a key consideration when evaluating application workload to determine the best database choice?

<p>Whether the workload is transactional or analytical. (B)</p> Signup and view all the answers

What is a benefit of using resource-based policies when managing access to an AWS data lake?

<p>Managing access through resource-based policies. (A)</p> Signup and view all the answers

In the context of data lake security, what is the purpose of using tags?

<p>To categorize and manage data and access permissions. (D)</p> Signup and view all the answers

Which type of encryption is typically used in AWS-built data lakes for enhanced security?

<p>Client-side and server-side encryption. (C)</p> Signup and view all the answers

What benefit does using Amazon CloudWatch, AWS CloudTrail, and AWS Security Hub provide for Amazon Redshift?

<p>Monitoring and alerting capabilities. (A)</p> Signup and view all the answers

Which aspect of security is handled distinctly from service security in Amazon Redshift?

<p>Database security. (C)</p> Signup and view all the answers

A company needs to create a centralized repository accessible by both data scientists and business analysts where data is analyzed at the time is accessed. Which type of storage is MOST appropriate?

<p>Data Lake (B)</p> Signup and view all the answers

A company uses exclusively highly curated datasets and relational data for all of its data processing needs. What type of data storage would be MOST appropriate.

<p>Data warehouse (D)</p> Signup and view all the answers

A company uses computing resources called nodes for their data storage. Which service are they using?

<p>Amazon Redshift (A)</p> Signup and view all the answers

Which of the following database types is best suited for high-traffic web applications, ecommerce systems, and gaming applications?

<p>Key-value (B)</p> Signup and view all the answers

Your application requires storage that offers dedicated, low-latency access. Which type of cloud storage is the MOST appropriate?

<p>Block storage (D)</p> Signup and view all the answers

Your company needs a fully managed service to build, secure, and manage data lakes. What AWS service should they use?

<p>AWS Lake Formation (B)</p> Signup and view all the answers

You need to write SQL queries that combine data from both your data lake and your data warehouse. Which service makes this possible?

<p>Amazon Redshift Spectrum (C)</p> Signup and view all the answers

You want to secure your AWS data lake by categorizing and managing data and access permissions. What should you use?

<p>Tags (A)</p> Signup and view all the answers

Your team uses both client-side and server-side encryption to secure their data. What type of storage are they MOST likely using?

<p>Data lakes (C)</p> Signup and view all the answers

A data engineer needs to design an infrastructure where they need to be able to query data directly from files in the company's data lake, which is built on Amazon S3. Which service feature would enable this capability?

<p>Amazon Redshift Spectrum (C)</p> Signup and view all the answers

What is the MOST important advantage of a data lake compared to a data warehouse, in terms of the variety of data it can store?

<p>A data lake can store any type of data in its raw format, while a data warehouse requires structured data. (D)</p> Signup and view all the answers

Flashcards

Data Ingestion

The process of bringing data into a system for storage and processing.

Data Lake

A centralized repository that allows you to store both structured and unstructured data at scale.

AWS Lake Formation

A service by AWS to build, secure and manage data lakes.

Data Warehouse

A storage architecture that stores structured, curated, or transformed data into three tiers.

Signup and view all the flashcards

Amazon Redshift

A cloud-based data warehouse solution offered by AWS.

Signup and view all the flashcards

Purpose-built database

The architecture where data storage is chosen to fit the specific database needs.

Signup and view all the flashcards

Object Storage

A type of cloud storage optimized for unstructured data, using unique identifiers for each object, and uses Amazon S3.

Signup and view all the flashcards

File Storage

A type of cloud storage where data is stored as files.

Signup and view all the flashcards

Block Storage

Offers dedicated, low-latency storage and is scalable and offers high performance.

Signup and view all the flashcards

Amazon S3

A simple storage service on AWS. Is secure, scalable and durable.

Signup and view all the flashcards

Data Lake Data

Nonrelational and Relational data from IoT devices, websites, social media and corporate applications .

Signup and view all the flashcards

Data Warehouse data

Transactional data from operational databases.

Signup and view all the flashcards

Schema on Read

Written at the time of analysis.

Signup and view all the flashcards

Schema on Write

Designed prior to data warehouse implementation.

Signup and view all the flashcards

Data Lake Query Results

Query results that are faster using low-cost storage.

Signup and view all the flashcards

Data Warehouse Query Results

Query results that are the fastest using higher cost storage.

Signup and view all the flashcards

Data lake users

Data scientists, data developers, and business analysts (using curated data).

Signup and view all the flashcards

Data warehouse users

Business analysts.

Signup and view all the flashcards

Data warehouse analytics

Batch reporting, business intelligence (BI), and visualizations.

Signup and view all the flashcards

Data Lake analytics

ML, predictive analytics, and data discovery and profiling.

Signup and view all the flashcards

AWS Lake Formation

A fully managed service that provides the ability to build, secure, and manage data data lakes.

Signup and view all the flashcards

Amazon Redshift Spectrum

A service that you can use to write SQL queries that combine data from both your data lake and your data warehouse.

Signup and view all the flashcards

Amazon Redshift Integrations

Cloudwatch, Cloudtrail and Security Hub

Signup and view all the flashcards

Securing AWS Data Lake

Benefit from the built-in security of Amazon S3, Manage access, Choose encryption options, Use tags, Use Lake Formation.

Signup and view all the flashcards

Study Notes

Module Objectives

  • Define storage types in modern data architecture.
  • Distinguish between data storage types.
  • Select data storage options matching storage needs.
  • Implement secure storage practices for cloud-based data.

Simplified Iterative Data Pipeline

  • The stages in a data pipeline are Ingestion, Storage, Processing, and Analysis & Visualization.

Storage in the AWS Modern Data Architecture

  • Amazon EMR is used for big data processing.
  • Amazon S3 can be used for data warehousing, machine learning, and log analytics.
  • Amazon Redshift is for data warehousing.
  • Amazon SageMaker can be used for machine learning.
  • Amazon OpenSearch Service assists with log analytics.
  • DynamoDB is for non-relational databases.
  • Aurora is for relational databases.

Types of Cloud Storage

Block Storage

  • Offers dedicated, low-latency storage.
  • It is scalable with high performance.
  • Similar to local direct attached storage or a SAN.
  • Amazon Elastic Block Storage (EBS) is an example.

File Storage

  • Stores data as files.
  • Highly scalable.
  • Suited for content repositories and media stores.
  • Amazon Elastic File System (EFS) provides an example.

Object Storage

  • Stores unstructured, semistructured, or structured data.
  • It is highly scalable.
  • Each object has a unique identifier.
  • Lower cost than traditional storage.
  • Amazon Simple Storage Service (Amazon S3) provides an example.

Comparing Data Lakes and Data Warehouses

Data

  • Data warehouses use relational data from transactional systems, operational databases, and line of business applications.
  • Data lakes use nonrelational and relational data from IoT devices, websites, mobile apps, social media, and corporate applications.

Schema

  • Data warehouses use "schema on write," designed prior to data warehouse implementation.
  • Data lakes use "schema on read," written at the time of analysis.

Price and Performance

  • Data warehouses provide the fastest query results using higher cost storage.
  • Data lakes have query results that are faster using low-cost storage.

Data Quality

  • Data warehouses use highly curated data that serves as the central version of the truth.
  • Data lakes can use any data, curated (e.g., raw data) or not.

Users

  • Data warehouses are for business analysts.
  • Data lakes are for data scientists, data developers, and business analysts, using curated data.

Analytics

  • Data warehouses support batch reporting, business intelligence (BI), and visualizations.
  • Data lakes support ML, predictive analytics, and data discovery and profiling.

Data Lakes

  • A centralized repository is provided.
  • Both structured and unstructured data is stored.
  • Data is cataloged and indexed for analysis without movement.
  • Data is stored, secured, and protected at unlimited scale.
  • There is in-place transformation and querying of data assets.
  • Data lakes are built using Amazon S3.

Amazon Simple Storage Service (Amazon S3)

  • It is secure, scalable, and durable.
  • A low-cost storage solution is provided.
  • Structured and unstructured data is stored.
  • In-place transformation and querying is supported.
  • Object storage classes are utilized.
  • There is a strong data consistency model.
  • Multipart upload is supported.
  • It is the basis of data lake creation.

AWS Lake Formation

  • Fully managed service.
  • The ability to build, secure, and manage data lakes is provided.
  • Automates elements of data lake creation.
  • Augments the AWS Identity and Access Management (IAM) permissions model.
  • Supports atomic, consistent, isolated, and durable (ACID) transactions using governed tables.
  • Integrates with AWS analytics and ML services.

Key Takeaways: Data Lake Storage

  • Data lakes store data as-is, without needing structured data to begin running analytics.
  • Amazon S3 promotes data integrity through strong data consistency and multipart uploads.
  • Using Lake Formation, governed tables can enable concurrent data inserts and edits across tables.

Data Warehouses

  • A centralized repository is provided.
  • Structured and semistructured data is stored
  • Data is stored in one of two ways: frequently accessed data in fast storage and infrequently accessed data in cheap storage.
  • Might contain multiple databases organized into tables and columns.
  • Analytics processing is separate from transactional databases.
  • Amazon Redshift is an example.

Amazon Redshift

  • Provides a cloud-based data warehouse solution.
  • Fully managed service.
  • Supports near real-time data analysis.
  • Uses columnar storage.
  • Offers multiple node types for tailored solutions, including DC2, DS2, and RA3.

Key Takeaways: Data Warehouse Storage

  • Data warehouses consist of three tiers and can store structured, curated, or transformed data.
  • Amazon Redshift is a fully-managed data warehouse service, using computing resources called nodes.
  • Redshift Spectrum can write SQL queries combining data from both your data lake and your data warehouse.

Choosing Your Purpose-Built Database

  • Choosing the right database is the key to supporting your application architecture.
  • The database will affect what the application can handle, how it will perform, and the operation you are responsible for.
  • Consider: Application workload, data shape, performance requirements, and operations burden.

Factors in Choosing a Purpose-Built Database

Application Workload

  • Consider if the workload is transactional or if it should be used for analytics purposes.
  • Consider if your workload needs caching to improve response times.

Data shape

  • How will the data be accessed?
  • How often will the data be updated?

Performance

  • How fast does the data access need to be?
  • What is the average size of the records that are being used?
  • How will end users use the service?

Operations burden

  • How will you prepare for instance failures?
  • How will you configure backups?
  • What future upgrades might be needed?

Common Database Use Cases

Relational

  • Traditional applications, enterprise resource planning (ERP), customer relationship management (CRM), and e-commerce.
  • AWS services include Aurora, Amazon RDS, and Amazon Redshift.

Key-Value

  • High-traffic web applications, e-commerce systems, and gaming applications.
  • AWS services include DynamoDB.

Document

  • Content management, catalogs, and user profiles.
  • AWS services include Amazon DocumentDB.

Graph

  • Fraud detection, social networking, and recommendation engines.
  • AWS services include Neptune.

Key Takeaways: Purpose-Built Databases

  • The chosen database will affect application handling, performance, and responsible operations.
  • When choosing a database, consider application workload, data shape, performance, and operations burden.

Secure, Protect, and Manage Your AWS Data Lake

  • Benefit from the built-in security of Amazon S3.
  • Manage access through resource-based and user policies.
  • Choose from multiple encryption options.
  • Use tags to categorize and manage data, and to manage access permissions.
  • Use Lake Formation for centralized governance and access control.

Security for a Data Warehouse in Amazon Redshift

  • Amazon Redshift database security is separate from that of the service itself.
  • Amazon Redshift has extra features to manage database security.
  • Due to third-party auditing, Amazon Redshift can help support applications required to meet international compliance standards.
  • Amazon Redshift integrates with Amazon CloudWatch, AWS CloudTrail, and AWS Security Hub for monitoring and alerting.

Key Takeaways: Securing Storage

  • Security for data lake storage relies on the intrinsic security features of Amazon S3.
  • Access policies are highly customizable to provide access to resources in a data lake.
  • Data lakes on AWS rely on server-side and client-side encryption.
  • Amazon Redshift handles service and database security separately.

Sample Exam Question

Key Words and Phrases

  • Query in data directly from files in the company's data lake built on Amazon S3.

Correct Answer

  • Amazon Redshift Spectrum.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser