AWS Certified Data Engineer Associate Course

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What does data sampling technique Random Sampling involve?

Equal chance for everything

Describe the concept of data skew in distributed systems.

Unequal distribution of data across nodes or partitions

Which technique addresses data skew by introducing a random factor to distribute data more uniformly?

Custom Partitioning
Sampling
Random Sampling (correct)
Adaptive Partitioning

Completeness in data validation ensures all required data is present and no essential parts are missing.

True (A) Signup and view all the answers

Pivoting involves turning row-level data into ______ data.

columnar Signup and view all the answers

Which of the following statements is true about structured data?

Structured data is found in relational databases. (B) Signup and view all the answers

What are some characteristics of unstructured data?

Comes in various formats. (B) Signup and view all the answers

Define 'Volume' in the context of data.

Volume refers to the amount or size of data that organizations are dealing with at any given time. Signup and view all the answers

Data Lakes require a predefined schema before writing data (Schema-on-write).

False (B) Signup and view all the answers

______ data is not as organized as structured data but has some level of structure in the form of tags, hierarchies, or other patterns.

Semi-Structured Signup and view all the answers

What does ETL stand for?

Extract, Transform, Load Signup and view all the answers

Which of the following are characteristics of a Data Lakehouse architecture? (Select all that apply)

Supports both structured and unstructured data (C), Provides capabilities for detailed analytics and machine learning tasks (D) Signup and view all the answers

Parquet is a columnar storage format optimized for __________.

analytics Signup and view all the answers

Schema Evolution refers to the ability to adapt and change the schema of a dataset without disrupting existing processes or systems.

True (A) Signup and view all the answers

Match the following data formats with their descriptions:

CSV (Comma-Separated Values) = Text-based format that represents data in a tabular form JSON (JavaScript Object Notation) = Lightweight, text-based, and human-readable data interchange format Avro = Binary format that stores data and schema together Parquet = Columnar storage format optimized for analytics Signup and view all the answers

What operation is used to achieve the same result as pivoting without the specific PIVOT operation?

Conditional aggregation Signup and view all the answers

What symbol is used as the regular expression operator in SQL for pattern matching?

~ Signup and view all the answers

Which of the following symbols is used to match a pattern at the start of a string in regular expressions?

^ (C) Signup and view all the answers

Which git command is used to initialize a new Git repository?

git init (A) Signup and view all the answers

The command 'git revert' creates a new commit that undoes all changes from a previous commit.

True (A) Signup and view all the answers

What is Amazon S3 advertised as?

infinitely scaling storage Signup and view all the answers

What are some use cases for Amazon S3? (Select all that apply)

Application hosting (A), Disaster Recovery (B), Media hosting (C), Data lakes & big data analytics (D), Backup and storage (E) Signup and view all the answers

Buckets in Amazon S3 must have a globally unique ______ across all regions and accounts.

name Signup and view all the answers

What is the maximum object size in Amazon S3?

5TB (A) Signup and view all the answers

Versioning in Amazon S3 is enabled at the object level.

False (B) Signup and view all the answers

Match the following security access control in Amazon S3:

IAM Policies = User-Based Bucket Policies = Resource-Based Object Access Control List (ACL) = Resource-Based Bucket Access Control List (ACL) = Resource-Based Signup and view all the answers

What is the recommended method for encrypting objects in Amazon S3 buckets?

Server-Side Encryption (SSE) Signup and view all the answers

Which encryption method in Amazon S3 allows users to manage their own encryption keys?

Server-Side Encryption with Customer-Provided Keys (SSE-C) (A) Signup and view all the answers

Amazon S3 Object Encryption using SSE-KMS does not rely on AWS Key Management Service (AWS KMS).

False (B) Signup and view all the answers

Which Amazon S3 storage class offers Retrieval Time of Instantaneous?

Standard (D) Signup and view all the answers

What is the cost of Storage per GB per month for Standard-IA in Amazon S3?

$0.0125 Signup and view all the answers

Objects in Amazon S3 can be transitioned to Glacier after _____ days using Lifecycle Rules.

60 Signup and view all the answers

S3 Event Notifications can be used for generating thumbnails of images uploaded to S3.

True (A) Signup and view all the answers

What is the benefit of versioning your S3 bucket?

Easy roll back to previous versions Signup and view all the answers

What happens to a file that is not versioned prior to enabling versioning in an S3 bucket?

It will have version 'null' (A) Signup and view all the answers

Suspending versioning in an S3 bucket deletes the previous versions of files.

False (B) Signup and view all the answers

Match the following types of S3 Replication with their descriptions:

Cross-Region Replication (CRR) = Compliance, lower latency access, replication across accounts Same-Region Replication (SRR) = Log aggregation, live replication between production and test accounts Signup and view all the answers

What is the purpose of Amazon S3 Intelligent-Tiering?

To automatically move objects between access tiers based on usage Signup and view all the answers

Which S3 Storage Class is meant for data that is accessed less frequently but requires rapid access when needed?

Amazon S3 Glacier Instant Retrieval (B) Signup and view all the answers

What is the Performance Mode set at EFS creation time used for?

Latency-sensitive use cases (D) Signup and view all the answers

Which EFS Storage Class is 50% cheaper and used for rarely accessed data?

Archive (A) Signup and view all the answers

EBS volumes are locked at the Availability Zone (AZ) level.

True (A) Signup and view all the answers

What process should be followed to migrate an EBS volume across Availability Zones?

Take a snapshot, Restore the snapshot to another AZ Signup and view all the answers

What is the purpose of SSE-S3 encryption in S3 buckets?

Automatically apply encryption to new objects stored in the S3 bucket Signup and view all the answers

Which encryption headers can be used to 'force encryption' for objects in an S3 bucket?

SSE-KMS (A), SSE-C (B), SSE-S3 (C) Signup and view all the answers

Bucket Policies are evaluated before Default Encryption in S3.

True (A) Signup and view all the answers

EBS stands for Elastic Block Store, and it is a ______ drive you can attach to your instances.

network Signup and view all the answers

Match the following EBS Volume attribute with its description:

Delete on Termination attribute = Controls EBS behavior when an EC2 instance terminates EBS Elastic Volumes = Allows volume modification without detaching or restarting instance Signup and view all the answers

Study Notes

Data Engineering Fundamentals

Data engineering is a broad field that encompasses various aspects of data processing, storage, and analytics.
The AWS Certified Data Engineer - Associate exam (DEA-C01) is a challenging certification that requires knowledge of AWS data engineering services.

Data Types

Structured data: organized in a defined manner or schema, typically found in relational databases, and is easily queryable.
Unstructured data: does not have a pre-defined structure or schema, and is not easily queryable without preprocessing.
Semi-structured data: has some level of structure, but not as organized as structured data, and is more flexible than structured data but not as chaotic as unstructured data.

Properties of Data

Volume: refers to the amount or size of data that an organization is dealing with at any given time.
Velocity: refers to the speed at which new data is generated, collected, and processed.
Variety: refers to the different types, structures, and sources of data.

Data Warehouse vs. Data Lake

Data Warehouse:
- A centralized repository optimized for analysis where data from different sources is stored in a structured format.
- Designed for complex queries and analysis.
- Data is cleaned, transformed, and loaded (ETL process).
- Typically uses a star or snowflake schema.
- Optimized for read-heavy operations.
Data Lake:
- A storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data.
- Can store large volumes of raw data without a pre-defined schema.
- Data is loaded as-is, without preprocessing.
- Supports batch, real-time, and stream processing.
- Can be queried for data transformation or exploration purposes.

Data Lakehouse

A hybrid data architecture that combines the best features of data lakes and data warehouses.
Supports both structured and unstructured data.
Allows for schema-on-write and schema-on-read.
Provides capabilities for both detailed analytics and machine learning tasks.
Typically built on top of cloud or distributed architectures.

Data Mesh

A governance and organization model that focuses on individual teams owning "data products" within a given domain.
Domain-based data management with federated governance and central standards.
Self-service tooling and infrastructure.

ETL Pipelines

ETL stands for Extract, Transform, Load.
Extract: retrieve raw data from source systems, ensuring data integrity.
Transform: convert the extracted data into a format suitable for the target data warehouse.
Load: move the transformed data into the target data warehouse or another data repository.

Managing ETL Pipelines

Automation is necessary for reliable ETL pipeline management.
AWS Glue and other orchestration services can be used to manage ETL pipelines.

Data Sources

JDBC (Java Database Connectivity): platform-independent, language-dependent.
ODBC (Open Database Connectivity): platform-dependent, language-independent.
Raw logs, API's, and streams are also data sources.

Common Data Formats

CSV (Comma-Separated Values): text-based format, human-readable, and editable.
JSON (JavaScript Object Notation): lightweight, text-based, human-readable, and suitable for data interchange between web servers and clients.
Avro: binary format, stores both data and schema, allowing for efficient serialization and schema evolution.### Parquet
A columnar storage format optimized for analytics.
Allows for efficient compression and encoding schemes.
Use cases:
- Analyzing large datasets with analytics engines.
- Reading specific columns instead of entire records.
- Storing data on distributed systems where I/O operations and storage need optimization.
Compatible with systems:
- Hadoop ecosystem.
- Apache Spark.
- Apache Hive.
- Apache Impala.
- Amazon Redshift Spectrum.

Data Modeling

A star schema consists of fact tables and dimensions.
Primary and foreign keys are used in data modeling.
An Entity Relationship Diagram (ERD) is a type of diagram that represents data models.

Data Lineage

A visual representation that traces the flow and transformation of data through its lifecycle.
Importance:
- Helps in tracking errors back to their source.
- Ensures compliance with regulations.
- Provides a clear understanding of how data is moved, transformed, and consumed within systems.
Example: Using Spline Agent (for Spark) attached to Glue, dumping lineage data into Neptune via Lambda.

Schema Evolution

The ability to adapt and change the schema of a dataset over time without disrupting existing processes or systems.
Importance:
- Ensures data systems can adapt to changing business requirements.
- Allows for the addition, removal, or modification of columns/fields in a dataset.
- Maintains backward compatibility with older data records.
Glue Schema Registry is used for schema discovery, compatibility, and validation.

Database Performance Optimization

Indexing:
- Avoids full table scans.
- Enforces data uniqueness and integrity.
Partitioning:
- Reduces the amount of data scanned.
- Helps with data lifecycle management.
- Enables parallel processing.
Compression:
- Speeds up data transfer and reduces storage and disk reads.
- Examples: GZIP, LZOP, BZIP2, ZSTD (Redshift examples).

Data Sampling Techniques

Random Sampling:
- Every item has an equal chance.
Stratified Sampling:
- Divides the population into homogeneous subgroups (strata).
- Randomly samples within each stratum.
- Ensures representation of each subgroup.
Other techniques:
- Systemic.
- Cluster.
- Convenience.
- Judgmental.

Data Skew Mechanisms

Data skew refers to the unequal distribution of data across various nodes or partitions in distributed systems.
Causes:
- Non-uniform distribution of data.
- Inadequate partitioning strategy.
- Temporal skew.
Importance:
- Monitoring data distribution and alerting when skew issues arise.

Addressing Data Skew

Adaptive Partitioning:
- Dynamically adjusts partitioning based on data characteristics.
Salting:
- Introduces a random factor to the data to distribute it more uniformly.
Regular Redistribution:
- Redistributes the data based on its current distribution characteristics.
Sampling:
- Uses a sample of the data to determine the distribution and adjust the processing strategy accordingly.
Custom Partitioning:
- Defines custom rules or functions for partitioning data based on domain knowledge.

Data Validation and Profiling

Completeness:
- Ensures all required data is present and no essential parts are missing.
- Checks: Missing values, null counts, percentage of populated fields.
- Importance:
  - Missing data can lead to inaccurate analyses and insights.
Consistency:
- Ensures data values are consistent across datasets and do not contradict each other.
- Checks: Cross-field validation, comparing data from different sources or periods.
- Importance:
  - Inconsistent data can cause confusion and result in incorrect conclusions.
Accuracy:
- Ensures data is correct, reliable, and represents what it is supposed to.
- Checks: Comparing with trusted sources, validation against known standards or rules.
- Importance:
  - Inaccurate data can lead to false insights and poor decision-making.
Integrity:
- Ensures data maintains its correctness and consistency over its lifecycle and across systems.
- Checks: Referential integrity (e.g., foreign key checks in databases), relationship validations.
- Importance:
  - Ensures relationships between data elements are preserved, and data remains trustworthy over time.

SQL Review

Aggregation:
- COUNT:
  - Example: SELECT COUNT(*) AS total_rows FROM employees;
- SUM:
  - Example: SELECT SUM(salary) AS total_salary FROM employees;
- AVG:
  - Example: SELECT AVG(salary) AS average_salary FROM employees;
- MAX / MIN:
  - Example: SELECT MAX(salary) AS highest_salary FROM employees;

Aggregate with CASE:

Allows for applying multiple filters to what is being aggregated.

Example:

SELECT
  COUNT(CASE WHEN salary &gt; 70000 THEN 1 END) AS high_salary_count,
  COUNT(CASE WHEN salary BETWEEN 50000 AND 70000 THEN 1 END) AS medium_salary_count,
  COUNT(CASE WHEN salary &lt; 50000 THEN 1 END) AS low_salary_count
FROM employees;

Grouping:
- Nested grouping, sorting.

Pivoting

Pivoting is the act of turning row-level data into columnar data.
Example:
- Using conditional aggregation to achieve pivoting.

Joins

INNER JOIN:

Returns only the rows that have a match in both tables.

Example:

SELECT *
FROM employees
INNER JOIN departments
ON employees.department_id = departments.department_id;

LEFT OUTER JOIN:

Returns all the rows from the left table and the matched rows from the right table.

Example:

SELECT *
FROM employees
LEFT OUTER JOIN departments
ON employees.department_id = departments.department_id;

RIGHT OUTER JOIN:

Returns all the rows from the right table and the matched rows from the left table.

Example:

SELECT *
FROM employees
RIGHT OUTER JOIN departments
ON employees.department_id = departments.department_id;

FULL OUTER JOIN:

Returns all the rows from both tables.

Example:

SELECT *
FROM employees
FULL OUTER JOIN departments
ON employees.department_id = departments.department_id;

CROSS OUTER JOIN:
- Returns the Cartesian product of both tables.
- Example:
```
SELECT *
FROM employees
CROSS OUTER JOIN departments;
```

SQL Regular Expressions

Pattern matching:
- Uses a regular expression operator (~).
- Example:
```
SELECT * 
FROM name 
WHERE name ~ '^(fire|ice)';
```
- This would select any rows where the name starts with "fire" or "ice" (case-insensitive).

Git Review

Setting Up and Configuration:
- git init: Initializes a new Git repository.
- git config: Sets configuration values for user info, aliases, and more.
- git config --global user.name "Your Name": Sets your name.
- git config --global user.email "[email protected]": Sets your email.
Basic Commands:
- git clone : Clones (or downloads) a repository from an existing URL.
- git status: Checks the status of your changes in the working directory.
- git add : Adds changes in the file to the staging area.
- git add .: Adds all new and changed files to the staging area.
- git commit -m "Commit message here": Commits the staged changes with a message.
- git log: Views commit logs.
Branching with Git:
- git branch: Lists all local branches.
- git branch : Creates a new branch.
- git checkout : Switches to a specific branch.
- git checkout -b : Creates a new branch and switches to it.
- git merge : Merges the specified branch into the current branch.
- git branch -d : Deletes a branch.
Remote Repositories:
- git remote add : Adds a remote repository.
- git remote: Lists all remote repositories.
- git push : Pushes a branch to a remote repository.
- git pull : Pulls changes from a remote repository branch into the current local branch.
Undoing Changes:
- git reset: Resets the staging area to match the most recent commit, without affecting the working directory.
- git reset --hard: Resets the staging area and the working directory to match the most recent commit.
- git revert : Creates a new commit that undoes all of the changes from a previous commit.### Advanced Git
git stash temporarily saves changes that are not yet ready for a commit
git stash pop restores the most recently stashed changes
git rebase reapplies changes from one branch onto another, often used to integrate changes from one branch into another
git cherry-pick applies changes from a specific commit to the current branch

Git Collaboration and Inspection

git blame shows who made changes to a file and when
git diff shows changes between commits, commit and working tree, etc.
git fetch fetches changes from a remote repository without merging them

Git Maintenance and Data Recovery

git fsck checks the database for errors
git gc cleans up and optimizes the local repository
git reflog records when refs were updated in the local repository, useful for recovering lost commits

Storage

Amazon S3 is one of the main building blocks of AWS, providing infinitely scaling storage
Many websites and AWS services use Amazon S3 as a backbone or integration

Amazon S3 Section

Amazon S3 has various use cases, including:
- Backup and storage
- Disaster Recovery
- Archive
- Hybrid Cloud storage
- Application hosting
- Media hosting
- Data lakes and big data analytics
- Software delivery
- Static website

Amazon S3 - Buckets

Buckets are directories in Amazon S3
Bucket names must be globally unique, with a length of 3-63 characters, and follow specific naming conventions
Buckets are defined at the region level, and S3 looks like a global service but buckets are created in a region

Amazon S3 - Objects

Objects (files) have a key, which is the full path to the object
The key is composed of a prefix and an object name
There is no concept of directories within buckets, just keys with long names containing slashes
Object values are the content of the body, with a max size of 5TB
Objects can have metadata, tags, and version IDs

Amazon S3 – Security

Security can be User-Based or Resource-Based
User-Based security uses IAM policies to determine which API calls are allowed for a specific user
Resource-Based security uses bucket policies and ACLs to control access to buckets and objects
Encryption can be used to encrypt objects in Amazon S3 using encryption keys

Amazon S3 – Bucket Policies

Bucket policies are JSON-based policies that define permissions for a bucket
Policies can be used to grant public access to a bucket, force objects to be encrypted at upload, or grant access to another account
Bucket policies have resources, effects, actions, and principals

Amazon S3 – Versioning

Versioning allows multiple versions of an object to be stored in Amazon S3
Versioning is enabled at the bucket level, and it is best practice to version buckets
Versioning protects against unintended deletes and allows for easy rollbacks to previous versions

Amazon S3 – Replication

Replication allows objects to be copied from one bucket to another
Replication must be enabled on both the source and destination buckets
Cross-Region Replication (CRR) and Same-Region Replication (SRR) are supported
Replication can be used for compliance, lower latency access, and replication across accounts

Amazon S3 – Storage Classes

Amazon S3 has various storage classes, including:
- S3 Standard
- S3 Standard-Infrequent Access
- S3 One Zone-Infrequent Access
- S3 Glacier Instant Retrieval
- S3 Glacier Flexible Retrieval
- S3 Glacier Deep Archive
- S3 Intelligent Tiering

S3 Durability and Availability

Durability measures how many objects are lost in a year, with a high durability of 99.999999999% (11 9's) across multiple AZs
Availability measures how readily available a service is, with varying levels of availability depending on storage class

S3 Storage Classes Comparison

Comparison of various storage classes, including durability, availability, and minimum storage duration

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Description

This quiz covers the fundamentals of data engineering, including storage, database, migration, and more, as part of the AWS Certified Data Engineer Associate Course.

AWS Certified Data Engineer Associate Course

Choose a study mode

Podcast

Questions and Answers

What does data sampling technique Random Sampling involve?

Describe the concept of data skew in distributed systems.

Which technique addresses data skew by introducing a random factor to distribute data more uniformly?

Completeness in data validation ensures all required data is present and no essential parts are missing.

Pivoting involves turning row-level data into ______ data.

Which of the following statements is true about structured data?

What are some characteristics of unstructured data?

Define 'Volume' in the context of data.

Data Lakes require a predefined schema before writing data (Schema-on-write).

______ data is not as organized as structured data but has some level of structure in the form of tags, hierarchies, or other patterns.

What does ETL stand for?

Which of the following are characteristics of a Data Lakehouse architecture? (Select all that apply)

Parquet is a columnar storage format optimized for __________.

Schema Evolution refers to the ability to adapt and change the schema of a dataset without disrupting existing processes or systems.

Match the following data formats with their descriptions:

What operation is used to achieve the same result as pivoting without the specific PIVOT operation?

What symbol is used as the regular expression operator in SQL for pattern matching?

Which of the following symbols is used to match a pattern at the start of a string in regular expressions?

Which git command is used to initialize a new Git repository?

The command 'git revert' creates a new commit that undoes all changes from a previous commit.

What is Amazon S3 advertised as?

What are some use cases for Amazon S3? (Select all that apply)

Buckets in Amazon S3 must have a globally unique ______ across all regions and accounts.

What is the maximum object size in Amazon S3?

Versioning in Amazon S3 is enabled at the object level.

Match the following security access control in Amazon S3:

What is the recommended method for encrypting objects in Amazon S3 buckets?

Which encryption method in Amazon S3 allows users to manage their own encryption keys?

Amazon S3 Object Encryption using SSE-KMS does not rely on AWS Key Management Service (AWS KMS).

Which Amazon S3 storage class offers Retrieval Time of Instantaneous?

What is the cost of Storage per GB per month for Standard-IA in Amazon S3?

Objects in Amazon S3 can be transitioned to Glacier after _____ days using Lifecycle Rules.

S3 Event Notifications can be used for generating thumbnails of images uploaded to S3.

What is the benefit of versioning your S3 bucket?

What happens to a file that is not versioned prior to enabling versioning in an S3 bucket?

Suspending versioning in an S3 bucket deletes the previous versions of files.

Match the following types of S3 Replication with their descriptions:

What is the purpose of Amazon S3 Intelligent-Tiering?

Which S3 Storage Class is meant for data that is accessed less frequently but requires rapid access when needed?

What is the Performance Mode set at EFS creation time used for?

Which EFS Storage Class is 50% cheaper and used for rarely accessed data?

EBS volumes are locked at the Availability Zone (AZ) level.

What process should be followed to migrate an EBS volume across Availability Zones?

What is the purpose of SSE-S3 encryption in S3 buckets?

Which encryption headers can be used to 'force encryption' for objects in an S3 bucket?

Bucket Policies are evaluated before Default Encryption in S3.

EBS stands for Elastic Block Store, and it is a ______ drive you can attach to your instances.

Match the following EBS Volume attribute with its description:

Study Notes

Data Engineering Fundamentals

Data Types

Properties of Data

Data Warehouse vs. Data Lake

Data Lakehouse

Data Mesh

ETL Pipelines

Managing ETL Pipelines

Data Sources

Common Data Formats

Data Modeling

Data Lineage

Schema Evolution

Database Performance Optimization

Data Sampling Techniques

Data Skew Mechanisms

Addressing Data Skew

Data Validation and Profiling

SQL Review

Pivoting

Joins

SQL Regular Expressions

Git Review

Git Collaboration and Inspection

Git Maintenance and Data Recovery

Storage

Amazon S3 Section