Podcast
Questions and Answers
What does data sampling technique Random Sampling involve?
What does data sampling technique Random Sampling involve?
Equal chance for everything
Describe the concept of data skew in distributed systems.
Describe the concept of data skew in distributed systems.
Unequal distribution of data across nodes or partitions
Which technique addresses data skew by introducing a random factor to distribute data more uniformly?
Which technique addresses data skew by introducing a random factor to distribute data more uniformly?
Completeness in data validation ensures all required data is present and no essential parts are missing.
Completeness in data validation ensures all required data is present and no essential parts are missing.
Signup and view all the answers
Pivoting involves turning row-level data into ______ data.
Pivoting involves turning row-level data into ______ data.
Signup and view all the answers
Which of the following statements is true about structured data?
Which of the following statements is true about structured data?
Signup and view all the answers
What are some characteristics of unstructured data?
What are some characteristics of unstructured data?
Signup and view all the answers
Define 'Volume' in the context of data.
Define 'Volume' in the context of data.
Signup and view all the answers
Data Lakes require a predefined schema before writing data (Schema-on-write).
Data Lakes require a predefined schema before writing data (Schema-on-write).
Signup and view all the answers
______ data is not as organized as structured data but has some level of structure in the form of tags, hierarchies, or other patterns.
______ data is not as organized as structured data but has some level of structure in the form of tags, hierarchies, or other patterns.
Signup and view all the answers
What does ETL stand for?
What does ETL stand for?
Signup and view all the answers
Which of the following are characteristics of a Data Lakehouse architecture? (Select all that apply)
Which of the following are characteristics of a Data Lakehouse architecture? (Select all that apply)
Signup and view all the answers
Parquet is a columnar storage format optimized for __________.
Parquet is a columnar storage format optimized for __________.
Signup and view all the answers
Schema Evolution refers to the ability to adapt and change the schema of a dataset without disrupting existing processes or systems.
Schema Evolution refers to the ability to adapt and change the schema of a dataset without disrupting existing processes or systems.
Signup and view all the answers
Match the following data formats with their descriptions:
Match the following data formats with their descriptions:
Signup and view all the answers
What operation is used to achieve the same result as pivoting without the specific PIVOT operation?
What operation is used to achieve the same result as pivoting without the specific PIVOT operation?
Signup and view all the answers
What symbol is used as the regular expression operator in SQL for pattern matching?
What symbol is used as the regular expression operator in SQL for pattern matching?
Signup and view all the answers
Which of the following symbols is used to match a pattern at the start of a string in regular expressions?
Which of the following symbols is used to match a pattern at the start of a string in regular expressions?
Signup and view all the answers
Which git command is used to initialize a new Git repository?
Which git command is used to initialize a new Git repository?
Signup and view all the answers
The command 'git revert' creates a new commit that undoes all changes from a previous commit.
The command 'git revert' creates a new commit that undoes all changes from a previous commit.
Signup and view all the answers
What is Amazon S3 advertised as?
What is Amazon S3 advertised as?
Signup and view all the answers
What are some use cases for Amazon S3? (Select all that apply)
What are some use cases for Amazon S3? (Select all that apply)
Signup and view all the answers
Buckets in Amazon S3 must have a globally unique ______ across all regions and accounts.
Buckets in Amazon S3 must have a globally unique ______ across all regions and accounts.
Signup and view all the answers
What is the maximum object size in Amazon S3?
What is the maximum object size in Amazon S3?
Signup and view all the answers
Versioning in Amazon S3 is enabled at the object level.
Versioning in Amazon S3 is enabled at the object level.
Signup and view all the answers
Match the following security access control in Amazon S3:
Match the following security access control in Amazon S3:
Signup and view all the answers
What is the recommended method for encrypting objects in Amazon S3 buckets?
What is the recommended method for encrypting objects in Amazon S3 buckets?
Signup and view all the answers
Which encryption method in Amazon S3 allows users to manage their own encryption keys?
Which encryption method in Amazon S3 allows users to manage their own encryption keys?
Signup and view all the answers
Amazon S3 Object Encryption using SSE-KMS does not rely on AWS Key Management Service (AWS KMS).
Amazon S3 Object Encryption using SSE-KMS does not rely on AWS Key Management Service (AWS KMS).
Signup and view all the answers
Which Amazon S3 storage class offers Retrieval Time of Instantaneous?
Which Amazon S3 storage class offers Retrieval Time of Instantaneous?
Signup and view all the answers
What is the cost of Storage per GB per month for Standard-IA in Amazon S3?
What is the cost of Storage per GB per month for Standard-IA in Amazon S3?
Signup and view all the answers
Objects in Amazon S3 can be transitioned to Glacier after _____ days using Lifecycle Rules.
Objects in Amazon S3 can be transitioned to Glacier after _____ days using Lifecycle Rules.
Signup and view all the answers
S3 Event Notifications can be used for generating thumbnails of images uploaded to S3.
S3 Event Notifications can be used for generating thumbnails of images uploaded to S3.
Signup and view all the answers
What is the benefit of versioning your S3 bucket?
What is the benefit of versioning your S3 bucket?
Signup and view all the answers
What happens to a file that is not versioned prior to enabling versioning in an S3 bucket?
What happens to a file that is not versioned prior to enabling versioning in an S3 bucket?
Signup and view all the answers
Suspending versioning in an S3 bucket deletes the previous versions of files.
Suspending versioning in an S3 bucket deletes the previous versions of files.
Signup and view all the answers
Match the following types of S3 Replication with their descriptions:
Match the following types of S3 Replication with their descriptions:
Signup and view all the answers
What is the purpose of Amazon S3 Intelligent-Tiering?
What is the purpose of Amazon S3 Intelligent-Tiering?
Signup and view all the answers
Which S3 Storage Class is meant for data that is accessed less frequently but requires rapid access when needed?
Which S3 Storage Class is meant for data that is accessed less frequently but requires rapid access when needed?
Signup and view all the answers
What is the Performance Mode set at EFS creation time used for?
What is the Performance Mode set at EFS creation time used for?
Signup and view all the answers
Which EFS Storage Class is 50% cheaper and used for rarely accessed data?
Which EFS Storage Class is 50% cheaper and used for rarely accessed data?
Signup and view all the answers
EBS volumes are locked at the Availability Zone (AZ) level.
EBS volumes are locked at the Availability Zone (AZ) level.
Signup and view all the answers
What process should be followed to migrate an EBS volume across Availability Zones?
What process should be followed to migrate an EBS volume across Availability Zones?
Signup and view all the answers
What is the purpose of SSE-S3 encryption in S3 buckets?
What is the purpose of SSE-S3 encryption in S3 buckets?
Signup and view all the answers
Which encryption headers can be used to 'force encryption' for objects in an S3 bucket?
Which encryption headers can be used to 'force encryption' for objects in an S3 bucket?
Signup and view all the answers
Bucket Policies are evaluated before Default Encryption in S3.
Bucket Policies are evaluated before Default Encryption in S3.
Signup and view all the answers
EBS stands for Elastic Block Store, and it is a ______ drive you can attach to your instances.
EBS stands for Elastic Block Store, and it is a ______ drive you can attach to your instances.
Signup and view all the answers
Match the following EBS Volume attribute with its description:
Match the following EBS Volume attribute with its description:
Signup and view all the answers
Study Notes
Data Engineering Fundamentals
- Data engineering is a broad field that encompasses various aspects of data processing, storage, and analytics.
- The AWS Certified Data Engineer - Associate exam (DEA-C01) is a challenging certification that requires knowledge of AWS data engineering services.
Data Types
- Structured data: organized in a defined manner or schema, typically found in relational databases, and is easily queryable.
- Unstructured data: does not have a pre-defined structure or schema, and is not easily queryable without preprocessing.
- Semi-structured data: has some level of structure, but not as organized as structured data, and is more flexible than structured data but not as chaotic as unstructured data.
Properties of Data
- Volume: refers to the amount or size of data that an organization is dealing with at any given time.
- Velocity: refers to the speed at which new data is generated, collected, and processed.
- Variety: refers to the different types, structures, and sources of data.
Data Warehouse vs. Data Lake
- Data Warehouse:
- A centralized repository optimized for analysis where data from different sources is stored in a structured format.
- Designed for complex queries and analysis.
- Data is cleaned, transformed, and loaded (ETL process).
- Typically uses a star or snowflake schema.
- Optimized for read-heavy operations.
- Data Lake:
- A storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data.
- Can store large volumes of raw data without a pre-defined schema.
- Data is loaded as-is, without preprocessing.
- Supports batch, real-time, and stream processing.
- Can be queried for data transformation or exploration purposes.
Data Lakehouse
- A hybrid data architecture that combines the best features of data lakes and data warehouses.
- Supports both structured and unstructured data.
- Allows for schema-on-write and schema-on-read.
- Provides capabilities for both detailed analytics and machine learning tasks.
- Typically built on top of cloud or distributed architectures.
Data Mesh
- A governance and organization model that focuses on individual teams owning "data products" within a given domain.
- Domain-based data management with federated governance and central standards.
- Self-service tooling and infrastructure.
ETL Pipelines
- ETL stands for Extract, Transform, Load.
- Extract: retrieve raw data from source systems, ensuring data integrity.
- Transform: convert the extracted data into a format suitable for the target data warehouse.
- Load: move the transformed data into the target data warehouse or another data repository.
Managing ETL Pipelines
- Automation is necessary for reliable ETL pipeline management.
- AWS Glue and other orchestration services can be used to manage ETL pipelines.
Data Sources
- JDBC (Java Database Connectivity): platform-independent, language-dependent.
- ODBC (Open Database Connectivity): platform-dependent, language-independent.
- Raw logs, API's, and streams are also data sources.
Common Data Formats
- CSV (Comma-Separated Values): text-based format, human-readable, and editable.
- JSON (JavaScript Object Notation): lightweight, text-based, human-readable, and suitable for data interchange between web servers and clients.
- Avro: binary format, stores both data and schema, allowing for efficient serialization and schema evolution.### Parquet
- A columnar storage format optimized for analytics.
- Allows for efficient compression and encoding schemes.
- Use cases:
- Analyzing large datasets with analytics engines.
- Reading specific columns instead of entire records.
- Storing data on distributed systems where I/O operations and storage need optimization.
- Compatible with systems:
- Hadoop ecosystem.
- Apache Spark.
- Apache Hive.
- Apache Impala.
- Amazon Redshift Spectrum.
Data Modeling
- A star schema consists of fact tables and dimensions.
- Primary and foreign keys are used in data modeling.
- An Entity Relationship Diagram (ERD) is a type of diagram that represents data models.
Data Lineage
- A visual representation that traces the flow and transformation of data through its lifecycle.
- Importance:
- Helps in tracking errors back to their source.
- Ensures compliance with regulations.
- Provides a clear understanding of how data is moved, transformed, and consumed within systems.
- Example: Using Spline Agent (for Spark) attached to Glue, dumping lineage data into Neptune via Lambda.
Schema Evolution
- The ability to adapt and change the schema of a dataset over time without disrupting existing processes or systems.
- Importance:
- Ensures data systems can adapt to changing business requirements.
- Allows for the addition, removal, or modification of columns/fields in a dataset.
- Maintains backward compatibility with older data records.
- Glue Schema Registry is used for schema discovery, compatibility, and validation.
Database Performance Optimization
- Indexing:
- Avoids full table scans.
- Enforces data uniqueness and integrity.
- Partitioning:
- Reduces the amount of data scanned.
- Helps with data lifecycle management.
- Enables parallel processing.
- Compression:
- Speeds up data transfer and reduces storage and disk reads.
- Examples: GZIP, LZOP, BZIP2, ZSTD (Redshift examples).
Data Sampling Techniques
- Random Sampling:
- Every item has an equal chance.
- Stratified Sampling:
- Divides the population into homogeneous subgroups (strata).
- Randomly samples within each stratum.
- Ensures representation of each subgroup.
- Other techniques:
- Systemic.
- Cluster.
- Convenience.
- Judgmental.
Data Skew Mechanisms
- Data skew refers to the unequal distribution of data across various nodes or partitions in distributed systems.
- Causes:
- Non-uniform distribution of data.
- Inadequate partitioning strategy.
- Temporal skew.
- Importance:
- Monitoring data distribution and alerting when skew issues arise.
Addressing Data Skew
- Adaptive Partitioning:
- Dynamically adjusts partitioning based on data characteristics.
- Salting:
- Introduces a random factor to the data to distribute it more uniformly.
- Regular Redistribution:
- Redistributes the data based on its current distribution characteristics.
- Sampling:
- Uses a sample of the data to determine the distribution and adjust the processing strategy accordingly.
- Custom Partitioning:
- Defines custom rules or functions for partitioning data based on domain knowledge.
Data Validation and Profiling
- Completeness:
- Ensures all required data is present and no essential parts are missing.
- Checks: Missing values, null counts, percentage of populated fields.
- Importance:
- Missing data can lead to inaccurate analyses and insights.
- Consistency:
- Ensures data values are consistent across datasets and do not contradict each other.
- Checks: Cross-field validation, comparing data from different sources or periods.
- Importance:
- Inconsistent data can cause confusion and result in incorrect conclusions.
- Accuracy:
- Ensures data is correct, reliable, and represents what it is supposed to.
- Checks: Comparing with trusted sources, validation against known standards or rules.
- Importance:
- Inaccurate data can lead to false insights and poor decision-making.
- Integrity:
- Ensures data maintains its correctness and consistency over its lifecycle and across systems.
- Checks: Referential integrity (e.g., foreign key checks in databases), relationship validations.
- Importance:
- Ensures relationships between data elements are preserved, and data remains trustworthy over time.
SQL Review
- Aggregation:
- COUNT:
- Example:
SELECT COUNT(*) AS total_rows FROM employees;
- Example:
- SUM:
- Example:
SELECT SUM(salary) AS total_salary FROM employees;
- Example:
- AVG:
- Example:
SELECT AVG(salary) AS average_salary FROM employees;
- Example:
- MAX / MIN:
- Example:
SELECT MAX(salary) AS highest_salary FROM employees;
- Example:
- COUNT:
- Aggregate with CASE:
- Allows for applying multiple filters to what is being aggregated.
- Example:
SELECT COUNT(CASE WHEN salary > 70000 THEN 1 END) AS high_salary_count, COUNT(CASE WHEN salary BETWEEN 50000 AND 70000 THEN 1 END) AS medium_salary_count, COUNT(CASE WHEN salary < 50000 THEN 1 END) AS low_salary_count FROM employees;
- Grouping:
- Nested grouping, sorting.
Pivoting
- Pivoting is the act of turning row-level data into columnar data.
- Example:
- Using conditional aggregation to achieve pivoting.
Joins
- INNER JOIN:
- Returns only the rows that have a match in both tables.
- Example:
SELECT * FROM employees INNER JOIN departments ON employees.department_id = departments.department_id;
- LEFT OUTER JOIN:
- Returns all the rows from the left table and the matched rows from the right table.
- Example:
SELECT * FROM employees LEFT OUTER JOIN departments ON employees.department_id = departments.department_id;
- RIGHT OUTER JOIN:
- Returns all the rows from the right table and the matched rows from the left table.
- Example:
SELECT * FROM employees RIGHT OUTER JOIN departments ON employees.department_id = departments.department_id;
- FULL OUTER JOIN:
- Returns all the rows from both tables.
- Example:
SELECT * FROM employees FULL OUTER JOIN departments ON employees.department_id = departments.department_id;
- CROSS OUTER JOIN:
- Returns the Cartesian product of both tables.
- Example:
SELECT * FROM employees CROSS OUTER JOIN departments;
SQL Regular Expressions
- Pattern matching:
- Uses a regular expression operator (~).
- Example:
SELECT * FROM name WHERE name ~ '^(fire|ice)';
- This would select any rows where the name starts with "fire" or "ice" (case-insensitive).
Git Review
- Setting Up and Configuration:
-
git init
: Initializes a new Git repository. -
git config
: Sets configuration values for user info, aliases, and more. -
git config --global user.name "Your Name"
: Sets your name. -
git config --global user.email "[email protected]"
: Sets your email.
-
- Basic Commands:
-
git clone
: Clones (or downloads) a repository from an existing URL. -
git status
: Checks the status of your changes in the working directory. -
git add
: Adds changes in the file to the staging area. -
git add .
: Adds all new and changed files to the staging area. -
git commit -m "Commit message here"
: Commits the staged changes with a message. -
git log
: Views commit logs.
-
- Branching with Git:
-
git branch
: Lists all local branches. -
git branch
: Creates a new branch. -
git checkout
: Switches to a specific branch. -
git checkout -b
: Creates a new branch and switches to it. -
git merge
: Merges the specified branch into the current branch. -
git branch -d
: Deletes a branch.
-
- Remote Repositories:
-
git remote add
: Adds a remote repository. -
git remote
: Lists all remote repositories. -
git push
: Pushes a branch to a remote repository. -
git pull
: Pulls changes from a remote repository branch into the current local branch.
-
- Undoing Changes:
-
git reset
: Resets the staging area to match the most recent commit, without affecting the working directory. -
git reset --hard
: Resets the staging area and the working directory to match the most recent commit. -
git revert
: Creates a new commit that undoes all of the changes from a previous commit.### Advanced Git
-
-
git stash
temporarily saves changes that are not yet ready for a commit -
git stash pop
restores the most recently stashed changes -
git rebase
reapplies changes from one branch onto another, often used to integrate changes from one branch into another -
git cherry-pick
applies changes from a specific commit to the current branch
Git Collaboration and Inspection
-
git blame
shows who made changes to a file and when -
git diff
shows changes between commits, commit and working tree, etc. -
git fetch
fetches changes from a remote repository without merging them
Git Maintenance and Data Recovery
-
git fsck
checks the database for errors -
git gc
cleans up and optimizes the local repository -
git reflog
records when refs were updated in the local repository, useful for recovering lost commits
Storage
- Amazon S3 is one of the main building blocks of AWS, providing infinitely scaling storage
- Many websites and AWS services use Amazon S3 as a backbone or integration
Amazon S3 Section
- Amazon S3 has various use cases, including:
- Backup and storage
- Disaster Recovery
- Archive
- Hybrid Cloud storage
- Application hosting
- Media hosting
- Data lakes and big data analytics
- Software delivery
- Static website
Amazon S3 - Buckets
- Buckets are directories in Amazon S3
- Bucket names must be globally unique, with a length of 3-63 characters, and follow specific naming conventions
- Buckets are defined at the region level, and S3 looks like a global service but buckets are created in a region
Amazon S3 - Objects
- Objects (files) have a key, which is the full path to the object
- The key is composed of a prefix and an object name
- There is no concept of directories within buckets, just keys with long names containing slashes
- Object values are the content of the body, with a max size of 5TB
- Objects can have metadata, tags, and version IDs
Amazon S3 – Security
- Security can be User-Based or Resource-Based
- User-Based security uses IAM policies to determine which API calls are allowed for a specific user
- Resource-Based security uses bucket policies and ACLs to control access to buckets and objects
- Encryption can be used to encrypt objects in Amazon S3 using encryption keys
Amazon S3 – Bucket Policies
- Bucket policies are JSON-based policies that define permissions for a bucket
- Policies can be used to grant public access to a bucket, force objects to be encrypted at upload, or grant access to another account
- Bucket policies have resources, effects, actions, and principals
Amazon S3 – Versioning
- Versioning allows multiple versions of an object to be stored in Amazon S3
- Versioning is enabled at the bucket level, and it is best practice to version buckets
- Versioning protects against unintended deletes and allows for easy rollbacks to previous versions
Amazon S3 – Replication
- Replication allows objects to be copied from one bucket to another
- Replication must be enabled on both the source and destination buckets
- Cross-Region Replication (CRR) and Same-Region Replication (SRR) are supported
- Replication can be used for compliance, lower latency access, and replication across accounts
Amazon S3 – Storage Classes
- Amazon S3 has various storage classes, including:
- S3 Standard
- S3 Standard-Infrequent Access
- S3 One Zone-Infrequent Access
- S3 Glacier Instant Retrieval
- S3 Glacier Flexible Retrieval
- S3 Glacier Deep Archive
- S3 Intelligent Tiering
S3 Durability and Availability
- Durability measures how many objects are lost in a year, with a high durability of 99.999999999% (11 9's) across multiple AZs
- Availability measures how readily available a service is, with varying levels of availability depending on storage class
S3 Storage Classes Comparison
- Comparison of various storage classes, including durability, availability, and minimum storage duration
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz covers the fundamentals of data engineering, including storage, database, migration, and more, as part of the AWS Certified Data Engineer Associate Course.