AWS Certified Data Engineer Associate Course
48 Questions
1 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What does data sampling technique Random Sampling involve?

Equal chance for everything

Describe the concept of data skew in distributed systems.

Unequal distribution of data across nodes or partitions

Which technique addresses data skew by introducing a random factor to distribute data more uniformly?

  • Custom Partitioning
  • Sampling
  • Random Sampling (correct)
  • Adaptive Partitioning
  • Completeness in data validation ensures all required data is present and no essential parts are missing.

    <p>True</p> Signup and view all the answers

    Pivoting involves turning row-level data into ______ data.

    <p>columnar</p> Signup and view all the answers

    Which of the following statements is true about structured data?

    <p>Structured data is found in relational databases.</p> Signup and view all the answers

    What are some characteristics of unstructured data?

    <p>Comes in various formats.</p> Signup and view all the answers

    Define 'Volume' in the context of data.

    <p>Volume refers to the amount or size of data that organizations are dealing with at any given time.</p> Signup and view all the answers

    Data Lakes require a predefined schema before writing data (Schema-on-write).

    <p>False</p> Signup and view all the answers

    ______ data is not as organized as structured data but has some level of structure in the form of tags, hierarchies, or other patterns.

    <p>Semi-Structured</p> Signup and view all the answers

    What does ETL stand for?

    <p>Extract, Transform, Load</p> Signup and view all the answers

    Which of the following are characteristics of a Data Lakehouse architecture? (Select all that apply)

    <p>Supports both structured and unstructured data</p> Signup and view all the answers

    Parquet is a columnar storage format optimized for __________.

    <p>analytics</p> Signup and view all the answers

    Schema Evolution refers to the ability to adapt and change the schema of a dataset without disrupting existing processes or systems.

    <p>True</p> Signup and view all the answers

    Match the following data formats with their descriptions:

    <p>CSV (Comma-Separated Values) = Text-based format that represents data in a tabular form JSON (JavaScript Object Notation) = Lightweight, text-based, and human-readable data interchange format Avro = Binary format that stores data and schema together Parquet = Columnar storage format optimized for analytics</p> Signup and view all the answers

    What operation is used to achieve the same result as pivoting without the specific PIVOT operation?

    <p>Conditional aggregation</p> Signup and view all the answers

    What symbol is used as the regular expression operator in SQL for pattern matching?

    <p>~</p> Signup and view all the answers

    Which of the following symbols is used to match a pattern at the start of a string in regular expressions?

    <p>^</p> Signup and view all the answers

    Which git command is used to initialize a new Git repository?

    <p>git init</p> Signup and view all the answers

    The command 'git revert' creates a new commit that undoes all changes from a previous commit.

    <p>True</p> Signup and view all the answers

    What is Amazon S3 advertised as?

    <p>infinitely scaling storage</p> Signup and view all the answers

    What are some use cases for Amazon S3? (Select all that apply)

    <p>Application hosting</p> Signup and view all the answers

    Buckets in Amazon S3 must have a globally unique ______ across all regions and accounts.

    <p>name</p> Signup and view all the answers

    What is the maximum object size in Amazon S3?

    <p>5TB</p> Signup and view all the answers

    Versioning in Amazon S3 is enabled at the object level.

    <p>False</p> Signup and view all the answers

    Match the following security access control in Amazon S3:

    <p>IAM Policies = User-Based Bucket Policies = Resource-Based Object Access Control List (ACL) = Resource-Based Bucket Access Control List (ACL) = Resource-Based</p> Signup and view all the answers

    What is the recommended method for encrypting objects in Amazon S3 buckets?

    <p>Server-Side Encryption (SSE)</p> Signup and view all the answers

    Which encryption method in Amazon S3 allows users to manage their own encryption keys?

    <p>Server-Side Encryption with Customer-Provided Keys (SSE-C)</p> Signup and view all the answers

    Amazon S3 Object Encryption using SSE-KMS does not rely on AWS Key Management Service (AWS KMS).

    <p>False</p> Signup and view all the answers

    Which Amazon S3 storage class offers Retrieval Time of Instantaneous?

    <p>Standard</p> Signup and view all the answers

    What is the cost of Storage per GB per month for Standard-IA in Amazon S3?

    <p>$0.0125</p> Signup and view all the answers

    Objects in Amazon S3 can be transitioned to Glacier after _____ days using Lifecycle Rules.

    <p>60</p> Signup and view all the answers

    S3 Event Notifications can be used for generating thumbnails of images uploaded to S3.

    <p>True</p> Signup and view all the answers

    What is the benefit of versioning your S3 bucket?

    <p>Easy roll back to previous versions</p> Signup and view all the answers

    What happens to a file that is not versioned prior to enabling versioning in an S3 bucket?

    <p>It will have version 'null'</p> Signup and view all the answers

    Suspending versioning in an S3 bucket deletes the previous versions of files.

    <p>False</p> Signup and view all the answers

    Match the following types of S3 Replication with their descriptions:

    <p>Cross-Region Replication (CRR) = Compliance, lower latency access, replication across accounts Same-Region Replication (SRR) = Log aggregation, live replication between production and test accounts</p> Signup and view all the answers

    What is the purpose of Amazon S3 Intelligent-Tiering?

    <p>To automatically move objects between access tiers based on usage</p> Signup and view all the answers

    Which S3 Storage Class is meant for data that is accessed less frequently but requires rapid access when needed?

    <p>Amazon S3 Glacier Instant Retrieval</p> Signup and view all the answers

    What is the Performance Mode set at EFS creation time used for?

    <p>Latency-sensitive use cases</p> Signup and view all the answers

    Which EFS Storage Class is 50% cheaper and used for rarely accessed data?

    <p>Archive</p> Signup and view all the answers

    EBS volumes are locked at the Availability Zone (AZ) level.

    <p>True</p> Signup and view all the answers

    What process should be followed to migrate an EBS volume across Availability Zones?

    <p>Take a snapshot, Restore the snapshot to another AZ</p> Signup and view all the answers

    What is the purpose of SSE-S3 encryption in S3 buckets?

    <p>Automatically apply encryption to new objects stored in the S3 bucket</p> Signup and view all the answers

    Which encryption headers can be used to 'force encryption' for objects in an S3 bucket?

    <p>SSE-KMS</p> Signup and view all the answers

    Bucket Policies are evaluated before Default Encryption in S3.

    <p>True</p> Signup and view all the answers

    EBS stands for Elastic Block Store, and it is a ______ drive you can attach to your instances.

    <p>network</p> Signup and view all the answers

    Match the following EBS Volume attribute with its description:

    <p>Delete on Termination attribute = Controls EBS behavior when an EC2 instance terminates EBS Elastic Volumes = Allows volume modification without detaching or restarting instance</p> Signup and view all the answers

    Study Notes

    Data Engineering Fundamentals

    • Data engineering is a broad field that encompasses various aspects of data processing, storage, and analytics.
    • The AWS Certified Data Engineer - Associate exam (DEA-C01) is a challenging certification that requires knowledge of AWS data engineering services.

    Data Types

    • Structured data: organized in a defined manner or schema, typically found in relational databases, and is easily queryable.
    • Unstructured data: does not have a pre-defined structure or schema, and is not easily queryable without preprocessing.
    • Semi-structured data: has some level of structure, but not as organized as structured data, and is more flexible than structured data but not as chaotic as unstructured data.

    Properties of Data

    • Volume: refers to the amount or size of data that an organization is dealing with at any given time.
    • Velocity: refers to the speed at which new data is generated, collected, and processed.
    • Variety: refers to the different types, structures, and sources of data.

    Data Warehouse vs. Data Lake

    • Data Warehouse:
      • A centralized repository optimized for analysis where data from different sources is stored in a structured format.
      • Designed for complex queries and analysis.
      • Data is cleaned, transformed, and loaded (ETL process).
      • Typically uses a star or snowflake schema.
      • Optimized for read-heavy operations.
    • Data Lake:
      • A storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data.
      • Can store large volumes of raw data without a pre-defined schema.
      • Data is loaded as-is, without preprocessing.
      • Supports batch, real-time, and stream processing.
      • Can be queried for data transformation or exploration purposes.

    Data Lakehouse

    • A hybrid data architecture that combines the best features of data lakes and data warehouses.
    • Supports both structured and unstructured data.
    • Allows for schema-on-write and schema-on-read.
    • Provides capabilities for both detailed analytics and machine learning tasks.
    • Typically built on top of cloud or distributed architectures.

    Data Mesh

    • A governance and organization model that focuses on individual teams owning "data products" within a given domain.
    • Domain-based data management with federated governance and central standards.
    • Self-service tooling and infrastructure.

    ETL Pipelines

    • ETL stands for Extract, Transform, Load.
    • Extract: retrieve raw data from source systems, ensuring data integrity.
    • Transform: convert the extracted data into a format suitable for the target data warehouse.
    • Load: move the transformed data into the target data warehouse or another data repository.

    Managing ETL Pipelines

    • Automation is necessary for reliable ETL pipeline management.
    • AWS Glue and other orchestration services can be used to manage ETL pipelines.

    Data Sources

    • JDBC (Java Database Connectivity): platform-independent, language-dependent.
    • ODBC (Open Database Connectivity): platform-dependent, language-independent.
    • Raw logs, API's, and streams are also data sources.

    Common Data Formats

    • CSV (Comma-Separated Values): text-based format, human-readable, and editable.
    • JSON (JavaScript Object Notation): lightweight, text-based, human-readable, and suitable for data interchange between web servers and clients.
    • Avro: binary format, stores both data and schema, allowing for efficient serialization and schema evolution.### Parquet
    • A columnar storage format optimized for analytics.
    • Allows for efficient compression and encoding schemes.
    • Use cases:
      • Analyzing large datasets with analytics engines.
      • Reading specific columns instead of entire records.
      • Storing data on distributed systems where I/O operations and storage need optimization.
    • Compatible with systems:
      • Hadoop ecosystem.
      • Apache Spark.
      • Apache Hive.
      • Apache Impala.
      • Amazon Redshift Spectrum.

    Data Modeling

    • A star schema consists of fact tables and dimensions.
    • Primary and foreign keys are used in data modeling.
    • An Entity Relationship Diagram (ERD) is a type of diagram that represents data models.

    Data Lineage

    • A visual representation that traces the flow and transformation of data through its lifecycle.
    • Importance:
      • Helps in tracking errors back to their source.
      • Ensures compliance with regulations.
      • Provides a clear understanding of how data is moved, transformed, and consumed within systems.
    • Example: Using Spline Agent (for Spark) attached to Glue, dumping lineage data into Neptune via Lambda.

    Schema Evolution

    • The ability to adapt and change the schema of a dataset over time without disrupting existing processes or systems.
    • Importance:
      • Ensures data systems can adapt to changing business requirements.
      • Allows for the addition, removal, or modification of columns/fields in a dataset.
      • Maintains backward compatibility with older data records.
    • Glue Schema Registry is used for schema discovery, compatibility, and validation.

    Database Performance Optimization

    • Indexing:
      • Avoids full table scans.
      • Enforces data uniqueness and integrity.
    • Partitioning:
      • Reduces the amount of data scanned.
      • Helps with data lifecycle management.
      • Enables parallel processing.
    • Compression:
      • Speeds up data transfer and reduces storage and disk reads.
      • Examples: GZIP, LZOP, BZIP2, ZSTD (Redshift examples).

    Data Sampling Techniques

    • Random Sampling:
      • Every item has an equal chance.
    • Stratified Sampling:
      • Divides the population into homogeneous subgroups (strata).
      • Randomly samples within each stratum.
      • Ensures representation of each subgroup.
    • Other techniques:
      • Systemic.
      • Cluster.
      • Convenience.
      • Judgmental.

    Data Skew Mechanisms

    • Data skew refers to the unequal distribution of data across various nodes or partitions in distributed systems.
    • Causes:
      • Non-uniform distribution of data.
      • Inadequate partitioning strategy.
      • Temporal skew.
    • Importance:
      • Monitoring data distribution and alerting when skew issues arise.

    Addressing Data Skew

    • Adaptive Partitioning:
      • Dynamically adjusts partitioning based on data characteristics.
    • Salting:
      • Introduces a random factor to the data to distribute it more uniformly.
    • Regular Redistribution:
      • Redistributes the data based on its current distribution characteristics.
    • Sampling:
      • Uses a sample of the data to determine the distribution and adjust the processing strategy accordingly.
    • Custom Partitioning:
      • Defines custom rules or functions for partitioning data based on domain knowledge.

    Data Validation and Profiling

    • Completeness:
      • Ensures all required data is present and no essential parts are missing.
      • Checks: Missing values, null counts, percentage of populated fields.
      • Importance:
        • Missing data can lead to inaccurate analyses and insights.
    • Consistency:
      • Ensures data values are consistent across datasets and do not contradict each other.
      • Checks: Cross-field validation, comparing data from different sources or periods.
      • Importance:
        • Inconsistent data can cause confusion and result in incorrect conclusions.
    • Accuracy:
      • Ensures data is correct, reliable, and represents what it is supposed to.
      • Checks: Comparing with trusted sources, validation against known standards or rules.
      • Importance:
        • Inaccurate data can lead to false insights and poor decision-making.
    • Integrity:
      • Ensures data maintains its correctness and consistency over its lifecycle and across systems.
      • Checks: Referential integrity (e.g., foreign key checks in databases), relationship validations.
      • Importance:
        • Ensures relationships between data elements are preserved, and data remains trustworthy over time.

    SQL Review

    • Aggregation:
      • COUNT:
        • Example: SELECT COUNT(*) AS total_rows FROM employees;
      • SUM:
        • Example: SELECT SUM(salary) AS total_salary FROM employees;
      • AVG:
        • Example: SELECT AVG(salary) AS average_salary FROM employees;
      • MAX / MIN:
        • Example: SELECT MAX(salary) AS highest_salary FROM employees;
    • Aggregate with CASE:
      • Allows for applying multiple filters to what is being aggregated.
      • Example:
        SELECT
          COUNT(CASE WHEN salary &gt; 70000 THEN 1 END) AS high_salary_count,
          COUNT(CASE WHEN salary BETWEEN 50000 AND 70000 THEN 1 END) AS medium_salary_count,
          COUNT(CASE WHEN salary &lt; 50000 THEN 1 END) AS low_salary_count
        FROM employees;
        
    • Grouping:
      • Nested grouping, sorting.

    Pivoting

    • Pivoting is the act of turning row-level data into columnar data.
    • Example:
      • Using conditional aggregation to achieve pivoting.

    Joins

    • INNER JOIN:
      • Returns only the rows that have a match in both tables.
      • Example:
        SELECT *
        FROM employees
        INNER JOIN departments
        ON employees.department_id = departments.department_id;
        
    • LEFT OUTER JOIN:
      • Returns all the rows from the left table and the matched rows from the right table.
      • Example:
        SELECT *
        FROM employees
        LEFT OUTER JOIN departments
        ON employees.department_id = departments.department_id;
        
    • RIGHT OUTER JOIN:
      • Returns all the rows from the right table and the matched rows from the left table.
      • Example:
        SELECT *
        FROM employees
        RIGHT OUTER JOIN departments
        ON employees.department_id = departments.department_id;
        
    • FULL OUTER JOIN:
      • Returns all the rows from both tables.
      • Example:
        SELECT *
        FROM employees
        FULL OUTER JOIN departments
        ON employees.department_id = departments.department_id;
        
    • CROSS OUTER JOIN:
      • Returns the Cartesian product of both tables.
      • Example:
        SELECT *
        FROM employees
        CROSS OUTER JOIN departments;
        

    SQL Regular Expressions

    • Pattern matching:
      • Uses a regular expression operator (~).
      • Example:
        SELECT * 
        FROM name 
        WHERE name ~ '^(fire|ice)';
        
      • This would select any rows where the name starts with "fire" or "ice" (case-insensitive).

    Git Review

    • Setting Up and Configuration:
      • git init: Initializes a new Git repository.
      • git config: Sets configuration values for user info, aliases, and more.
      • git config --global user.name "Your Name": Sets your name.
      • git config --global user.email "[email protected]": Sets your email.
    • Basic Commands:
      • git clone : Clones (or downloads) a repository from an existing URL.
      • git status: Checks the status of your changes in the working directory.
      • git add : Adds changes in the file to the staging area.
      • git add .: Adds all new and changed files to the staging area.
      • git commit -m "Commit message here": Commits the staged changes with a message.
      • git log: Views commit logs.
    • Branching with Git:
      • git branch: Lists all local branches.
      • git branch : Creates a new branch.
      • git checkout : Switches to a specific branch.
      • git checkout -b : Creates a new branch and switches to it.
      • git merge : Merges the specified branch into the current branch.
      • git branch -d : Deletes a branch.
    • Remote Repositories:
      • git remote add : Adds a remote repository.
      • git remote: Lists all remote repositories.
      • git push : Pushes a branch to a remote repository.
      • git pull : Pulls changes from a remote repository branch into the current local branch.
    • Undoing Changes:
      • git reset: Resets the staging area to match the most recent commit, without affecting the working directory.
      • git reset --hard: Resets the staging area and the working directory to match the most recent commit.
      • git revert : Creates a new commit that undoes all of the changes from a previous commit.### Advanced Git
    • git stash temporarily saves changes that are not yet ready for a commit
    • git stash pop restores the most recently stashed changes
    • git rebase reapplies changes from one branch onto another, often used to integrate changes from one branch into another
    • git cherry-pick applies changes from a specific commit to the current branch

    Git Collaboration and Inspection

    • git blame shows who made changes to a file and when
    • git diff shows changes between commits, commit and working tree, etc.
    • git fetch fetches changes from a remote repository without merging them

    Git Maintenance and Data Recovery

    • git fsck checks the database for errors
    • git gc cleans up and optimizes the local repository
    • git reflog records when refs were updated in the local repository, useful for recovering lost commits

    Storage

    • Amazon S3 is one of the main building blocks of AWS, providing infinitely scaling storage
    • Many websites and AWS services use Amazon S3 as a backbone or integration

    Amazon S3 Section

    • Amazon S3 has various use cases, including:
      • Backup and storage
      • Disaster Recovery
      • Archive
      • Hybrid Cloud storage
      • Application hosting
      • Media hosting
      • Data lakes and big data analytics
      • Software delivery
      • Static website

    Amazon S3 - Buckets

    • Buckets are directories in Amazon S3
    • Bucket names must be globally unique, with a length of 3-63 characters, and follow specific naming conventions
    • Buckets are defined at the region level, and S3 looks like a global service but buckets are created in a region

    Amazon S3 - Objects

    • Objects (files) have a key, which is the full path to the object
    • The key is composed of a prefix and an object name
    • There is no concept of directories within buckets, just keys with long names containing slashes
    • Object values are the content of the body, with a max size of 5TB
    • Objects can have metadata, tags, and version IDs

    Amazon S3 – Security

    • Security can be User-Based or Resource-Based
    • User-Based security uses IAM policies to determine which API calls are allowed for a specific user
    • Resource-Based security uses bucket policies and ACLs to control access to buckets and objects
    • Encryption can be used to encrypt objects in Amazon S3 using encryption keys

    Amazon S3 – Bucket Policies

    • Bucket policies are JSON-based policies that define permissions for a bucket
    • Policies can be used to grant public access to a bucket, force objects to be encrypted at upload, or grant access to another account
    • Bucket policies have resources, effects, actions, and principals

    Amazon S3 – Versioning

    • Versioning allows multiple versions of an object to be stored in Amazon S3
    • Versioning is enabled at the bucket level, and it is best practice to version buckets
    • Versioning protects against unintended deletes and allows for easy rollbacks to previous versions

    Amazon S3 – Replication

    • Replication allows objects to be copied from one bucket to another
    • Replication must be enabled on both the source and destination buckets
    • Cross-Region Replication (CRR) and Same-Region Replication (SRR) are supported
    • Replication can be used for compliance, lower latency access, and replication across accounts

    Amazon S3 – Storage Classes

    • Amazon S3 has various storage classes, including:
      • S3 Standard
      • S3 Standard-Infrequent Access
      • S3 One Zone-Infrequent Access
      • S3 Glacier Instant Retrieval
      • S3 Glacier Flexible Retrieval
      • S3 Glacier Deep Archive
      • S3 Intelligent Tiering

    S3 Durability and Availability

    • Durability measures how many objects are lost in a year, with a high durability of 99.999999999% (11 9's) across multiple AZs
    • Availability measures how readily available a service is, with varying levels of availability depending on storage class

    S3 Storage Classes Comparison

    • Comparison of various storage classes, including durability, availability, and minimum storage duration

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    new-AWS-certified data engg.pdf

    Description

    This quiz covers the fundamentals of data engineering, including storage, database, migration, and more, as part of the AWS Certified Data Engineer Associate Course.

    More Like This

    AWS Athena for Data Analysis
    98 questions

    AWS Athena for Data Analysis

    LawAbidingCommonsense avatar
    LawAbidingCommonsense
    Data Engineering with AWS
    3 questions
    AWS Data Solutions and Orchestration
    41 questions
    Quiz 2
    3 questions

    Quiz 2

    OrganizedGarnet avatar
    OrganizedGarnet
    Use Quizgecko on...
    Browser
    Browser