DynamoDB and Athena Quiz

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

When is it advised to use an Exponential Backoff Retry in DynamoDB?

  • When using Conditional Writes to ensure atomicity
  • When dealing with hot partitions or very large items (correct)
  • When using the UpdateItem API call to increment an atomic counter
  • When using the GetItem API call with SCR to retrieve items

What is the main advantage of using DynamoDB On Demand mode?

  • It allows for greater flexibility and scalability, adapting to unpredictable workloads. (correct)
  • It offers a more efficient use of resources compared to Provisioned mode.
  • It provides higher read and write throughput compared to Provisioned mode.
  • It offers a cost-effective option for large, complex workloads.

Which API call is used to create a new item or completely replace an existing one in DynamoDB?

  • UpdateItem
  • Query
  • PutItem (correct)
  • GetItem

What is the primary purpose of Conditional Writes in DynamoDB?

<p>Ensuring data consistency and preventing race conditions during concurrent access to items. (B)</p> Signup and view all the answers

Which DynamoDB API call allows retrieving items based on a KeyConditionExpression, requiring a partition key value and optionally a sort key value?

<p>Query (B)</p> Signup and view all the answers

What is the difference between Read Consistency (RCU) and Strong Read Consistency (SCR) in DynamoDB?

<p>SCR comes with a performance penalty as it requires more RCU units, potentially taking longer to complete reads. (C)</p> Signup and view all the answers

Which of the following is NOT a recommended strategy for managing workloads in DynamoDB?

<p>Creating a single, dedicated queue for all query requests, regardless of their type or duration, for simplicity. (C)</p> Signup and view all the answers

What is the primary function of DynamoDB Accelerator (DAX)?

<p>It enhances the performance of DynamoDB operations by providing a caching layer. (D)</p> Signup and view all the answers

Why would one choose to store data in a large number of small files rather than a small number of large files?

<p>Large numbers of small files are generally more performant for faster queries. (D)</p> Signup and view all the answers

Which data format is directly supported by Athena?

<p>JSON (B), CSV (C)</p> Signup and view all the answers

Which of the following techniques does Athena use to enhance query performance?

<p>Predicate Pushdown (B), All of the above (D)</p> Signup and view all the answers

Which of the following is a benefit of partitioning large datasets in Athena?

<p>All of the above (D)</p> Signup and view all the answers

Which of the following is NOT a key feature of Athena?

<p>Real-time data updates (C)</p> Signup and view all the answers

What is the benefit of using 'MSCK REPAIR TABLE' command after adding partitions to your table?

<p>It ensures that Athena can properly recognize the new partitions. (B)</p> Signup and view all the answers

What does the 'splittable' feature of data formats like Parquet and ORC refer to?

<p>Data can be processed in parallel by multiple processors. (A)</p> Signup and view all the answers

Which of the following is a key feature of Apache Iceberg?

<p>ACID transactions (C)</p> Signup and view all the answers

What is the purpose of EMR?

<p>To provide a managed service for processing and analyzing large datasets (C)</p> Signup and view all the answers

Which of the following statements is TRUE about EMR notebooks?

<p>EMR notebooks allow for interactive analysis using EMR (B)</p> Signup and view all the answers

Which of these is NOT a benefit of using EMR?

<p>Support for automatic failover for master nodes. (B)</p> Signup and view all the answers

Which of the following is a valid scaling strategy for EMR clusters?

<p>Adding task nodes before core nodes. (C)</p> Signup and view all the answers

What is the difference between transient and long-running clusters in EMR?

<p>Transient clusters are terminated automatically while long-running clusters require manual termination. (C)</p> Signup and view all the answers

Why is termination protection enabled by default for long-running EMR clusters?

<p>To prevent accidental termination and data loss (A)</p> Signup and view all the answers

What is the primary role of the master node in an EMR cluster?

<p>To manage the cluster, track tasks, and monitor health. (B)</p> Signup and view all the answers

What is the implication of using reserved instances on a long-running EMR cluster?

<p>It reduces the cost of running the cluster. (B)</p> Signup and view all the answers

What is the maximum throughput per shard for Kinesis Data Streams (KDS) when using standard consumers?

<p>2 MB/s (B)</p> Signup and view all the answers

Which of the following is a benefit of using Kinesis Data Firehose (KDF) over Kinesis Data Streams (KDS)?

<p>Fully managed and serverless (C)</p> Signup and view all the answers

What is the maximum number of records that can be returned in a single GetRecords call for Kinesis Data Streams (KDS)?

<p>10 MB (B)</p> Signup and view all the answers

Which of the following can be used for data transformation within Kinesis Data Firehose?

<p>Amazon Lambda (D)</p> Signup and view all the answers

What is the purpose of the Kinesis Client Library (KCL)?

<p>Reading and de-aggregating records from Kinesis Data Streams. (B)</p> Signup and view all the answers

Which of the following is NOT a supported target destination for Kinesis Data Firehose?

<p>Amazon RDS (F), Amazon Aurora (G)</p> Signup and view all the answers

In the context of Kinesis Data Firehose, what is the purpose of the 'buffer'?

<p>To optimize the performance of the firehose by batching records. (A)</p> Signup and view all the answers

When should you use Enhanced Fan Out consumers instead of Standard consumers for Kinesis Data Streams?

<p>When the latency of data processing needs to be minimized. (D)</p> Signup and view all the answers

Which of the following is a common reason why a Kinesis Data Firehose delivery might fail?

<p>All of the above. (E)</p> Signup and view all the answers

What is the key difference between Kinesis Data Streams and Kinesis Data Firehose in terms of data storage?

<p>KDS stores data for 1–365 days, while Firehose does not store data. (B)</p> Signup and view all the answers

What is the primary function of the Kinesis Agent?

<p>It allows you to monitor log files and send them to Kinesis Data Streams. (D)</p> Signup and view all the answers

Which of the following is NOT a feature of the Kinesis Producer Library (KPL)?

<p>Built-in compression for data. (C)</p> Signup and view all the answers

When might using the AWS SDK directly be preferable to the Kinesis Producer Library?

<p>When your application cannot tolerate the added delay introduced by the KPL's RecordMaxBufferedTime setting. (A)</p> Signup and view all the answers

What are the two API options provided by the AWS SDK for sending data to Kinesis Streams?

<p>PutRecord and PutRecords. (A)</p> Signup and view all the answers

What is the primary advantage of using the PutRecords API (compared to PutRecord)?

<p>It enables batching, increasing throughput and reducing HTTP requests. (C)</p> Signup and view all the answers

Which of these scenarios best describes a use case for the AWS SDK (instead of the Kinesis Producer Library)?

<p>An application that requires high throughput with minimal latency and low latency requests. (B)</p> Signup and view all the answers

Which of the following is NOT a managed AWS source for Kinesis Data Streams?

<p>Amazon S3 (C)</p> Signup and view all the answers

What are the primary benefits of using the Kinesis Producer Library (KPL) for building high-performance, long-running data producers?

<p>KPL offers advanced features such as automated retries, batching, and asynchronous operation, enhancing performance and reliability. (D)</p> Signup and view all the answers

Which of the following can the Make_struct function create?

<p>A structure containing data of different types (C)</p> Signup and view all the answers

What does the Project function do?

<p>Projects every data type to a single specified type, for example, project:string (A)</p> Signup and view all the answers

What are some of the benefits of Glue job bookmarks?

<p>They allow you to process only new data and avoid processing old data (C)</p> Signup and view all the answers

What are some ways to trigger Glue jobs?

<p>Using a scheduled event like CRON, using CloudWatch events, or using an SNS notification. (B)</p> Signup and view all the answers

What are the primary benefits of Lake Formation?

<p>It makes it easy to set up, secure, and manage data lakes, and it handles data loads, monitoring, and partitioning. (C)</p> Signup and view all the answers

What are some of the troubleshooting notes for Lake Formation?

<p>IAM permissions on KMS encryption keys are needed for encrypted data catalogs in LF, cross account LF permission issues, and LF does not support manifests in Athena or Redshift queries. (D)</p> Signup and view all the answers

What are some of the features of Governed Tables in Lake Formation?

<p>Support for ACID transactions across multiple tables, automatic data compaction, and support for streaming data (Kinesis). (B)</p> Signup and view all the answers

What are the different types of data permissions that can be assigned in Lake Formation?

<p>Databases, tables, or columns with policy tags (D)</p> Signup and view all the answers

What are the different levels of data filters that can be applied in Lake Formation?

<p>Row level security, column level security, and cell level security (A)</p> Signup and view all the answers

What is Athena?

<p>A serverless, interactive querying service for S3 data (A)</p> Signup and view all the answers

Flashcards

Hot Partitions

Partitions that receive more requests than others, causing performance issues.

Exponential Backoff

A retry strategy that increases wait time between attempts after failures.

DynamoDB Accelerator (DAX)

A caching service for DynamoDB that improves read performance.

On-Demand Mode

Database mode that auto-scales reads/writes based on workloads.

Signup and view all the flashcards

RRU and WRU

Metrics for read and write requests in DynamoDB.

Signup and view all the flashcards

PutItem

API call to create a new item or replace an existing one.

Signup and view all the flashcards

GetItem

API call to read an item using its primary key.

Signup and view all the flashcards

Workload Management (WLM)

System to prioritize and manage query execution for efficiency.

Signup and view all the flashcards

Parquet and ORC

Columnar storage formats that optimize data storage and retrieval.

Signup and view all the flashcards

Partitoning

Dividing large datasets into smaller parts to improve performance and reduce costs.

Signup and view all the flashcards

MSCK REPAIR TABLE

A command used in Hive to add partitions to a table after creation.

Signup and view all the flashcards

ACID transactions

Atomicity, Consistency, Isolation, Durability properties supported by Apache Iceberg.

Signup and view all the flashcards

Time travel operations

The ability to recover recently deleted data using a SELECT statement.

Signup and view all the flashcards

Serverless query execution

Execution of queries without managing servers, improving scalability.

Signup and view all the flashcards

Federated queries

Ability to query across different data sources, including non-S3 sources.

Signup and view all the flashcards

Athena Query Editor

An interactive SQL editor with syntax highlighting and auto-completion features.

Signup and view all the flashcards

SDK

A software development kit that allows code or CLI to send data to Kinesis Streams.

Signup and view all the flashcards

PutRecords API

An API used to send multiple records to Kinesis Streams efficiently by batching them.

Signup and view all the flashcards

ProvisionedThroughputExceeded

An error indicating exceeding limits when sending data to Kinesis Streams.

Signup and view all the flashcards

Kinesis Producer Library (KPL)

A library for high-performance producers with automatic retries and batching capabilities.

Signup and view all the flashcards

Batching

Collecting multiple records and sending them in one API call to improve throughput and reduce costs.

Signup and view all the flashcards

RecordMaxBufferedTime

A KPL configuration that influences batching efficiency by causing delays for packing.

Signup and view all the flashcards

Kinesis Agent

A tool for monitoring log files and sending them to Kinesis Data Streams, built on KPL.

Signup and view all the flashcards

AWS SDK vs KPL

AWS SDK is direct for lower latency, while KPL allows batching for higher throughput with delays.

Signup and view all the flashcards

Make_cols

Creates new columns for fields with the same name, e.g., price_double.

Signup and view all the flashcards

Cast

Changes all values to a specified data type.

Signup and view all the flashcards

Make_struct

Creates a structure that contains each data type.

Signup and view all the flashcards

Job bookmarks

Persists state from job runs to track processed data.

Signup and view all the flashcards

CloudWatch Events

Triggers Lambda functions or notifications on ETL outcomes.

Signup and view all the flashcards

Lake Formation

Builds secure data lakes and integrates with Glue services.

Signup and view all the flashcards

Governed tables

Supports ACID transactions and automatic data compaction.

Signup and view all the flashcards

Data permissions

Controls access to data using IAM roles or policy tags.

Signup and view all the flashcards

Cell level security

Specific permissions for certain columns and rows in data.

Signup and view all the flashcards

Athena

Enables serverless interactive queries on S3 data.

Signup and view all the flashcards

Kinesis Data Firehose

A fully managed service for loading streaming data into data lakes and warehouses, stores data in near real time.

Signup and view all the flashcards

Kinesis Client Library (KCL)

Helps to read and process records from Kinesis streams by enabling checkpointing and shard management.

Signup and view all the flashcards

Enhanced Fan Out

A feature that allows multiple consumers to receive data from a single shard at high throughput levels without latency issues.

Signup and view all the flashcards

Checkpointing

A method to resume progress in the processing of streaming data, ensuring that data isn't reprocessed unnecessarily.

Signup and view all the flashcards

Kinesis Connector Library

Utilizes KCL for writing data to S3, DynamoDB, Redshift, and OpenSearch.

Signup and view all the flashcards

Data buffering

Accumulating records in a temporary storage area before sending them to their destinations based on size or time.

Signup and view all the flashcards

Record de-aggregation

The process of separating combined records into individual records for processing.

Signup and view all the flashcards

Lambda

A serverless compute service that can process data in real time or trigger notifications.

Signup and view all the flashcards

Shard

A shard is a uniquely identified group of data records in a Kinesis stream, with specific throughput limitations.

Signup and view all the flashcards

Throughput

The amount of data that can be processed within a given timeframe in Kinesis services.

Signup and view all the flashcards

Athena SQL

A query service that makes it easy to analyze data in S3 using SQL.

Signup and view all the flashcards

DPU

Data Processing Unit, a measure of compute capacity in AWS services.

Signup and view all the flashcards

Elastic Map Reduce (EMR)

A managed framework using Hadoop on EC2 to process vast data sets.

Signup and view all the flashcards

EMR Notebooks

Managed environment for data preparation, visualization, and collaboration.

Signup and view all the flashcards

Cluster

A collection of EC2 instances running Hadoop for processing jobs.

Signup and view all the flashcards

Transient Cluster

A cluster that terminates after its tasks complete, saving costs.

Signup and view all the flashcards

Master Node

The central node in a cluster that manages tasks and monitors health.

Signup and view all the flashcards

Scaling Strategies

Methods to add/remove nodes in a cluster based on load.

Signup and view all the flashcards

Study Notes

Data Characteristics

  • Structured data is organized in a defined manner or schema, found in relational databases. It's easily queryable and organized in rows and columns. Examples include database tables, CSV files, and Excel spreadsheets.
  • Unstructured data lacks a predefined structure or schema, making it difficult to query without pre-processing. It comes in various formats, such as text files without a fixed format, video/audio files, images, emails, and word documents.
  • Semi-structured data is not as organized as structured data but has some structure, using tags, hierarchies, or other patterns. It's more flexible than structured but less chaotic than unstructured. Examples include XML, JSON, email headers, and log files.
  • Key properties of data include volume (amount of data), velocity (speed at which data is generated), and variety (different types and sources of data).

Data Repositories

  • Data warehouses are centralized repositories designed for complex queries and analysis. Data is stored in structured format, cleaned, transformed, and loaded (ETL). Often uses star or snowflake schema. Optimized for read-heavy operations. Example: Amazon Redshift.
  • Data lakes are repositories holding vast amounts of raw data in native, unpreprocessed format. They support batch, real-time, and stream processing. Example: Amazon S3.
  • Data Lakehouses combine the best of Data Warehouses and Data Lakes. They support both structured and unstructured data.
  • Data Mesh is an approach where individual teams own data products in a given domain with various uses.

Data Pipelines

  • Extract raw data from source systems, ensuring data integrity.
  • Transform data into suitable formats, handle missing values, and do transformations.
  • Load data into the target data warehouse or repository, ensuring data integrity during loading.
  • Data pipeline tasks are done automatically in a reliable manner, using services like AWS Glue.

Data Sources and Formats

  • JDBC- Java Database Connectivity—Platform independent, language dependent.
  • ODBC — Object Database Connectivity - Platform dependent, Language independent.
  • Common data formats: CSV, JSON, Avro, Parquet. Each has its own use cases (e.g., CSV for small datasets, Avro for big data).

Data Modeling

  • Star Schema: Fact tables and dimension tables which branch.
  • Data Lineage: Visualization of data flow from source to destination.
  • Schema Evolution: Adaptability of schema to changing requirements.

Data Validation and Profiling

  • Check completeness (missing values), consistency (cross-field validation), accuracy (comparing with trusted sources), and integrity (validating against rules/standards)

DynamoDB

  • A fully managed NoSQL database service for key-value and document storage.
  • Provides fast and predictable performance with seamless scalability.
  • Supports various data types (scalar, document, set).
  • Offers low-cost and auto-scaling options.

Read/Write Capacity Modes

  • Manage table capacity for read/write throughput.
  • Data stored in partitions (copies of data on servers).
  • WCUs and RCUs—Specify read/write units.
  • Modes include provisioned (predefined read/write) and eventually consistent (less costly).

Basic API calls

  • CRUD operations for data manipulation in DynamoDB.
  • Conditional writes (e.g., PutItem).
  • Reading data (e.g., GetItem, Scan).
  • Deleting data (e.g. DeleteItem).
  • Batch operations to optimize operations.

Indexes

  • Local Secondary Indexes (LSIs) —Create alternative sort keys to the base table for data that can be more easily and quickly queried. Uses WCUs/RCUs of main table.
  • Global Secondary Indexes (GSIs) - Alternative primary keys that speed up queries on non-key attributes.

DynamoDB Accelerator (DAX)

  • A fully managed cache for DynamoDB.
  • Provides high-speed access to frequently read data by using in-memory caches.

Stream records

  • Data retention can last up to 24 hours.
  • Options to choose what is written to stream (metadata, new image or older image of items).

Data Sources and Formats (continued)

  • Avro, Parquet, and other data formats, each with attributes and use cases relevant to different data types.

Miscellaneous

  • Data modeling (star, snowflake, etc.) relevant details, like primary and foreign keys.
  • Data validation and profiling: Checking for completeness, consistency, accuracy and integrity.
  • Data sampling techniques like random, stratified, and systematic sampling.

Relational Database Service (RDS) :

  • Relational database service frequently used for smaller data volumes and managed backups, software patching
  • Full ACID compliance, so data consistency and reliability are guaranteed
  • Suitable for transactional apps and multi-user environments.

RDS Read Replicas

  • A feature that replicates the primary to other locations within the same or different regions
  • Optimized for scaling read-heavy workloads, which allows the main database to manage writing workload more efficiently and improves performance

Aurora

  • A fully managed MySQL and PostgreSQL-compatible database service. â—‹ Offers high performance.
  • Offers compatibility with various data types.

DocumentDB

  • NoSQL database compatible with MongoDB.
  • Good for storing JSON data.
  • Offers automatic scaling, high availability, and replication.

MemoryDB for Redis

  • In-memory database
  • Optimized for fast read/write performance and used in Microservices architectures to minimize latency.

Keyspaces

  • A managed Apache Cassandra-compatible database service.
  • Allows you to define the structure of your data.

Neptune

  • Fully managed graph database for complex and highly interconnected data
  • Enables real time querying and analysis on graph data.

Timestream

  • A time series database • Suitable for processing large amounts of time-stamped data.

Redshift

  • A massively parallel processing (MPP) database service. • Optimized for high performance data warehousing use cases
  • Ideal for supporting complex analytical work and data warehousing modernization.

Importing/Exporting Data

  • COPY command to efficiently load large amounts of data into Redshift.
  • Use UNLOAD command to unload data from a Redshift table.

Workload Management (WLM)

  • Workload Management in AWS. • Allow prioritization of certain types of queries by placing them into queues to avoid overloading the system

Serverless

  • Auto scaling for workloads, pay-only usage, and easier management.
  • Suitable when workloads are unpredictable or variable - high activity sometimes, lower activity other times.

Data Sharing

  • Enables secure sharing of data across Redshift clusters.

Other Tools/Features

  • Query Performance Insights dashboards for understanding query performance.
  • Redshift Advisor, to make suggestions about how to perform analytical queries more efficiently.
  • Windows functions.

Materialized Views

  • Pre-calculated results for efficient querying or retrieving pre-calculated result, useful for complex reports, can improve reporting performance

Miscellaneous (continued)

  • Specific details on retry mechanisms/handling exceptions and issues or common tasks.

DataSync

  • Transfers large datasets between on-premises and AWS storage.
  • Useful for disaster recovery or for data lakes use cases.

Snow Family

  • Highly secure and portable devices for moving large amounts of data.
  • Use cases include data migration/data warehousing.

Edge Computing Services

  • Process data near the edge instead of routing it to the cloud.
  • Use cases include real-time analytics, media streaming, and machine learning at the Edge.

Transfer Family

  • Fully managed service for file transfers in AWS. • Supports FTP, FTPS, and SFTP protocols.

EC2

  • Virtual servers • Manual provisioning or auto-scaling.

Lambda

  • Serverless compute service.
  • Ideal for running code in response to events.
  • Use cases: processing data from different sources, building microservices architecture and event-driven apps.

Integration

  • Different services within AWS integrate well with each other.

SQS

  • A secure, fully managed message queue service for asynchronous messaging that supports exactly once delivery for both messages and records.
  • A good choice for decoupling applications and buffering high volumes of requests.

FIFO Queue

  • In FIFO SQS is meant for message ordering - exactly once delivery with FIFO ordering.

SNS

  • A fully managed pub-sub messaging service. â—‹ Good for asynchronous messaging (push-based instead of pull)

Step Functions

  • A workflow orchestrator • Coordinate AWS services and can have steps for processes such as deployment or ETL pipelines.

AppFlow

  • A fully managed integration service.
  • Transfers data between different SaaS apps and AWS services.

EventBridge

  • A managed event bus where data and messages are routed to designated apps/services.
  • Useful for enabling event-driven architectures.

Amazon Managed Workflows for Apache Airflow (MWAA)

  • Fully managed service to perform tasks based on Python-defined workflows (DAGs).
  • Can scale and manage airflow tasks.

General

  • Principles of least privilege and best practices

CloudTrail

  • Logging and security service • For auditing, compliance, and analyzing operational events in AWS

CloudWatch

  • Tracking and monitoring data from AWS services, applications and apps. • Ideal for performance monitoring, dashboards/notifications, and debugging

Cost Explorer, Budgets

  • Provides high level analysis on costs. • Budgets can set limits that prevent spending from going above a certain level, with notifications when this happens

CloudFormation

  • IaC service to manage and deploy AWS resources through templates

Secrets Manager

  • For securely storing and managing sensitive data, such as passwords or API keys.
  • Data is encrypted and kept securely.

WAF

  • Security service to protect web applications.

Shield

  • Protects against Distributed Denial of Service (DDOS) attacks.

Networking

  • Essential concepts of VPCs, Subnets, Route tables, Network ACLs and Security groups are covered.
  • Concepts of PrivateLink, Direct Connect (DX), Internet Gateway, NAT Gateway are covered for secure communication/networks.

Data

  • Structured, Unstructured, Semi-structured data.

Key Management Service (KMS)

  • Manage encryption keys for data stored in various AWS services

Key Types

  • Symmetric/asymmetric keys

Macie

  • Automates the discovery of sensitive data in AWS environment, mostly S3 data.

Additional Concepts

  • AWS services like EFS, S3, Lambda, DynamoDB, Glue and their use cases

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

DynamoDB APIs
18 questions

DynamoDB APIs

RationalStanza9319 avatar
RationalStanza9319
DynamoDB On-Demand Capacity Mode
16 questions
DynamoDB Streams: Capturing Table Changes
22 questions
DynamoDB Time To Live (TTL)
5 questions
Use Quizgecko on...
Browser
Browser