Podcast
Questions and Answers
When is it advised to use an Exponential Backoff Retry in DynamoDB?
When is it advised to use an Exponential Backoff Retry in DynamoDB?
- When using Conditional Writes to ensure atomicity
- When dealing with hot partitions or very large items (correct)
- When using the UpdateItem API call to increment an atomic counter
- When using the GetItem API call with SCR to retrieve items
What is the main advantage of using DynamoDB On Demand mode?
What is the main advantage of using DynamoDB On Demand mode?
- It allows for greater flexibility and scalability, adapting to unpredictable workloads. (correct)
- It offers a more efficient use of resources compared to Provisioned mode.
- It provides higher read and write throughput compared to Provisioned mode.
- It offers a cost-effective option for large, complex workloads.
Which API call is used to create a new item or completely replace an existing one in DynamoDB?
Which API call is used to create a new item or completely replace an existing one in DynamoDB?
- UpdateItem
- Query
- PutItem (correct)
- GetItem
What is the primary purpose of Conditional Writes in DynamoDB?
What is the primary purpose of Conditional Writes in DynamoDB?
Which DynamoDB API call allows retrieving items based on a KeyConditionExpression, requiring a partition key value and optionally a sort key value?
Which DynamoDB API call allows retrieving items based on a KeyConditionExpression, requiring a partition key value and optionally a sort key value?
What is the difference between Read Consistency (RCU) and Strong Read Consistency (SCR) in DynamoDB?
What is the difference between Read Consistency (RCU) and Strong Read Consistency (SCR) in DynamoDB?
Which of the following is NOT a recommended strategy for managing workloads in DynamoDB?
Which of the following is NOT a recommended strategy for managing workloads in DynamoDB?
What is the primary function of DynamoDB Accelerator (DAX)?
What is the primary function of DynamoDB Accelerator (DAX)?
Why would one choose to store data in a large number of small files rather than a small number of large files?
Why would one choose to store data in a large number of small files rather than a small number of large files?
Which data format is directly supported by Athena?
Which data format is directly supported by Athena?
Which of the following techniques does Athena use to enhance query performance?
Which of the following techniques does Athena use to enhance query performance?
Which of the following is a benefit of partitioning large datasets in Athena?
Which of the following is a benefit of partitioning large datasets in Athena?
Which of the following is NOT a key feature of Athena?
Which of the following is NOT a key feature of Athena?
What is the benefit of using 'MSCK REPAIR TABLE' command after adding partitions to your table?
What is the benefit of using 'MSCK REPAIR TABLE' command after adding partitions to your table?
What does the 'splittable' feature of data formats like Parquet and ORC refer to?
What does the 'splittable' feature of data formats like Parquet and ORC refer to?
Which of the following is a key feature of Apache Iceberg?
Which of the following is a key feature of Apache Iceberg?
What is the purpose of EMR?
What is the purpose of EMR?
Which of the following statements is TRUE about EMR notebooks?
Which of the following statements is TRUE about EMR notebooks?
Which of these is NOT a benefit of using EMR?
Which of these is NOT a benefit of using EMR?
Which of the following is a valid scaling strategy for EMR clusters?
Which of the following is a valid scaling strategy for EMR clusters?
What is the difference between transient and long-running clusters in EMR?
What is the difference between transient and long-running clusters in EMR?
Why is termination protection enabled by default for long-running EMR clusters?
Why is termination protection enabled by default for long-running EMR clusters?
What is the primary role of the master node in an EMR cluster?
What is the primary role of the master node in an EMR cluster?
What is the implication of using reserved instances on a long-running EMR cluster?
What is the implication of using reserved instances on a long-running EMR cluster?
What is the maximum throughput per shard for Kinesis Data Streams (KDS) when using standard consumers?
What is the maximum throughput per shard for Kinesis Data Streams (KDS) when using standard consumers?
Which of the following is a benefit of using Kinesis Data Firehose (KDF) over Kinesis Data Streams (KDS)?
Which of the following is a benefit of using Kinesis Data Firehose (KDF) over Kinesis Data Streams (KDS)?
What is the maximum number of records that can be returned in a single GetRecords
call for Kinesis Data Streams (KDS)?
What is the maximum number of records that can be returned in a single GetRecords
call for Kinesis Data Streams (KDS)?
Which of the following can be used for data transformation within Kinesis Data Firehose?
Which of the following can be used for data transformation within Kinesis Data Firehose?
What is the purpose of the Kinesis Client Library (KCL)?
What is the purpose of the Kinesis Client Library (KCL)?
Which of the following is NOT a supported target destination for Kinesis Data Firehose?
Which of the following is NOT a supported target destination for Kinesis Data Firehose?
In the context of Kinesis Data Firehose, what is the purpose of the 'buffer'?
In the context of Kinesis Data Firehose, what is the purpose of the 'buffer'?
When should you use Enhanced Fan Out consumers instead of Standard consumers for Kinesis Data Streams?
When should you use Enhanced Fan Out consumers instead of Standard consumers for Kinesis Data Streams?
Which of the following is a common reason why a Kinesis Data Firehose delivery might fail?
Which of the following is a common reason why a Kinesis Data Firehose delivery might fail?
What is the key difference between Kinesis Data Streams and Kinesis Data Firehose in terms of data storage?
What is the key difference between Kinesis Data Streams and Kinesis Data Firehose in terms of data storage?
What is the primary function of the Kinesis Agent?
What is the primary function of the Kinesis Agent?
Which of the following is NOT a feature of the Kinesis Producer Library (KPL)?
Which of the following is NOT a feature of the Kinesis Producer Library (KPL)?
When might using the AWS SDK directly be preferable to the Kinesis Producer Library?
When might using the AWS SDK directly be preferable to the Kinesis Producer Library?
What are the two API options provided by the AWS SDK for sending data to Kinesis Streams?
What are the two API options provided by the AWS SDK for sending data to Kinesis Streams?
What is the primary advantage of using the PutRecords API (compared to PutRecord)?
What is the primary advantage of using the PutRecords API (compared to PutRecord)?
Which of these scenarios best describes a use case for the AWS SDK (instead of the Kinesis Producer Library)?
Which of these scenarios best describes a use case for the AWS SDK (instead of the Kinesis Producer Library)?
Which of the following is NOT a managed AWS source for Kinesis Data Streams?
Which of the following is NOT a managed AWS source for Kinesis Data Streams?
What are the primary benefits of using the Kinesis Producer Library (KPL) for building high-performance, long-running data producers?
What are the primary benefits of using the Kinesis Producer Library (KPL) for building high-performance, long-running data producers?
Which of the following can the Make_struct function create?
Which of the following can the Make_struct function create?
What does the Project function do?
What does the Project function do?
What are some of the benefits of Glue job bookmarks?
What are some of the benefits of Glue job bookmarks?
What are some ways to trigger Glue jobs?
What are some ways to trigger Glue jobs?
What are the primary benefits of Lake Formation?
What are the primary benefits of Lake Formation?
What are some of the troubleshooting notes for Lake Formation?
What are some of the troubleshooting notes for Lake Formation?
What are some of the features of Governed Tables in Lake Formation?
What are some of the features of Governed Tables in Lake Formation?
What are the different types of data permissions that can be assigned in Lake Formation?
What are the different types of data permissions that can be assigned in Lake Formation?
What are the different levels of data filters that can be applied in Lake Formation?
What are the different levels of data filters that can be applied in Lake Formation?
What is Athena?
What is Athena?
Flashcards
Hot Partitions
Hot Partitions
Partitions that receive more requests than others, causing performance issues.
Exponential Backoff
Exponential Backoff
A retry strategy that increases wait time between attempts after failures.
DynamoDB Accelerator (DAX)
DynamoDB Accelerator (DAX)
A caching service for DynamoDB that improves read performance.
On-Demand Mode
On-Demand Mode
Signup and view all the flashcards
RRU and WRU
RRU and WRU
Signup and view all the flashcards
PutItem
PutItem
Signup and view all the flashcards
GetItem
GetItem
Signup and view all the flashcards
Workload Management (WLM)
Workload Management (WLM)
Signup and view all the flashcards
Parquet and ORC
Parquet and ORC
Signup and view all the flashcards
Partitoning
Partitoning
Signup and view all the flashcards
MSCK REPAIR TABLE
MSCK REPAIR TABLE
Signup and view all the flashcards
ACID transactions
ACID transactions
Signup and view all the flashcards
Time travel operations
Time travel operations
Signup and view all the flashcards
Serverless query execution
Serverless query execution
Signup and view all the flashcards
Federated queries
Federated queries
Signup and view all the flashcards
Athena Query Editor
Athena Query Editor
Signup and view all the flashcards
SDK
SDK
Signup and view all the flashcards
PutRecords API
PutRecords API
Signup and view all the flashcards
ProvisionedThroughputExceeded
ProvisionedThroughputExceeded
Signup and view all the flashcards
Kinesis Producer Library (KPL)
Kinesis Producer Library (KPL)
Signup and view all the flashcards
Batching
Batching
Signup and view all the flashcards
RecordMaxBufferedTime
RecordMaxBufferedTime
Signup and view all the flashcards
Kinesis Agent
Kinesis Agent
Signup and view all the flashcards
AWS SDK vs KPL
AWS SDK vs KPL
Signup and view all the flashcards
Make_cols
Make_cols
Signup and view all the flashcards
Cast
Cast
Signup and view all the flashcards
Make_struct
Make_struct
Signup and view all the flashcards
Job bookmarks
Job bookmarks
Signup and view all the flashcards
CloudWatch Events
CloudWatch Events
Signup and view all the flashcards
Lake Formation
Lake Formation
Signup and view all the flashcards
Governed tables
Governed tables
Signup and view all the flashcards
Data permissions
Data permissions
Signup and view all the flashcards
Cell level security
Cell level security
Signup and view all the flashcards
Athena
Athena
Signup and view all the flashcards
Kinesis Data Firehose
Kinesis Data Firehose
Signup and view all the flashcards
Kinesis Client Library (KCL)
Kinesis Client Library (KCL)
Signup and view all the flashcards
Enhanced Fan Out
Enhanced Fan Out
Signup and view all the flashcards
Checkpointing
Checkpointing
Signup and view all the flashcards
Kinesis Connector Library
Kinesis Connector Library
Signup and view all the flashcards
Data buffering
Data buffering
Signup and view all the flashcards
Record de-aggregation
Record de-aggregation
Signup and view all the flashcards
Lambda
Lambda
Signup and view all the flashcards
Shard
Shard
Signup and view all the flashcards
Throughput
Throughput
Signup and view all the flashcards
Athena SQL
Athena SQL
Signup and view all the flashcards
DPU
DPU
Signup and view all the flashcards
Elastic Map Reduce (EMR)
Elastic Map Reduce (EMR)
Signup and view all the flashcards
EMR Notebooks
EMR Notebooks
Signup and view all the flashcards
Cluster
Cluster
Signup and view all the flashcards
Transient Cluster
Transient Cluster
Signup and view all the flashcards
Master Node
Master Node
Signup and view all the flashcards
Scaling Strategies
Scaling Strategies
Signup and view all the flashcards
Study Notes
Data Characteristics
- Structured data is organized in a defined manner or schema, found in relational databases. It's easily queryable and organized in rows and columns. Examples include database tables, CSV files, and Excel spreadsheets.
- Unstructured data lacks a predefined structure or schema, making it difficult to query without pre-processing. It comes in various formats, such as text files without a fixed format, video/audio files, images, emails, and word documents.
- Semi-structured data is not as organized as structured data but has some structure, using tags, hierarchies, or other patterns. It's more flexible than structured but less chaotic than unstructured. Examples include XML, JSON, email headers, and log files.
- Key properties of data include volume (amount of data), velocity (speed at which data is generated), and variety (different types and sources of data).
Data Repositories
- Data warehouses are centralized repositories designed for complex queries and analysis. Data is stored in structured format, cleaned, transformed, and loaded (ETL). Often uses star or snowflake schema. Optimized for read-heavy operations. Example: Amazon Redshift.
- Data lakes are repositories holding vast amounts of raw data in native, unpreprocessed format. They support batch, real-time, and stream processing. Example: Amazon S3.
- Data Lakehouses combine the best of Data Warehouses and Data Lakes. They support both structured and unstructured data.
- Data Mesh is an approach where individual teams own data products in a given domain with various uses.
Data Pipelines
- Extract raw data from source systems, ensuring data integrity.
- Transform data into suitable formats, handle missing values, and do transformations.
- Load data into the target data warehouse or repository, ensuring data integrity during loading.
- Data pipeline tasks are done automatically in a reliable manner, using services like AWS Glue.
Data Sources and Formats
- JDBC- Java Database Connectivity—Platform independent, language dependent.
- ODBC — Object Database Connectivity - Platform dependent, Language independent.
- Common data formats: CSV, JSON, Avro, Parquet. Each has its own use cases (e.g., CSV for small datasets, Avro for big data).
Data Modeling
- Star Schema: Fact tables and dimension tables which branch.
- Data Lineage: Visualization of data flow from source to destination.
- Schema Evolution: Adaptability of schema to changing requirements.
Data Validation and Profiling
- Check completeness (missing values), consistency (cross-field validation), accuracy (comparing with trusted sources), and integrity (validating against rules/standards)
DynamoDB
- A fully managed NoSQL database service for key-value and document storage.
- Provides fast and predictable performance with seamless scalability.
- Supports various data types (scalar, document, set).
- Offers low-cost and auto-scaling options.
Read/Write Capacity Modes
- Manage table capacity for read/write throughput.
- Data stored in partitions (copies of data on servers).
- WCUs and RCUs—Specify read/write units.
- Modes include provisioned (predefined read/write) and eventually consistent (less costly).
Basic API calls
- CRUD operations for data manipulation in DynamoDB.
- Conditional writes (e.g., PutItem).
- Reading data (e.g., GetItem, Scan).
- Deleting data (e.g. DeleteItem).
- Batch operations to optimize operations.
Indexes
- Local Secondary Indexes (LSIs) —Create alternative sort keys to the base table for data that can be more easily and quickly queried. Uses WCUs/RCUs of main table.
- Global Secondary Indexes (GSIs) - Alternative primary keys that speed up queries on non-key attributes.
DynamoDB Accelerator (DAX)
- A fully managed cache for DynamoDB.
- Provides high-speed access to frequently read data by using in-memory caches.
Stream records
- Data retention can last up to 24 hours.
- Options to choose what is written to stream (metadata, new image or older image of items).
Data Sources and Formats (continued)
- Avro, Parquet, and other data formats, each with attributes and use cases relevant to different data types.
Miscellaneous
- Data modeling (star, snowflake, etc.) relevant details, like primary and foreign keys.
- Data validation and profiling: Checking for completeness, consistency, accuracy and integrity.
- Data sampling techniques like random, stratified, and systematic sampling.
Relational Database Service (RDS) :
- Relational database service frequently used for smaller data volumes and managed backups, software patching
- Full ACID compliance, so data consistency and reliability are guaranteed
- Suitable for transactional apps and multi-user environments.
RDS Read Replicas
- A feature that replicates the primary to other locations within the same or different regions
- Optimized for scaling read-heavy workloads, which allows the main database to manage writing workload more efficiently and improves performance
Aurora
- A fully managed MySQL and PostgreSQL-compatible database service. â—‹ Offers high performance.
- Offers compatibility with various data types.
DocumentDB
- NoSQL database compatible with MongoDB.
- Good for storing JSON data.
- Offers automatic scaling, high availability, and replication.
MemoryDB for Redis
- In-memory database
- Optimized for fast read/write performance and used in Microservices architectures to minimize latency.
Keyspaces
- A managed Apache Cassandra-compatible database service.
- Allows you to define the structure of your data.
Neptune
- Fully managed graph database for complex and highly interconnected data
- Enables real time querying and analysis on graph data.
Timestream
- A time series database • Suitable for processing large amounts of time-stamped data.
Redshift
- A massively parallel processing (MPP) database service. • Optimized for high performance data warehousing use cases
- Ideal for supporting complex analytical work and data warehousing modernization.
Importing/Exporting Data
- COPY command to efficiently load large amounts of data into Redshift.
- Use UNLOAD command to unload data from a Redshift table.
Workload Management (WLM)
- Workload Management in AWS. • Allow prioritization of certain types of queries by placing them into queues to avoid overloading the system
Serverless
- Auto scaling for workloads, pay-only usage, and easier management.
- Suitable when workloads are unpredictable or variable - high activity sometimes, lower activity other times.
Data Sharing
- Enables secure sharing of data across Redshift clusters.
Other Tools/Features
- Query Performance Insights dashboards for understanding query performance.
- Redshift Advisor, to make suggestions about how to perform analytical queries more efficiently.
- Windows functions.
Materialized Views
- Pre-calculated results for efficient querying or retrieving pre-calculated result, useful for complex reports, can improve reporting performance
Miscellaneous (continued)
- Specific details on retry mechanisms/handling exceptions and issues or common tasks.
DataSync
- Transfers large datasets between on-premises and AWS storage.
- Useful for disaster recovery or for data lakes use cases.
Snow Family
- Highly secure and portable devices for moving large amounts of data.
- Use cases include data migration/data warehousing.
Edge Computing Services
- Process data near the edge instead of routing it to the cloud.
- Use cases include real-time analytics, media streaming, and machine learning at the Edge.
Transfer Family
- Fully managed service for file transfers in AWS. • Supports FTP, FTPS, and SFTP protocols.
EC2
- Virtual servers • Manual provisioning or auto-scaling.
Lambda
- Serverless compute service.
- Ideal for running code in response to events.
- Use cases: processing data from different sources, building microservices architecture and event-driven apps.
Integration
- Different services within AWS integrate well with each other.
SQS
- A secure, fully managed message queue service for asynchronous messaging that supports exactly once delivery for both messages and records.
- A good choice for decoupling applications and buffering high volumes of requests.
FIFO Queue
- In FIFO SQS is meant for message ordering - exactly once delivery with FIFO ordering.
SNS
- A fully managed pub-sub messaging service. â—‹ Good for asynchronous messaging (push-based instead of pull)
Step Functions
- A workflow orchestrator • Coordinate AWS services and can have steps for processes such as deployment or ETL pipelines.
AppFlow
- A fully managed integration service.
- Transfers data between different SaaS apps and AWS services.
EventBridge
- A managed event bus where data and messages are routed to designated apps/services.
- Useful for enabling event-driven architectures.
Amazon Managed Workflows for Apache Airflow (MWAA)
- Fully managed service to perform tasks based on Python-defined workflows (DAGs).
- Can scale and manage airflow tasks.
General
- Principles of least privilege and best practices
CloudTrail
- Logging and security service • For auditing, compliance, and analyzing operational events in AWS
CloudWatch
- Tracking and monitoring data from AWS services, applications and apps. • Ideal for performance monitoring, dashboards/notifications, and debugging
Cost Explorer, Budgets
- Provides high level analysis on costs. • Budgets can set limits that prevent spending from going above a certain level, with notifications when this happens
CloudFormation
- IaC service to manage and deploy AWS resources through templates
Secrets Manager
- For securely storing and managing sensitive data, such as passwords or API keys.
- Data is encrypted and kept securely.
WAF
- Security service to protect web applications.
Shield
- Protects against Distributed Denial of Service (DDOS) attacks.
Networking
- Essential concepts of VPCs, Subnets, Route tables, Network ACLs and Security groups are covered.
- Concepts of PrivateLink, Direct Connect (DX), Internet Gateway, NAT Gateway are covered for secure communication/networks.
Data
- Structured, Unstructured, Semi-structured data.
Key Management Service (KMS)
- Manage encryption keys for data stored in various AWS services
Key Types
- Symmetric/asymmetric keys
Macie
- Automates the discovery of sensitive data in AWS environment, mostly S3 data.
Additional Concepts
- AWS services like EFS, S3, Lambda, DynamoDB, Glue and their use cases
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.