Podcast
Questions and Answers
What is the purpose of EMR notebooks?
What is the purpose of EMR notebooks?
- To provide a managed environment for preparing and visualizing data
- To collaborate with peers on data analysis
- To build applications that utilize EMR
- All of the above (correct)
What is the primary advantage of using spot instances for temporary capacity on long-running EMR clusters?
What is the primary advantage of using spot instances for temporary capacity on long-running EMR clusters?
- Reduced costs due to lower pricing for spot instances (correct)
- Improved data processing speed and efficiency
- Automatic scaling of the cluster based on workload demand
- Increased cluster stability and reliability
What happens when a code node fails in an EMR cluster?
What happens when a code node fails in an EMR cluster?
- The cluster automatically scales down to compensate for the failed node
- EMR provisions new nodes to replace the failed node (correct)
- The cluster continues to operate with reduced processing capacity
- The cluster automatically shuts down and all data is lost
What is the role of the master node in an EMR cluster?
What is the role of the master node in an EMR cluster?
What is the primary difference between a transient and a long-running EMR cluster?
What is the primary difference between a transient and a long-running EMR cluster?
In what order does EMR scale up a cluster?
In what order does EMR scale up a cluster?
What is the purpose of instance groups and instance fleets in EMR scaling?
What is the purpose of instance groups and instance fleets in EMR scaling?
What happens when an EMR cluster is scaled down?
What happens when an EMR cluster is scaled down?
What is the purpose of using the ProjectionExpression
parameter in the GetItem
API call?
What is the purpose of using the ProjectionExpression
parameter in the GetItem
API call?
What is the key difference between PutItem
and UpdateItem
API calls?
What is the key difference between PutItem
and UpdateItem
API calls?
Why is it recommended to distribute partition keys as much as possible in DynamoDB?
Why is it recommended to distribute partition keys as much as possible in DynamoDB?
What action should be taken to address a 500 or 503 error from Kinesis?
What action should be taken to address a 500 or 503 error from Kinesis?
Which of the following is a suggested action when records are processed too slowly?
Which of the following is a suggested action when records are processed too slowly?
What is a common cause of reading records too slowly in Kinesis?
What is a common cause of reading records too slowly in Kinesis?
Which factor may lead to the expiration of a shard iterator unexpectedly?
Which factor may lead to the expiration of a shard iterator unexpectedly?
What should be monitored if record processing is falling behind?
What should be monitored if record processing is falling behind?
What is an effective way to manage throttling errors in Kinesis?
What is an effective way to manage throttling errors in Kinesis?
When should unhandled exceptions in the KCL be checked?
When should unhandled exceptions in the KCL be checked?
Which aspect should not be neglected to avoid Lambda function invocation issues?
Which aspect should not be neglected to avoid Lambda function invocation issues?
Which strategy can improve the error handling for Kinesis when getting 500 errors?
Which strategy can improve the error handling for Kinesis when getting 500 errors?
What should be increased to handle high latency in Kinesis effectively?
What should be increased to handle high latency in Kinesis effectively?
What does the Make_cols function do in data transformation?
What does the Make_cols function do in data transformation?
Which of the following is NOT a method of connecting to a VPC endpoint?
Which of the following is NOT a method of connecting to a VPC endpoint?
What is the purpose of job bookmarks in Glue jobs?
What is the purpose of job bookmarks in Glue jobs?
Which feature does Lake Formation provide for access control?
Which feature does Lake Formation provide for access control?
What happens when Glue is configured for time-based scheduling?
What happens when Glue is configured for time-based scheduling?
What does the Project function do in data types?
What does the Project function do in data types?
In Lake Formation, how does it handle data access for external AWS accounts?
In Lake Formation, how does it handle data access for external AWS accounts?
Which aspect of Glue ETL does NOT influence cost?
Which aspect of Glue ETL does NOT influence cost?
What type of transactions do governed tables in Lake Formation support?
What type of transactions do governed tables in Lake Formation support?
What does automatic data compaction in Lake Formation help with?
What does automatic data compaction in Lake Formation help with?
What is a key requirement for cold storage in data management?
What is a key requirement for cold storage in data management?
Which option best describes a method to handle memory pressure in JVM?
Which option best describes a method to handle memory pressure in JVM?
What is the function of index state management (ISM)?
What is the function of index state management (ISM)?
Why is it recommended to have at least three dedicated master nodes?
Why is it recommended to have at least three dedicated master nodes?
What does OpenSearch Compute Units (OCUs) measure in a serverless setup?
What does OpenSearch Compute Units (OCUs) measure in a serverless setup?
In which situation is it advised to choose fewer shards?
In which situation is it advised to choose fewer shards?
What is a primary characteristic of snapshots in OpenSearch?
What is a primary characteristic of snapshots in OpenSearch?
What is an example of an anti-pattern in data querying methods?
What is an example of an anti-pattern in data querying methods?
Which of the following describes the role of inverted indices?
Which of the following describes the role of inverted indices?
Cross-cluster replication in OpenSearch is primarily used for what purpose?
Cross-cluster replication in OpenSearch is primarily used for what purpose?
Which statement is true about the handling of fine-grained access control?
Which statement is true about the handling of fine-grained access control?
What is the minimum number of nodes recommended for operational stability?
What is the minimum number of nodes recommended for operational stability?
What is the outcome of using rollups in data management?
What is the outcome of using rollups in data management?
What is the primary advantage of using Kinesis over traditional batch processing systems?
What is the primary advantage of using Kinesis over traditional batch processing systems?
How is the cost of using Kinesis primarily calculated?
How is the cost of using Kinesis primarily calculated?
What distinguishes MSK from Kinesis regarding message size?
What distinguishes MSK from Kinesis regarding message size?
What key function does RANDOM_CUT_FOREST provide?
What key function does RANDOM_CUT_FOREST provide?
In terms of managing security, how does MSK differ from Kinesis?
In terms of managing security, how does MSK differ from Kinesis?
What is a significant feature of MSK that enhances its operational reliability?
What is a significant feature of MSK that enhances its operational reliability?
What storage types does OpenSearch offer?
What storage types does OpenSearch offer?
Which feature of MSK Connect allows for automatic scaling?
Which feature of MSK Connect allows for automatic scaling?
What is a major use case for Kinesis Data Streams?
What is a major use case for Kinesis Data Streams?
What role do connectors play in MSK Connect?
What role do connectors play in MSK Connect?
How does Kinesis handle data storage as opposed to MSK?
How does Kinesis handle data storage as opposed to MSK?
What is a limitation when managing message sizes in Kinesis?
What is a limitation when managing message sizes in Kinesis?
Which of the following describes the deployment of MSK clusters?
Which of the following describes the deployment of MSK clusters?
Flashcards
Athena SQL
Athena SQL
A service for querying data stored in Amazon S3 using standard SQL.
DPU (Data Processing Units)
DPU (Data Processing Units)
Units used to measure compute capacity in Athena for query executions.
Elastic Map Reduce (EMR)
Elastic Map Reduce (EMR)
A managed Hadoop framework to process vast amounts of data on EC2.
EMR Notebooks
EMR Notebooks
Signup and view all the flashcards
Cluster in EMR
Cluster in EMR
Signup and view all the flashcards
Transient Cluster
Transient Cluster
Signup and view all the flashcards
Master Node
Master Node
Signup and view all the flashcards
Scaling Strategy in EMR
Scaling Strategy in EMR
Signup and view all the flashcards
Hot partitions
Hot partitions
Signup and view all the flashcards
Exponential backoff retry
Exponential backoff retry
Signup and view all the flashcards
On demand mode
On demand mode
Signup and view all the flashcards
RRU and WRU
RRU and WRU
Signup and view all the flashcards
PutItem
PutItem
Signup and view all the flashcards
UpdateItem
UpdateItem
Signup and view all the flashcards
GetItem
GetItem
Signup and view all the flashcards
Workload Management (WLM)
Workload Management (WLM)
Signup and view all the flashcards
Make_cols
Make_cols
Signup and view all the flashcards
Cast
Cast
Signup and view all the flashcards
Project
Project
Signup and view all the flashcards
Job bookmarks
Job bookmarks
Signup and view all the flashcards
Lake Formation
Lake Formation
Signup and view all the flashcards
Governed tables
Governed tables
Signup and view all the flashcards
Time-travel queries
Time-travel queries
Signup and view all the flashcards
Data permissions
Data permissions
Signup and view all the flashcards
Data filters
Data filters
Signup and view all the flashcards
CloudWatch Events
CloudWatch Events
Signup and view all the flashcards
AmazonKinesisException
AmazonKinesisException
Signup and view all the flashcards
Retry Mechanism
Retry Mechanism
Signup and view all the flashcards
Hot Shards
Hot Shards
Signup and view all the flashcards
Exponential Backoff
Exponential Backoff
Signup and view all the flashcards
GetRecords
GetRecords
Signup and view all the flashcards
IteratorAgeMilliseconds
IteratorAgeMilliseconds
Signup and view all the flashcards
Managed Service for Apache Flink (MSAF)
Managed Service for Apache Flink (MSAF)
Signup and view all the flashcards
DataStream API
DataStream API
Signup and view all the flashcards
Flink Sinks
Flink Sinks
Signup and view all the flashcards
Lambda Function Invocation
Lambda Function Invocation
Signup and view all the flashcards
Cold Storage
Cold Storage
Signup and view all the flashcards
Dedicated Master Node
Dedicated Master Node
Signup and view all the flashcards
Data Migration
Data Migration
Signup and view all the flashcards
Shards
Shards
Signup and view all the flashcards
Index State Management (ISM)
Index State Management (ISM)
Signup and view all the flashcards
Cross Cluster Replication
Cross Cluster Replication
Signup and view all the flashcards
JVM Memory Pressure
JVM Memory Pressure
Signup and view all the flashcards
Snapshots to S3
Snapshots to S3
Signup and view all the flashcards
OpenSearch Compute Units (OCUs)
OpenSearch Compute Units (OCUs)
Signup and view all the flashcards
Index Rollups
Index Rollups
Signup and view all the flashcards
DynamoDB Streams
DynamoDB Streams
Signup and view all the flashcards
Visualization Tool (QuickSight)
Visualization Tool (QuickSight)
Signup and view all the flashcards
Encryption at Rest
Encryption at Rest
Signup and view all the flashcards
Ad hoc Data Querying
Ad hoc Data Querying
Signup and view all the flashcards
Fine-Grained Access Control
Fine-Grained Access Control
Signup and view all the flashcards
Kinesis Processing Units (KPU)
Kinesis Processing Units (KPU)
Signup and view all the flashcards
Schema Discovery
Schema Discovery
Signup and view all the flashcards
RANDOM_CUT_FOREST
RANDOM_CUT_FOREST
Signup and view all the flashcards
Managed Streaming for Apache Kafka (MSK)
Managed Streaming for Apache Kafka (MSK)
Signup and view all the flashcards
Auto Recovery in MSK
Auto Recovery in MSK
Signup and view all the flashcards
EBS Volumes
EBS Volumes
Signup and view all the flashcards
Kafka Connect
Kafka Connect
Signup and view all the flashcards
Source Connectors
Source Connectors
Signup and view all the flashcards
Sink Connectors
Sink Connectors
Signup and view all the flashcards
KDS Message Size Limit
KDS Message Size Limit
Signup and view all the flashcards
TLS Encryption
TLS Encryption
Signup and view all the flashcards
OpenSearch
OpenSearch
Signup and view all the flashcards
Hot Storage
Hot Storage
Signup and view all the flashcards
UltraWarm Storage
UltraWarm Storage
Signup and view all the flashcards
AWS VPC
AWS VPC
Signup and view all the flashcards
Study Notes
Data Characteristics
- Structured data is organized in a defined manner, schema, rows, and columns. Found in relational databases, CSV files, and Excel spreadsheets. Easily queryable.
- Unstructured data has no predefined format or schema. Examples include text files without fixed formats, video/audio files, images, emails, and Word docs. Not easily queryable without preprocessing.
- Semi-structured data is more organized than unstructured but lacks the rigid structure of relational data. Examples include XML, JSON, email headers, or log files with varied formats.
Properties of Data
- Volume: Amount/size of data generated, collected, and processed.
- Velocity: Speed at which new data is generated, collected, and processed. High velocity data requires real-time processing capabilities.
- Variety: Different types, structures, and sources of data.
Data Repositories
- Data warehouses are centralized repositories optimized for complex queries and analysis. Data is structured, cleaned, transformed, and loaded (ETL). Optimized for read-heavy operations.
- Data lakes are storage repositories holding vast amounts of raw data in native formats. No data preprocessing, supports batch, real-time, and stream processing. Can be queried for data transformations or exploration.
- Data lakehouses combine the features of data warehouses and data lakes. They support both structured and unstructured data.
Data Pipelines
- Extract: Retrieves raw data from source systems, ensuring data integrity during extraction. Can be done in real-time or batches.
- Transform: Converts data into a suitable format for the target. Could include data cleaning, enrichment, format changes, and handling missing values.
- Load: Moves data into the target data warehouse or repository, ensuring data integrity during loading.
Data Sources and Formats
- JDBC (Java Database Connectivity): A platform-independent way to connect java apps to databases.
- ODBC (Open Database Connectivity): A platform-independent API for connecting apps to databases using SQL.
- CSV (Comma Separated Values): A simple data format for storage of small to medium datasets.
- JSON (JavaScript Object Notation): A human-readable data format used often in data interchange between web servers and web clients.
- Avro: A binary file format (used in big data/real-time processing applications) including schema and data
- Parquet: Columnar storage format for large datasets - good for analyzing specific columns
Data Modeling
- Star Schema: A simple schema with a central fact table and connected dimension tables. Good for reporting and dashboards.
- Snowflake Schema: A more complex version of a star schema with further normalized dimension tables. Better for large datasets and complex queries.
- Flat Schema: A single table for all data, no normalization. Great for basic apps where data is small & relationships are minimal.
- Relational Schema: Multiple normalized tables related via foreign keys. Good for transactional apps.
Data Validation & Profiling
- Completeness: Checking for missing/null values.
- Consistency: Validation across data sources for possible differences.
- Accuracy: Comparison with trusted sources & known rules, sanity checks.
- Integrity: Validating the data integrity with relationship validations.
DynamoDB
- A NoSQL database service offering fast & predictable performance with seamless scalability, fault tolerance, and high availability.
- Fully managed, highly available with replication across multiple AZs.
- Scales horizontally to massive workloads.
- Millions of requests per second, trillions of rows, and 100s of TB of storage.
- Use for fast & consistent data retrieval, event-driven applications.
- Contains tables with primary keys that must be decided at creation.
- Data is stored in tables in rows. Partition & sort keys are used to efficiently organize data.
- Choose the right type of table (standard vs. provisioned) for your workload.
Data Pipelines
- Data pipelines automate data transfer and transformations between data stores.
- Extract, transform, load (ETL) processes are crucial for data quality and reliability.
- The process needs to be automated for efficiency.
#Data Validation and Analysis
- Identify discrepancies and enforce data integrity across different data sources and periods.
- Check for null or missing values, and inconsistencies in data entries.
Other concepts
- Data Mesh: A federated governance structure for data products within a domain.
Miscellaneous
- Schema Evolution: Adapt and change the schema of datasets over time while maintaining backward compatibility.
Data Pipelines
- Extract, transform, load (ETL): Data extraction, transformation and loading within data pipelines.
Additional Services/Concepts
- DataSync: A managed service for moving large amounts of data to and from on-premises storage & AWS services. Good for batch or scheduled data transfer.
- Snow Family: Highly secure, portable devices that let use collect & process data at the edge for transfer to AWS.
- App Flow: Enables secure data transfer between cloud & SaaS apps & can be scheduled or run on demand.
- AWS Backup: Centrally manages and automates backups across many AWS services.
- Application Discovery: Provides data about on-premise servers to plan easier migration to AWS.
- Application Migration: Converts physical/virtual/cloud servers to run on AWS.
- DynamoDB Accelerator (DAX): A fully managed, high-performance accelerator for DynamoDB allowing microsecond read latency, compatible with existing DynamoDB APIs.
- Timestream: An AWS database service for time series data designed for high throughput and scalable querying.
- Redshift: A fully managed, petabyte-scale data warehouse with columnar storage, optimized for analytical workloads.
- CloudFront: A content delivery network used for faster and more efficient data delivery across the world.
- CloudWatch: A monitoring service for collecting and tracking metrics, logs and events regarding resources and applications.
- CloudTrail: A service that logs all API calls and all activity that can be archived for security analysis.
- Kinesis: Records are received & can be continuously read by consumers
- Kinesis Data Analytics: Provides a managed compute service for running real-time analytics on Kinesis Data Streams.
- Glue: Enables the extraction of schemas, building ETL pipelines, and managing data lakes.
- Glue Catalog: Centralized metadata repository for storing data schemas or metadata.
- Glue Studio: A visual interface for designing, building, and running ETL workflows with Glue services.
- Glue DataBrew: A visual interface for cleaning and preparing large datasets using transformations.
- Glue Workflows: Allow you to design and automate ETL in a visual interface using DAGs.
- Athena: A serverless query service that can read data from S3 directly using Structured Query Language.
- EMR: A fully managed Hadoop framework offering flexible deployments and scaling for large-scale data processing workloads. Supports batch & real-time data processing.
- EFS: A file system service that stores data on objects without requiring management of compute resources.
- EKS: A managed Kubernetes service that provides the flexibility and scalability needed by containerized applications. Highly available & scalable.
- MSK: A fully managed Apache Kafka service for processing streaming data.
- Secrets Manager: A managed service for storing sensitive information like API keys, passwords, and databases credentials.
- WAF: Protects web applications from common web exploits.
- Shield: A managed service that protects your applications from denial-of-service (DoS) attacks.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.