Podcast
Questions and Answers
What is the purpose of EMR notebooks?
What is the purpose of EMR notebooks?
What is the primary advantage of using spot instances for temporary capacity on long-running EMR clusters?
What is the primary advantage of using spot instances for temporary capacity on long-running EMR clusters?
What happens when a code node fails in an EMR cluster?
What happens when a code node fails in an EMR cluster?
What is the role of the master node in an EMR cluster?
What is the role of the master node in an EMR cluster?
Signup and view all the answers
What is the primary difference between a transient and a long-running EMR cluster?
What is the primary difference between a transient and a long-running EMR cluster?
Signup and view all the answers
In what order does EMR scale up a cluster?
In what order does EMR scale up a cluster?
Signup and view all the answers
What is the purpose of instance groups and instance fleets in EMR scaling?
What is the purpose of instance groups and instance fleets in EMR scaling?
Signup and view all the answers
What happens when an EMR cluster is scaled down?
What happens when an EMR cluster is scaled down?
Signup and view all the answers
What is the purpose of using the ProjectionExpression
parameter in the GetItem
API call?
What is the purpose of using the ProjectionExpression
parameter in the GetItem
API call?
Signup and view all the answers
What is the key difference between PutItem
and UpdateItem
API calls?
What is the key difference between PutItem
and UpdateItem
API calls?
Signup and view all the answers
Why is it recommended to distribute partition keys as much as possible in DynamoDB?
Why is it recommended to distribute partition keys as much as possible in DynamoDB?
Signup and view all the answers
What action should be taken to address a 500 or 503 error from Kinesis?
What action should be taken to address a 500 or 503 error from Kinesis?
Signup and view all the answers
Which of the following is a suggested action when records are processed too slowly?
Which of the following is a suggested action when records are processed too slowly?
Signup and view all the answers
What is a common cause of reading records too slowly in Kinesis?
What is a common cause of reading records too slowly in Kinesis?
Signup and view all the answers
Which factor may lead to the expiration of a shard iterator unexpectedly?
Which factor may lead to the expiration of a shard iterator unexpectedly?
Signup and view all the answers
What should be monitored if record processing is falling behind?
What should be monitored if record processing is falling behind?
Signup and view all the answers
What is an effective way to manage throttling errors in Kinesis?
What is an effective way to manage throttling errors in Kinesis?
Signup and view all the answers
When should unhandled exceptions in the KCL be checked?
When should unhandled exceptions in the KCL be checked?
Signup and view all the answers
Which aspect should not be neglected to avoid Lambda function invocation issues?
Which aspect should not be neglected to avoid Lambda function invocation issues?
Signup and view all the answers
Which strategy can improve the error handling for Kinesis when getting 500 errors?
Which strategy can improve the error handling for Kinesis when getting 500 errors?
Signup and view all the answers
What should be increased to handle high latency in Kinesis effectively?
What should be increased to handle high latency in Kinesis effectively?
Signup and view all the answers
What does the Make_cols function do in data transformation?
What does the Make_cols function do in data transformation?
Signup and view all the answers
Which of the following is NOT a method of connecting to a VPC endpoint?
Which of the following is NOT a method of connecting to a VPC endpoint?
Signup and view all the answers
What is the purpose of job bookmarks in Glue jobs?
What is the purpose of job bookmarks in Glue jobs?
Signup and view all the answers
Which feature does Lake Formation provide for access control?
Which feature does Lake Formation provide for access control?
Signup and view all the answers
What happens when Glue is configured for time-based scheduling?
What happens when Glue is configured for time-based scheduling?
Signup and view all the answers
What does the Project function do in data types?
What does the Project function do in data types?
Signup and view all the answers
In Lake Formation, how does it handle data access for external AWS accounts?
In Lake Formation, how does it handle data access for external AWS accounts?
Signup and view all the answers
Which aspect of Glue ETL does NOT influence cost?
Which aspect of Glue ETL does NOT influence cost?
Signup and view all the answers
What type of transactions do governed tables in Lake Formation support?
What type of transactions do governed tables in Lake Formation support?
Signup and view all the answers
What does automatic data compaction in Lake Formation help with?
What does automatic data compaction in Lake Formation help with?
Signup and view all the answers
What is a key requirement for cold storage in data management?
What is a key requirement for cold storage in data management?
Signup and view all the answers
Which option best describes a method to handle memory pressure in JVM?
Which option best describes a method to handle memory pressure in JVM?
Signup and view all the answers
What is the function of index state management (ISM)?
What is the function of index state management (ISM)?
Signup and view all the answers
Why is it recommended to have at least three dedicated master nodes?
Why is it recommended to have at least three dedicated master nodes?
Signup and view all the answers
What does OpenSearch Compute Units (OCUs) measure in a serverless setup?
What does OpenSearch Compute Units (OCUs) measure in a serverless setup?
Signup and view all the answers
In which situation is it advised to choose fewer shards?
In which situation is it advised to choose fewer shards?
Signup and view all the answers
What is a primary characteristic of snapshots in OpenSearch?
What is a primary characteristic of snapshots in OpenSearch?
Signup and view all the answers
What is an example of an anti-pattern in data querying methods?
What is an example of an anti-pattern in data querying methods?
Signup and view all the answers
Which of the following describes the role of inverted indices?
Which of the following describes the role of inverted indices?
Signup and view all the answers
Cross-cluster replication in OpenSearch is primarily used for what purpose?
Cross-cluster replication in OpenSearch is primarily used for what purpose?
Signup and view all the answers
Which statement is true about the handling of fine-grained access control?
Which statement is true about the handling of fine-grained access control?
Signup and view all the answers
What is the minimum number of nodes recommended for operational stability?
What is the minimum number of nodes recommended for operational stability?
Signup and view all the answers
What is the outcome of using rollups in data management?
What is the outcome of using rollups in data management?
Signup and view all the answers
What is the primary advantage of using Kinesis over traditional batch processing systems?
What is the primary advantage of using Kinesis over traditional batch processing systems?
Signup and view all the answers
How is the cost of using Kinesis primarily calculated?
How is the cost of using Kinesis primarily calculated?
Signup and view all the answers
What distinguishes MSK from Kinesis regarding message size?
What distinguishes MSK from Kinesis regarding message size?
Signup and view all the answers
What key function does RANDOM_CUT_FOREST provide?
What key function does RANDOM_CUT_FOREST provide?
Signup and view all the answers
In terms of managing security, how does MSK differ from Kinesis?
In terms of managing security, how does MSK differ from Kinesis?
Signup and view all the answers
What is a significant feature of MSK that enhances its operational reliability?
What is a significant feature of MSK that enhances its operational reliability?
Signup and view all the answers
What storage types does OpenSearch offer?
What storage types does OpenSearch offer?
Signup and view all the answers
Which feature of MSK Connect allows for automatic scaling?
Which feature of MSK Connect allows for automatic scaling?
Signup and view all the answers
What is a major use case for Kinesis Data Streams?
What is a major use case for Kinesis Data Streams?
Signup and view all the answers
What role do connectors play in MSK Connect?
What role do connectors play in MSK Connect?
Signup and view all the answers
How does Kinesis handle data storage as opposed to MSK?
How does Kinesis handle data storage as opposed to MSK?
Signup and view all the answers
What is a limitation when managing message sizes in Kinesis?
What is a limitation when managing message sizes in Kinesis?
Signup and view all the answers
Which of the following describes the deployment of MSK clusters?
Which of the following describes the deployment of MSK clusters?
Signup and view all the answers
Flashcards
Athena SQL
Athena SQL
A service for querying data stored in Amazon S3 using standard SQL.
DPU (Data Processing Units)
DPU (Data Processing Units)
Units used to measure compute capacity in Athena for query executions.
Elastic Map Reduce (EMR)
Elastic Map Reduce (EMR)
A managed Hadoop framework to process vast amounts of data on EC2.
EMR Notebooks
EMR Notebooks
Signup and view all the flashcards
Cluster in EMR
Cluster in EMR
Signup and view all the flashcards
Transient Cluster
Transient Cluster
Signup and view all the flashcards
Master Node
Master Node
Signup and view all the flashcards
Scaling Strategy in EMR
Scaling Strategy in EMR
Signup and view all the flashcards
Hot partitions
Hot partitions
Signup and view all the flashcards
Exponential backoff retry
Exponential backoff retry
Signup and view all the flashcards
On demand mode
On demand mode
Signup and view all the flashcards
RRU and WRU
RRU and WRU
Signup and view all the flashcards
PutItem
PutItem
Signup and view all the flashcards
UpdateItem
UpdateItem
Signup and view all the flashcards
GetItem
GetItem
Signup and view all the flashcards
Workload Management (WLM)
Workload Management (WLM)
Signup and view all the flashcards
Make_cols
Make_cols
Signup and view all the flashcards
Cast
Cast
Signup and view all the flashcards
Project
Project
Signup and view all the flashcards
Job bookmarks
Job bookmarks
Signup and view all the flashcards
Lake Formation
Lake Formation
Signup and view all the flashcards
Governed tables
Governed tables
Signup and view all the flashcards
Time-travel queries
Time-travel queries
Signup and view all the flashcards
Data permissions
Data permissions
Signup and view all the flashcards
Data filters
Data filters
Signup and view all the flashcards
CloudWatch Events
CloudWatch Events
Signup and view all the flashcards
AmazonKinesisException
AmazonKinesisException
Signup and view all the flashcards
Retry Mechanism
Retry Mechanism
Signup and view all the flashcards
Hot Shards
Hot Shards
Signup and view all the flashcards
Exponential Backoff
Exponential Backoff
Signup and view all the flashcards
GetRecords
GetRecords
Signup and view all the flashcards
IteratorAgeMilliseconds
IteratorAgeMilliseconds
Signup and view all the flashcards
Managed Service for Apache Flink (MSAF)
Managed Service for Apache Flink (MSAF)
Signup and view all the flashcards
DataStream API
DataStream API
Signup and view all the flashcards
Flink Sinks
Flink Sinks
Signup and view all the flashcards
Lambda Function Invocation
Lambda Function Invocation
Signup and view all the flashcards
Cold Storage
Cold Storage
Signup and view all the flashcards
Dedicated Master Node
Dedicated Master Node
Signup and view all the flashcards
Data Migration
Data Migration
Signup and view all the flashcards
Shards
Shards
Signup and view all the flashcards
Index State Management (ISM)
Index State Management (ISM)
Signup and view all the flashcards
Cross Cluster Replication
Cross Cluster Replication
Signup and view all the flashcards
JVM Memory Pressure
JVM Memory Pressure
Signup and view all the flashcards
Snapshots to S3
Snapshots to S3
Signup and view all the flashcards
OpenSearch Compute Units (OCUs)
OpenSearch Compute Units (OCUs)
Signup and view all the flashcards
Index Rollups
Index Rollups
Signup and view all the flashcards
DynamoDB Streams
DynamoDB Streams
Signup and view all the flashcards
Visualization Tool (QuickSight)
Visualization Tool (QuickSight)
Signup and view all the flashcards
Encryption at Rest
Encryption at Rest
Signup and view all the flashcards
Ad hoc Data Querying
Ad hoc Data Querying
Signup and view all the flashcards
Fine-Grained Access Control
Fine-Grained Access Control
Signup and view all the flashcards
Kinesis Processing Units (KPU)
Kinesis Processing Units (KPU)
Signup and view all the flashcards
Schema Discovery
Schema Discovery
Signup and view all the flashcards
RANDOM_CUT_FOREST
RANDOM_CUT_FOREST
Signup and view all the flashcards
Managed Streaming for Apache Kafka (MSK)
Managed Streaming for Apache Kafka (MSK)
Signup and view all the flashcards
Auto Recovery in MSK
Auto Recovery in MSK
Signup and view all the flashcards
EBS Volumes
EBS Volumes
Signup and view all the flashcards
Kafka Connect
Kafka Connect
Signup and view all the flashcards
Source Connectors
Source Connectors
Signup and view all the flashcards
Sink Connectors
Sink Connectors
Signup and view all the flashcards
KDS Message Size Limit
KDS Message Size Limit
Signup and view all the flashcards
TLS Encryption
TLS Encryption
Signup and view all the flashcards
OpenSearch
OpenSearch
Signup and view all the flashcards
Hot Storage
Hot Storage
Signup and view all the flashcards
UltraWarm Storage
UltraWarm Storage
Signup and view all the flashcards
AWS VPC
AWS VPC
Signup and view all the flashcards
Study Notes
Data Characteristics
- Structured data is organized in a defined manner, schema, rows, and columns. Found in relational databases, CSV files, and Excel spreadsheets. Easily queryable.
- Unstructured data has no predefined format or schema. Examples include text files without fixed formats, video/audio files, images, emails, and Word docs. Not easily queryable without preprocessing.
- Semi-structured data is more organized than unstructured but lacks the rigid structure of relational data. Examples include XML, JSON, email headers, or log files with varied formats.
Properties of Data
- Volume: Amount/size of data generated, collected, and processed.
- Velocity: Speed at which new data is generated, collected, and processed. High velocity data requires real-time processing capabilities.
- Variety: Different types, structures, and sources of data.
Data Repositories
- Data warehouses are centralized repositories optimized for complex queries and analysis. Data is structured, cleaned, transformed, and loaded (ETL). Optimized for read-heavy operations.
- Data lakes are storage repositories holding vast amounts of raw data in native formats. No data preprocessing, supports batch, real-time, and stream processing. Can be queried for data transformations or exploration.
- Data lakehouses combine the features of data warehouses and data lakes. They support both structured and unstructured data.
Data Pipelines
- Extract: Retrieves raw data from source systems, ensuring data integrity during extraction. Can be done in real-time or batches.
- Transform: Converts data into a suitable format for the target. Could include data cleaning, enrichment, format changes, and handling missing values.
- Load: Moves data into the target data warehouse or repository, ensuring data integrity during loading.
Data Sources and Formats
- JDBC (Java Database Connectivity): A platform-independent way to connect java apps to databases.
- ODBC (Open Database Connectivity): A platform-independent API for connecting apps to databases using SQL.
- CSV (Comma Separated Values): A simple data format for storage of small to medium datasets.
- JSON (JavaScript Object Notation): A human-readable data format used often in data interchange between web servers and web clients.
- Avro: A binary file format (used in big data/real-time processing applications) including schema and data
- Parquet: Columnar storage format for large datasets - good for analyzing specific columns
Data Modeling
- Star Schema: A simple schema with a central fact table and connected dimension tables. Good for reporting and dashboards.
- Snowflake Schema: A more complex version of a star schema with further normalized dimension tables. Better for large datasets and complex queries.
- Flat Schema: A single table for all data, no normalization. Great for basic apps where data is small & relationships are minimal.
- Relational Schema: Multiple normalized tables related via foreign keys. Good for transactional apps.
Data Validation & Profiling
- Completeness: Checking for missing/null values.
- Consistency: Validation across data sources for possible differences.
- Accuracy: Comparison with trusted sources & known rules, sanity checks.
- Integrity: Validating the data integrity with relationship validations.
DynamoDB
- A NoSQL database service offering fast & predictable performance with seamless scalability, fault tolerance, and high availability.
- Fully managed, highly available with replication across multiple AZs.
- Scales horizontally to massive workloads.
- Millions of requests per second, trillions of rows, and 100s of TB of storage.
- Use for fast & consistent data retrieval, event-driven applications.
- Contains tables with primary keys that must be decided at creation.
- Data is stored in tables in rows. Partition & sort keys are used to efficiently organize data.
- Choose the right type of table (standard vs. provisioned) for your workload.
Data Pipelines
- Data pipelines automate data transfer and transformations between data stores.
- Extract, transform, load (ETL) processes are crucial for data quality and reliability.
- The process needs to be automated for efficiency.
#Data Validation and Analysis
- Identify discrepancies and enforce data integrity across different data sources and periods.
- Check for null or missing values, and inconsistencies in data entries.
Other concepts
- Data Mesh: A federated governance structure for data products within a domain.
Miscellaneous
- Schema Evolution: Adapt and change the schema of datasets over time while maintaining backward compatibility.
Data Pipelines
- Extract, transform, load (ETL): Data extraction, transformation and loading within data pipelines.
Additional Services/Concepts
- DataSync: A managed service for moving large amounts of data to and from on-premises storage & AWS services. Good for batch or scheduled data transfer.
- Snow Family: Highly secure, portable devices that let use collect & process data at the edge for transfer to AWS.
- App Flow: Enables secure data transfer between cloud & SaaS apps & can be scheduled or run on demand.
- AWS Backup: Centrally manages and automates backups across many AWS services.
- Application Discovery: Provides data about on-premise servers to plan easier migration to AWS.
- Application Migration: Converts physical/virtual/cloud servers to run on AWS.
- DynamoDB Accelerator (DAX): A fully managed, high-performance accelerator for DynamoDB allowing microsecond read latency, compatible with existing DynamoDB APIs.
- Timestream: An AWS database service for time series data designed for high throughput and scalable querying.
- Redshift: A fully managed, petabyte-scale data warehouse with columnar storage, optimized for analytical workloads.
- CloudFront: A content delivery network used for faster and more efficient data delivery across the world.
- CloudWatch: A monitoring service for collecting and tracking metrics, logs and events regarding resources and applications.
- CloudTrail: A service that logs all API calls and all activity that can be archived for security analysis.
- Kinesis: Records are received & can be continuously read by consumers
- Kinesis Data Analytics: Provides a managed compute service for running real-time analytics on Kinesis Data Streams.
- Glue: Enables the extraction of schemas, building ETL pipelines, and managing data lakes.
- Glue Catalog: Centralized metadata repository for storing data schemas or metadata.
- Glue Studio: A visual interface for designing, building, and running ETL workflows with Glue services.
- Glue DataBrew: A visual interface for cleaning and preparing large datasets using transformations.
- Glue Workflows: Allow you to design and automate ETL in a visual interface using DAGs.
- Athena: A serverless query service that can read data from S3 directly using Structured Query Language.
- EMR: A fully managed Hadoop framework offering flexible deployments and scaling for large-scale data processing workloads. Supports batch & real-time data processing.
- EFS: A file system service that stores data on objects without requiring management of compute resources.
- EKS: A managed Kubernetes service that provides the flexibility and scalability needed by containerized applications. Highly available & scalable.
- MSK: A fully managed Apache Kafka service for processing streaming data.
- Secrets Manager: A managed service for storing sensitive information like API keys, passwords, and databases credentials.
- WAF: Protects web applications from common web exploits.
- Shield: A managed service that protects your applications from denial-of-service (DoS) attacks.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on Amazon EMR cluster concepts and functionalities. This quiz covers topics such as the roles of various nodes, advantages of spot instances, and the difference between transient and long-running clusters. Ideal for those seeking to enhance their understanding of EMR operations.