EMR Cluster Concepts Quiz

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the purpose of EMR notebooks?

  • To provide a managed environment for preparing and visualizing data
  • To collaborate with peers on data analysis
  • To build applications that utilize EMR
  • All of the above (correct)

What is the primary advantage of using spot instances for temporary capacity on long-running EMR clusters?

  • Reduced costs due to lower pricing for spot instances (correct)
  • Improved data processing speed and efficiency
  • Automatic scaling of the cluster based on workload demand
  • Increased cluster stability and reliability

What happens when a code node fails in an EMR cluster?

  • The cluster automatically scales down to compensate for the failed node
  • EMR provisions new nodes to replace the failed node (correct)
  • The cluster continues to operate with reduced processing capacity
  • The cluster automatically shuts down and all data is lost

What is the role of the master node in an EMR cluster?

<p>Manages the cluster, tracks task status, and monitors cluster health (A)</p> Signup and view all the answers

What is the primary difference between a transient and a long-running EMR cluster?

<p>A transient cluster is terminated after all steps are completed while a long-running cluster continues to run until manually terminated (D)</p> Signup and view all the answers

In what order does EMR scale up a cluster?

<p>First adds core nodes, then task nodes, up to the maximum units specified (C)</p> Signup and view all the answers

What is the purpose of instance groups and instance fleets in EMR scaling?

<p>All of the above (D)</p> Signup and view all the answers

What happens when an EMR cluster is scaled down?

<p>Task nodes are removed first, followed by core nodes, up to the minimum constraints (B)</p> Signup and view all the answers

What is the purpose of using the ProjectionExpression parameter in the GetItem API call?

<p>To retrieve only specific attributes of an item. (A)</p> Signup and view all the answers

What is the key difference between PutItem and UpdateItem API calls?

<p>PutItem creates a new item, while UpdateItem only updates the existing item's attributes. (A)</p> Signup and view all the answers

Why is it recommended to distribute partition keys as much as possible in DynamoDB?

<p>To avoid performance issues when accessing popular items. (A)</p> Signup and view all the answers

What action should be taken to address a 500 or 503 error from Kinesis?

<p>Implement a retry mechanism (D)</p> Signup and view all the answers

Which of the following is a suggested action when records are processed too slowly?

<p>Increase maxRecords per call (B)</p> Signup and view all the answers

What is a common cause of reading records too slowly in Kinesis?

<p>Insufficient shard number (A)</p> Signup and view all the answers

Which factor may lead to the expiration of a shard iterator unexpectedly?

<p>Insufficient write capacity on DynamoDB (C)</p> Signup and view all the answers

What should be monitored if record processing is falling behind?

<p>IteratorAgeMilliseconds and MillisBehindLatest (C)</p> Signup and view all the answers

What is an effective way to manage throttling errors in Kinesis?

<p>Implement exponential backoff (D)</p> Signup and view all the answers

When should unhandled exceptions in the KCL be checked?

<p>When records get skipped (D)</p> Signup and view all the answers

Which aspect should not be neglected to avoid Lambda function invocation issues?

<p>Proper permissions on the execution role (C)</p> Signup and view all the answers

Which strategy can improve the error handling for Kinesis when getting 500 errors?

<p>Implement a retry mechanism (A)</p> Signup and view all the answers

What should be increased to handle high latency in Kinesis effectively?

<p>Number of shards and retention period (B)</p> Signup and view all the answers

What does the Make_cols function do in data transformation?

<p>Creates a new column for each instance of a field name. (D)</p> Signup and view all the answers

Which of the following is NOT a method of connecting to a VPC endpoint?

<p>Jupyter Notebook on local machine (C)</p> Signup and view all the answers

What is the purpose of job bookmarks in Glue jobs?

<p>They persist the state from job runs to process only new data. (B)</p> Signup and view all the answers

Which feature does Lake Formation provide for access control?

<p>Granular access control with row and cell level security. (A)</p> Signup and view all the answers

What happens when Glue is configured for time-based scheduling?

<p>Jobs follow a recurring schedule like CRON. (A)</p> Signup and view all the answers

What does the Project function do in data types?

<p>Projects all data types to a specified type. (D)</p> Signup and view all the answers

In Lake Formation, how does it handle data access for external AWS accounts?

<p>Involves IAM permissions and AWS Resource Access Manager. (B)</p> Signup and view all the answers

Which aspect of Glue ETL does NOT influence cost?

<p>Number of data sources accessed. (B)</p> Signup and view all the answers

What type of transactions do governed tables in Lake Formation support?

<p>ACID transactions across multiple tables. (A)</p> Signup and view all the answers

What does automatic data compaction in Lake Formation help with?

<p>Optimizing query performance by merging small files. (C)</p> Signup and view all the answers

What is a key requirement for cold storage in data management?

<p>It requires a dedicated master node and UltraWarm to be enabled. (B)</p> Signup and view all the answers

Which option best describes a method to handle memory pressure in JVM?

<p>Balance shard allocations across nodes. (D)</p> Signup and view all the answers

What is the function of index state management (ISM)?

<p>To manage index policies such as deleting old indices. (B)</p> Signup and view all the answers

Why is it recommended to have at least three dedicated master nodes?

<p>To avoid split brain scenarios. (B)</p> Signup and view all the answers

What does OpenSearch Compute Units (OCUs) measure in a serverless setup?

<p>Capacity for indexing and searching. (A)</p> Signup and view all the answers

In which situation is it advised to choose fewer shards?

<p>When facing JVM memory pressure errors. (A)</p> Signup and view all the answers

What is a primary characteristic of snapshots in OpenSearch?

<p>They can be configured to store data in S3. (C)</p> Signup and view all the answers

What is an example of an anti-pattern in data querying methods?

<p>Using RDS for transaction-oriented workloads. (B)</p> Signup and view all the answers

Which of the following describes the role of inverted indices?

<p>They organize data efficiently for retrieval. (A)</p> Signup and view all the answers

Cross-cluster replication in OpenSearch is primarily used for what purpose?

<p>To enhance data availability and geographical redundancy. (A)</p> Signup and view all the answers

Which statement is true about the handling of fine-grained access control?

<p>Users must be mapped to the cold_manager role for access. (C)</p> Signup and view all the answers

What is the minimum number of nodes recommended for operational stability?

<p>Three nodes. (B)</p> Signup and view all the answers

What is the outcome of using rollups in data management?

<p>They reduce the size of the data by summarizing it. (C)</p> Signup and view all the answers

What is the primary advantage of using Kinesis over traditional batch processing systems?

<p>It offers real-time data processing. (B)</p> Signup and view all the answers

How is the cost of using Kinesis primarily calculated?

<p>By Kinesis Processing Units (KPUs) consumed per hour. (B)</p> Signup and view all the answers

What distinguishes MSK from Kinesis regarding message size?

<p>MSK has a default message size of 1MB but can be configured for larger sizes. (A)</p> Signup and view all the answers

What key function does RANDOM_CUT_FOREST provide?

<p>Anomaly detection on numeric columns in a stream. (B)</p> Signup and view all the answers

In terms of managing security, how does MSK differ from Kinesis?

<p>MSK can use mutual TLS and Kafka ACLs for security. (D)</p> Signup and view all the answers

What is a significant feature of MSK that enhances its operational reliability?

<p>Auto recovery for common Kafka failures. (B)</p> Signup and view all the answers

What storage types does OpenSearch offer?

<p>Hot and UltraWarm storage types. (D)</p> Signup and view all the answers

Which feature of MSK Connect allows for automatic scaling?

<p>Auto scaling capabilities for workers. (B)</p> Signup and view all the answers

What is a major use case for Kinesis Data Streams?

<p>Streaming ETL. (D)</p> Signup and view all the answers

What role do connectors play in MSK Connect?

<p>They transport data between Kafka and external systems. (B)</p> Signup and view all the answers

How does Kinesis handle data storage as opposed to MSK?

<p>Kinesis streams data without dedicated storage options. (D)</p> Signup and view all the answers

What is a limitation when managing message sizes in Kinesis?

<p>Messages have a maximum size limit of 1MB. (C)</p> Signup and view all the answers

Which of the following describes the deployment of MSK clusters?

<p>Deployable in multiple AZs for high availability. (C)</p> Signup and view all the answers

Flashcards

Athena SQL

A service for querying data stored in Amazon S3 using standard SQL.

DPU (Data Processing Units)

Units used to measure compute capacity in Athena for query executions.

Elastic Map Reduce (EMR)

A managed Hadoop framework to process vast amounts of data on EC2.

EMR Notebooks

Managed environment in EMR for data preparation, visualization, and collaboration.

Signup and view all the flashcards

Cluster in EMR

A collection of EC2 instances running Hadoop for data processing.

Signup and view all the flashcards

Transient Cluster

An EMR cluster that terminates after completing all tasks, saving costs.

Signup and view all the flashcards

Master Node

The node that manages the EMR cluster, tracks status, and health.

Signup and view all the flashcards

Scaling Strategy in EMR

Approach to adding or removing nodes in an EMR cluster for efficiency.

Signup and view all the flashcards

Hot partitions

Particularly popular items leading to uneven workload distribution across partitions.

Signup and view all the flashcards

Exponential backoff retry

A strategy to increase wait time between retries to reduce load on services.

Signup and view all the flashcards

On demand mode

DynamoDB adjusts reads/writes automatically based on workloads, charging for actual usage.

Signup and view all the flashcards

RRU and WRU

Read and Write Request Units, measuring capacity used for reads and writes in DynamoDB.

Signup and view all the flashcards

PutItem

An API call that creates a new item or replaces an existing item with the same primary key.

Signup and view all the flashcards

UpdateItem

An API call used to modify an item's attributes or add it if it doesn't exist.

Signup and view all the flashcards

GetItem

Reads data based on a primary key, retrieving attributes as specified.

Signup and view all the flashcards

Workload Management (WLM)

A strategy to manage concurrent queries, prioritizing faster queries over slower ones.

Signup and view all the flashcards

Make_cols

Creates a new column for each occurrence of the same field name.

Signup and view all the flashcards

Cast

Converts all values in a column to a specified data type.

Signup and view all the flashcards

Project

Projects every data type to a specific type, such as string.

Signup and view all the flashcards

Job bookmarks

Keeps track of the last processed state in ETL jobs to prevent reprocessing.

Signup and view all the flashcards

Lake Formation

A service built on AWS Glue to create and manage secure data lakes.

Signup and view all the flashcards

Governed tables

Supports ACID transactions and automatic data compaction in data management.

Signup and view all the flashcards

Time-travel queries

Allows access to data that was modified within a defined time period.

Signup and view all the flashcards

Data permissions

Controls access to data at various levels (row, column, etc.) using IAM roles.

Signup and view all the flashcards

Data filters

Security applied to data when granting SELECT permissions on tables.

Signup and view all the flashcards

CloudWatch Events

Triggers notifications or functions when ETL jobs succeed or fail.

Signup and view all the flashcards

AmazonKinesisException

An error indicating an error rate above 1% in Kinesis streams.

Signup and view all the flashcards

Retry Mechanism

A strategy to attempt operations again after a failure.

Signup and view all the flashcards

Hot Shards

Shards that receive more traffic than others, causing throttling.

Signup and view all the flashcards

Exponential Backoff

A strategy that increases the wait time between retries exponentially.

Signup and view all the flashcards

GetRecords

An API call to retrieve records from a Kinesis stream.

Signup and view all the flashcards

IteratorAgeMilliseconds

A metric showing how far behind the consumer is in record processing.

Signup and view all the flashcards

Managed Service for Apache Flink (MSAF)

AWS service to run Apache Flink applications with support for Python and Scala.

Signup and view all the flashcards

DataStream API

An API in Flink for processing data streams.

Signup and view all the flashcards

Flink Sinks

Destinations where processed data is sent in a Flink application.

Signup and view all the flashcards

Lambda Function Invocation

The process of triggering a Lambda function to execute in response to events.

Signup and view all the flashcards

Cold Storage

A storage option using S3 for infrequent access to old data.

Signup and view all the flashcards

Dedicated Master Node

A dedicated node that manages cluster state and metadata in a database.

Signup and view all the flashcards

Data Migration

Moving data between different storage types for efficiency.

Signup and view all the flashcards

Shards

Pieces of an index that allow for distribution across nodes in a cluster.

Signup and view all the flashcards

Index State Management (ISM)

Automates policies for managing the lifecycle of indices.

Signup and view all the flashcards

Cross Cluster Replication

Replicates indices across different domains for redundancy.

Signup and view all the flashcards

JVM Memory Pressure

Performance issues relating to memory overload in Java applications.

Signup and view all the flashcards

Snapshots to S3

Backing up indices to S3 for disaster recovery and storage management.

Signup and view all the flashcards

OpenSearch Compute Units (OCUs)

Metrics for measuring capacity in a serverless OpenSearch implementation.

Signup and view all the flashcards

Index Rollups

Summarizes old data into fewer, larger files to save storage costs.

Signup and view all the flashcards

DynamoDB Streams

A feature for capturing changes in your DynamoDB tables.

Signup and view all the flashcards

Visualization Tool (QuickSight)

A tool for building visual reports and dashboards from big data sets.

Signup and view all the flashcards

Encryption at Rest

Security feature that protects data stored on disk from unauthorized access.

Signup and view all the flashcards

Ad hoc Data Querying

Temporary and specific queries made to retrieve data as needed.

Signup and view all the flashcards

Fine-Grained Access Control

Detailed management of user permissions to specific data or functions.

Signup and view all the flashcards

Kinesis Processing Units (KPU)

Billing unit for Kinesis services; 1 KPU = 1vCPU + 4GB.

Signup and view all the flashcards

Schema Discovery

Analyzing incoming stream data while setting up the stream.

Signup and view all the flashcards

RANDOM_CUT_FOREST

SQL function for anomaly detection in numeric streams.

Signup and view all the flashcards

Managed Streaming for Apache Kafka (MSK)

AWS service for fully managed Apache Kafka for streaming data.

Signup and view all the flashcards

Auto Recovery in MSK

MSK's ability to recover from common Kafka failures automatically.

Signup and view all the flashcards

EBS Volumes

Elastic Block Store used for data storage in MSK.

Signup and view all the flashcards

Kafka Connect

Framework for moving data to/from Kafka using connectors.

Signup and view all the flashcards

Source Connectors

Imports data into Kafka topics from external systems.

Signup and view all the flashcards

Sink Connectors

Exports data from Kafka topics to external systems.

Signup and view all the flashcards

KDS Message Size Limit

Default message size limit of 1 MB in Kinesis Data Streams.

Signup and view all the flashcards

TLS Encryption

Transport Layer Security used for encrypting data in transit.

Signup and view all the flashcards

OpenSearch

A scalable search and analytics engine for large datasets.

Signup and view all the flashcards

Hot Storage

Fast storage solution used by standard data nodes in OpenSearch.

Signup and view all the flashcards

UltraWarm Storage

Storage using S3 for less frequent writes in OpenSearch.

Signup and view all the flashcards

AWS VPC

Virtual Private Cloud for deploying resources on AWS.

Signup and view all the flashcards

Study Notes

Data Characteristics

  • Structured data is organized in a defined manner, schema, rows, and columns. Found in relational databases, CSV files, and Excel spreadsheets. Easily queryable.
  • Unstructured data has no predefined format or schema. Examples include text files without fixed formats, video/audio files, images, emails, and Word docs. Not easily queryable without preprocessing.
  • Semi-structured data is more organized than unstructured but lacks the rigid structure of relational data. Examples include XML, JSON, email headers, or log files with varied formats.

Properties of Data

  • Volume: Amount/size of data generated, collected, and processed.
  • Velocity: Speed at which new data is generated, collected, and processed. High velocity data requires real-time processing capabilities.
  • Variety: Different types, structures, and sources of data.

Data Repositories

  • Data warehouses are centralized repositories optimized for complex queries and analysis. Data is structured, cleaned, transformed, and loaded (ETL). Optimized for read-heavy operations.
  • Data lakes are storage repositories holding vast amounts of raw data in native formats. No data preprocessing, supports batch, real-time, and stream processing. Can be queried for data transformations or exploration.
  • Data lakehouses combine the features of data warehouses and data lakes. They support both structured and unstructured data.

Data Pipelines

  • Extract: Retrieves raw data from source systems, ensuring data integrity during extraction. Can be done in real-time or batches.
  • Transform: Converts data into a suitable format for the target. Could include data cleaning, enrichment, format changes, and handling missing values.
  • Load: Moves data into the target data warehouse or repository, ensuring data integrity during loading.

Data Sources and Formats

  • JDBC (Java Database Connectivity): A platform-independent way to connect java apps to databases.
  • ODBC (Open Database Connectivity): A platform-independent API for connecting apps to databases using SQL.
  • CSV (Comma Separated Values): A simple data format for storage of small to medium datasets.
  • JSON (JavaScript Object Notation): A human-readable data format used often in data interchange between web servers and web clients.
  • Avro: A binary file format (used in big data/real-time processing applications) including schema and data
  • Parquet: Columnar storage format for large datasets - good for analyzing specific columns

Data Modeling

  • Star Schema: A simple schema with a central fact table and connected dimension tables. Good for reporting and dashboards.
  • Snowflake Schema: A more complex version of a star schema with further normalized dimension tables. Better for large datasets and complex queries.
  • Flat Schema: A single table for all data, no normalization. Great for basic apps where data is small & relationships are minimal.
  • Relational Schema: Multiple normalized tables related via foreign keys. Good for transactional apps.

Data Validation & Profiling

  • Completeness: Checking for missing/null values.
  • Consistency: Validation across data sources for possible differences.
  • Accuracy: Comparison with trusted sources & known rules, sanity checks.
  • Integrity: Validating the data integrity with relationship validations.

DynamoDB

  • A NoSQL database service offering fast & predictable performance with seamless scalability, fault tolerance, and high availability.
  • Fully managed, highly available with replication across multiple AZs.
  • Scales horizontally to massive workloads.
  • Millions of requests per second, trillions of rows, and 100s of TB of storage.
  • Use for fast & consistent data retrieval, event-driven applications.
  • Contains tables with primary keys that must be decided at creation.
  • Data is stored in tables in rows. Partition & sort keys are used to efficiently organize data.
  • Choose the right type of table (standard vs. provisioned) for your workload.

Data Pipelines

  • Data pipelines automate data transfer and transformations between data stores.
  • Extract, transform, load (ETL) processes are crucial for data quality and reliability.
  • The process needs to be automated for efficiency.

#Data Validation and Analysis

  • Identify discrepancies and enforce data integrity across different data sources and periods.
  • Check for null or missing values, and inconsistencies in data entries.

Other concepts

  • Data Mesh: A federated governance structure for data products within a domain.

Miscellaneous

  • Schema Evolution: Adapt and change the schema of datasets over time while maintaining backward compatibility.

Data Pipelines

  • Extract, transform, load (ETL): Data extraction, transformation and loading within data pipelines.

Additional Services/Concepts

  • DataSync: A managed service for moving large amounts of data to and from on-premises storage & AWS services. Good for batch or scheduled data transfer.
  • Snow Family: Highly secure, portable devices that let use collect & process data at the edge for transfer to AWS.
  • App Flow: Enables secure data transfer between cloud & SaaS apps & can be scheduled or run on demand.
  • AWS Backup: Centrally manages and automates backups across many AWS services.
  • Application Discovery: Provides data about on-premise servers to plan easier migration to AWS.
  • Application Migration: Converts physical/virtual/cloud servers to run on AWS.
  • DynamoDB Accelerator (DAX): A fully managed, high-performance accelerator for DynamoDB allowing microsecond read latency, compatible with existing DynamoDB APIs.
  • Timestream: An AWS database service for time series data designed for high throughput and scalable querying.
  • Redshift: A fully managed, petabyte-scale data warehouse with columnar storage, optimized for analytical workloads.
  • CloudFront: A content delivery network used for faster and more efficient data delivery across the world.
  • CloudWatch: A monitoring service for collecting and tracking metrics, logs and events regarding resources and applications.
  • CloudTrail: A service that logs all API calls and all activity that can be archived for security analysis.
  • Kinesis: Records are received & can be continuously read by consumers
  • Kinesis Data Analytics: Provides a managed compute service for running real-time analytics on Kinesis Data Streams.
  • Glue: Enables the extraction of schemas, building ETL pipelines, and managing data lakes.
  • Glue Catalog: Centralized metadata repository for storing data schemas or metadata.
  • Glue Studio: A visual interface for designing, building, and running ETL workflows with Glue services.
  • Glue DataBrew: A visual interface for cleaning and preparing large datasets using transformations.
  • Glue Workflows: Allow you to design and automate ETL in a visual interface using DAGs.
  • Athena: A serverless query service that can read data from S3 directly using Structured Query Language.
  • EMR: A fully managed Hadoop framework offering flexible deployments and scaling for large-scale data processing workloads. Supports batch & real-time data processing.
  • EFS: A file system service that stores data on objects without requiring management of compute resources.
  • EKS: A managed Kubernetes service that provides the flexibility and scalability needed by containerized applications. Highly available & scalable.
  • MSK: A fully managed Apache Kafka service for processing streaming data.
  • Secrets Manager: A managed service for storing sensitive information like API keys, passwords, and databases credentials.
  • WAF: Protects web applications from common web exploits.
  • Shield: A managed service that protects your applications from denial-of-service (DoS) attacks.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser