EMR Cluster Concepts Quiz
57 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the purpose of EMR notebooks?

  • To provide a managed environment for preparing and visualizing data
  • To collaborate with peers on data analysis
  • To build applications that utilize EMR
  • All of the above (correct)
  • What is the primary advantage of using spot instances for temporary capacity on long-running EMR clusters?

  • Reduced costs due to lower pricing for spot instances (correct)
  • Improved data processing speed and efficiency
  • Automatic scaling of the cluster based on workload demand
  • Increased cluster stability and reliability
  • What happens when a code node fails in an EMR cluster?

  • The cluster automatically scales down to compensate for the failed node
  • EMR provisions new nodes to replace the failed node (correct)
  • The cluster continues to operate with reduced processing capacity
  • The cluster automatically shuts down and all data is lost
  • What is the role of the master node in an EMR cluster?

    <p>Manages the cluster, tracks task status, and monitors cluster health (A)</p> Signup and view all the answers

    What is the primary difference between a transient and a long-running EMR cluster?

    <p>A transient cluster is terminated after all steps are completed while a long-running cluster continues to run until manually terminated (D)</p> Signup and view all the answers

    In what order does EMR scale up a cluster?

    <p>First adds core nodes, then task nodes, up to the maximum units specified (C)</p> Signup and view all the answers

    What is the purpose of instance groups and instance fleets in EMR scaling?

    <p>All of the above (D)</p> Signup and view all the answers

    What happens when an EMR cluster is scaled down?

    <p>Task nodes are removed first, followed by core nodes, up to the minimum constraints (B)</p> Signup and view all the answers

    What is the purpose of using the ProjectionExpression parameter in the GetItem API call?

    <p>To retrieve only specific attributes of an item. (A)</p> Signup and view all the answers

    What is the key difference between PutItem and UpdateItem API calls?

    <p>PutItem creates a new item, while UpdateItem only updates the existing item's attributes. (A)</p> Signup and view all the answers

    Why is it recommended to distribute partition keys as much as possible in DynamoDB?

    <p>To avoid performance issues when accessing popular items. (A)</p> Signup and view all the answers

    What action should be taken to address a 500 or 503 error from Kinesis?

    <p>Implement a retry mechanism (D)</p> Signup and view all the answers

    Which of the following is a suggested action when records are processed too slowly?

    <p>Increase maxRecords per call (B)</p> Signup and view all the answers

    What is a common cause of reading records too slowly in Kinesis?

    <p>Insufficient shard number (A)</p> Signup and view all the answers

    Which factor may lead to the expiration of a shard iterator unexpectedly?

    <p>Insufficient write capacity on DynamoDB (C)</p> Signup and view all the answers

    What should be monitored if record processing is falling behind?

    <p>IteratorAgeMilliseconds and MillisBehindLatest (C)</p> Signup and view all the answers

    What is an effective way to manage throttling errors in Kinesis?

    <p>Implement exponential backoff (D)</p> Signup and view all the answers

    When should unhandled exceptions in the KCL be checked?

    <p>When records get skipped (D)</p> Signup and view all the answers

    Which aspect should not be neglected to avoid Lambda function invocation issues?

    <p>Proper permissions on the execution role (C)</p> Signup and view all the answers

    Which strategy can improve the error handling for Kinesis when getting 500 errors?

    <p>Implement a retry mechanism (A)</p> Signup and view all the answers

    What should be increased to handle high latency in Kinesis effectively?

    <p>Number of shards and retention period (B)</p> Signup and view all the answers

    What does the Make_cols function do in data transformation?

    <p>Creates a new column for each instance of a field name. (D)</p> Signup and view all the answers

    Which of the following is NOT a method of connecting to a VPC endpoint?

    <p>Jupyter Notebook on local machine (C)</p> Signup and view all the answers

    What is the purpose of job bookmarks in Glue jobs?

    <p>They persist the state from job runs to process only new data. (B)</p> Signup and view all the answers

    Which feature does Lake Formation provide for access control?

    <p>Granular access control with row and cell level security. (A)</p> Signup and view all the answers

    What happens when Glue is configured for time-based scheduling?

    <p>Jobs follow a recurring schedule like CRON. (A)</p> Signup and view all the answers

    What does the Project function do in data types?

    <p>Projects all data types to a specified type. (D)</p> Signup and view all the answers

    In Lake Formation, how does it handle data access for external AWS accounts?

    <p>Involves IAM permissions and AWS Resource Access Manager. (B)</p> Signup and view all the answers

    Which aspect of Glue ETL does NOT influence cost?

    <p>Number of data sources accessed. (B)</p> Signup and view all the answers

    What type of transactions do governed tables in Lake Formation support?

    <p>ACID transactions across multiple tables. (A)</p> Signup and view all the answers

    What does automatic data compaction in Lake Formation help with?

    <p>Optimizing query performance by merging small files. (C)</p> Signup and view all the answers

    What is a key requirement for cold storage in data management?

    <p>It requires a dedicated master node and UltraWarm to be enabled. (B)</p> Signup and view all the answers

    Which option best describes a method to handle memory pressure in JVM?

    <p>Balance shard allocations across nodes. (D)</p> Signup and view all the answers

    What is the function of index state management (ISM)?

    <p>To manage index policies such as deleting old indices. (B)</p> Signup and view all the answers

    Why is it recommended to have at least three dedicated master nodes?

    <p>To avoid split brain scenarios. (B)</p> Signup and view all the answers

    What does OpenSearch Compute Units (OCUs) measure in a serverless setup?

    <p>Capacity for indexing and searching. (A)</p> Signup and view all the answers

    In which situation is it advised to choose fewer shards?

    <p>When facing JVM memory pressure errors. (A)</p> Signup and view all the answers

    What is a primary characteristic of snapshots in OpenSearch?

    <p>They can be configured to store data in S3. (C)</p> Signup and view all the answers

    What is an example of an anti-pattern in data querying methods?

    <p>Using RDS for transaction-oriented workloads. (B)</p> Signup and view all the answers

    Which of the following describes the role of inverted indices?

    <p>They organize data efficiently for retrieval. (A)</p> Signup and view all the answers

    Cross-cluster replication in OpenSearch is primarily used for what purpose?

    <p>To enhance data availability and geographical redundancy. (A)</p> Signup and view all the answers

    Which statement is true about the handling of fine-grained access control?

    <p>Users must be mapped to the cold_manager role for access. (C)</p> Signup and view all the answers

    What is the minimum number of nodes recommended for operational stability?

    <p>Three nodes. (B)</p> Signup and view all the answers

    What is the outcome of using rollups in data management?

    <p>They reduce the size of the data by summarizing it. (C)</p> Signup and view all the answers

    What is the primary advantage of using Kinesis over traditional batch processing systems?

    <p>It offers real-time data processing. (B)</p> Signup and view all the answers

    How is the cost of using Kinesis primarily calculated?

    <p>By Kinesis Processing Units (KPUs) consumed per hour. (B)</p> Signup and view all the answers

    What distinguishes MSK from Kinesis regarding message size?

    <p>MSK has a default message size of 1MB but can be configured for larger sizes. (A)</p> Signup and view all the answers

    What key function does RANDOM_CUT_FOREST provide?

    <p>Anomaly detection on numeric columns in a stream. (B)</p> Signup and view all the answers

    In terms of managing security, how does MSK differ from Kinesis?

    <p>MSK can use mutual TLS and Kafka ACLs for security. (D)</p> Signup and view all the answers

    What is a significant feature of MSK that enhances its operational reliability?

    <p>Auto recovery for common Kafka failures. (B)</p> Signup and view all the answers

    What storage types does OpenSearch offer?

    <p>Hot and UltraWarm storage types. (D)</p> Signup and view all the answers

    Which feature of MSK Connect allows for automatic scaling?

    <p>Auto scaling capabilities for workers. (B)</p> Signup and view all the answers

    What is a major use case for Kinesis Data Streams?

    <p>Streaming ETL. (D)</p> Signup and view all the answers

    What role do connectors play in MSK Connect?

    <p>They transport data between Kafka and external systems. (B)</p> Signup and view all the answers

    How does Kinesis handle data storage as opposed to MSK?

    <p>Kinesis streams data without dedicated storage options. (D)</p> Signup and view all the answers

    What is a limitation when managing message sizes in Kinesis?

    <p>Messages have a maximum size limit of 1MB. (C)</p> Signup and view all the answers

    Which of the following describes the deployment of MSK clusters?

    <p>Deployable in multiple AZs for high availability. (C)</p> Signup and view all the answers

    Flashcards

    Athena SQL

    A service for querying data stored in Amazon S3 using standard SQL.

    DPU (Data Processing Units)

    Units used to measure compute capacity in Athena for query executions.

    Elastic Map Reduce (EMR)

    A managed Hadoop framework to process vast amounts of data on EC2.

    EMR Notebooks

    Managed environment in EMR for data preparation, visualization, and collaboration.

    Signup and view all the flashcards

    Cluster in EMR

    A collection of EC2 instances running Hadoop for data processing.

    Signup and view all the flashcards

    Transient Cluster

    An EMR cluster that terminates after completing all tasks, saving costs.

    Signup and view all the flashcards

    Master Node

    The node that manages the EMR cluster, tracks status, and health.

    Signup and view all the flashcards

    Scaling Strategy in EMR

    Approach to adding or removing nodes in an EMR cluster for efficiency.

    Signup and view all the flashcards

    Hot partitions

    Particularly popular items leading to uneven workload distribution across partitions.

    Signup and view all the flashcards

    Exponential backoff retry

    A strategy to increase wait time between retries to reduce load on services.

    Signup and view all the flashcards

    On demand mode

    DynamoDB adjusts reads/writes automatically based on workloads, charging for actual usage.

    Signup and view all the flashcards

    RRU and WRU

    Read and Write Request Units, measuring capacity used for reads and writes in DynamoDB.

    Signup and view all the flashcards

    PutItem

    An API call that creates a new item or replaces an existing item with the same primary key.

    Signup and view all the flashcards

    UpdateItem

    An API call used to modify an item's attributes or add it if it doesn't exist.

    Signup and view all the flashcards

    GetItem

    Reads data based on a primary key, retrieving attributes as specified.

    Signup and view all the flashcards

    Workload Management (WLM)

    A strategy to manage concurrent queries, prioritizing faster queries over slower ones.

    Signup and view all the flashcards

    Make_cols

    Creates a new column for each occurrence of the same field name.

    Signup and view all the flashcards

    Cast

    Converts all values in a column to a specified data type.

    Signup and view all the flashcards

    Project

    Projects every data type to a specific type, such as string.

    Signup and view all the flashcards

    Job bookmarks

    Keeps track of the last processed state in ETL jobs to prevent reprocessing.

    Signup and view all the flashcards

    Lake Formation

    A service built on AWS Glue to create and manage secure data lakes.

    Signup and view all the flashcards

    Governed tables

    Supports ACID transactions and automatic data compaction in data management.

    Signup and view all the flashcards

    Time-travel queries

    Allows access to data that was modified within a defined time period.

    Signup and view all the flashcards

    Data permissions

    Controls access to data at various levels (row, column, etc.) using IAM roles.

    Signup and view all the flashcards

    Data filters

    Security applied to data when granting SELECT permissions on tables.

    Signup and view all the flashcards

    CloudWatch Events

    Triggers notifications or functions when ETL jobs succeed or fail.

    Signup and view all the flashcards

    AmazonKinesisException

    An error indicating an error rate above 1% in Kinesis streams.

    Signup and view all the flashcards

    Retry Mechanism

    A strategy to attempt operations again after a failure.

    Signup and view all the flashcards

    Hot Shards

    Shards that receive more traffic than others, causing throttling.

    Signup and view all the flashcards

    Exponential Backoff

    A strategy that increases the wait time between retries exponentially.

    Signup and view all the flashcards

    GetRecords

    An API call to retrieve records from a Kinesis stream.

    Signup and view all the flashcards

    IteratorAgeMilliseconds

    A metric showing how far behind the consumer is in record processing.

    Signup and view all the flashcards

    Managed Service for Apache Flink (MSAF)

    AWS service to run Apache Flink applications with support for Python and Scala.

    Signup and view all the flashcards

    DataStream API

    An API in Flink for processing data streams.

    Signup and view all the flashcards

    Flink Sinks

    Destinations where processed data is sent in a Flink application.

    Signup and view all the flashcards

    Lambda Function Invocation

    The process of triggering a Lambda function to execute in response to events.

    Signup and view all the flashcards

    Cold Storage

    A storage option using S3 for infrequent access to old data.

    Signup and view all the flashcards

    Dedicated Master Node

    A dedicated node that manages cluster state and metadata in a database.

    Signup and view all the flashcards

    Data Migration

    Moving data between different storage types for efficiency.

    Signup and view all the flashcards

    Shards

    Pieces of an index that allow for distribution across nodes in a cluster.

    Signup and view all the flashcards

    Index State Management (ISM)

    Automates policies for managing the lifecycle of indices.

    Signup and view all the flashcards

    Cross Cluster Replication

    Replicates indices across different domains for redundancy.

    Signup and view all the flashcards

    JVM Memory Pressure

    Performance issues relating to memory overload in Java applications.

    Signup and view all the flashcards

    Snapshots to S3

    Backing up indices to S3 for disaster recovery and storage management.

    Signup and view all the flashcards

    OpenSearch Compute Units (OCUs)

    Metrics for measuring capacity in a serverless OpenSearch implementation.

    Signup and view all the flashcards

    Index Rollups

    Summarizes old data into fewer, larger files to save storage costs.

    Signup and view all the flashcards

    DynamoDB Streams

    A feature for capturing changes in your DynamoDB tables.

    Signup and view all the flashcards

    Visualization Tool (QuickSight)

    A tool for building visual reports and dashboards from big data sets.

    Signup and view all the flashcards

    Encryption at Rest

    Security feature that protects data stored on disk from unauthorized access.

    Signup and view all the flashcards

    Ad hoc Data Querying

    Temporary and specific queries made to retrieve data as needed.

    Signup and view all the flashcards

    Fine-Grained Access Control

    Detailed management of user permissions to specific data or functions.

    Signup and view all the flashcards

    Kinesis Processing Units (KPU)

    Billing unit for Kinesis services; 1 KPU = 1vCPU + 4GB.

    Signup and view all the flashcards

    Schema Discovery

    Analyzing incoming stream data while setting up the stream.

    Signup and view all the flashcards

    RANDOM_CUT_FOREST

    SQL function for anomaly detection in numeric streams.

    Signup and view all the flashcards

    Managed Streaming for Apache Kafka (MSK)

    AWS service for fully managed Apache Kafka for streaming data.

    Signup and view all the flashcards

    Auto Recovery in MSK

    MSK's ability to recover from common Kafka failures automatically.

    Signup and view all the flashcards

    EBS Volumes

    Elastic Block Store used for data storage in MSK.

    Signup and view all the flashcards

    Kafka Connect

    Framework for moving data to/from Kafka using connectors.

    Signup and view all the flashcards

    Source Connectors

    Imports data into Kafka topics from external systems.

    Signup and view all the flashcards

    Sink Connectors

    Exports data from Kafka topics to external systems.

    Signup and view all the flashcards

    KDS Message Size Limit

    Default message size limit of 1 MB in Kinesis Data Streams.

    Signup and view all the flashcards

    TLS Encryption

    Transport Layer Security used for encrypting data in transit.

    Signup and view all the flashcards

    OpenSearch

    A scalable search and analytics engine for large datasets.

    Signup and view all the flashcards

    Hot Storage

    Fast storage solution used by standard data nodes in OpenSearch.

    Signup and view all the flashcards

    UltraWarm Storage

    Storage using S3 for less frequent writes in OpenSearch.

    Signup and view all the flashcards

    AWS VPC

    Virtual Private Cloud for deploying resources on AWS.

    Signup and view all the flashcards

    Study Notes

    Data Characteristics

    • Structured data is organized in a defined manner, schema, rows, and columns. Found in relational databases, CSV files, and Excel spreadsheets. Easily queryable.
    • Unstructured data has no predefined format or schema. Examples include text files without fixed formats, video/audio files, images, emails, and Word docs. Not easily queryable without preprocessing.
    • Semi-structured data is more organized than unstructured but lacks the rigid structure of relational data. Examples include XML, JSON, email headers, or log files with varied formats.

    Properties of Data

    • Volume: Amount/size of data generated, collected, and processed.
    • Velocity: Speed at which new data is generated, collected, and processed. High velocity data requires real-time processing capabilities.
    • Variety: Different types, structures, and sources of data.

    Data Repositories

    • Data warehouses are centralized repositories optimized for complex queries and analysis. Data is structured, cleaned, transformed, and loaded (ETL). Optimized for read-heavy operations.
    • Data lakes are storage repositories holding vast amounts of raw data in native formats. No data preprocessing, supports batch, real-time, and stream processing. Can be queried for data transformations or exploration.
    • Data lakehouses combine the features of data warehouses and data lakes. They support both structured and unstructured data.

    Data Pipelines

    • Extract: Retrieves raw data from source systems, ensuring data integrity during extraction. Can be done in real-time or batches.
    • Transform: Converts data into a suitable format for the target. Could include data cleaning, enrichment, format changes, and handling missing values.
    • Load: Moves data into the target data warehouse or repository, ensuring data integrity during loading.

    Data Sources and Formats

    • JDBC (Java Database Connectivity): A platform-independent way to connect java apps to databases.
    • ODBC (Open Database Connectivity): A platform-independent API for connecting apps to databases using SQL.
    • CSV (Comma Separated Values): A simple data format for storage of small to medium datasets.
    • JSON (JavaScript Object Notation): A human-readable data format used often in data interchange between web servers and web clients.
    • Avro: A binary file format (used in big data/real-time processing applications) including schema and data
    • Parquet: Columnar storage format for large datasets - good for analyzing specific columns

    Data Modeling

    • Star Schema: A simple schema with a central fact table and connected dimension tables. Good for reporting and dashboards.
    • Snowflake Schema: A more complex version of a star schema with further normalized dimension tables. Better for large datasets and complex queries.
    • Flat Schema: A single table for all data, no normalization. Great for basic apps where data is small & relationships are minimal.
    • Relational Schema: Multiple normalized tables related via foreign keys. Good for transactional apps.

    Data Validation & Profiling

    • Completeness: Checking for missing/null values.
    • Consistency: Validation across data sources for possible differences.
    • Accuracy: Comparison with trusted sources & known rules, sanity checks.
    • Integrity: Validating the data integrity with relationship validations.

    DynamoDB

    • A NoSQL database service offering fast & predictable performance with seamless scalability, fault tolerance, and high availability.
    • Fully managed, highly available with replication across multiple AZs.
    • Scales horizontally to massive workloads.
    • Millions of requests per second, trillions of rows, and 100s of TB of storage.
    • Use for fast & consistent data retrieval, event-driven applications.
    • Contains tables with primary keys that must be decided at creation.
    • Data is stored in tables in rows. Partition & sort keys are used to efficiently organize data.
    • Choose the right type of table (standard vs. provisioned) for your workload.

    Data Pipelines

    • Data pipelines automate data transfer and transformations between data stores.
    • Extract, transform, load (ETL) processes are crucial for data quality and reliability.
    • The process needs to be automated for efficiency.

    #Data Validation and Analysis

    • Identify discrepancies and enforce data integrity across different data sources and periods.
    • Check for null or missing values, and inconsistencies in data entries.

    Other concepts

    • Data Mesh: A federated governance structure for data products within a domain.

    Miscellaneous

    • Schema Evolution: Adapt and change the schema of datasets over time while maintaining backward compatibility.

    Data Pipelines

    • Extract, transform, load (ETL): Data extraction, transformation and loading within data pipelines.

    Additional Services/Concepts

    • DataSync: A managed service for moving large amounts of data to and from on-premises storage & AWS services. Good for batch or scheduled data transfer.
    • Snow Family: Highly secure, portable devices that let use collect & process data at the edge for transfer to AWS.
    • App Flow: Enables secure data transfer between cloud & SaaS apps & can be scheduled or run on demand.
    • AWS Backup: Centrally manages and automates backups across many AWS services.
    • Application Discovery: Provides data about on-premise servers to plan easier migration to AWS.
    • Application Migration: Converts physical/virtual/cloud servers to run on AWS.
    • DynamoDB Accelerator (DAX): A fully managed, high-performance accelerator for DynamoDB allowing microsecond read latency, compatible with existing DynamoDB APIs.
    • Timestream: An AWS database service for time series data designed for high throughput and scalable querying.
    • Redshift: A fully managed, petabyte-scale data warehouse with columnar storage, optimized for analytical workloads.
    • CloudFront: A content delivery network used for faster and more efficient data delivery across the world.
    • CloudWatch: A monitoring service for collecting and tracking metrics, logs and events regarding resources and applications.
    • CloudTrail: A service that logs all API calls and all activity that can be archived for security analysis.
    • Kinesis: Records are received & can be continuously read by consumers
    • Kinesis Data Analytics: Provides a managed compute service for running real-time analytics on Kinesis Data Streams.
    • Glue: Enables the extraction of schemas, building ETL pipelines, and managing data lakes.
    • Glue Catalog: Centralized metadata repository for storing data schemas or metadata.
    • Glue Studio: A visual interface for designing, building, and running ETL workflows with Glue services.
    • Glue DataBrew: A visual interface for cleaning and preparing large datasets using transformations.
    • Glue Workflows: Allow you to design and automate ETL in a visual interface using DAGs.
    • Athena: A serverless query service that can read data from S3 directly using Structured Query Language.
    • EMR: A fully managed Hadoop framework offering flexible deployments and scaling for large-scale data processing workloads. Supports batch & real-time data processing.
    • EFS: A file system service that stores data on objects without requiring management of compute resources.
    • EKS: A managed Kubernetes service that provides the flexibility and scalability needed by containerized applications. Highly available & scalable.
    • MSK: A fully managed Apache Kafka service for processing streaming data.
    • Secrets Manager: A managed service for storing sensitive information like API keys, passwords, and databases credentials.
    • WAF: Protects web applications from common web exploits.
    • Shield: A managed service that protects your applications from denial-of-service (DoS) attacks.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    Test your knowledge on Amazon EMR cluster concepts and functionalities. This quiz covers topics such as the roles of various nodes, advantages of spot instances, and the difference between transient and long-running clusters. Ideal for those seeking to enhance their understanding of EMR operations.

    More Like This

    Use Quizgecko on...
    Browser
    Browser