Glue
15 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

An organization needs to run several AWS Glue ETL jobs that require significant memory and processing power. They estimate each job needs approximately 8 vCPUs and 30 GB of memory. What is the minimum number of DPUs required to run each of these jobs?

  • 0.5 DPUs
  • 1 DPU
  • 2 DPUs (correct)
  • 3 DPUs

A company is using AWS Glue to catalog their data. Their data catalog consists of 1,250,000 objects. What will be their monthly cost for using the AWS Glue Data Catalog?

  • $3.00
  • $12.50
  • $1.00
  • $2.50 (correct)

An AWS Glue crawler runs for 5 minutes and 30 seconds. What duration will the crawler be billed for, assuming standard billing practices?

  • 10 minutes (correct)
  • 1 minute
  • 5 minutes and 30 seconds
  • 6 minutes

An AWS Glue ETL job using version 3.0 runs for 30 seconds. What duration will the user be billed for?

<p>1 minute (D)</p> Signup and view all the answers

A company has several small datasets, each requiring 3 vCPUs and 10 GB of memory to process using AWS Glue. They want to optimize costs. What is the most cost-effective configuration regarding DPUs for processing each dataset?

<p>Allocate 1 DPU for each dataset. (B)</p> Signup and view all the answers

An Apache Spark ETL job is configured to run for 30 minutes and utilizes 8 DPUs. Given a DPU hourly cost of $0.44, what is the total cost for this job execution?

<p>$1.76 (B)</p> Signup and view all the answers

Which of the following scenarios would benefit most from using AWS Glue workflows?

<p>A multi-step data processing pipeline involving several Glue jobs and crawlers with dependencies between them. (B)</p> Signup and view all the answers

You have a stateful AWS Glue ETL job that processes data in an S3 bucket. A new file is added to the bucket. What is the expected behavior of the Glue crawler during the next job execution?

<p>The crawler will process all files in the bucket, including the new file and the previously processed files. (C)</p> Signup and view all the answers

Which AWS service is MOST suitable for orchestrating both stateful and stateless data ingestion workflows?

<p>AWS Data Pipeline (C)</p> Signup and view all the answers

When designing an ETL process with AWS Glue, what file format transformation sequence would generally result in the MOST efficient query performance, assuming the data is initially in CSV format?

<p>CSV &gt; Parquet &gt; JSON &gt; XML (B)</p> Signup and view all the answers

Which type of AWS Glue trigger is MOST appropriate for initiating a workflow at specific, pre-defined times, such as daily at 5:00 AM?

<p>Scheduled trigger (A)</p> Signup and view all the answers

You need to implement a near-real-time data analysis pipeline. Which AWS Glue job type is BEST suited for analyzing data streams as they arrive?

<p>Spark Streaming ETL jobs (B)</p> Signup and view all the answers

For an ETL job where cost optimization is a primary concern and job execution time is less critical, which AWS Glue execution type should be chosen?

<p>Flex execution (B)</p> Signup and view all the answers

How does partitioning data in AWS Glue typically enhance the performance and cost-efficiency of ETL jobs?

<p>By providing better query performance, reducing I/O operations, and enabling parallel processing. (A)</p> Signup and view all the answers

Which AWS Glue DataBrew transformation would be MOST appropriate for converting multiple columns containing address components (street, city, state, zip) into a single, structured column?

<p>Nest to struct (A)</p> Signup and view all the answers

Flashcards

AWS Glue Data Catalog

Service for discovering, classifying, and managing metadata.

DPU (Data Processing Unit)

Data Processing Units; unit of compute capacity used by AWS Glue.

AWS Glue Job Billing

AWS Glue jobs are billed hourly, based on the number of DPUs used, with a 10-minute minimum (1-minute minimum for versions 2.0+).

AWS Glue Crawler Billing

AWS Glue crawler is hourly, based on the number of DPUs, billed by the second with a 10-minute minimum.

Signup and view all the flashcards

AWS Glue Data Catalog Pricing

AWS Glue provides up to 1 million objects for free, and then charges $1 per 100,000 objects over a million per month.

Signup and view all the flashcards

Apache Spark

A distributed computing framework for large-scale data processing and analytics.

Signup and view all the flashcards

Spark Streaming

Extends Spark to enable real-time data stream processing.

Signup and view all the flashcards

Ray job

A job processing framework suitable for parallel processing tasks.

Signup and view all the flashcards

Glue Workflows

A service that orchestrates multi-step data processing jobs, managing executions and monitoring.

Signup and view all the flashcards

Scheduled Triggers (Glue)

Triggers workflows based on regular intervals.

Signup and view all the flashcards

On-Demand Triggers (Glue)

Triggers workflows manually from the AWS console.

Signup and view all the flashcards

EventBridge Triggers (Glue)

Launches workflows based on specific events captured by EventBridge.

Signup and view all the flashcards

Spark ETL Jobs

A type of Glue job for large-scale batch data processing, using between 2 and 100 DPUs.

Signup and view all the flashcards

Flex execution

A cost-effective Glue execution type for less time-sensitive ETL jobs, allowing for some delay in job start.

Signup and view all the flashcards

Glue Partitioning

A feature that processes each partition independently and in parallel, improving performance and reducing query efforts.

Signup and view all the flashcards

Study Notes

  • Glue is a service used for transform, extract, and load (ETL) jobs

Glue Cost

Crawlers

  • Crawlers are billed hourly based on the number of Data Processing Units (DPUs) used, with charges calculated by the second and a 10-minute minimum.

Data Catalog

  • The Glue Data Catalog offers up to one million objects for free. Additional objects are charged at $1 per 100,000 objects over the million object limit per month.

ETL Jobs

  • ETL jobs are billed hourly based on the number of DPUs, with charges calculated by the second and a 10-minute minimum, though versions 2.0 and later have a 1-minute minimum.

Data Processing Units (DPUs)

  • A single DPU provides 4 vCPUs and 16 GB of memory.

Number of DPUs used

  • Apache Spark jobs need a minimum of 2 DPUs and default to 10.
  • Spark Streaming requires a minimum of 2 DPUs and defaults to 10.
  • Ray jobs (ML/AI) need a minimum of 2 MDPUs (higher memory) and default to 6.

DPU Cost

  • DPUs cost $0.44 per DPU hour, although this can vary by region.

Glue Notebooks/Interactive Sessions

  • Glue notebooks are used for interactively developing ELT code.
  • Billing is based on the session's active time and the number of DPUs used.
  • Configurable idle timeouts are available.
  • There is a 1-minute minimum billing period.
  • A minimum of 2 DPUs are required, with a default of 5.

ETL Job Example

  • An Apache Spark job running for 15 minutes and using 6 DPUs costs $0.44 per DPU hour.

Stateful vs. Stateless

  • Stateful systems remember past interactions, influencing future ones. For example, a crawler will reload everything when new files are added to a bucket
  • Stateless systems process each request independently, without relying on past interactions.

Data Ingestion in AWS

  • Amazon Kinesis supports both stateful and stateless data processing.
  • AWS Data Pipeline orchestrates workflows for both stateful and stateless data ingestion.
  • Glue offers ETL jobs with features like job bookmarks for tracking progress.

Glue - Transform, Extract, and Load

Extract

  • RDS, Aurora, DynamoDB
  • Redshift
  • S3, Kinesis

Transform

  • Filtering: removing unnecessary data
  • Joining: combining data
  • Aggregation: summarizing data
  • Find Matches (ML): identifying records that refer to the same entity (duplicates)
  • Detect PII (Personal Identity Information): identifying and managing sensitive information

Data Format Hierarchy

  • CSV > PARQUET > JSON > XML

Glue Workflows

  • Used to orchestrate multi-step data processing jobs, managing executions and monitoring of jobs/crawlers.
  • Good for managing AWS Glue operations.
  • Provides a visual interface.
  • Workflows can be created manually.

Triggers

  • Initiate jobs and crawlers. -Scheduled triggers: start the workflow at regular intervals. -On-demand triggers: starts manually the workflow from the AWS console. -Event Bridge triggers: launches the workflow based on specific events captured by EventBridge.

Glue Job Types

  • Spark ETL jobs: large-scale data processing (2 DPU to 100 DPU).
  • Spark Streaming ETL jobs: analyze data in real time (2 DPU to 100 DPU).
  • Python Shell jobs: suitable for lightweight tasks (0.06 to 1 DPU).
  • Ray jobs: suitable for parallel processing tasks.

Execution Types

  • Standard: Designed for predictable ETL jobs which start immediately and guarantees consistent job execution times.
  • Flex execution: A cost-effective option for less time-sensitive ETL jobs that may start with some delay.

Glue Partitioning

  • Glue automatically partitions data if it's properly organized.
  • Improves query performance and reduces I/O operations.
  • Glue can skip over large segments within partitioned data and process each partition independently (parallel processing).
  • Provides cost efficiency by reducing query efforts.
  • Partitioning can be defined as part of the ETL job script or within the Glue Data Catalog.

Glue DataBrew

  • A data preparation tool with a visual interface for cleaning and formatting data
  • Offers 250+ pre-built transformations and no-code data preparation capabilities.
  • Automates and schedules data preparations.

Integration

  • Amazon S3 -> AWS Glue DataBrew -> Amazon Redshift

DataBrew Components

  • Project: Used to configure transformation tasks.
  • Step: An applied transformation to the dataset.
  • Recipe: A set of transformation steps that can be saved and reused.
  • Job: The execution of a recipe on a dataset that outputs to locations such as S3
  • Schedule: Schedules jobs to automate transformation.
  • Data profiling: Used to understand the quality and characteristics of your data.

Transformations

  • Nest to Map: Converts columns into a map.
  • Nest to Array: Converts columns into an array.
  • Nest to Struct: Similar to Nest to Map but retains exact data and order.
  • Unnest Array: Converts an array to columns.
  • Pivot: Pivots columns and values to rotate data from rows into columns.
  • Unpivot: Converts columns into rows.
  • Transpose: Switches columns and rows.
  • Additionally, supports join, split, filter, sort, and date/time conversions, count distinct

Cost

  • Interactive sessions cost $1 per session.
  • DataBrew jobs cost $0.48 per node-hour.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

More Like This

AWS Glue Job Run Metrics
8 questions

AWS Glue Job Run Metrics

UserReplaceableRose avatar
UserReplaceableRose
Use Quizgecko on...
Browser
Browser