Podcast
Questions and Answers
An organization needs to run several AWS Glue ETL jobs that require significant memory and processing power. They estimate each job needs approximately 8 vCPUs and 30 GB of memory. What is the minimum number of DPUs required to run each of these jobs?
An organization needs to run several AWS Glue ETL jobs that require significant memory and processing power. They estimate each job needs approximately 8 vCPUs and 30 GB of memory. What is the minimum number of DPUs required to run each of these jobs?
- 0.5 DPUs
- 1 DPU
- 2 DPUs (correct)
- 3 DPUs
A company is using AWS Glue to catalog their data. Their data catalog consists of 1,250,000 objects. What will be their monthly cost for using the AWS Glue Data Catalog?
A company is using AWS Glue to catalog their data. Their data catalog consists of 1,250,000 objects. What will be their monthly cost for using the AWS Glue Data Catalog?
- $3.00
- $12.50
- $1.00
- $2.50 (correct)
An AWS Glue crawler runs for 5 minutes and 30 seconds. What duration will the crawler be billed for, assuming standard billing practices?
An AWS Glue crawler runs for 5 minutes and 30 seconds. What duration will the crawler be billed for, assuming standard billing practices?
- 10 minutes (correct)
- 1 minute
- 5 minutes and 30 seconds
- 6 minutes
An AWS Glue ETL job using version 3.0 runs for 30 seconds. What duration will the user be billed for?
An AWS Glue ETL job using version 3.0 runs for 30 seconds. What duration will the user be billed for?
A company has several small datasets, each requiring 3 vCPUs and 10 GB of memory to process using AWS Glue. They want to optimize costs. What is the most cost-effective configuration regarding DPUs for processing each dataset?
A company has several small datasets, each requiring 3 vCPUs and 10 GB of memory to process using AWS Glue. They want to optimize costs. What is the most cost-effective configuration regarding DPUs for processing each dataset?
An Apache Spark ETL job is configured to run for 30 minutes and utilizes 8 DPUs. Given a DPU hourly cost of $0.44, what is the total cost for this job execution?
An Apache Spark ETL job is configured to run for 30 minutes and utilizes 8 DPUs. Given a DPU hourly cost of $0.44, what is the total cost for this job execution?
Which of the following scenarios would benefit most from using AWS Glue workflows?
Which of the following scenarios would benefit most from using AWS Glue workflows?
You have a stateful AWS Glue ETL job that processes data in an S3 bucket. A new file is added to the bucket. What is the expected behavior of the Glue crawler during the next job execution?
You have a stateful AWS Glue ETL job that processes data in an S3 bucket. A new file is added to the bucket. What is the expected behavior of the Glue crawler during the next job execution?
Which AWS service is MOST suitable for orchestrating both stateful and stateless data ingestion workflows?
Which AWS service is MOST suitable for orchestrating both stateful and stateless data ingestion workflows?
When designing an ETL process with AWS Glue, what file format transformation sequence would generally result in the MOST efficient query performance, assuming the data is initially in CSV format?
When designing an ETL process with AWS Glue, what file format transformation sequence would generally result in the MOST efficient query performance, assuming the data is initially in CSV format?
Which type of AWS Glue trigger is MOST appropriate for initiating a workflow at specific, pre-defined times, such as daily at 5:00 AM?
Which type of AWS Glue trigger is MOST appropriate for initiating a workflow at specific, pre-defined times, such as daily at 5:00 AM?
You need to implement a near-real-time data analysis pipeline. Which AWS Glue job type is BEST suited for analyzing data streams as they arrive?
You need to implement a near-real-time data analysis pipeline. Which AWS Glue job type is BEST suited for analyzing data streams as they arrive?
For an ETL job where cost optimization is a primary concern and job execution time is less critical, which AWS Glue execution type should be chosen?
For an ETL job where cost optimization is a primary concern and job execution time is less critical, which AWS Glue execution type should be chosen?
How does partitioning data in AWS Glue typically enhance the performance and cost-efficiency of ETL jobs?
How does partitioning data in AWS Glue typically enhance the performance and cost-efficiency of ETL jobs?
Which AWS Glue DataBrew transformation would be MOST appropriate for converting multiple columns containing address components (street, city, state, zip) into a single, structured column?
Which AWS Glue DataBrew transformation would be MOST appropriate for converting multiple columns containing address components (street, city, state, zip) into a single, structured column?
Flashcards
AWS Glue Data Catalog
AWS Glue Data Catalog
Service for discovering, classifying, and managing metadata.
DPU (Data Processing Unit)
DPU (Data Processing Unit)
Data Processing Units; unit of compute capacity used by AWS Glue.
AWS Glue Job Billing
AWS Glue Job Billing
AWS Glue jobs are billed hourly, based on the number of DPUs used, with a 10-minute minimum (1-minute minimum for versions 2.0+).
AWS Glue Crawler Billing
AWS Glue Crawler Billing
Signup and view all the flashcards
AWS Glue Data Catalog Pricing
AWS Glue Data Catalog Pricing
Signup and view all the flashcards
Apache Spark
Apache Spark
Signup and view all the flashcards
Spark Streaming
Spark Streaming
Signup and view all the flashcards
Ray job
Ray job
Signup and view all the flashcards
Glue Workflows
Glue Workflows
Signup and view all the flashcards
Scheduled Triggers (Glue)
Scheduled Triggers (Glue)
Signup and view all the flashcards
On-Demand Triggers (Glue)
On-Demand Triggers (Glue)
Signup and view all the flashcards
EventBridge Triggers (Glue)
EventBridge Triggers (Glue)
Signup and view all the flashcards
Spark ETL Jobs
Spark ETL Jobs
Signup and view all the flashcards
Flex execution
Flex execution
Signup and view all the flashcards
Glue Partitioning
Glue Partitioning
Signup and view all the flashcards
Study Notes
- Glue is a service used for transform, extract, and load (ETL) jobs
Glue Cost
Crawlers
- Crawlers are billed hourly based on the number of Data Processing Units (DPUs) used, with charges calculated by the second and a 10-minute minimum.
Data Catalog
- The Glue Data Catalog offers up to one million objects for free. Additional objects are charged at $1 per 100,000 objects over the million object limit per month.
ETL Jobs
- ETL jobs are billed hourly based on the number of DPUs, with charges calculated by the second and a 10-minute minimum, though versions 2.0 and later have a 1-minute minimum.
Data Processing Units (DPUs)
- A single DPU provides 4 vCPUs and 16 GB of memory.
Number of DPUs used
- Apache Spark jobs need a minimum of 2 DPUs and default to 10.
- Spark Streaming requires a minimum of 2 DPUs and defaults to 10.
- Ray jobs (ML/AI) need a minimum of 2 MDPUs (higher memory) and default to 6.
DPU Cost
- DPUs cost $0.44 per DPU hour, although this can vary by region.
Glue Notebooks/Interactive Sessions
- Glue notebooks are used for interactively developing ELT code.
- Billing is based on the session's active time and the number of DPUs used.
- Configurable idle timeouts are available.
- There is a 1-minute minimum billing period.
- A minimum of 2 DPUs are required, with a default of 5.
ETL Job Example
- An Apache Spark job running for 15 minutes and using 6 DPUs costs $0.44 per DPU hour.
Stateful vs. Stateless
- Stateful systems remember past interactions, influencing future ones. For example, a crawler will reload everything when new files are added to a bucket
- Stateless systems process each request independently, without relying on past interactions.
Data Ingestion in AWS
- Amazon Kinesis supports both stateful and stateless data processing.
- AWS Data Pipeline orchestrates workflows for both stateful and stateless data ingestion.
- Glue offers ETL jobs with features like job bookmarks for tracking progress.
Glue - Transform, Extract, and Load
Extract
- RDS, Aurora, DynamoDB
- Redshift
- S3, Kinesis
Transform
- Filtering: removing unnecessary data
- Joining: combining data
- Aggregation: summarizing data
- Find Matches (ML): identifying records that refer to the same entity (duplicates)
- Detect PII (Personal Identity Information): identifying and managing sensitive information
Data Format Hierarchy
- CSV > PARQUET > JSON > XML
Glue Workflows
- Used to orchestrate multi-step data processing jobs, managing executions and monitoring of jobs/crawlers.
- Good for managing AWS Glue operations.
- Provides a visual interface.
- Workflows can be created manually.
Triggers
- Initiate jobs and crawlers. -Scheduled triggers: start the workflow at regular intervals. -On-demand triggers: starts manually the workflow from the AWS console. -Event Bridge triggers: launches the workflow based on specific events captured by EventBridge.
Glue Job Types
- Spark ETL jobs: large-scale data processing (2 DPU to 100 DPU).
- Spark Streaming ETL jobs: analyze data in real time (2 DPU to 100 DPU).
- Python Shell jobs: suitable for lightweight tasks (0.06 to 1 DPU).
- Ray jobs: suitable for parallel processing tasks.
Execution Types
- Standard: Designed for predictable ETL jobs which start immediately and guarantees consistent job execution times.
- Flex execution: A cost-effective option for less time-sensitive ETL jobs that may start with some delay.
Glue Partitioning
- Glue automatically partitions data if it's properly organized.
- Improves query performance and reduces I/O operations.
- Glue can skip over large segments within partitioned data and process each partition independently (parallel processing).
- Provides cost efficiency by reducing query efforts.
- Partitioning can be defined as part of the ETL job script or within the Glue Data Catalog.
Glue DataBrew
- A data preparation tool with a visual interface for cleaning and formatting data
- Offers 250+ pre-built transformations and no-code data preparation capabilities.
- Automates and schedules data preparations.
Integration
- Amazon S3 -> AWS Glue DataBrew -> Amazon Redshift
DataBrew Components
- Project: Used to configure transformation tasks.
- Step: An applied transformation to the dataset.
- Recipe: A set of transformation steps that can be saved and reused.
- Job: The execution of a recipe on a dataset that outputs to locations such as S3
- Schedule: Schedules jobs to automate transformation.
- Data profiling: Used to understand the quality and characteristics of your data.
Transformations
- Nest to Map: Converts columns into a map.
- Nest to Array: Converts columns into an array.
- Nest to Struct: Similar to Nest to Map but retains exact data and order.
- Unnest Array: Converts an array to columns.
- Pivot: Pivots columns and values to rotate data from rows into columns.
- Unpivot: Converts columns into rows.
- Transpose: Switches columns and rows.
- Additionally, supports join, split, filter, sort, and date/time conversions, count distinct
Cost
- Interactive sessions cost $1 per session.
- DataBrew jobs cost $0.48 per node-hour.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.