AWS Glue Overview and Database

Questions and Answers

What role does AWS Glue Data Catalog play in data integration?

It manages user permissions and access to data lakes.

It executes ETL jobs and performs data transformation.

It directly stores the actual data used for analytics.

It serves as a persistent metadata store for data schemas and types. (correct)

Which statement accurately describes the function of Glue Crawlers?

They are used for data preparation without any coding.

They store the actual data and its associated schemas.

They create metadata tables in the AWS Glue Data Catalog from various data sources. (correct)

They execute ETL jobs based on scheduled triggers.

What is the purpose of partitioning in AWS S3 when using AWS Glue?

To maintain the physical structure of the S3 storage.

To reduce the amount of data transferred during queries.

To improve data security by encrypting individual partitions.

To make queries faster and more efficient by avoiding brute force access. (correct)

How does AWS Glue facilitate data transformation in its jobs?

By autogenerating code and using DynamicFrame for data processing. Signup and view all the answers

Which of the following correctly describes AWS Glue DataBrew?

It is a visual data preparation tool that allows for data cleaning without coding. Signup and view all the answers

What capability does Glue provide that Lake Formation does not?

Control permissions at the table level Signup and view all the answers

Which feature allows you to deduplicate a dataset in Glue?

FindMatches Signup and view all the answers

How does the Pushdown Predicate optimization technique benefit data processing in Glue?

It retrieves only necessary data based on metadata. Signup and view all the answers

What is the primary purpose of a Glue workflow?

Monitor and control multiple jobs and crawlers. Signup and view all the answers

What does the Glue Schema Registry primarily facilitate?

Managing and enforcing data schemas across AWS services. Signup and view all the answers

Which of the following describes the Glue PySpark Transform called Relationalize?

Flattens nested schemas and pivots out array columns. Signup and view all the answers

What is the billing structure for using Glue's crawlers and ETL jobs?

Hourly rate billed by the second. Signup and view all the answers

What does the groupSize parameter achieve when reading a larger number of small files in Glue?

Increases processing efficiency by adjusting partition sizes. Signup and view all the answers

What role does Glue's Data Quality feature serve?

It detects data quality issues through machine learning. Signup and view all the answers

Which distinguishing factor sets the Glue flex execution option apart from standard execution?

Flex execution utilizes spare capacity to reduce costs. Signup and view all the answers

Which component of AWS Glue is responsible for creating metadata tables in the data catalog?

Glue Crawler Signup and view all the answers

What primary function does the AWS Glue Data Catalog serve?

Acts as a persistent metadata store Signup and view all the answers

Which statement accurately describes the purpose of an AWS Glue database?

Organizes metadata for a data store Signup and view all the answers

What does the Glue connection facilitate for users in AWS Glue?

Connection to external data stores using connection properties Signup and view all the answers

In the context of AWS Glue, what is the primary role of Glue triggers?

To schedule and execute jobs based on events or time Signup and view all the answers

Which of the following is NOT a trigger type available for Glue workflows?

Manual trigger Signup and view all the answers

What is the primary benefit of using the CatalogPartitionPredicate in Glue?

Optimizes data retrieval by filtering partition indexes Signup and view all the answers

Which statement best describes Data Processing Units (DPUs) in AWS Glue?

Increasing the number of DPUs can lead to faster job execution. Signup and view all the answers

Which feature allows you to monitor and execute multiple jobs and crawlers within Glue?

Glue Workflow Signup and view all the answers

How does Glue handle the detection of PII data?

It utilizes machine learning anomaly detection for identification. Signup and view all the answers

What is a key difference between Standard Execution and Flex Execution in Glue?

Standard Execution guarantees faster job start times. Signup and view all the answers

What optimization technique is employed by Glue when using Pushdown Predicate?

It applies retrieval logic directly at the source. Signup and view all the answers

Which Glue PySpark Transform is designed to apply declarative mapping to a DynamicFrame?

ApplyMapping Signup and view all the answers

What happens when using the Repartition feature in Glue?

It aggregates smaller partitions into larger ones. Signup and view all the answers

Which statement correctly describes the billing for Glue's data catalog services?

First million object accesses are free. Signup and view all the answers

Study Notes

AWS Glue Overview

Fully managed Extract, Transform, Load (ETL) service that simplifies data integration.
Utilizes Spark ETL engine for scalable data processing.
Serverless architecture allows for data discovery, preparation, and ETL tasks without managing infrastructure.

AWS Glue Data Catalog

Persistent metadata store that maintains information such as data locations, schemas, types, and classifications.
References source and target data used for ETL jobs, with actual data often stored externally (e.g., in S3).
Essential for building data warehouses or data lakes as it organizes and catalogs data.

AWS Glue Database

Organizes metadata to represent data stores, such as those in S3.
Comprised of associated data catalog table definitions grouped for easy management.

AWS Glue Table

Represents the schema of data stored in locations like S3.
Helps define how data can be queried and accessed.

AWS Glue Partitions

Enables storage organization in S3 using year/month/day formats.
Enhances query performance by minimizing the data scanned through optimized access patterns.

Glue Crawler

A program that connects to data stores (e.g., S3) to automatically create metadata tables in the Glue Data Catalog.

Glue Connections

Objects in the data catalog that contain necessary properties for making connections, including connection strings and credential hashes.

Glue Job

Handles all aspects of ETL tasks, including transformations, sources, and targets.
Automatically generates code to perform ETL operations using DynamicFrame.

Glue Triggers

Scheduled triggers allow for ETL jobs to run at defined times (e.g., daily at 8 PM).
Event-based triggers initiate jobs upon specific occurrences.

Glue DataBrew

A visual data preparation tool enabling users to clean and normalize data without requiring coding skills.
Supports data masking for protecting personally identifiable information (PII) and can eliminate outliers from datasets.

Glue Permissions

Manage access control at the table level using the Glue Data Catalog.
More granular permissions (database/table/column/row/cell) require integration with Lake Formation.

Glue Studio

Provides a visual interface for creating and managing ETL jobs efficiently.

Glue Workflows

Facilitates the creation and visualization of complex ETL activities involving multiple crawlers, jobs, and triggers.
Supports both scheduled and on-demand job executions, as well as event-based triggers using EventBridge.

Glue Pricing

Crawlers and ETL jobs are charged at an hourly rate, billed by the second.
Data catalog usage incurs a simplified monthly fee, with the first million objects or accesses free.

Data Processing Units (DPUs)

Resources assigned to run ETL jobs; increased DPUs lead to faster processing but higher costs.

Glue Schema Registry

A data discovery feature designed to manage and enforce schemas across AWS services.

Glue PySpark Transforms

GlueTransform serves as a base class for ETL operations.
ApplyMapping applies mapping transformations to DynamicFrames.
FindMatches identifies and deduplicates matching records within DynamicFrames.
Join operation merges two DynamicFrames based on specified equality conditions.
Map function allows transformation of DynamicFrames by applying functions to each record.
Relationalize flattens nested frames and pivots out array columns.
SelectFields specifies column selections for DynamicFrames.
Spigot samples records, enabling verification of the transformations performed by the Glue Job.

Glue Jobs with Pushdown Predicate

Optimization technique that minimizes data retrieval load by pushing filtering logic closer to the data source, enhancing performance for large datasets.
Allows selective reading of necessary data instead of loading entire datasets.

Glue Flex Execution

Jobs run on spare capacity for cost savings, beneficial when immediate job start times are not critical.
Standard execution combines reserved and on-demand capacities but at higher cost.

Glue Data Quality

Detects data quality issues using machine learning for anomaly detection.
Enables enforcement of data quality checks on the data catalog and ETL pipelines.

Detect PII Transforms

Facilitates the identification, masking, or removal of PII data during ETL processes.

CatalogPartitionPredicate vs. PushdownPredicate

CatalogPartitionPredicate utilizes partition indexes for improved processing times on heavily partitioned tables.
PushdownPredicate applies direct filtering to metadata, preventing unnecessary loading of entire datasets.

Repartition

Reduces the number of partitions while increasing individual partition sizes, facilitating efficient data reads.

Workflow vs. Job vs. Crawler

Workflow orchestrates multiple jobs and crawlers for monitoring and execution.
Job carries out specific tasks required for ETL processes.
Crawler scans data stores and populates the Glue Data Catalog with metadata.