Glue deep dive.md
Document Details

Uploaded by RationalStanza9319
UANL
Full Transcript
Glue cost ## Crawlers: - hourly rate based on the number of DPUs based - billed by seconds with a 10 minute min ## What are DPUs? - DPU= Data processing Units - a single DPU provides 4vCPU and 16 GB of memory ## Data catalog - up to a million objects for free - $1 per 100,000 objects over a mill...
Glue cost ## Crawlers: - hourly rate based on the number of DPUs based - billed by seconds with a 10 minute min ## What are DPUs? - DPU= Data processing Units - a single DPU provides 4vCPU and 16 GB of memory ## Data catalog - up to a million objects for free - $1 per 100,000 objects over a million per month - Workers = DPUs ## ETL Jobs: - hourly rate based on the number of DPUs used - billed by seconds with a 10min minimum - aws glue versions 2.0 and later have a 1 minute minimum How many DPUs are used? - apache spark - minimum 2 - default 10 - spark streaming: minimum 2 - default 10 - Ray job (ML/AI): minimum 2 - MDPUs (higher memory) - default 6 Cost of DPUs - $0.44 per DPU hour (may differ and depend on region) ## Glue notebooks/interactive sessions - used to interactively develop ELT code in notebooks - based on time session is active and number of DPUs - configurable idle timeouts - 1 minute minimum billing - minimum of 2 DPUs - default 5 ETL jobs example: - Apache spark job - runs for 15 minutes - uses 6 DPUs - 1 DPU - hour is $0.44 ## Stateful vs Stateless - Stateful: system remembers past interactions for influencing future ones (if bucket has files the crawler will load everything again, including the new file) - Stateless: system process each request independently without relying on past interactions ( bookmarks in glue, it's like a snapshot only loads the new data) ## Data Ingestion in AWS - Amazon kinesis: supports both stateful and stateless data processing - AWS data pipeline: orchestrates workflow for both stateful and stateless data ingestions - Glue: offer both ETL jobs with features like job bookmarks for tracking process ## Glue - transform-extract and load Extract: - rds, aurora, dynamo DB - Redshift - S3 kinesis Transform - filtering: remove unnecessary data - joining: combine data - aggregation: summarize data - find matches (ML): Identify records that refers same entity (duplicated) - Detect PII (Personal identity info): identify and manage sensitive info CSV > PARQUET>JSON>XML ## Glue workflows - orchestrate: multi step data processing jobs, manage executions and monitoring of jobs/crawlers - ideally used for managing AWS glue operations - provides visual interface - you can create workflows manually - Triggers: initiate job and crawlers - Scheduled triggers: starts the workflow at regular intervals - On-demand triggers: starts manually the workflow from the aws console - Event bridge: launches the workflow based on specific events captured by event bridge. ![[BDB1575-image001.jpg]] ## Glue job types - Spark ETL jobs: large scale data processing (2DPU to 100 DPU) - Spark streaming ETL jobs: - Analyze data in real time ( 2 dpu to 100dpu) - python shell jobs: - suitable for light-weight tasks 0.06 to 1 DPU - Ray jobs: - suitable for parallel processing tasks Execution types - Standard: designed for predictable ETL jobs - job start running immediately - guarantees consist job execution times - Flex execution: - cost-effective option for less time sensitive etl jobs - jobs may start with some delay ## GLUE PARTITIONING - Glue automaically does it if it's properly organized - Enhaces the performance of Glue - provides better query performance - reduces i/o operations - AWS Glue can skip over large segments within partioned data - Glue can process each partition independently (parallel processing) - Provide cost efficiency by reducing query efforts - in Glue define partitioning as part of ETL jobs script. Also possible within glue data catalog ## Glue databrew - data preparation tool with visual interface - cleaning and data format process - 250+ pre built transformations - no-code data preparation - scheduled and automate data preparations Amazon S3 (DL) -> AWS Glue data brew -> Amazon redshift Tool with visual interface - Project: where you configure transformation tasks - Step: applied transformation to your dataset - recipe: set of transformation steps; can be saved and re used - Job: execution of a recipe on a dataset; output to locations such as S3 - Scheduled: schedule jobs to automate transformation - Data profiling: understand quality and characteristics of your data. Transformations - nest to map: convert columns into a map - Nest to array: columns into array - Nest to struct: like nest to map but retains exact data and order - Unnest array: array to columns - Pivot: pivot column and values to rotate data from rows into columns - Unpivot: columns into rows - transpose: switch columns and rows - Also we can use join, split, filter, sort, date/time conversions, count distinct Cost - Interactive sessions= $1 per session - data brew jobs: $0.48 per node hour