AWS Certified Data Engineer Associate PDF
Document Details
Uploaded by QuaintGoshenite600
2024
Tags
Related
- new-AWS-certified data engg.pdf
- Joe Reis, Matt Housley - Fundamentals of Data Engineering_ Plan and Build Robust Data Systems-O'Reilly Media (2022).pdf
- AWS Certified Data Engineer - Associate DEA-C01 Past Paper PDF
- AWS Certified Machine Learning Specialty MLS-C01 Certification Guide PDF
- AWS Data Science Course Part 2 PDF
- Amazon Web Services Machine Learning Questions PDF
Summary
These notes cover data characteristics, repositories (data warehouses and data lakes), data pipelines, data sources and formats, and database performance optimization. The document discusses structured, unstructured and semi-structured data, with examples of different formats such as JSON, CSV, Avro, and Parquet. It also touches on data modeling like the star schema, data lineage, schema evolution, and database performance optimizations such as indexing and partitioning.
Full Transcript
Data Characteristics Monday, April 22, 2024 10:33 PM Types of data Structured ○ Organized in a defined manner or schema ○ Found in relational db ○ Easily queryable ○ Organized in rows and columns ○ Has a consistent structure ○ Ex: database...
Data Characteristics Monday, April 22, 2024 10:33 PM Types of data Structured ○ Organized in a defined manner or schema ○ Found in relational db ○ Easily queryable ○ Organized in rows and columns ○ Has a consistent structure ○ Ex: database tables, CSV files, excel Unstructured ○ No predefined structure or schema ○ Not easily queryable w/o preprocessing ○ May come in various formats ○ Ex: text files w/o fixed format, video/audio files, images, emails, word docs Semi-structured ○ Not as organized as structured data but has some level in the form of tags, hierarchies, or other patterns ○ More flexible than structured but not as chaotic as unstructured ○ Ex: XML, JSON, email headers, log files w/ varied formats Properties of data Volume ○ Amount/size of data Velocity ○ Speed at which new data is generated, collected, and processed ▪ High velocity → real time processing capabilities Variety ○ Different types, structure, and sources of data Data Engineering Fundamentals Page 1 Data Repositories Monday, April 22, 2024 10:41 PM Data Warehouses Centralized repo optimized for complex queries and analysis Data is stored in structured format ○ Cleaned, transformed, and loaded (ETL) Typically star or snowflake schema Optimized for read-heavy operations Ex: Amazon Redshift Think data mart as a shop that is getting/using the data for a different purpose Use when: ○ Have structured data that require fast and complex queries ○ Data integration from different sources ○ BI and analytics are primary use cases Data Lakes Storage repo that holds vast amounts of raw data in native format, including all types of data Can store large volumes of raw data w/o predefined schema No preprocessing of data Supports batch, real-time, and stream processing Can be queried for data transformation or exploration purposes Ex: Amazon S3 ○ AWS Glue can be used to extract a structure/schema from S3 and then Amazon Athena can use Glue catalog to figure out how to query Use when: ○ Mix of data types ○ Need scalable and cost effective solution to store data ○ Future needs of data are uncertain and want flexibility in storage/processing ○ Advanced analytics, ML, data discovery are key goals Comparison of Data Warehouse vs Data Lake Category Data Warehouse Data Lake Schema Schema on write (predefined schema before writing data) - ETL Schema on read (schema defined at reading) - ELT Data types Structured Structured/unstructured Agility Less flexible (due to predefined schema) More agile Cost More expensive b/c of optimizations for complex queries Cost effective but can rise when processing large amounts of data Data Lakehouse Hybrid data structure that combines best features of DL and DW Supports both data types Allows for schema on write and on read Provides capabilities for detailed analytics and ML tasks Built on top of cloud or distributed architectures Ex: AWS Lake Formation (w/ S3 and Redshift Spectrum) Data Mesh Individual teams own data products w/in a given domain ○ Data products have various use cases around organization ○ Federated governance w/ central standards ○ DL, DW may be a part of it Data Engineering Fundamentals Page 2 Data Pipelines Monday, April 22, 2024 10:54 PM Extract Retrieve raw data from source systems Ensure data integrity during extraction Can be done in real time or in batches Transform Convert data into format suitable for target DW Could include data cleaning, enrichment, format changes, computations, encoding/decoding, handling missing values Load Moving data into target DW or data repo Can be done in batches or in streaming Ensure data integrity during loading Managing Must be automated in a reliable way ○ AWS Glue ○ Orchestration services ▪ EventBridge ▪ Amazon MWAA ▪ AWS Step Functions ▪ Lambda ▪ Glue Workflows Data Engineering Fundamentals Page 3 Data Sources and Formats Monday, April 22, 2024 10:56 PM Common data sources o JDBC – Java Database Connectivity ▪ Platform independent – based on java ▪ Language dependent – based on java o ODBC – Object Database Connectivity ▪ Platform dependent b/c you need specific drivers ▪ Language independent o Raw logs o APIs o Streams Common data formats o CSV ▪ Use for Small to medium datasets Data interchange between systems w/ different technologies For human readable and editable data storage Importing/exporting data from db or spreadsheets ▪ Systems: DB (SQL based), Excel, pandas, R, ETL tools o JSON ▪ Human readable ▪ Can be structured or semi structured based on key-value pairs ▪ Use for Data interchange between web server and web client Config and settings for apps Use cases that need flexible schema or nested data structures ▪ Systems: web browsers, programming languages, RESTful APIs, NoSQL o Avro ▪ Binary format that stores both data and its schema ▪ Use for Big data and real time processing systems When schema changes Efficient serialization for data transport between systems ▪ Systems: Kafka, Spark, Hadoop o Parquet ▪ Columnar storage format optimized for analytics – when paying attention to specific columns and not necessarily all the rows ▪ Allows for efficient compression and encoding schemes ▪ Use for Analyzing large datasets w/ analytics engines Use cases where reading specific columns Storing data on distributed systems where I/O operations and storage need optimization ▪ Systems: Hadoop, Spark, Hive, Impala, Amazon Redshift Spectrum Data Engineering Fundamentals Page 4 Misc Monday, April 22, 2024 10:59 PM Data modeling Star schema ○ Think: have a table with ‘facts’ and then tables that branch out to other dimension tables to give more info about each ‘fact’ ○ Fact tables ○ Dimensions ○ Primary/foreign keys ○ ERD Data lineage Visual rep that traces flow and transformation of data through its lifecycle, from source to final destination Think: a workflow of data Importance ○ Help in tracking errors ○ Ensure compliance ○ Good documentation for others Schema evolution Ability to adapt and change schema of dataset over time w/ disrupting existing processes or systems Importance ○ Ensures data systems can adapt to changing business reqs ○ Allows for addition/removal/mod of cols/fields in dataset ○ Maintains backward compatibility Database performance optimization Indexing ○ Avoid full table scans ○ Enforce data uniqueness and integrity Partitioning ○ Reduce amount of data scanned ○ Helps w/ data lifecycle management ○ Enables parallel processing Compression ○ Speed up data transfer, reduce storage, and disk reads ○ Columnar compression Data sampling techniques Random sampling Stratified sampling Systemic sampling ○ A rule for sampling (like every third item) Data skew mechanisms Data skew = unequal distribution or imbalance of data across various nodes/partitions in distributed computing systems “The celebrity problem” – Brad Pitt could have more traffic and so whatever partition is Data Engineering Fundamentals Page 5 “The celebrity problem” – Brad Pitt could have more traffic and so whatever partition is processing his info would cause an imbalance Causes ○ Non-uniform distribution ○ Inadequate partitioning of strategy ○ Temporal skew – recent partitions could be larger than older Important to monitor data distribution Data validation and profiling Completeness ○ Check missing/null values, % of populated fields Consistency ○ Cross field validation, comparing against difference sources/periods Accuracy ○ Comparing w/ trusted sources, validate against known rules/standards, sanity check Integrity ○ Relationship validations Data Engineering Fundamentals Page 6 DynamoDB Monday, April 22, 2024 11:08 PM Characteristics A NoSQL database service that provides fast and predictable performance w/ seamless scalability Fully managed, highly available w/ replication across multiple AZs NoSQL Non-relational databases and are distributed Don’t support query joins All the data that is needed for a query is present on one row Doesn’t perform aggregations (e.g. sum, avg) Scale horizontally Scales to massive workloads, distributed database Millions of requests per seconds, trillions of rows, 100s of TB of storage Fast and consistent in performance (low latency on retrieval) Integrated w/ IAM for security, authorization, and admin Enables event driven programming Low cost and auto scaling Standard and IA table class Structure Made of tables Each table has a primary key – must be decided at creation Option 1: partition key (HASH) ○ Must be unique for each item ○ Must be diverse so that data is distributed Option 2: partition key + sort key (HASH + RANGE) ○ Combo must be unique ○ Data is grouped by partition key ○ Think like CMID (Primary) + PNUM (Sort) Each table can have an infinite number of items (i.e. rows) Each item has attributes Max size of item is 400KB Data types supported: Scalar – exactly one value ○ i.e. string, number, binary, Boolean, null Document – complex structure w/ nested attributes ○ i.e. list, map Set – multiple scalar values ○ i.e. string set, number set, binary set Anti-patterns Prewritten app tied to a tradition Relational database Joins or complex transactions Binary large data (BLOB) data Large data with low I/O rate Database Page 7 Read/Write Capacity Modes Monday, April 22, 2024 11:09 PM o Read/Write capacity modes – control how you manage table’s capacity (read/write throughput) o Can switch once every 24 hours o Internal partitions Data is stored in partitions – copies of data that live on specific servers Partition keys go through hashing algorithm to know which partition they go WCUs and RCUs are spread evenly across partitions Provisioned mode (default) You specify number of reads/writes per second Need to plan capacity beforehand Pay for provisioned read/write capacity units RCU = throughput for reads – one SCR per second or two ECR per second for an item size 4KB Strongly consistent read - SCR ○ If we read just after a write, will get correct data ○ Set “Consistent Read” parameter to True in API calls ○ Consume twice as much RCUs Eventually consistent read (default) - ECR ○ If we read just after a write, possible to get some stale data bc of replication Ex 1: 10 SCR per sec, item size 4KB ○ 10 * (4KB/4KB) = 10 RCUs Ex 2: 16 ECR per sec, item size 12 KB ○ (16/2) * (12KB/4KB) = 24 RCUs Ex 3: 10 SCR, item size 6KB (round up to nearest 4 KB) ○ 10 * (8 KB/4KB) = 20 RCUs WCU = throughput for writes – one write per second for an item up to 1KB (more WCUs consumed if > 1KB) Ex 1: write 10 items per sec, item size 2KB ○ Need 10 * (2KB/1KB) = 20 WCUs Ex 2: write 6 items per sec, item size 4.5 KB (round up) ○ 6 *(5KB/1KB) = 30 WCUs Ex 3: write 120 items per min, item size 2KB ○ (120/60) * (2KB/1KB) = 4 WCUs Throttling - prevents app from consuming too many capacity units Throughput can be exceeded temp using ‘burst capacity’ ProvisionedThroughputExceededException – if already consumed burst capacity ▪ Reasons: ○ Hot keys – one partition key is being read too many times (e.g. popular item) ○ Hot partitions ○ Very large items ▪ Advised to do an exponential backoff retry (already in SDK) ▪ Distribute partition keys as much as possible ▪ If RCU issue, can use DynomoDB Accelerator (DAX) On demand mode Reads/writes auto scale w/ workloads No planning needed Pay for what you use (2.5x more expensive) in terms of RRU and WRU Database Page 8 Pay for what you use (2.5x more expensive) in terms of RRU and WRU RRU = Read request units – throughputs for reads (same as RCU) WRU = Write request units – throughput for writes (same as WCU) Use cases: unknown workloads, unpredictable app traffic Database Page 9 Basic API Calls Monday, April 22, 2024 11:09 PM Writing data PutItem – creates a new item or fully replace an old item (same primary key) Consumes WCUs UpdateItem – edits existing item’s attribute (adds new item if doesn’t exist) Can be used to implement atomic counters – a numeric attribute that’s unconditionally incremented Conditional Writes Accept a write/update/delete only if conditions are met, otherwise returns an error Helps with concurrent access to items No performance impact Reading data GetItem – read based on primary key Primary can be either option ECR (default) Option to use SCR (more RCU → might take longer) ProjectionExpression can be specified to retrieve only certain attributes Query – returns items based on KeyConditionExpression ○ Partition key value (must be = operator) – required ○ Sort key value (=, Redshift Redshift Streaming Ingestion From Kinesis Data Streams or MSK [PS1]IMPORTANT [PS2]KEY CONCEPT Database Page 26 Workload Management (WLM) Tuesday, May 21, 2024 2:52 PM ▪ Prioritize short, fast queries vs. long, slow queries ▪ Query queues ▪ Via console, CLI, API Automatic WLM Creates up to 8 queues Default 5 queues w/ even memory allocation Large queries (e.g. big hash joins) → concurrency lowered Small queries (e.g. inserts, scans, aggs) → concurrency raised Configuring query queues o Priority o Concurrency scaling mode o User groups o Query groups (label, tag) o Query monitoring rules Manual WLM One default queue with concurrency level of 5 (i.e. 5 queries at once) Superuser queue w/ concurrency level 1 Define up to 8 queues, up to level 50 Each can have defined o Concurrency scaling mode o Concurrency level o User groups o Query groups o Memory o Timeout o Query monitoring rules Can also enable query queue hopping – timed out queries hop to next queue and try again Database Page 27 Serverless Tuesday, May 21, 2024 2:52 PM Auto scaling and provisioning for your workload Optimizes cost and performance – pay only when in use Easy spin up dev and test environments Easy ad hoc business analysis Getting started Need an IAM role with this policy (see ppt) Define ○ Db name ○ Admin user credentials ○ VPC ○ Encryption settings ○ Audit logging Can manage snapshots and recovery points after creation Resource scaling Capacity measured in Redshift Processing Units (RPUs) Pay for RPU hours per second plus storage Base RPUs ○ Can adjust base capacity ○ Defaults to AUTO Max RPUs ○ Can set a usage limit to control costs ○ Increase it to improve throughput Monitoring Monitoring views CloudWatch logs and metrics Data sharing Securely share live data across Redshift clusters for read purposes Purpose ○ Workload isolation ○ Cross group collaboration ○ Sharing data between dev/test/prod ○ Licensing data access in AWS Data Exchange Can share ○ Dbs ○ Schemas ○ Tables views Producer/consumer architecture ○ Producer controls security ○ Isolation to ensure producer performance unaffected by consumers ○ Data is live and transactionally consistent Both must be encrypted and use RA3 nodes Cross region data sharing involves transfer charges Types of data shares Standard Database Page 28 ○ Standard ○ AWS Data Exchange ○ AWS Lake Formation – managed Materialized views Contain precomputed results based on SQL queries over one or more base tables ○ Doesn’t store the results of query Provide a way to speed up complex queries in a DW environment, esp on large tables Can query materialized views Queries return faster results since they use precomputed results w/o accessing base tables Beneficial for predictable/recurring queries Can be built from other materialized views ○ Useful for reusing expensive joins CREATE MATERIALIZED VIEW Lambda UDF Use custom functions in AWS Lambda inside SQL queries CREATE EXTERNAL FUNCTION ○ Must GRANT USAGE ON LANGUAGE EXFUNC for permissions Can sue AWSLambdaRole IAM policy to grant permissions to Lambda on cluster’s IAM role ○ Need to use this role in CREATE EXTERNAL FUNCTION command Redshift communicates with Lambda using JSON Federated queries ▪ Query and analyze across DB, DW, and DL ▪ Ties Redshift to Amazon RDS or Aurora for PostgreSQL or MySQL ○ Incorporate live data in RDS into your Redshift queries ○ Avoids the need for ETL pipelines ▪ Offloads computation to DB to reduce data movement ▪ Must establish connectivity between Redshift cluster and RDS/Aurora ○ Put them in same VPC subnet or use VPC peering ▪ Credentials must be in AWS Secrets Manager ▪ Include secrets in IAM role for Redshift cluster ▪ Connect using CREATE EXTERNAL SCHEMA ▪ Read only access to external data sources ▪ Costs will be incurred on external DBs ▪ Only a one way connection so can’t query Redshift from RDS/Aurora Database Page 29 Data Sharing & Serverless Sunday, January 19, 2025 1:22 PM Data Sharing Useful when you need to share data in a seamless, secure way between different Redshift clusters within the same or across AWS accounts without needed to copy/move data Examples ○ Multi-team: Marketing and Finance teams use separate Redshift clusters for their workloads, but need access to a shared customer dataset ○ Real-time data access: A central 'producer' cluster with live sales data is accessed by multiple 'consumer' clusters for regional reporting ○ Sharing only the necessary schemas or tables with external or internal partners ○ Cross account data sharing: A parent company sharing data with its subsidiaries ○ Workload isolation: running resource-intensive ML training on one cluster while sharing data with another cluster used for lightweight reporting ○ Cost and performance optimization When not to use ○ Latency is critical ○ Clusters are in different regions ○ Static datasets ○ Sharing with non-Redshift customers Serverless Ideal for when you want the benefits of Redshifts analytics capabilities without managing or provisioning infrastructure Examples ○ Unpredictable/variable workloads: a retail business analyzing sales trends, with high activity during holiday seasons but low activity the rest of the year ○ Ad hoc or on-demand analytics: a data science team experimenting with new datasets for ML projects ○ Startups or small teams: a startup analyzing customer data but lacking the resources to manage infrastructure ○ Data sharing across teams: using serverless as a consumer to access shared datasets from a central producer cluster via Redshift Data Sharing ○ Cost-efficient development and testing: a developer creating ETL pipelines or testing new analytics queries without needing to provision a dedicated cluster ○ Burst analytics: running daily or weekly batch analytics jobs that don't justify having a constantly running cluster ○ Simplified setup When not to use ○ Highly predictable, consistent workloads ○ Heavy workloads with long running queries ○ Need for multi-region or multi-VPC deployments since Serverless only works in a single region/VPC Federated queries Allows you to query data that is stored outside of Redshift (e.g. RDS, Aurora, MySQL, S3) without needing to load the data into Redshift Database Page 30 Other Tools/Features Monday, January 20, 2025 5:08 PM Query Performance Insights Provides a comprehensive view of query performance to help understand the performance characteristics of both individual queries and overall workload Key features ○ Query execution details (execution time, CPU usage, I/O, memory) ○ Query plans - access the execution plan to understand how queries are processed (EXPLAIN) ○ Resource metrics - monitor how cluster resources are utilized ○ Query priority - assign priorities using WLM ○ Slow query identification - highlight queries causing bottlenecks or taking longer than expected Integrations with CloudWatch and Redshift Advisor Redshift Advisor Provides automated recommendations to optimize the performance of Redshift clusters, such as distribution style changes, sort key additions, etc. Continuously monitors cluster Identifies tables and materialized views that are barely used Suggests VACUUM to reclaim storage space or ANALYZE to refresh table stats Window functions SQL analytical functions that perform calculations across a set of table rows related to the current row Used for data analytics tasks Types of functions ○ Ranking functions: RANK, DENSE_RANK, ROW_NUMBER ○ Aggregate functions: SUM, AVG, MAX, MIN ○ Value functions: LEAD, LAG, FIRST_VALUE, LAST_VALUE Redshift Query Editor Web-based query editor for running SQL queries and managing cluster Integration with S3 for loading/exporting data Redshift Data API Allows you to execute SQL queries programmatically without a persistent database connection Useful for serverless applications, integration with Lambda, and REST API workflows Can connect via JDBC/ODBC Built-in authentication (IAM) Stores query results temporarily Query Monitoring Rules Allows you to define rules to identify and take action on long running or resource-intensive queries Database Page 31 Materialized Views Sunday, January 19, 2025 12:49 PM When to use: ○ Improve query performance for repetitive, complex aggregations or transformations ○ Ensure semi-real-time data availability (use REFRESH) ○ Optimize read-heavy workloads REFRESH ○ Updates materialized views with current data ○ Manually run REFRESH MATERIALIZED VIEW or automate the process using EventBridge or Lambda Database Page 32 Misc Wednesday, May 22, 2024 6:38 PM Concurrency Scaling ▪ Auto add cluster capacity to handle increase in concurrent read queries ▪ Supports unlimited concurrent users and queries ▪ Use WLM queues to manage which queries are sent to concurrency scaling cluster Short Query Acceleration (SQA) ▪ Prioritize short running queries over longer running ones ▪ Short queries run a dedicated space so won’t wait in behind long queries ▪ Can be used in place of WLM queues for short queries ▪ Works with: CREATE TABLE AS Read only queries (SELECT) ▪ Uses ML to predict query’s execution time ▪ Can configure how many seconds is short VACUUM ▪ Recovers space from deleted rows and restore sort order ▪ Four types of commands VACUUM FULL o Default o Resort all rows and reclaim space from deleted rows VACUUM DELETE ONLY o Only reclaims space (no re-sort) VACUUM SORT ONLY o Only re-sort VACUUM REINDEX o Reinitialized interleaved sort indexes and then performed full vacuum Anti patterns ▪ Small datasets – use RDS instead ▪ OLTP – use RDS or DynamoDB ▪ Unstructured data – ETL first with EMR ▪ BLOB data – store reference to large binary files in S3 not the file themselves Resizing types ▪ Elastic Quickly add or remove nodes of same type Cluster is down for a few minutes Tries to keep connections open across the downtime Limited to doubling or having for some node types ▪ Classic Change node type and/or number of nodes Cluster is read only for hours to days ▪ Can take a snapshot cluster, restore, and resize new cluster Used to keep cluster available during classic resizing Newer features ▪ RA3 nodes with managed storage Database Page 33 ▪ RA3 nodes with managed storage New node type optimized for Redshift Decoupled compute and storage to enable independent scaling SSD-based ▪ Data lake export Unload Redshift querity to S3 in Parquet format Parquet is faster to unload and consumes less storage Compatible with Spectrum, Athena, EMR, SageMaker Auto partitioned ▪ Spatial data types Geometry, geography ▪ Cross region data sharing Share live data across Redshift clusters without copying Requires new RA3 node type Secure, across regions and accounts Security concerns ▪ Using a Hardware Security Module (HSM) Must use a client and server cert to configure a trusted connection between Redshift and HSM If migrating an unencrypted cluster to an HSM-encrypted cluster, must create new encrypted cluster and then move data to it ▪ Defining access privileges for user or group Use the GRANT or REVOKE commands in SQL System Tables and Views ▪ Contains info about how Redshift is functioning ▪ Types of system tables/views SYS views – monitor query and workload usage STV tables – transient data containing snapshots of current system data SVV views – metadata about DB objects that reference STV tables STL views – generated from logs persisted to disk SVCS views – details about queries on main and concurrency scaling clusters SVL views – details about queries on main clusters Database Page 34 ElastiCache Friday, January 24, 2025 6:05 PM Allows you to setup, manage, and scale in-memory data stores in the cloud Improves performance of apps by reducing latency of data retrieval and offloading read and write traffic from your primary database or other backend systems Acs as layer between app and database, storing frequently accessed data Supports two engines Feature Redis Memcached ○ Redis - open source, highly available, and fast in-memory key-value store ○ Memcached - simpler caching layer optimized for storing key-value pairs Data Structures Supports advanced data types Only key-value pairs. Key Features: (strings, hashes, sets, etc.). ○ Low latency and high throughput - enables fast data access Persistence Supports backups and persistence. No persistence. ○ High availability Replication Yes, with automatic failover. No replication. ▪ Redis - provides automatic failover in a multi-AZ setup ▪ Data is replicated across multiple nodes to ensure durability Clustering Horizontal scaling with shards. Horizontal scaling with consistent ○ Horizontal and vertical scaling hashing. ○ Backup and restore (Redis) - store automatic/manual snapshots of data Use Cases Complex caching, analytics, Simple caching, session storage. ○ Data replication - allows you to create read replicas messaging. ○ Encryption at rest and in transit ○ Integration with CloudWatch Use cases ○ Session management - store user session for web apps ○ Database query caching - cache the results of frequent, resource-intensive queries to improve app performance ○ Real-time analytics - store real time metrics, dashboards, or event streams ○ Leaderboards/gaming ○ Message queues - use Redis's pub/sub feature for building lightweight message queues ○ ML - use to store intermediate results or frequently used datasets for ML model inference Database Page 35 Caching Strategies Saturday, January 25, 2025 10:50 AM Lazy Loading Data is loaded into the cache only when requested Cache hit - if the data exists in the cache, it's returned Cache miss - if data is not in cache, the app fetches it from the database and stores it in the cache for future use Advantages ○ Reduces memory storage ○ Ensures cache freshness Disadvantages ○ The first request for any item results in a cache miss, which can introduce latency ○ May cause cache stampede if multiple users request the same uncached data simultaneously Use case ○ Apps with read-heavy workloads and predictable access patterns ▪ e.g. product catalogs or user profile lookups Write-through Data is written to the cache whenever its updated in the database Ensures that the cache and database are always synchronized Advantages ○ Minimizes cache misses ○ Consistent performance Disadvantages ○ Writes are slower ○ May cache rarely accessed data Use case ○ Apps where data is frequently updated and needs to be immediately available ▪ E.g. leaderboards or shopping carts Cache-aside Manages the cache explicitly ○ Checks the cache for data ○ If data is not present, retrieves it from database, write it to cache Advantages ○ Flexible and allows custom logic for cache population/eviction ○ Avoids caching unnecessary or rarely access data Disadvantages ○ Requires app-level logic to manage cache ○ Similar risk of cache misses and stampedes Use case ○ Apps with complex caching needs or irregular access patterns ▪ e.g. ad targeting or recommendation systems Read-through Cache retrieves data from the database on behalf of the app ○ If the requested data is not in the cache, the cache fetches it from the database, stores it, and serves it to the app Advantages ○ Simplifies app logic Reduces risk of cache stampedes by centralizing control Database Page 36 ○ Reduces risk of cache stampedes by centralizing control Disadvantages ○ Slightly more complex to implement Use case ○ Apps needing transparent caching behavior ▪ e.g. analytics platforms or content delivery systems Write-behind/write-back Data is written only to the cache initially and asynchronously written back to the database later Database acts as a backup storage, while the cache serves as the primary write layer Advantages ○ Low write latency ○ Optimizes write-heavy workloads Disadvantages ○ Risk of data loss if the cache fails before the data is written to the database ○ Requires additional mechanisms to ensure database consistency Use case ○ Write heavy apps with less critical data consistency needs ▪ e.g. gaming leaderboards or session stores TTL Data in the cache is set to expire after a specified period of time ○ Once expired, data is removed from the cache and must be reloaded from the database on subsequent requests Advantages ○ Prevents stale data from lingering in the cache ○ Reduces memory usage Disadvantages ○ Data must be reloaded upon expiration, which can increase latency Use case ○ Apps requiring freshness guarantees ▪ e.g. live sports scores or stock picks Database Page 37 Glue Tuesday, May 21, 2024 3:08 PM o Serverless system that handles discovery and definition of table defs and schema ▪ A central metadata repo for data lakes ▪ Extract schemas/structure from unstructured data so you can query it ▪ The ‘glue’ between your unstructured data and data analytic tools o Main purpose: perform ETL on data for analytics ▪ Helps to transform data without moving it Custom ETL jobs ▪ Trigger driven, on a schedule, or on demand ▪ Fully managed ▪ Uses Apache Spark under the hood Hive ▪ Lets you run SQL-like queries from EMR ▪ Glue Data Catalog can serve as a hive ‘metastore’ (metadata store) ▪ Can also import a Hive metastore into Glue Analytics Page 38 Glue Crawler Thursday, May 23, 2024 7:00 PM ▪ Scans data in S3 and creates schema ▪ Can run periodically ▪ Populates Glue Data Catalog Stores only table def Original data stays in S3 Only one Glue Data Catalog per region ▪ Once catalogued, can treat data like it’s structured and can use analytic tools: Redshift Spectrum Athena EMR Quicksight ▪ Will extract partitions based on how S3 data is organized Think up front about how you will be querying data late in S3 and organize as such How it works ▪ Determines format, schema, and properties of raw data by classifying the data ▪ Groups the data into tables or partitions ▪ Writes metadata to Glue Data Catalog Analytics Page 39 ETL Tuesday, May 21, 2024 3:09 PM ▪ Automatic code generation for processing/transforming data Scala or python Can modify code Can provide own Spark/PySpark scripts Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog ▪ Encryption Server side at rest SSL in transit ▪ Can be event driven ▪ Can provision additional DPUs (data processing units) to increase performance of underlying Spark jobs Enabling job metrics can help figure out max capacity in DPUs needed ▪ Errors reported in CloudWatch Could tie into SNS for notification ▪ Pay only for resources consumed ▪ Glue Scheduler used to schedule jobs ▪ Glue Triggers to automate job runs based on events ETL scripts can update schema and partitions if necessary Adding new partitions o Rerun crawler or o Have script use enableUpdateCatalog and partitionKeys options Updating table schema o Rerun crawler or o Use enableUpdateCatalog / updateBehavior Creating new tables o enableUpdataCatalog / updateBehavior with setCatalogInfo Restrictions o Can only be done when S3 is the underlying data store o JSON, CSV, Avro, Parquet formats only o Parquet requires special code o Nested schemas not supported ▪ Supports serverless streaming ETL (NEW) Consumes from Kinesis or Kafka Clean and transform in flight Store results in S3 or other data stores Runs on Apache Spark Structured Streaming Transformations Bundled transformations o DropFields, DropNullFields – remove (null) fields o Filter – specify a function to filter records o Join – to enrich data o Map – add/delete fields, perform external lookups ML transformations o FindMatches ML – identify dups or matching records in dataset, even when the records to not have a common unique identifier and no fields match exactly Format conversions o CSV, JSON, Avro, Parquet, ORC, XML Apache Spark transformations (example: K-Means) Analytics Page 40 Glue Studio Tuesday, May 21, 2024 3:09 PM o Visual interface for ETL workflows o Visual job editor ▪ Create DAGs for complex workflows ▪ Sources include S3, Kinesis, Kafka, JDBC ▪ Transform/sample/join data ▪ Target to S3 or Glue Data Catalog ▪ Support partitioning o Visual job dashboard ▪ Overviews, status, run times Glue Data Quality ▪ Quality rules may be created manually or rec’d automatically ▪ Integrates into glue jobs ▪ Uses Data Quality Definition Language (DQDL) ▪ Results can be used to fail the job or just report to CloudWatch Analytics Page 41 Glue DataBrew Tuesday, May 21, 2024 3:09 PM Visual data prep tool for cleaning and normalizing data UI for pre processing large datasets Input from S3, DW, DB Output to Se Over 250 ready made transformations Create ‘recipes’ of transformations that can be saved as jobs w/in larger project May define data quality rules May create datasets with custom SQL from Redshift and Snowflake Security Can integrate with KMS (w/ customer master keys only) SSL in transit IAM CloudWatch and CloudTrail Handling PII[PS1] Substitution ○ RELATE_WITH_RANDOM Shuffling ○ SHUFFLE_ROWS Deterministic encryption ○ DETERMINISTIC_ENCRYPT Probabilitic encryption ○ ENCRYPT Decrpytion ○ DECRYPT Nulling out or deletion ○ DELETE Masking out ○ MASK_CUSTOM ○ MASK_DATE ○ MASK_DELIMITER ○ MASK_RANGE Hashing ○ CRYPTOGRAPHIC_HASH [PS1]important Analytics Page 42 Glue Workflows Tuesday, May 21, 2024 3:10 PM o Design multi job, multi crawler ETL processes that are run together o Create from AWS Glue blueprint, from the console or API o Only for orchestrating complex ETL operations using Glue o Triggers within workflows start jobs or crawlers or can be fired when jobs/crawlers complete ▪ Schedule based on CRON expression ▪ On demand ▪ EventBridge events Start on single event or batch of events Optional batch conditions o Size o Window – within X seconds Analytics Page 43 Misc Thursday, May 23, 2024 7:01 PM DynamicFrame A distributed data that supports nested data A collection of DynamicRecords[PS1] – self describing and have a schema Scala and Python APIs Can be used to provide a set of advanced transformations for data cleaning and ETL ResolveChoice Deals w/ ambiguities in a DynamicFrame and returns a new one ○ E.g. two fields w/ the same name like price Make_cols – creates a new column for each time ○ E.g. price_double, price_string Cast – casts all values to a specified type Make_struct – creates a structure that contains each data type Project – projects every type to a given type ▪ E.g. project:string Development Endpoints o Develop ETL scripts using a notebook then create an ETL job that runs your script (using Spark and Glue) for testing o Endpoint is in a VPC controlled by sec groups – connect via ▪ Apache Zeppelin on local machine ▪ Zeppelin notebook server on EC2 (via Glue console) ▪ SageMaker notebook ▪ Terminal window ▪ PyCharm professional edition ▪ Use Elastic IPs to access a private endpoint address Running Glue jobs o Time based schedules (like CRON) o Job bookmarks[PS2] ▪ Persists state from the job run – keeps track of where you left off so you can process new data and avoid reprocessing old data ▪ Works w/ S3 sources in a variety of formats ▪ Works w/ relational databases via JDBC if primary keys are in seq order Only handles new rows, not updated rows o CloudWatch Events ▪ Can fire off Lambda function or SNS notification when ETL succeeds/fails Costs o Billed by the second for crawler and ETL jobs o 1M objects stored and accesses are free for Glue Data Catalog o Dev endpoints for developing ETL charged by the minute Anti-patterns o Multiple ETL engines ▪ Glue ETL is based on Spark ▪ Use Data Pipeline EMR for other engines Analytics Page 44 [PS1]familiarize [PS2]important Analytics Page 45 Lake Formation Tuesday, May 21, 2024 3:10 PM Built on top of Glue so can integrate w/e Glue does Makes it easy to set up a secure data lake Loads data and monitors data flows Sets up partitions Has encryption and managing keys Defines transformation jobs and monitors them Access control Auditing Can source on premise or AWS No cost but pay for underlying services (S3, Glue, etc.) Storage optimization w/ auto compaction Avoids the overhead of accessing the tables from increasing Granular access control with row and cell level security Both for governed and S3 tables How it works Identifies existing data stores and move data to data lake Data is then crawled, catalogued, and prepared for analytics Provides users w/ data access Troubleshooting notes Cross account LF permission issues ○ Recipient must be set up as a data lake to allow access ○ Can use AWS Resource Access Manager (RAM) for accounts external to org ○ IAM permissions for cross account access LF does not support manifests [PS1] in Athena or Redshift queries IAM permissions on KMS encryption key are needed for encrypted data catalogs in LF IAM permissions needed to create blueprints and workflows Governed tables Supports ACID transactions across multiple tables Automatic data compaction Time-travel queries ○ Access data that was updated/deleted at any point w/in a defined period, New type of S3 table Can’t change choice of governed afterwards Works w/ streaming data (Kinesis) Can query with Athena Data permissions Can tie to IAM users/roles, SAML, or external AWS accounts Can use policy tags on databases, tables, or columns Can select specific permissions for tables or columns Data filters Column, row, or cell level security Applied when granting SELECT permissions on tables All columns + row filter = row level security Analytics Page 46 All columns + row filter = row level security All rows + specific columns = column level security Specific cols + specific rows = cell level security Create filters via console or via CreateDataCellsFilter API [PS1]what’s this Analytics Page 47 Athena Tuesday, May 21, 2024 3:10 PM Serverless interactive queries of S3 data i.e. a SQL interface for S3 no need to load data, stays in S3 ○ Can analyze data directly Executes queries in parallel Supports many data formats [PS1] CSV, TSV – human readable JSON – human readable ORC – columnar, splitable Parquet – columnar, splitable Avro – splitable (i.e. allows for parallel processing) Supports unstructured, semi-structured, structured Use cases Ad hoc queries of web logs Querying staging data before loading to RedShift Analyze CloudTrail / CloudFront / VPC/ ELB logs in S3 Costs Pay as you go Successful or cancelled queries count, but failed queries do not Not charge for DDL operations (CREATE, ALTER, DROP, etc.) Save a lot of money by using columnar formats bc limiting how is being scanned by Athena Glue and S3 have their own charges Per query limit - specifies a threshold for the total amount of data scanned per query ○ Only one can be created in a work group Per work group query limit - limits the total amount of data scanned by all queries running w/in a specific time frame ○ Can use multiple Security Access control ○ IAM, ACLs, S3 bucket policies Encrypt results at rest in S3 staging directory ○ SSE-S3, SSE-KMS, CSE-KMS If S3 files are encrypted, can perform queries on the encrypted data itself Cross account access in S3 bucket policy possible Transport Layer Security (TLS) encrypts in transit between Athena and S3 Anti-patterns Highly formatted reports/visualization ○ Use QuickSight instead ETL ○ Use Glue instead Performance optimization Use columnar data (e.g. Parquet, ORC) Small # of large files better than large # of small files Use partitions If adding partitions after the fact, use MSCK REPAIR TABLE command Analytics Page 48 ○ If adding partitions after the fact, use MSCK REPAIR TABLE command Compress files Make files splittable ACID transactions support Powered by Apache Iceberg Allows concurrent users safely make row level mods Compatible with EMR, Spark, anything that supports Iceberg table format Removes need for custom record locking Time travel operations ○ Recover data recently deleted w/ SELECT statement Benefits from periodic compaction to preserve performance ○ Have to do it yourself every once in a while ○ What to do if you notice it getting slower over time [PS1]good takeaway for exam – note the characteristics of each type splitable means you can run in parallel Analytics Page 49 Key Features Friday, January 24, 2025 7:27 PM Key features Serverless query execution Supports standard SQL Supports data formats such as CSV, JSON, Parquet, ORC, Avro as well as unstructured and semi- structured data Integration with S3, Glue, Quicksight, CloudTrail, SageMaker, Lambda, Federated queries Defines the schema when data is queried, which reduces the need for pre-processing or data transformation Allows you to partition large datasets into smaller subsets, improving performance and reducing costs Query results are automatically stored in S3 Query optimization for improved performance ○ Predicate pushdown ○ Partition pruning ○ Column projection Multiple users can query the same dataset simultaneously Performance monitoring in Athena console or CloudWatch View and manage query history Supports joins, aggregations, window functions, nested queries, CTEs and UNION ALL Creates views Athena Query Editor An interactive SQL query editor Provides syntax highlighting and auto-completion Custom Connectors Query non-S3 data sources via Lambda-based connections (e.g. MongoDB, Elasticsearch, API endpoints) Analytics Page 50 Workgroups Thursday, May 23, 2024 7:14 PM Workgroups Can organize users/teams/apps/workloads into Workgroups Control query access and track costs by Workgroup Integrates w/ IAM, CloudWatch, SNS Each workgroup can have its own ○ Query history ○ Data limits (can be controlled) ○ IAM policies ○ Encryption settings Analytics Page 51 Integration Thursday, May 23, 2024 7:15 PM Fine grained access to AWS Glue Data Catalog IAM based database and table level security ○ Broader than data filters in Lake Formation ○ Can’t restrict to specific table versions At min, must have a policy that grants access to your db and the Glue Data Catalog in each region Might have policies to restrict access to ○ ALTER or CREATE DATABASE ○ CREATE TABLE ○ DROP DATABASE ○ DROP TABLE ○ MSCK REPAIR TABLE ○ SHOW DATABASES ○ SHOW TABLES Just need to map these operations to their IAM actions QuickSight for easy visualization Analytics Page 52 Athena Notebook Friday, January 24, 2025 7:37 PM An interactive, notebook-style environment that allows users to write, organize, and execute queries while collaborating on data analysis Integrates with Athena - query data stored in S3 or federated sources Works seamlessly with Glue Data Catalogs Supports running single or multiple queries at once Displays query execution status, data preview, and query results Provide visualizations of query results Version control Supports advanced querying, data transformations, and advanced analytics Analytics Page 53 Spark Integration Friday, January 24, 2025 7:44 PM Provides a serverless, interactive environment for running Spark applications on large-scale data A flexible, powerful way to process and analyze complex datasets using Spark's distributed computing framework Combines flexibility of Spark for big data analytics with the serverless and on-demand nature of Athena Key features Provides Jupyter-style notebooks within the Athena console for writing, running, and iterating on Spark apps ○ Enables exploratory data analysis, data engineering, and ML workflows Ideal for processing large-scale datasets stored in S3 ○ ETL operations, data transformations, aggregations Integrates with Glue Data Catalog Supports PySpark, Spark SQL, Scala APIs Supports CSV, JSON, Parquet, ORC, Avro, unstructured, semi-structured, and structured data Automatic retries Analytics Page 54 Apache Spark Tuesday, May 21, 2024 3:10 PM o Distributed processing framework for big data o In memory caching, optimized query execution o Supports code reuse across ▪ Batch processing ▪ Interactive queries ▪ Real-type analytics ▪ ML ▪ Graph processing o Spark streaming ▪ Integrated w/ kinesis, kafka, EMR o Not meant for OLTP o Spark apps are run as independent processes on a cluster ▪ Spark Context Driver program that coordinates processes Sends app code and tasks to executors ▪ Cluster Manager Spark Context works through this to split them up ▪ Executors run computations and store data o CREATE TABLE AS SELECT ▪ Can convert data into a new underlying format ▪ Can be used to create a subsetted table Analytics Page 55 Components Thursday, May 23, 2024 7:16 PM Spark core (underlying component) Memory management Fault recovery Scheduling Distribute and monitor jobs Interact w/ storage Spark streaming Real type streaming analytics Structured streaming o An unbounded table – a constantly growing dataset o New data in stream → new rows appended to input table Increase productivity MLLib Library of algos for doing ML at a large scale Classification Clustering Collaborative filtering GraphX Graph processing ETL, analysis Not widely used Analytics Page 56 Integration Thursday, May 23, 2024 7:16 PM Kinesis ▪ There’s a library built on top of kinesis library to allow Spark streaming to import from AWS data library Redshift ▪ There’s a package that allows Spark datasets from RedShift ▪ A Spark SQL data source ▪ Useful for ETL using Spark Athena ▪ Can run Jupyter notebooks w/ Spark enabled within Athena console Notebooks may be encrypted automatically or w/ KMS ▪ Serverless ▪ Selectable as an alternative analytics engine (vs. Athena SQL) ▪ Programmatic API / CLI access as well ▪ Can adjust DPUs for coordinator and executor sizes Still want to consider capacity ▪ Pricing based on compute usage and DPU per hour Analytics Page 57 EMR Tuesday, May 21, 2024 3:10 PM o Elastic Map Reduce o A managed Hadoop framework on EC2 instances ▪ Used to process and analyze vast amounts of data o Includes Spark, HBase, etc. o EMR notebooks provide a managed environment to: ▪ Prepare and visualize data ▪ Collab w/ peers ▪ Build apps ▪ Perform interactive analysis using EMR o Several integration points w/ AWS Promises ▪ Charges by the hour + EC2 charges ▪ Provisions new nodes if a code node fails ▪ Can add/remove task nodes on the fly ▪ Can resize a running cluster’s core nodes Increases both processing and HDFS capacity ▪ Core nodes can be added/removed Risks data loss Managed scaling ▪ Support instance groups and instance fleets ▪ Scales spot, on demand, and instances in a Savings Plan w/in same cluster ▪ Available for Spark, Hive, YARN workloads ▪ Scale up strategy: First adds core nodes, then task nodes, up to max units specified ▪ Scale down strategy: first removes task nodes, then core nodes, no further than min constraints Spot nodes always removed before on demand instances Analytics Page 58 Clusters Thursday, May 23, 2024 7:18 PM A collection of EC2 instances (node) that is running Hadoop Frameworks and apps are specified at cluster launch Connect directly to master to run jobs directly OR submit ordered steps via console Transient cluster Terminates once all steps are done Saves money Long running clusters Must be manually terminated Basically a DW w/ periodic processing on large datasets Can spin up task nodes using spot instances for temporary capacity Can use reserved instances on long running clusters to save money Termination protection on by default, auto termination off Master node Manages cluster Tracks status of tasks Monitors cluster health Every cluster has one Single EC2 instance ○ Can be a single node cluster Does not support automatic failover ○ i.e. will not switch to another cluster if it fails Core node Hosts HDFS data and runs task Can be scaled up and down but with some risk Doing the actual work and distributing the work across the cluster Multi node clusters have at least one Fault tolerant - will continue job execution if node goes down Task node Runs tasks but does not host/store data Optional No risk of data loss when removing Good for adding more processing capacity but do not need more storage capacity Good use of spot instances[PS1] [PS1]key – good way to save money Analytics Page 59 Integration Thursday, May 23, 2024 7:19 PM ▪ Amazon EC2 for instances that comprise the nodes ▪ Amazon VPC to configure the virtual network in which you launch your instances ▪ Amazon S3 to store input/output data ▪ Amazon CloudWatch to monitor cluster performance and configure alarms ▪ AWS IAM to configure permissions ▪ AWS CloudTrail to audit requests made to the service ▪ AWS Data Pipeline to schedule/start clusters ▪ On EKS ▪ Allows submitting Spark job on EKS w/o provisioning clusters ▪ Fully managed ▪ Share resources between Spark and other apps on Kubernetes Analytics Page 60 Storage Thursday, May 23, 2024 7:19 PM HDFS Hadoop Distributed File System ○ Distributed and scalable Multiple copies stored across cluster instances for redundancy ○ Ensures no data is lost if an instance fails Files stored as blocks Ephemeral – data is lost when cluster is terminated Useful for caching intermediate results or workloads w/ significant random I/O Hadoop tries to process data where it’s stored on HDFS (i.e. the node where it’s stored) EMRFS Access S3 as if it were HDFS Allows persistent storage after cluster termination EMRFS Consistency View [PS1] – optional for S3 consistency Local file storage Locally connected disk Suitable for temp data (buffers, caches) Ephemeral EBS for HDFS Allows use of EMR on EBS only types (M4,C4) Deleted when cluster is terminated EBS volumes can only be attached when launching a cluster If you manually detach an EBS volume, EMR treats that as a failure and replaces it [PS1]no longer necessary but exam might not reflect that Analytics Page 61 Serverless Thursday, May 23, 2024 7:28 PM ▪ Decides how many nodes it needs ▪ Choose EMR release and runtime (Spark, Hive, Presto) ▪ Submit queries/scripts via job run requests ▪ Manages underlying capacity But you can specify default worker sizes and pre initialized capacity computes resources needed and schedules workers accordingly All within one region across multiple AZs ▪ Pre initialized capacity Spark adds 10% overhead to memory requested for drivers and executors Be sure initial capacity is at least 10% more than requested by job Security Same as EMR EMRFS o S3 encryption (SSE or CSE) at rest o TLS in transit between EMR nodes and S3 S3 o SSE-S3, SSE KMS Local disk encryption Spark communication between drives and executors is encrypted Hive communication between Glue Metastore and EMR uses TLS Force HTTPS (TLS) on S3 policies with aws:SecureTransport Analytics Page 62 Kinesis Data Streams Tuesday, May 21, 2024 3:11 PM o A way to stream data in system o A stream is made up of shards ▪ Contains an ordered sequence of records ordered by arrival time ▪ Can scale # of shards – must provision ahead of time o Retention between 1-365 days o Ability to reprocess/replay data o Once data is inserted in Kinesis, can’t be deleted (immutability) o Data that share the same partition go to same shard (ordering) Capacity modes ▪ Provisioned mode Choose number of shards provisioned o Scale manually or using API Each shard gets 1MB/s in, 2MB/s out Pay per shard provisioned per hour ▪ On demand mode No need to provision or manage capacity Default capacity provisioned – 4 MB/s Scales auto based on observed throughput peak during last 30 days Pay per stream per hour and data in/out per GB Security ▪ Control access/auth using IAM policies ▪ Encryption in fight using HTTPS endpoints ▪ Encryption at rest using KMS ▪ Can implement encryption/decryption of data on client side (harder) ▪ VPC endpoints available for kinesis to Access within VPC ▪ Monitor API calls using CloudTrail Scaling/Resharding ▪ Adding shards = ‘Shard splitting’ Can be used to increase stream capacity Can be used to divide a hot shard Old shard is closed and will be deleted once the data is expired ▪ Merging shards Decrease stream capacity and save costs Can be used to group two shards w/ low traffic Old shards are closed and deleted based on data expiration ▪ Out of order records after resharding After a reshard, can read from child shards However, data you haven’t read yet could still be in the parent If you start reading the child before completing reading the parent, you could read data for a particular hash key out of order Solution: read entirely from the parent until you don’t have new records KCL has this logic already built in ▪ Limitations Resharding can’t be done in parallel Analytics Page 63 Resharding can’t be done in parallel o Plan capacity in advanced Can only perform on resharding operation at a time and it takes a few secs Handling duplicates ▪ For producers Producer retries can create duplicates due to network timeouts Identical data would have unique sequence Fix: embed unique record ID in the data to deduplicate on consumer side ▪ For consumers Consumer retries can make app read same data twice Retries happen when record processors restart: o A worker terminates unexpectedly o Worker instances are added/removed o Shards are merged or split o App is deployed Fixes o Make consumer app idempotent – if data is read twice, it does not have twice the same side effects o If final destination can handle dups, handle them there Analytics Page 64 Producers Tuesday, May 21, 2024 3:12 PM Sends data into streams as records Made up of partition key and data blob (value itself) E.g. apps, client, SDK, Kinesis agent SDK Allows you to use code or CLI to send data directly to Kinesis Streams PutRecords o API that are used are PutRecord (one) and PutRecords (many recods) o PuRecords use batching and increases throughput → less HTTP requests o ProvisionedThroughputExceeded if we go over the limits ▪ Happens when sending more data ▪ Make sure you don’t have a hot shard (i.e. partition key is bad and too much data goes to that partition) ▪ Solution: Retries w/ backoff Increase shards (scaling) Ensure partition key is good o Use case: low throughput, higher latency, simple API, AWS Lambda o Managed AWS sources for Kinesis Data Streams ▪ CloudWatch Logs ▪ AWS IoT ▪ Kinesis Data Analytics Analytics Page 65 Kinesis Producer Library (KPL) Tuesday, May 21, 2024 3:11 PM Used for building high performance, long running producers Automated and configurable retry mechanism Synchronous API (same as SDK) and Asynchronous API (better performance for async) Submits metrics to CloudWatch for monitoring Batching[PS1] – increases throughput, decrease cost Collect records and write to multiple shards in same PutRecords API call (on by default) Aggregate – Capability to store multiple records in one record ○ Increased latency ○ Increase payload size and improve throughput ○ On by default Can influence batching efficiency by introducing some delay with RecordMaxBufferedTime Compression must be implemented manually KPL records must be decoded w/ KCL or special helper library When not to use Can incur an additional processing delay of up to RecrodMaxBufferedTime within library (user-config) Larger values of RMBT results in higher packing efficiencies and better performance Apps that can’t tolerate this additional delay may need to use AWS SDK directly [PS1]important Analytics Page 66 Kinesis Agent Tuesday, May 21, 2024 3:12 PM Monitor Log files and sends them to KDS Built on top of KPL Install in Linux based server environments Features o Write from multiple dirs and write to multiple streams o Routing feature based on dir/log file o Pre process data before sending to streams o Handles file rotation, checkpointing, and retry ipon failures o Emits metrics to CloudWatch for monitoring Analytics Page 67 Consumers Tuesday, May 21, 2024 3:12 PM Consumes data E.g. apps, Lambda, Kinesis Data Firehose, Kinesis Data Analytics Record received contains partition key, sequence number, data blob SDK Records are polled by consumers from a shard o Each shard has 2MB total aggregate throughput GetRecords o Returns up to 10MB of data o Max of 5 calls per shard per second = 200ms latency o More consumers → less throughput per consumers Analytics Page 68 Kinesis Client Library (KCL) Tuesday, May 21, 2024 3:13 PM Read records from Kinesis produced w/ the KPL (de-aggregation) Share multiple shards w/ multiple consumers in one ‘group’ o Shard discovery process Checkpointing features to resume progress Leverages DynamoDB for coordination and checkpointing (one row per shard) o Make sure you provision enough WCU/RCU or use on demand mode, otherwise may slow down KCL Record processors will process the data ExpiredIterationException → increase WCU Analytics Page 69 Kinesis Connector Library Tuesday, May 21, 2024 3:13 PM Leverages the KCL library Writes data to: o S3 o DynamoDB o Redshift o OpenSearch Replaced by Firehose and Lambda Analytics Page 70 Kinesis Data Firehose (KDF) Tuesday, May 21, 2024 3:13 PM Fully managed Near real time Buffer based on time Store data into target destinations Auto scaling Supports many data formats Data conversions from JSON to Parquet/ORC (only for S3) Data transformation through Lambda Supports compression when target is S3 Only GZIP if data is further loaded into Redshift Pay for amount of data going through firehose Spark/KCL do not read from KDF[PS1] Read from producers, use Lambda for data transformation, batch writes to destinations: AWS ○ S3 ○ Redshift ○ OpenSearch 3rd party Custom destination All/failed data can be sent to S3 backup bucket Buffer sizing Accumulates records in a buffer Buffer is flushed based on time and size rules ○ Ex: Buffer size = 32MB → if buffer size is reached, it’s flushed ○ Ex: Buffer time = 2 min → if time is reached, it’s flushed High throughput → buffer size will be hit Low throughput → buffer time will be hit [PS1]important Analytics Page 71 KDF vs KDS Monday, May 27, 2024 11:23 AM KDS Going to write custom code (producer/consumer) Real time Must manage scaling (shard splitting/merging) Data storage for 1-365 days Replay capability Multi consumers Firehose Fully managed Serverless data transformation w/ Lambda Near real time Auto scaling No data storage Analytics Page 72 Lambda Monday, May 27, 2024 11:25 AM Can source records from KDS Lambda consumer has a library to de-aggregate record from KPL Can be used to run lightweight ETL to o S3 o DynamoDB o RedShift o OpenSearch Can used to trigger notifications/send emails in real time Has configurable batch size Analytics Page 73 Enhanced Fan Out Monday, May 27, 2024 11:29 AM ▪ Each consumer gets 2MB/s of provisioned throughput per shard 20 consumers → 40 MB/s per shard aggregated No more 2MB/s limit ▪ Pushes data to consumers over HTTP/2 Consumer app subscribes to shard ▪ Reduced latency ▪ Vs Standard Consumers Standard Consumers o Low number of consuming apps o Can tolerate 200 ms latency o Minimize cost Enhanced Fan Out Consumers o Multiple consumer apps for same stream o Low latency reqs o Higher costs o Default limit of 20 consumers using enhanced fan out per data stream Analytics Page 74 Troubleshooting Monday, May 27, 2024 11:32 AM Producers performance Writing is too slow o Service limits may be exceeded o Check for throughput exceptions, what operations are being throttled o Shard limits for writes and reads o Other operations have stream level limits of 5-20 calls/sec o Select partition key to evenly distribute puts across shards Large producers o Batch things up o Use KPL, PutRecords, or aggregate records into larger files Small producers (i.e. apps) o Use PutRecords or Kinesis Recorder in AWS Mobile SDKs Stream returns a 500 or 503 error o Indicates an AmazonKinesisException error rate above 1% o Implement a retry mechanism Connection errors from Flink to Kinesis o Network issue or lack of resources in Flink’s env o Could be VPC misconfig Timeout errors from Flink to Kinesis o Adjust RequestTimeout and #setQueueLimit on FlinkKinesisProducer Throttling errors o Check for hot shards w/ enhanced monitoring (shard level) o Check logs for micro spikes or obscure metrics breaching limits o Try a random partition key or improve key’s distribution o Use exponential backoff o Rate-limit Consumers Records get skipped with KCL o Check for unhandled exceptions or processRercords Records in same shard are processed by more than one processor o May be due to failover on record processor workers o Adjust failover time o Handle shutdown methods with reason ‘Zombie’ Reading is too slow o Increase number of shards o maxRecords per call is too low o Code is too slow – test an empty processor vs. yours GetRecords returning empty results o Normal – just keep calling GetRecords Shard Iterator expires unexpectedly o May need more write capacity on the shard table in DyanmoDB Record processing falling behind o Increase retention period while troubleshooting o Usually insufficient resources o Monitor with GetRecords.IteratorAgeMilliseconds and MillisBehindLatest Lambda function isn’t getting invoked o Permissions issue on execution role o Function is timing out – check max execution time o Breaching concurrency limits o Monitor IteratorAge metric – will increase if this is a problem ReadProvisionedThroughputExceeded exception o Throttling o Reshard your stream o Reduce size of GetRecords requests Use enhanced fan out Analytics Page 75 o Use enhanced fan out o Use retries and exponential backoff High latency o Monitor w/ GetRecords.Latency and IteratorAge o Increase shards o Increase retention period o Check CPU and memory utilization – may need more mem 500 errors o Same a producers – indicates a high error rate of > 1% o Implement retry mechanism Blocked or stuck KCL app o Optimize your processRecords method o Increase maxLeasesPerWorker o Enable KCL debug logs Analytics Page 76 Managed Service for Apache Flink (MSAF) Tuesday, May 21, 2024 3:14 PM ▪ Supports Python and Scala ▪ Flink is a framework for processing data streams ▪ Integrates Flink with AWS Instead of using SQL, can develop own Flink app from scratch and load it into MSAF via S3 ▪ Has DataStream API and Table API for SQL access ▪ Serverless ▪ Flink sources KDS Managed Streaing for Apache Kafka ▪ Flink Sinks (i.e. targets) S3 Firehose KDS ▪ Use cases Streaming ETL Continuous metric generation Responsive analytics Cost model ▪ Pay for only resources consumed ▪ Charged by Kinesis Processing Units (KPUs) consumed per hour 1 KPU = 1vCPU + 4GB ▪ Serverless, scales auto ▪ Use IAM permissions to access streaming source and dest ▪ Schema discovery – analyze incoming data in your stream as you’re setting it up RANDOM_CUT_FOREST ▪ SQL function used for anomaly detection on numeric columns in a stream Analytics Page 77 Managed Streaming for Apache Kafka (MSK) Tuesday, May 21, 2024 3:14 PM o Alternative to Kinesis o Fully managed Apache Kafka on AWS to ingest and processing streaming data in real time o Allows you to create, update, delete clusters o Creates and manages Kafka brokers nodes and Zookeeper nodes for you o Deploy the MSK cluster in your VPC ▪ Multi AZ up to 3 for high availability o Auto recovery for common Kafka failures o Data stored on EBS volumes o Can build producers and consumers of data o Can create custom config for clusters ▪ Default message size is 1MB ▪ Possibilities of sending large messages into Kafka after custom config Config ▪ Number of AZ ▪ VPC and subnets ▪ Broker instance type ▪ Number of brokers per AZ Can add more later ▪ Size of EBS volums (1GB – 16TB) Security ▪ Encryption in flight (TLS) between brokers (optional) ▪ Encryption with TLS between clients (optional) ▪ At rest for EBS volumes using KMS ▪ Authorize specific security groups for Kafka clients ▪ Authentication (AuthN) and authorization (AuthZ)[PS1] Define who can read/write to which topics Mutual TLS (AuthN) + Kafka ACLs (AuthZ) SASL/SCRAM (AuthN) + Kafka ACLs (AuthZ) IAM Access control (both) Monitoring ▪ CloudWatch Metrics Basic monitoring - cluster and broker metrics Enhanced monitoring - includes enhanced broker metrics Topic level monitoring – includes enhanced topic level metrics ▪ Prometheus (open source monitoring) Opens a port on the broker to export cluster, broker, and topic-level metrics Setup the JMX exporter (metrics) or Node exporter (CPU and disk metrics) ▪ Broker log delivery to: CloudWatch Logs S3 KDS MSK Connect ▪ Managed Kafka Connect workers on AWS Analytics Page 78 ▪ Managed Kafka Connect workers on AWS Framework to take data from Kafka and put it somewhere else or vv Worker - JVM process that runs the connector logic □ Creates a set of tasks that can operate in parallel threads and copy data ▪ Stream data to/from Kafka clusters by using connectors Source connectors - import data from external systems into topics Sink connector - export data from topics to external systems ▪ Auto scaling capabilities for workers ▪ Can deploy any Kafka Connect connectors to MSK Connect as a plugin MSK Serverless ▪ Run Kafka on MSK w/o managing capacity ▪ Auto provisions resources and scales compute/storage ▪ You just define topics and partitions ▪ Security – IAM access control for all clusters [PS1]important Analytics Page 79 KDS vs MSK Monday, May 27, 2024 11:38 AM KDS MSK 1 MB message size limit 1 MB (default) but can configure for higher Data Streams w/ Shards Kafka Topics w/ Partitions Shard splitting and merging Can only add partitions to a topic TLS in-flight encryption PLAINTEXT or TLS in-flight encryption KMS at-rest encryption KMS at-rest encryption Security: IAM policies for AuthN/AuthZ Security: Mutual TLS (AuthN) + Kafka ACLs (AuthZ) SAS/SCRAM (AuthN) + Kafka ACLs (AuthZ) IAM Access Control (AuthN + AuthZ) Analytics Page 80 OpenSearch Tuesday, May 21, 2024 3:15 PM o Analysis and reporting at PB scale o A search engine o An analysis tool o A visualization tool (Dashboards) o A data pipeline ▪ Integrate with Kinesis Use cases ▪ Full text search ▪ Log analytics ▪ App monitoring ▪ Security analytics ▪ Clickstream analytics Documents ▪ Things you’re searching for ▪ Can be text or any structured JSON data ▪ Has unique ID ▪ Hashed to a particular shard Storage types ▪ Hot storage Used by standard data nodes Instance stores or EBS volumes Fastest performance ▪ UltraWarm (warm) storage Uses S3 + caching Best for indices w/ few writes (e.g. log data, immutable data) Slower performance but much lower cost Must have dedicated master node ▪ Cold storage Also uses S3 Even cheaper For periodic research or forensic analysis on older data Must have dedicated master node and have UltraWarm enabled too Not compatible with T2 or T3 instance types on data nodes If using fine-grained access control, must map users to cold_manager role in Dashboards ▪ Data may be migrated between storage types Stability ▪ Best to have 3 dedicated master nodes Avoids ‘split brain’ – 3rd node can determine who’s the master ▪ Don’t run out of disk space ▪ Choose correct # of shards Rare case: need to limit # of shards per node o Usually run out of disk space first ▪ Choose correct instance types At least 3 nodes Analytics Page 81 At least 3 nodes Mostly about storage reqs Performance ▪ Memory pressure in the JVM can result if You have unbalanced shard allocations across nodes You have too many shards in a cluster ▪ Fewer shards can yield better performance if JVMMemoryPressure errors are encountered Delete old or unused indices Analytics Page 82 Fully Managed Monday, May 27, 2024 11:46 AM ▪ Scale up/down w/o downtime – not auto ▪ Pay for what you use Instance hours, storage, data transfer ▪ Network isolation AWS integration S3 buckets (via Lambda to Kinesis) KDS DynamoDB streams CloudWatch / CloudTrail Zone awareness Options Dedicated master nodes – choice of count and instance types Domains – a collection of resources needed to run OpenSearch cluster Snapshots to S3 Security Resource based policies Identity based policies IP based policies Request signing VPC o Can not move in or out Cognito o Can use to allow Dashboards to access in VPC Anti-patterns OLTP o RDS or DynamoDB is better Ad hoc data querying o Athena is better Analytics Page 83 Serverless Monday, May 27, 2024 11:47 AM ▪ On demand auto scaling ▪ Works against collections instead of provisioned domains May be search or time series types ▪ Always encrypted w/ your KMS key Data access policies Encryption at rest is required May configure security policies across many collections ▪ Capacity measured in OpenSearch Compute Units (OCUs) Can set upper limit Lower limit is always 2 for indexing and searching Analytics Page 84 Indices Monday, May 27, 2024 11:47 AM ▪ Powers search into all docs w/in a collection of types ▪ Contain inverted indices that let you search across everything within them at once ▪ An index is split into shards Each shard may be on a different node in a cluster Self contained Lucene index of its own ▪ Index has two primary shards and two replicas Your app should round robin requests amongst nodes Write requests are rounded to the primary shard then replicated Read requests are routed to the primary or any replica Index state management (ISM) ▪ Automates index management policies ▪ Examples: Delete old indices after a period of time Move indices into read only state after a period of time Move between storage types Reduce replica count over time Automate index snapshots ▪ Run every 30-48 minutes Ensures they don’t run all at once ▪ Can send notifications when done ▪ Index rollups Periodically roll up old data into summarized indices Saves storage costs New index may have fewer files, coarser time buckets ▪ Index transforms Like rollups, but purpose is to create a diff view to analyze data differently Groupings and aggs ▪ Cross cluster replication Replicate indices/mappings/metadata across domains Ensures high availability in an outage Replicate data geographically for better latency Follower index pulls data from leader index Requires fine-grained access control and node to node encryption Remote reindex – allows copying indices from one cluster to another on demand Analytics Page 85 QuickSight Tuesday, May 21, 2024 3:15 PM o Visualization tool for big data o Scalable o A web app for anyone (not just for devs) o Use cases: ▪ Build visualizations ▪ Build paginated reports ▪ Perform ad hoc analysis ▪ Get alerts on detected anomalies ▪ Quickly get business insights from data o Don’t use for ▪ Highly formatted canned reports ▪ ETL Use Glue instead o Serverless Data sources ▪ Redshift ▪ Aurora / RDS ▪ Athena ▪ OpenSearch ▪ IoT analytics ▪ EC2-hosted databases ▪ Files (S3 or on premises) E.g. Excel, CSV, TSV, common or extended log format SPICE – Super-fast, parallel, in memory calculation engine ▪ Data sets are imported into SPICE Uses columnar storage, in mem, machine code generation Accelerates interactive queries on large datasets ▪ Each user gets 10GB of SPICE ▪ Highly available/durable ▪ Scales to hundreds of thousands of users ▪ Can accelerate large queries that would time out in direct query mode (hitting Athena directly) If it takes >30 min to import data, it will still time out Security ▪ Multi factor auth ▪ VPC connectivity Add QuickSight’s IP address range to DB security groups ▪ Row-level security ▪ Column-level security – enterprise edition only ▪ Private VPC access Via Elastic Network Interface, AWS Direct Connect ▪ Resource access Must ensure QuickSight is authorized to use Athena/S3/buckets Manged w/in QuickSight console ▪ Data access Create IAM policies to restrict what data in S3 that can be accessed Analytics Page 86 Create IAM policies to restrict what data in S3 that can be accessed ▪ With Redshift Default: QS can only access data stored in the same region as the one QS is running within A problem if QS in one region and Redshift in another A VPC configured to work across regions won’t work Solution: create a new security group with an inbound rule authorizing access from the IP range of QS servers in that region ▪ User management Users defined via IAM or email signup Active Directory connector with QS Enterprise o All keys managed by AWS o Can tweak security access using IAM if needed o Pricing ▪ Annual subscription By the user/month ▪ Extra SPICE capacity beyond 10GB costs extra ▪ Monthly option ▪ Enterprise edition Encytion at rest Access to Microsoft Active Directory o Dashboards ▪ Read only snapshots of analysis ▪ Can share w/ QS users ▪ Can share even more widely w/ embedded dashboards Embed w/in an app Authenticate w/ Active Directory/Cognito/SSO QS Javascript SDK/QS API Whitelist domains where embedding is allowed o ML insights ▪ ML-powered anomaly detection Uses random cut forest Identify top contributors to significant changes in metrics ▪ ML-powered forecasting Also uses random cut forest Detects seasonality and trends Excludes outliers and imputes missing values ▪ Autonarratives Adds ‘story of your data’ to your dashboards ▪ Suggested insights ‘Insights’ tab displays read-to-use suggested insights Analytics Page 87 S3 Sunday, June 2, 2024 4:33 PM Use cases Backup and storage Disaster recovery Archive Hybrid cloud storage Application and media hosting Data lakes and big data analytics Software delivery Static website Buckets Think: directories Must have globally unique name (across all regions and accounts) Defined at region level - can't be changed Naming convention No uppercase or underscore 3-63 characters Not an IP Cannot start with xn-- or end in -s3alias Objects Think: files Have a key = full path Object values are the content of the body Max size = 5TB Must use multi part upload if uploading > 5GB Metadata Tag Version id S3 Select Pulls out only the data needed from an object Improves performance and decreases costs Works w/ various formats (CSV, JSON, Parquet, etc.) Storage Page 88 Security Sunday, June 2, 2024 4:35 PM User based IAM Policies – which API call should be allowed for a specific user Resource based Bucket policies – bucket wide rules, allows cross account JSON based policies Resources: buckets and objects Effect: allow/deny Actions: Set API to allow/deny Principal: the account or user to apply policy to ○ Can use * to apply to everyone Object Access Control List (ACL) – finer grain (can be disabled) Bucket ACL – less common IAM principal can access S3 object if (user IAM permissions or resource policy allow) and no explicit deny Encryption Use IAM roles for EC2 instances (for example) Storage Page 89 Versioning Sunday, June 2, 2024 4:35 PM Enabled at the bucket level Disabled by default Protects against unintended deletes Easy roll back to previous version Suspending versioning does not delete previous versions Storage Page 90 Replication Sunday, June 2, 2024 4:35 PM Cross regional replication (CRR) Enables automatic, async copying of objects across buckets in different regions Use cases: compliance, lower latency access, replication across accounts Requirements: ○ Both source and destination buckets must have versioning enabled ○ Source and destination buckets must be in different regions ○ S3 must have permissions to replicate objects from source to destination Same region replication (SRR) Log aggregation, live replication between prod and test accounts Must enable versioning in source and destination buckets Buckets can be in different AWS accounts Copying is asynchronous Must give proper IAM permissions After replication is enabled, only new objects are replicated Can replicate existing options using S3 Batch Replication For DELETE operations: Can delete markers from source to targe (optional setting) Deletions w/ a version ID are not replicated (to avoid malicious deletes) No chaining of replication If bucket 1 has replication into bucket 2 which has replication into bucket 3, the objects created in bucket 1 are not replicated to bucket 3 Storage Page 91 Storage Classes Sunday, June 2, 2024 4:35 PM Standard – general purpose 99.99% availability Used for frequently accessed data Low latency and high throughput Sustain 2 concurrent facility failures Use cases: big data analytics, mobile/gaming apps, content distribution IA storage classes Standard – infrequent access (IA) ○ Less frequently accessed but requires rapid access when needed ○ Lower cost than standard ○ High availability ○ Use cases: disaster recover, backups One zone – IA ○ High durability in a single AZ, data lost when AZ in destroyed ○ High availability ○ Use cases: strong secondary backup copies of on premise data or data you can retrieve Glacier storage classes Long term archive ○ Not available for real time access ○ Must restore objects first before accessing them Glacier instant retrieval ○ Millisecond retrieval ○ Great for data accessed once a quarter ○ Min storage duration of 90 days Glacier flexible retrieval ○ Good for data access 1/2 times a year ○ Expedited (1-5 min) ○ Standard (3-5 min) ○ Bulk (5-12 hrs) ○ Min storage duration of 90 days Glacier deep archive ○ Data is rarely accessed in a year ○ Standard (12 hrs) ○ Bulk (48 hrs) ○ Min storage duration of 180 days Intelligent tiering Small monthly monitoring and auto tiering fee Moves objects automatically between access tiers based on usage No retrieval charges Tiers ○ Frequent access (auto) – default ○ Infrequent access (auto) – objects not access for 30 days ○ Archive instant access (auto) – not accessed for 90 days ○ Archive access (optional) – configurable from 90-700+days ○ Deep archive access (optional) – config, 180-700+days Can move between classes manually or using S3 Lifecycle configurations Durability How often an object will be lost on S3 Storage Page 92 How often an object will be lost on S3 High durability – if you store 10,000,000 objects, you can on average expect to incur a loss of a single object once every 10,000 years Same for all storage classes Availability Measures how readily available a service is Varies depending on storage class Storage Page 93 Performance Sunday, June 2, 2024 4:35 PM Auto scales to high request rates – high latency High performance – 3500 put/copy/pose/delete or 5500 get/head requests per second per prefix in a bucket Object path = prefix No limits to number of prefix in a bucket Optimization Multi part upload ○ Rec for files > 100MB ○ Must use for 5 GB ○ Can help parallelize uploads (speed up transfers S3 transfer acceleration ○ Increase transfer speed by transferring file to an edge location which will forward the data to bucket in target region ○ Compatible with multipart upload Reading files in an efficient way S3 Byte-Range fetches ○ Parallelize GETs by requesting specific byte ranges ○ Better resilience in case of failures ○ Can be used to speed up downloads ○ Can be used to retrieve partial data (e.g. head of a file) Select and Glacier Select Retrieve less data using SQL by performing server side filtering i.e. Filter data when getting from S3 vs. getting all data and then filtering Can filter by rows/cols Less network transfer, less CPU cost client side Storage Page 94 Encryption Sunday, June 2, 2024 4:35 PM Server side encryption (SSE) With Amazon S3 Managed keys (SSE-S3) – enabled by default ○ Uses keys handled, managed, and owned by AWS ○ Object is encrypted server side ○ Encryption type is AES 256 ○ Enabled by default for new buckets and new objects With KMS key stored in AWS KMS (SSE-KMS) ○ Leverage AWS Key management Service (KMS) to manage encryption keys ○ Advantage: user control and audit key usage using CloudTrail ○ Object is encrypted server side ○ Limitation: you upload →calls GenerateData KMS or you download →calls Decrypt KMS Api ○ Both count towards the KMS quota per second With customer provided keys (SSE-C) ○ When you want to manage your own encryption keys ○ Keys fully managed by the customer outside of AWS ○ S3 doesn’t store encryption key you provide ○ HTTPS must be used ○ Encryption keys must provided in HTTP headers for every HTTP request made Client side encryption Use client libraries such as Amazon S3 Client Side Encryption Library Clients must encrypt data themselves b