new-AWS-certified data engg.pdf
Document Details
Uploaded by IdyllicHilbert
Tags
Related
- AWS Certified Data Engineer-page2.pdf
- AWS Certified Data Engineer - Associate DEA-C01 Past Paper PDF
- AWS_Certified_Data_Engineer_DEA_C01_Answers_with_Explanations_v2.pdf
- AWS Certified Data Engineer - 3.pdf
- AWS Architectures ING-4-SSIR-Part 1 PDF
- AWS Certified Machine Learning Specialty MLS-C01 Certification Guide PDF
Full Transcript
Table of Contents Data Engineering Fundamentals Storage Database Migration and Transfer Compute Containers Analytics Application Integration Security, Identity, and Compliance Networking and Content Delivery Management and Governa...
Table of Contents Data Engineering Fundamentals Storage Database Migration and Transfer Compute Containers Analytics Application Integration Security, Identity, and Compliance Networking and Content Delivery Management and Governance www.datacumulus.com Machine Learning Developer Tools Everything Else Exam Tips - AWS Certified Data Engineer Associate Course DEA-C01 Welcome! We’re starting in 5 minutes We’re going to prepare for the AWS Certified Data Engineer – Associate exam – DEA-C01 It’s a challenging certification, so this course will be long and interesting Recommended to have previous AWS knowledge (EC2, networking…) Preferred to have some data engineering background We will cover all the AWS Data Engineering services related to the exam Take your time, it’s not a race! Services we’ll learn ANALYTICS APP INTEGRATION CLOUD FINANCIAL MANAGEMENT Amazon EMR AWS Lake Formation Amazon Redshift Amazon EventBridge AWS Step Functions AWS Budgets Amazon Kinesis AWS Glue Amazon Managed Streaming Amazon AppFlow Amazon Simple Notification AWS Cost Explorer for Apache Kafka Service (Amazon SNS) Amazon Amazon QuickSight Amazon Athena OpenSearch Service Amazon Simple Queue Amazon Managed Workflows Service (Amazon SQS) for Apache Airflow Services we’ll learn COMPUTE CONTAINERS DATABASE AWS Batch Amazon Elastic Compute Amazon Elastic Container Amazon DocumentDB Amazon DynamoDB Cloud (Amazon EC2) Registry (Amazon ECR) (with MongoDB compatibility) AWS Lambda AWS Serverless Amazon Elastic Container Amazon Keyspaces Amazon MemoryDB Application Repository Service (Amazon ECS) (for Apache Cassandra) for Redis Amazon Elastic Kubernetes Amazon Neptune Amazon Relational Database Service (Amazon EKS) Service (Amazon RDS) - Services we’ll learn MANAGEMENT DEVELOPER TOOLS AND GOVERNANCE FRONTEND WEB AWS Command Line AWS Cloud9 AWS Cloud Development Kit Amazon API Gateway AWS CloudFormation AWS CloudTrail Amazon CloudWatch Interface (AWS CLI) (AWS CDK) AWS CodeBuild AWS CodeCommit AWS CodeDeploy MACHINE AWS Config Amazon AWS Systems Managed Grafana Manager LEARNING AWS CodePipeline Amazon SageMaker AWS Well-Architected Tool sundog-educaƟŽn.com Services we’ll learn NETWORKING AND SECURITY, IDENTITY, MIGRATION AND TRANSFER CONTENT DELIVERY AND COMPLIANCE AWS Application AWS Application Amazon CloudFront AWS PrivateLink AWS Identity and Access AWS Key Management Discovery Service Migration Service Management (IAM) Service (AWS KMS) AWS Database Migration AWS DataSync Amazon Route 53 Amazon Virtual Private Cloud Amazon Macie AWS Secrets Manager Service (AWS DMS) (Amazon VPC) AWS Transfer Family AWS Snow Family AWS Shield AWS WAF Services we’ll learn STORAGE AWS Backup Amazon Elastic Block Store Amazon ElasƟc File System Amazon Simple Storage (Amazon EBS) (Amazon EFS) Service (Amazon S3) Data Engineering Fundamentals Beyond AWS – a review - Types of Data Structured Unstructured Semi-Structured - Structured Data Definition: Data that is organized in a defined manner or schema, typically found in relational databases. Characteristics: Easily queryable Organized in rows and columns Has a consistent structure Examples: Database tables CSV files with consistent columns Excel spreadsheets Unstructured Data DeĮŶŝƟŽŶ͗ Data that doesn't have a predeĮŶed structure or schema. CharacterisƟĐƐ͗ Not easily queryable without preprocessing May come in various formats Examples͗ Text ĮůĞƐ without aĮxed format Videos and audio ĮůĞƐ Images Emails and word processing documents - Semi-Structured Data DeĮŶŝƟŽŶ͗ Data that is not as organized as structured data but has some level of structure in the form of tags, hierarchies, or other paƩerns. CharacterisƟĐƐ͗ Elements might be tagged or categorized in some way More Ňexible than structured data but not as chĂŽƟĐ as unstructured data ExampleƐ͗ XML and JSON Įles Email headers (which have a mix of structurĞĚĮelds like date, subject, etc., and unstructured data in the body) Log Įles with varied formats - Properties of Data Volume Velocity Variety - DeĮniƟŽn͗ Refers to the amount or size of data that organizaƟons are dealing with at any givĞŶƟme. CharacterisƟĐƐ͗ May range from gigabytes to petabytes or even more Volume Challenges in storing, processing, and analyzing high volumes of data ExampůĞƐ͗ A popular social media plaƞorm processing terabytes of data daily from user posts, images, and videos. Retailers collĞĐƟŶŐyears' worth of transacƟŽn data, amounƟng to several petabytes. - Definition: Refers to the speed at which new data is generated, collected, and processed. Characteristics: High velocity requires real-time or near-real-time processing capabilities Velocity Rapid ingestion and processing can be critical for certain applications Examples: Sensor data from IoT devices streaming readings every millisecond. High-frequency trading systems where milliseconds can make a difference in decision-making. Definition: Refers to the different types, structures, and sources of data. Characteristics: Data can be structured, semi-structured, or unstructured Data can come from multiple sources and in various formats Variety Examples: A business analyzing data from relational databases (structured), emails (unstructured), and JSON logs (semi-structured). Healthcare systems collecting data from electronic medical records, wearable health devices, and patient feedback forms. - NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Data Warehouses vs. Data Lakes sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide Data Warehouse DeĮŶŝƟŽŶ͗ A centralized repository opƟŵŝzed for analysis where data from diīerent sources is stored in a structured format. CharacterisƟĐƐ͗ Designed for complex queries and analysis Data is cleaned, transformed, and loaded (ETL process) Typically uses a star or snowŇake schema OpƟmized for read-heavy operaƟŽns ExampleƐ͗ Amazon RĞĚƐŚŝŌ Google BigQuery MicrŽƐŽŌzure SQL Data Warehouse Amazon Redshift - Data Warehouse example Clickstream Purchase Catalog data data data Data Warehouse Machine AccounƟng Analysis Learning Data Mart Data Mart Data Mart - Data Lake DeĮŶŝƟoŶ͗ storage repository that holds vast amounts of raw data in its naƟve format, including structured, semi-structured, and unstructured data. CharacterisƟĐƐ͗ Can store large volumes of raw data without predeĮŶĞĚ schema Data is loaded as-is, no need for preprocessing Supports batch, real-ƟŵĞ͕ and stream processing Can be queried for data transformaƟon or exploraƟon purposes Examples͗ Amazon Simple Storage Service (S3) when used as a data lake Azure Data Lake Storage Hadoop Distributed File System (HDFS) Amazon Simple Storage AWS Glue Amazon Athena Service (Amazon S3) - Comparing the two Schema: Data Warehouse͗ Schema-on-write (predeĮned schema before wriƟng data) Extract – Transform – Load (ETL) Data Lake͗ Schema-on-read (schema is deĮned at the Ɵme of reading data) Extract – Load – Transform (ELT) Data Types: Data Warehouse͗ Primarily structured data Data Lake͗ Both structured and unstructured data Agility: Data Warehouse͗ Less agile due to predeĮned schema Data Lake͗ More agile as it accepts raw data without a predeĮned structure Processing: Data Warehouse͗ ETL (Extract, Transform, Load) Data Lake͗ ELT (Extract, Load, Transform) or just Load for storage purposes Cost: Data Warehouse͗ Typically more expensive because of opƟmizaƟons for complex queries Data Lake͗ Cost-eīĞĐƟve storage soluƟons, but costs can rise when processing large amounts of data - Choosing a Warehouse vs. a Lake Use a Data Warehouse when: You have structured data sources and require fast and complex queries. Data integraƟon from diīerent sources is essenƟal. Business intelligence and analyƟcs are the primary use cases. Use a Data Lake when: You have a mix of structured, semi-structured, or unstructured data. You need a scalable and cost-eīĞĐƟve soluƟŽn to store massive amounts of data. Future needs for data are uncertain, and you want Ňexibility in storage and processing. Advanced analyƟcs, machine learning, or data discovery are key goals. OŌen, organizaƟons use a combinaƟon of both, ingesƟng raw data into a data lake and then processing and moving reĮŶed data into a data warehouse for analysis. - Data Lakehouse DeĮniƟon͗ A hybrid data architecture that combines the best features of data lakes and data warehouses, aiming to provide the performance, reliability, and capabiliƟĞƐ of a data warehouse while maintaining the Ňexibility, scale, and low-cost storage of data lakes. CharacterisƟĐƐ͗ Supports both structured and unstructured data. Allows for schema-on-write and schema-on-read. Provides capabiliƟĞƐ for both detailed analyƟĐƐ and machine learning tasks. Typically built on top of cloud or distributed architectures. BeneĮƚƐ from technologies like Delta Lake, which bring ACID transacƟŽns to big data. ExampůĞƐ͗ AWS Lake FormaƟon (with S3 and RedshiŌ Spectrum) Delta Lake: An open-source storage layer that brings ACID transacƟŽns to Apache Spark and big data workloads. Databricks Lakehouse Plaƞorm: A uniĮed plaƞorm that combines the capabiliƟĞƐ of data lakes and data warehouses. Azure Synapse AnalyƟcs: MicrŽƐŽŌΖƐ analyƟĐƐ service that brings together big data and data warehousing. - Data Mesh Use Raw Coined in 2019; it’s more about governance and organization case Data product data Individual teams own “data products” within a given domain Data product These data products serve various “use cases” around the organization Data domain “Domain-based data management” Data product Federated governance with central standards Self-service tooling & infrastructure Data domain Data product Data lakes, warehouses, etc. may Raw be part of it Raw Data product data But a “data mesh” is more about the “data management paradigm” and data Raw not the specific technologies or Data domain data architectures. - ETL Pipelines DeĮŶŝƟŽŶ͗ ETL stands for Extract, Transform, Load. It's a process used to move data from source systems into a data warehouse. Extract: Retrieve raw data from source systems, which can be databases, CRMs, Ňat ĮůĞƐ͕WIs, or other data repositories. Ensure data integrity during the extracƟŽŶ phase. Can be done in real-ƟŵĞ or in batches, depending on requirements. www.datacumulus.com sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com ETL Pipelines: Transform Convert the extracted data into a format suitable for the target data warehouse. Can involve various operations such as: Data cleansing (e.g., removing duplicates, fixing errors) Data enrichment (e.g., adding additional data from other sources) Format changes (e.g., date formatting, string manipulation) Aggregations or computations (e.g., calculating totals or averages) Encoding or decoding data Handling missing values sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com ETL Pipelines: Load Move the transformed data into the target data warehouse or another data repository. Can be done in batches (all at once) or in a streaming manner (as data becomes available). Ensure that data maintains its integrity during the loading phase. sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Managing ETL Pipelines This process must be automated in some reliable way AWS Glue Orchestration services EventBridge Amazon Managed Workflows for Apache Airflow [Amazon MWAA] AWS Step Functions Lambda Glue Workflows We’ll get into specific architectures later. sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Data Sources JDBC Java Database Connectivity Platform-independent Language-dependent ODBC Open Database Connectivity Platform-dependent (thx to drivers) Language-independent Raw logs API’s Streams sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Common Data Formats sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com CSV (Comma-Separated Values) DescripƟon: Text-based format that represents data in a tabular form where each line corresponds to a row and values within a row are separated by commas. When to Use: For small to medium datasets. For data interchange between systems with diīerent technologies. For human-readable and editable data storage. ImporƟng/ExporƟng data from databases or spreadsheets. Systems: Databases (SQL-based), Excel, Pandas in Python, R, many ETL tools. sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com JSON (JavaScript Object Notation) DescripƟŽŶ͗ Lightweight, text-based, and human-readable data interchange format that represents structured or semi- structured data based on key-value pairs. When to Use: Data interchange between a web server and a web client. ConĮŐƵraƟons and seƫŶŐƐ for ƐŽŌware applicaƟŽŶƐ͘ Use cases that need a Ňexible schema or nested data structures. Systems: Web browsers, many programming languages (like JavaScript, Python, Java, etc.), RESTful APIs, NoSQL databases (like MongoDB). sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Avro Description: Binary format that stores both the data and its schema, allowing it to be processed later with different systems without needing the original system's context. When to Use: With big data and real-time processing systems. When schema evolution (changes in data structure) is needed. Efficient serialization for data transport between systems. Systems: Apache Kafka, Apache Spark, Apache Flink, Hadoop ecosystem. sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Parquet Description: Columnar storage format optimized for analytics. Allows for efficient compression and encoding schemes. When to Use: Analyzing large datasets with analytics engines. Use cases where reading specific columns instead of entire records is beneficial. Storing data on distributed systems where I/O operations and storage need optimization. Systems: Hadoop ecosystem, Apache Spark, Apache Hive, Apache Impala, Amazon Redshift Spectrum. sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide A Very (intentionally) Incomplete NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Overview of Data Modeling The exam guide doesn’t really talk about specific data models. But here’s a star schema. Fact tables Dimensions Primary / foreign keys This sort of diagram is an Entity Relationship Diagram (ERD) sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Data Lineage DescripƟŽŶ͗ A visual representaƟŽŶƚŚat traces the Ňow and transformaƟŽŶ of data through its lifecycle, from its source to its ĮŶal desƟŶaƟŽŶ͘ Importance: Helps in tracking errors back to their source. Ensures compliance with regulaƟŽŶƐ͘ Provides a clear understanding of how data is moved, transformed, and consumed within systems. sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Data Lineage Example of capturing data lineage Uses Spline Agent (for Spark) attached to Glue Dump lineage data into Neptune via Lambda Image͗AWS (hƩps͗ͬͬaws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/) sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Schema Evolution DescripƟon: The ability to adapt and change the schema of a dataset over ƟŵĞ without disrupƟŶŐ exisƟŶŐ processes or systems. Importance: Ensures data systems can adapt to changing business requirements. Allows for the addiƟon, removal, or modiĮcaƟŽŶŽĨcolumns/Įelds in a dataset. Maintains backward compaƟďŝlity with older data records. Glue Schema Registry Schema discovery, compaƟďŝlity, validaƟŽŶ͕registraƟŽŶ͙ sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Database Performance Optimization Indexing Avoid full table scans! Enforce data uniqueness and integrity Partitioning Reduce amount of data scanned Helps with data lifecycle management Enables parallel processing Compression Compute Node Compute Node Speed up data transfer, reduce storage & disk reads Node Slices Node Slices GZIP, LZOP, BZIP2, ZSTD (Redshift examples) Various tradeoffs between compression & speed Columnar compression sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Data Sampling Techniques Random Sampling Everything has an equal chance Stratified Sampling Divide population into homogenous subgroups (strata) Random sample within each stratum Ensures representation of each subgroup Dan Kernler, CC BY-SA 4.0 , via Wikimedia Commons Others Systemic, Cluster, Convenience, Judgmental sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Random Sampling sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Stratified Sampling Books Music Home & Garden Apparel sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Systemic Sampling sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Data Skew Mechanisms Data skew refers to the unequal distribƵƟŽŶŽƌ imbalance of data across various nodes or Actor clickstream data parƟƟŽŶƐ in distributed cŽŵƉƵƟŶŐ systems. “The celebrity problem” Even ƉĂƌƟƟoning doesn’t work if your traĸĐ is uneven Imagine you’re IMDb͙ Brad PiƩ could overload his parƟƟon Causes: Non-uniform distribution of data Inadequate partitioning strategy Temporal skew Important to monitor data distribution and alert when skew issues arise. sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Addressing Data Skew 1.AdapƟve ParƟƟŽŶŝŶŐ͗ Dynamically adjust parƟƟŽning based on data characterisƟĐƐ to ensure a more balanced distribuƟon. 2.^ĂůƟŶŐ͗ Introduce a random factor or "salt" to the data to distribute it more uniformly. 3.RĞƉĂƌƟƟŽŶŝŶŐ͗ Regularly redistribute the data based on its current distribuƟon characterisƟĐƐ͘ 4.Sampling: Use a sample of the data to determine the distribuƟŽŶ and adjust the processing strategy accordingly. 5.Custom PĂƌƟƟŽŶŝŶg: DeĮne custom rules or funĐƟŽns for ƉĂƌƟƟŽŶing data based on domain knowledge. sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Data Validation and Profiling 1. Completeness 1. DeĮniƟon: Ensures all required data is present and no essenƟal parts are missing. 2. Checks: Missing values, null counts, percentage of populated Įelds. 3. Importance: Missing data can lead to inaccurate analyses and insights. 2. Consistency 1. DeĮniƟon: Ensures data values are consistent across datasets and do not contradict each other. 2. Checks: Cross-ĮĞůd validĂƟon, comparing data from diīerent sources or periods. 3. Importance: Inconsistent data can cause confusion and result in incorrect conclusions. 3. Accuracy 1. DeĮniƟon: Ensures data is correct, reliable, and represents what it is supposed to. 2. Checks: Comparing with trusted sources, validĂƟon against known standards or rules. 3. Importance: Inaccurate data can lead to false insights and poor decision-making. 4. Integrity 1. DeĮniƟon: Ensures data maintains its correctness and consistency over its lifecycle and across systems. 2. Checks: RefeƌĞŶƟĂl integrity (e.g., foreign key checks in databases), ƌĞůĂƟonship validĂƟons. 3. Importance: Ensures ƌĞůĂƟonships between data elements are preserved, and data remains trustworthy over Ɵme. sundog-educaƟŽn.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com SQL Review sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Aggregation COUNT SELECT COUNT(*) AS total_rows FROM employees; SUM SELECT SUM(salary) AS total_salary FROM employees; AVG SELECT AVG(salary) AS average_salary FROM employees; MAX / MIN SELECT MAX(salary) AS highest_salary FROM employees; sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Aggregate with CASE WHERE clauses are specified after aggregation, so you can only filter on one thing at a time. SELECT COUNT(*) AS high_salary_count FROM employees WHERE salary > 70000; One way to apply multiple filters to what you’re aggregating SELECT COUNT(CASE WHEN salary > 70000 THEN 1 END) AS high_salary_count, COUNT(CASE WHEN salary BETWEEN 50000 AND 70000 THEN 1 END) AS medium_salary_count, COUNT(CASE WHEN salary < 50000 THEN 1 END) AS low_salary_count FROM employees; sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Grouping sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Nested grouping, sorting sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Pivoting Pivoting is the act of turning row-level data into columnar data. How this works is very database-specific. Some have a PIVOT command. For example, let’s imagine we have a sales table that contains sales amounts and the salesperson in each row, but we want a report by salesperson: sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Pivoting The same thing could be achieved with conditional aggregation, without requiring a specific PIVOT operation: sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com INNER JOIN GermanX, CC BY-SA 4.0 , via Wikimedia Commons sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com LEFT OUTER JOIN GermanX, CC BY-SA 4.0 , via Wikimedia Commons sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com RIGHT OUTER JOIN GermanX, CC BY-SA 4.0 , via Wikimedia Commons sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com FULL OUTER JOIN GermanX, CC BY-SA 4.0 , via Wikimedia Commons sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com CROSS OUTER JOIN GermanX, CC BY-SA 4.0 , via Wikimedia Commons sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com SQL Regular Expressions Pattern matching Think a much more powerful “LIKE” ~ is the regular expression operator ~* is case-insensitive !~* would mean “not match expression, case insensitive” Regular expression 101 ^ - match a pattern at the start of a string $ - match a pattern at the end of a string (boo$ would match boo but not book) | - alternate characters (sit|sat matches both sit and sat) Ranges ([a-z] matches any lower case letter) Repeats ([a-z]{4} matches any four-letter lowercase word) Special metacharacters \d – any digit; \w – any letter, digit, or underscore, \s – whitespace, \t – tab Example: SELECT * FROM name WHERE name ~ * ‘^(fire|ice)’; Selects any rows where the name starts with “fire” or “ice” (case insensitive) sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Git review Daniel Kinzler, CC BY 3.0 , via Wikimedia Commons sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Common git commands Setting Up and Configuration: git init: Initialize a new Git repository. git config: Set configuration values for user info, aliases, and more. git config --global user.name "Your Name": Set your name. git config --global user.email "[email protected]": Set your email. Basic Commands: git clone : Clone (or download) a repository from an existing URL. git status: Check the status of your changes in the working directory. git add : Add changes in the file to the staging area. git add.: Add all new and changed files to the staging area. git commit -m "Commit message here": Commit the staged changes with a message. git log: View commit logs. sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Branching with git git branch: List all local branches. git branch : Create a new branch. git checkout : Switch to a specific branch. git checkout -b : Create a new branch and switch to it. git merge : Merge the specified branch into the current branch. git branch -d : Delete a branch. Felix Dreissig, noris network AG sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Remote repositories git remote add : Add a remote repository. git remote: List all remote repositories. git push : Push a branch to a remote repository. git pull : Pull changes from a remote repository branch into the current local branch. sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Undoing changes git reset: Reset your staging area to match the most recent commit, without affecting the working directory. git reset --hard: Reset the staging area and the working directory to match the most recent commit. git revert : Create a new commit that undoes all of the changes from a previous commit. sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Advanced git git stash: Temporarily save changes that are not yet ready for a commit. git stash pop: Restore the most recently stashed changes. git rebase : Reapply changes from one branch onto another, often used to integrate changes from one branch into another. git cherry-pick : Apply changes from a specific commit to the current branch. sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Git collaboration and inspection git blame : Show who made changes to a file and when. git diff: Show changes between commits, commit and working tree, etc. git fetch: Fetch changes from a remote repository without merging them. sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Git maintenance and data recovery git fsck: Check the database for errors. git gc: Clean up and optimize the local repository. git reflog: Record when refs were updated in the local repository, useful for recovering lost commits. sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Storage Storing, accessing, and backing up data in AWS sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 Section sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Section introduction Amazon S3 is one of the main building blocks of AWS It’s advertised as ”infinitely scaling” storage Many websites use Amazon S3 as a backbone Many AWS services use Amazon S3 as an integration as well We’ll have a step-by-step approach to S3 sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 Use cases Backup and storage Disaster Recovery Archive Nasdaq stores 7 years of data into S3 Glacier Hybrid Cloud storage Application hosting Media hosting Data lakes & big data analytics Sysco runs analytics Software delivery on its data and gain business insights Static website sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 - Buckets Amazon S3 allows people to store objects (files) in “buckets” (directories) Buckets must have a globally unique name (across all regions all accounts) Buckets are defined at the region level S3 looks like a global service but buckets are created in a region Naming convention No uppercase, No underscore 3-63 characters long Not an IP S3 Bucket Must start with lowercase letter or number Must NOT start with the prefix xn-- Must NOT end with the suffix -s3alias sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 - Objects Objects (files) have a Key The key is the FULL path: s3://my-bucket/my_file.txt s3://my-bucket/my_folder1/another_folder/my_file.txt Object The key is composed of prefix + object name s3://my-bucket/my_folder1/another_folder/my_file.txt There’s no concept of “directories” within buckets (although the UI will trick you to think otherwise) Just keys with very long names that contain S3 Bucket slashes (“/”) with Objects sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 – Objects (cont.) Object values are the content of the body: Max. Object Size is 5TB (5000GB) If uploading more than 5GB, must use “multi-part upload” Metadata (list of text key / value pairs – system or user metadata) Tags (Unicode key / value pair – up to 10) – useful for security / lifecycle Version ID (if versioning is enabled) sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 – Security User-Based IAM Policies – which API calls should be allowed for a specific user from IAM Resource-Based Bucket Policies – bucket wide rules from the S3 console - allows cross account Object Access Control List (ACL) – finer grain (can be disabled) Bucket Access Control List (ACL) – less common (can be disabled) Note: an IAM principal can access an S3 object if The user IAM permissions ALLOW it OR the resource policy ALLOWS it AND there’s no explicit DENY Encryption: encrypt objects in Amazon S3 using encryption keys sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Bucket Policies JSON based policies Resources: buckets and objects Effect: Allow / Deny Actions: Set of API to Allow or Deny Principal: The account or user to apply the policy to Use S3 bucket for policy to: Grant public access to the bucket Force objects to be encrypted at upload Grant access to another account (Cross Account) sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide Example: Public Access - Use Bucket NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Policy S3 Bucket Policy Allows Public Access Anonymous www website visitor S3 Bucket sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide Example: User Access to S3 – IAM NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com permissions IAM Policy IAM User S3 Bucket sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide Example: EC2 instance access - Use NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com IAM Roles IAM permissions EC2 Instance Role EC2 Instance S3 Bucket sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide Advanced: Cross-Account Access – NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Use Bucket Policy S3 Bucket Policy Allows Cross-Account IAM User Other AWS account S3 Bucket sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Bucket settings for Block Public Access These settings were created to prevent company data leaks If you know your bucket should never be public, leave these on Can be set at the account level sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 - Versioning User You can version your files in Amazon S3 It is enabled at the bucket level upload Same key overwrite will change the “version”: 1, 2, 3…. It is best practice to version your buckets S3 Bucket (my-bucket) Protect against unintended deletes (ability to restore a Version 1 Version 2 version) Easy roll back to previous version Version 3 Notes: s3://my-bucket/my-file.docx Any file that is not versioned prior to enabling versioning will have version “null” Suspending versioning does not delete the previous - versions NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 – Replication (CRR & SRR) Must enable Versioning in source and destination buckets Cross-Region Replication (CRR) S3 Bucket Same-Region Replication (SRR) (eu-west-1) Buckets can be in different AWS accounts Copying is asynchronous asynchronous Must give proper IAM permissions to S3 replica on Use cases: S3 Bucket CRR – compliance, lower latency access, replication across (us-east-2) accounts SRR – log aggregation, live replication between production and test accounts sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 – Replication (Notes) After you enable Replication, only new objects are replicated Optionally, you can replicate existing objects using S3 Batch Replication Replicates existing objects and objects that failed replication For DELETE operations Can replicate delete markers from source to target (optional setting) Deletions with a version ID are not replicated (to avoid malicious deletes) There is no “chaining” of replication If bucket 1 has replication into bucket 2, which has replication into bucket 3 Then objects created in bucket 1 are not replicated to bucket 3 sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Storage Classes Amazon S3 Standard - General Purpose Amazon S3 Standard-Infrequent Access (IA) Amazon S3 One Zone-Infrequent Access Amazon S3 Glacier Instant Retrieval Amazon S3 Glacier Flexible Retrieval Amazon S3 Glacier Deep Archive Amazon S3 Intelligent Tiering Can move between classes manually or using S3 Lifecycle configurations sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Durability and Availability Durability: High durability (99.999999999%, 11 9’s) of objects across multiple AZ If you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years Same for all storage classes Availability: Measures how readily available a service is Varies depending on storage class Example: S3 standard has 99.99% availability = not available 53 minutes a year sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Standard – General Purpose 99.99% Availability Used for frequently accessed data Low latency and high throughput Sustain 2 concurrent facility failures Use Cases: Big Data analytics, mobile & gaming applications, content distribution… sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Storage Classes – Infrequent Access For data that is less frequently accessed, but requires rapid access when needed Lower cost than S3 Standard Amazon S3 Standard-Infrequent Access (S3 Standard-IA) 99.9% Availability Use cases: Disaster Recovery, backups Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA) High durability (99.999999999%) in a single AZ; data lost when AZ is destroyed 99.5% Availability Use Cases: Storing secondary backup copies of on-premises data, or data you can recreate sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 Glacier Storage Classes Low-cost object storage meant for archiving / backup Pricing: price for storage + object retrieval cost Amazon S3 Glacier Instant Retrieval Millisecond retrieval, great for data accessed once a quarter Minimum storage duration of 90 days Amazon S3 Glacier Flexible Retrieval (formerly Amazon S3 Glacier): Expedited (1 to 5 minutes), Standard (3 to 5 hours), Bulk (5 to 12 hours) – free Minimum storage duration of 90 days Amazon S3 Glacier Deep Archive – for long term storage: Standard (12 hours), Bulk (48 hours) Minimum storage duration of 180 days sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Intelligent-Tiering Small monthly monitoring and auto-tiering fee Moves objects automatically between Access Tiers based on usage There are no retrieval charges in S3 Intelligent-Tiering Frequent Access tier (automatic): default tier Infrequent Access tier (automatic): objects not accessed for 30 days Archive Instant Access tier (automatic): objects not accessed for 90 days Archive Access tier (optional): configurable from 90 days to - 700+ days NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Storage Classes Comparison Intelligent- Glacier Instant Glacier Flexible Glacier Deep Standard Standard-IA One Zone-IA Tiering Retrieval Retrieval Archive Durability 99.999999999% == (11 9’s) Availability 99.99% 99.9% 99.9% 99.5% 99.9% 99.99% 99.99% Availability SLA 99.9% 99% 99% 99% 99% 99.9% 99.9% Availability >= 3 >= 3 >= 3 1 >= 3 >= 3 >= 3 Zones Min. Storage None None 30 Days 30 Days 90 Days 90 Days 180 Days Dura on Charge Min. Billable None None 128 KB 128 KB 128 KB 40 KB 40 KB Object Size Retrieval Fee None None Per GB retrieved Per GB retrieved Per GB retrieved Per GB retrieved Per GB retrieved h ps://aws.amazon.com/s3/storage-classes/ sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide S3 Storage Classes – Price Comparison NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Example: us-east-1 Glacier Instant Glacier Flexible Glacier Deep Standard Intelligent-Tiering Standard-IA One Zone-IA Retrieval Retrieval Archive Storage Cost $0.023 $0.0025 - $0.023 $0.0125 $0.01 $0.004 $0.0036 $0.00099 (per GB per month) GET: $0.0004 GET: $0.0004 POST: $0.03 POST: $0.05 Retrieval Cost GET: $0.0004 GET: $0.0004 GET: $0.001 GET: $0.001 GET: $0.01 (per 1000 request) POST: $0.005 POST: $0.005 POST: $0.01 POST: $0.01 POST: $0.02 Expedited: $10 Standard: $0.10 Standard: $0.05 Bulk: $0.025 Bulk: free Expedited (1 – 5 mins) Standard (12 hours) Retrieval Time Instantaneous Standard (3 – 5 hours) Bulk (48 hours) Bulk (5 – 12 hours) Monitoring Cost $0.0025 (pet 1000 objects) h ps://aws.amazon.com/s3/pricing/ sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide Amazon S3 – Moving between Storage NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Classes You can transition objects between storage classes Standard Standard IA For infrequently accessed object, move them to Standard IA Intelligent Tiering For archive objects that you don’t One-Zone IA need fast access to, move them to Glacier or Glacier Deep Glacier Instant Retrieval Archive Glacier Flexible Retrieval Moving objects can be automated Glacier Deep Archive using a Lifecycle Rules sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 – Lifecycle Rules Transition Actions – configure objects to transition to another storage class Move objects to Standard IA class 60 days after creation Move to Glacier for archiving after 6 months Expiration actions – configure objects to expire (delete) after some time Access log files can be set to delete after a 365 days Can be used to delete old versions of files (if versioning is enabled) Can be used to delete incomplete Multi-Part uploads Rules can be created for a certain prefix (example: s3://mybucket/mp3/*) - Rules can be created for certain objects Tags (example: Department: Amazon S3 – Lifecycle Rules (Scenario NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com 1) Your application on EC2 creates images thumbnails after profile photos are uploaded to Amazon S3. These thumbnails can be easily recreated, and only need to be kept for 60 days. The source images should be able to be immediately retrieved for these 60 days, and afterwards, the user can wait up to 6 hours. How would you design this? S3 source images can be on Standard, with a lifecycle configuration to transition them to Glacier after 60 days S3 thumbnails can be on One-Zone IA, with a lifecycle configuration to expire them (delete them) after 60 days sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide Amazon S3 – Lifecycle Rules (Scenario NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com 2) A rule in your company states that you should be able to recover your deleted S3 objects immediately for 30 days, although this may happen rarely. After this time, and for up to 365 days, deleted objects should be recoverable within 48 hours. Enable S3 Versioning in order to have object versions, so that “deleted objects” are in fact hidden by a “delete marker” and can be recovered Transition the “noncurrent versions” of the object to Standard IA Transition afterwards the “noncurrent versions” to Glacier Deep - Archive Amazon S3 Analytics – Storage Class NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Analysis Help you decide when to transition objects S3 Bucket to the right storage class Recommendations for Standard and Standard IA Does NOT work for One-Zone IA or Glacier S3 Analy cs Report is updated daily 24 to 48 hours to start seeing data analysis.csv report Good first step to put together Lifecycle Date StorageClass ObjectAge Rules (or improve them)! 8/22/2022 STANDARD 000-014 8/25/2022 STANDARD 030-044 9/6/2022 STANDARD 120-149 sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Event Notifications S3:ObjectCreated, S3:ObjectRemoved, S3:ObjectRestore, S3:Replication… SNS Object name filtering possible (*.jpg) Use case: generate thumbnails of events images uploaded to S3 Can create as many “S3 events” as Amazon S3 SQS desired S3 event notifications typically deliver events in seconds but can sometimes Lambda Func on take a minute or longer sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Event Notifications – IAM Permissions SNS SNS Resource (Access) Policy events Amazon S3 SQS SQS Resource (Access) Policy Lambda Func on Lambda Resource Policy sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide S3 Event Notifications NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com with Amazon EventBridge events All events rules Over 18 AWS services as des na ons Amazon S3 Amazon bucket EventBridge Advanced filtering options with JSON rules (metadata, object size, name...) Multiple Destinations – ex Step Functions, Kinesis Streams / Firehose… EventBridge Capabilities – Archive, Replay Events, Reliable delivery sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 – Baseline Performance Amazon S3 automatically scales to high request rates, latency 100- 200 ms Your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. Example (object path => prefix): bucket/folder1/sub1/file => /folder1/sub1/ bucket/folder1/sub2/file => /folder1/sub2/ bucket/1/file => /1/ bucket/2/file => /2/ If you spread reads across all four prefixes evenly, you can achieve 22,000 requests per second for GET and HEAD sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Performance Multi-Part upload: S3 Transfer Acceleration recommended for files > Increase transfer speed by 100MB, must use for files > transferring file to an AWS edge 5GB location which will forward the data Can help parallelize uploads to the S3 bucket in the target region (speed up transfers) Compatible with multi-part upload Divide Parallel uploads In parts Fast Fast (public www) (private AWS) File in USA Edge Loca on S3 Bucket Amazon S3 USA Australia BIG file sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide S3 Performance – S3 Byte-Range NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Fetches Parallelize GETs by requesting specific byte ranges Better resilience in case of failures Can be used to retrieve only partial data (for example the Can be used to speed up head of a file) downloads File in S3 File in S3 Byte-range request for header (first XX bytes) Part 1 Part 2 … Part N header Requests in parallel sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Select & Glacier Select Retrieve less data using SQL by performing server-side filtering Can filter by rows & columns (simple SQL statements) Less network transfer, less CPU cost client-side CSV file Get CSV with S3 Select Send filtered dataset Amazon S3 Server-side filtering h ps://aws.amazon.com/blogs/aws/s3-glacier-select/ sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 – Object Encryption You can encrypt objects in S3 buckets using one of 4 methods Server-Side Encryption (SSE) Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3) – Enabled by Default Encrypts S3 objects using keys handled, managed, and owned by AWS Server-Side Encryption with KMS Keys stored in AWS KMS (SSE-KMS) Leverage AWS Key Management Service (AWS KMS) to manage encryption keys Server-Side Encryption with Customer-Provided Keys (SSE-C) When you want to manage your own encryption keys Client-Side Encryption It’s important to understand which ones are for which situation for the exam sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 Encryption – SSE-S3 Encryption using keys handled, managed, and owned by AWS Object is encrypted server-side Encryption type is AES-256 Must set header "x-amz-server-side-encryption": "AES256" Enabled by default for new buckets & new objects Amazon S3 Object upload HTTP(S) + Header + Encryp on User S3 Bucket S3 Owned Key sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 Encryption – SSE-KMS Encryption using keys handled and managed by AWS KMS (Key Management Service) KMS advantages: user control + audit key usage using CloudTrail Object is encrypted server side Must set header "x-amz-server-side-encryption": "aws:kms" Amazon S3 Object upload HTTP(S) + Header + Encryp on User S3 Bucket KMS Key AWS KMS sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com SSE-KMS Limitation If you use SSE-KMS, you may be S3 Bucket KMS Key impacted by the KMS limits API call When you upload, it calls the GenerateDataKey KMS API Upload / download When you download, it calls the SSE-KMS Decrypt KMS API Count towards the KMS quota per second (5500, 10000, 30000 req/s based on region) Users You can request a quota increase using the Service Quotas Console sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 Encryption – SSE-C Server-Side Encryption using keys fully managed by the customer outside of AWS Amazon S3 does NOT store the encryption key you provide HTTPS must be used Encryption key must provided in HTTP headers, for every HTTP request made Amazon S3 Object + upload HTTPS ONLY + Encryp on User + Key in Header S3 Bucket Client-Provided Key sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide Amazon S3 Encryption – Client-Side NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Encryption Use client libraries such as Amazon S3 Client-Side Encryption Library Clients must encrypt data themselves before sending to Amazon S3 Clients must decrypt data themselves when retrieving from Amazon S3 Customer fully manages the keys and encryption cycle File Amazon S3 + Encryp on upload HTTP(S) File (encrypted) S3 Bucket Client Key sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide Amazon S3 – Encryption in transit NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com (SSL/TLS) Encryption in flight is also called SSL/TLS Amazon S3 exposes two endpoints: HTTP Endpoint – non encrypted HTTPS Endpoint – encryption in flight HTTPS is recommended HTTPS is mandatory for SSE-C Most clients would use the HTTPS endpoint by default sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide Amazon S3 – Force Encryption in Transit NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com aws:SecureTransport Account B User S3 Bucket (my-bucket) User Bucket Policy sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon S3 – Default Encryption vs. Bucket Policies SSE-S3 encryption is automatically applied to new objects stored in S3 bucket Optionally, you can “force encryption” using a bucket policy and refuse any API call to PUT an S3 object without encryption headers (SSE-KMS or SSE-C) Note: Bucket Policies are evaluated before “Default Encryption” sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 – Access Points Policy Grant R/W to Users /finance prefix Finance S3 Bucket (Finance) Access Point Simple Bucket Policy Grant R/W to Policy Users /finance/… /sales prefix Sales (Sales) Access Point Policy /sales/… Grant R to Users en re bucket Analy cs (Analy cs) Access Point Access Points simplify security management for S3 Buckets Each Access Point has: its own DNS name (Internet Origin or VPC Origin) an access point policy (similar to bucket policy) – manage security at scale sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 – Access Points – VPC Origin We can define the VPC Access Point access point to be EC2 Instance VPC Endpoint VPC Origin S3 Bucket accessible only from within the VPC You must create a Endpoint Access Point Bucket Policy Policy Policy VPC Endpoint to access the Access Point (Gateway or Interface Endpoint) The VPC Endpoint Policy must allow access to the target bucket and Access Point sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com S3 Object Lambda AWS Cloud Use AWS Lambda Functions to change the object before it is retrieved by the caller application Original Object S3 Bucket Only one S3 bucket is needed, E-Commerce on top of which we create S3 Applica on Access Point and S3 Object Suppor ng S3 Access Point Lambda Access Points. S3 Object Lambda Access Point Redac ng Lambda Func on Use Cases: Redacted Redacting personally identifiable Object information for analytics or non- Analy cs production environments. Applica on S3 Object Lambda Enriching Converting across data formats, Access Point Lambda Func on such as converting XML to JSON. Enriched Resizing and watermarking Object images on the fly using caller- Marke ng specific details, such as the user Applica on Customer Loyalty who requested the object. Database sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com EC2 Instance Storage Section sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com What’s an EBS Volume? An EBS (Elastic Block Store) Volume is a network drive you can attach to your instances while they run It allows your instances to persist data, even after their termination They can only be mounted to one instance at a time (at the CCP level) They are bound to a specific availability zone Analogy: Think of them as a “network USB stick” Free tier: 30 GB of free EBS storage of type General Purpose (SSD) or Magnetic per month sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com EBS Volume It’s a network drive (i.e. not a physical drive) It uses the network to communicate the instance, which means there might be a bit of latency It can be detached from an EC2 instance and attached to another one quickly It’s locked to an Availability Zone (AZ) An EBS Volume in us-east-1a cannot be attached to us-east-1b To move a volume across, you first need to snapshot it Have a provisioned capacity (size in GBs, and IOPS) You get billed for all the provisioned capacity You can increase the capacity of the drive over time sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com EBS Volume - Example US-EAST-1A US-EAST-1B EBS EBS EBS EBS EBS (10 GB) (100 GB) (50 GB) (50 GB) (10 GB) una ached sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com EBS – Delete on Termination attribute Controls the EBS behaviour when an EC2 instance terminates By default, the root EBS volume is deleted (attribute enabled) By default, any other attached EBS volume is not deleted (attribute disabled) This can be controlled by the AWS console / AWS CLI Use case: preserve root volume when instance is terminated sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon EBS Elastic Volumes You don’t have to detach a volume or restart your instance to change it! Just go to actions / modify volume from the console Increase volume size You can only increase, not decrease Change volume type Gp2 -> Gp3 Specify desired IOPS or throughput performance (or it will guess) Adjust performance Increase or decrease sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon EFS – Elastic File System Managed NFS (network file system) that can be mounted on many EC2 EFS works with EC2 instances in multi-AZ Highly available, scalable, expensive (3x gp2), pay per use us-east-1a us-east-1b us-east-1c EC2 Instances EC2 Instances EC2 Instances Security Group EFS FileSystem sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com Amazon EFS – Elastic File System Use cases: content management, web serving, data sharing, Wordpress Uses NFSv4.1 protocol Uses security group to control access to EFS Compatible with Linux based AMI (not Windows) Encryption at rest using KMS POSIX file system (~Linux) that has a standard file API File system scales automatically, pay-per-use, no capacity planning! sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com EFS – Performance & Storage Classes EFS Scale 1000s of concurrent NFS clients, 10 GB+ /s throughput Grow to Petabyte-scale network file system, automatically Performance Mode (set at EFS creation time) General Purpose (default) – latency-sensitive use cases (web server, CMS, etc…) Max I/O – higher latency, throughput, highly parallel (big data, media processing) Throughput Mode Bursting – 1 TB = 50MiB/s + burst of up to 100MiB/s Provisioned – set your throughput regardless of storage size, ex: 1 GiB/s for 1 TB storage Elastic – automatically scales throughput up or down based on your workloads Up to 3GiB/s for reads and 1GiB/s for writes Used for unpredictable workloads sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com EFS – Storage Classes Storage Tiers (lifecycle management feature – move file after N days) Standard: for frequently accessed files Infrequent access (EFS-IA): cost to retrieve files, lower price to store. Archive: rarely accessed data (few times each year), 50% cheaper no access for 60 days Implement lifecycle policies to move files between storage tiers EFS Standard Availability and durability move Lifecycle Policy Standard: Multi-AZ, great for prod One Zone: One AZ, great for dev, backup enabled by default, compatible with IA (EFS One Zone-IA) Over 90% in cost savings EFS IA Amazon EFS File System sundog-educa on.com datacumulus.com © 2023 All Rights Reserved Worldwide NOT FOR DISTRIBUTION © Stephane Maarek www.datacumulus.com EBS vs EFS – Elastic Block Storage EBS volumes… Availability Zone 1 Availability Zone 2 one instance (except multi-attach io1/io2) are locked at the Availability Zone (AZ) level gp2: IO increases if the disk size increases gp3 & io1: can increase IO independently EBS EBS To migrate an EBS volume across AZ Take a snapshot Restore the snapshot to another AZ snapshot restore EBS backups use IO and you shouldn’t run them while your application is