AWS Cloud Services PDF
Document Details
Uploaded by GroundbreakingSun3148
TUM
Viktor Leis
Tags
Summary
This document is a case study about Amazon Web Services (AWS). It covers various aspects of the cloud computing platform, including its history, services, and pricing models. The document delves into different instance types, pricing models, and related topics. It is likely to be part of a course on cloud computing or a technical presentation.
Full Transcript
Cloud Information Systems Prof. Dr. Viktor Leis Chair for Decentralized Information Systems and Data Management Case Study: Amazon Web Services Modern Cloud Era Amazon: extreme growth led to sophisticated in-house IT infrastructure Why not sell internal IT services? Amazon...
Cloud Information Systems Prof. Dr. Viktor Leis Chair for Decentralized Information Systems and Data Management Case Study: Amazon Web Services Modern Cloud Era Amazon: extreme growth led to sophisticated in-house IT infrastructure Why not sell internal IT services? Amazon Web Services (AWS) launched EC2 and S3 in 2006 this triggered the start of the modern era of cloud computing Microsoft Azure followed in 2010, Google Cloud in 2012 1 Hyperscalers Market Share source: https://www.srgresearch.com/articles/ 2 cloud-spending-growth-rate-slows-but-q4-still-up-by-10-billion-from-2021-microsoft-gains-market-share Market Share source: https://commission.europa.eu/document/download/ ec1409c1-d4b4-4882-8bdd-3519f86bbb92_en 3 AWS Deep Dive we’ll take a detailed look the available AWS services their pricing models why do we need to do that? cloud costs are often a big part of cost of software products development, administration can be amortized and/or automated cloud costs are effectively marginal costs the offerings of public cloud vendors have now effectively the hardware landscape seemingly minor changes can have very large cost effects cloud can be much cheaper or much more expensive than on-premise IT pricing structures and costs have implicates for the best software architecture useful to use cloud effectively learn how to price cloud services 4 Case Study: Amazon Web Services EC2 Amazon Elastic Compute Cloud (EC2) large and ever-growing set of heterogeneous virtual machines (“instances”) general purpose: m7g, mac, m6g, m6i, m6in, m6a, m5, m5n, m5zn, m5a, m4, a1, t4g, t3, t3a, t2 compute optimized: c8g, c7g, c7gn, c7a, c7i, c6i, c6in, c6a, c6g, c6gn, c5, c5n, c5a, c4 memory optimized: r8g, r7g, r7iz, r6g, r6i, r6in, r6a, r5, r5n, r5b, r5a, r4, x2gd, x2idn, x2iedn, x2iezn, x1, x1e, high memory, z1d accelerated computing: p4, p3, p2, dl1, trn1, inf2, inf1, g5, g5g, g4dn, g4ad, g3, f1, vt1 storage optimized: im4gn, is4gen, i4i, i3, i3en, d2, d3, d3en, h1 HPC optimized: hpc6id, hpc6a in the past: mainly 2-socket Intel servers now: also AMD and custom chips ARM (“Graviton”) https://aws.amazon.com/ec2/instance-types/ https://instances.vantage.sh/ 5 Selection Of Mainstream Instance Families With Intel CPUs (Outdated) instance vCPUs DRAM instance SSD network BW. cost ↑ API name # /$ GB /$ TB /$ Gbit/s /$ $/h c5n.18xlarge 72 9.3 192 49.4 - - 100 25.7 3.89 c5.24xlarge 96 11.8 192 47.1 - - 25 6.1 4.08 c5d.24xlarge 96 10.4 192 41.7 4×0.9 0.78 25 5.4 4.61 m5.24xlarge 96 10.4 384 83.3 - - 25 5.4 4.61 i3.16xlarge 64 6.4 488 97.8 8×1.9 3.04 25 5.0 4.99 m5d.24xlarge 96 8.8 384 70.8 4×0.9 0.66 25 4.6 5.42 m5n.24xlarge 96 8.4 384 67.2 - - 100 17.5 5.71 r5.24xlarge 96 7.9 768 127.0 - - 25 4.1 6.05 m5dn.24xlarge 96 7.4 384 58.8 4×0.9 0.55 100 15.3 6.53 r5d.24xlarge 96 6.9 768 111.1 4×0.9 0.52 25 3.6 6.91 r5n.24xlarge 96 6.7 768 107.4 - - 100 14.0 7.15 r5dn.24xlarge 96 6.0 768 95.8 4×0.9 0.45 100 12.5 8.02 i3en.24xlarge 96 4.4 768 70.8 8×7.5 5.53 100 9.2 10.85 x1e.32xlarge 128 2.4 3,904 146.3 2×1.9 0.14 25 0.9 26.69 6 Instance Slices physical servers are sliced into smaller virtual machines c5n.metal: same as c5n.18xlarge but without virtualization Intel and AMD (usually): 2 vCPUs = 2 hyperthreads = 1 core on ARM/Graviton: vCPU = core cost usually scales linearly with resources Google Cloud offers fine-granular RAM/CPU configuration 7 Burst Bandwidth: What Does “Up To 25 Gbit” Mean? on smaller slices, bandwidth unused by neighbors is given to others on same machine (for a time) in steady state, bandwidth is forced to baseline bandwidth (roughly the paid-for fraction): https://docs.aws.amazon.com/AWSEC2/latest/ UserGuide/compute-optimized-instances.html c5n.4xlarge 14.3 c5n.2xlarge 9.3 c5n.xlarge 4.7 8 Changing Instances EC2 does not support live migration to change an instance (e.g., increase its size), it has to be stopped to avoid service disruption: load balancer in front of compute nodes Google Cloud and Azure support live migration 9 Case Study: Amazon Web Services Regions, AZs, Networking Regions 27 physical locations of data centers, e.g., eu-central-1 = Frankfurt prices vary: eu-central-1 roughly 15% more expensive than us-east-1 10 Region Sizes https://github.com/patmyron/cloud/ 11 #compute--memory-unit-prices-by-virtual-machine-type Network Latency median round-trip latency from eu-central-1 to eu-south-1 (Milan): 12ms eu-west-1 (Ireland): 27ms us-east-1 (N. Virginia): 93ms us-west-1 (N. California): 153ms ap-southeast-1 (Singapore): 209ms for comparison: 10,000km at speed of light (round trip) is 60ms https://www.cloudping.co/grid 12 Availability Zones (AZ) AZ ≈ data center, network latency within AZ can be as low as 0.03ms regions usually consist of several AZs that are geographically close, but not directly next to each other, e.g., eu-central-1a, eu-central-1b, and eu-central-1c euc1-az2 (eu-central-1a) 0.835ms 0.845ms euc1-az3 (eu-central-1b) 0.900ms 0.878ms 0.666ms 0.673ms euc1-az1 (eu-central-1c) https://www.xkyle.com/Measuring-AWS-Region-and-AZ-Latency/ 13 Internet Traffic Cost inbound data transfer is free outbound: $0.05-$0.09 / GB rate depends on transfer volume in each monthly billing period since March 2024, outbound traffic is free when closing account: https://aws.amazon.com/blogs/aws/ free-data-transfer-out-to-internet-when-moving-out-of-aws/ Microsoft Azure, Google Cloud dropped this first probably triggered by European Data Act 14 Intra-AWS Transfer Cost free within the same AZ (using private, not public IP!) not free across regions: $0.01-$0.02 / GB outbound e.g., transferring 1TB from Ohio to Tokyo costs $0.02/GB * 1000GB = $20 not free in same region across AZs: $0.01/GB in each direction transferring 1TB between AZs in same region costs $0.01/GB * 1000GB * 2 = $20 in the same region, it may be cheaper to transfer via S3: http://databasearchitects.blogspot.com/2022/04/ cloud-network-traffic-within-same.html Microsoft recently abolished this cost: https://azure.microsoft.com/en-us/updates/ update-on-interavailability-zone-data-transfer-pricing/ 15 Placement Groups generally one has little control over which physical machine one gets and in which rack it resides racks have correlated failures (e.g., ToR switch, power supply) placement groups offer some control: cluster: close together (same rack) spread: different racks partition: groups (control which instances are same/different racks) 16 Region/AZ Choice: Considerations legal reasons network bandwidth cost network speed fault tolerance availability 17 Case Study: Amazon Web Services Pricing Models EC2 Pricing Models EC2 offers several pricing models: on demand: per-second pricing spot: variable costs, interruption possible reserved: reservation for 1 or 3 year reserved instance market: buy/sell reserved instances with remaining contract duration savings plan: commit to certain yearly spending, get discount 18 On Demand most flexible and most expensive option per-second pricing minimum duration: 1 minute no guarantee that a particular instance is actually available 19 Reserved Instance, Reserved Instance Market reserve instance for long duration 1 or 3 year 50-70% savings (partial) upfront or monthly payment reserved instance market: seller pays service fee of 12% on the price 20 Reserved Instance Market Example (c5.24xlarge on demand: $4.608) 21 Savings Plans commit to certain spending over 1 or 3 year period (e.g., $427,200/year) commitment is then broken down to hourly spending (e.g., $50/hour) if hourly spending is below that amount, get discount if hourly spending is above, no discount above committed amount on demand rates hourly commitment unused commitment time 22 Savings Plans Variants two variants: Compute Savings Plans: can be used for any EC2 instance EC2 Instance Savings Plans: commit to certain instance family (e.g., m5.* or c5.*) in a certain region example 1-year, c5n.24xlarge: 28% vs. 37% example 3-year, c5n.24xlarge: 58% vs. 64% 23 Spot on demand pricing may lead to many instances being idle idea: sell unused capacity at low prices to flexible customers often 60-70% discount when creating a spot request, one can set a limit price (below demand price) price and availability varies across AZs spot instances used to be an actual market with very large price fluctuations, now it is algorithmically driven 24 Spot Example 25 Spot Instance Interruption spot instances may be forcefully stopped at any point but can often run for weeks without interruption interruption notice 2 minutes before forced stop AWS also publishes interruption frequencies of last 30 days: https://aws.amazon.com/ec2/spot/instance-advisor/ 26 EC2 Pricing Models: Summary best strategy is probably a mix of models example: 3-year reserved instances for base load and for load that is mission-critical spot instances for large jobs that are not latency critical on demand instances to fill unexpected gaps my rule of thumb: on average one should pay half the on demand cost 27 Why Are We Talking About Cost So Much? cloud-native systems could and should automatically pick instances and pricing models currently this is generally done manually better: software automatically picks the best suitable instance, taking current (spot) prices and discounts into account this is an open research question 28 Case Study: Amazon Web Services Outages and SLAs Things Can Go Wrong... https://thestack.technology/ovhcloud-fire-strasbourg/ AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures: 29 https://www.youtube.com/watch?v=swQbA4zub20 Service Level Agreements (SLAs) SLAs are a way to commit to quality standards for a service, e.g., availability, performance, durability example: 95% uptime per month availability SLAs are a way to communicate to service users what they can expect through monetary penalties SLAs align the incentives of the service vendor and the user 30 EC2 SLAs despite server-grade hardware, individual instances fail occasionally Instance-Level SLA: refund if monthly instance availability is below 99.5%: 36.0 hours/month downtime): 100% refund 7.2 hours/month downtime): 30% refund 3.6 hours/month downtime): 10% refund Region-Level SLA: refund if availability of two instances in two different AZs is below 99.99%: 36.0 hours/month downtime): 100% refund 7.2 hours/month downtime): 30% refund 4.3 minutes/month downtime): 10% refund 31 AWS Outage History hyperscalers occasionally have major outages but probably fewer than most organizations running their own data centers example: S3 downtime in us-east-1 on February 28th, 2017 from 9:37AM PST to 1:54PM PST (https://aws.amazon.com/message/41926/) because S3 is used by many other services (including EC2), this has been called one of the biggest outages in cloud computing AWS Post-Event Summaries: https://aws.amazon.com/premiumsupport/technology/pes/ 32 Case Study: Amazon Web Services Burstable Instances Burstable Instances many systems and applications have low average CPU utilization but occasional workload spikes burstable instances may be more economical in such cases https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ 33 burstable-performance-instances.html Burstable Instances Implementation CPU oversubscription: co-locate many VMs on one physical server to smooth over spikes can be implemented easily through hypervisor: vCPUs of started VMs > hardware CPUs a bit like airline seat overbooking RAM, which is a significant contributor to cost, could also be oversubscribed, but generally is not available burstable instance families: t4g (ARM), t3 (Intel), t3a (AMD), t2 (Intel) 34 Burstable Instances Pricing Example example: t4g.2xlarge, 32 GB, 8 vCPUs, 40.00% base load, $0.27/h base load: 8 vCPUs * 40% = 3.2 vCPUs over 24-hour period, accrue credits if below base load if above base load, spend accrued credits first, then standard mode: slow down to base load unlimited mode: additional charges ($0.04 per vCPU) 35 Steady-State Comparison (Graviton2, 32 GB RAM) t4g.2xlarge: 8 vCPUs (40.00% base load), $0.27/h m6g.2xlarge: 8 vCPUs, $0.31/h r6g.xlarge: 4 vCPUs, $0.20/h 36 Steady-State Comparison (Graviton2, 4 GB RAM) t4g.medium: 2 vCPUs (20.00% base load), $0.033/h c6g.large: 2 vCPUs, $0.068/h m6g.medium: 1 vCPUs, $0.0385/h 37 Burstable Instance Analysis idea fits the spirit of cloud computing nicely save resources by smoothing load across different tenants hyperscalers can sell fraction of a CPU core, increase average CPU utilization for bursty workloads, practical savings in EC2 can be substantial for small instances (2× for t4g.small vs. c6g.medium) limited for larger slices (only 15% for t4g.2xlarge vs. m6g.2xlarge) due to oversubscription, performance is less predictable 38 Hetzner (Per Hour Including 19% VAT) vCPUs RAM Shared vCPU Dedicated vCPU Dedicated Server Setup Fee 2 8 €0.0238 4 8 €0.0261 4 16 €0.0466 8 16 €0.0496 8 32 €0.0925 8 64 €0.0877 €46.41 16 32 €0.1047 48 192 €0.5501 48 256 €0.3795 €94.01 Shared/Dedicated vCPU: hourly Dedicated Server: monthly + one-time setup fee outbound traffic: 0.001€/TB (50×! AWS) 39 Case Study: Amazon Web Services Lambda Lambda: Function as a Service Function as a Service code is automatically assigned to compute hardware automatic scaling intended for short invocations, not suitable for long-running services often called “serverless” works best for stateless tasks complex jobs: break into a graph of stateless computations large state is often passed through some storage service (e.g., S3) 40 Lambda Pricing duration cost: price depends on main memory (RAM) capacity $0.0000106667 and $0.0000166667 per GB-second from 128 MB to 10,240 MB in MB steps CPU is proportional to main memory (1,769 MB = 1 vCPU) time is measured in millisecond granularity savings plans available, but only up to 17% discount request cost: $0.20 per 1M requests request cost equals to about 10 CPU ms https://aws.amazon.com/lambda/pricing/ 41 Limitations maximum duration of 15 minute maximum of 10,240MB and 5.8 vCPUs no network communication with other Lambdas Serverless Computing: One Step Forward, Two Steps Back https://www.cidrdb.org/cidr2019/papers/p119-hellerstein-cidr19.pdf 42 Lambda vs. EC2 On Demand Cost Example (Graviton) c6g.large: 4 GB RAM, 2 vCPUs Lambda: 4 GB RAM, 2.32 vCPUs for long durations: Lambda costs 2.26× more 43 How Is Lambda Implemented? originally: containers per function and EC2 VMs per customer now: workers use lightweight virtualization KVM + special virtual machine monitor (Firecracker instead of QEMU) guest OS: minimal Linux 5 MB memory overhead per function, boots in less than 125ms scale-out frontend “sticky-routes” requests to compute workers small pool of pre-warmed VMs to reduce startup time https://www.usenix.org/system/files/nsdi20-paper-agache.pdf 44 Lambda (vs. EC2) + automatic elasticity and scalability, no server management/administration + low-latency startup + fine-grained pricing + available in very small sizes (low cost for small jobs) − more than 2× higher cost for long-running jobs (Lambda runs on EC2) − little hardware customization − limitations (duration, size, networking) 45 Case Study: Amazon Web Services Fargate Fargate container-based compute service alternative to EC2 for Elastic Compute Cloud (ECS) or Elastic Kubernetes Service (EKS) user assigns vCPU and RAM constraints (e.g., 1 vCPU, 5 GB RAM) 0.25 and 0.5 vCPUs are available compute price (ARM: $0.03238 per vCPU per hour) plus RAM price (ARM: $0.00356 per GB per hour) minimum duration: 1 minute spot and savings plan available https://aws.amazon.com/fargate/pricing/ 46 Fargate On Demand Cost Example (Graviton) c6g.large: 4 GB RAM, 2 vCPUs Lambda: 4 GB RAM, 2.32 vCPUs Fargate: 4 GB RAM, 2 vCPUs Fargate 12% more expensive than EC2 47 Fargate vs. EC2 + no manual instance management + finer CPU/RAM granularity than EC2, coarser than Lambda + CPU and RAM can be configured separately − slightly more expensive than EC2 − requires ECS/EKS − fewer hardware options 48 Compute vs. Space Tradeoff often there are different ways of implementing the same functionality: algorithm vCPU [hours] memory [GB] A 1.0 10 B 3.0 1 C 1.5 2 49 Compute vs. Space Tradeoff often there are different ways of implementing the same functionality: algorithm vCPU [hours] memory [GB] A 1.0 10 B 3.0 1 C 1.5 2 cloud prices suggest a unique solution: Fargate prices: $0.03238 per vCPU per hour, $0.00356 per GB per hour in other words: 1 CPU core is worth about 10 GB of RAM A: $0.07, B: $0.11, C: $0.06 49 Compute vs. Space Tradeoff (2) assuming perfect scalability: 50 Case Study: Amazon Web Services Storage Instance Storage examples: c5d.24xlarge: 4 * 900 GB NVMe SSD i3en.24xlarge: 8 * 7500 GB NVMe SSD d3en.12xlarge: 24 * 13980 GB HDD capacity and I/O throughput is usually proportionally scaled down for smaller instances slices instance storage not really persistent durability ≈ EC2 availability mainly useful for transient data or as a cache 51 Amazon Elastic Block Store (EBS) virtual disk block device, not a file system used as root volume additional volumes can be added maximum bandwidth also depends on instance pricing model: for storage and for provisioned (guaranteed) throughput EBS volume can usually only be accessed by one VM at any point in time but can attach multiple volumes to one instance EBS uses its own network (SAN) replicated across multiple servers in one AZ 52 EBS General Purpose Storage: gp3 features: volume size: 1 GB - 16 TB max I/O operations per second (IOPS) per volume (instance): 16,000 (260,000) max throughput per volume (instance): 1 GB/s (10 GB/s) 99.8% - 99.9% durability (= 0.1% - 0.2% annual volume failure rate) pricing: storage cost: $80/TB-month 3,000 IOPS free and $0.005/provisioned IOPS-month over 3,000 throughput: 125 MB/s free and $0.040/provisioned MB/s-month over 125 53 EBS Variants io1, io2: higher storage cost ($125/TB/month) higher IOPS cost higher performance higher durability (up to 99.999%) st1, sc1: disk-based lower storage cost ($45/TB/month, $15/TB/month) lower bandwidth high latency, low IOPS (500, 250) 54 Simple Storage Service (S3) foundational service, maybe most important after EC2 redundant storage across AZs in one region terminology: “object” = file “prefix” = file path “bucket” = named collection of files in a region bucket lives in region, not AZ http(s) API: GET, PUT, LIST, DELETE buckets can be internet public or private 55 S3 Standard storage: $21-25 / TB / month requests: PUT, LIST: $5/million GET: $0.4/million DELETE is free within region no additional charges for request bandwidth 56 S3 Standard Durability and Availability SLAs 99.999999999% (eleven 9’s) yearly durability per object with 1 million objects, the chance of some data loss over 10 years is 0.01% should be enough for most applications (except with billions of objects) survives loss of one AZ 99.99% availability (=52 minutes downtime per year), refund if below 57 S3 Consistency until 2020, S3 used to have weak consistency one could read outdated data now the guarantee is stronger: “After a successful write of a new object, or an overwrite or delete of an existing object, any subsequent read request immediately receives the latest version of the object. S3 also provides strong consistency for list operations, so after a write, you can immediately perform a listing of the objects in a bucket with any changes reflected.” https://aws.amazon.com/s3/consistency/ conditional writes are another recent feature https://aws.amazon.com/about-aws/whats-new/2024/08/ amazon-s3-conditional-writes/ 58 S3 Variants Variant $/TB/month Standard 21.00 Standard - Infrequent Access 12.50 One Zone - Infrequent Access 10.00 One Zone - Express 160.00 Glacier Instant Retrieval 4.00 Glacier Flexible Retrieval 3.60 Glacier Deep Archive 0.99 Intelligent - Tiering 4-23.00 variants also differ in access time, access cost, and durability 59 Other Storage Alternatives keep everything in main memory Amazon Elastic File System (EFS): network file system DynamoDB: distributed key/value store Relational Database Service (RDS): relational OLTP database system 60 Storage Summary many options with different tradeoffs: interface, latency, bandwidth, durability, availability, storage cost, per-operation cost 61 Case Study: Amazon Web Services Other Services AWS Marketplace marketplace for buying and selling third-party software and SaaS services through AWS customers can procure products through self-service or individually negotiate prices (private offerings) AWS takes fee on sales (3% of transaction value for SaaS) buyer’s perspective consolidate third-party spending on AWS bill sellers vetted by AWS seller’s perspective access to existing AWS customers billing process managed by AWS https://docs.aws.amazon.com/marketplace/latest/buyerguide/ what-is-marketplace.html 62 AWS Marketplace: Subscription-based Pricing seller can define one or more pricing dimensions (e.g., data ingested in GB, number of requests, provisioned throughput in Gbits) and a charge per unit for each of them (e.g., $1.5 per GB) no upfront cost, customers are only billed for usage service with configured pricing appears on AWS Marketplace website technical details application is hosted in the seller’s AWS environment (un)subscribe events are captured via SNS notifications seller is responsible for metering and reporting resource consumption via the AWS Marketplace Metering API every hour https://docs.aws.amazon.com/marketplace/latest/userguide/ saas-subscriptions.html 63 Over 200 Additional Services user management: Identity and Access Management (IAM) logging, monitoring, alarms: CloudWatch Domain Name System (DNS): Route 53 private networking: Virtual Private Cloud (VPC) content delivery network (CDN): CloudFront queues: Simple Queue Service (SQS) publish/subscribe: Simple Notification Service (SNS) Lambda workflows: Step Functions machine learning: Sagemaker PaaS: Beanstalk Big Data batch processing: EMR (Spark/MapReduce)... we will discuss orchestration (ECS and EKS), databases, and some others 64 Case Study: Amazon Web Services EC2 Trends EC2 Hardware Landscape EC2 on demand prices rarely change new instances appear regularly older instances are deprecated 65 EC2 CPU Cost Evolution (Intel) 12 c5 c5d CPU Core [h/$] 9 c5n m5n r5 6 i3 3 x1e 0 2016 2017 2018 2019 2020 2021 66 EC2 CPU Cost Evolution 67 EC2 DRAM Cost Evolution RAM capacity [GB/h/$] 150 x1e r5 100 i3 m5n c5n 50 c5 c5d 0 2016 2017 2018 2019 2020 2021 68 EC2 I/O Bandwidth Cost Evolution 12 I/O bandwidth [TB/$] i3 large NVMe 9 arrays introduced 6 c5d 3 0 x1e 2016 2017 2018 2019 2020 2021 69 EC2 Storage Capacity Cost Evolution 70 EC2 Network Bandwidth Cost Evolution netw. bandwidth [TB/$] 12 c5n 100Gbit 9 networking introduced m5n 6 c5 3 c5d i3 r5 0 x1e 2016 2017 2018 2019 2020 2021 71 Case Study: Amazon Web Services Specialized Hardware and Accelerators CPU Stagnation so far we looked at boring hardware: multi-core CPU, RAM, disk/SSD, Ethernet Moore’s law is dead: CPUs and DRAM are stagnating underlying reason: cost per transistor stagnates specialized hardware (“accelerators”) becomes more attractive the cloud makes trying/using specialized hardware easier The Decline of Computers as a General Purpose Technology, CACM 2021: https://cacm.acm.org/magazines/2021/3/ 250710-the-decline-of-computers-as-a-general-purpose-technology/ fulltext?mobile=false 72 Semiconductor Fabrication Trends 73 CPU Stagnation 74 AMD CPU Cost Stagnation http://databasearchitects.blogspot.com/2023/04/the-great-cpu-stagnation.html 75 When Does Specialized Hardware Pay Off? 76 Custom AWS Hardware hyperscalers have custom hardware that cannot be bought AWS Nitro Cards for VPC, EBS, Instance Storage, Security Chip Nitro reduces virtualization overhead, hardware effectively virtualizes itself also: custom Ethernet NICs and switches 77 Machine Learning 78 Machine Learning Training FLOPs Compute Trends Across Three Eras of Machine Learning, 2022: https://arxiv.org/pdf/2202.05924.pdf https://ml-progress.com/visualization 79 Graphics Processing Unit (GPU) originally designed for graphical games happened to work well for machine learning two orders of magnitude more floating point operations per seconds (FLOPS) than CPUs one order more memory bandwidth (but smaller capacity) instances with GPUs: g2, g3, g3s, g4ad, g4dn, g5, g5g, p2, p3, p3dn, p4d, p4de p4de.24xlarge: 8x NVIDIA A100 (≈ 600 16-bit TFLOPs), 640 GB GPU memory ML training is extremely compute intensive (GPU cluster may run for months) it may be cost effective to move to another cloud just for training: https://vast.ai/ 80 Machine Learning Accelerators AWS Trainium accelerator trn1: 3 PFLOPS, 512 GB high bandwidth memory with 9.8 TB/s bandwidth AWS Inferentia2 accelerator: inf2: 2.3 PFLOPS, 384 GB high bandwidth memory with 9.8 TB/s bandwidth Google Tensor Processing Unit (TPU): https://arxiv.org/abs/2304.01433 81 Field Programmable Gate Array (FPGA) sometimes CPUs, GPUs, and ML accelerators are not enough Application-Specific Integrated Circuit (ASIC) would be an option, but very high upfront cost FPGAs are programmable hardware good for pipeline parallelism also useful as prototyping platform for ASIC EC2 instances with FPGAs: f1.2xlarge, f1.4xlarge, f1.16xlarge: 1/4/8 Xilinx UltraScale+ VU9P FPGAs 82 FPGA Example for Analytical Query Processing https://aws.amazon.com/blogs/aws/new-aqua-advanced-query-accelerator-for-amazon-redshift/ 83 Summary basic infrastructure services: EC2, EBS, S3 EC2 has sophisticated discount models with implications on system architecture compute, storage, network can be cost bottleneck it is crucial to understand pricing structure (not just performance) otherwise running in cloud can be very expensive cloud simplifies running on specialized instance configurations 84