Cloud Information Systems: State Management and S3 Performance - PDF

Cloud Information Systems Prof. Dr. Viktor Leis Chair for Decentralized Information Systems and Data Management State Management State Management Is Hard the cloud promises easy elasticity and scalability FaaS (e.g., AWS Lambda) comes quite close at achieving these goals for stateless computation load balancers and web servers can also quite easily be scaled out – both are effectively stateless as well however, beneath these scalable components, there are almost always stateful storage services or database systems these often become the synchronization points and scaling bottlenecks scaling data management systems is hard: large data volumes, high update rates, strict durability requirements some of the most challenging cloud systems often database systems are key components of other services (e.g., OLTP systems are often used to manage control plane state) 1 Benchmarking to use an existing service successfully, one needs to understand (for S3): functionality (highly-durable object store, GET/PUT http(s) API) cost ($21/TB/month, $0.4/M GET, $5/M PUT) performance characteristics (?) the documentation usually describes the first two points, but not performance (rarely covered by SLAs) it’s very hard to design (cost-)efficient software architectures if one does not know based performance properties performance characteristics determine whether and how to use a service this requires benchmarking experiments from: Exploiting Cloud Object Storage for High-Performance Analytics, Durner et al., VLDB 2023 2 State Management S3 Performance S3 Latency 3 (Inverse) Request Latency Over Time (16MB objects, eu-central-1, c5n.large) indirectly measures overall utilization workdays, weekend visible unnatural upper bound: artificial throttling? 4 S3 Bandwidth (c5n.18xlarge) with many parallel requests, one can almost exploit the network bandwidth 5 How Many Requests Do We Need? seek latency: 30ms, scan speed: 50MB/s with 16MB requests, to achieve 80Gbit/s (=10GB/s), we need a request arrival rate (λ) of 10GB/s : 16MB = 640/s time in system for 16MB request (W): 30ms + 16MB/50MB/s = 350ms = 0.35s Little’s law (the number of requests in the systems L): 640/s * 0.35s = 224 6 S3 GET Cost vs. c5n.18xlarge Cost with request sizes above 16MB, EC2 dominates S3 GET cost 7 S3 Performance Summary best to think that each access has the latency (over 10ms) and the bandwidth of one disk (around 50 MB/s) except that one has thousands of disks available can achieve very high bandwidth by scheduling hundreds of requests at any point in time request cost is quite high for small objects, but disappears in comparison with EC2 for larges objects other vendor’s object stores are quite similar to S3 in terms of cost and performance 8 State Management Build Your Own S3 How is S3 implemented? load balancers, API servers, metadata storage, object storage scale separately additionally there’s an asynchronous/background storage management data is partitioned by object (not customer) Load Balancers HTTP Metadata Storage GET PUT API Servers Object Storage 9 S3 In The Real World (2023) https://www.usenix.org/conference/fast23/presentation/warfield implemented using hundreds of internal microservices a single large customer stores 600PB = 600,000TB overall: 280 trillion objects and 100M requests per second 100M GET/s per second cost $1.2B/year 10 Is it possible to build a cost-competitive object store using EC2? S3 latencies and prices imply disk (not SSD) storage cheapest disk instance in terms of GB/$ is d3en.12xlarge: 24 * 13,980GB = 335TB HDD $6.31/h on demand, let’s assume 50% discount: $3.15 $2,271/month for 335TB = $6.78/TB/month S3 (for comparison): $21/TB/month we probably need more than 1 copy, but d3en still seems feasible 11 Requests on d3en disks typically have about 100MB/s throughput reading a 13,980GB disk takes 38 hours 24 * 100MB/s = 2.4 GB/s I/O bandwidth 75 Gbit network link is fast enough to deliver full I/O bandwidth disks typically have about 10ms latency 24 * 1s/10ms = 2400 IOPS does not seem like a lot, but is 8.64M/hour 1M GETs cost $0.4/M thus, the GET requests are worth $3.46/hour 12 What happens objects are accessed frequently? if on average, objects are accessed more frequently than every 38 hours, we have to duplicate them to be able to handle the workload in practice, workloads are skewed: most data is accessed very rarely, some objects are accessed very frequently it makes sense to cache the hot objects (on RAM, SSD, or both) this could be an additional service in front of the disk storage servers or one could use API servers or metadata storage for caching 13 How To Achieve 11’9s? one approach would be to have 3 full copies across 2 or 3 AZs this is just barely feasible in our calculation ($6.78*3 = $20.34) an alternative is erasure coding: total data erasure c. space survives disks disks disks overhead failures 4 3 1 1.33 1 6 4 2 1.50 2 11 8 3 1.38 3 20 16 4 1.25 4 20 17 3 1.18 3 or: two full copies plus erasure coding recovery is probably a continuous background process note that S3 guarantees 11’9s durability, but only 3’9s availability 14 State Management Database Systems in the Cloud DBMS Market: $80B/year source: https://web.archive.org/web/20230228074215/https: //blogs.gartner.com/merv-adrian/2022/04/16/dbms-market-transformation-2021-the-big-picture/ 15 If we believe these numbers, 1/3 of all AWS revenue is databases! OLTP and OLAP OnLine Transaction Processing (OLTP): OnLine Analytical Processing (OLAP): simple, latency-critical queries mostly reads + batch updates many inserts/deletes/updates large table scans heavy use of index structures read optimized (column store) write optimized (row store) find top 5 customers by revenue store new order, find customer 42 over last 5 years PostgreSQL, SQL Server, Aurora Vertica, ClickHouse, AWS Redshift ⇒ different requirements lead to specialized systems (though hybrid like SAP HANA systems exist) ⇒ an Extract, Transform, Load (ETL) process periodically moves data from the operational to the analytical system 16 State Management OLTP Systems DynamoDB multi-tenant distributed key/value store Create, Update, Read, Delete (CRUD) operations three types of reads: eventually consistent, strongly consistent, transactional transactional functionality is limited: read committed isolation only, limited size, read or write, one-shot provisioned capacity and on demand pricing models write request unit: 1 for each write (up to 1 KB) and 2 write request units for each transactional write read request unit: 1 for each strongly consistent read (up to 4 KB), 2 for each transactional read, and 0.5 for each eventually consistent read on demand pricing: $1.25 per million write request units $0.25 per million read request units $250 per TB/month 17 Classic DBMS Design for OLTP data is organized as fixed-size pages (e.g., 8KB) disk is primary storage medium for pages B-trees are used as indexes to quickly find rows for performance, pages from disk are cached in RAM inserts/updates/deletes are applied to pages in cache and logged in a write ahead log (WAL) on commit: force WAL to disk (but not necessarily changed pages) additionally: asynchronously write WAL to log archive and backup database pages this is how PostgreSQL, MySQL, SQL Server,... work 18 Classic DBMS in the Cloud (Lift and Shift) one can run the classical design on a VM in the cloud with instance storage however, when the instance fails, data is lost (RAID does not help, we don’t get physical access to the disks) we can recover from backup and log archive, but: we will have downtime we may lose most recent changes What about: scalability elasticity compute/storage disaggregation 19 Remote Block Device (RBD) to improve durability, use virtual disks (e.g., EBS) instead of instance storage if instance dies, disk can be attached to a new instance better durability substantially higher cost than instance storage 20 Primary/Secondary Design (aka High Availability Disaster Recovery (HADR)) run two database systems on two identical nodes one is the primary all write transactions go to primary node primary ships WAL to secondary secondary applies entries eagerly if primary fails, one can quickly switch to secondary we improved availability and durability, but at twice the hardware cost what about: scalability elasticity compute/storage disaggregation 21 Amazon Aurora dominant cloud-native OLTP system, introduced 2014 Aurora disaggregates storage and compute: 1. compute: primary processing node (plus secondary for availability) 2. storage: multi-tenant page and logging service primary does not write changed pages to disk instead it writes WAL entries to 6 storage nodes (in different AZs) on 3 of these nodes, the log records are re-played primary node can read from one of these 3 nodes individual storage servers can go down without affecting service backups and log archive on S3 ⇒ storage layer is basically a distributed, multi-tenant, elastic, fault-tolerant disk 22 Aurora Pricing standard: compute (depends on instance): ≈2× more expensive than EC2 storage capacity: $100/TB/month storage I/O: $0.2/million I/O-Optimized: compute (depends on instance): ≈3× more expensive than EC2 storage capacity: $225/TB/month storage I/O: free 23 Microsoft Socrates commercial name: SQL Database Hyperscale observations regarding Aurora: storing 3 copies of each page is quite expensive conceptually, log and page service are quite different from each other Socrates changes: store each page only on one page server if that page server dies, recovers pages from backup (on S3) + log service log service is implemented as separate component 24 System Comparison 25 Fully Distributed OLTP Systems the designs described so far have a scalability bottleneck: all writes have to go through the primary they also requiring explicit provisioning of servers fully-distributed OLTP systems (e.g., DynamoDB) exist but are interestingly not quite as commonly used for general-purpose OLTP workloads discussion of tradeoffs: https://www.cidrdb.org/cidr2023/papers/p50-ziegler.pdf geo-distributed systems (e.g., FaunaDB) exist as well 26 State Management OLAP Systems OLAP optimized for large table scans columnar, compressed storage large datasets: may not fit onto one machine parallelism and distribution is essential 27 Traditional OLAP Design: Horizontal Partitioning (Shared Nothing) data is partitioned horizontally (by row not column) across multiple nodes usually users have to specify partitioning key for each table each node has its own storage: storage and compute are scaled in lock-step by adding nodes example system: (original) Amazon Redshift 28 Modern OLAP Design: Disaggregated Storage/Compute cloud object stores (e.g., S3) Customer 1 Customer 2 allow disaggregating storage Control Plane and computation access/tx-control, query compilation, scheduling Snowflake pioneered this Node #1 VW 1 (4 nodes) VW 2 (8 nodes) design for OLAP N1 N2 N1 N2 CPU1 … CPU8 1. control plane: multi-tenant N4 N3 N4 N3 service, backed by OLTP DRAM caching N5 N6 system VW 2 (1 node) SSD 2. query processing: per-tenant N1 N7 N8 clusters of (e.g., EC2) instances (“virtual warehouse”) for S3 query processing DB1 DB2 DB3 3. storage: on cloud object store 29 Snowflake Query Step-By-Step Example 1. control plane: query OLTP systems for all current blocks filter unnecessary blocks using lightweight per-block indexes/filters send query plan and filtered blocks to query engine nodes 2. query engine: request data from cache/storage execute query 3. storage: implemented as immutable blocks on S3 30 Snowflake Details supports AWS/Azure/Google Cloud caching cloud object store stores immutable data blocks (e.g., 10K tuples) blocks can be cached on query processing nodes consistent hashing helps with elasticity updates/transactions: update transactions create new objects coordinated in control plane using OLTP system’s (FoundationDB) transaction functionality read queries will either see old or new objects elasticity: cluster size can be adjusted a pool of worker nodes is maintained to make this quick configurable auto-shutdown/startup (e.g., after 15 minutes of no queries) one can launch several virtual warehouses for the same database as needed 31 Snowflake Pricing compute: at least $2/hour ($3 or $4 for additional features) multiplied by cluster size: 1 (X-Small), 2 (Small), 4 (Medium),..., 512 (6X-Large) hardware not specified (was c5d.2xlarge1 in EC2: 8 vCPUs, 16 GiB, 200 GB NVMe) storage: on demand: $40/TB/month capacity: $24/TB/month https://www.snowflake.com/pricing/ https://docs.snowflake.com/en/user-guide/warehouses-overview 1 $0.38/h 32 Redshift Pricing Redshift (traditional = shared nothing): name vCPU Memory Storage I/O Price dc2.large 2 15 GiB 160 GB 0.60 GB/s $0.25/h dc2.8xlarge 32 244 GiB 2.7 TB 7.50 GB/s $4.80/h Redshift (managed storage): name vCPU Memory I/O Price ra3.xlplus 4 32 GiB 0.65 GB/s $1.09/h ra3.4xlarge 12 96 GiB 2.00 GB/s $3.26/h ra3.16xlarge 48 384 GiB 8.00 GB/s $13.04/h + $24/TB/month for storage 33 Query as a Service Pricing so far, the user still has to decide cluster size and then pay for that cluster average cluster utilization is often low why not charge by query? examples: Google BigQuery, Aurora, Redshift Spectrum $5/TB scanned (+ cloud object storage cost) no server administration necessary/possible 34 How Are QaaS Systems Implemented? large, multi-tenant cluster disaggregated storage schedule queries dynamically onto available instances this is good for overall resource utilization often best-effort per-query latency guarantee (e.g., 10s): the number of nodes used for one query is scaled with query size billing by scan size alone is exploitable: a CPU time limit may be necessary (https://arxiv.org/pdf/1809.00159.pdf) BigQuery: https://www.vldb.org/pvldb/vol13/p3461-melnik.pdf 35 Analytic Query Processing using FaaS idea: execute each query using lots of serverless functions research prototypes exist: https://arxiv.org/pdf/1912.00937 scale number of Lambda calls with amount of data processed each Lambda reads data from S3 may have to write intermediate results to S3 as well (e.g., large joins) probably more expensive than large, multi-tenant QaaS services but, this approach can conceptually drive query latency toward zero even for very small tenants 36 Discussion BigQuery launched in 2011, Redshift in 2013, Snowflake in 2014 pricing model has impact of software architecture and vice versa Redshift now recommends managed storage and has a “serverless” variant with auto-shutdown (therefore becomes more similar to Snowflake) multi-cloud support of Snowflake was probably a big factor for its success QaaS model is conceptually the most elastic, easy-to-use (but may not be easy to understand for customers) BigQuery now also offers capacity compute pricing (for more predictable pricing) 37

Cloud Information Systems: State Management and S3 Performance - PDF

Document Details

Tags

Related

Summary

Full Transcript