Big Data Systems - Modern Hardware I - PDF
Document Details
Uploaded by TimelySweetPea
Hasso-Plattner-Institut
Martin Boissier
Tags
Summary
This document presents a lecture on Big Data Systems, focusing on modern hardware. It covers topics like hardware trends, memory access, caches, and vectorized execution. The summary also delves into why some data processing systems show poor cache behavior and how programmers can optimize code for cache performance.
Full Transcript
Martin Boissier Big Data Systems Data Engineering Systems Modern Hardware I Hasso Plattner Institute Timeline II Date Tuesday Wednesday 7.1./ 8.1. ML Systems II Modern Hard...
Martin Boissier Big Data Systems Data Engineering Systems Modern Hardware I Hasso Plattner Institute Timeline II Date Tuesday Wednesday 7.1./ 8.1. ML Systems II Modern Hardware I 14.1. / 15.1. Modern Hardware II Exercise 21.1. / 22.1. Modern Cloud Warehouses - 28.1. / 29.1. Industry Talk (Markus Dreseler, Snowflake) Exercise 4.2. / 5.2. Exam Prep Buffer/Exam prep 10.2. Exam 2 This Lecture § Hardware-Aware Data Processing § Brief Introduction to CPU Architecture § Caches § Data Layout Sources § Data Processing on Modern Hardware – Summer Semester 2022 § Structured Computer Organization, Andrew S. Tanenbaum, Todd Austin 3 Where are we? § Efficient use of current hardware Application / Query Language / Application Analytics / Visualization § Hardware trends Data Processing Big Data Data Management Systems § Hardware software codesign File System Virtualization / Container OS / Scheduling Infrastructure Hardware 4 Motivation Classic DBMS Architecture § Performance of database Register engines was limited by disk IO 0.3 ns § Most optimization efforts Caches focused on reducing disk IO 1.8 / 7.9 / 37.4 ns § Reflected by the classic Main Memory 90 - 110 ns database architecture: Access Time Gap: ~103 § Usage of a buffer pool Solid State Drive Access Time Gap 50 - 90 µs § Row-oriented data layout ~105 organized on pages Hard Disk Drive § Tuple-at-a-time processing 3 - 12 ms (Volcano Processing Model) Archive 100+ seconds to minutes 6 Intel Xeon Platinum 8352Y (two sockets) Game Changer – Cheap RAM § Over time, affordable RAM capacity increased § Nowadays, can get server machines with terabytes of main memory! § Most databases fit into main memory § However: Traditional DBMS architectures cannot utilize in-memory setups From: Harizopoulos et al., OLTP Through the Looking Glass, and What We Found There, SIGMOD 08. 7 Game Changer – Cheap RAM § Very few companies store “big data” § When they do, large fractions are rarely accessed (cold data) § One can rent machines with multiple TB of DRAM on AWS § “the vast majority of enterprises have data warehouses smaller than a terabyte” * § “90% of queries processed less than 100 MB of data.” § For most workloads, single-node processing provides the best performance and the best cost efficiency § However, other requirements like fault tolerance and elasticity need to be considered too Ø Back-of-the-envelope calculations * Big Data is Dead – Blogpost: https://motherduck.com/blog/big-data-is-dead/ 8 Multicore CPUs § High Parallelism: § Multiple cores (task parallelism): Multiple threads can perform different tasks at the same time § Vector units (data parallelism): The same instruction is performed on multiple data items at once § High Memory Bandwidth: § Aggregated memory bandwidth of 307.2 GB/s per CPU (DDR5 memory with eight channels, 38.4 GB/s per channel) § Multiple processors are organized in NUMA (Non-Uniform Memory Access) architecture § Cache coherent memory across all CPUs 9 Processor Trends 10 The Limitations of Multi-Core § Transistor density doubles -> power consumption stays the same (Dennard Scaling) § Since ~2006 Dennard Scaling stopped § Smaller surfaces increase the risks of current leakage § The peak frequency that a single processor can achieve is limited § Temporary solution: Adding more cores § Helps keeping the energy consumption per core low § More cores can still be added in a CPU, but they cannot all run simultaneously § Unused physical cores -> "dark silicon" 11 Co-Processors [Accelerators] CPU Multi-core CPU GPU FPGA ASIC Flexibility Efficiency "Accelerator – a special purpose device that supplements the main general-purpose CPU to speed up certain operations" Robey & Zamora: Parallel and High-Performance Computing Icon credits (Noun Project) CPU – fizae GPU – Azam Ishaq Motherboard – Matej Design AI – Eko Purnomo 12 Graphics Processing Units § Co-processors initially used for image rendering § General Purpose Graphics Processing Unit (GPGPU) § Unified execution model with general-purpose GPUs § Design inspired by CPUs, but focus on throughput not latency § 100k+ threads can run concurrently § High GPU-memory bandwidth (1.5 TB/s) § Main memory accesses are expensive § “Smart” thread scheduling § Asynchronous thread execution § Workloads need to be adapted § Kernel-based execution § Shared vs global memory utilization CPU GPGPU 13 Field Programmable Gate Arrays § Field-Programmable § Circuits can be developed and changed after production § Difference to ASICs (application-specific integrated circuit) § Gate Array § Logic gates and RAM blocks on the FPGA § Behavior and connections specified at runtime § Have been around since the 80s (prototyping, communication sector, networking) § Recent popularity even in the world of databases / data processing § Energy efficient, highly parallel, low clock rate, hard to program 14 Time for a Rewrite § To optimize a data processing system for fast in-memory performance, we need to build it practically from scratch: § Cache and processor efficient algorithms (parallel joins and aggregations) § Cache and processor efficient data structures (column stores, compression) § Cache and processor efficient processing models (vectorized processing, query compilation) 15 Scale Out vs. Scale Up Processing Scale-Up: Operate a small cluster of nodes, keep all data in distributed main memory Scale Out Systems Scale Up Systems 16 Industry Trend § TPU cloud, tight coupling, scale up, special hardware § “Only way to meet growing compute demands” - Amin Vahdat (Google) 17 Computer Architecture High Level Computer Organization CPU Core Core CPU DRAM L1 L1 L1I L1I D D NUMA L2 L2 L3 PMEM shared L3 cache PCIe GPU Network Disk FPGA § Modern computer architecture with PCIe Bus § Memory, PCIe, core and socket connection through internal bus (ring or mesh) 19 Hardware Trends dram is slow Source: Hennessy & Patterson, Computer Architecture, 6th edition § Memory performance increases slower than CPU performance § CPUs spend a lot of time to wait on memory loads and stores 20 Memory 6= Memory Dynamic RAM (DRAM) Static RAM (SR Memory ≠ Memory WL VDD Dynamic RAM (DRAM) § Stage kept in capacitator § Leakage, needs refreshing § Small, cheap Memory 6= Memory BL § Usage: DIMM Dynamic RAM (DRAM) Static RAM (SRAM) WL VDD Static RAM (SRAM) State kept in capacitor Bistable latch § Bistable latch (0 or 1) Leakage Cell state stable ! refreshing needed ! no refreshin § Cell state stable, no refresh BL BL § Larger, more expensive § Usage: CPU-cache Sebastian Breß In-Memory Databases On Modern Hardware State kept in capacitor Bistable latch (0 or 1) Leakage Cell state stable 21 DRAM Characteristics Dynamic RAM is comparably slow. DRAM Characteristics Memory needs to be refreshed periodically (⇡ every 64 (Dis-)charging a capacitor takes time. Dynamic RAM is comparably slow charge discharge § Memory needs to be refreshed periodically (≈ every 64ms) % charged § (Dis-)charging takes time § DRAM cells must be addressed and capacitor outputs amplified § Overall we’re talking about ≈ 200 CPU cycles per access time Under certain circumstances, DRAM can be reasonably fast DRAM cells must be addressed and capacitor outputs amplified. § DRAM cells are physically organized as a 2-d array § The discharge/amplify process is done for an entire row Overall we’re talking about ⇡ 200 CPU cycles per access. § Once this is done, more than one word can be read out Sebastian Breß In-Memory Databases On Modern Hardware § Several DRAM cells can be used in parallel § Read out even more words in parallel § We can exploit this by using sequential access patterns 22 SRAM Characteristics SRAM can be very fast § Transistors actively drive output lines, access almost instantaneous But SRAMs are significantly more expensive (chip space = money) Therefore: § Organize memory as a hierarchy § Small, fast memories used as caches for slower memory 23 Memory Hierarchy – Large Scale Capacity Latency 1 KB Registers 0.3 ns 64 KB – 10 MB Caches 1 – 40 ns 8 – 128 GB Main Memory 90 - 110 ns 0.5 – 8 TB Solid State Drive 50 - 90 µs 1 – 32 TB Hard Disk Drive 3 - 12 ms Up to 45 TB (compressed) Archive 100+ seconds to minutes 24 Caches Principle of Locality § Caches take advantage of the principle of locality Example sum = 0; § The hot set of data often fits into caches for (i = 0; i < n; i++) { § 90% execution time spent in 10% of the sum += a[i]; code } return sum; § Spatial Locality Spatial Locality § Related data is often spatially close § Code often contains loops □ a[i] accessed in stride-1 pattern □ Instructions referenced in sequence § Temporal Locality Temporal Locality § Programs tend to re-use data frequently □ sum referenced in each iteration § Code may call a function repeatedly, even if it is not spatially close □ Repeated loop cycle 26 Memory Access Example § Sum up all entries of a “two”-dimensional A Motivating Example (Memory Access) array cols int *src = new int[rows*cols]; 100s 109 108 107 106 105 104 103 102 101 100 100s 50s 50s total execution time 20s 20s § Alternative 1 (row-wise): 10s 10s for (r = 0; r < rows; r++) 5s 5s for (c = 0; c < cols; c++) sum += src[r * cols + c]; 2s 2s 1s 1s 100 101 102 103 104 105 106 107 108 109 § Alternative 2 (column-wise): rows column-wise traversal row-wise traversal for (c = 0; c < cols; c++) for (r = 0; r < rows; r++) Sebastian Breß In-Memory Databases On Modern Hardware 29/409 sum += src[r * cols + c]; 27 Memory Access § On every memory access, the CPU checks if the respective cache line is already cached § Cache Hit: § Read data directly from the cache § No need to access lower-level memory § Cache Miss: § Read full cache line from lower-level memory § Evict some cached block and replace it by the newly read cache line § CPU stalls until data becomes available 28 Cache Performance § Big difference between the cost of cache hit and a cache miss § Can be 100x speed difference between accessing cache and main memory (in clock cycles) § Miss rate (MR) #misses § Fraction of memory references not found in cache: = 1 − Hit rate #accesses § Hit time (HT) § Time to deliver a cache line from the cache to the processor § Miss penalty (MP) § Additional time required because of a miss § Average time to access memory (considering both hits and misses): HT + MR × MP § 99% hit rate is twice as good as 97% hit rate § Assume HT of 1 𝑐𝑦𝑐𝑙𝑒, and MP of 100 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠 § 97%: 1 + (1−0.97) x 100 = 1 + 3 = 4 𝑐𝑦𝑐𝑙𝑒𝑠 § 99%: 1 + (1−0.99) x 100 = 1 + 1 = 2 𝑐𝑦𝑐𝑙𝑒𝑠 29 CPU Cache Internals Cache Internals § To guarantee speed, the overhead To guarantee ofoverhead speed, the cachingofmust bemust caching keptbereasonable. kept reasonable. 0 1 2 3 4 5 6 7 Organize cache in cache lines. Only load/evict full cache line size § Organize cache in cache lines. lines. § Only load/evict full cache Typicallines. cache line size: 64 bytes. § Typical cache line size: 64 bytes (M1: 128B) cache line The organization in cache lines is consistent with the principle of (spatial) locality. § The organization in cache lines is consistent with the principle of (spatial) locality. Sebastian Breß In-Memory Databases On Modern Hardware 43/414 30 Cache Organization - Example Example numbers Caches: Sample Organization of the Memory (AMD Opteron, 2.8 GHz, PC3200 DDR SDRAM Hierarchie ) Core 1 Core 2 Core 3 Core 4 § L1 cache: § separate data and instruction caches L1i L1d L1i L1d L1i L1d L1i L1d § each 64 kB, 64 B cache lines L2 § L2 cache: shared cache, 1 MB, 64 B cache lines L3 § L1 hit latency: 2 cycles (≈ 1 ns) MM § L2 hit latency: 7 cycles (≈ 3.5 ns) § L2 miss latency: 160–180 cycles (≈ 60 ns) 15/4 31 Caches Latencies § Approximate numbers Memory Latency [cycles] Register ≤1 L1 3-4 L2 ≈14 TLB 1 ≈12 TLB 2 ≈30 Main Memory ≈240 32 Caches on a Die – Intel i7 Intel Core i7-3960X The die is 21 by 21 mm and has 2.27 billion transistors 33 Caches on a Die II – M1 Performance Cores Perf Core § L1 Instruction Cache 192 KB § L1 Data Cache 128KB Perf L2 § Shared L2 Cache 12 MB DRAM Channel Efficiency Cores § L1 Instruction Cache 128 KB System § L1 Data Cache 64KB Cache § Shared L2 Cache 4 MB Eff L2 System Level Cache Eff Core § Shared with GPU: 16MB 34 Numbers on M1 § Pointer chasing benchmark 1 size_t run_bm(size_t element_count) { 2 auto data = std::vector(element_count); 3 // [..] 4 std::iota(data.begin(), data.end(), 0); 5 std::shuffle(data.begin(), data.end(), generator); 6 7 size_t next_position = size_t{0}; 8 for (auto index = size_t{0}; index < NUM_OPS; ++index) { 9 next_position = data[next_position]; 10 sum += next_position; 11 } 12 13 const auto end = std::chrono::steady_clock::now(); 14 const auto duration = (end - start); 15 // [..] 16 } 35 DELab Measurements § Pointer chasing benchmark 36 DELab Measurements § Pointer chasing benchmark 37 Performance (SPECint 2000) (SPECint 2000) Performance.. 20. misses per 1000 instructions L1 Instruction Cache 15 L2 Cache (shared) 10 5 0 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf avg benchmark program 38 Performance (SPECint 2000) (SPECint 2000) Performance.. 20. misses per 1000 instructions L1 Instruction Cache 15 L2 Cache (shared) 10 5 0 gzip vpr gcc mcf crafty parser eon perlbmk gap vortex bzip2 twolf avg TPC-C benchmark program 39 Assessment § Why do database systems show such poor cache behavior? § Poor code locality: § Polymorphic functions (E.g., resolve attribute types for each processed tuple individually) § Volcano iterator model (pipelining) Each tuple is passed through a query plan composed of many operators § Poor data locality: § Database systems are designed to navigate through large data volumes quickly § Navigating an index tree, e.g., by design means to “randomly” visit any of the (many) child nodes 40 Cache Friendly Code § Programmer can optimize for cache performance § How data structures are organized § How data are accessed (e.g., nested loop structure) § All systems favor “cache-friendly code” § Getting absolute optimum performance is very platform specific § Cache sizes, cache block size, associativity, etc. § Can get most of the advantage with generic code: § Keep working set reasonably small (temporal locality) § Use small strides (spatial locality) § Focus on inner loop cycle § Do not optimize too much prematurely. Check the hotspots with a profiling tool like perf. 41 Data Layout Caches for Data Processing § How can we improve data cache usage? § Requires going back to different data storage models and query execution models § And thinking both in terms of temporal- and spatial-locality § Example SELECT COUNT (*) FROM lineitem WHERE l_shipdate = “2009-09-26” § Typically involves full table scan 43 Row-stores Data Storage Layout § Two principal approaches a.k.a. row-wise storage or n-ary storage model, NSM: § Row layout (n-ary storage model / NSM) 1 a 1 1 b c a4 b4 c4 a1 b1 c1 d1 Column-stores c1 d1 a2 c4 d4 a2 b2 c2 d2 b2 c2 d2 d2 a3 b3 a3 b3 c3 d3 c3 d3 a4 b4 c4 d4 page 0 page 1 § Column layout storage a.k.a. column-wise (decomposition storage or decomposition modelDSM: storage model, / DSM) a1 2a a3 b1 b2 b3 a1 b1 c1 d1 a3 a4 b3 b4 a2 b2 c2 d2 ··· a3 b3 c3 d3 a4 Sebastian Breß b4 c4 d4 In-Memory Databases On Modernpage Hardware 0 page 1 54/409 44 A full table scan in a row-store Full Table Scan In In aa row-store, all rows row-store, all rows of of aa table table are are stored stored sequentially sequentially on on aa Row Store §database database page. page. l_shipdate l_shipdate tuple tuple cache block boundaries With every access to a l_shipdate field, we load a large amount §ofWith every irrelevant access information into to the a l_shipdate field, we cache. load a large amount of irrelevant information into the cache Sebastian Sebastian Breß Breß In-Memory In-Memory Databases Databases On On Modern Modern Hardware Hardware 57/409 57/409 45 A A ”full ”full table table scan” scan” on on a a column-store column-store Full Table Scan In In aa column-store, column-store, all all values values of one column of one column are are stored stored sequentially sequentially §onColumn Store on aa database database page. page. l_shipdate l_shipdate tuple tuple cache block boundaries §AllAlldata dataloaded into loaded caches into by aby“l_shipdate caches a “l_shipdatescan” is is scan” now now relevant for the query actually relevant for the query. § Less data has to be fetched from memory. § Amortize cost for fetch over more tuples. § If we’re really lucky, Sebastian Breß the full (l_shipdate) data might now even fit58/409 In-Memory Databases On Modern Hardware into caches. Sebastian Breß In-Memory Databases On Modern Hardware 58/409 46 HybridTrade-off Column-Store approaches § Tuple recombination candata One can also store cause considerable in a hybrid format: cost. § Need to perform many random lookups or even joins § Workload-dependent trade-offAcross) layout: PAX (Partition Attributes Divide each page into mini-pages and group attributes into them Weaving Relations for Cache Performance by Ailamaki et al. (VLDB 2001) Hybrid approaches Hybrid storage model § PAX (Partition Attributes Across) layout: Store new data in NSM for fast OLTP § Divide page into mini-pages, group attributes Weaving Relations for to Migrate data Cache Performance DSM for byOLAP more efficient Ailamaki et al. (VLDB 2001) Fractured mirrors (Oracle, IBM), Delta Store (SAP Hana) § Hybrid storage Recentmodel research states that DSM can be used efficiently for hybrid workloads § Store new Optimal data Column in NSM forforfast Layout HybridOLTP Workloads by Athanassoulis et al. (VLDB 2019) § Migrate data to DSM for more efficient OLAP § Fractured mirrors (Oracle, IBM), Delta Store (SAP Hana) 47 62 Parallelism Pipeline Parallelism Hardware Parallelism § Pipelining is Pipelining one technique to leverage is one technique available to leverage available hardware parallelism hardware parallelism. chip die Task 1 Task 2 Task 3 Separate chip regions for individual tasks execute independently. § Separate chip regions Advantage:for Use individual tasks sequential parallelism, but maintain execute independently execution semantics at front-end (here: assembly instruction § Advantage: Usestream). parallelism, but maintain sequential execution semantics at front-end We discussed(here: problems assembly around hazards instruction in the previous stream) chapter. § VLSI technology limits VLSI thelimits technology degree up up the degree totowhich pipelining which pipelining is is feasible feasible. (% H. Kaeslin. Digital Sebastian Breß Integrated In-Memory Databases On Circuit Design. Cambridge Univ.245/409 Modern Hardware 49 Press.). Hardware Parallelism Chip area can as well be used for other types of parallelis Other Hardware Parallelism Hardware Parallelism Chip area can asin1well be used for other types ofout 1 parallelism Task 1 § Chip area can also be used for other in out1 2 in21 out types of parallelism Task Task 1 2 in in32 out2 3 out Task 23 Task in3 out3 Task 3 Computer systems typically use identical hardware circuits, their function may be controlled by di↵erent instruction st § Computer systems typically usesiComputer :identical systems typically use identical hardware circuits, b their function may be controlled by di↵erent instruction stre hardware circuits, but their function may s1 s2 s3 si : be controlled by different instruction in1 out1 s1 s2 s3 streams si PU in21 in out out1 2 PU PU § Example: Multicore CPUs, Multiple in32 out in out2 3 Instructions, Multiple Data (MIMD) PU PU in3 out3 PU Sebastian Breß In-Memory Databases On Modern Hardware 50 Vector Units (SIMD) § Most modern processors also include a SIMD unit § Execute same assembly instruction Special on(SIMD) Instances a set of values § Also called vector unit; vector processors are entire systems built on that idea Most modern processors also include a SIMD unit: s1 in1 out1 PU in2 out2 PU in3 out3 PU Execute same assembly instruction on a set of values. Also called vector unit; vector processors are entire systems 51 Vectorized Execution in a Nutshell § A scalar instruction processes a single operation (e.g., add) at a time § A vectorized instruction processes one operation on multiple data items (i.e., Vectorized addition (+) on a vector of elements) at a time 52 Vectorized Execution SIMD Instructions § SIMD (Single instruction multiple data) vs SISD (Single instruction single data) § SIMD instructions: A class of instructions in modern CPUs that allow a processor to perform the same operation on multiple values simultaneously § CPUs from major vendors have microarchitecture support for SIMD instructions: § Intel x86: MMX, SSE, AVX(2/512) § ARM: NEON and Scalable Vector Extension (SVE) § Sparc: VIS 54 Example: SIMD vs. SISD 8 𝑋+ 𝑌= 𝑍 7 6 𝑥! 𝑦! 𝑥! + 𝑦! 𝑋 5 ⋮ + ⋮ = ⋮ 4 𝑥" 𝑦" 𝑥" + 𝑦" 3 2 𝑍 1 SISD + 1 for (int i=0;i