Big Data Systems Benchmarking & Measurement PDF

Prof. Tilmann Rabl Big Data Systems Data Engineering Systems Benchmarking & Measurement Hasso Plattner Institute This Lecture 1. Introduction to performance analysis 2. Back of the envelope calculations 3. Measurement 4. Benchmarks 5. BigBench Source: Raj Jain (Washington University in St. Louis) - Computer Systems Analysis https://www.cse.wustl.edu/~jain/cse567-17/index.html 4 Where are we? § Measurement is required on all levels Application / Query Language / Analytics / Visualization Application § Concrete level depends on the Data Processing research question Big Data Data Management Systems File System Virtualization / Container OS / Scheduling Infrastructure Hardware 5 Why Measurement and Benchmarking § Systems are increasingly complex interaction with web server § Single transactions can span 1000 components / nodes § Consumer page load time expectation decreases § 1999 – 8 sec outages cost a lot of Money (loosing users) § 2009 – 2 sec § 2018 – 3 sec -> 50% consumers leave page § Low performance or outages cost $$ 6 Why Measurement and Benchmarking Prof. Rabl‘s Famous 7 Step Paper/Thesis Recipe 1. Literature search 2. Identify a research problem 3. Describe a novel solution 4. Perform BotEC to show it might work 5. Perform experiments that show it does work 6. Write up the paper 7. Endure revision cycles 7 Benchmark vs Analysis Figure 8: Variable-sized ⇠216 Byte records. Figure 9: Total memory. Figure 10: Update strategy. § Analysis 5.3.3 a breakdown Variable-sized Records. In this benchmark, we evaluate the impact of variable-sized records on the performance of the systems. § Single system/algorithm To this end, we pre ll 100 million records of about 216 Byte, with a normal distribution around 16 B for the key (N (16, 3.22 )) and 200 B § Individual optimizations for value size (N (200, 402 )). We then perform each 50 million puts and gets and measure the throughput. § Micro benchmarks The results shown in Figure 8 are in line with those of the core operations (cf. Fig 6). For puts, Viper clearly outperforms the other systems due to its e cient VPage design. Record retrieval also Figure 11: Viper versions. Figure 12: Op breakdown. follows the same trend of xed-sized records discussed above. How- ever, both put and get achieve lower overall throughput compared updates. For these workloads, e cient in-place modi cation signi cantly reduces read- and write-ampli cation. In Figure 10 § Benchmark to xed-sized records. For xed-sized records, the compiler gener- ates SIMD mov instruction, while regular mov instructions are used we show the advantage of in-place updates over copy-on-write for variable-length. The get performance is also lower for variable (CoW ) updates. When atomically updating only 8 B of a value, § Compare multiple systems records, as they additionally require more data reads than xed Viper achieves more than 2x updates/s compared to CoW. If an records. Viper must read the size metadata before retrieving the atomic update is not possible, Viper still outperforms the other § Standard or real workload actual value, while xed records require only pointer arithmetic due to known o sets at compile time. systems when reading, modifying, and re-inserting the value (cf. Fig 6). Recent work [23, 51] has also shown the advantage of in- place updates, thus, supporting larger in-place modi cations poses 5.3.4 Memory Consumption. We evaluate the total DRAM and an interesting challenge for future work. PMem consumption to better understand the systems’ resource re- quirements. We ll each system with the default 100 million records, 5.3.6 Viper Versions. In Figure 11, we evaluate four Viper versions § A good paper has both! i.e., 20.1 GB raw data (1GB = 230 B). We measure the DRAM and 4 5 6 to understand the impact of data placement in our design, i.e., by PMem consumption with Intel’s VTune , pmap , and pmempool. placing data + indexFigure in PMem or DRAM, by placing the data in 13: YCSB latency and throughput. The results are shown in Figure 9. Viper consumes 21.2 GB ofViper before PMem and the index in DRAM (Viper), performing 50 million operations with 36 threads. We 5.3.9 and by using Recovery. unaligned As a persistent KVS should be able to restart after split the operations evenly into a mixed 25%-each workload. We nor- a crash or shutdown, we evaluate Viper’s recovery performance. PMem and 2.3 GB of DRAM. The DRAM consumption is attributed VPages malize the runtime in PMem of each operation to 1 andshifted present theby time2048 spent Byte (Unaligned). We pre ll 100 million 216We Byterun recordsthe and recover using a varying experiments with 36 threads. This evaluation number of supports ourthread design 8 nearly completely to the o set map. FASTER consumes slightly less on PMem access, O set Map access, and VPage fetching/locking. threads. A single requires 38 seconds to fully The results are shown in Figure 12. For put, we see that most of restart Viper. More threads reduce the recovery time to 19/10/5/4 Understanding System Performance how to get the values? § Modeling use model to predict whats going on § Back of the envelope calculation § Analytical model § Rule of Validation: § Measurement § Do not trust result of a single technique § Experimental design but validate with another § Benchmarks § Often: validate measurements with model § Simulation § Emulation es kann ja sein, dass ich das System bisher noch nicht habe § Trace-driven § … 9 Back of the Envelope Calculation calc is so small that it fits on the back of an envelope How to get good (enough) performance? § Understand your application § Back of the envelope calculation § Estimate your system performance within an order of magnitude bullshit filter § A.k.a., B.S.-filter – filter out stupid ideas early § Want to know for sure? – Benchmark! BotEC from Jeff Dean Von Scan:de:Benutzer:Superbass - eigener Scan, Gemeinfrei, https://commons.wikimedia.org/w/index.php?curid=5205050 Google system guru 11 Useful Latency Numbers L1 cache reference 0.5 ns DDR4 channel bandwidth 20 GB/sec Branch mispredict 5 ns PCIe gen3 x16 channel 12.5 GB/sec L3 cache reference 20 ns NVMe Flash bandwidth 2GB/sec Mutex lock/unlock 25 ns GbE link bandwidth 10 – 100 Gbps Main memory reference 100 ns Disk bandwidth 6 Gbps Compress 1K bytes with Snappy 3,000 ns 3 us Send 2K bytes over 10Ge 2,000 ns 2 us Read 1 MB sequentially from memory 100,000 ns 100 us NVMe Flash 4KB IOPS 500K – 1M Read 4KB from NVMe Flash 50,000 ns 50 us Disk 4K IOPS 100 – 200 Round trip within same datacenter 500,000 ns 500 us Disk seek 10,000,000 ns 10,000 us 10 ms Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms Send packet CA → Europe → CA 150,000,000 ns 150,000 us 150 ms the further we get down in Memory the more expensive it gets (Jeff Dean) 12 Basic Considerations § Do I have a big data problem? § My data fits in memory -> probably no § Example: Find max/avg/min element in list with N integers § N=16,000 fits in L1 § N=64,000 fits in L2 § N=13,000,000 ~ 52MB ~ 200 ms = instantaneous on 8yr old laptop* *T.A. Limoncelli. 10 Optimizations on Linear Search. CACM 59(09), 2016 13 Simple BotEC Example* § How long to generate image results page (30 thumbnails)? § Design 1: Read serially, thumbnail 256KB images on the fly § 30 seeks * 10 ms/seek + 30 * 256KB / 30 MB/s = 560 ms find Right position § Design 2: Issue reads in parallel § 10 ms/seek + 256KB read / 30 MB/s = 18 ms § (Ignores variance, so really more like 30-60 ms, probably) § Lots of variations: § caching (single images? whole sets of thumbnails?) then we do not Need to read from disk § pre-computing thumbnails § Back of the envelope helps identify most promising. § Often, you need a simple prototype to get useful numbers. 14 Sorting Example § How long does it take to sort 1 GB of 4 Byte numbers? § 1GB = 228 int numbers (on most 32 or 64 bit machines) § Quicksort et al.: O(n log n) § log(228) passes over 228 numbers ≈ 233 comparisons § ½ mispredicts -> 232 mispredicts * 5ns/mispredict = 21 secs § Memory bandwidth: mostly sequential streaming § 230 Bytes * 28 passes = 28 Gbyte. Memory BW ≈ 4GB/sec -> 7 sec § Roughly 30 seconds to sort 1GB on one CPU. § Let’s try 15 Complete Sorting Program #include #include #include int main(int argc, const char * argv[]) { int size = 268435456; std::vector data; data.reserve(size); uint64_t dur = 0; // Fill array with random numbers for (int i = 0; i < size; i++) { data.emplace_back(rand()); // % 10000; } // Measure and sort const auto start = std::chrono::steady_clock::now(); std::sort(data.begin(), data.end()); const auto end = std::chrono::steady_clock::now(); // Output result dur += std::chrono::duration_cast(end - start).count(); std::cout

Big Data Systems Benchmarking & Measurement PDF

Document Details

Tags

Related

Summary

Full Transcript