BDS Exam WiSe 23/24 PDF
Document Details
Uploaded by TimelySweetPea
Hasso-Plattner-Institut
2023
BDS
Tags
Summary
This is a past paper for a BDS exam in the 2023/24 academic year. It includes multiple-choice questions on topics such as system under test, benchmarking, and distributed file systems; and more. It also has questions on computer science concepts such as matrix multiplication, and programming.
Full Transcript
BDS Exam WiSe 23/24 1 Multiple Choice (10 points) 1 point per correct answer. 0 points for wrong answers. There is always exactly one correct answer per question. 1. When we use the term use the term “system under test” in benchmarking, we are referring specifically to: S...
BDS Exam WiSe 23/24 1 Multiple Choice (10 points) 1 point per correct answer. 0 points for wrong answers. There is always exactly one correct answer per question. 1. When we use the term use the term “system under test” in benchmarking, we are referring specifically to: Software, workloads, and metrics Deployment of software where users can issue requests Deployment comprised of hardware, software and data Software 2. Given the following cluster resources {CPUs: 32, RAM: 128 GB, GPUs: 8, SSD: 1 TB} and job {CPUs: 20, RAM: 30 GB, GPUs: 6, SSD: 200GB}. In Dominant Resource Fair Scheduling, which is the dominant resource of the job? RAM SSD GPU CPU 3. Given the following three equivalent matrix multiplications, which execution order is the fastest? (((A10*5 x B5*10) x C10*1) x D1*7) ((A10*5 x B5*10) x (C10*1 x D1*7)) (A10*5 x (B5*10 x (C10*1 x D1*7))) 4. What are the main three challenges of distributed file systems Caching, multi-tenancy and shareability. Performance, scalability, and consistency Support all file sizes, all operative systems, and all data types. 5. Which of the following statements is true about RDDs upon failures: Only the lost partitions of an RDD need to be recomputed. The entire RDD is rebuilt using its lineage. The whole program is rolled back and replayed. 6. What statement is true in the context of stream processing Event time refers to the time when events occurred, while processing time is the time it takes to process the events. Event time and processing time are interchangeable terms, both referring to the time at which events are ingested into the stream processing system. Event time refers to the time when events occurred, while processing time refers to the time when events are processed by the stream processing system. 7. Assume you have a matrix that you represent as a two-dimensional int array of size rows*columns. You decide to represent it as a large one-dimensional array in a row-major layout. In code you write: Int *src = new int[rows*columns]; Assume the matrix is very large and has a quadratic shape. You want to calculate the sum over all values in the matrix, which code snippet is most likely the faster one? Listing 1: Snippet 1 Listing 2: Snippet 2 for (c = 0; c < cols; c++) for (r = 0; c < rows; r++) for (r = 0; r > rows; r++) for (c = 0; c > cols; c++) sum += src[r * cols + c]; sum += src[r * cols + c]; Snippet 1 Snippet 2 8. Which statement is correct? In general, a GPU is less efficient but more flexible than a CPU. In general, a GPU is less efficient but more flexible than an ASIC. In general, an ASIC is less efficient but more flexible than a FPGA. 9. Assume you have two datasets: a dataset containing basic information about 10 products that a company sells and a dataset that consists of a large number of historical transactions representing purchases of the products. Now assume you want to join the data by, for example, a product ID. Which join method is the most suitable? Repartition join Broadcast join 10. Which of the following functions is not a distributive function? maximum of all values most frequent value sum of all values 2 Stream Window Aggregation (8 points) In your stream application you are performing windowed aggregations. Your application received the following event stream: 1, 2 2, 3 8, 4 3, 1 WM, 11 9, 9 13, 10 16, 8 14, 0 15, 2 WM, 22 fs d event is in the form of Each , watermarks are marked as WM, timestamp. The same timestamp, value windowing assumptions as in the programming assignment are used: A window is a closed-open interval, i.e. an event with a timestamp of the window start belongs to the window, but an event with the window end as a timestamp does not. Time starts at 0. A tumbling count window [0,3) includes the events 0,1,2, and should be ordered, i.e., if you receive events with timestamps 10, 20, 40 and 30 for a count window of size 3, the window should include the events for 10, 20,30. Use one decimal place precision for calculations and “null” for the result of an empty window if necessary. Question 1 What are the windowed aggregate values and functions after processing the watermark at 22? Note that for count-based windows, “start” and “end” refer to indices instead of timestamps. (a) Sliding window of length 6 and slide 4 (b) Tumbling count window with a length of with AVG aggregation 3 (count based) with SUM aggregation window window aggregate window window aggregate start end value start end value 3. B+ Tree Operations (21 points) In the following questions you will be asked to perform operations on a B+ Tree. For all of the following questions please draw the resulting B+ Tree. Adding drawings of the relevant parts for the intermediate steps (if there are any) may help you and might award partial points in case the final version is incorrect. If a subtree is not changed by the operation, you may indicate this in the drawing with an arrow point at […] to avoid unnecessary copying. For this task, the B+ Tree is defined with n = 4 keys and n+1 child pointers. When splitting a node, move n/2 +1 keys to the new node. If a delete requires sibling nodes, perform the deletion with the following precedence: i) steal from right sibling, ii) steal from left sibling, iii) merge with right sibling, iv) merge with left sibling. Question 2 (3 points) Given the initial B+ Tree in the figure below, Delete “27” 17 4 12 26 33 1 2 4 8 12 13 17 19 24 26 27 33 35 Question 3 (4points) Given the initial B+ Tree in the figure below, add “13” 3 5 7 9 1 2 3 4 5 6 7 8 9 10 11 12 Question 4 (4 points) Given the initial B+ Tree in the figure below, Delete “1” 7 3 5 9 11 1 2 3 4 5 6 7 8 9 10 11 12 The figure below describes a B+ Tree that indexes the primary key (PK) of a relation named employee. This B+ Tree has root, inner, and leaf nodes, all of which reside in memory. The leaf nodes have pointers to data pages on disk, with each key pointer in a leaf pointing to the same data page and an extra pointer pointing to the next leaf node (if it exists). The relation records are stored on data pages row wise, and data pages from contiguous leaf nodes are stored sequentially. Each page is 4 KiB. Retrieving a node (including its pointers) requires just one memory access. In memory 66 21 51 80 95 1 5 7 15 21 25 27 40 51 55 57 63 66 67 70 73 80 82 85 90 95 97 99 100 Data page 1 | rio | gibbs | 10/03/1982 | … 5 | rayhan | finley | 11/05/1987 | … 7 | natalya | proctor | 10/08/1994 | … 15 | adrianna | james | 23/07/1954 | … Hardware Characteristics: Additional Information: - memory acces takes 100 ns - 1 ns = 10-9 seconds - disk seek takes 10 ms - 1 us = 10-6 seconds = 1,000 ns - reading 1 KiB sequentially from disk takes 20,000 ns - 1 ms = 10-3 seconds = 1,000 ms Question 5 (3 points) Perform a BotEC to estimate the latency of a query over employee that fetches the record with PK=40. Assume that the system’s query optimizer decides to use the B+ Tree. Please express the answer in millions of nanoseconds rounded to one decimal place, for example, 2.07 ms = 2.1M ns. Question 6 (3 points) Perform a BotEC to estimate the latency of a query over employee that fetches the record with PK greater or equal to 64 and less than 96. Assume that the system’s query optimizer decides to use the B+ Tree. Please express the answer in millions of nanoseconds rounded to one decimal place, for example, 2.07 ms = 2.1M ns. Question 7 (4 points) Suppose a buffer manager in the system stores the two most recently accessed data pages in memory. Reading 1KiB of data from memory takes 250 nanoseconds. Please perform a BotEC to estimate the latency of a query that retrieves records with primary keys 90, 170, 57, and 15 from the “employee” table. Assume the data pages containing the data for the leaf nodes 21-40 and 51- 63 are cached. Also, assume that the system’s query optimizer selects the B+ Tree for this query. Please express the answer in millions of nanoseconds rounded to one decimal place, for example, 2.07 ms = 2.1M ns. 4 LSM-Trees (12 points) LSM-Trees make use of SSTables. Consider the following SSTables with the highest SSTable being the most recently written one, and the lowest SSTable being the oldest. Further assume that the SSTables are the only SSTables that exist. We use the ꓕ symbol to represent a “Tombstone”. ‘each‘: ꓕ ‘eager’: 3 ‘ear’: 34 ‘earlier’: 77 ‘earlobe’: ꓕ ‘early’: 98 most ‘earn’: 4 ‘ease’: 0 ‘easy’: 62 ‘easygoing’: 55 recent ‘each‘: 67 ‘eager’: ꓕ ‘earlier’: 54 ‘earlobe’: 27 ‘early’: 3 ‘earn’: 28 ‘earth’: ꓕ ‘ease’: 63 ‘easygoing’: 56 ‘eat’: 97 ‘each‘: 95 ‘eager’: 42 ‘ear’: ꓕ ‘earlier’: 82 ‘early’: 12 ‘earn’: 92 oldest ‘earth’: 37 ‘ease’: 71 ‘easygoing’: 23 ‘eat’: 80 Question 8 (1 point) If you request the value for the key “eager”, what is the expected result (if the value does not exist, return “null”)? If you request the value for the key “eat”, what is the expected result (if the value does not exist, return “null”)? If you request the value for the key “earth”, what is the expected result (if the value does not exist, return “null”)? Given the following Bloom Filter that was creating using these two hash functions x mod 10 x2 mod 10 index 0 1 2 3 4 5 6 7 8 9 value 1 1 0 1 0 0 0 0 1 1 Question 11 (2 points) Can key 3 be in the DB? Explain in one sentence why or why not. Question 12 (2 points) Can key 5 be in the DB? Explain in one sentence why or why not. Question 13 (3 points) Plot the main trend (e.g. increasing, decreasing, parabolic, etc.) of the false positive probability of a bloom filter as a function of i) the number of hash functions, ii) the filter size, and iii) the number of items in the filter. For i), assume a fixed filter size and a fixed number of items in the filter. For ii), assume a fixed number of hash functions and a fixed number of items in the filter. Additionally, assume that the hash functions depend on the filter size, e.g. hi = x2 mod size. For iii), assume a fixed number of hash functions and a fixed filter size. FP prob. FP prob. FP prob. 1 1 1 0 1 # hash 0 Filter size 0 i) functions ii) 0 iii) Items in the filter Question 14 (2 points) For a workload distribution of 95% sequential reads and 5% writes, which tree structure (LSM-Tree or B+ Tree) is the better choice for optimal performance on traditional disk-based data structures? Please explain your choice briefly. 5 Distributed Hash Tables (15 points) Question 15 (5 points) You are given the following has-partitioned distributed hash table with 5 nodes, each with keys, using mod(5) as a hashing function. Node 4 is removed from the hash table. Update the hash function so that the loads is shared across all nodes and draw the resulting DHT after deletion. Node Keys Node 0 25, 50, 55 Node 1 36 Node 2 2, 7 Node 3 8 Node 4 14, 19, 34 New hash function: Updated DHT: Question 16 (2 points) What are the two major drawbacks of the approach used in the previous question? Please answer in a maximum of one sentence per drawback, bullet points are allowed! Drawback (1): Drawback (2): Question 17 (6 points) Consistent hashing together with virtual nodes is one way to address the drawbacks from above. Assume you have 2 nodes (Node 0 and Node 1) with 2 virtual nodes each (V0, V1, V2, V3) collectively storing the keys: 1, 5, 11, 13, 17, 21, 26, 30. Further assume: no replication we use 25 bits for the key range the keys are laid out clockwise on the hash ring the allocation of keys to virtual nodes is counter-clockwise (same as in the quizzes this means a node covers the key range in a counter-clockwise direction of its position). Draw an ideal possible configuration of a hash ring with the keys and the virtual nodes on it. Also, provide a mapping from physical nodes to virtual nodes and a mapping of virtual nodes to keys. 0 Node ID Virtual Node ID Virtual Node ID Key 16 Question 18 (2 points) Now assume two more nodes Node 2 and Node 3 join the cluster, while keeping the number of virtual nodes on the ring constant. Update the figures and tables from above that will have to change after adding nodes. 6 Map Reduce (18 points) Question 19 (5 points) You’re working with a dataset containing information about books. Analyze the following MapReduce code and provide a concise explanation (in one sentence) of what the code is doing. Assume the function CATEGORY(book) returns the genre of a book, and the functions MAX(ITERABLE) and MIN(ITERABLE) calculate the maximum and minimum values of an iterable, respectively. def map(key, book): if book.language == ‘English’ and CATEGORY(book) == ‘Science Fiction’: emit(book.author, book.num_pages) def reduce(author, pages): emit (author, (MIN(pages), MAX(pages))) Question 20 (5 points) You are investigating which and how many badly rated books in a sales dataset were bought in Germany during the current year. A book is considered badly rated if its rating is below 2. The following SQL-like query reads data from the sales table and outputs the book.id along with a count of how many times this book was sold only considering badly rated books. For this, the query includes a WHERE clause that filters out tuples that do not meet the conditions you are interested in, specifically, only considering sales in Germany for the current year with ratings below 2. Finally, the GROUP BY clause aggregates the book sales per book_id. SELECT book_id, COUNT(sale_timestamp) as count FROM sales WHERE country_iso = ‘DE’ AND rating < 2 AND YEAR(sale_timestamp) = CURRENT_YEAR( ) ) GROUP BY book_id; Translate the given SQL query, which analyzes book sales in Germany during the current year, into MapReduce pseudo code. Assume the existence of the functions YEAR(TIMESTAMP) and CURRENT_YEAR( ) that return the calendar year of a TIMESTAMP and the current calendar year, respectively. Additionally, consider the availability of the functions SUM(ITERABLE). Question 21 (6 points) (don’t have the exact text for this question but the gist is there + correct numbers and layout) You have to do TPMNMS with the disk being in this state: B0 B1 B2 B3 B4 B5 14 21 11 2 23 22 24 15 12 5 8 1 16 20 3 28 9 30 Your memory is big enough to hold 3 blocks. In phase 1 you have access to the following instructions: read(“Bi”) or read(“Bi”, “Bj”, “Bk”): Reads blocks from disk and puts their value into memory. sort_mem(): Sorts all values currently in memory. write(“Bi”) or write(“Bi”, “Bj”, “Bk”): Writes from memory into the specified blocks. Give the sequence of instructions that happen in phase 1 of TPMS and the state of the disk after phase 1. Sequence of instructions Disk after Phase 1: B0 B1 B2 B3 B4 B5 The text here is back to the original: Phase 2: For phase 2, describe what happens using the following instructions but stop listing instructions after you have the first 6 elements sorted on disk. load(“Bi”) or load(“Bi”, “Bj”, “Bk”): Allocates memory and loads a single or multiple blocks from disk to memory. write_values(a,b,c) Writes a block holding the values a, b and c on disk and clears the content of the block in memory. You do not have to list instructions other than the ones given for phase 2. This means you do not have to list more fine granular steps like: copying single elements between blocks, allocating empty blocks, or clearing blocks from memory. Please stop listing instructions after you have the first 6 elements sorted on disk: Question 22 (2 points) Assume you have a main memory size of M, a tuple size of R, and a block size of B. Explain why the maximum number of tuples that can be sorted with TPMMS can be calculated as follows: (M / R) * ((M / B) – 1) 7 Spark (7 points) The following directed acyclic graph (DAG) describes the operations across three stages from a Spark task. Boxes with solid outlines are RDDs. A B Stage 1 G I D F C V III IV II Stage 2 E Stage 3 Question 23 (2.5 points) Given the cadidate set, O = {I, II, III, IV, V} and the operation names set, C = {reduce, groupBy, union, join, map}, match each candidate with its corresponding operation name (each operation name should be used exactly once.) Candidate Operation Name I II III IV V Question 24 (2.5 points) List each operation from the operation names set, C ={reduce, groupBy, union, join, map}, as either transformations or actions. Type Operations Transformation Actions Question 25 (2 points) List each transformation as either wide or narrow. Type Transformations Narrow Wide 8 Dense and Sparse Matrices (9 points) You are given the following 4x5 matrix: 2 8 6 8 4 5 1 6 Question 26 (3 points) Draw the matrix in a modified compressed sparse rows (MCSR) representation. Question 27 (6 points) Assume you have a use case where you operate with very large but sparse matrices. Assume dimensions of 105 x 103 (rows x columns) and a filling degree of approx. 1%. Further, assume you currently use 32 bit for indexes, 32 bit for matrix values, and 64 bit for pointers. Currently, you are using the Coordinate Matrix Format (COO) format. You want to make your program memory efficient and seek a more space-efficient way to represent your matrices. A colleague suggests that using the Compressed Sparse Column (CSC) format would improve memory consumption by at least 30%. Perform a back-of-the-envelope calculation to assess your colleague’s suggestion.