Podcast
Questions and Answers
When we use the term "system under test" in benchmarking, we are referring specifically to:
When we use the term "system under test" in benchmarking, we are referring specifically to:
Given the following cluster resources {CPUs: 32, RAM: 128 GB, GPUs: 8, SSD: 1 TB} and job {CPUs: 20, RAM: 30 GB, GPUs: 6, SSD: 200GB}. In Dominant Resource Fair Scheduling, which is the dominant resource of the job?
Given the following cluster resources {CPUs: 32, RAM: 128 GB, GPUs: 8, SSD: 1 TB} and job {CPUs: 20, RAM: 30 GB, GPUs: 6, SSD: 200GB}. In Dominant Resource Fair Scheduling, which is the dominant resource of the job?
Given the following three equivalent matrix multiplications, which execution order is the fastest?
Given the following three equivalent matrix multiplications, which execution order is the fastest?
What are the main three challenges of distributed file systems
What are the main three challenges of distributed file systems
Signup and view all the answers
Which of the following statements is true about RDDs upon failures:
Which of the following statements is true about RDDs upon failures:
Signup and view all the answers
What statement is true in the context of stream processing
What statement is true in the context of stream processing
Signup and view all the answers
Assume you have a matrix that you represent as a two-dimensional int array of size rows*columns. You decide to represent it as a large one-dimensional array in a row-major layout. In code you write:
Int src = new int[rowscolumns];
Assume the matrix is very large and has a quadratic shape. You want to calculate the sum over all values in the matrix, which code snippet is most likely the faster one?
Assume you have a matrix that you represent as a two-dimensional int array of size rows*columns. You decide to represent it as a large one-dimensional array in a row-major layout. In code you write: Int src = new int[rowscolumns]; Assume the matrix is very large and has a quadratic shape. You want to calculate the sum over all values in the matrix, which code snippet is most likely the faster one?
Signup and view all the answers
Which statement is correct?
Which statement is correct?
Signup and view all the answers
Assume you have two datasets: a dataset containing basic information about 10 products that a company sells and a dataset that consists of a large number of historical transactions representing purchases of the products.
Now assume you want to join the data by, for example, a product ID. Which join method is the most suitable?
Assume you have two datasets: a dataset containing basic information about 10 products that a company sells and a dataset that consists of a large number of historical transactions representing purchases of the products. Now assume you want to join the data by, for example, a product ID. Which join method is the most suitable?
Signup and view all the answers
Which of the following functions is not a distributive function?
Which of the following functions is not a distributive function?
Signup and view all the answers
What are the windowed aggregate values and functions after processing the watermark at 22? Note that for count-based windows, "start" and "end" refer to indices instead of timestamps.
What are the windowed aggregate values and functions after processing the watermark at 22? Note that for count-based windows, "start" and "end" refer to indices instead of timestamps.
Signup and view all the answers
Given the initial B+ Tree in the figure below, Delete "27"
Given the initial B+ Tree in the figure below, Delete "27"
Signup and view all the answers
Given the initial B+ Tree in the figure below, add "13"
Given the initial B+ Tree in the figure below, add "13"
Signup and view all the answers
If you request the value for the key "eager", what is the expected result (if the value does not exist, return "null")?
If you request the value for the key "eager", what is the expected result (if the value does not exist, return "null")?
Signup and view all the answers
Can key 3 be in the DB? Explain in one sentence why or why not.
Can key 3 be in the DB? Explain in one sentence why or why not.
Signup and view all the answers
Perform a BotEC to estimate the latency of a query over employee that fetches the record with PK=40. Assume that the system's query optimizer decides to use the B+ Tree. Please express the answer in millions of nanoseconds rounded to one decimal place, for example, 2.07 ms = 2.1M ns.
Perform a BotEC to estimate the latency of a query over employee that fetches the record with PK=40. Assume that the system's query optimizer decides to use the B+ Tree. Please express the answer in millions of nanoseconds rounded to one decimal place, for example, 2.07 ms = 2.1M ns.
Signup and view all the answers
Perform a BotEC to estimate the latency of a query over employee that fetches the record with PK greater or equal to 64 and less than 96. Assume that the system's query optimizer decides to use the B+ Tree. Please express the answer in millions of nanoseconds rounded to one decimal place, for example, 2.07 ms = 2.1M ns.
Perform a BotEC to estimate the latency of a query over employee that fetches the record with PK greater or equal to 64 and less than 96. Assume that the system's query optimizer decides to use the B+ Tree. Please express the answer in millions of nanoseconds rounded to one decimal place, for example, 2.07 ms = 2.1M ns.
Signup and view all the answers
Suppose a buffer manager in the system stores the two most recently accessed data pages in memory. Reading 1KiB of data from memory takes 250 nanoseconds. Please perform a BotEC to estimate the latency of a query that retrieves records with primary keys 90, 170, 57, and 15 from the "employee" table. Assume the data pages containing the data for the leaf nodes 21-40 and 51-63 are cached. Also, assume that the system's query optimizer selects the B+ Tree for this query. Please express the answer in millions of nanoseconds rounded to one decimal place, for example, 2.07 ms = 2.1M ns.
Suppose a buffer manager in the system stores the two most recently accessed data pages in memory. Reading 1KiB of data from memory takes 250 nanoseconds. Please perform a BotEC to estimate the latency of a query that retrieves records with primary keys 90, 170, 57, and 15 from the "employee" table. Assume the data pages containing the data for the leaf nodes 21-40 and 51-63 are cached. Also, assume that the system's query optimizer selects the B+ Tree for this query. Please express the answer in millions of nanoseconds rounded to one decimal place, for example, 2.07 ms = 2.1M ns.
Signup and view all the answers
You're working with a dataset containing information about books. Analyze the following MapReduce code and provide a concise explanation (in one sentence) of what the code is doing. Assume the function CATEGORY(book) returns the genre of a book, and the functions MAX(ITERABLE) and MIN(ITERABLE) calculate the maximum and minimum values of an iterable, respectively.
def map(key, book):
if book.language == 'English' and CATEGORY(book) == ‘Science Fiction':
emit(book.author, book.num_pages)
def reduce(author, pages):
emit (author, (MIN(pages), MAX(pages)))
You're working with a dataset containing information about books. Analyze the following MapReduce code and provide a concise explanation (in one sentence) of what the code is doing. Assume the function CATEGORY(book) returns the genre of a book, and the functions MAX(ITERABLE) and MIN(ITERABLE) calculate the maximum and minimum values of an iterable, respectively.
def map(key, book):
if book.language == 'English' and CATEGORY(book) == ‘Science Fiction':
emit(book.author, book.num_pages)
def reduce(author, pages): emit (author, (MIN(pages), MAX(pages)))
Signup and view all the answers
What are the two major drawbacks of the approach used in the previous question? Please answer in a maximum of one sentence per drawback, bullet points are allowed!
What are the two major drawbacks of the approach used in the previous question? Please answer in a maximum of one sentence per drawback, bullet points are allowed!
Signup and view all the answers
Now assume two more nodes Node 2 and Node 3 join the cluster, while keeping the number of virtual nodes on the ring constant.
Update the figures and tables from above that will have to change after adding nodes.
Now assume two more nodes Node 2 and Node 3 join the cluster, while keeping the number of virtual nodes on the ring constant.
Update the figures and tables from above that will have to change after adding nodes.
Signup and view all the answers
Assume you have a main memory size of M, a tuple size of R, and a block size of B. Explain why the maximum number of tuples that can be sorted with TPMMS can be calculated as follows:
(M/R) * ((M/ В) – 1)
Assume you have a main memory size of M, a tuple size of R, and a block size of B. Explain why the maximum number of tuples that can be sorted with TPMMS can be calculated as follows: (M/R) * ((M/ В) – 1)
Signup and view all the answers
List each operation from the operation names set, C ={reduce, groupBy, union, join, map}, as either transformations or actions.
List each operation from the operation names set, C ={reduce, groupBy, union, join, map}, as either transformations or actions.
Signup and view all the answers
List each transformation as either wide or narrow.
List each transformation as either wide or narrow.
Signup and view all the answers
Draw the matrix in a modified compressed sparse rows (MCSR) representation.
Draw the matrix in a modified compressed sparse rows (MCSR) representation.
Signup and view all the answers
Assume you have a use case where you operate with very large but sparse matrices. Assume dimensions of 105 x 103 (rows x columns) and a filling degree of approx. 1%. Further, assume you currently use 32 bit for indexes, 32 bit for matrix values, and 64 bit for pointers. Currently, you are using the Coordinate Matrix Format (COO) format. You want to make your program memory efficient and seek a more space-efficient way to represent your matrices. A colleague suggests that using the Compressed Sparse Column (CSC) format would improve memory consumption by at least 30%. Perform a back-of-the-envelope calculation to assess your colleague's suggestion.
Assume you have a use case where you operate with very large but sparse matrices. Assume dimensions of 105 x 103 (rows x columns) and a filling degree of approx. 1%. Further, assume you currently use 32 bit for indexes, 32 bit for matrix values, and 64 bit for pointers. Currently, you are using the Coordinate Matrix Format (COO) format. You want to make your program memory efficient and seek a more space-efficient way to represent your matrices. A colleague suggests that using the Compressed Sparse Column (CSC) format would improve memory consumption by at least 30%. Perform a back-of-the-envelope calculation to assess your colleague's suggestion.
Signup and view all the answers
Give the sequence of instructions that happen in phase 1 of TPMS and the state of the disk after phase 1.
Give the sequence of instructions that happen in phase 1 of TPMS and the state of the disk after phase 1.
Signup and view all the answers
Study Notes
Multiple Choice Questions (BDS Exam)
-
Benchmarking: The term "system under test" refers to the software, workloads, and metrics being evaluated. This includes the software deployment, user requests, and hardware/software/data deployments.
-
Dominant Resource Fair Scheduling: In the provided cluster resources and job descriptions, RAM is the dominant resource.
-
Matrix Multiplication: The fastest execution order for matrix multiplication is (((A x B) x C) x D).
-
Distributed File Systems Challenges: The main challenges are performance, scalability, and consistency. They must also support various file sizes, operating systems, and data types.
-
RDD Failures: Only the lost data partitions of an RDD need to be recomputed. The entire RDD is not reconstructed; the lineage is used for rebuilding.
-
Stream Processing: Event time is the moment an event occurs, while processing time is when the stream processing system actually handles the event.
Matrix Summation
- Matrix Representation: Representing a matrix as a one-dimensional array (a row-major layout in this scenario) is possible and has implications on computational speed (snippet 2 is likely faster given quadratic matrix shape).
GPU/ASIC/FPGA Efficiency
- General Comparison: A GPU is generally less efficient (than a CPU), but more flexible.
Data Analysis
- Join Method: A broadcast join is suitable for joining a small dataset (e.g., product information) with a large dataset (e.g., historical transactions)
Stream Window Aggregation Notes
- Aggregate Values: The sliding and tumbling window aggregations will provide specific outcomes after a watermark (e.g., 22) is processed based on the timestamps and data values within the windows.
B+ Tree Operations
- Deletion: Deleting "27" from the provided B+ Tree involves evaluating multiple scenarios (steal from right/left sibling, merge with sibling) for adjustment to the tree structure.
- Insertion: Inserting "13" in the given B+ Tree structure requires potential node splits.
- Deletion: Similarly, deletion of "1" from the original tree also necessitates adjustments involving node splits or merges.
B+ Tree (Employee Relation)
- Data Page Retrieval: All data manipulation and retrieval within this given context of the B+ Tree are done with memory and are very fast (100 ns for memory access, and sequential reads from disk 20,000 ns for 1 KiB).
BotEC Calculations
- Query Latacies (employee table): These calculations, using a Bottleneck Estimate of Costs (BotEC), provide estimates for the query latencies related to queries that use the indexed employee relation data.
LSM-Trees
- Key Value Retrieval (eager, eat, earth): Retrieve the associated values for the given keys, and return "null" if a key is not found. The example includes all the steps of processing.
Bloom Filter
- Existence Check (Key 3, Key 5): Determine whether key 3 or 5 , exists in a set according to the bit configuration of the filter.
Bloom Filter Trends
- False Positive Probability: The probability of a false positive in a Bloom filter typically increases with more items, a smaller filter size, or fewer hash functions.
Tree Structure Selection
- Workloads (95% reads, 5% writes): B+ Trees are generally preferred for workloads with high read ratios and frequent or fast lookups, whereas LSM-Trees are more suitable for workloads with high write rates, though a detailed explanation about the tradeoffs would be needed.
Distributed Hash Tables (DHT)
- Partitioning: Updates of Node 4's removal from the hash table structure require a recalculation of the hashing function (i.e., using mod(5) when 5 nodes are not available in a DHT structure) and remapping of keys to other nodes resulting in a revised DHT structure accordingly.
MapReduce Example
- Processing Books: The code maps English language science fiction books to their page count. The reduce phase then calculates the maximum and minimum page counts for each author.
SQL Query (Sales Data)
-
Book Sales: This pseudocode translates the SQL query analyzing book sales in Germany during the current year using MapReduce, considering timestamps, the availability of functions like
YEAR()
andCURRENT_YEAR()
, and basic aggregation functions.
TPMMS Algorithm Discussion
- Sorting process: The phase 1 instructions for the given TPMMS sorting algorithm involve reading blocks from disk, performing sorting of memory values, and writing sorted values to disk blocks. The sequence of instructions is crucial for getting desired results.
Matrix Considerations
- MCSR (Modified Compressed Sparse Rows): The compressed sparse-row representation reduces storage requirements by only storing the non-zero values and their corresponding row and column indices.
Large Sparse Matrix Efficient Storage
- Alternative Format (Compressed Sparse Column Format): Using a compressed sparse column (CSC) format offers potentially significant memory benefits for large sparse matrices, potentially reducing memory space usage by 30% or more (compared to Coordinate Matrix Format (COO)).
Spark DAG Analysis
- Stage Matching: The given directed acyclic graph (DAG) reflects the staged operations within a Spark task. The operations (e.g., reduce, groupBy, union, join, map) can be explicitly matched onto the respective stages (I, II, III, IV, V) in the Spark DAG in order.
(Transformation/Action) Types
- Spark Operations: List each operation from the given set as either a transformation or an action and the type as either narrow or wide depending upon the functionality of the operations in Spark.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on key concepts related to benchmarking, resource scheduling, and matrix multiplication within distributed systems. This quiz covers critical challenges and operational principles including RDD failures and stream processing. Perfect for BDS students preparing for exams in the field of computer science.