Parallelism (PAR) Data Decomposition Techniques PDF
Document Details
Uploaded by Deleted User
Universitat Politècnica de Catalunya
Josep Ramon Herrero, Gladys Utrera, Jordi Tubella, Eduard Ayguadé, Daniel Jiménez
Tags
Related
Summary
This document presents data decomposition strategies for improving parallel computation efficiency. It explores strategies for reducing memory coherence traffic and avoiding false sharing, along with examples and code transformations. The context is within the broader subject of computer architecture and specifically focuses on distributed memory architectures.
Full Transcript
Data decomposition False sharing Parallelism (PAR) Data-aware task decomposition strategies (or... how to reduce memory coherence traffic in your parallelization)...
Data decomposition False sharing Parallelism (PAR) Data-aware task decomposition strategies (or... how to reduce memory coherence traffic in your parallelization) Josep Ramon Herrero (T10), Gladys Utrera (T20), Jordi Tubella (T40), Eduard Ayguadé and Daniel Jiménez Computer Architecture Department Universitat Politècnica de Catalunya Course 2024/25 (Fall semester) 1/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Learning material for this Unit I Atenea: Unit 5 Data decomposition I Atenea quizz with motivation example I Going further: distributed-memory architectures video lesson (OPTIONAL) I These slides to deep dive into the concepts in this Unit I Collection of Exercises: problems in Chapter 5 2/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Task creation in OpenMP (summary) I #pragma omp parallel: One implicit task is created for each thread in the team (and immediately executed) I int omp get num threads: returns the number of threads in the current team. 1 if outside a parallel region I int omp get thread num: returns the identifier of the thread in the current team, between 0 and omp get num threads()-1 3/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Outline Reducing memory coherence traffic: improving locality by data decomposition Reducing memory coherence traffic: avoiding false sharing 4/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Why should we consider data decomposition? I Used to derive concurrency for problems that operate on large amounts of data focusing on the multiplicity of data I E.g. Elements in vectors, rows/columns/slices in matrices, elements in a list and subtrees in a tree I... for architectures in which memory plays a performance role Non-coherent interconnection network Processor Processor Processor Processor Processor Processor L1 cache L1 cache L1 cache L2 cache L2 cache L2 cache L1 cache L1 cache L1 cache hub hub hub … L2 cache L2 cache L2 cache M/Directory M/Directory M/Directory … Coherent interconnection network M M M 5/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Task vs. data decompositions We can imagine1 data to be distributed across the multiple memories in our NUMA multiprocessor system...... then, can we try to assign work so that tasks executed in a certain NUMA node access the data that is stored in the main memory of that NUMA node? I Use of implicit tasks created in parallel... I... and the identifier of the thread they are running to decide what to execute 1 Easy to imagine if we remember first touch, which brings data to the memory of the NUMA node that first touches it. 6/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing What, how and who? I What to decompose?: Identify the data used and/or produced in the computations I Output data, input data or both I How to decompose?: Partition this data across various tasks. I Linear or geometric decomposition I Recursive decomposition In distributed–memory architectures, add the necessary data allocation and movement actions. I Who? How to induce the computational partitioning that corresponds to the data partitioning?: owner-computes rule 7/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Guidelines for data decomposition I Data can be partitioned in various ways – this may critically impact performance I Generate comparable amounts of work (for load balancing) I Maximize data locality (or minimize the need for task interactions) I Minimize volume of data involved in task interactions I Minimize frequency of interactions I Minimize contention and hot spots I Overlap computation with interactions to ”hide” their effect I Parametrizable data partition I number of data chunks, size,... I Simplicity 8/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Examples of ”what” Counting the instances of given itemsets in a database of transactions (a) Transactions (input), itemsets (input), and frequencies (output) A, B, C, E, G, H A, B, C 1 B, D, E, F, K, L D, E 3 Itemset Frequency Database Transactions A, B, F, H, L C, F, G 0 Itemsets D, E, F, H A, E 2 F, G, H, K, C, D 1 A, E, F, K, L D, K 2 B, C, D, G, H, L B, C, F 0 G, H, L C, D, K 0 D, E, F, K, L F, G, H, L (b) Partitioning the frequencies (and itemsets) among the tasks set Frequency set Frequency A, B, C, E, G, H A, B, C 1 A, B, C, E, G, H C, D 1 Itemsets Itemsets B, D, E, F, K, L D, E 3 B, D, E, F, K, L D, K 2 9/33 actions actions A, B, F, H, L C, F, G 0 A, B, F, H, L B, C, F 0 UPC-DAC Parallelism (PAR) D, E, F, H A, E 2 D, E, F, H C, D, K 0 Data decomposition False sharing Output data decomposition I Partition of the output data structures across tasks. Input (a) Transactions (input), itemsets (input), and frequencies (output) data structures may follow the same decomposition or require A, B, C, E, G, H B, D, E, F, K, L A, B, C D, E 1 3 Itemset Frequency replication in order to avoid task interactions Database Transactions A, B, F, H, L C, F, G 0 Itemsets D, E, F, H A, E 2 I Example: the itemset frequencies are partitioned across tasks F, G, H, K, A, E, F, K, L C, D D, K 1 2 I The database B, C, D, G, H, L of transactions B, C, F 0 needs to be replicated G, H, L C, D, K 0 I The itemsets D, E, F, K, L can be partitioned across tasks as well (reduce memory utilization) F, G, H, L (b) Partitioning the frequencies (and itemsets) among the tasks Itemset Frequency Itemset Frequency A, B, C, E, G, H A, B, C 1 A, B, C, E, G, H C, D 1 Itemsets Itemsets B, D, E, F, K, L D, E 3 B, D, E, F, K, L D, K 2 Database Transactions Database Transactions A, B, F, H, L C, F, G 0 A, B, F, H, L B, C, F 0 D, E, F, H A, E 2 D, E, F, H C, D, K 0 F, G, H, K, F, G, H, K, A, E, F, K, L A, E, F, K, L B, C, D, G, H, L B, C, D, G, H, L G, H, L G, H, L D, E, F, K, L D, E, F, K, L F, G, H, L F, G, H, L task 1 task 2 10/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Input data decomposition I Partition the input data structures across tasks. It may require combining partial results in order to generate the output data structures I Example: the database transactions can be partitioned, but it requires the itemsets to be replicated. Final aggregation of partial counts for all itemsets Partitioning the transactions among the tasks Database Transactions Database Transactions A, B, C, E, G, H A, B, C 1 A, B, C 0 B, D, E, F, K, L Itemset Frequency Itemset Frequency D, E 2 D, E 1 A, B, F, H, L C, F, G 0 C, F, G 0 Itemsets Itemsets D, E, F, H A, E 1 A, E, F, K, L A, E 1 F, G, H, K, C, D 0 B, C, D, G, H, L C, D 1 D, K 1 G, H, L D, K 1 B, C, F 0 D, E, F, K, L B, C, F 0 C, D, K 0 F, G, H, L C, D, K 0 task 1 task 2 11/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Input and output data decomposition I Input and output data decomposition could be combined I Example: the database and itemsets (input) and counts (output) can be decomposed Partitioning both transactions and frequencies among the tasks Database Transactions Database Transactions A, B, C, E, G, H A, B, C 1 A, B, C, E, G, H B, D, E, F, K, L B, D, E, F, K, L Itemset Frequency Itemset Frequency D, E 2 A, B, F, H, L C, F, G 0 A, B, F, H, L Itemsets Itemsets D, E, F, H A, E 1 D, E, F, H F, G, H, K, F, G, H, K, C, D 0 D, K 1 B, C, F 0 C, D, K 0 task 1 task 2 Database Transactions Database Transactions A, B, C 0 Itemset Frequency Itemset Frequency D, E 1 C, F, G 0 Itemsets Itemsets A, E, F, K, L A, E 1 A, E, F, K, L B, C, D, G, H, L B, C, D, G, H, L C, D 1 G, H, L G, H, L D, K 1 D, E, F, K, L D, E, F, K, L B, C, F 0 F, G, H, L F, G, H, L C, D, K 0 task 3 task 4 12/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing How: Geometric Data Decomposition (1) Block (left) and cyclic (right) data decompositions P0 P0 P1 P2 P3 13/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing How: Geometric Data Decomposition (2) Block (left) and cyclic (right) data decompositions in a triangular iteration space k k P0 P0 k k P1 P2 P3 14/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing How: Geometric Data Decomposition (3) Cyclic (left) and block-cyclic (right) data decompositions P0 P0 15/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Who: The Owner Computes rule It defines who is responsible for doing the computations: I In the case of output data decomposition, the owner computes rule implies that the output is computed by the task to which the output data is assigned. I In the case of input data decomposition, the owner computes rule implies that all computations that use the input data are performed by the task to which the input is assigned. 16/33 UPC-DAC Parallelism (PAR) Data decomposition False sharing Code transformations for data decompositions (1) CYCLIC DATA DECOMPOSITION, by ROWS #pragma omp parallel private (i, j) { int my_id = omp_get_thread_num(); int howmany = omp_get_num_threads(); for (int i=my_id; i