🎧 New: AI-Generated Podcasts Turn your study notes into engaging audio conversations. Learn more

4- Principles of Parallel Algorithm Design(1).pptx

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...

Transcript

Principles of Parallel Algorithm Design Parallel & Distributed Processing- 1502412 Prof. Ali El- 1 Moursy Chapter Overview: Algorithms and Concurrency Introduction to Parallel Algorithms – Tasks and Decomposition – P...

Principles of Parallel Algorithm Design Parallel & Distributed Processing- 1502412 Prof. Ali El- 1 Moursy Chapter Overview: Algorithms and Concurrency Introduction to Parallel Algorithms – Tasks and Decomposition – Processes and Mapping Decomposition Techniques – Data Decomposition – Recursive Decomposition – Exploratory Decomposition – Hybrid Decomposition Characteristics of Tasks and Interactions – Task Generation, Granularity, and Context – Characteristics of Task Interactions. Parallel & Distributed Processing- 1502412 Prof. Ali El- 2 Moursy Chapter Overview: Concurrency and Mapping Mapping Techniques for Load Balancing – Static and Dynamic Mapping Methods for Minimizing Interaction Overheads – Maximizing Data Locality – Minimizing Contention and Hot-Spots – Overlapping Communication and Computations – Replication vs. Communication – Group Communications vs. Point-to-Point Communication Parallel Algorithm Design Models – Data-Parallel, Work-Pool, Task Graph, Master- Slave, Pipeline, and Hybrid Models Parallel & Distributed Processing- 1502412 Prof. Ali El- 3 Moursy Basic steps for developing Parallel Algorithm Algorithm development is a critical component of problem solving using computers. A sequential algorithm is essentially a recipe or a sequence of basic steps for solving a given problem using a serial computer. Similarly, a parallel algorithm is a recipe that tells us how to solve a given problem using multiple processors. A parallel algorithm has the added dimension of concurrency and the algorithm designer must specify sets of steps that can be executed simultaneously. Parallel & Distributed Processing- 1502412 Prof. Ali El- 4 Moursy Basic steps for developing Parallel Algorithm Parallel algorithm may include some or all of the following: – Identifying portions of the work that can be performed concurrently. – Mapping the concurrent pieces of work onto multiple processes running in parallel. – Distributing the input, output, and intermediate data associated with the program. – Managing accesses to data shared by multiple processors. – Synchronizing the processors at various stages of the parallel program execution. Parallel & Distributed Processing- 1502412 Prof. Ali El- 5 Moursy Preliminaries: Decomposition, Tasks, and Dependency Graphs The process of dividing a computation into smaller parts, some or all of which may potentially be executed in parallel, is called decomposition. Tasks are programmer-defined units of computation into which the main computation is subdivided by means of decomposition. Simultaneous execution of multiple tasks is the key to reducing the time required to solve the entire problem. Parallel & Distributed Processing- 1502412 Prof. Ali El- 6 Moursy Preliminaries: Decomposition, Tasks, and Dependency Graphs The first step in developing a parallel algorithm is to decompose the problem into tasks that can be executed concurrently A given problem may be decomposed into tasks in many different ways. Tasks may be of same or different sizes. A decomposition can be illustrated in the form of a directed graph with nodes corresponding to tasks and edges indicating that the result of one task is required for processing the next. Such a graph is called a task dependency graph. Parallel & Distributed Processing- 1502412 Prof. Ali El- 7 Moursy Example: Multiplying a Dense Matrix with a Vector Computation of each element of output vector y is independent of other elements. Based on this, a dense matrix-vector product can be decomposed into n tasks. The figure highlights the portion of the matrix and vector accessed by Task 1. Observations: While tasks share data (namely, the vector b ), they do not have any control dependencies - i.e., no task needs to wait for the (partial) completion of any other. All tasks are of the same size in terms of number of operations. Is this the maximum number of tasks we could decompose this problem into? Parallel & Distributed Processing- 1502412 Prof. Ali El- 8 Moursy Granularity of Task Decompositions The number of tasks into which a problem is decomposed determines its granularity. Decomposition into a large number of tasks results and small task size is fine-grained decomposition Decomposition into a small number of tasks results and large task size is a coarse grained decomposition. Parallel & Distributed Processing- 1502412 Prof. Ali El- 9 Moursy Granularity of Task Decompositions A coarse grained counterpart to the dense matrix-vector product example. Each task in this example corresponds to the computation of three elements of the result vector. Parallel & Distributed Processing- 1502412 Prof. Ali El- 10 Moursy Preliminaries: Decomposition, Tasks, and Dependency Graphs A decomposition can be illustrated in the form of a directed graph with nodes corresponding to tasks and edges indicating that the result of one task is required for processing the next which indicate the dependencies amongst them. Such a graph is called a task dependency graph (TDG). The task corresponding to a node can be executed when all tasks connected to this node by incoming edges have completed. Parallel & Distributed Processing- 1502412 Prof. Ali El- 11 Moursy Example: Database Query Processing Consider the execution of the query: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE) on the following database: ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius 2001 Green CA $18,000 6734 Civic 2001 White OR $17,000 5342 Altima 2001 Green FL $19,000 3845 Maxima 2001 Blue NY $22,000 8354 Accord 2000 Green VT $18,000 4395 Civic 2001 Red CA $17,000 7352 Civic 2002 Red WA $18,000 Parallel & Distributed Processing- 1502412 Prof. Ali El- 12 Moursy Example: Database Query Processing The execution of the query can be divided into subtasks in various ways. Each task can be thought of as generating an intermediate table of entries that satisfy a particular clause. Parallel & Distributed Processing- 1502412 Prof. Ali El- 13 Moursy Example: Database Query Processing Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next Parallel & Distributed Processing- 1502412 Prof. Ali El- 14 Moursy Example: Database Query Processing Different task decompositions may lead to significant differences with respect to their eventual parallel performance. Parallel & Distributed Processing- 1502412 Prof. Ali El- 15 Moursy Critical Path Length A directed path in the task dependency graph represents a sequence of tasks that must be processed one after the other. The longest such path determines the shortest time in which the program can be executed in parallel. The length of the longest path in a task dependency graph is called the critical path length. Parallel & Distributed Processing- 1502412 Prof. Ali El- 16 Moursy Degree of Concurrency The number of tasks that can be executed in parallel is the degree of concurrency of a decomposition. Since the number of tasks that can be executed in parallel may change over program execution, the maximum degree of concurrency is the maximum number of such tasks at any point during execution. This indicates the maximum number of processors needed. Average degree of concurrency is calculated to indicate the efficiency of using the processors. Parallel & Distributed Processing- 1502412 Prof. Ali El- 17 Moursy Degree of Concurrency The degree of concurrency increases as the decomposition becomes finer in granularity and vice versa. What is the maximum degree of concurrency of the database query examples? Assuming that each tasks in the database example takes identical processing time, what is the average degree of concurrency in each decomposition? Parallel & Distributed Processing- 1502412 Prof. Ali El- 18 Moursy Example: Database Query Processing Consider the execution of the query: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE) on the following database: ID# Model Year Color Dealer Price 1 4523 Civic 2002 Blue MN $18,000 2 3476 Corolla 1999 White IL $15,000 3 7623 Camry 2001 Green NY $21,000 4 9834 Prius 2001 Green CA $18,000 5 6734 Civic 2001 White OR $17,000 5342 Altima 2001 Green FL $19,000 6 3845 Maxima 2001 Blue NY $22,000 7 8354 Accord 2000 Green VT $18,000 8 4395 Civic 2001 Red CA $17,000 9 7352 Civic 2002 Red WA $18,000 10 Parallel & Distributed Processing- 1502412 Prof. Ali El- 19 Moursy Case study: Database Query Processing Consider the task dependency graph: 10 10 10 10 Parallel & Distributed Processing- 1502412 Prof. Ali El- 20 Moursy Example: Database Query Processing 9 6 Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next Parallel & Distributed Processing- 1502412 Prof. Ali El- 21 Moursy Case study: Database Query Processing Consider the task dependency graph: 9 6 Parallel & Distributed Processing- 1502412 Prof. Ali El- 22 Moursy Example: Database Query Processing 2 6 Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next Parallel & Distributed Processing- 1502412 Prof. Ali El- 23 Moursy Case study: Database Query Processing Consider the task dependency graph: 8 Parallel & Distributed Processing- 1502412 Prof. Ali El- 24 Moursy Case study: Database Query Processing Consider the task dependency graphs of the two database query decompositions: Parallel & Distributed Processing- 1502412 Prof. Ali El- 25 Moursy Case study: Database Query Processing If each search task takes 1 time unit: What are the critical path lengths for the two task dependency graphs? – a) 27 b) 34 What is the shortest parallel execution time for each decomposition? a) Parallel & Distributed Processing- 1502412 Prof. Ali El- 26 Moursy Case study: Database Query Processing How is this related to what we learned in the previous chapter? Critical Path Length for a parallel execution = Tp a) Tp = 27-time units b) Tp = 34-time units What about Serial time Ts? a) Ts = 63-time units a) (10*4 + 6*2 + 3 + 8 )/27= 2.33 b) Ts = 64-time units b) (10*4 + 6 + 11 + 7)/34 = 1.88 Parallel & Distributed Processing- 1502412 Prof. Ali El- 27 Moursy Case study: Database Query Processing Although serial time is little different, we will compare each with its serial Can you calculate Efficiency? E = Ts / (p. Tp) a) E = 2.33/4 = 58.25% b) E = 1.88/4 = 47% a) (10*4 + 6*2 + 3 + 8 )/27= 2.33 b) (10*4 + 6 + 11 + 7)/34 = 1.88 Parallel & Distributed Processing- 1502412 Prof. Ali El- 28 Moursy Case study: Database Query Processing If each search task takes 1 time unit: How many processors are needed in each case to achieve this minimum parallel execution time? What is the maximum degree of concurrency? What is the average a) (10*4 + 6*2 + 3 + 8 )/27= 2.33 degree of concurrency? b) (10*4 + 6 + 11 + 7)/34 = 1.88 Parallel & Distributed Processing- 1502412 Prof. Ali El- 29 Moursy Limits on Parallel Performance It would appear that the parallel time can be made arbitrarily small by making the decomposition finer in granularity. There is an inherent bound on how fine the granularity of a computation can be. – For example, in the case of multiplying a dense matrix with a vector, there can be no more than (n2) concurrent tasks. Concurrent tasks may also have to exchange data with other tasks. This results in communication overhead. The tradeoff between the granularity of a decomposition and associated overheads often determines performance bounds. Parallel & Distributed Processing- 1502412 Prof. Ali El- 30 Moursy Task Interaction Graphs Subtasks generally exchange data with others in a decomposition. For example, even in the trivial decomposition of the dense matrix-vector product, if the vector is not replicated across all tasks, they will have to communicate elements of the vector. The graph of tasks (nodes) and their interactions/data exchange (edges) is referred to as a task interaction graph. Note that task interaction graphs represent data dependencies, whereas task dependency graphs represent control dependencies. Parallel & Distributed Processing- 1502412 Prof. Ali El- 31 Moursy Decomposition Techniques Parallel & Distributed Processing- 1502412 Prof. Ali El- 32 Moursy Decomposition Techniques So how does one decompose a task into various subtasks? While there is no single recipe that works for all problems, we present a set of commonly used techniques that apply to broad classes of problems. These include: – Recursive decomposition – Data decomposition – Exploratory decomposition – Hybrid decomposition Parallel & Distributed Processing- 1502412 Prof. Ali El- 33 Moursy Recursive Decomposition Generally suited to problems that are solved using the divide-and-conquer strategy. A given problem is first decomposed into a set of sub-problems. These sub-problems are recursively decomposed further until a desired granularity is reached. Parallel & Distributed Processing- 1502412 Prof. Ali El- 34 Moursy Recursive Decomposition: Example The problem of finding the minimum number in a given list (or indeed any other associative operation such as sum, AND, etc.) can be fashioned as a divide-and-conquer algorithm. The following algorithm illustrates this. We first start with a simple serial loop for computing the minimum entry in a given list: 1. procedure SERIAL_MIN (A,n) 2. begin 3. min = A; 4. for i:= 1 to n−1 do 5. if (A[i]

Tags

parallel algorithms algorithm design concurrency computer science
Use Quizgecko on...
Browser
Browser