Parallel Computing Unit 1 - Introduction to Parallel Computing PDF

Unit 1 – Introduction to Parallel Computing Parallel Computing 2023 Parallel Computing | UNITEN 1 History From 1986 to 2003, the performance of microprocessors increased, on average, more than 50% per year. Since 2003, however, single processor performance improvement has slowed to the point that in the period from 2015 to 2017, it increased at less than 4% per year. Furthermore, this difference in performance increase has been associated with a dramatic change in processor design. By 2005, most of the major manufacturers of microprocessors had decided that the road to rapidly increasing performance lay in the direction of parallelism. Rather than trying to continue to develop ever-faster monolithic processors, manufacturers started putting multiple complete processors on a single integrated circuit. 2022 Parallel Computing | UNITEN 2 This change has a very important consequence for software developers: simply adding more processors will not magically improve the performance of the vast majority of serial programs, that is, programs that were written to run on a single processor. Such programs are unaware of the existence of multiple processors, and the performance of such a program on a system with multiple processors will be effectively the same as its performance on a single processor of the multiprocessor system. All of this raises a number of questions: Why do we care? Aren’t single-processor systems fast enough? Why can’t microprocessor manufacturers continue to develop much faster single processor systems? Why build parallel systems? Why build systems with multiple processors? Why can’t we write programs that will automatically convert serial programs into parallel programs, that is, programs that take advantage of the presence of multiple processors? 2022 Parallel Computing | UNITEN 3 Serial Computation Traditionally software has been written for serial computation:-run on a single computer-instructions are run one after another-only one instruction executed at a time 4 2022 Parallel Computing | UNITEN Limitations of Serial Computing Speed: Serial computing can be slow for tasks that could be parallelized. In serial processing, each instruction is executed one after the other, which can lead to slower overall execution times, especially for complex tasks. Resource utilization: Serial computing may not effectively utilize all available resources. In a serial process, only one instruction is executed at a time, potentially leaving other resources idle. In serial computing, only one instruction is executed at a time, so the CPU may not be fully utilized. While one instruction is being processed, other parts of the CPU may remain idle, leading to inefficient use of computational resources. 2022 Parallel Computing | UNITEN 5 Scalability: Serial computing may not scale well to handle large or complex problems. As the size or complexity of a task increases, the time required to complete the task may become impractical. Limited multitasking: With serial computing, it can be challenging to perform multiple tasks simultaneously or efficiently switch between tasks. This limitation can impact the overall efficiency and responsiveness of a system. 2022 Parallel Computing | UNITEN 6 Parallel Computing 2022 Parallel Computing | UNITEN 7 Parallel Computing Simultaneous use of multiple compute sources to solve a single problem use of multiple processors or computers working together on a common task. 8 2022 Parallel Computing | UNITEN 2022 Parallel Computing | UNITEN 9 What is Parallel Computing? In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem. To be run using multiple CPUs A problem is broken into discrete parts that can be solved concurrently Each part is further broken down to a series of instructions Instructions from each part execute simultaneously on different CPUs 2022 Parallel Computing | UNITEN 10 2022 Parallel Computing | UNITEN 11 Parallel Computing: Resources The compute resources can include: A single computer with multiple processors A single computer with (multiple) processor(s) and some specialized computer resources (GPU - graphics processing unit), FPGA - Field- programmable gate array, etc..) An arbitrary number of computers connected by a network A combination of both 2022 Parallel Computing | UNITEN 12 Parallel Computing – Resources A single computer with multiple processors A single computer with (multiple) processor(s) and some specialized computer resources (GPU, FPGA …) Multiple number of computers connected by a network Combination of both 2022 Parallel Computing | UNITEN 13 Parallel Computing – Resources A single computer with multiple processors A single computer with (multiple) processor(s) and some specialized computer resources (GPU, FPGA …) Multiple number of computers connected by a network Combination of both 2022 Parallel Computing | UNITEN 14 Parallel Computing – Resources A single computer with multiple processors A single computer with (multiple) processor(s) and some specialized computer resources (GPU, FPGA …) Multiple number of computers connected by a network Combination of both 2022 Parallel Computing | UNITEN 15 Parallel Computing – Resources A single computer with multiple processors A single computer with (multiple) processor(s) and some specialized computer resources (GPU, FPGA …) Multiple number of computers connected by a network Combination of multiple computers with multiple processors 2022 Parallel Computing | UNITEN 16 Characteristics of Computational Problem Broken apart into discrete pieces of work that can be solved simultaneously Execute multiple program instructions at any moment in time Solved in less time with multiple compute resources than with a single compute resource 2022 Parallel Computing | UNITEN 17 Why we need ever-increasing performance As our computational power increases, the number of problems that we can seriously consider solving also increases. Here are a few examples: Climate modeling. To better understand climate change, we need far more accurate computer models, models that include interactions between the atmosphere, the oceans, solid land, and the ice caps at the poles. We also need to be able to make detailed studies of how various interventions might affect the global climate. Energy research. Increased computational power will make it possible to program much more detailed models of technologies, such as wind turbines, solar cells, and batteries. These programs may provide the information needed to construct far more efficient clean energy sources. 2022 Parallel Computing | UNITEN 18 Data analysis. We generate tremendous amounts of data. By some estimates, the quantity of data stored worldwide doubles every two years but the vast majority of it is largely useless unless it’s analyzed. As an example, knowing the sequence of nucleotides in human DNA is, by itself, of little use. Understanding how this sequence affects development and how it can cause disease requires extensive analysis. In addition to genomics, huge quantities of data are generated by particle colliders, such as the Large Hadron Collider at CERN, medical imaging, astronomical research, and Web search engines—to name a few. 2022 Parallel Computing | UNITEN 19 Application of Parallel Computing Multiple complex, interrelated events happening at the same time, yet within a sequence e.g. Weather and ocean patterns, Planetary and galactic orbits, etc… Traditionally, parallel computing has been considered to be "the high end of computing" and has been motivated by numerical simulations of complex systems and "Grand Challenge Problems" such as: weather and climate chemical and nuclear reactions biological, human genome geological, seismic activity mechanical devices - from prosthetics to spacecraft electronic circuits manufacturing processes 2022 Parallel Computing | UNITEN 20 The Real World is Massively Parallel 2022 Parallel Computing | UNITEN 21 Why we’re building parallel systems Much of the tremendous increase in single-processor performance was driven by the ever-increasing density of transistors—the electronic switches—on integrated circuits. As the size of transistors decreases, their speed can be increased, and the overall speed of the integrated circuit can be increased. However, as the speed of transistors increases, their power consumption also increases. Therefore, it is becoming impossible to continue to increase the speed of integrated circuits. 2022 Parallel Computing | UNITEN 22 How then, can we continue to build ever more powerful computers? The answer is parallelism. Rather than building ever-faster, more complex, monolithic processors, the industry has decided to put multiple, relatively simple, complete processors on a single chip. Such integrated circuits are called multicore processors. 2022 Parallel Computing | UNITEN 23 Why we need to write parallel programs Most programs that have been written for conventional, single-core systems cannot exploit the presence of multiple cores. We can run multiple instances of a program on a multicore system, but this is often of little help. For example, being able to run multiple instances of our favorite game isn’t really what we want—we want the program to run faster with more realistic graphics. To do this, we need to either rewrite our serial programs so that they’re parallel, so that they can make use of multiple cores, or write translation programs, that is, programs that will automatically convert serial programs into parallel programs 2022 Parallel Computing | UNITEN 24 An efficient parallel implementation of a serial program may not be obtained by finding efficient parallelization of each of its steps. Rather, the best parallelization may be obtained by devising an entirely new algorithm. 2022 Parallel Computing | UNITEN 25 As an example, suppose that we need to compute n values and add them together. We know that this can be done with the following serial code: sum = 0; for ( i = 0; i < n ; i ++) { x = Compute_next_value (... ) ; sum += x ; } Now suppose we also have p cores and p ≤ n. Then each core can form a partial sum of approximately n/p values: my_sum = 0; my_first_i =... ; my_last_i =... ; for ( my_i = my_first_i ; my_i < my_last_i ; my_i ++) { my_x = Compute_next_value (... ) ; my_sum += my_x ; } 2022 Parallel Computing | UNITEN 26 Here the prefix my_ indicates that each core is using its own, private variables, and each core can execute this block of code independently of the other cores. After each core completes execution of this code, its variable my_sum will store the sum of the values computed by its calls to Compute_next_value. For example, if there are eight cores, n = 24, and the 24 calls to Compute_next_value return the values 1, 4, 3 9, 2, 8 5, 1, 1 6, 2, 7 2, 5, 0 4, 1, 8 6, 5, 1 2, 3, 9 Then the values stored in my_sum might be 2022 Parallel Computing | UNITEN 27 When the cores are done computing their values of my_sum, they can form a global sum by sending their results to a designated “master” core, which can add their results: i f ( I’m the master core) { sum = my_sum ; for each core other than myself { receive value from core ; sum += value ; } } else { send my_sum to the master ; } 2022 Parallel Computing | UNITEN 28 In our example, if the master core is core 0, it would add the values 8 + 19 + 7 + 15+7+ 13+12+ 14 = 95. But you can probably see a better way to do this— especially if the number of cores is large. Instead of making the master core do all the work of computing the final sum, we can pair the cores so that while core 0 adds in the result of core 1, core 2 can add in the result of core 3, core 4 can add in the result of core 5, and so on. Then we can repeat the process with only the even-ranked cores: 0 adds in the result of 2, 4 adds in the result of 6, and so on. 2022 Parallel Computing | UNITEN 29 For both “global” sums, the master core (core 0) does more work than any other core, and the length of time it takes the program to complete the final sum should be the length of time it takes for the master to complete. However, with eight cores, the master will carry out seven receives and adds using the first method, While with the second method, it will only carry out three. So the second method results in an improvement of more than a factor of two. The difference becomes much more dramatic with large numbers of cores.With 1000 cores, the first method will require 999 receives and adds, while the second will only require 10—an improvement of almost a factor of 100! 2022 Parallel Computing | UNITEN 30 How do we write parallelism program There are two widely used approaches: task-parallelism and data-parallelism. Task-parallelism Partition the various tasks carried out in solving the problem among the cores. As an example, suppose that Prof P has to teach a section of “Survey of English Literature.” and has one hundred students in her section, so she’s been assigned four teaching assistants (TAs): Mr. A, Ms. B,Mr. C, and Ms. D. When the semester is over, and Prof P makes up a final exam that consists of five questions. To grade the exam, she and her TAs might consider the following two options: each of them can grade all one hundred responses to one of the questions; say, P grades question 1, A grades question 2, and so on. 2022 Parallel Computing | UNITEN 31 Data-parallelism Partition the data used in solving the problem among the cores, and each core carries out more or less similar operations on its part of the data. Dividing the one hundred exams into five piles of twenty exams each, and each of them can grade all the papers in one of the piles; P grades the papers in the first pile, A grades the papers in the second pile, and so on. In both approaches the “cores” are the professor and her TAs. The first approach might be considered an example of task-parallelism. There are five tasks to be carried out: grading the first question, grading the second question, and so on. Presumably, the graders will be looking for different information in question 1, which is about Shakespeare, from the information in question 2, which is about Milton, and so on. So the professor and her TAs will be “executing different instructions.” On the other hand, the second approach might be considered an example of data parallelism. The “data” are the students’ papers, which are divided among the cores, and each core applies more or less the same grading instructions to each paper. 2022 Parallel Computing | UNITEN 32 Coordination In both global sum examples, the coordination of the cores involves: Communication: one or more cores send their current partial sums to another core. Load balancing. In the first part of the global sum, it’s clear that we want the amount of time taken by each core to be roughly the same as the time taken by the other cores. If the cores are identical, and each call to Compute_next_value requires the same amount of work, then we want each core to be assigned roughly the same number of values as the other cores. If, for example, one core has to compute most of the values, then the other cores will finish much sooner than the heavily loaded core, and their computational power will be wasted. Synchronization In most systems the cores are not automatically synchronized. Rather, each core works at its own pace. In this case, the problem is that we don’t want the other cores to race ahead and start computing their partial sums before the master is done initializing x and making it available to the other cores. That is, the cores need to wait before starting execution of the code 2022 Parallel Computing | UNITEN 33 What we’ll be doing We’ll be focusing on learning to write programs that are explicitly parallel. Our purpose is to learn the basics of programming parallel computers using the C language and four different APIs or application program interfaces: Message-Passing Interface or MPI, POSIX threads or Pthreads, OpenMP CUDA. MPI and Pthreads are libraries of type definitions, functions, and macros that can be used in C programs. OpenMP consists of a library and some modifications to the C compiler. CUDA consists of a library and modifications to the C++ compiler. 2022 Parallel Computing | UNITEN 34 You may well wonder why we’re learning about four different APIs instead of just one. The answer has to do with both the extensions and parallel systems. Currently, there are two main ways of classifying parallel systems: To consider the memory that the different cores have access to and whether the cores can operate independently of each other. 2022 Parallel Computing | UNITEN 35 Memory In the memory classification, we’ll be focusing on shared-memory systems and distributed-memory systems. Shared-memory systems Cores can share access to the computer’s memory; in principle, each core can read and write each memory location. In a shared-memory system, we can coordinate the cores by having them examine and update shared- memory locations. Distributed-memory systems each core has its own, private memory, and the cores can communicate explicitly by doing something like sending messages across a network 2022 Parallel Computing | UNITEN 36 The next classification divides parallel systems according to the number of independent instruction streams and the number of independent data streams. In one type of system, the cores can be thought of as conventional processors, so they have their own control units, and they are capable of operating independently of each other. Each core can manage its own instruction stream and its own data stream, so this type of system is called a Multiple-Instruction Multiple-Data or MIMD system. Parallel system with cores that are not capable of managing their own instruction streams: they can be thought of as cores with no control unit. Rather, the cores share a single control unit. However, each core can access either its own private memory or memory that’s shared among the cores. In this type of system, all the cores carry out the same instruction on their own data, so this type of system is called a Single-Instruction Multiple-Data or SIMD system. 2022 Parallel Computing | UNITEN 37 In a MIMD system, it’s perfectly feasible for one core to execute an addition while another core executes a multiply. In a SIMD system, two cores either execute the same instruction (on their own data) or, if they need to execute different instructions, one executes its instruction while the other is idle, and then the second executes its instruction while the first is idle. In a SIMD system, we couldn’t have one core executing an addition while another core executes a multiplication. The system would have to do something like this: 2022 Parallel Computing | UNITEN 38 Our different APIs are used for programming different types of systems: MPI is an API for programming distributed memory MIMD systems. Pthreads is an API for programming shared memory MIMD systems. OpenMP is an API for programming both shared memory MIMD and shared Memory SIMD systems, although we’ll be focusing on programming MIMD systems. CUDA is an API for programming Nvidia GPUs. 2022 Parallel Computing | UNITEN 39 2022 Universiti Tenaga Nasional. All rights reserved. 2022 Parallel Computing | UNITEN 40

Parallel Computing Unit 1 - Introduction to Parallel Computing PDF

Document Details

Tags

Related

Summary

Full Transcript