dokumen.pub_parallel-programming-techniques-and-applications-using-networked-workstations-and-parallel-computers-2nd-ed-0131405632-9780131405639.pdf
Document Details
Uploaded by ThumbUpDiscernment6190
University of North Carolina at Charlotte
2005
Tags
Full Transcript
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL ALLEN...
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL ALLEN University of North Carolina at Charlotte Upper Saddle River, NJ 07458 Library of Congress CataIoging-in-Publication Data CIP DATA AVAILABLE. Vice President and Editorial Director, ECS: Marcia Horton Executive Editor: Kate Hargett Vice President and Director of Production and Manufacturing, ESM: David W. Riccardi Executive Managing Editor: Vince O’Brien Managing Editor: Camille Trentacoste Production Editor: John Keegan Director of Creative Services: Paul Belfanti Art Director: Jayne Conte Cover Designer: Kiwi Design Managing Editor, AV Management and Production: Patricia Burns Art Editor: Gregory Dulles Manufacturing Manager: Trudy Pisciotti Manufacturing Buyer: Lisa McDowell Marketing Manager: Pamela Hersperger © 2005, 1999 Pearson Education, Inc. Pearson Prentice Hall Pearson Education, Inc. Upper Saddle River, NJ 07458 All rights reserved. No part of this book may be reproduced in any form or by any means, without permission in writing from the publisher. Pearson Prentice Hall® is a trademark of Pearson Education, Inc. The author and publisher of this book have used their best efforts in preparing this book. These efforts include the development, research, and testing of the theories and programs to determine their effectiveness. The author and publisher make no warranty of any kind, expressed or implied, with regard to these programs or the documentation contained in this book. The author and publisher shall not be liable in any event for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of these programs. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 ISBN: 0-13-140563-2 Pearson Education Ltd., London Pearson Education Australia Pty. Ltd., Sydney Pearson Education Singapore, Pte. Ltd. Pearson Education North Asia Ltd., Hong Kong Pearson Education Canada, Inc., Toronto Pearson Educación de Mexico, S.A. de C.V. Pearson Education—Japan, Tokyo Pearson Education Malaysia, Pte. Ltd. Pearson Education, Inc., Upper Saddle River, New Jersey To my wife, Wendy, and my daughter, Johanna Barry Wilkinson To my wife, Bonnie Michael Allen This page intentionally left blank Preface The purpose of this text is to introduce parallel programming techniques. Parallel program- ming is programming multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than is possible with a single computer. It also offers the opportunity to tackle larger problems, that is, problems with more compu- tational steps or larger memory requirements, the latter because multiple computers and multiprocessor systems often have more total memory than a single computer. In this text, we concentrate upon the use of multiple computers that communicate with one another by sending messages; hence the term message-passing parallel programming. The computers we use can be different types (PC, SUN, SGI, etc.) but must be interconnected, and a software environment must be present for message passing between computers. Suitable computers (either already in a network or capable of being interconnected) are very widely available as the basic computing platform for students, so that it is usually not necessary to acquire a specially designed multiprocessor system. Several software tools are available for message-passing parallel programming, notably several implementations of MPI, which are all freely available. Such software can also be used on specially designed multiproces- sor systems should these systems be available for use. So far as practicable, we discuss techniques and applications in a system-independent fashion. Second Edition. Since the publication of the first edition of this book, the use of interconnected computers as a high-performance computing platform has become wide- spread. The term “cluster computing” has come to be used to describe this type of comput- ing. Often the computers used in a cluster are “commodity” computers, that is, low-cost personal computers as used in the home and office. Although the focus of this text, using multiple computers and processors for high-performance computing, has not been changed, we have revised our introductory chapter, Chapter 1, to take into account the move towards v commodity clusters and away from specially designed, self-contained, multiprocessors. In the first edition, we described both PVM and MPI and provided an appendix for each. However, only one would normally be used in the classroom. In the second edition, we have deleted specific details of PVM from the text because MPI is now a widely adopted standard and provides for much more powerful mechanisms. PVM can still be used if one wishes, and we still provide support for it on our home page. Message-passing programming has some disadvantages, notably the need for the programmer to specify explicitly where and when the message passing should occur in the program and what to send. Data has to be sent to those computers that require the data through relatively slow messages. Some have compared this type of programming to assembly language programming, that is, programming using the internal language of the computer, a very low-level and tedious way of programming which is not done except under very specific circumstances. An alternative programming model is the shared memory model. In the first edition, shared memory programming was covered for computers with multiple internal processors and a common shared memory. Such shared memory multiprocessors have now become cost-effective and common, especially dual- and quad-processor systems. Thread programming was described using Pthreads. Shared memory programming remains in the second edition and with significant new material added including performance aspects of shared memory programming and a section on OpenMP, a thread-based standard for shared memory programming at a higher level than Pthreads. Any broad-ranging course on practical parallel programming would include shared memory programming, and having some experience with OpenMP is very desir- able. A new appendix is added on OpenMP. OpenMP compilers are available at low cost to educational institutions. With the focus of using clusters, a major new chapter has been added on shared memory programming on clusters. The shared memory model can be employed on a cluster with appropriate distributed shared memory (DSM) software. Distributed shared memory programming attempts to obtain the advantages of the scalability of clusters and the elegance of shared memory. Software is freely available to provide the DSM environment, and we shall also show that students can write their own DSM systems (we have had several done so). We should point out that there are performance issues with DSM. The performance of software DSM cannot be expected to be as good as true shared memory programming on a shared memory multiprocessor. But a large, scalable shared memory multiprocessor is much more expensive than a commodity cluster. Other changes made for the second edition are related to programming on clusters. New material is added in Chapter 6 on partially synchronous computations, which are par- ticularly important in clusters where synchronization is expensive in time and should be avoided. We have revised and added to Chapter 10 on sorting to include other sorting algo- rithms for clusters. We have added to the analysis of the algorithms in the first part of the book to include the computation/communication ratio because this is important to message- passing computing. Extra problems have been added. The appendix on parallel computa- tional models has been removed to maintain a reasonable page count. The first edition of the text was described as course text primarily for an undergrad- uate-level parallel programming course. However, we found that some institutions also used the text as a graduate-level course textbook. We have also used the material for both senior undergraduate-level and graduate-level courses, and it is suitable for beginning vi Preface graduate-level courses. For a graduate-level course, more advanced materials, for example, DSM implementation and fast Fourier transforms, would be covered and more demanding programming projects chosen. Structure of Materials. As with the first edition, the text is divided into two parts. Part I now consists of Chapters 1 to 9, and Part II now consists of Chapters 10 to 13. In Part I, the basic techniques of parallel programming are developed. In Chapter 1, the concept of parallel computers is now described with more emphasis on clusters. Chapter 2 describes message-passing routines in general and particular software (MPI). Evaluating the performance of message-passing programs, both theoretically and in practice, is dis- cussed. Chapter 3 describes the ideal problem for making parallel the embarrassingly parallel computation where the problem can be divided into independent parts. In fact, important applications can be parallelized in this fashion. Chapters 4, 5, 6, and 7 describe various programming strategies (partitioning and divide and conquer, pipelining, synchro- nous computations, asynchronous computations, and load balancing). These chapters of Part I cover all the essential aspects of parallel programming with the emphasis on message-passing and using simple problems to demonstrate techniques. The techniques themselves, however, can be applied to a wide range of problems. Sample code is usually given first as sequential code and then as parallel pseudocode. Often, the underlying algorithm is already parallel in nature and the sequential version has “unnaturally” serial- ized it using loops. Of course, some algorithms have to be reformulated for efficient parallel solution, and this reformulation may not be immediately apparent. Chapter 8 describes shared memory programming and includes Pthreads, an IEEE standard system that is widely available, and OpenMP. There is also a significant new section on timing and per- formance issues. The new chapter on distributed shared memory programming has been placed after the shared memory chapter to complete Part I, and the subsequent chapters have been renumbered. Many parallel computing problems have specially developed algorithms, and in Part II problem-specific algorithms are studied in both non-numeric and numeric domains. For Part II, some mathematical concepts are needed, such as matrices. Topics covered in Part II include sorting (Chapter 10), numerical algorithms, matrix multiplication, linear equations, partial differential equations (Chapter 11), image processing (Chapter 12), and searching and optimization (Chapter 13). Image processing is particularly suitable for parallelization and is included as an interesting application with significant potential for projects. The fast Fourier transform is discussed in the context of image processing. This important transform is also used in many other areas, including signal processing and voice recognition. A large selection of “real-life” problems drawn from practical situations is presented at the end of each chapter. These problems require no specialized mathematical knowledge and are a unique aspect of this text. They develop skills in the use of parallel programming techniques rather than simply teaching how to solve specific problems, such as sorting numbers or multiplying matrices. Prerequisites. The prerequisite for studying Part I is a knowledge of sequential programming, as may be learned from using the C language. The parallel pseudocode in the text uses C-like assignment statements and control flow statements. However, students with only a knowledge of Java will have no difficulty in understanding the pseudocode, Preface vii because syntax of the statements is similar to that of Java. Part I can be studied immediately after basic sequential programming has been mastered. Many assignments here can be attempted without specialized mathematical knowledge. If MPI is used for the assignments, programs are usually written in C or C++ calling MPI message-passing library routines. The descriptions of the specific library calls needed are given in Appendix A. It is possible to use Java, although students with only a knowledge of Java should not have any difficulty in writing their assignments in C/C++. In Part II, the sorting chapter assumes that the student has covered sequential sorting in a data structure or sequential programming course. The numerical algorithms chapter requires the mathematical background that would be expected of senior computer science or engineering undergraduates. Course Structure. The instructor has some flexibility in the presentation of the materials. Not everything need be covered. In fact, it is usually not possible to cover the whole book in a single semester. A selection of topics from Part I would be suitable as an addition to a normal sequential programming class. We have introduced our first-year students to parallel programming in this way. In that context, the text is a supplement to a sequential programming course text. All of Part I and selected parts of Part II together are suitable as a more advanced undergraduate or beginning graduate-level parallel program- ming/computing course, and we use the text in that manner. Home Page. A Web site has been developed for this book as an aid to students and instructors. It can be found at www.cs.uncc.edu/par_prog. Included at this site are extensive Web pages to help students learn how to compile and run parallel programs. Sample programs are provided for a simple initial assignment to check the software envi- ronment. The Web site has been completely redesigned during the preparation of the second edition to include step-by-step instructions for students using navigation buttons. Details of DSM programming are also provided. The new Instructor’s Manual is available to instruc- tors, and gives MPI solutions. The original solutions manual gave PVM solutions and is still available. The solutions manuals are available electronically from the authors. A very extensive set of slides is available from the home page. Acknowledgments. The first edition of this text was the direct outcome of a National Science Foundation grant awarded to the authors at the University of North Carolina at Charlotte to introduce parallel programming in the first college year.1 Without the support of the late Dr. M. Mulder, program director at the National Science Foundation, we would not have been able to pursue the ideas presented in the text. A number of graduate students worked on the original project. Mr. Uday Kamath produced the original solutions manual. We should like to record our thanks to James Robinson, the departmental system administrator who established our local workstation cluster, without which we would not have been able to conduct the work. We should also like to thank the many students at UNC Charlotte who took our classes and helped us refine the material over many years. This 1National Science Foundation grant “Introducing parallel programming techniques into the freshman cur- ricula,” ref. DUE 9554975. viii Preface included “teleclasses” in which the materials for the first edition were classroom tested in a unique setting. The teleclasses were broadcast to several North Carolina universities, including UNC Asheville, UNC Greensboro, UNC Wilmington, and North Carolina State University, in addition to UNC Charlotte. Professor Mladen Vouk of North Carolina State University, apart from presenting an expert guest lecture for us, set up an impressive Web page that included “real audio” of our lectures and “automatically turning” slides. (These lectures can be viewed from a link from our home page.) Professor John Board of Duke University and Professor Jan Prins of UNC Chapel Hill also kindly made guest-expert pre- sentations to classes. A parallel programming course based upon the material in this text was also given at the Universidad Nacional de San Luis in Argentina by kind invitation of Professor Raul Gallard. The National Science Foundation has continued to support our work on cluster com- puting, and this helped us develop the second edition. A National Science Foundation grant was awarded to us to develop distributed shared memory tools and educational materials.2 Chapter 9, on distributed shared memory programming, describes the work. Subsequently, the National Science Foundation awarded us a grant to conduct a three-day workshop at UNC Charlotte in July 2001 on teaching cluster computing,3 which enabled us to further refine our materials for this book. We wish to record our appreciation to Dr. Andrew Bernat, program director at the National Science Foundation, for his continuing support. He suggested the cluster computing workshop at Charlotte. This workshop was attended by 18 faculty from around the United States. It led to another three-day workshop on teaching cluster computing at Gujarat University, Ahmedabad, India, in December 2001, this time by invitation of the IEEE Task Force on Cluster Computing (TFCC), in association with the IEEE Computer Society, India. The workshop was attended by about 40 faculty. We are also deeply in the debt to several people involved in the workshop, and especially to Mr. Rajkumar Buyya, chairman of the IEEE Computer Society Task Force on Cluster Computing who suggested it. We are also very grateful to Prentice Hall for providing copies of our textbook to free of charge to everyone who attended the workshops. We have continued to test the materials with student audiences at UNC Charlotte and elsewhere (including the University of Massachusetts, Boston, while on leave of absence). A number of UNC-Charlotte students worked with us on projects during the development of the second edition. The new Web page for this edition was developed by Omar Lahbabi and further refined by Sari Ansari, both undergraduate students. The solutions manual in MPI was done by Thad Drum and Gabriel Medin, also undergraduate students at UNC- Charlotte. We would like to express our continuing appreciation to Petra Recter, senior acquisi- tions editor at Prentice Hall, who supported us throughout the development of the second edition. Reviewers provided us with very helpful advice, especially one anonymous reviewer whose strong views made us revisit many aspects of this book, thereby definitely improving the material. Finally, we wish to thank the many people who contacted us about the first edition, providing us with corrections and suggestions. We maintained an on-line errata list which was useful as the book went through reprints. All the corrections from the first edition have 2National Science Foundation grant “Parallel Programming on Workstation Clusters,” ref. DUE 995030. 3National Science Foundation grant supplement for a cluster computing workshop, ref. DUE 0119508. Preface ix been incorporated into the second edition. An on-line errata list will be maintained again for the second edition with a link from the home page. We always appreciate being contacted with comments or corrections. Please send comments and corrections to us at [email protected] (Barry Wilkinson) or [email protected] (Michael Allen). BARRY WILKINSON MICHAEL ALLEN Western Carolina University University of North Carolina, Charlotte x Preface About the Authors Barry Wilkinson is a full professor in the Department of Computer Science at the University of North Carolina at Charlotte, and also holds a faculty position at Western Carolina Uni- versity. He previously held faculty positions at Brighton Polytechnic, England (1984–87), the State University of New York, College at New Paltz (1983–84), University College, Cardiff, Wales (1976–83), and the University of Aston, England (1973–76). From 1969 to 1970, he worked on process control computer systems at Ferranti Ltd. He is the author of Computer Peripherals (with D. Horrocks, Hodder and Stoughton, 1980, 2nd ed. 1987), Digital System Design (Prentice Hall, 1987, 2nd ed. 1992), Computer Architecture Design and Performance (Prentice Hall 1991, 2nd ed. 1996), and The Essence of Digital Design (Prentice Hall, 1997). In addition to these books, he has published many papers in major computer journals. He received a B.S. degree in electrical engineering (with first-class honors) from the University of Salford in 1969, and M.S. and Ph.D. degrees from the Uni- versity of Manchester (Department of Computer Science), England, in 1971 and 1974, respectively. He has been a senior member of the IEEE since 1983 and received an IEEE Computer Society Certificate of Appreciation in 2001 for his work on the IEEE Task Force on Cluster Computing (TFCC) education program. Michael Allen is a full professor in the Department of Computer Science at the University of North Carolina at Charlotte. He previously held faculty positions as an associate and full professor in the Electrical Engineering Department at the University of North Carolina at Charlotte (1974–85), and as an instructor and an assistant professor in the Electrical Engineering Department at the State University of New York at Buffalo (1968–74). From 1985 to 1987, he was on leave from the University of North Carolina at Charlotte while serving as the president and chairman of DataSpan, Inc. Additional industry experience includes electronics design and software systems development for Eastman Kodak, Sylvania Electronics, Bell of Pennsylvania, Wachovia Bank, and numerous other firms. He received B.S. and M.S. degrees in Electrical Engineering from Carnegie Mellon Univer- sity in 1964 and 1965, respectively, and a Ph.D. from the State University of New York at Buffalo in 1968. xi This page intentionally left blank Contents Preface v About the Authors xi PART I BASIC TECHNIQUES 1 CHAPTER 1 PARALLEL COMPUTERS 3 1.1 The Demand for Computational Speed 3 1.2 Potential for Increased Computational Speed 6 Speedup Factor 6 What Is the Maximum Speedup? 8 Message-Passing Computations 13 1.3 Types of Parallel Computers 13 Shared Memory Multiprocessor System 14 Message-Passing Multicomputer 16 Distributed Shared Memory 24 MIMD and SIMD Classifications 25 1.4 Cluster Computing 26 Interconnected Computers as a Computing Platform 26 Cluster Configurations 32 Setting Up a Dedicated “Beowulf Style” Cluster 36 1.5 Summary 38 Further Reading 38 xiii Bibliography 39 Problems 41 CHAPTER 2 MESSAGE-PASSING COMPUTING 42 2.1 Basics of Message-Passing Programming 42 Programming Options 42 Process Creation 43 Message-Passing Routines 46 2.2 Using a Cluster of Computers 51 Software Tools 51 MPI 52 Pseudocode Constructs 60 2.3 Evaluating Parallel Programs 62 Equations for Parallel Execution Time 62 Time Complexity 65 Comments on Asymptotic Analysis 68 Communication Time of Broadcast/Gather 69 2.4 Debugging and Evaluating Parallel Programs Empirically 70 Low-Level Debugging 70 Visualization Tools 71 Debugging Strategies 72 Evaluating Programs 72 Comments on Optimizing Parallel Code 74 2.5 Summary 75 Further Reading 75 Bibliography 76 Problems 77 CHAPTER 3 EMBARRASSINGLY PARALLEL COMPUTATIONS 79 3.1 Ideal Parallel Computation 79 3.2 Embarrassingly Parallel Examples 81 Geometrical Transformations of Images 81 Mandelbrot Set 86 Monte Carlo Methods 93 3.3 Summary 98 Further Reading 99 Bibliography 99 Problems 100 xiv Contents CHAPTER 4 PARTITIONING AND DIVIDE-AND-CONQUER STRATEGIES 106 4.1 Partitioning 106 Partitioning Strategies 106 Divide and Conquer 111 M-ary Divide and Conquer 116 4.2 Partitioning and Divide-and-Conquer Examples 117 Sorting Using Bucket Sort 117 Numerical Integration 122 N-Body Problem 126 4.3 Summary 131 Further Reading 131 Bibliography 132 Problems 133 CHAPTER 5 PIPELINED COMPUTATIONS 140 5.1 Pipeline Technique 140 5.2 Computing Platform for Pipelined Applications 144 5.3 Pipeline Program Examples 145 Adding Numbers 145 Sorting Numbers 148 Prime Number Generation 152 Solving a System of Linear Equations — Special Case 154 5.4 Summary 157 Further Reading 158 Bibliography 158 Problems 158 CHAPTER 6 SYNCHRONOUS COMPUTATIONS 163 6.1 Synchronization 163 Barrier 163 Counter Implementation 165 Tree Implementation 167 Butterfly Barrier 167 Local Synchronization 169 Deadlock 169 Contents xv 6.2 Synchronized Computations 170 Data Parallel Computations 170 Synchronous Iteration 173 6.3 Synchronous Iteration Program Examples 174 Solving a System of Linear Equations by Iteration 174 Heat-Distribution Problem 180 Cellular Automata 190 6.4 Partially Synchronous Methods 191 6.5 Summary 193 Further Reading 193 Bibliography 193 Problems 194 CHAPTER 7 LOAD BALANCING AND TERMINATION DETECTION 201 7.1 Load Balancing 201 7.2 Dynamic Load Balancing 203 Centralized Dynamic Load Balancing 204 Decentralized Dynamic Load Balancing 205 Load Balancing Using a Line Structure 207 7.3 Distributed Termination Detection Algorithms 210 Termination Conditions 210 Using Acknowledgment Messages 211 Ring Termination Algorithms 212 Fixed Energy Distributed Termination Algorithm 214 7.4 Program Example 214 Shortest-Path Problem 214 Graph Representation 215 Searching a Graph 217 7.5 Summary 223 Further Reading 223 Bibliography 224 Problems 225 CHAPTER 8 PROGRAMMING WITH SHARED MEMORY 230 8.1 Shared Memory Multiprocessors 230 8.2 Constructs for Specifying Parallelism 232 Creating Concurrent Processes 232 Threads 234 xvi Contents 8.3 Sharing Data 239 Creating Shared Data 239 Accessing Shared Data 239 8.4 Parallel Programming Languages and Constructs 247 Languages 247 Language Constructs 248 Dependency Analysis 250 8.5 OpenMP 253 8.6 Performance Issues 258 Shared Data Access 258 Shared Memory Synchronization 260 Sequential Consistency 262 8.7 Program Examples 265 UNIX Processes 265 Pthreads Example 268 Java Example 270 8.8 Summary 271 Further Reading 272 Bibliography 272 Problems 273 CHAPTER 9 DISTRIBUTED SHARED MEMORY SYSTEMS AND PROGRAMMING 279 9.1 Distributed Shared Memory 279 9.2 Implementing Distributed Shared Memory 281 Software DSM Systems 281 Hardware DSM Implementation 282 Managing Shared Data 283 Multiple Reader/Single Writer Policy in a Page-Based System 284 9.3 Achieving Consistent Memory in a DSM System 284 9.4 Distributed Shared Memory Programming Primitives 286 Process Creation 286 Shared Data Creation 287 Shared Data Access 287 Synchronization Accesses 288 Features to Improve Performance 288 9.5 Distributed Shared Memory Programming 290 9.6 Implementing a Simple DSM system 291 User Interface Using Classes and Methods 291 Basic Shared-Variable Implementation 292 Overlapping Data Groups 295 Contents xvii 9.7 Summary 297 Further Reading 297 Bibliography 297 Problems 298 PART II ALGORITHMS AND APPLICATIONS 301 CHAPTER 10 SORTING ALGORITHMS 303 10.1 General 303 Sorting 303 Potential Speedup 304 10.2 Compare-and-Exchange Sorting Algorithms 304 Compare and Exchange 304 Bubble Sort and Odd-Even Transposition Sort 307 Mergesort 311 Quicksort 313 Odd-Even Mergesort 316 Bitonic Mergesort 317 10.3 Sorting on Specific Networks 320 Two-Dimensional Sorting 321 Quicksort on a Hypercube 323 10.4 Other Sorting Algorithms 327 Rank Sort 327 Counting Sort 330 Radix Sort 331 Sample Sort 333 Implementing Sorting Algorithms on Clusters 333 10.5 Summary 335 Further Reading 335 Bibliography 336 Problems 337 CHAPTER 11 NUMERICAL ALGORITHMS 340 11.1 Matrices — A Review 340 Matrix Addition 340 Matrix Multiplication 341 Matrix-Vector Multiplication 341 Relationship of Matrices to Linear Equations 342 xviii Contents 11.2 Implementing Matrix Multiplication 342 Algorithm 342 Direct Implementation 343 Recursive Implementation 346 Mesh Implementation 348 Other Matrix Multiplication Methods 352 11.3 Solving a System of Linear Equations 352 Linear Equations 352 Gaussian Elimination 353 Parallel Implementation 354 11.4 Iterative Methods 356 Jacobi Iteration 357 Faster Convergence Methods 360 11.5 Summary 365 Further Reading 365 Bibliography 365 Problems 366 CHAPTER 12 IMAGE PROCESSING 370 12.1 Low-level Image Processing 370 12.2 Point Processing 372 12.3 Histogram 373 12.4 Smoothing, Sharpening, and Noise Reduction 374 Mean 374 Median 375 Weighted Masks 377 12.5 Edge Detection 379 Gradient and Magnitude 379 Edge-Detection Masks 380 12.6 The Hough Transform 383 12.7 Transformation into the Frequency Domain 387 Fourier Series 387 Fourier Transform 388 Fourier Transforms in Image Processing 389 Parallelizing the Discrete Fourier Transform Algorithm 391 Fast Fourier Transform 395 12.8 Summary 400 Further Reading 401 Contents xix Bibliography 401 Problems 403 CHAPTER 13 SEARCHING AND OPTIMIZATION 406 13.1 Applications and Techniques 406 13.2 Branch-and-Bound Search 407 Sequential Branch and Bound 407 Parallel Branch and Bound 409 13.3 Genetic Algorithms 411 Evolution and Genetic Algorithms 411 Sequential Genetic Algorithms 413 Initial Population 413 Selection Process 415 Offspring Production 416 Variations 418 Termination Conditions 418 Parallel Genetic Algorithms 419 13.4 Successive Refinement 423 13.5 Hill Climbing 424 Banking Application 425 Hill Climbing in a Banking Application 427 Parallelization 428 13.6 Summary 428 Further Reading 428 Bibliography 429 Problems 430 APPENDIX A BASIC MPI ROUTINES 437 APPENDIX B BASIC PTHREAD ROUTINES 444 APPENDIX C OPENMP DIRECTIVES, LIBRARY FUNCTIONS, AND ENVIRONMENT VARIABLES 449 INDEX 460 xx Contents PART I Basic Techniques CHAPTER 1 PARALLEL COMPUTERS CHAPTER 2 MESSAGE-PASSING COMPUTING CHAPTER 3 EMBARRASSINGLY PARALLEL COMPUTATIONS CHAPTER 4 PARTITIONING AND DIVIDE-AND-CONQUER STRATEGIES CHAPTER 5 PIPELINED COMPUTATIONS CHAPTER 6 SYNCHRONOUS COMPUTATIONS CHAPTER 7 LOAD BALANCING AND TERMINATION DETECTION CHAPTER 8 PROGRAMMING WITH SHARED MEMORY CHAPTER 9 DISTRIBUTED SHARED MEMORY SYSTEMS AND PROGRAMMING This page intentionally left blank Chapter 1 Parallel Computers In this chapter, we describe the demand for greater computational power from computers and the concept of using computers with multiple internal processors and multiple intercon- nected computers. The prospects for increased speed of execution by using multiple computers or multiple processors and the limitations are discussed. Then, the various ways that such systems can be constructed are described, in particular by using multiple computers in a cluster, which has become a very cost-effective computer platform for high- performance computing. 1.1 THE DEMAND FOR COMPUTATIONAL SPEED There is a continual demand for greater computational power from computer systems than is currently possible. Areas requiring great computational speed include numerical simula- tion of scientific and engineering problems. Such problems often need huge quantities of repetitive calculations on large amounts of data to give valid results. Computations must be completed within a “reasonable” time period. In the manufacturing realm, engineering cal- culations and simulations must be achieved within seconds or minutes if possible. A simu- lation that takes two weeks to reach a solution is usually unacceptable in a design environment, because the time has to be short enough for the designer to work effectively. As systems become more complex, it takes increasingly more time to simulate them. There are some problems that have a specific deadline for the computations, for example weather forecasting. Taking two days to forecast the local weather accurately for the next day would make the prediction useless. Some areas, such as modeling large DNA structures and global weather forecasting, are grand challenge problems. A grand challenge problem is one that cannot be solved in a reasonable amount of time with today’s computers. 3 Weather forecasting by computer (numerical weather prediction) is a widely quoted example that requires very powerful computers. The atmosphere is modeled by dividing it into three-dimensional regions or cells. Rather complex mathematical equations are used to capture the various atmospheric effects. In essence, conditions in each cell (temperature, pressure, humidity, wind speed and direction, etc.) are computed at time intervals using conditions existing in the previous time interval in the cell and nearby cells. The calcula- tions of each cell are repeated many times to model the passage of time. The key feature that makes the simulation significant is the number of cells that are necessary. For forecast- ing over days, the atmosphere is affected by very distant events, and thus a large region is necessary. Suppose we consider the whole global atmosphere divided into cells of size 1 mile × 1 mile × 1 mile to a height of 10 miles (10 cells high). A rough calculation leads to about 5 × 108 cells. Suppose each calculation requires 200 floating-point operations (the type of operation necessary if the numbers have a fractional part or are raised to a power). In one time step, 1011 floating point operations are necessary. If we were to forecast the weather over seven days using 1-minute intervals, there would be 104 time steps and 1015 floating-point operations in total. A computer capable of 1 Gflops (109 floating-point oper- ations/sec) with this calculation would take 106 seconds or over 10 days to perform the cal- culation. To perform the calculation in 5 minutes would require a computer operating at 3.4 Tflops (3.4 × 1012 floating-point operations/sec). Another problem that requires a huge number of calculations is predicting the motion of the astronomical bodies in space. Each body is attracted to each other body by gravita- tional forces. These are long-range forces that can be calculated by a simple formula (see Chapter 4). The movement of each body can be predicted by calculating the total force experienced by the body. If there are N bodies, there will be N − 1 forces to calculate for each body, or approximately N2 calculations, in total. After the new positions of the bodies are determined, the calculations must be repeated. A snapshot of an undergraduate student’s results for this problem, given as a programming assignment with a few bodies, is shown in Figure 1.1. However, there could be a huge number of bodies to consider. A galaxy might have, say, 1011 stars. This suggests that 1022 calculations have to be repeated. Even using the efficient approximate algorithm described in Chapter 4, which requires N log2 N calcu- lations (but more involved calculations), the number of calculations is still enormous (1011 log2 1011). It would require significant time on a single-processor system. Even if each calculation could be done in 1µs (10−6 seconds, an extremely optimistic figure, since it involves several multiplications and divisions), it would take 109 years for one iteration using the N2 algorithm and almost a year for one iteration using the N log2 N algorithm. The N-body problem also appears in modeling chemical and biological systems at the molecular level and takes enormous computational power. Global weather forecasting and simulation of a large number of bodies (astronomical or molecular) are traditional examples of applications that require immense computational power, but it is human nature to continually envision new applications that exceed the capa- bilities of present-day computer systems and require more computational speed than avail- able. Recent applications, such as virtual reality, require considerable computational speed to achieve results with images and movements that appear real without any jerking. It seems that whatever the computational speed of current processors, there will be applications that require still more computational power. 4 Parallel Computers Chap. 1 Figure 1.1 Astrophysical N-body simulation by Scott Linssen (undergraduate student, University of North Carolina at Charlotte). A traditional computer has a single processor for performing the actions specified in a program. One way of increasing the computational speed, a way that has been considered for many years, is by using multiple processors within a single computer (multiprocessor) or alternatively multiple computers, operating together on a single problem. In either case, the overall problem is split into parts, each of which is performed by a separate processor in parallel. Writing programs for this form of computation is known as parallel program- ming. The computing platform, a parallel computer, could be a specially designed computer system containing multiple processors or several computers interconnected in some way. The approach should provide a significant increase in performance. The idea is that p processors/computers could provide up to p times the computational speed of a single processor/computer, no matter what the current speed of the processor/computer, with the expectation that the problem would be completed in 1/pth of the time. Of course, this is an ideal situation that is rarely achieved in practice. Problems often cannot be divided perfectly into independent parts, and interaction is necessary between the parts, both for data transfer and synchronization of computations. However, substantial improvement can be achieved, depending upon the problem and the amount of parallelism in the problem. What makes parallel computing timeless is that the continual improvements in the execution speed of processors simply make parallel computers even faster, and there will always be grand challenge problems that cannot be solved in a reasonable amount of time on current computers. Apart from obtaining the potential for increased speed on an existing problem, the use of multiple computers/processors often allows a larger problem or a more precise solution of a problem to be solved in a reasonable amount of time. For example, computing many physical phenomena involves dividing the problem into discrete solution points. As we have mentioned, forecasting the weather involves dividing the air into a three-dimensional grid of solution points. Two- and three-dimensional grids of solution points occur in many other applications. A multiple computer or multiprocessor solution will often allow more solution Sec. 1.1 The Demand for Computational Speed 5 points to be computed in a given time, and hence a more precise solution. A related factor is that multiple computers very often have more total main memory than a single computer, enabling problems that require larger amounts of main memory to be tackled. Even if a problem can be solved in a reasonable time, situations arise when the same problem has to be evaluated multiple times with different input values. This situation is especially applicable to parallel computers, since without any alteration to the program, multiple instances of the same program can be executed on different processors/computers simultaneously. Simulation exercises often come under this category. The simulation code is simply executed on separate computers simultaneously but with different input values. Finally, the emergence of the Internet and the World Wide Web has spawned a new area for parallel computers. For example, Web servers must often handle thousands of requests per hour from users. A multiprocessor computer, or more likely nowadays multiple computers connected together as a “cluster,” are used to service the requests. Individual requests are serviced by different processors or computers simultaneously. On-line banking and on-line retailers all use clusters of computers to service their clients. The parallel computer is not a new idea; in fact it is a very old idea. For example, Gill wrote about parallel programming in 1958 (Gill, 1958). Holland wrote about a “computer capable of executing an arbitrary number of sub-programs simultaneously” in 1959 (Holland, 1959). Conway described the design of a parallel computer and its programming in 1963 (Conway, 1963). Notwithstanding the long history, Flynn and Rudd (1996) write that “the continued drive for higher- and higher-performance systems... leads us to one simple conclusion: the future is parallel.” We concur. 1.2 POTENTIAL FOR INCREASED COMPUTATIONAL SPEED In the following and in subsequent chapters, the number of processes or processors will be identified as p. We will use the term “multiprocessor” to include all parallel computer systems that contain more than one processor. 1.2.1 Speedup Factor Perhaps the first point of interest when developing solutions on a multiprocessor is the question of how much faster the multiprocessor solves the problem under consideration. In doing this comparison, one would use the best solution on the single processor, that is, the best sequential algorithm on the single-processor system to compare against the parallel algorithm under investigation on the multiprocessor. The speedup factor, S(p),1 is a measure of relative performance, which is defined as: Execution time using single processor system (with the best sequential algorithm) S(p) = Execution time using a multiprocessor with p processors We shall use ts as the execution time of the best sequential algorithm running on a single processor and tp as the execution time for solving the same problem on a multiprocessor. 1 The speedup factor is normally a function of both p and the number of data items being processed, n, i.e. S(p,n). We will introduce the number of data items later. At this point, the only variable is p. 6 Parallel Computers Chap. 1 Then: ts S(p) = tp S(p) gives the increase in speed in using the multiprocessor. Note that the underlying algorithm for the parallel implementation might not be the same as the algorithm on the single-processor system (and is usually different). In a theoretical analysis, the speedup factor can also be cast in terms of computational steps: Number of computational steps using one processor S(p) = Number of parallel computational steps with p processors For sequential computations, it is common to compare different algorithms using time com- plexity, which we will review in Chapter 2. Time complexity can be extended to parallel algorithms and applied to the speedup factor, as we shall see. However, considering com- putational steps alone may not be useful, as parallel implementations may require expense communications between the parallel parts, which is usually much more time-consuming than computational steps. We shall look at this in Chapter 2. The maximum speedup possible is usually p with p processors (linear speedup). The speedup of p would be achieved when the computation can be divided into equal-duration processes, with one process mapped onto one processor and no additional overhead in the parallel solution. ts S(p) ≤ =p ts /p Superlinear speedup, where S(p) > p, may be seen on occasion, but usually this is due to using a suboptimal sequential algorithm, a unique feature of the system architecture that favors the parallel formation, or an indeterminate nature of the algorithm. Generally, if a purely deterministic parallel algorithm were to achieve better than p times the speedup over the current sequential algorithm, the parallel algorithm could be emulated on a single processor one parallel part after another, which would suggest that the original sequential algorithm was not optimal. One common reason for superlinear speedup is extra memory in the multiprocessor system. For example, suppose the main memory associated with each processor in the multiprocessor system is the same as that associated with the processor in a single- processor system. Then, the total main memory in the multiprocessor system is larger than that in the single-processor system, and can hold more of the problem data at any instant, which leads to less disk memory traffic. Efficiency. It is sometimes useful to know how long processors are being used on the computation, which can be found from the (system) efficiency. The efficiency, E, is defined as Execution time using one processor E = ----------------------------------------------------------------------------------------------------------------------------------------------------- - Execution time using a multiprocessor × number of processors ts = ------------ - tp × p Sec. 1.2 Potential for Increased Computational Speed 7 which leads to S(p) E= × 100% p when E is given as a percentage. For example, if E = 50%, the processors are being used half the time on the actual computation, on average. The efficiency of 100% occurs when all the processors are being used on the computation at all times and the speedup factor, S(p), is p. 1.2.2 What Is the Maximum Speedup? Several factors will appear as overhead in the parallel version and limit the speedup, notably 1. Periods when not all the processors can be performing useful work and are simply idle. 2. Extra computations in the parallel version not appearing in the sequential version; for example, to recompute constants locally. 3. Communication time between processes. It is reasonable to expect that some part of a computation cannot be divided into concurrent processes and must be performed sequentially. Let us assume that during some period, perhaps an initialization period or the period before concurrent processes are set up, only one processor is doing useful work, and for the rest of the computation additional proces- sors are operating on processes. Assuming there will be some parts that are only executed on one processor, the ideal situation would be for all the available processors to operate simultaneously for the other times. If the fraction of the computation that cannot be divided into concurrent tasks is f, and no overhead is incurred when the computation is divided into concurrent parts, the time to perform the computation with p processors is given by fts + (1 − f )ts /p, as illustrated in Figure 1.2. Illustrated is the case with a single serial part at the beginning of the computa- tion, but the serial part could be distributed throughout the computation. Hence, the speedup factor is given by ts p S ( p ) = ------------------------------------- - = --------------------------- - ft s + ( 1 – f )t s /p 1 + ( p – 1 )f This equation is known as Amdahl’s law (Amdahl, 1967). Figure 1.3 shows S(p) plotted against number of processors and against f. We see that indeed a speed improvement is indi- cated. However, the fraction of the computation that is executed by concurrent processes needs to be a substantial fraction of the overall computation if a significant increase in speed is to be achieved. Even with an infinite number of processors, the maximum speedup is limited to 1/f; i.e., S ( p ) = 1--- p→∞ f For example, with only 5% of the computation being serial, the maximum speedup is 20, irre- spective of the number of processors. Amdahl used this argument to promote single-processor 8 Parallel Computers Chap. 1 ts fts (1 f)ts Serial section Parallelizable sections (a) One processor (b) Multiple processors p processors (1 f)tsp tp Figure 1.2 Parallelizing sequential problem — Amdahl’s law. systems in the 1960s. Of course, one can counter this by saying that even a speedup of 20 would be impressive. Orders-of-magnitude improvements are possible in certain circumstances. For example, superlinear speedup can occur in search algorithms. In search problems performed by exhaustively looking for the solution, suppose the solution space is divided among the processors for each one to perform an independent search. In a sequential 20 f 0% 20 p 256 Speedup factor, S(p) Speedup factor, S(p) 16 16 12 12 f 5% 8 8 f 10% 4 f 20% 4 p = 16 4 8 12 16 20 0.2 0.4 0.6 0.8 1.0 Number of processors, p Serial fraction, f (a) (b) Figure 1.3 (a) Speedup against number of processors. (b) Speedup against serial fraction, f. Sec. 1.2 Potential for Increased Computational Speed 9 implementation, the different search spaces are attacked one after the other. In parallel implementation, they can be done simultaneously, and one processor might find the solution almost immediately. In the sequential version, suppose x sub-spaces are searched and then the solution is found in time ∆t in the next sub-space search. The number of pre- viously searched sub-spaces, say x, is indeterminate and will depend upon the problem. In the parallel version, the solution is found immediately in time ∆t, as illustrated in Figure 1.4. The speedup is then given by x × --t-s + ∆t p S ( p ) = ------------------------------ ∆t Start Time ts ts/p Sub-space t search xts /p Solution found (a) Searching each sub-space sequentially t Solution found (b) Searching each sub-space in parallel Figure 1.4 Superlinear speedup. 10 Parallel Computers Chap. 1 The worst case for the sequential search is when the solution is found in the last sub-space search, and the parallel version offers the greatest benefit: – 1 p----------- - × t + ∆t p s S ( p ) = --------------------------------------- → ∞ as ∆t tends to zero ∆t The least advantage for the parallel version would be when the solution is found in the first sub-space search of the sequential search: S( p) = ∆ -----t = 1 ∆t The actual speedup will depend upon which sub-space holds the solution but could be extremely large. Scalability. The performance of a system will depend upon the size of the system, i.e., the number of processors, and generally the larger the system the better, but this comes with a cost. Scalability is a rather imprecise term. It is used to indicate a hardware design that allows the system to be increased in size and in doing so to obtain increased performance. This could be described as architecture or hardware scalability. Scalability is also used to indicate that a parallel algorithm can accommodate increased data items with a low and bounded increase in computational steps. This could be described as algorithmic scalability. Of course, we would want all multiprocessor systems to be architecturally scalable (and manufacturers will market their systems as such), but this will depend heavily upon the design of the system. Usually, as we add processors to a system, the interconnection network must be expanded. Greater communication delays and increased contention results, and the system efficiency, E, reduces. The underlying goal of most multiprocessor designs is to achieve scalability, and this is reflected in the multitude of interconnection networks that have been devised. Combined architecture/algorithmic scalability suggests that increased problem size can be accommodated with increased system size for a particular architecture and algo- rithm. Whereas increasing the size of the system clearly means adding processors, increasing the size of the problem requires clarification. Intuitively, we would think of the number of data elements being processed in the algorithm as a measure of size. However, doubling the problem size would not necessarily double the number of computational steps. It will depend upon the problem. For example, adding two matrices, as discussed in Chapter 11, has this effect, but multiplying matrices does not. The number of computa- tional steps for multiplying matrices quadruples. Hence, scaling different problems would imply different computational requirements. An alternative definition of problem size is to equate problem size with the number of basic steps in the best sequential algorithm. Of course, even with this definition, if we increase the number of data points, we will increase the problem size. In subsequent chapters, in addition to number of processors, p, we will also use n as the number of input data elements in a problem.2 These two, p and n, usually can be altered in an attempt to improve performance. Altering p alters the size of the computer system, Sec. 1.2 Potential for Increased Computational Speed 11 and altering n alters the size of the problem. Usually, increasing the problem size improves the relative performance because more parallelism can be achieved. Gustafson presented an argument based upon scalability concepts to show that Amdahl’s law was not as significant as first supposed in determining the potential speedup limits (Gustafson, 1988). Gustafson attributed formulating the idea into an equation to E. Barsis. Gustafson makes the observation that in practice a larger multiprocessor usually allows a larger-size problem to be undertaken in a reasonable execution time. Hence in practice, the problem size selected frequently depends of the number of available proces- sors. Rather than assume that the problem size is fixed, it is just as valid to assume that the parallel execution time is fixed. As the system size is increased (p increased), the problem size is increased to maintain constant parallel-execution time. In increasing the problem size, Gustafson also makes the case that the serial section of the code is normally fixed and does not increase with the problem size. Using the constant parallel-execution time constraint, the resulting speedup factor will be numerically different from Amdahl’s speedup factor and is called a scaled speedup factor (i.e, the speedup factor when the problem is scaled). For Gustafson’s scaled speedup factor, the parallel execution time, tp, is constant rather than the serial execution time, ts, in Amdahl’s law. For the derivation of Gustafson’s law, we shall use the same terms as for deriving Amdahl’s law, but it is necessary to separate out the serial and parallelizable sections of the sequential execution time, ts, into fts + (1 − f )ts as the serial section fts is a constant. For algebraic convenience, let the parallel execution time, tp = fts + (1 − f )ts/p = 1. Then, with a little algebraic manipulation, the serial execution time, ts, becomes fts + (1 − f ) ts = p + (1 − p)fts. The scaled speedup factor then becomes ft s + ( 1 – f )t s p + ( 1 – p )ft - = --------------------------------s = p + ( 1 – p )ft s S s ( p ) = -------------------------------------- ft s + ( 1 – f )t s ⁄ p 1 which is called Gustafson’s law. There are two assumptions in this equation: the parallel execution time is constant, and the part that must be executed sequentially, fts, is also constant and not a function of p. Gustafson’s observation here is that the scaled speedup factor is a line of negative slope (1 − p) rather than the rapid reduction previously illustrated in Figure 1.3(b). For example, suppose we had a serial section of 5% and 20 processors; the speedup is 0.05 + 0.95(20) = 19.05 according to the formula instead of 10.26 according to Amdahl’s law. (Note, however, the different assumptions.) Gustafson quotes examples of speedup factors of 1021, 1020, and 1016 that have been achieved in practice with a 1024- processor system on numerical and simulation problems. Apart from constant problem size scaling (Amdahl’s assumption) and time-constrained scaling (Gustafson’s assumption), scaling could be memory-constrained scaling. In memory-constrained scaling, the problem is scaled to fit in the available memory. As the number of processors grows, normally the memory grows in proportion. This form can lead to significant increases in the execution time (Singh, Hennessy, and Gupta, 1993). 2 For matrices, we consider n × n matrices. 12 Parallel Computers Chap. 1 1.2.3 Message-Passing Computations The analysis so far does not take account of message-passing, which can be a very signifi- cant overhead in the computation in message-passing programming. In this form of parallel programming, messages are sent between processes to pass data and for synchronization purposes. Thus, tp = tcomm + tcomp where tcomm is the communication time, and tcomp is the computation time. As we divide the problem into parallel parts, the computation time of the parallel parts generally decreases because the parts become smaller, and the communication time between the parts generally increases (as there are more parts communicating). At some point, the communication time will dominate the overall execution time and the parallel execution time will actually increase. It is essential to reduce the communication overhead because of the significant time taken by interprocessor communication. The communication aspect of the parallel solution is usually not present in the sequential solution and considered as an overhead. The ratio Computation time t comp Computation/communication ratio = -------------------------------------------------- = ------------- Communication time t comm can be used as a metric. In subsequent chapters, we will develop equations for the compu- tation time and the communication time in terms of number of processors (p) and number of data elements (n) for algorithms and problems under consideration to get a handle on the potential speedup possible and effect of increasing p and n. In a practical situation we may not have much control over the value of p, that is, the size of the system we can use (except that we could map more than one process of the problem onto one processor, although this is not usually beneficial). Suppose, for example, that for some value of p, a problem requires c1n computations and c2n2 communications. Clearly, as n increases, the communication time increases faster than the computation time. This can be seen clearly from the computation/communication ratio, (c1/c2n), which can be cast in time-complexity notation to remove constants (see Chapter 2). Usually, we want the computation/communication ratio to be as high as possible, that is, some highly increasing function of n so that increasing the problem size lessens the effects of the communication time. Of course, this is a complex matter with many factors. Finally, one can only verify the execution speed by executing the program on a real multiprocessor system, and it is assumed this would then be done. Ways of measuring the actual execution time are described in the next chapter. 1.3 TYPES OF PARALLEL COMPUTERS Having convinced ourselves that there is potential for speedup with the use of multiple processors or computers, let us explore how a multiprocessor or multicomputer could be constructed. A parallel computer, as we have mentioned, is either a single computer with multiple internal processors or multiple computers interconnected to form a coherent Sec. 1.3 Types of Parallel Computers 13 high-performance computing platform. In this section, we shall look at specially designed parallel computers, and later in the chapter we will look at using an off-the-shelf “com- modity” computer configured as a cluster. The term parallel computer is usually reserved for specially designed components. There are two basic types of parallel computer: 1. Shared memory multiprocessor 2. Distributed-memory multicomputer. 1.3.1 Shared Memory Multiprocessor System A conventional computer consists of a processor executing a program stored in a (main) memory, as shown in Figure 1.5. Each main memory location in the memory is located by a number called its address. Addresses start at 0 and extend to 2b − 1 when there are b bits (binary digits) in the address. A natural way to extend the single-processor model is to have multiple processors connected to multiple memory modules, such that each processor can access any memory module in a so–called shared memory configuration, as shown in Figure 1.6. The connec- tion between the processors and memory is through some form of interconnection network. A shared memory multiprocessor system employs a single address space, which means that each location in the whole main memory system has a unique address that is used by each processor to access the location. Although not shown in these “models,” real systems have high-speed cache memory, which we shall discuss later. Programming a shared memory multiprocessor involves having executable code stored in the shared memory for each processor to execute. The data for each program will also be stored in the shared memory, and thus each program could access all the data if Main memory Instructions (to processor) Data (to or from processor) Processor Figure 1.5 Conventional computer having a single processor and memory. Main memory One address space Memory modules Interconnection network Figure 1.6 Traditional shared memory Processors multiprocessor model. 14 Parallel Computers Chap. 1 needed. A programmer can create the executable code and shared data for the processors in different ways, but the final result is to have each processor execute its own program or code sequences from the shared memory. (Typically, all processors execute the same program.) One way for the programmer to produce the executable code for each processor is to use a high-level parallel programming language that has special parallel programming con- structs and statements for declaring shared variables and parallel code sections. The compiler is responsible for producing the final executable code from the programmer’s specification in the program. However, a completely new parallel programming language would not be popular with programmers. More likely when using a compiler to generate parallel code from the programmer’s “source code,” a regular sequential programming language would be used with preprocessor directives to specify the parallelism. An example of this approach is OpenMP (Chandra et al., 2001), an industry-standard set of compiler directives and constructs added to C/C++ and Fortran. Alternatively, so-called threads can be used that contain regular high-level language code sequences for individual processors. These code sequences can then access shared locations. Another way that has been explored over the years, and is still finding interest, is to use a regular sequential programming language and modify the syntax to specify parallelism. A recent example of this approach is UPC (Unified Parallel C) (see http://upc.gwu.edu). More details on exactly how to program shared memory systems using threads and other ways are given in Chapter 8. From a programmer’s viewpoint, the shared memory multiprocessor is attractive because of the convenience of sharing data. Small (two-processor and four-processor) shared memory multiprocessor systems based upon a bus interconnection structure-as illustrated in Figure 1.7 are common; for example dual-Pentium® and quad-Pentium systems. Two-processor shared memory systems are particularly cost-effective. However, it is very difficult to implement the hardware to achieve fast access to all the shared memory by all the processors with a large number of processors. Hence, most large shared memory systems have some form of hierarchical or distributed memory structure. Then, processors can physically access nearby memory locations much faster than more distant memory locations. The term nonuniform memory access (NUMA) is used in these cases, as opposed to uniform memory access (UMA). Conventional single processors have fast cache memory to hold copies of recently referenced memory locations, thus reducing the need to access the main memory on every memory reference. Often, there are two levels of cache memory between the processor and the main memory. Cache memory is carried over into shared memory multiprocessors by providing each processor with its own local cache memory. Fast local cache memory with each processor can somewhat alleviate the problem of different access times to different main memories in larger systems, but making sure that copies of the same data in different Processors Shared memory Bus Figure 1.7 Simplistic view of a small shared memory multiprocessor. Sec. 1.3 Types of Parallel Computers 15 caches are identical becomes a complex issue that must be addressed. One processor writing to a cached data item often requires all the other copies of the cached item in the system to be made invalid. Such matters are briefly covered in Chapter 8. 1.3.2 Message-Passing Multicomputer An alternative form of multiprocessor to a shared memory multiprocessor can be created by connecting complete computers through an interconnection network, as shown in Figure 1.8. Each computer consists of a processor and local memory but this memory is not accessible by other processors. The interconnection network provides for processors to send messages to other processors. The messages carry data from one processor to another as dictated by the program. Such multiprocessor systems are usually called message-passing multiprocessors, or simply multicomputers, especially if they consist of self-contained computers that could operate separately. Programming a message-passing multicomputer still involves dividing the problem into parts that are intended to be executed simultaneously to solve the problem. Program- ming could use a parallel or extended sequential language, but a common approach is to use message-passing library routines that are inserted into a conventional sequential program for message passing. Often, we talk in terms of processes. A problem is divided into a number of concurrent processes that may be executed on a different computer. If there were six processes and six computers, we might have one process executed on each computer. If there were more processes than computers, more than one process would be executed on one computer, in a time-shared fashion. Processes communicate by sending messages; this will be the only way to distribute data and results between processes. The message-passing multicomputer will physically scale more easily than a shared memory multiprocessor. That is, it can more easily be made larger. There have been examples of specially designed message-passing processors. Message-passing systems can also employ general-purpose microprocessors. Networks for Multicomputers. The purpose of the interconnection network shown in Figure 1.8 is to provide a physical path for messages sent from one computer to another computer. Key issues in network design are the bandwidth, latency, and cost. Ease of construction is also important. The bandwidth is the number of bits that can be transmit- ted in unit time, given as bits/sec. The network latency is the time to make a message transfer through the network. The communication latency is the total time to send the Interconnection network Messages Processor Main memory Figure 1.8 Message-passing multiprocessor Computers model (multicomputer). 16 Parallel Computers Chap. 1 message, including the software overhead and interface delays. Message latency, or startup time, is the time to send a zero-length message, which is essentially the software and hardware overhead in sending a message (finding the route, packing, unpacking, etc.) onto which must be added the actual time to send the data along the interconnection path. The number of physical links in a path between two nodes is an important consider- ation because it will be a major factor in determining the delay for a message. The diameter is the minimum number of links between the two farthest nodes (computers) in the network. Only the shortest routes are considered. How efficiently a parallel problem can be solved using a multicomputer with a specific network is extremely important. The diameter of the network gives the maximum distance that a single message must travel and can be used to find the communication lower bound of some parallel algorithms. The bisection width of a network is the minimum number of links (or sometimes wires) that must be cut to divide the network into two equal parts. The bisection bandwidth is the collective bandwidth over these links, that is, the maximum number of bits that can be transmitted from one part of the divided network to the other part in unit time. These factor can also be important in evaluating parallel algorithms. Parallel algorithms usually require numbers to be moved about the network. To move numbers across the network from one side to the other we must use the links between the two halves, and the bisection width gives us the number of links available. There are several ways one could interconnect computers to form a multicomputer system. For a very small system, one might consider connecting every computer to every other computer with links. With c computers, there are c(c − 1)/2 links in all. Such exhaus- tive interconnections have application only for a very small system. For example, a set of four computers could reasonably be exhaustively interconnected. However, as the size increases, the number of interconnections clearly becomes impractical for economic and engineering reasons. Then we need to look at networks with restricted interconnection and switched interconnections. There are two networks with restricted direct interconnections that have seen wide use — the mesh network and the hypercube network. Not only are these important as inter- connection networks, the concepts also appear in the formation of parallel algorithms. Mesh. A two-dimensional mesh can be created by having each node in a two- dimensional array connect to its four nearest neighbors, as shown in Figure 1.9. The diameter of a p × p mesh is 2( p −1), since to reach one corner from the opposite corner requires a path to made across ( p −1) nodes and down ( p −1) nodes. The free ends of a mesh might circulate back to the opposite sides. Then the network is called a torus. The mesh and torus networks are popular because of their ease of layout and expand- ability. If necessary, the network can be folded; that is, rows are interleaved and columns are interleaved so that the wraparound connections simply turn back through the network rather than stretch from one edge to the opposite edge. Three-dimensional meshes can be formed where each node connects to two nodes in the x-plane, the y-plane, and the z-plane. Meshes are particularly convenient for many scientific and engineering problems in which solution points are arranged in two-dimensional or three-dimensional arrays. There have been several examples of message-passing multicomputer systems using two-dimensional or three-dimensional mesh networks, including the Intel Touchstone Delta computer (delivered in 1991, designed with a two-dimensional mesh), and the J-machine, a Sec. 1.3 Types of Parallel Computers 17 Computer/ Links processor Figure 1.9 Two-dimensional array (mesh). research prototype constructed at MIT in 1991 with a three-dimensional mesh. A more recent example of a system using a mesh is the ASCI Red supercomputer from the U.S. Department of Energy’s Accelerated Strategic Computing Initiative, developed in 1995–97. ASCI Red, sited at Sandia National Laboratories, consists of 9,472 Pentium-II Xeon pro- cessors and uses a 38 × 32 × 2 mesh interconnect for message passing. Meshes can also be used in shared memory systems. Hypercube Network. In a d-dimensional (binary) hypercube network, each node connects to one node in each of the dimensions of the network. For example, in a three- dimensional hypercube, the connections in the x-direction, y-direction, and z-direction form a cube, as shown in Figure 1.10. Each node in a hypercube is assigned a d-bit binary address when there are d dimensions. Each bit is associated with one of the dimensions and can be a 0 or a 1, for the two nodes in that dimension. Nodes in a three-dimensional hypercube have a 3-bit address. Node 000 connects to nodes with addresses 001, 010, and 100. Node 111 connects to nodes 110, 101, and 011. Note that each node connects to nodes whose addresses differ by one bit. This characteristic can be extended for higher-dimension hyper- cubes. For example, in a five-dimensional hypercube, node 11101 connects to nodes 11100, 11111, 11001, 10101, and 01101. A notable advantage of the hypercube is that the diameter of the network is given by log2 p for a p-node hypercube, which has a reasonable (low) growth with increasing p. The 110 111 100 101 010 011 000 001 Figure 1.10 Three-dimensional hypercube. 18 Parallel Computers Chap. 1 number of links emanating from each node also only grows logarithmically. A very conve- nient aspect of the hypercube is the existence of a minimal distance deadlock-free routing algorithm. To describe this algorithm, let us route a message from a node X having a nodal address X = xn−1xn−2 … x1x0 to a destination node having a nodal address Y = yn−1yn−2 … y1y0. Each bit of Y that is different from that of X identifies one hypercube dimension that the route should take and can be found by performing the exclusive-OR function, Z = X ⊕ Y, operating on pairs of bits. The dimensions to use in the routing are given by those bits of Z that are 1. At each node in the path, the exclusive-OR function between the current nodal address and the destination nodal address is performed. Usually the dimension identified by the most significant 1 in Z is chosen for the route. For example, the route taken from node 13 (001101) to node 42 (101010) in a six-dimensional hypercube would be node 13 (001101) to node 45 (101101) to node 41 (101001) to node 43 (101011) to node 42 (101010). This hypercube routing algorithm is sometimes called the e-cube routing algo- rithm, or left-to-right routing. A d-dimensional hypercube actually consists of two d − 1 dimensional hypercubes with dth dimension links between them. Figure 1.11 shows a four-dimensional hypercube drawn as two three-dimensional hypercubes with eight connections between them. Hence, the bisection width is 8. (The bisection width is p/2 for a p-node hypercube.) A five-dimensional hypercube consists of two four-dimensional hypercubes with connections between them, and so forth for larger hypercubes. In a practical system, the network must be laid out in two or possibly three dimensions. Hypercubes are a part of a larger family of k-ary d-cubes; however, it is only the binary hypercube (with k = 2) that is really important as a basis for multicomputer construc- tion and for parallel algorithms. The hypercube network became popular for constructing message-passing multicomputers after the pioneering research system called the Cosmic Cube was constructed at Caltech in the early 1980s (Seitz, 1985). However, interest in hypercubes has waned since the late 1980s. As an alternative to direct links between individual computers, switches can be used in various configurations to route the messages between the computers. Crossbar switch. The crossbar switch provides exhaustive connections using one switch for each connection. It is employed in shared memory systems more so than 0110 0111 1110 1111 0100 0101 1100 1101 0010 0011 1010 1011 0000 0001 1000 1001 Figure 1.11 Four-dimensional hypercube. Sec. 1.3 Types of Parallel Computers 19 Memories Processors Switches Figure 1.12 Cross-bar switch. message-passing systems for connecting processor to memories. The layout of the crossbar switch is shown in Figure 1.12. There are several examples of systems using crossbar switches at some level with the system, especially very high performance systems. One of our students built a very early crossbar switch multiple microprocessor system in the 1970s (Wilkinson and Abachi, 1983). Tree Networks. Another switch configuration is to use a binary tree, as shown in Figure 1.13. Each switch in the tree has two links connecting to two switches below it as the network fans out from the root. This particular tree is a complete binary tree because every level is fully occupied. The height of a tree is the number of links from the root to the lowest leaves. A key aspect of the tree structure is that the height is logarithmic; there are log2 p levels of switches with p processors (at the leaves). The tree network need not be complete or based upon the base two. In an m-ary tree, each node connects to m nodes beneath it. Under uniform request patterns, the communication traffic in a tree interconnection network increases toward the root, which can be a bottleneck. In a fat tree network (Leis- erson, 1985), the number of the links is progressively increased toward the root. In a binary fat tree, we simply add links in parallel, as required between levels of a binary tree, and increase the number of links toward the root. Leiserson developed this idea into the universal fat tree, in which the number of links between nodes grows exponentially toward the root, thereby allowing increased traffic toward the root and reducing the communica- tion bottleneck. The most notable example of a computer designed with tree interconnec- tion networks is the Thinking Machine’s Connection Machine CM5 computer, which uses a 4-ary fat tree (Hwang, 1993). The fat tree has been used subsequently. For example, the Quadrics QsNet network (see http://www.quadrics.com) uses a fat tree. Root Switch Links element Processors Figure 1.13 Tree structure. 20 Parallel Computers Chap. 1 Multistage Interconnection Networks. The multistage interconnection network (MIN) is a classification covering a multitude of configurations with the common charac- teristic of having a number of levels of switches. Switches in one level are connected to switches in adjacent levels in various symmetrical ways such that a path can made from one side of the network to the other side (and back sometimes). An example of a multistage interconnection network is the Omega network shown in Figure 1.14 (for eight inputs and outputs). This network has a very simple routing algorithm using the destination address. Inputs and outputs are given addresses as shown in the figure. Each switching cell requires one control signal to select either the upper output or the lower output (0 specifying the upper output and 1 specifying the lower). The most significant bit of the destination address is used to control the switch in the first stage; if the most significant bit is 0, the upper output is selected, and if