A Practical Introduction to Data Structures and Algorithms (Java) PDF
Document Details
Uploaded by AmbitiousElbaite6435
Virginia Tech
2009
Clifford A. Shaffer
Tags
Related
- Data Structure and Algorithm (BCSE202L) Lecture Notes PDF
- Data Structures and Algorithms with Python and C++ PDF
- Data Structures and Algorithm Analysis in C++ 4th Edition PDF
- Data Structures and Algorithms in Python PDF
- Data Structures and Algorithms(All Sessions) PDF
- Data Structures and Analysis of Algorithms Lab Manual (CSD3009) PDF
Summary
This book provides a practical introduction to data structures and algorithms, focusing on their various aspects and use cases in computer science development. It covers core topics such as sorting, searching, and graph analysis, and specifically uses Java.
Full Transcript
A Practical Introduction to Data Structures and Algorithm Analysis Third Edition (Java) Clifford A. Shaffer Department of Computer Science Virginia Tech Blacksburg, VA 24061...
A Practical Introduction to Data Structures and Algorithm Analysis Third Edition (Java) Clifford A. Shaffer Department of Computer Science Virginia Tech Blacksburg, VA 24061 April 16, 2009 Copyright c 2008 by Clifford A. Shaffer. This document is the draft of a book to be published by Prentice Hall and may not be duplicated without the express written consent of either the author or a representative of the publisher. Contents Preface xiii I Preliminaries 1 1 Data Structures and Algorithms 3 1.1 A Philosophy of Data Structures 4 1.1.1 The Need for Data Structures 4 1.1.2 Costs and Benefits 6 1.2 Abstract Data Types and Data Structures 8 1.3 Design Patterns 12 1.3.1 Flyweight 13 1.3.2 Visitor 14 1.3.3 Composite 15 1.3.4 Strategy 16 1.4 Problems, Algorithms, and Programs 17 1.5 Further Reading 19 1.6 Exercises 21 2 Mathematical Preliminaries 25 2.1 Sets and Relations 25 2.2 Miscellaneous Notation 29 2.3 Logarithms 31 2.4 Summations and Recurrences 33 iii iv Contents 2.5 Recursion 36 2.6 Mathematical Proof Techniques 39 2.6.1 Direct Proof 40 2.6.2 Proof by Contradiction 40 2.6.3 Proof by Mathematical Induction 41 2.7 Estimating 47 2.8 Further Reading 49 2.9 Exercises 50 3 Algorithm Analysis 57 3.1 Introduction 57 3.2 Best, Worst, and Average Cases 63 3.3 A Faster Computer, or a Faster Algorithm? 65 3.4 Asymptotic Analysis 67 3.4.1 Upper Bounds 68 3.4.2 Lower Bounds 70 3.4.3 Θ Notation 71 3.4.4 Simplifying Rules 72 3.4.5 Classifying Functions 73 3.5 Calculating the Running Time for a Program 74 3.6 Analyzing Problems 79 3.7 Common Misunderstandings 81 3.8 Multiple Parameters 83 3.9 Space Bounds 84 3.10 Speeding Up Your Programs 86 3.11 Empirical Analysis 89 3.12 Further Reading 90 3.13 Exercises 91 3.14 Projects 95 II Fundamental Data Structures 97 4 Lists, Stacks, and Queues 99 Contents v 4.1 Lists 100 4.1.1 Array-Based List Implementation 103 4.1.2 Linked Lists 106 4.1.3 Comparison of List Implementations 117 4.1.4 Element Implementations 119 4.1.5 Doubly Linked Lists 120 4.2 Stacks 125 4.2.1 Array-Based Stacks 125 4.2.2 Linked Stacks 128 4.2.3 Comparison of Array-Based and Linked Stacks 129 4.2.4 Implementing Recursion 129 4.3 Queues 133 4.3.1 Array-Based Queues 134 4.3.2 Linked Queues 137 4.3.3 Comparison of Array-Based and Linked Queues 140 4.4 Dictionaries and Comparators 140 4.5 Further Reading 147 4.6 Exercises 147 4.7 Projects 150 5 Binary Trees 153 5.1 Definitions and Properties 153 5.1.1 The Full Binary Tree Theorem 156 5.1.2 A Binary Tree Node ADT 157 5.2 Binary Tree Traversals 158 5.3 Binary Tree Node Implementations 162 5.3.1 Pointer-Based Node Implementations 163 5.3.2 Space Requirements 169 5.3.3 Array Implementation for Complete Binary Trees 170 5.4 Binary Search Trees 171 5.5 Heaps and Priority Queues 180 5.6 Huffman Coding Trees 187 5.6.1 Building Huffman Coding Trees 189 vi Contents 5.6.2 Assigning and Using Huffman Codes 195 5.7 Further Reading 198 5.8 Exercises 198 5.9 Projects 202 6 Non-Binary Trees 205 6.1 General Tree Definitions and Terminology 205 6.1.1 An ADT for General Tree Nodes 206 6.1.2 General Tree Traversals 207 6.2 The Parent Pointer Implementation 208 6.3 General Tree Implementations 216 6.3.1 List of Children 217 6.3.2 The Left-Child/Right-Sibling Implementation 218 6.3.3 Dynamic Node Implementations 218 6.3.4 Dynamic “Left-Child/Right-Sibling” Implementation 220 6.4 K-ary Trees 221 6.5 Sequential Tree Implementations 223 6.6 Further Reading 226 6.7 Exercises 226 6.8 Projects 230 III Sorting and Searching 233 7 Internal Sorting 235 7.1 Sorting Terminology and Notation 236 7.2 Three Θ(n2 ) Sorting Algorithms 237 7.2.1 Insertion Sort 238 7.2.2 Bubble Sort 240 7.2.3 Selection Sort 241 7.2.4 The Cost of Exchange Sorting 243 7.3 Shellsort 244 7.4 Mergesort 246 7.5 Quicksort 249 Contents vii 7.6 Heapsort 256 7.7 Binsort and Radix Sort 259 7.8 An Empirical Comparison of Sorting Algorithms 265 7.9 Lower Bounds for Sorting 267 7.10 Further Reading 271 7.11 Exercises 272 7.12 Projects 275 8 File Processing and External Sorting 279 8.1 Primary versus Secondary Storage 280 8.2 Disk Drives 282 8.2.1 Disk Drive Architecture 283 8.2.2 Disk Access Costs 286 8.3 Buffers and Buffer Pools 289 8.4 The Programmer’s View of Files 297 8.5 External Sorting 298 8.5.1 Simple Approaches to External Sorting 301 8.5.2 Replacement Selection 304 8.5.3 Multiway Merging 307 8.6 Further Reading 310 8.7 Exercises 311 8.8 Projects 315 9 Searching 317 9.1 Searching Unsorted and Sorted Arrays 318 9.2 Self-Organizing Lists 324 9.3 Bit Vectors for Representing Sets 329 9.4 Hashing 330 9.4.1 Hash Functions 331 9.4.2 Open Hashing 336 9.4.3 Closed Hashing 337 9.4.4 Analysis of Closed Hashing 346 9.4.5 Deletion 350 viii Contents 9.5 Further Reading 351 9.6 Exercises 352 9.7 Projects 355 10 Indexing 357 10.1 Linear Indexing 359 10.2 ISAM 361 10.3 Tree-based Indexing 364 10.4 2-3 Trees 366 10.5 B-Trees 372 10.5.1 B+ -Trees 375 10.5.2 B-Tree Analysis 381 10.6 Further Reading 382 10.7 Exercises 382 10.8 Projects 384 IV Advanced Data Structures 387 11 Graphs 389 11.1 Terminology and Representations 390 11.2 Graph Implementations 394 11.3 Graph Traversals 397 11.3.1 Depth-First Search 400 11.3.2 Breadth-First Search 401 11.3.3 Topological Sort 405 11.4 Shortest-Paths Problems 407 11.4.1 Single-Source Shortest Paths 407 11.5 Minimum-Cost Spanning Trees 411 11.5.1 Prim’s Algorithm 412 11.5.2 Kruskal’s Algorithm 415 11.6 Further Reading 416 11.7 Exercises 416 11.8 Projects 420 Contents ix 12 Lists and Arrays Revisited 423 12.1 Multilists 423 12.2 Matrix Representations 427 12.3 Memory Management 430 12.3.1 Dynamic Storage Allocation 431 12.3.2 Failure Policies and Garbage Collection 438 12.4 Further Reading 443 12.5 Exercises 444 12.6 Projects 445 13 Advanced Tree Structures 447 13.1 Tries 447 13.2 Balanced Trees 452 13.2.1 The AVL Tree 453 13.2.2 The Splay Tree 455 13.3 Spatial Data Structures 459 13.3.1 The K-D Tree 461 13.3.2 The PR quadtree 466 13.3.3 Other Point Data Structures 471 13.3.4 Other Spatial Data Structures 471 13.4 Further Reading 473 13.5 Exercises 473 13.6 Projects 475 V Theory of Algorithms 479 14 Analysis Techniques 481 14.1 Summation Techniques 482 14.2 Recurrence Relations 487 14.2.1 Estimating Upper and Lower Bounds 487 14.2.2 Expanding Recurrences 491 14.2.3 Divide and Conquer Recurrences 492 14.2.4 Average-Case Analysis of Quicksort 495 x Contents 14.3 Amortized Analysis 496 14.4 Further Reading 499 14.5 Exercises 500 14.6 Projects 504 15 Lower Bounds 505 15.1 Introduction to Lower Bounds Proofs 506 15.2 Lower Bounds on Searching Lists 508 15.2.1 Searching in Unsorted Lists 508 15.2.2 Searching in Sorted Lists 510 15.3 Finding the Maximum Value 511 15.4 Adversarial Lower Bounds Proofs 513 15.5 State Space Lower Bounds Proofs 516 15.6 Finding the ith Best Element 519 15.7 Optimal Sorting 522 15.8 Further Reading 524 15.9 Exercises 525 15.10Projects 527 16 Patterns of Algorithms 529 16.1 Greedy Algorithms 529 16.2 Dynamic Programming 530 16.2.1 Knapsack Problem 531 16.2.2 All-Pairs Shortest Paths 532 16.3 Randomized Algorithms 534 16.3.1 Skip Lists 536 16.4 Numerical Algorithms 541 16.4.1 Exponentiation 542 16.4.2 Largest Common Factor 543 16.4.3 Matrix Multiplication 543 16.4.4 Random Numbers 546 16.4.5 Fast Fourier Transform 546 16.5 Further Reading 551 Contents xi 16.6 Exercises 551 16.7 Projects 552 17 Limits to Computation 553 17.1 Reductions 554 17.2 Hard Problems 559 17.2.1 The Theory of N P-Completeness 560 17.2.2 N P-Completeness Proofs 565 17.2.3 Coping with N P-Complete Problems 569 17.3 Impossible Problems 573 17.3.1 Uncountability 574 17.3.2 The Halting Problem Is Unsolvable 577 17.4 Further Reading 581 17.5 Exercises 581 17.6 Projects 584 Bibliography 585 Index 591 Preface We study data structures so that we can learn to write more efficient programs. But why must programs be efficient when new computers are faster every year? The reason is that our ambitions grow with our capabilities. Instead of rendering effi- ciency needs obsolete, the modern revolution in computing power and storage ca- pability merely raises the efficiency stakes as we computerize more complex tasks. The quest for program efficiency need not and should not conflict with sound design and clear coding. Creating efficient programs has little to do with “program- ming tricks” but rather is based on good organization of information and good al- gorithms. A programmer who has not mastered the basic principles of clear design is not likely to write efficient programs. Conversely, “software engineering” cannot be used as an excuse to justify inefficient performance. Generality in design can and should be achieved without sacrificing performance, but this can only be done if the designer understands how to measure performance and does so as an integral part of the design and implementation process. Most computer science curricula recognize that good programming skills begin with a strong emphasis on funda- mental software engineering principles. Then, once a programmer has learned the principles of clear program design and implementation, the next step is to study the effects of data organization and algorithms on program efficiency. Approach: This book describes many techniques for representing data. These techniques are presented within the context of the following principles: 1. Each data structure and each algorithm has costs and benefits. Practitioners need a thorough understanding of how to assess costs and benefits to be able to adapt to new design challenges. This requires an understanding of the principles of algorithm analysis, and also an appreciation for the significant effects of the physical medium employed (e.g., data stored on disk versus main memory). xiii xiv Preface 2. Related to costs and benefits is the notion of tradeoffs. For example, it is quite common to reduce time requirements at the expense of an increase in space requirements, or vice versa. Programmers face tradeoff issues regularly in all phases of software design and implementation, so the concept must become deeply ingrained. 3. Programmers should know enough about common practice to avoid rein- venting the wheel. Thus, programmers need to learn the commonly used data structures, their related algorithms, and the most frequently encountered design patterns found in programming. 4. Data structures follow needs. Programmers must learn to assess application needs first, then find a data structure with matching capabilities. To do this requires competence in principles 1, 2, and 3. As I have taught data structures through the years, I have found that design issues have played an ever greater role in my courses. This can be traced through the various editions of this textbook by the increasing coverage for design patterns and generic interfaces. The first edition had no mention of design patterns. The second edition had limited coverage of a few example patterns, and introduced the dictionary ADT and comparator classes. With the third edition, there is explicit coverage of some design patterns that are encountered when programming the basic data structures and algorithms covered in the book. Using the Book in Class: Data structures and algorithms textbooks tend to fall into one of two categories: teaching texts or encyclopedias. Books that attempt to do both usually fail at both. This book is intended as a teaching text. I believe it is more important for a practitioner to understand the principles required to select or design the data structure that will best solve some problem than it is to memorize a lot of textbook implementations. Hence, I have designed this as a teaching text that covers most standard data structures, but not all. A few data structures that are not widely adopted are included to illustrate important principles. Some relatively new data structures that should become widely used in the future are included. Within an undergraduate program, this textbook is designed for use in either an advanced lower division (sophomore or junior level) data structures course, or for a senior level algorithms course. New material has been added in the third edition to support its use in an algorithms course. Normally, this text would be used in a course beyond the standard freshman level “CS2” course that often serves as the ini- tial introduction to data structures. Readers of this book should have programming experience, typically two semesters or the equivalent of a structured programming language such as Pascal or C, and including at least some exposure to Java. Read- ers who are already familiar with recursion will have an advantage. Students of Preface xv data structures will also benefit from having first completed a good course in Dis- crete Mathematics. Nonetheless, Chapter 2 attempts to give a reasonably complete survey of the prerequisite mathematical topics at the level necessary to understand their use in this book. Readers may wish to refer back to the appropriate sections as needed when encountering unfamiliar mathematical material. A sophomore-level class where students have only a little background in basic data structures or analysis (that is, background equivalent to what would be had from a traditional CS2 course) might cover Chapters 1-11 in detail, as well as se- lected topics from Chapter 13. That is how I use the book for my own sophomore- level class. Students with greater background might cover Chapter 1, skip most of Chapter 2 except for reference, briefly cover Chapters 3 and 4, and then cover chapters 5-12 in detail. Again, only certain topics from Chapter 13 might be cov- ered, depending on the programming assignments selected by the instructor. A senior-level algorithms course would focus on Chapters 11 and 14-17. Chapter 13 is intended in part as a source for larger programming exercises. I recommend that all students taking a data structures course be required to im- plement some advanced tree structure, or another dynamic structure of comparable difficulty such as the skip list or sparse matrix representations of Chapter 12. None of these data structures are significantly more difficult to implement than the binary search tree, and any of them should be within a student’s ability after completing Chapter 5. While I have attempted to arrange the presentation in an order that makes sense, instructors should feel free to rearrange the topics as they see fit. The book has been written so that once the reader has mastered Chapters 1-6, the remaining material has relatively few dependencies. Clearly, external sorting depends on understand- ing internal sorting and disk files. Section 6.2 on the UNION/FIND algorithm is used in Kruskal’s Minimum-Cost Spanning Tree algorithm. Section 9.2 on self- organizing lists mentions the buffer replacement schemes covered in Section 8.3. Chapter 14 draws on examples from throughout the book. Section 17.2 relies on knowledge of graphs. Otherwise, most topics depend only on material presented earlier within the same chapter. Most chapters end with a section entitled “Further Reading.” These sections are not comprehensive lists of references on the topics presented. Rather, I include books and articles that, in my opinion, may prove exceptionally informative or entertaining to the reader. In some cases I include references to works that should become familiar to any well-rounded computer scientist. Use of Java: The programming examples are written in JavaTM. As with any programming language, Java has both advantages and disadvantages. Java is a xvi Preface small language. There usually is only one way to do something, and this has the happy tendency of encouraging a programmer toward clarity when used correctly. In this respect, it is superior to C or C++. Java serves nicely for defining and using most traditional data structures such as lists and trees. On the other hand, Java is quite poor when used to do file processing, being both cumbersome and inefficient. It is also a poor language when fine control of memory is required. As an example, applications requiring memory management, such as those discussed in Section 12.3, are difficult to write in Java. Since I wish to stick to a single language throughout the text, like any programmer I must take the bad along with the good. The most important issue is to get the ideas across, whether or not those ideas are natural to a particular language of discourse. Most programmers will use a variety of programming languages throughout their career, and the concepts described in this book should prove useful in a variety of circumstances. I do not wish to discourage those unfamiliar with Java from reading this book. I have attempted to make the examples as clear as possible while maintaining the advantages of Java. Java is used here strictly as a tool to illustrate data structures concepts. Fortunately, Java is an easy language for C or Pascal programmers to read with a minimal amount of study of the syntax related to object-oriented program- ming. In particular, I make use of Java’s support for hiding implementation details, including features such as classes, private class members, and interfaces. These features of the language support the crucial concept of separating logical design, as embodied in the abstract data type, from physical implementation as embodied in the data structure. I make no attempt to teach Java within the text. An Appendix is provided that describes the Java syntax and concepts necessary to understand the program examples. I also provide the actual Java code used in the text through anonymous FTP. Inheritance, a key feature of object-oriented programming, is used only spar- ingly in the code examples. Inheritance is an important tool that helps programmers avoid duplication, and thus minimize bugs. From a pedagogical standpoint, how- ever, inheritance often makes code examples harder to understand since it tends to spread the description for one logical unit among several classes. Thus, some of my class definitions for objects such as tree or list nodes do not take full advantage of possible inheritance from earlier code examples. This does not mean that a pro- grammer should do likewise. Avoiding code duplication and minimizing errors are important goals. Treat the programming examples as illustrations of data structure principles, but do not copy them directly into your own programs. Preface xvii My Java implementations serve to provide concrete illustrations of data struc- ture principles. They are not meant to be a series of commercial-quality Java class implementations. The code examples provide less parameter checking than is sound programming practice for commercial programmers. Some parameter checking is included in the form of calls to methods in class Assert. These methods are modeled after the C standard library function assert. Method Assert.notFalse takes a Boolean expression. If this expression evaluates to false, then the program terminates immediately. Method Assert.notNull takes a reference to class Object, and terminates the program if the value of the reference is null. (To be precise, these functions throw an IllegalArgument- Exception, which typically results in terminating the program unless the pro- grammer takes action to handle the exception.) Terminating a program when a function receives a bad parameter is generally considered undesirable in real pro- grams, but is quite adequate for understanding how a data structure is meant to op- erate. In real programming applications, Java’s exception handling features should be used to deal with input data errors. I make a distinction in the text between “Java implementations” and “pseu- docode.” Code labeled as a Java implementation has actually been compiled and tested on one or more Java compilers. Pseudocode examples often conform closely to Java syntax, but typically contain one or more lines of higher level description. Pseudocode is used where I perceived a greater pedagogical advantage to a simpler, but less precise, description. Exercises and Projects: Proper implementation and anaysis of data structures cannot be learned simply by reading a book. You must practice by implementing real programs, constantly comparing different techniques to see what really works best in a given situation. One of the most important aspects of a course in data structures is that it is where students really learn to program using pointers and dynamic memory alloca- tion, by implementing data structures such as linked lists and trees. Its also where students truly learn recursion. In our curriculum, this is the first course where students do significant design, because it often requires real data structures to mo- tivate significant design exercises. Finally, the fundamental differences between memory-based and disk-based data access cannot be appreciated without practical programming experience. For all of these reasons, a data structures course cannot succeed without a significant programming component. In our department, the data structures course is arguably the most difficult programming course in the curricu- lum. xviii Preface Students should also work problems to develop their analytical abilities. I pro- vide over 400 exercises and suggestions for programming projects. I urge readers to take advantage of them. Contacting the Author and Supplementary Materials: A book such as this is sure to contain errors and have room for improvement. I welcome bug reports and constructive criticism. I can be reached by electronic mail via the Internet at [email protected]. Alternatively, comments can be mailed to Cliff Shaffer Department of Computer Science Virginia Tech Blacksburg, VA 24061 A set of transparency masters for use in conjunction with this book can be ob- tained via the WWW at http://www.cs.vt.edu/∼shaffer/book.html. The Java code examples are also available this site. Online Web pages for Virginia Tech’s sophomore-level data structures class can be found at URL http://courses.cs.vt.edu/∼cs3114 This book was originally typeset by the author with LATEX. The bibliography was prepared using BIBTEX. The index was prepared using makeindex. The figures were mostly drawn with Xfig. Figures 3.1 and 9.8 were partially created using Mathematica. Acknowledgments: It takes a lot of help from a lot of people to make a book. I wish to acknowledge a few of those who helped to make this book possible. I apologize for the inevitable omissions. Virginia Tech helped make this whole thing possible through sabbatical re- search leave during Fall 1994, enabling me to get the project off the ground. My de- partment heads during the time I have written the various editions of this book, Den- nis Kafura and Jack Carroll, provided unwavering moral support for this project. Mike Keenan, Lenny Heath, and Jeff Shaffer provided valuable input on early ver- sions of the chapters. I also wish to thank Lenny Heath for many years of stimulat- ing discussions about algorithms and analysis (and how to teach both to students). Steve Edwards deserves special thanks for spending so much time helping me on various redesigns of the C++ and Java code versions for the second and third edi- tions, and many hours of discussion on the principles of program design. Thanks to Layne Watson for his help with Mathematica, and to Bo Begole, Philip Isenhour, Jeff Nielsen, and Craig Struble for much technical assistance. Thanks to Bill Mc- Quain, Mark Abrams and Dennis Kafura for answering lots of silly questions about C++ and Java. Preface xix I am truly indebted to the many reviewers of the various editions of this manu- script. For the first edition these reviewers included J. David Bezek (University of Evansville), Douglas Campbell (Brigham Young University), Karen Davis (Univer- sity of Cincinnati), Vijay Kumar Garg (University of Texas – Austin), Jim Miller (University of Kansas), Bruce Maxim (University of Michigan – Dearborn), Jeff Parker (Agile Networks/Harvard), Dana Richards (George Mason University), Jack Tan (University of Houston), and Lixin Tao (Concordia University). Without their help, this book would contain many more technical errors and many fewer insights. For the second edition, I wish to thank these reviewers: Gurdip Singh (Kansas State University), Peter Allen (Columbia University), Robin Hill (University of Wyoming), Norman Jacobson (University of California – Irvine), Ben Keller (East- ern Michigan University), and Ken Bosworth (Idaho State University). In addition, I wish to thank Neil Stewart and Frank J. Thesen for their comments and ideas for improvement. Third edition reviewers included Randall Lechlitner (University of Houstin, Clear Lake) and Brian C. Hipp (York Technical College). I thank them for their comments. Without the hard work of many people at Prentice Hall, none of this would be possible. Authors simply do not create printer-ready books on their own. Foremost thanks go to Kate Hargett, Petra Rector, Laura Steele, and Alan Apt, my editors over the years. My production editors, Irwin Zucker for the second edition, Kath- leen Caren for the original C++ version, and Ed DeFelippis for the Java version, kept everything moving smoothly during that horrible rush at the end. Thanks to Bill Zobrist and Bruce Gregory (I think) for getting me into this in the first place. Others at Prentice Hall who helped me along the way include Truly Donovan, Linda Behrens, and Phyllis Bregman. I am sure I owe thanks to many others at Prentice Hall for their help in ways that I am not even aware of. I wish to express my appreciation to Hanan Samet for teaching me about data structures. I learned much of the philosophy presented here from him as well, though he is not responsible for any problems with the result. Thanks to my wife Terry, for her love and support, and to my daughters Irena and Kate for pleasant diversions from working too hard. Finally, and most importantly, to all of the data structures students over the years who have taught me what is important and what should be skipped in a data structures course, and the many new insights they have provided. This book is dedicated to them. Clifford A. Shaffer Blacksburg, Virginia PART I Preliminaries 1 1 Data Structures and Algorithms How many cities with more than 250,000 people lie within 500 miles of Dallas, Texas? How many people in my company make over $100,000 per year? Can we connect all of our telephone customers with less than 1,000 miles of cable? To answer questions like these, it is not enough to have the necessary information. We must organize that information in a way that allows us to find the answers in time to satisfy our needs. Representing information is fundamental to computer science. The primary purpose of most computer programs is not to perform calculations, but to store and retrieve information — usually as fast as possible. For this reason, the study of data structures and the algorithms that manipulate them is at the heart of computer science. And that is what this book is about — helping you to understand how to structure information to support efficient processing. This book has three primary goals. The first is to present the commonly used data structures. These form a programmer’s basic data structure “toolkit.” For many problems, some data structure in the toolkit provides a good solution. The second goal is to introduce the idea of tradeoffs and reinforce the concept that there are costs and benefits associated with every data structure. This is done by describing, for each data structure, the amount of space and time required for typical operations. The third goal is to teach how to measure the effectiveness of a data structure or algorithm. Only through such measurement can you determine which data structure in your toolkit is most appropriate for a new problem. The techniques presented also allow you to judge the merits of new data structures that you or others might invent. There are often many approaches to solving a problem. How do we choose between them? At the heart of computer program design are two (sometimes con- flicting) goals: 3 4 Chap. 1 Data Structures and Algorithms 1. To design an algorithm that is easy to understand, code, and debug. 2. To design an algorithm that makes efficient use of the computer’s resources. Ideally, the resulting program is true to both of these goals. We might say that such a program is “elegant.” While the algorithms and program code examples pre- sented here attempt to be elegant in this sense, it is not the purpose of this book to explicitly treat issues related to goal (1). These are primarily concerns of the disci- pline of Software Engineering. Rather, this book is mostly about issues relating to goal (2). How do we measure efficiency? Chapter 3 describes a method for evaluating the efficiency of an algorithm or computer program, called asymptotic analysis. Asymptotic analysis also allows you to measure the inherent difficulty of a problem. The remaining chapters use asymptotic analysis techniques for every algorithm presented. This allows you to see how each algorithm compares to other algorithms for solving the same problem in terms of its efficiency. This first chapter sets the stage for what is to follow, by presenting some higher- order issues related to the selection and use of data structures. We first examine the process by which a designer selects a data structure appropriate to the task at hand. We then consider the role of abstraction in program design. We briefly consider the concept of a design pattern and see some examples. The chapter ends with an exploration of the relationship between problems, algorithms, and programs. 1.1 A Philosophy of Data Structures 1.1.1 The Need for Data Structures You might think that with ever more powerful computers, program efficiency is becoming less important. After all, processor speed and memory size still seem to double every couple of years. Won’t any efficiency problem we might have today be solved by tomorrow’s hardware? As we develop more powerful computers, our history so far has always been to use additional computing power to tackle more complex problems, be it in the form of more sophisticated user interfaces, bigger problem sizes, or new problems previously deemed computationally infeasible. More complex problems demand more computation, making the need for efficient programs even greater. Worse yet, as tasks become more complex, they become less like our everyday experience. Today’s computer scientists must be trained to have a thorough understanding of the principles behind efficient program design, because their ordinary life experiences often do not apply when designing computer programs. Sec. 1.1 A Philosophy of Data Structures 5 In the most general sense, a data structure is any data representation and its associated operations. Even an integer or floating point number stored on the com- puter can be viewed as a simple data structure. More typically, a data structure is meant to be an organization or structuring for a collection of data items. A sorted list of integers stored in an array is an example of such a structuring. Given sufficient space to store a collection of data items, it is always possible to search for specified items within the collection, print or otherwise process the data items in any desired order, or modify the value of any particular data item. Thus, it is possible to perform all necessary operations on any data structure. However, using the proper data structure can make the difference between a program running in a few seconds and one requiring many days. A solution is said to be efficient if it solves the problem within the required resource constraints. Examples of resource constraints include the total space available to store the data — possibly divided into separate main memory and disk space constraints — and the time allowed to perform each subtask. A solution is sometimes said to be efficient if it requires fewer resources than known alternatives, regardless of whether it meets any particular requirements. The cost of a solution is the amount of resources that the solution consumes. Most often, cost is measured in terms of one key resource such as time, with the implied assumption that the solution meets the other resource constraints. It should go without saying that people write programs to solve problems. How- ever, it is crucial to keep this truism in mind when selecting a data structure to solve a particular problem. Only by first analyzing the problem to determine the perfor- mance goals that must be achieved can there be any hope of selecting the right data structure for the job. Poor program designers ignore this analysis step and apply a data structure that they are familiar with but which is inappropriate to the problem. The result is typically a slow program. Conversely, there is no sense in adopting a complex representation to “improve” a program that can meet its performance goals when implemented using a simpler design. When selecting a data structure to solve a problem, you should follow these steps. 1. Analyze your problem to determine the basic operations that must be sup- ported. Examples of basic operations include inserting a data item into the data structure, deleting a data item from the data structure, and finding a specified data item. 2. Quantify the resource constraints for each operation. 3. Select the data structure that best meets these requirements. 6 Chap. 1 Data Structures and Algorithms This three-step approach to selecting a data structure operationalizes a data- centered view of the design process. The first concern is for the data and the op- erations to be performed on them, the next concern is the representation for those data, and the final concern is the implementation of that representation. Resource constraints on certain key operations, such as search, inserting data records, and deleting data records, normally drive the data structure selection pro- cess. Many issues relating to the relative importance of these operations are ad- dressed by the following three questions, which you should ask yourself whenever you must choose a data structure: Are all data items inserted into the data structure at the beginning, or are insertions interspersed with other operations? Can data items be deleted? Are all data items processed in some well-defined order, or is search for specific data items allowed? Typically, interspersing insertions with other operations, allowing deletion, and supporting search for data items all require more complex representations. 1.1.2 Costs and Benefits Each data structure has associated costs and benefits. In practice, it is hardly ever true that one data structure is better than another for use in all situations. If one data structure or algorithm is superior to another in all respects, the inferior one will usually have long been forgotten. For nearly every data structure and algorithm presented in this book, you will see examples of where it is the best choice. Some of the examples might surprise you. A data structure requires a certain amount of space for each data item it stores, a certain amount of time to perform a single basic operation, and a certain amount of programming effort. Each problem has constraints on available space and time. Each solution to a problem makes use of the basic operations in some relative pro- portion, and the data structure selection process must account for this. Only after a careful analysis of your problem’s characteristics can you determine the best data structure for the task. Example 1.1 A bank must support many types of transactions with its customers, but we will examine a simple model where customers wish to open accounts, close accounts, and add money or withdraw money from accounts. We can consider this problem at two distinct levels: (1) the re- quirements for the physical infrastructure and workflow process that the Sec. 1.1 A Philosophy of Data Structures 7 bank uses in its interactions with its customers, and (2) the requirements for the database system that manages the accounts. The typical customer opens and closes accounts far less often than he or she accesses the account. Customers are willing to wait many minutes while accounts are created or deleted but are typically not willing to wait more than a brief time for individual account transactions such as a deposit or withdrawal. These observations can be considered as informal specifica- tions for the time constraints on the problem. It is common practice for banks to provide two tiers of service. Hu- man tellers or automated teller machines (ATMs) support customer access to account balances and updates such as deposits and withdrawals. Spe- cial service representatives are typically provided (during restricted hours) to handle opening and closing accounts. Teller and ATM transactions are expected to take little time. Opening or closing an account can take much longer (perhaps up to an hour from the customer’s perspective). From a database perspective, we see that ATM transactions do not mod- ify the database significantly. For simplicity, assume that if money is added or removed, this transaction simply changes the value stored in an account record. Adding a new account to the database is allowed to take several minutes. Deleting an account need have no time constraint, because from the customer’s point of view all that matters is that all the money be re- turned (equivalent to a withdrawal). From the bank’s point of view, the account record might be removed from the database system after business hours, or at the end of the monthly account cycle. When considering the choice of data structure to use in the database system that manages customer accounts, we see that a data structure that has little concern for the cost of deletion, but is highly efficient for search and moderately efficient for insertion, should meet the resource constraints imposed by this problem. Records are accessible by unique account number (sometimes called an exact-match query). One data structure that meets these requirements is the hash table described in Chapter 9.4. Hash tables allow for extremely fast exact-match search. A record can be modified quickly when the modification does not affect its space requirements. Hash tables also support efficient insertion of new records. While deletions can also be supported efficiently, too many deletions lead to some degradation in performance for the remaining operations. However, the hash table can be reorganized periodically to restore the system to peak efficiency. Such reorganization can occur offline so as not to affect ATM transactions. 8 Chap. 1 Data Structures and Algorithms Example 1.2 A company is developing a database system containing in- formation about cities and towns in the United States. There are many thousands of cities and towns, and the database program should allow users to find information about a particular place by name (another example of an exact-match query). Users should also be able to find all places that match a particular value or range of values for attributes such as location or population size. This is known as a range query. A reasonable database system must answer queries quickly enough to satisfy the patience of a typical user. For an exact-match query, a few sec- onds is satisfactory. If the database is meant to support range queries that can return many cities that match the query specification, the entire opera- tion may be allowed to take longer, perhaps on the order of a minute. To meet this requirement, it will be necessary to support operations that pro- cess range queries efficiently by processing all cities in the range as a batch, rather than as a series of operations on individual cities. The hash table suggested in the previous example is inappropriate for implementing our city database, because it cannot perform efficient range queries. The B+ -tree of Section 10.5.1 supports large databases, insertion and deletion of data records, and range queries. However, a simple linear in- dex as described in Section 10.1 would be more appropriate if the database is created once, and then never changed, such as an atlas distributed on a CD-ROM. 1.2 Abstract Data Types and Data Structures The previous section used the terms “data item” and “data structure” without prop- erly defining them. This section presents terminology and motivates the design process embodied in the three-step approach to selecting a data structure. This mo- tivation stems from the need to manage the tremendous complexity of computer programs. A type is a collection of values. For example, the Boolean type consists of the values true and false. The integers also form a type. An integer is a simple type because its values contain no subparts. A bank account record will typically contain several pieces of information such as name, address, account number, and account balance. Such a record is an example of an aggregate type or composite type. A data item is a piece of information or a record whose value is drawn from a type. A data item is said to be a member of a type. Sec. 1.2 Abstract Data Types and Data Structures 9 A data type is a type together with a collection of operations to manipulate the type. For example, an integer variable is a member of the integer data type. Addition is an example of an operation on the integer data type. A distinction should be made between the logical concept of a data type and its physical implementation in a computer program. For example, there are two tra- ditional implementations for the list data type: the linked list and the array-based list. The list data type can therefore be implemented using a linked list or an ar- ray. Even the term “array” is ambiguous in that it can refer either to a data type or an implementation. “Array” is commonly used in computer programming to mean a contiguous block of memory locations, where each memory location stores one fixed-length data item. By this meaning, an array is a physical data structure. However, array can also mean a logical data type composed of a (typically ho- mogeneous) collection of data items, with each data item identified by an index number. It is possible to implement arrays in many different ways. For exam- ple, Section 12.2 describes the data structure used to implement a sparse matrix, a large two-dimensional array that stores only a relatively few non-zero values. This implementation is quite different from the physical representation of an array as contiguous memory locations. An abstract data type (ADT) is the realization of a data type as a software component. The interface of the ADT is defined in terms of a type and a set of operations on that type. The behavior of each operation is determined by its inputs and outputs. An ADT does not specify how the data type is implemented. These implementation details are hidden from the user of the ADT and protected from outside access, a concept referred to as encapsulation. A data structure is the implementation for an ADT. In an object-oriented lan- guage such as Java, an ADT and its implementation together make up a class. Each operation associated with the ADT is implemented by a member function or method. The variables that define the space required by a data item are referred to as data members. An object is an instance of a class, that is, something that is created and takes up storage during the execution of a computer program. The term “data structure” often refers to data stored in a computer’s main mem- ory. The related term file structure often refers to the organization of data on peripheral storage, such as a disk drive or CD-ROM. Example 1.3 The mathematical concept of an integer, along with opera- tions that manipulate integers, form a data type. The Java int variable type is a physical representation of the abstract integer. The int variable type, along with the operations that act on an int variable, form an ADT. Un- 10 Chap. 1 Data Structures and Algorithms fortunately, the int implementation is not completely true to the abstract integer, as there are limitations on the range of values an int variable can store. If these limitations prove unacceptable, then some other represen- tation for the ADT “integer” must be devised, and a new implementation must be used for the associated operations. Example 1.4 An ADT for a list of integers might specify the following operations: Insert a new integer at a particular position in the list. Return true if the list is empty. Reinitialize the list. Return the number of integers currently in the list. Delete the integer at a particular position in the list. From this description, the input and output of each operation should be clear, but the implementation for lists has not been specified. One application that makes use of some ADT might use particular member functions of that ADT more than a second application, or the two applications might have different time requirements for the various operations. These differences in the requirements of applications are the reason why a given ADT might be supported by more than one implementation. Example 1.5 Two popular implementations for large disk-based database applications are hashing (Section 9.4) and the B+ -tree (Section 10.5). Both support efficient insertion and deletion of records, and both support exact- match queries. However, hashing is more efficient than the B+ -tree for exact-match queries. On the other hand, the B+ -tree can perform range queries efficiently, while hashing is hopelessly inefficient for range queries. Thus, if the database application limits searches to exact-match queries, hashing is preferred. On the other hand, if the application requires support for range queries, the B+ -tree is preferred. Despite these performance is- sues, both implementations solve versions of the same problem: updating and searching a large collection of records. The concept of an ADT can help us to focus on key issues even in non-com-ut- ing applications. Sec. 1.2 Abstract Data Types and Data Structures 11 Example 1.6 When operating a car, the primary activities are steering, accelerating, and braking. On nearly all passenger cars, you steer by turn- ing the steering wheel, accelerate by pushing the gas pedal, and brake by pushing the brake pedal. This design for cars can be viewed as an ADT with operations “steer,” “accelerate,” and “brake.” Two cars might imple- ment these operations in radically different ways, say with different types of engine, or front- versus rear-wheel drive. Yet, most drivers can oper- ate many different cars because the ADT presents a uniform method of operation that does not require the driver to understand the specifics of any particular engine or drive design. These differences are deliberately hidden. The concept of an ADT is one instance of an important principle that must be understood by any successful computer scientist: managing complexity through abstraction. A central theme of computer science is complexity and techniques for handling it. Humans deal with complexity by assigning a label to an assembly of objects or concepts and then manipulating the label in place of the assembly. Cognitive psychologists call such a label a metaphor. A particular label might be related to other pieces of information or other labels. This collection can in turn be given a label, forming a hierarchy of concepts and labels. This hierarchy of labels allows us to focus on important issues while ignoring unnecessary details. Example 1.7 We apply the label “hard drive” to a collection of hardware that manipulates data on a particular type of storage device, and we ap- ply the label “CPU” to the hardware that controls execution of computer instructions. These and other labels are gathered together under the label “computer.” Because even small home computers have millions of compo- nents, some form of abstraction is necessary to comprehend how a com- puter operates. Consider how you might go about the process of designing a complex computer program that implements and manipulates an ADT. The ADT is implemented in one part of the program by a particular data structure. While designing those parts of the program that use the ADT, you can think in terms of operations on the data type without concern for the data structure’s implementation. Without this ability to simplify your thinking about a complex program, you would have no hope of understanding or implementing it. 12 Chap. 1 Data Structures and Algorithms Example 1.8 Consider the design for a relatively simple database system stored on disk. Typically, records on disk in such a program are accessed through a buffer pool (see Section 8.3) rather than directly. Variable length records might use a memory manager (see Section 12.3) to find an appro- priate location within the disk file to place the record. Multiple index struc- tures (see Chapter 10) will typically be used to access records in various ways. Thus, we have a chain of classes, each with its own responsibili- ties and access privileges. A database query from a user is implemented by searching an index structure. This index requests access to the record by means of a request to the buffer pool. If a record is being inserted or deleted, such a request goes through the memory manager, which in turn interacts with the buffer pool to gain access to the disk file. A program such as this is far too complex for nearly any human programmer to keep all of the details in his or her head at once. The only way to design and imple- ment such a program is through proper use of abstraction and metaphors. In object-oriented programming, such abstraction is handled using classes. Data types have both a logical and a physical form. The definition of the data type in terms of an ADT is its logical form. The implementation of the data type as a data structure is its physical form. Figure 1.1 illustrates this relationship between logical and physical forms for data types. When you implement an ADT, you are dealing with the physical form of the associated data type. When you use an ADT elsewhere in your program, you are concerned with the associated data type’s logical form. Some sections of this book focus on physical implementations for a given data structure. Other sections use the logical ADT for the data type in the context of a higher-level task. Example 1.9 A particular Java environment might provide a library that includes a list class. The logical form of the list is defined by the public functions, their inputs, and their outputs that define the class. This might be all that you know about the list class implementation, and this should be all you need to know. Within the class, a variety of physical implementations for lists is possible. Several are described in Section 4.1. 1.3 Design Patterns At a higher level of abstraction than ADTs are abstractions for describing the design of programs — that is, the interactions of objects and classes. Experienced software Sec. 1.3 Design Patterns 13 Data Type ADT: Type Data Items: Logical Form Operations Data Structure: Storage Space Data Items: Subroutines Physical Form Figure 1.1 The relationship between data items, abstract data types, and data structures. The ADT defines the logical form of the data type. The data structure implements the physical form of the data type. designers learn and reuse various techniques for combining software components. Such techniques are sometimes referred to as design patterns. A design pattern embodies and generalizes important design concepts for a recurring problem. A primary goal of design patterns is to quickly transfer the knowledge gained by expert designers to newer programmers. Another goal is to allow for efficient communication between programmers. Its much easier to discuss a design issue when you share a vocabulary relevant to the topic. Specific design patterns emerge from the discovery that a particular design problem appears repeatedly in many contexts. They are meant to solve real prob- lems. Design patterns are a bit like generics: They describe the structure for a design solution, with the details filled in for any given problem. Design patterns are a bit like data structures: Each one provides costs and benefits, which implies that tradeoffs are possible. Therefore, a given design pattern might have variations on its application to match the various tradeoffs inherent in a given situation. The rest of this section introduces a few simple design patterns that are used later in the book. 1.3.1 Flyweight The Flyweight design pattern is meant to solve the following problem. You have an application with many objects. Some of these objects are identical in the in- formation that they contain, and the role that they play. But they must be reached from various places, and conceptually they really are distinct objects. Because so much information is shared, we would like to take advantage of the opportunity to reduce memory cost by sharing space. An example comes from representing the 14 Chap. 1 Data Structures and Algorithms layout for a document. The letter “C” might reasonably be represented by an object that describes that character’s strokes and bounding box. However, we don’t want to create a separate “C” object everywhere in the document that a “C” appears. The solution is to allocate a single copy of the shared representation for “C” ob- ject. Then, every place in the document that needs a “C” in a given font, size, and typeface will reference this single copy. The various instances of references to “C” are called flyweights. A flyweight includes the reference to the shared information, and might include additional information specific to that instance. We could imagine describing the layout of text on a page by using a tree struc- ture. The root of the tree is a node representing the page. The page has multiple child nodes, one for each column. The column nodes have child nodes for each row. And the rows have child nodes for each character. These representations for characters are the flyweights. The flyweight includes the reference to the shared shape information, and might contain additional information specific to that in- stance. For example, each instance for “C” will contain a reference to the shared information about strokes and shapes, and it might also contain the exact location for that instance of the character on the page. Flyweights are used in the implementation for the PR quadtree data structure for storing collections of point objects, described in Section 13.3. In a PR quadtree, we again have a tree with leaf nodes. Many of these leaf nodes (the ones that represent empty areas) contain the same information. These identical nodes can be implemented using the Flyweight design pattern for better memory efficiency. 1.3.2 Visitor Given a tree of objects to describe a page layout, we might wish to perform some activity on every node in the tree. Section 5.2 discusses tree traversal, which is the process of visiting every node in the tree in a defined order. A simple example for our text composition application might be to count the number of nodes in the tree that represents the page. At another time, we might wish to print a listing of all the nodes for debugging purposes. We could write a separate traversal function for each such activity that we in- tend to perform on the tree. A better approach would be to write a generic traversal function, and pass in the activity to be performed at each node. This organization constitutes the visitor design pattern. The visitor design pattern is used in Sec- tions 5.2 (tree traversal) and 11.3 (graph traversal). Sec. 1.3 Design Patterns 15 1.3.3 Composite There are two fundamental approaches to dealing with the relationship between a collection of actions and a hierarchy of object types. First consider the typical procedural approach. Say we have a base class for page layout entities, with a subclass hierarchy to define specific subtypes (page, columns, rows, figures, char- acters, etc.). And say there are actions to be performed on a collection of such objects (such as rendering the objects to the screen). The procedural design ap- proach is for each action to be implemented as a method that takes as a parameter a pointer to the base class type. Each such method will traverse through the collec- tion of objects, visiting each object in turn. Each method contains something like a case statement that defines the details of the action for each subclass in the collec- tion (e.g., page, column, row, character). We can cut the code down some by using the visitor design pattern so that we only need to write the traversal once, and then write a visitor subroutine for each action that might be applied to the collection of objects. But each such visitor subroutine must still contain logic for dealing with each of the possible subclasses. In our page composition application, there are only a few activities that we would like to perform on the page representation. We might render the objects in full detail. Or we might want a “rough draft” rendering that prints only the bound- ing boxes of the objects. If we come up with a new activity to apply to the collection of objects, we do not need to change any of the code that implements the existing activities. But adding new activities won’t happen often for this application. In contrast, there could be many object types, and we might frequently add new ob- ject types to our implementation. Unfortunately, adding a new object type requires that we modify each activity, and the subroutines implementing the activities get rather long case statements to distinguish the behavior of the many subclasses. An alternative design is to have each object subclass in the hierarchy embody the action for each of the various activities that might be performed. Each subclass will have code to perform each activity (such as full rendering or bounding box rendering). Then, if we wish to apply the activity to the collection, we simply call the first object in the collection and specify the action (as a method call on that object). In the case of our page layout and its hierarchical collection of objects, those objects that contain other objects (such as a row objects that contains letters) will call the appropriate method for each child. If we want to add a new activity with this organization, we have to change the code for every subclass. But this is relatively rare for our text compositing application. In contrast, adding a new object into the subclass hierarchy (which for this application is far more likely than adding a new rendering function) is easy. Adding a new subclass does not require changing 16 Chap. 1 Data Structures and Algorithms any of the existing subclasses. It merely requires that we define the behavior of each activity that can be performed on that subclass. This second design approach of burying the functional activity in the subclasses is called the Composite design pattern. A detailed example for using the Composite design pattern is presented in Section 5.3.1. 1.3.4 Strategy Our final example of a design pattern lets us encapsulate and make interchangeable a set of alternative actions that might be performed as part of some larger activity. Again continuing our text compositing example, each output device that we wish to render to will require its own function for doing the actual rendering. That is, the objects will be broken down into constituent pixels or strokes, but the actual mechanics of rendering a pixel or stroke will depend on the output device. We don’t want to build this rendering functionality into the object subclasses. Instead, we want to pass to the subroutine performing the rendering action a method or class that does the appropriate rendering details for that output device. That is, we wish to hand to the object the appropriate “strategy” for accomplishing the details of the rendering task. Thus, we call this approach the Strategy design pattern. The Strategy design pattern will be discussed further in Chapter 7. There, a sorting function is given a class (called a comparator) that understands how to ex- tract and compare the key values for records to be sorted. In this way, the sorting function does not need to know any details of how its record type is implemented. One of the biggest challenges to understanding design patterns is that many of them appear to be pretty much the same. For example, you might be confused about the difference between the composite pattern and the visitor pattern. The distinction is that the composite design pattern is about whether to give control of the traversal process to the nodes of the tree or to the tree itself. Both approaches can make use of the visitor design pattern to avoid rewriting the traversal function many times, by encapsulating the activity performed at each node. But isn’t the strategy design pattern doing the same thing? The difference be- tween the visitor pattern and the strategy pattern is more subtle. Here the difference is primarily one of intent and focus. In both the strategy design pattern and the visi- tor design pattern, an activity is being passed in as a parameter. The strategy design pattern is focused on encapsulating an activity that is part of a larger process, so that different ways of performing that activity can be substituted. The visitor de- sign pattern is focused on encapsulating an activity that will be performed on all members of a collection so that completely different activities can be substituted within a generic method that accesses all of the collection members. Sec. 1.4 Problems, Algorithms, and Programs 17 1.4 Problems, Algorithms, and Programs Programmers commonly deal with problems, algorithms, and computer programs. These are three distinct concepts. Problems: As your intuition would suggest, a problem is a task to be performed. It is best thought of in terms of inputs and matching outputs. A problem definition should not include any constraints on how the problem is to be solved. The solution method should be developed only after the problem is precisely defined and thor- oughly understood. However, a problem definition should include constraints on the resources that may be consumed by any acceptable solution. For any problem to be solved by a computer, there are always such constraints, whether stated or implied. For example, any computer program may use only the main memory and disk space available, and it must run in a “reasonable” amount of time. Problems can be viewed as functions in the mathematical sense. A function is a matching between inputs (the domain) and outputs (the range). An input to a function might be a single value or a collection of information. The values making up an input are called the parameters of the function. A specific selection of values for the parameters is called an instance of the problem. For example, the input parameter to a sorting function might be an array of integers. A particular array of integers, with a given size and specific values for each position in the array, would be an instance of the sorting problem. Different instances might generate the same output. However, any problem instance must always result in the same output every time the function is computed using that particular input. This concept of all problems behaving like mathematical functions might not match your intuition for the behavior of computer programs. You might know of programs to which you can give the same input value on two separate occasions, and two different outputs will result. For example, if you type “date” to a typical UNIX command line prompt, you will get the current date. Naturally the date will be different on different days, even though the same command is given. However, there is obviously more to the input for the date program than the command that you type to run the program. The date program computes a function. In other words, on any particular day there can only be a single answer returned by a properly running date program on a completely specified input. For all computer programs, the output is completely determined by the program’s full set of inputs. Even a “random number generator” is completely determined by its inputs (although some random number generating systems appear to get around this by accepting a random input from a physical process beyond the user’s control). The relationship between programs and functions is explored further in Section 17.3. 18 Chap. 1 Data Structures and Algorithms Algorithms: An algorithm is a method or a process followed to solve a problem. If the problem is viewed as a function, then an algorithm is an implementation for the function that transforms an input to the corresponding output. A problem can be solved by many different algorithms. A given algorithm solves only one problem (i.e., computes a particular function). This book covers many problems, and for several of these problems I present more than one algorithm. For the important problem of sorting I present nearly a dozen algorithms! The advantage of knowing several solutions to a problem is that solution A might be more efficient than solution B for a specific variation of the problem, or for a specific class of inputs to the problem, while solution B might be more efficient than A for another variation or class of inputs. For example, one sorting algorithm might be the best for sorting a small collection of integers, another might be the best for sorting a large collection of integers, and a third might be the best for sorting a collection of variable-length strings. By definition, an algorithm possesses several properties. Something can only be called an algorithm to solve a particular problem if it has all of the following properties. 1. It must be correct. In other words, it must compute the desired function, converting each input to the correct output. Note that every algorithm im- plements some function Because every algorithm maps every input to some output (even if that output is a system crash). At issue here is whether a given algorithm implements the intended function. 2. It is composed of a series of concrete steps. Concrete means that the action described by that step is completely understood — and doable — by the person or machine that must perform the algorithm. Each step must also be doable in a finite amount of time. Thus, the algorithm gives us a “recipe” for solving the problem by performing a series of steps, where each such step is within our capacity to perform. The ability to perform a step can depend on who or what is intended to execute the recipe. For example, the steps of a cookie recipe in a cookbook might be considered sufficiently concrete for instructing a human cook, but not for programming an automated cookie- making factory. 3. There can be no ambiguity as to which step will be performed next. Often it is the next step of the algorithm description. Selection (e.g., the if statements in Java) is normally a part of any language for describing algorithms. Selec- tion allows a choice for which step will be performed next, but the selection process is unambiguous at the time when the choice is made. Sec. 1.5 Further Reading 19 4. It must be composed of a finite number of steps. If the description for the algorithm were made up of an infinite number of steps, we could never hope to write it down, nor implement it as a computer program. Most languages for describing algorithms (including English and “pseudocode”) provide some way to perform repeated actions, known as iteration. Examples of iteration in programming languages include the while and for loop constructs of Java. Iteration allows for short descriptions, with the number of steps actually performed controlled by the input. 5. It must terminate. In other words, it may not go into an infinite loop. Programs: We often think of a computer program as an instance, or concrete representation, of an algorithm in some programming language. In this book, nearly all of the algorithms are presented in terms of programs, or parts of pro- grams. Naturally, there are many programs that are instances of the same alg- orithm, because any modern computer programming language can be used to im- plement the same collection of algorithms (although some programming languages can make life easier for the programmer). To simplify presentation throughout the remainder of the text, I often use the terms “algorithm” and “program” inter- changeably, despite the fact that they are really separate concepts. By definition, an algorithm must provide sufficient detail that it can be converted into a program when needed. The requirement that an algorithm must terminate means that not all computer programs meet the technical definition of an algorithm. Your operating system is one such program. However, you can think of the various tasks for an operating sys- tem (each with associated inputs and outputs) as individual problems, each solved by specific algorithms implemented by a part of the operating system program, and each one of which terminates once its output is produced. To summarize: A problem is a function or a mapping of inputs to outputs. An algorithm is a recipe for solving a problem whose steps are concrete and un- ambiguous. The algorithm must be correct, of finite length, and must terminate for all inputs. A program is an instantiation of an algorithm in a computer program- ming language. 1.5 Further Reading The first authoritative work on data structures and algorithms was the series of books The Art of Computer Programming by Donald E. Knuth, with Volumes 1 and 3 being most relevant to the study of data structures [Knu97, Knu98]. A mod- ern encyclopedic approach to data structures and algorithms that should be easy 20 Chap. 1 Data Structures and Algorithms to understand once you have mastered this book is Algorithms by Robert Sedge- wick [Sed03]. For an excellent and highly readable (but more advanced) teaching introduction to algorithms, their design, and their analysis, see Introduction to Al- gorithms: A Creative Approach by Udi Manber [Man89]. For an advanced, en- cyclopedic approach, see Introduction to Algorithms by Cormen, Leiserson, and Rivest [CLRS01]. Steven S. Skiena’s The Algorithm Design Manual [Ski98] pro- vides pointers to many implementations for data structures and algorithms that are available on the Web. For a gentle introduction to ADTs and program specification, see Abstract Data Types: Their Specification, Representation, and Use by Thomas, Robinson, and Emms [TRE88]. The claim that all modern programming languages can implement the same algorithms (stated more precisely, any function that is computable by one program- ming language is computable by any programming language with certain standard capabilities) is a key result from computability theory. For an easy introduction to this field see James L. Hein, Discrete Structures, Logic, and Computability [Hei03]. Much of computer science is devoted to problem solving. Indeed, this is what attracts many people to the field. How to Solve It by George Pólya [Pól57] is con- sidered to be the classic work on how to improve your problem-solving abilities. If you want to be a better student (as well as a better problem solver in general), see Strategies for Creative Problem Solving by Folger and LeBlanc [FL95], Effective Problem Solving by Marvin Levine [Lev94], and Problem Solving & Comprehen- sion by Arthur Whimbey and Jack Lochhead [WL99]. See The Origin of Consciousness in the Breakdown of the Bicameral Mind by Julian Jaynes [Jay90] for a good discussion on how humans use the concept of metaphor to handle complexity. More directly related to computer science educa- tion and programming, see “Cogito, Ergo Sum! Cognitive Processes of Students Dealing with Data Structures” by Dan Aharoni [Aha00] for a discussion on mov- ing from programming-context thinking to higher-level (and more design-oriented) programming-free thinking. On a more pragmatic level, most people study data structures to write better programs. If you expect your program to work correctly and efficiently, it must first be understandable to yourself and your co-workers. Kernighan and Pike’s The Practice of Programming [KP99] discusses a number of practical issues related to programming, including good coding and documentation style. For an excellent (and entertaining!) introduction to the difficulties involved with writing large pro- grams, read the classic The Mythical Man-Month: Essays on Software Engineering by Frederick P. Brooks [Bro95]. Sec. 1.6 Exercises 21 If you want to be a successful Java programmer, you need good reference man- uals close at hand. David Flanagan’s Java in a Nutshell [Fla05] provides a good reference for those familiar with the basics of the language. After gaining proficiency in the mechanics of program writing, the next step is to become proficient in program design. Good design is difficult to learn in any discipline, and good design for object-oriented software is one of the most difficult of arts. The novice designer can jump-start the learning process by studying well- known and well-used design patterns. The classic reference on design patterns is Design Patterns: Elements of Reusable Object-Oriented Software by Gamma, Helm, Johnson, and Vlissides [GHJV95] (this is commonly referred to as the “gang of four” book). Unfortunately, this is an extremely difficult book to understand, in part because the concepts are inherently difficult. A number of Web sites are available that discuss design patterns, and which provide study guides for the De- sign Patterns book. Two other books that discuss object-oriented software design are Object-Oriented Software Design and Construction with C ++ by Dennis Ka- fura [Kaf98], and Object-Oriented Design Heuristics by Arthur J. Riel [Rie96]. 1.6 Exercises The exercises for this chapter are different from those in the rest of the book. Most of these exercises are answered in the following chapters. However, you should not look up the answers in other parts of the book. These exercises are intended to make you think about some of the issues to be covered later on. Answer them to the best of your ability with your current knowledge. 1.1 Think of a program you have used that is unacceptably slow. Identify the spe- cific operations that make the program slow. Identify other basic operations that the program performs quickly enough. 1.2 Most programming languages have a built-in integer data type. Normally this representation has a fixed size, thus placing a limit on how large a value can be stored in an integer variable. Describe a representation for integers that has no size restriction (other than the limits of the computer’s available main memory), and thus no practical limit on how large an integer can be stored. Briefly show how your representation can be used to implement the operations of addition, multiplication, and exponentiation. 1.3 Define an ADT for character strings. Your ADT should consist of typical functions that can be performed on strings, with each function defined in 22 Chap. 1 Data Structures and Algorithms terms of its input and output. Then define two different physical representa- tions for strings. 1.4 Define an ADT for a list of integers. First, decide what functionality your ADT should provide. Example 1.4 should give you some ideas. Then, spec- ify your ADT in Java in the form of an abstract class declaration, showing the functions, their parameters, and their return types. 1.5 Briefly describe how integer variables are typically represented on a com- puter. (Look up one’s complement and two’s complement arithmetic in an introductory computer science textbook if you are not familiar with these.) Why does this representation for integers qualify as a data structure as de- fined in Section 1.2? 1.6 Define an ADT for a two-dimensional array of integers. Specify precisely the basic operations that can be performed on such arrays. Next, imagine an application that stores an array with 1000 rows and 1000 columns, where less than 10,000 of the array values are non-zero. Describe two different imple- mentations for such arrays that would be more space efficient than a standard two-dimensional array implementation requiring one million positions. 1.7 You have been assigned to implement a sorting program. The goal is to make this program general purpose, in that you don’t want to define in advance what record or key types are used. Describe ways to generalize a simple sorting algorithm (such as insertion sort, or any other sort you are familiar with) to support this generalization. 1.8 You have been assigned to implement a simple seqential search on an array. The problem is that you want the search to be as general as possible. This means that you need to support arbitrary record and key types. Describe ways to generalize the search function to support this goal. Consider the possibility that the function will be used multiple times in the same program, on differing record types. Consider the possibility that the function will need to be used on different keys (possibly with the same or different types) of the same record. For example, a student data record might be searched by zip code, by name, by salary, or by GPA. 1.9 Does every problem have an algorithm? 1.10 Does every algorithm have a Java program? 1.11 Consider the design for a spelling checker program meant to run on a home computer. The spelling checker should be able to handle quickly a document of less than twenty pages. Assume that the spelling checker comes with a dictionary of about 20,000 words. What primitive operations must be imple- mented on the dictionary, and what is a reasonable time constraint for each operation? Sec. 1.6 Exercises 23 1.12 Imagine that you have been hired to design a database service containing information about cities and towns in the United States, as described in Ex- ample 1.2. Suggest two possible implementations for the database. 1.13 Imagine that you are given an array of records that is sorted with respect to some key field contained in each record. Give two different algorithms for searching the array to find the record with a specified key value. Which one do you consider “better” and why? 1.14 How would you go about comparing two proposed algorithms for sorting an array of integers? In particular, (a) What would be appropriate measures of cost to use as a basis for com- paring the two sorting algorithms? (b) What tests or analysis would you conduct to determine how the two algorithms perform under these cost measures? 1.15 A common problem for compilers and text editors is to determine if the parentheses (or other brackets) in a string are balanced and properly nested. For example, the string “((())())()” contains properly nested pairs of paren- theses, but the string “)()(” does not; and the string “())” does not contain properly matching parentheses. (a) Give an algorithm that returns true if a string contains properly nested and balanced parentheses, and false if otherwise. Hint: At no time while scanning a legal string from left to right will you have encoun- tered more right parentheses than left parentheses. (b) Give an algorithm that returns the position in the string of the first of- fending parenthesis if the string is not properly nested and balanced. That is, if an excess right parenthesis is found, return its position; if there are too many left parentheses, return the position of the first ex- cess left parenthesis. Return −1 if the string is properly balanced and nested. 1.16 A graph consists of a set of objects (called vertices) and a set of edges, where each edge connects two vertices. Any given pair of vertices can be connected by only one edge. Describe at least two different ways to represent the con- nections defined by the vertices and edges of a graph. 1.17 Imagine that you are a shipping clerk for a large company. You have just been handed about 1000 invoices, each of which is a single sheet of paper with a large number in the upper right corner. The invoices must be sorted by this number, in order from lowest to highest. Write down as many different approaches to sorting the invoices as you can think of. 24 Chap. 1 Data Structures and Algorithms 1.18 Imagine that you are a programmer who must write a function to sort an array of about 1000 integers from lowest value to highest value. Write down at least five approaches to sorting the array. Do not write algorithms in Java or pseudocode. Just write a sentence or two for each approach to describe how it would work. 1.19 Think of an algorithm to find the maximum value in an (unsorted) array. Now, think of an algorithm to find the second largest value in the array. Which is harder to implement? Which takes more time to run (as measured by the number of comparisons performed)? Now, think of an algorithm to find the third largest value. Finally, think of an algorithm to find the middle value. Which is the most difficult of these problems to solve? 1.20 An unsorted list of integers allows for constant-time insert simply by adding a new integer at the end of the list. Unfortunately, searching for the integer with key value X requires a sequential search through the unsorted list until you find X, which on average requires looking at half the list. On the other hand, a sorted array-based list of n integers can be searched in log n time by using a binary search. Unfortunately, inserting a new integer requires a lot of time because many integers might be shifted in the array if we want to keep it sorted. How might data be organized to support both insertion and search in log n time? 2 Mathematical Preliminaries This chapter presents mathematical notation, background, and techniques used throughout the book. This material is provided primarily for review and reference. You might wish to return to the relevant sections when you encounter unfamiliar notation or mathematical techniques in later chapters. Section 2.7 on estimating might be unfamiliar to many readers. Estimating is not a mathematical technique, but rather a general engineering skill. It is enor- mously useful to computer scientists doing design work, because any proposed solution whose estimated resource requirements fall well outside the problem’s re- source constraints can be discarded immediately. 2.1 Sets and Relations The concept of a set in the mathematical sense has wide application in computer science. The notations and techniques of set theory are commonly used when de- scribing and implementing algorithms because the abstractions associated with sets often help to clarify and simplify algorithm design. A set is a collection of distinguishable members or elements. The members are typically drawn from some larger population known as the base type. Each member of a set is either a primitive element of the base type or is a set itself. There is no concept of duplication in a set. Each value from the base type is either in the set or not in the set. For example, a set named P might be the three integers 7, 11, and 42. In this case, P’s members are 7, 11, and 42, and the base type is integer. Figure 2.1 shows the symbols commonly used to express sets and their rela- tionships. Here are some examples of this notation in use. First define two sets, P and Q. P = {2, 3, 5}, Q = {5, 10}. 25 26 Chap. 2 Mathematical Preliminaries {1, 4} A set composed of the members 1 and 4 {x | x is a positive integer} A set definition using a set former Example: the set of all positive integers x∈P x is a member of set P x∈/P x is not a member of set P ∅ The null or empty set |P| Cardinality: size of set P or number of members for set P P ⊆ Q, Q ⊇ P Set P is included in set Q, set P is a subset of set Q, set Q is a superset of set P P∪Q Set Union: all elements appearing in P OR Q P∩Q Set Intersection: all elements appearing in P AND Q P−Q Set difference: all elements of set P NOT in set Q Figure 2.1 Set notation. |P| = 3 (because P has three members) and |Q| = 2 (because Q has two members). The union of P and Q, written P ∪ Q, is the set of elements in either P or Q, which is {2, 3, 5, 10}. The intersection of P and Q, written P ∩ Q, is the set of elements that appear in both P and Q, which is {5}. The set difference of P and Q, written P − Q, is the set of elements that occur in P but not in Q, which is {2, 3}. Note that P ∪ Q = Q ∪ P and that P ∩ Q = Q ∩ P, but in general P − Q 6= Q − P. In this example, Q − P = {10}. Note that the set {4, 3, 5} is indistinguishable from set P, because sets have no concept of order. Likewise, set {4, 3, 4, 5} is also indistinguishable from P, because sets have no concept of duplicate elements. The powerset of a set S is the set of all possible subsets for S. Consider the set S = {a, b, c}. The powerset of S is {∅, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}}. Sometimes we wish to define a collection of elements with no order (like a set), but with duplicate-valued elements. Such a collection is called a bag.1 To distinguish bags from sets, I use square brackets [] around a bag’s elements. For 1 The object referred to here as a bag is