Introduction to Computing Science Lecture Notes PDF
Document Details
Uploaded by Deleted User
Northwestern Polytechnical University
2013
Xiuwei Zhang, Yanning Zhang
Tags
Related
- UGBS 102 (COMPUTING) LECTURE NOTES PDF
- COS 101 Introduction to Computer Science Lecture Notes PDF
- COS 101 Introduction to Computer Science Lecture Notes PDF
- Week 01 - Introduction to Computers and Computing PDF
- Intro to Computing Technology PDF
- COS 101: Introduction to Computing Sciences Lecture Notes PDF
Summary
This document is a set of lecture notes for an Introduction to Computing Science course, likely at the undergraduate level, at Northwestern Polytechnical University (China). It covers various aspects of computing science, including the representation of data, computing machines, algorithms, programming, operating systems, and networks. The notes were published in 2013.
Full Transcript
Introduction to Computing Science 张秀伟 张艳宁 主编 Introduction to Computing Science 计算科学概论讲义 Xiuwei Zhang, Yanning Zhang Northwestern Polytechnical University 2013 July ...
Introduction to Computing Science 张秀伟 张艳宁 主编 Introduction to Computing Science 计算科学概论讲义 Xiuwei Zhang, Yanning Zhang Northwestern Polytechnical University 2013 July Contents I Contents Contents...................................................................................................................................... I Chapter 1 Computing................................................................................................................ 1 1.1. Processes, Procedure, and Computers........................................................................ 1 1.2. Problem Solving and Algorithm................................................................................ 3 1.3. Science, Engineering, and the Liberal Arts.............................................................. 10 1.4. Overview.................................................................................................................. 12 Chapter 2 Binary Values and Data Representation................................................................. 13 2.1. Binary Values and Number System......................................................................... 13 2.2. Data and Computers................................................................................................. 19 2.3. Representing Numeric Data..................................................................................... 23 2.4. Representing Text.................................................................................................... 27 2.5. Representing Audio Information............................................................................. 33 2.6. Representing Images and Graphics.......................................................................... 35 2.7. Representing Video.................................................................................................. 37 Summary............................................................................................................................. 38 Exercises............................................................................................................................. 39 Chapter 3 Computing Machine............................................................................................... 43 3.1. Gates and Circuits.................................................................................................... 43 3.2. Individual Computer Components........................................................................... 58 3.3. Stored-Program Concept.......................................................................................... 60 3.4. Non-von Neumann Architectures............................................................................ 68 Summary............................................................................................................................. 70 Exercises............................................................................................................................. 71 Chapter 4 Algorithms and Raptor........................................................................................... 75 4.1. Algorithm Concept................................................................................................... 75 4.2. Raptor Introduction.................................................................................................. 80 4.3. Programming Control Structures............................................................................. 87 4.4. Array Variable......................................................................................................... 98 4.5. RAPTORgraph....................................................................................................... 108 4.6. Raptor Procedure and Sub Chart............................................................................ 117 Summary........................................................................................................................... 125 II Introduction to Computing Science Exercises............................................................................................................................ 126 Chapter 5 Programming with Raptor..................................................................................... 127 5.1 Brute Force............................................................................................................. 127 5.2 Modulo Operation.................................................................................................. 129 5.3 Char and String Operation...................................................................................... 132 5.4 Recursion................................................................................................................ 133 5.5 Sort......................................................................................................................... 139 5.6 Search..................................................................................................................... 150 Summary............................................................................................................................ 153 Exercises............................................................................................................................ 153 Chapter 6 Operating Systems................................................................................................ 155 6.1. Roles of an Operating System................................................................................ 155 6.2. Memory Management............................................................................................ 159 6.3. Process Management.............................................................................................. 165 6.4. CPU Scheduling..................................................................................................... 166 Summary............................................................................................................................ 169 Exercises............................................................................................................................ 170 Chapter 7 File Systems and Directories................................................................................ 175 7.1. File System............................................................................................................. 175 7.2. Directories.............................................................................................................. 180 7.3. Disk Scheduling..................................................................................................... 184 Summary............................................................................................................................ 187 Exercises............................................................................................................................ 188 Chapter 8 Networking and the Internet................................................................................. 191 8.1. Networking............................................................................................................. 191 8.2. Open System and Protocols.................................................................................... 195 8.3. Network Addresses................................................................................................ 199 8.4. The World Wide Web............................................................................................ 202 Summary............................................................................................................................ 210 Exercises............................................................................................................................ 211 Reference............................................................................................................................... 215 Chapter 1 Computing 1 Chapter 1 Computing The first million years of hominid history produced tools to amplify, and later mechanize, our physical abilities to enable us to move faster, reach higher, and hit harder. We have developed tools that amplify physical force by the trillions and increase the speeds at which we can travel by the thousands. Tools that amplify intellectual abilities are much rare. While some animals have developed tools to amplify their physical abilities, only humans have developed tools to substantially amplify our intellectual abilities and it is those advances that have enabled humans to dominate the planet. The first key intellect amplifier was language. Language provided the ability to transmit our thoughts to others, as well as to use our own minds more effectively. The next key intellect amplifier was writing, which enabled the storage and transmission of thoughts over time and distance. Computing is the ultimate mental amplifier—computers can mechanize any intellectual activity we can imagine. Automatic computing radically changes how humans solve problems, and even the kinds of problems we can imagine solving. Computing has changes the world more than any other invention of the past hundred years, and has come to pervade nearly all human endeavors. Yet, we are just at the beginning of the computing revolution; today‘s computing offers just a glimpse of the potential impact of computing. There are two reasons why everyone should study computing: 1. Nearly all of the most exciting and important technologies, arts, and sciences of today and tomorrow are driven by computing. 2. Understanding computing illuminates deep insights and questions into the nature of our minds, our culture, and our universe. Anyone who has submitted a query to Google, watched The CROODS, used a smart phone, seen a Cirque Du Soleil show, shopped with a credit card, or microwaved a pizza should be convinced of the first reason. None of these would be possible without the tremendous advances in computing over the past half century. Although this book will touch on some applications of computing, our primary focus is on the second reason, which may seem more surprising. Computing changes how we think about problems and how we understand the world. The goal of this book is to teach you that new way of thinking. 1.1. Processes, Procedure, and Computers Computer science is the study of information processes. A process is a sequence of steps. Each step changes the state of the world in some small way, and the result of all the steps produces some goal state. For example, baking a cake, mailing a letter, and planting a tree are all processes. Because they involve physical things like sugar and dirt, however, they are not pure information processes. Computer 2 Introduction to Computing Science science focuses on processes that involve abstract information rather than physical things. The boundaries between the physical world and pure information processes, however, are often fuzzy. Real computers operate in the physical world: they obtain input through physical means (e.g. a user pressing a key on a keyboard that produces an electrical impulse), and produce physical outputs (e.g. an image displayed on a screen). By focusing on abstract information, we simplify computation to its essence to better enable understanding and seasoning. A procedure is a description of a process. A simple process can be described just by listing the steps. The list of steps is the procedure; the act of following them is process. A procedure that can be followed without any thought is called a mechanical procedure. An algorithm is a mechanical procedure that is guaranteed to eventually finish. For example, here is procedure for making coffee, adapted from actual directions that come with a major coffeemaker: 1. Lift and open the coffeemaker lid. 2. Place a basket-type filter into the filter basket. 3. Add the desired amount of coffee and shake to level the coffee. 4. Fill the decanter with cold, fresh water to the desired capacity. 5. Pour the water into the water reservoir. 6. Close the lid. 7. Place the empty decanter on the warming plate. 8. Press the ON button. Describing processed by just listing steps like this has many limitations. First, natural languages are very imprecise and ambiguous. Following the steps correctly requires knowing lots of unstated assumptions. For example, step three assumes the operator understands the difference between coffee grounds and finished coffee, and can infer that this use of ―coffee‖ refers to coffee grounds since the end goal of this process is to make drinkable coffee. Other steps assume the coffeemaker is plugged in and sitting on a flat surface. One could, of course, add lots more details to our procedure and make the language more precise than this. Even when a lot of efforts are put into writing precisely and clearly, however, natural languages such as English are inherently ambiguous. This is why the United States tax code is 3.4 million words long, but lawyers can still spend years arguing over what it really means. Another problem with this way of describing a procedure is that the size of the description is proportional to the number of steps in the process. This is fine for simple processed that can be executed by humans in a reasonable amount of time, but the processed we want to execute on computers involve trillions of steps. This means we need more efficient ways to describe them than just listing each step one-by-one. To program computers, we need tools that allow us to describe processes precisely and succinctly. Since the procedures are carried out by a machine, every step needs to be described; we cannot rely on the operator having ―common sense‖ (for example, to know how to fill the coffeemaker with water without explaining that water comes from a faucet, and how to turn the faucet on). Instead, we need mechanical procedures that can be followed without any thinking. A computer is a machine that can: Chapter 1 Computing 3 1. Accept input. Input could be entered by a human typing at keyboard, received over a network, or provided automatically by sensors attached to the computer. 2. Execute a mechanical procedure, that is, a procedure where each step can be executed without any thought. 3. Product output. Output could be data displayed to a human, but it could also be anything that affects the world outside the computer such as electrical signals that control how a device operates. Computer exists in a wide range of forms, and thousands of computers are hidden in devices we use every day but don‘t think of as computers such as cars, phones, TVs, microwave ovens, and access cards. Our primary focus is on universal computers, which are computers that can perform all possible mechanical computations on discrete inputs except for practical limits on space and time. 1.2. Problem Solving and Algorithm We often have a misconception ―computer seems very smart‖. In fact, the computer can do nothing without being told what to do. A computer is not intelligent. It cannot analyze a problem and come up with a solution. Before a machine such as a computer can perform a task or solve a problem, an algorithm for performing that task must be discovered by a human (the programmer) and represented in a form that is compatible with the machine. A representation of an algorithm is called a program. For the convenience of humans, computer programs are usually printed on paper or displayed on computer screens. For the convenience of machines, programs are encoded in a manner compatible with the technology of the machine. The process of developing a program, encoding it in machine-compatible form, and inserting it into a machine is called programming. Programs, and the algorithms they represent, are collectively referred to as software, in contrast to the machinery itself, which is known as hardware. The advantage of using a computer is that, once we have written a solution for the computer, the computer can repeat the solution very quickly and consistently, again and again, for different situations and data. The computer frees people from repetitive and boring tasks. The study of algorithms began as a subject in mathematics. Indeed, the search for algorithms was a significant activity of mathematicians long before the development of today‘s computers. The goal was to find a single set of directions that described how all problems of a particular type could be solved. One of the best known examples of this early research is the long division algorithm for finding the quotient of two multiple-digit numbers. Another example is the Euclidean algorithm, discovered by the ancient Greek mathematician Euclid, for finding the greatest common divisor of two positive integers (Figure 1.1). Once an algorithm for performing a task has been found, the performance of that task no longer requires an understanding of the principles on which the algorithm is based. Instead, the performance of the task is reduced to the process of merely following directions. (We can follow the long division algorithm to find a quotient or the Euclidean algorithm to find a greatest common divisor without understanding why the algorithm works.) In a sense, the intelligence required to solve the problem at hand is encoded in the algorithm. 4 Introduction to Computing Science Description: This algorithm assumes that its input consists of two positive integers and proceeds to compute the greatest common divisor of these two values. Procedure: Step 1. Assign M and N the value of the larger and smaller of the two input values, respectively. Step 2. Divide M by N, and call the remainder R. Step 3. If R is not 0, then assign M the value of N, assign N the value of R, and return to step 2; otherwise, the greatest common divisor is the value currently assigned to N. Figure 1.1 The Euclidean algorithm for finding the greatest common divisor of two positive integers It is through this ability to capture and convey intelligence (or at least intelligent behavior) by means of algorithms that we are able to build machines that perform useful tasks. Consequently, the level of intelligence displayed by machines is limited by the intelligence that can be conveyed through algorithms. We can construct a machine to perform a task only if an algorithm exists for performing that task. In turn, if no algorithm exists for solving a problem, then the solution of that problem lies beyond the capabilities of machines. Identifying the limitations of algorithmic capabilities solidified as a subject in mathematics in the 1930s with the publication of Kurt Gödel‘s incompleteness theorem. This theorem essentially states that in any mathematical theory encompassing our traditional arithmetic system, there are statements whose truth or falseness cannot be established by algorithmic means. In short, any complete study of our arithmetic system lies beyond the capabilities of algorithmic activities. This realization shook the foundations of mathematics, and the study of algorithmic capabilities that ensued was the beginning of the field known today as computing science. Indeed, it is the study of algorithms that forms the core of computing science. To produce an algorithm to solve a problem, we should firstly know how to solve this problem and then convey our intelligence through algorithms. In 1945, G. Polya wrote a little book entitled How to Solve It: A New Aspect of Mathematical Method. Although this book was written over 50 years ago when computers were only experimental—it remains the classic description of the problem-solving process. The process is summarized in Figure 1.2. What has made Polya‘s book a classic is that his How to Solve It list is quite general. Although it was written in the context of solving mathematical problems, we can replace the words unknown with problem, data with information, and theorem with solution, and the list becomes applicable to all types of problems. Of course, it is the second step—finding the connection between the information and the solution—that is at the heart of problem solving. In computing, there are three phases in the problem-solving process: the algorithm development phase, the implementation phase, and the maintenance phase. See Figure 1.3. The output from the algorithm development phase is a plan for a general solution to the problem. The output from the second phase is a working computer program that implements the algorithm, that is, a specific solution to the problem. There is no output from the third phase, unless errors are detected or changes need to be made. If so, these errors or changes are sent back either to the first phase or second phase, whichever is appropriate. Chapter 1 Computing 5 See Figure 1.4. HOW TO SOLVE IT UNDERSTANDING THE PROBLEM First. What is the unknown? What are the data? What is the condition? Is it possible to satisfy the condition? Is the condition sufficient to determine You have to understand the the unknown? Or is it insufficient? Or redundant? Or contradictory? problem. Draw a figure. Introduce suitable notation. Separate the various parts of the condition. Can you write them down? DEVISING A PLAN Second. Have you seen it before? Or have you seen the same problem in a slightly different form? Find the connection between the data and the Do you know a related problem? Do you know a theorem that could unknown. You may be be useful? Look at the unknown! And try to think of a familiar problem obliged to consider auxiliary having the same or a similar unknown. problems if an immediate Here is a problem related to yours and solved before. Could you use it? connection cannot be found. Could you use its result? Could you use its method? Should you You should obtain introduce some auxiliary element in order to make its use possible? eventually a plan of the Could you restate the problem? Could you restate it still differently? Go solution. back to definitions. If you cannot solve the proposed problem, try to solve first some related problem. Could you imagine a more accessible related problem? A more general problem? A more special problem? An analogous problem? Could you solve a part of the problem? Keep only a part of the condition, drop the other part; how far is the unknown then determined; how can it vary? Could you derive something useful from the data? Could you think of other data appropriate to determine the unknown? Could you change the unknown or the data, or both if necessary, so that the new unknown and the new data are nearer to each other? Did you use all the data? Did you use the whole condition? Have you taken into account all essential notions involved in the problem? CARRYING OUT THE PLAN Third. Carrying out your plan of the solution, check each step. Can you see clearly that the step is correct? Can you prove that it is correct? Carry out your plan. LOOKING BACK Fourth. Can you check the result? Can you check the argument? Can you derive the result differently? Can you see it at a glance? Can you use the result, Examine the solution or the method, for some other problem? obtained. Figure 1.2 Polya‘s How to Solve It list Notice that all of Polya‘s phases are included in this outline of how we solve problems using the computer. The first step is always to understand the problem. You cannot write a computer solution to a problem you don‘t understand. The next step is to develop a plan (an algorithm) for the solution. There are various techniques for expressing algorithms. In Chapter 4, we will learn how to use flowcharts to express algorithm. Whatever form you use for your algorithm, you must test it by examining it 6 Introduction to Computing Science carefully to be sure that it does solve the problem. Figure 1.3 The computer problem-solving process Figure 1.4 The interaction between problem-solving phases The next step is to implement the plan in a way that the computer can execute it and test the results. In Polya‘s list, the human executes the plan and evaluates the results. In a computer solution, a program is written expressing the plan in a language that the computer can execute. But it is the human who takes the results of the computer program and checks them to be sure that they are correct. The maintenance phase maps to Polya‘s last stage, where the results are examined and perhaps modified. 1.3. The History of Computing Today‘s computers have an extensive genealogy. One of the earlier computing devices was the abacus. History tells us that it most likely had its roots in ancient China and was used in the early Greek and Roman civilizations. The machine is quite simple, consisting of beads strung on rods that are in turn mounted in a rectangular frame (Figure 1.5). As the beads are moved back and forth on the rods, their positions represent stored values. It is in the positions of the beads that this ―computer‖ represents and stores data. For control of an algorithm‘s execution, the machine relies on the human operator. Thus the Chapter 1 Computing 7 abacus alone is merely a data storage system; it must be combined with a human to create a complete computational machine. Figure 1.5 An abacus (photography by Wayne Chandler) In the time period after the Middle Ages and before the Modern Era the quest for more sophisticated computing machines was seeded. A few inventors began to experiment with the technology of gears. Among these were Blaise Pascal (1623-1662) of France, Gottfried Wilhelm Leibniz (1646-1716) of Germany, and Charles Babbage (1792-1871) of England. These machines represented data through gear positioning, with data being input mechanically by establishing initial gear positions. Output from Pascal‘s and Leibniz‘s machines was achieved by observing the final gear positions. Babbage, on the other hand, envisioned machines that would print results of computations on paper so that the possibility of transcription errors would be eliminated. As for the ability to follow an algorithm, we can see a progression of flexibility in these machines. Pascal‘s machine was built to perform only addition. Consequently, the appropriate sequence of steps was embedded into the structure of the machine itself. In a similar manner, Leibniz‘s machine had its algorithms firmly embedded in its architecture, although it offered a variety of arithmetic operations from which the operator could select. Babbage‘s Difference Engine (of which only a demonstration model was constructed) could be modified to perform a variety of calculations, but his Analytical Engine (the construction for which he never received funding) was designed to read instructions in the form of holes in paper cards. Thus Babbage‘s Analytical Engine was programmable. In fact, Augusta Ada Byron (Ada Lovelace), who published a paper in which she demonstrated how Babbage‘s Analytical Engine could be programmed to perform various computations, is often identified today as the world‘s first programmer. The idea of communicating an algorithm via holes in paper was not originated by Babbage. He got the idea from Joseph Jacquard (1752-1834), who, in 1801, had developed a weaving loom in which the steps to be performed during the weaving process were determined by patterns of holes in large thick cards made of wood (or cardboard). In this manner, the algorithm followed by the loom could be changed easily to produce different woven designs. Another beneficiary of Jacquard‘s idea was Herman Hollerith (1860-1929), who applied the concept of representing information as holes in paper cards to speed up the tabulation process in the 1890 U.S. census. (It was this work by Hollerith that led to the creation of IBM.) Such cards ultimately came to be known as punched cards and survived as a popular 8 Introduction to Computing Science means of communicating with computers well into the 1970s. Indeed, the technique lives on today, as witnessed by the voting issues raised in the 2000 U.S. presidential election. The technology of the time was unable to produce the complex gear-driven machines of Pascal, Leibniz, and Babbage in a financially feasible manner. But with the advances in electronics in the early 1900s, this barrier was overcome. Examples of this progress include the electromechanical machine of George Stibitz, completed in 1940 at Bell Laboratories, and the Mark I, completed in 1944 at Harvard University by Howard Aiken and a group of IBM engineers (Figure1.6). These machines made heavy use of electronically controlled mechanical relays. In this sense they were obsolete almost as soon as they were built, because other researchers were applying the technology of vacuum tubes to construct totally electronic computers. The first of these machines was apparently the Atanasoff-Berry machine, constructed during the period from 1937 to 1941 at Iowa State College (now Iowa State University) by John Atanasoff and his assistant, Clifford Berry. Another was a machine called Colossus, built under the direction of Tommy Flowers in England to decode German messages during the latter part of World War II. (Actually, as many as ten of these machines were apparently built, but military secrecy and issues of national security kept their existence from becoming part of the “computer family tree.”) Other, more flexible machines, such as the ENIAC (electronic numerical integrator and calculator) developed by John Mauchly and J. Presper Eckert at the Moore School of Electrical Engineering, University of Pennsylvania, soon followed. Figure 1.6 The Mark I compute From that point on, the history of computing machines has been closely linked to advancing technology, including the invention of transistors (for which physicists William Shockley, John Bardeen, and Walter Brattain were awarded a Nobel Prize) and the subsequent development of complete circuits constructed as single units, called integrated circuits (for which Jack Kilby also won a Nobel Prize in physics). With these developments, the room-sized machines of the 1940s were reduced over the decades to the size of single cabinets. At the same time, the processing power of computing machines Chapter 1 Computing 9 began to double every two years (a trend that has continued to this day). As work on integrated circuitry progressed, many of the circuits within a computer became readily available on the open market as integrated circuits encased in toy-sized blocks of plastic called chips. A major step toward popularizing computing was the development of desktop computers. The origins of these machines can be traced to the computer hobbyists who built homemade computers from combinations of chips. It was within this ―underground‖ of hobby activity that Steve Jobs and Stephen Wozniak built a commercially viable home computer and, in 1976, established Apple Computer, Inc. (now Apple Inc.) to manufacture and market their products. Other companies that marketed similar products were Commodore, Heathkit, and Radio Shack. Although these products were popular among computer hobbyists, they were not widely accepted by the business community, which continued to look to the well-established IBM for the majority of its computing needs. In 1981, IBM introduced its first desktop computer, called the personal computer, or PC, whose underlying software was developed by a newly formed company known as Microsoft. The PC was an instant success and legitimized the desktop computer as an established commodity in the minds of the business community. Today, the term PC is widely used to refer to all those machines (from various manufacturers) whose design has evolved from IBM‘s initial desktop computer, most of which continue to be marketed with software from Microsoft. At times, however, the term PC is used interchangeably with the generic terms desktop or laptop. As the twentieth century drew to a close, the ability to connect individual computers in a world-wide system called the Internet was revolutionizing communication. In this context, Tim Berners-Lee (a British scientist) proposed a system by which documents stored on computers throughout the Internet could be linked together producing a maze of linked information called the World Wide Web (often shortened to ―Web‖). To make the information on the Web accessible, software systems, called search engines, were developed to ―sift through‖ the Web, ―categorize‖ their findings, and then use the results to assist users researching particular topics. Major players in this field are Google, Yahoo, and Microsoft. These companies continue to expand their Web-related activities, often in directions that challenge our traditional way of thinking. At the same time that desktop computers (and the newer mobile laptop computers) were being accepted and used in homes, the miniaturization of computing machines continued. Today, tiny computers are embedded within various devices. For example, automobiles now contain small computers running Global Positioning Systems (GPS), monitoring the function of the engine, and providing voice command services for controlling the car‘s audio and phone communication systems. Perhaps the most potentially revolutionary application of computer miniaturization is found in the expanding capabilities of portable telephones. Indeed, what was recently merely a telephone has evolved into a small hand-held general purpose computer known as a smartphone on which telephony is only one of many applications. These ―phones‖ are equipped with a rich array of sensors and interfaces including cameras, microphones, compasses, touch screens, accelerometers (to detect the phone‘s orientation and motion), and a number of wireless technologies to communicate with other smartphones and computers. The potential is enormous. Indeed, many argue that the smartphone will have a greater effect on society than the PC. The miniaturization of computers and their expanding capabilities have brought computer technology to the forefront of today‘s society. Computer technology is so prevalent now that familiarity with it is 10 Introduction to Computing Science fundamental to being a member of modern society. Computing technology has altered the ability of governments to exert control; had enormous impact on global economics; led to startling advances in scientific research; revolutionized the role of data collection, storage, and applications; provided new means for people to communicate and interact; and has repeatedly challenged society‘s status quo. The result is a proliferation of subjects surrounding computer science, each of which is now a significant field of study in its own right. Moreover, as with mechanical engineering and physics, it is often difficult to draw a line between these fields and computer science itself. Thus, to gain a proper perspective, our study will not only cover topics central to the core of computer science but will also explore a variety of disciplines dealing with both applications and consequences of the science. Indeed, an introduction to computer science is an interdisciplinary undertaking. 1.4. Science, Engineering, and the Liberal Arts Much ink and many bits have been spent debating whether computer science is an art, an engineering discipline, or a science. The confusion stems from the nature of computing as a new field that does not fit well into existing silos. In fact, computer science fits into all three kingdoms, and it is useful to approach computing from all three perspectives. Science. Traditional science is about understanding nature through observation. The goal of science is to develop general and predictive theories that allow us to understand aspects of nature deeply enough to make accurate quantitative predication. For example, Newton‘s law of universal gravitation makes predictions about how masses will move. The more general a theory is the better. A key, as yet unachieved, a goal of science is to find a universal law that can describe all physical behavior at scales form the smallest subparticle to entire universe, and all the bosons, muons, dark matter, black holes, and galaxies in between. Science deals with real things (like bowling balls, planets, and electrons) and attempts to make progress toward theories that predict increasingly precisely how these real things will behave in different situations. Computer science focuses on artificial things like numbers, graphs, functions, and lists. Instead of dealing with physical things in the real world, computer science concerns abstract things in a virtual world. The numbers we use in computations often represent properties of physicals things in the real world, and with enough bits we can model real things with arbitrary precision. But, since our focus is on abstract, artificial things rather than physical things, computer science is not a traditional natural science but a more abstract field like mathematics. Like mathematics, computing is an essential tool for modern science, but when we study computing on artificial things it is not a natural science itself. In a deeper sense, computing pervades all of nature. A long term goal of computer science is to develop theories that explain how nature computes. One example of computing in nature comes from biology. Complex life exists because nature can perform sophisticated computing. People sometimes describe DNA as a ―blueprint‖, but it is really much better thought of as a program. Whereas a blueprint describes what a building should be when it is finished, giving the dimensions of walls and how they fit together, the DNA of an organism encodes a process of growing that organism. A human genome is not a blue print that describes the body plan of a human; it is a program that turns a single cell into a complex human given the appropriate environment. The process of evolution (which itself is an information process) produces new programs, and hence new species, through the process of natural selection on mutated DNA. Understanding how both these processes work is one of the most interesting Chapter 1 Computing 11 and important open scientific questions, and it involves deep questions in computer science, as well as biology, chemistry, and physics. Engineering. Engineering is about making useful things. Engineering is often distinguished from crafts in that engineers use scientific principles to create their designs, and focus on designing under practical constraints. As William Wulf and George Fisher put it: Whereas science is analytic in that it strives to understand nature or what is, engineering is synthetic in that in strives to create. Our own favorite description of what engineers do is “design under constraint”. Engineering is creativity constrained by nature, by cost, by concerns of safety, environmental impact, ergonomics, reliability, manufacturability, maintainability - the whole long list of such “ilities”. To be sure, the realities of nature is one of the constraint sets we work under, but it is far from the only one, it is seldom the hardest one, and almost never the limiting one. Computer scientists do not typically face the natural constraints faced by civil and mechanical engineers—computer program are massless and not exposed to the weather, so programmers do not face the kinds of physical constraints like gravity that impose limits on bridge designers. Computer scientists, however, do face many constraints. A primary constraint is the capacity of the human mind—there is a limit to how much information a human can keep in mind at one time. As computing systems get more complex, there is no way for a human to understand the entire system at once. To build complex systems, we need techniques for managing complexity. The primary tool computer scientists use to manage complexity is abstraction. Abstraction is a way of giving a name to something in a way that allows us to hide unnecessary details. By using carefully designed abstraction, we can construct complex systems with reliable properties while limiting the amount of information a human designer needs to keep in mind at any one time. Liberal Arts. The notion of the liberal arts emerged during the middle ages to distinguish education for the purpose of expanding the intellects of free people from the illiberal arts such as medicine and carpentry that were pursued for economic purposes. The liberal arts were intended for people who did not need to learn an art to make a living, but instead had the luxury to pursue purely intellectual activities for their own sake. The traditional seven liberal arts started with the Trivium (three roads), focused on language. Grammar—―the art of inventing symbols and combining them to express thought‖ Rhetoric—―the art of communicating thought from one mind to another, the adaptation of language to circumstance‖ Logic—―the art of thinking‖ The Trivium was followed by the Quadrivium, focused on numbers: Arithmetic—―theory of number‖ Geometry—―theory of space‖ Music—―application of the theory of number‖ Astronomy—―application of the theory of space‖ All of these have strong connections to computer science, and we will touch on each of them to some degree in this book. Language is essential to computing since we use the tools of language to describe information processes. Programming language will be discussed in Chapter 4. Rhetoric encompasses 12 Introduction to Computing Science communicating thoughts between minds. In computing, we are not typically communicating directly between minds. But we see many forms of communication between entities: interfaces between components of a program, as well as protocols used to enable multiple computing systems to communicate (for example, the HTTP protocol defines how a web browser and web server interact), and communication between computer programs and human users. The primary tool for understanding what computer programs mean, and hence, for constructing programs with particular meaning, is logic. Hence, the traditional trivium liberal arts of language and logic permeate computer science. The connections between computing and the quadrivium arts are also pervasive. We will see how computers use sequences of bits to represent numbers and how machines can perform basic arithmetic operations in Chapter 2. Geometry is essential for computer graphics, and graph theory is also important for computer networking. The harmonic structures in music have strong connections to the recursive definitions introduced in Chapter 4. Unlike the other six liberal arts, astronomy is not directly connected to computing, but computing is an essential tool for doing modern astronomy. Although learning about computing qualifies as an illiberal art (that is, it can have substantial economic benefits for those who learn it well), computing science also covers at least six of the traditional seven liberal arts. 1.5. Overview In this chapter, some basic concepts about computing and computer science are introduced, including process, procedure, algorithm and computer. Now we know algorithm plays an important role in computing science. We human convey our intelligence through algorithms that we are able to build machines that perform useful tasks. In fact, a computer is a machine that can just accept input, execute a mechanical procedure, and product output. The computer can do nothing without programs. We might have more confusion and questions. How does a computer compute mechanically and automatically? How do we design our algorithm to solve problem? How can our computer do so many tasks simultaneously, e.g. typing, playing music and receiving QQ messages? The following content will explain these questions in detail. It begins with the fundamentals of information encoding, data storage, computer components and computer architecture (Chapters 2 and 3); investigates the topics of algorithms, solving problems with programming in RAPTOR (Chapters 4, Chapter 5); progresses to the study of operating systems (Chapter 6 and 7) and computer networks (Chapter 8). Although the text follows this natural progression, the individual chapters and sections are surprisingly independent and can usually be read as isolated units or rearranged to form alternative sequences of study. One of these alternatives begins with material from Chapters 4 and 5 (Algorithms and Programming) and returns to the earlier chapters as desired. This book presents an introductory survey of computing science. It is for most science and engineering students, and including beginning computer science students. Chapter 2 Binary Values and Data Representation 13 Chapter 2 Binary Values and Data Representation This chapter describes binary values—the way in which computer hardware represents and manages information. This chapter puts binary values in the context of all number systems, reminding us of grade school principles that we now take for granted. You probably already know many of the concepts about binary numbers described in this chapter, but you might not realize that you know them! The rules of all number systems are the same; it‘s just a matter of going back to those underlying concepts and applying them in a new base. By making sure we have an under-standing of binary values, we pave the way to understanding how computing systems use the binary number system to accomplish their tasks. Building on the fundamental concepts of the established binary number system, then this chapter also explores how we represent and store the various kinds of information a computer manages. 2.1.Binary Values and Number System 2.1.1. Number Categories and Natural Numbers Numbers come in all sorts of categories. There are natural numbers, negative numbers, rational numbers, irrational numbers, and many others that are important in mathematics but not to the understanding of computing. Let‘s review the relevant category definitions briefly. First, let‘s define the general concept of a number: A number is a unit belonging to an abstract mathematical system and subject to specified laws of succession, addition, and multiplication. That is, a number is a representation of a value, and certain arithmetic operations can be consistently applied to such values. Now let‘s separate numbers into categories. A natural number is the number 0 or any number obtained by repeatedly adding 1 to this number. Natural numbers are the ones we use in counting. A negative number is less than zero, and is opposite in sign to a positive number. An integer is any of the natural numbers or any of the negatives of these numbers. A rational number is an integer or the quotient of two integers—that is, any value that can be expressed as a fraction. In this section we focus on natural numbers and how they are represented in various number systems. As part of that discussion we establish how all number systems relate to each other. Part of our goal in this section is to remind you of those underlying principles and show you that those principles apply to all number systems. Then, the idea that a computer uses binary values, 1s and 0s, to represent information should be less mysterious. How many ones are there in 943? That is, how many actual things does the number 943 represent? Well, in grade school terms you might say there are 9 hundreds plus 4 tens plus 3 ones. Or, said another way, there are 900 ones plus 40 ones plus 3 ones. So how many ones are there in 754? 700 ones plus 50 ones plus 4 ones. Right? Well, maybe. The answer depends on the base of the number system you are using. This answer is correct in the base-10, or decimal, number system, which is the number system we 14 Introduction to Computing Science humans use every day. But that answer is not correct in other number system. The base of a number system specifies the number of digits used in the system. The digits always begin with 0 and continue through one less than the base. For example, there are 2 digits in base 2: 0 and 1. There are 8 digits in base 8: 0 through 7. There are 10 digits in base 10: 0 through 9. The base also determines what the position of digits mean. When you add 1 to the last digit in the number system, you have a carry to the digit position to the left. 2.1.2. Positional Notation Numbers are written using positional notation. The rightmost digit represents its value times the base to the zeroth power. The digit to the left of that one represents its value times the base to the first power. The next digit represents its value times the base to the second power. The next digit represents its value times the base to the third power, and so on. You are so familiar with positional notation that you probably don‘t think about it. We used it instinctively to calculate the number of ones in 943. 9 * 102= 9 * 100 = 900 1 + 4 * 10 = 4 * 10 = 40 0 + 3 * 10 = 3 * 1= 3 943 A more formal way of defining positional notation is that the value is represented as a polynomial in the base of the number system. But what is a polynomial? A polynomial is a sum of two or more algebraic terms, each of which consists of a constant multiplied by one or more variables raised to a nonnegative integral power. When defining positional notation, the variable is the base of the number system. 943 is represented as a polynomial as follows, with x acting as the base. 2 1 0 9*x +4*x +3*x Let‘s express this idea formally. If a number in the base-R number system has n digits, it is represented as follows, where di represents the digit in the ith position in the number. n-1 n-2 d n* R +dn-1*R +... + d 2* R + d 1 Look complicated? Let‘s look at a concrete example: 63578 in base 10. n is 5 (the number has 5 digits), and R is 10 (the base). The formula says that the fifth digit (last digit on the left) is multiplied by the base to the fourth power; the fourth digit is multiplied by the base to the third power; the third digit is multiplied by the base to the second power; the second digit is multiplied by the base to the first power; and the first digit is not multiplied by anything. 4 3 2 1 6 * 10 + 3 * 10 + 5 * 10 + 7 * 10 + 8 In the previous calculation, we have assumed the number base is ten. This is a logical assumption since our number system is base ten. However, there is nothing about the number 943 that says it couldn‘t be representing a value in base 13. If so, to determine the number of ones, we would have to convert it to base 10. Chapter 2 Binary Values and Data Representation 15 2 + 9 * 13 = 9 * 169 = 1521 1 + 4 * 13 = 4 * 13 = 52 0 + 3 * 13 = 3 * 1= 3 1576 Therefore, 943 in base 13 is equal to 1576 in base 10. Keep in mind that these two numbers have an equivalent value. That is, they both represent the same number of things. If a bag contains 943 (base 13) beans, and another bag contains 1576 (base 10) beans, both bags contain the exact same number of beans. Number systems just allow us to represent values in various ways. Other bases, such as base 2 (binary), are particularly important in computer processing. Let‘s explore these bases in more detail. 2.1.3. Binary, Octal, and Hexadecimal The base-2 (binary) number system is important in computing. It is also helpful to be familiar with number systems that are powers of 2, such as base 8 (octal), and base 16 (hexadecimal). Remember that the base value specifies the number of digits in the number system. Base 10 has ten digits (0–9), base 2 has two digits (0–1), and base 8 has eight digits (0–7). Therefore, the number 943 could not represent a value in any base less than base 10, because the digit 9 doesn‘t exist in those bases. It is, however, a valid number in base 10 or any base higher than that. Likewise, the number 2074 is a valid number in base 8 or higher, but it simply does not exist (because it uses the digit 7) in any base lower than that. What are the digits in bases higher than 10? We need symbols to represent the digits that correspond to the decimal values 10 and beyond. In bases higher than 10, we use letters as digits. We use the letter A to represent the number 10, B to represent 11, C to represent 12, etc. Therefore, the 16 digits in base 16 are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F Let‘s look at values in octal, hexadecimal, and binary to see what they represent in base 10. For example, let‘s calculate the decimal equivalent of 754 in octal (base 8). As before, we just expand the number in its polynomial form and add up the numbers. 2 7 * 8 = 7 * 64 = 448 1 + 5 * 8 = 5 * 8 = 40 0 +4*8 =4* 1= 4 492 Let‘s convert the hexadecimal number ABC to decimal: 2 A * 16 = 10 * 256 = 2560 1 + B * 16 = 11 * 16 = 176 0 + C * 16 = 12 * 1= 12 2748 Note that we perform the exact same calculation to convert the number to base 10. We just use a base value of 16 this time, and we have to remember what the letter digits represent. After a little practice you won‘t find the use of letters as digits that strange. Finally, let‘s convert a binary (base-2) number 1010110 to decimal. Once again, we perform the same steps; only the base value changes: 16 Introduction to Computing Science 6 1 * 2 = 1 * 64 = 64 5 + 0 * 2 = 0 * 32 = 0 4 + 1 * 2 = 1 * 16 = 16 3 +0*2 =0* 8= 0 2 +1*2 =1* 4= 4 1 +1*2 =1* 2= 2 0 +0*2 =0* 1= 0 86 Recall that the digits in any number system go up to one less than the base value. To represent the base value in that base, you need two digits. In any base, a 0 in the rightmost position and a 1 in the second position represent the value of the base itself. So 10 is ten in base 10 and 10 is eight in base 8 and 10 is sixteen in base 16. Think about it. The consistency of number systems is actually quite elegant. Addition and subtraction of numbers in other bases are performed exactly like they are on decimal numbers. 2.1.4. Arithmetic in Other Bases Recall the basic idea of arithmetic in decimal. 0 + 1 is 1, 1 + 1 is 2, 2 + 1 is 3, and so on. Things get interesting when you try to add two numbers whose sum is equal to or larger than the base value. For example: 1 + 9. Because there isn‘t a symbol for 10, we reuse the same digits and rely on position. The rightmost digit reverts to 0, and there is a carry into the next position to the left. Thus 1 + 9 equals 10 in base 10. The rules of binary arithmetic are analogous, but we run out of digits much sooner. 0 + 1 is 1, and 1 + 1 is 0 with a carry. Then the same rule is applied to every column in a larger number, and the process continues until there are no more digits to add. The example below adds the binary values 101110 and 11011. The carry value is marked above each column in color. 11111 ← carry 101110 + 11011 1001001 We can convince ourselves that this answer is correct by converting both operands to base 10, adding them, and comparing the result. 101110 is 46, 11011 is 27, and the sum is 73. 101001 is 73 in base 10. The subtraction facts that you learned in grade school were that 9-1 is 8, 8-1 is 7, and so on until you try to subtract a larger digit from a smaller one, such as 0-1. To accomplish this, you have to ―borrow one‖ from the next left digit of the number from which you are subtracting. More precisely, you borrow one power of the base. So in base 10, when you borrow, you borrow 10. The same logic applies to binary subtraction. Every time you borrow in a binary subtraction, you borrow 2. Here is an example with the borrowed values marked above. 1 022 ← borrow 111001 - 110 110011 Chapter 2 Binary Values and Data Representation 17 Once again, you can check the calculation by converting all values to base 10 and subtract to see if the answers correspond. 2.1.5. Power of Two Number Systems Binary and octal numbers have a very special relationship to one another: Given a number in binary, you can read it off in octal and given a number in octal, you can read it off in binary. For example, take the octal number 754. If you replace each digit with the binary representation of that digit, you have 754 in binary. That is, 7 in octal is 111 in binary, 5 in octal is 101 in binary, 4 in octal is 100 in binary, so 754 in octal is 111101100 in binary. Binary Octal Decimal 000 0 0 001 1 1 010 2 2 011 3 3 100 4 4 101 5 5 110 6 6 111 7 7 1000 10 8 1001 11 9 1010 12 10 To facilitate this type of conversion, the table below shows counting in binary from 0 through 10 with their octal and decimal equivalents. To convert from binary to octal, you start at the rightmost binary digit and mark the digits in groups of threes. Then you convert each group of three to its octal value. 111 101 100 7 5 4 Let‘s convert the binary number 1010110 to octal, and then convert that octal value to decimal. The answer should be the equivalent of 1010110 in decimal, or 86. 1 010 110 1 2 6 2 1 * 8 = 1 * 64 = 64 1 + 2 * 8 = 2 * 8 = 16 0 +6*8 =6* 1= 6 86 The reason that binary can be immediately converted to octal and octal can be immediately converted to binary is that 8 is a power of 2. There is a similar relationship between binary and hexadecimal. Every hexadecimal digit can be represented in four binary digits. Let‘s take the binary number 1010110 and convert it to hexadecimal by marking of the digits from right to left in groups of four. 101 0110 5 6 18 Introduction to Computing Science 5 * 161= 5 * 16 = 80 + 6 * 160= 6 * 1 = 6 86 Now let‘s convert ABC in hexadecimal to binary. It takes four binary digits to represent each hex digit. A in hexadecimal is 10 in decimal and therefore is 1010 in binary. Likewise, B in hexadecimal is 1011 in binary, and C in hexadecimal is 1100 in binary. Therefore, ABC in hexadecimal is 101010111100 in binary. Rather than confirming that 10001001010 is 2748 in decimal directly, let‘s mark it off in octal and convert the octal. 5274 in octal is 2748 in decimal. 101 010 111 100 5 2 7 4 In the next section, we show how to convert base-10 numbers to the equivalent number in another base. 2.1.6. Converting from Base 10 to Other Bases The rules for converting base-10 numbers involve dividing by the base into which you are converting the number. From this division, you get a quotient and a remainder. The remainder becomes the next digit in the new number (from right to left), and the quotient replaces the number to be converted. The process continues until the quotient is zero. Let‘s write the rules in a different form. While the quotient is not zero Divide the decimal number by the new base Make the remainder the next digit to the left in the answer Replace the decimal number with the quotient These rules form an algorithm for converting from base 10 to another base. An algorithm is a logical sequence of steps that solves a problem. We have much more to say about algorithms in later chapters. Here we show one way of describing an algorithm and show how we can apply it to perform the conversions. The first line of the algorithm tells us to repeat the next three lines until the quotient from our division becomes zero. Let‘s convert the decimal number 2748 to hexadecimal. As we‘ve seen in previous examples, the answer should be ABC. The remainder (12) is the first digit in the hexadecimal answer, represented by the digit C. So the Chapter 2 Binary Values and Data Representation 19 answer so far is C. Since the quotient is not zero, we divide it (171) by the new base. The remainder (11) is the next digit to the left in the answer, which is represented by the digit B. Now the answer so far is BC. Since the quotient is not zero, we divide it (10) by the new base. The remainder (10) is the next digit to the left in the answer, which is represented by the digit A. Now the answer is ABC. The quotient is zero, so we are finished, and the final answer is ABC. 2.2.Data and Computers Although some of the early computers were decimal machines, modern computers are binary machines. That is, numbers within the computer are represented in binary form. In fact, all information is somehow represented using binary values. The reason is that each storage location within a computer either contains a low-voltage signal or a high-voltage signal. Because each location can have one of two states, it is logical to equate those states to 0 and 1. A low-voltage signal is equated with a 0, and a high- voltage signal is equated with a 1. In fact, you can forget about voltages and think of each storage location as containing either a 0 or a 1. Note that a storage location cannot be empty: It must contain either a 0 or a 1. Each storage unit is called a binary digit or bit for short. Bits are grouped together into bytes (8 bits), and bytes are grouped together into units called words. The number of bits in a word is known as the word length of the computer. For example, IBM 370 architecture in the late 1970s had half words (2 bytes or 16 bits), full words (4 bytes), and double words (8 bytes). Modern computers are often 32-bit machines or 64-bit machines. However, some microprocessors that are used in applications such as pagers are 8-bit machines. The computing machine you are using, whatever it is, is ultimately supported by the binary number system. In the remaining part of this chapter, we will examine many kinds of data and see how they are represented in a computer. In Chapter 3 we see how to control electrical signals that represent binary values. Without data, computers would be useless. Every task a computer undertakes deals with managing data in some way. Therefore, our need to represent and organize that data in appropriate ways is paramount. In the not-so-distant past, computers dealt almost exclusively with numeric and textual data, but now computers are truly multimedia devices, dealing with a vast array of information categories. Computers store, present, and help us modify many different types of data, including: Numbers 20 Introduction to Computing Science Text Audio Images and graphics Video Ultimately, all of this data is stored as binary digits. Each document, picture, and sound bite is somehow represented as strings of 1s and 0s. The following sections explore each of these types of data in turn and discuss the basic ideas behind the ways in which we represent these types of data on a computer. We cannot discuss data representation without talking about data compression—reducing the amount of space needed to store a piece of data. In the past we needed to keep data small because of storage limitations. Today, computer storage is relatively cheap; but now we have an even more pressing reason to shrink our data: the need to share it with others. The Web and its underlying networks have inherent bandwidth restrictions that define the maximum number of bits or bytes that can be transmitted from one place to another in a fixed amount of time. The compression ratio gives an indication of how much compression occurs. The compression ratio is the size of the compressed data divided by the size of the original data. The values could be in bits or characters or whatever is appropriate as long as both values are measuring the same thing. The ratio should result in a number between 0 and 1. The closer the ratio is to zero, the tighter the compression. A data compression technique can be lossless, which means the data can be retrieved without losing any of the original information. Or it can be lossy, in which case some information is lost in the process of compaction. Although we never want to lose information, in some cases the loss is acceptable. When dealing with data representation and compression, we always face a tradeoff between accuracy and size. 2.2.1 Analog and Digital Information The natural world, for the most part, is continuous and infinite. A number line is continuous, with values growing infinitely large and small. That is, you can always come up with a number larger or smaller than any given number. And the numeric space between two integers is infinite. For instance, any number can be divided in half. But the world is not just infinite in a mathematical sense. The spectrum of colors is a continuous rainbow of infinite shades. Objects in the real world move through continuous and infinite space. Theoretically, you could always close the distance between you and a wall by half, and you would never actually reach the wall. Computers, on the other hand, are finite. Computer memory and other hardware devices have only so much room to store and manipulate a certain amount of data. We always fail in our attempt to represent an infinite world on a finite machine. The goal, then, is to represent enough of the world to satisfy our computational needs and our senses of sight and sound. We want to make our representations good enough to get the job done, whatever that job might be. Information can be represented in one of two ways: analog or digital. Analog data is a continuous representation, analogous to the actual information it represents. Digital data is a discrete Chapter 2 Binary Values and Data Representation 21 representation, breaking the information up into separate elements. A mercury thermometer is an analog device. The mercury rises in a continuous flow in the tube in direct proportion to the temperature. We calibrate and mark the tube so that we can read the current temperature, usually as an integer such as 75 degrees Fahrenheit. However, a mercury thermometer is actually rising in a continuous manner between degrees. So at some point in time the temperature is actually 74.568 degrees Fahrenheit and the mercury is accurately indicating that, even if our markings are not fine enough to note such small changes. See Figure 2.1. Analog information is directly proportional to the continuous, infinite world around us. Computers, therefore, cannot work well with analog information. So instead, we digitize information by breaking it into pieces and representing those pieces separately. Each of the representations we discuss in this chapter has found an appropriate way to take a continuous entity and separate it into discrete elements. Those discrete elements are then individually represented using binary digits. But why do we use binary? We know from last section that binary is just one of many equivalent number systems. Couldn‘t we use, say, the decimal number system, with which we are already more familiar? We could. In fact, it‘s been done. Computers have been built that are based on other number systems. However, modern computers are designed to use and manage binary values because the devices that store and manage the data are far less expensive and far more reliable if they only have to represent one of two possible values. Also, electronic signals are far easier to maintain if they transfer only binary data. An analog signal continually fluctuates in voltage up and down. But a digital signal has only a high or low state, corresponding to the two binary digits. See Figure 2.2. Figure 2.1 A mercury thermometer continually rises in direct Figure 2.2 an analog and a digital signal proportion to the temperature All electronic signals (both analog and digital) degrade as they move down a line. That is, the voltage of the signal fluctuates due to environmental effects. The trouble is that as soon as an analog signal degrades, information is lost. Since any voltage level within the range is valid, it‘s impossible to know what the original signal state was or even that it changed at all. 22 Introduction to Computing Science Figure 2.3 Degradation of analog and digital signals Digital signals, on the other hand, jump sharply between two extremes. This is referred to as pulse-code modulation (PCM). A digital signal can degrade quite a bit before any information is lost, because any voltage value above a certain threshold is considered a high value, and any value below that threshold is considered a low value. Periodically, a digital signal is reclocked to regain its original shape. As long as it is reclocked before too much degradation occurs, no information is lost. Figure 2.3 shows the degradation effects of analog and digital signals. 2.2.2 Binary Representations As we undertake the details of representing particular types of data, it‘s important to remember the inherent nature of using binary. One bit can be either 0 or 1. There are no other possibilities. Therefore, one bit can represent only two things. For example, if we wanted to classify a food as being either sweet or sour, we would need only one bit to do it. We could say that if the bit is 0, the food is sweet, and if the bit is 1, the food is sour. But if we want to have additional classifications (such as spicy), one bit is not sufficient. To represent more than two things, we need multiple bits. Two bits can represent four things because there are four combinations of 0 and 1 that can be made from two bits: 00, 01, 10, and 11. So, for instance, if we want to represent which of four possible gears a car is in (park, drive, reverse, or neutral), we need only two bits. Park could be represented by 00, drive by 01, reverse by 10, and neutral by 11. The actual mapping between bit combinations and the thing each combination represents is sometimes irrelevant (00 could be used to represent reverse, if you prefer), though sometimes the mapping can be meaningful and important, as we discuss in later sections of this chapter. If we want to represent more than four things, we need more than two bits. Three bits can represent eight things because there are eight combinations of 0 and 1 that can be made from three bits. Likewise, four bits can represent 16 things; five bits can represent 32 things, and so on. n n In general, n bits can represent 2 things because there are 2 combinations of 0 and 1 that can be made from n bits. Note that every time we increase the number of available bits by 1, we double the number of things we can represent. Let‘s turn the question around. How many bits do you need to represent, say, 25 unique things? Well, four bits wouldn‘t be enough because four bits can represent only 16 things. We would have to use at least five bits, which would allow us to represent 32 things. Since we only need to represent 25 things, some of the bit combinations would not have a valid interpretation. Keep in mind that even though we may technically need only a certain minimum number of bits to represent a set of items, we may allocate more than that for the storage of that data. There are a minimum number of bits that a computer architecture can address and move around at one time, and it is usually a power of two, such as 8, 16, or 32 bits. Therefore, the minimum amount of storage given to Chapter 2 Binary Values and Data Representation 23 any type of data is in multiples of that value. 2.3.Representing Numeric Data Numeric values are the most prevalent type of data used in a computer system. Unlike other types of data, there may seem to be no need to come up with a clever mapping between binary codes and numeric data. Since binary is a number system, there is a natural relationship between the numeric information and the binary values that we store to represent them. This is true, in general, for positive integer data. The basic issues regarding integer conversions were covered in Section 2.1 in the general discussion of the binary system and its equivalence to other bases. However, there are other issues regarding the representation of numeric information to consider at this point. Integers are just the beginning in terms of numeric data. This section discusses the representation of negative and non-integer values. 2.3.1 Representing Negative Values Aren‘t negative numbers just numbers with a minus sign in front? Perhaps. That is certainly one valid way to think about them. Let‘s explore the issue of negative numbers, and discuss appropriate ways to represent them on a computer. Signed-Magnitude Representation You have used the signed-magnitude representation of numbers since you first learned about negative numbers in grade school. In the traditional decimal system, a sign (+ or-) is placed before a number‘s value, though the positive sign is often assumed. The sign represents the ordering, and the digits represent the magnitude of the number. The classic number line looks something like this, in which a negative sign meant that the number was to the left of zero and the positive number was to the right of zero. Performing addition and subtraction with signed integer numbers can be described as moving a certain number of units in one direction or another. To add two numbers you find the first number on the scale and move in the direction of the sign of the second as many units as specified. Subtraction was done in a similar way, moving along the number line as dictated by the sign and the operation. In grade school you soon graduated to doing addition and subtraction without using the number line. There is a problem with the sign-magnitude representation: There are two representations of zero. There is plus zero and minus zero. The idea of a negative zero doesn‘t necessarily bother us; we just ignore negative zero entirely. However, two representations of zero within a computer can cause unnecessary complexity, so other representations of negative numbers are used. Let‘s examine another alternative. 24 Introduction to Computing Science Fixed-Sized Numbers If we allow only a fixed number of values, we can represent numbers as just integer values, where half of them represent negative numbers. The sign is determined by the magnitude of the number. For example, if the maximum number of decimal digits we can represent is two, we can let 1 through 49 be the positive numbers 1 through 49 and let 50 through 99 represent the negative numbers -50 through -1. Let‘s take the number line and number the negative values on the top according to this scheme: To perform addition within this scheme, you just add the numbers together and discard any carry. Adding positive numbers should be ok; let‘s try adding a positive number and a negative number, a negative number and a positive number, and two negative numbers. These are shown below in sign-magnitude and in this scheme. (Note that the carries are discarded.) What about subtraction, using this scheme for representing negative numbers? The key is in the relationship between addition and subtraction: A-B = A + (-B). We can subtract one number from another by adding the negative of the second to the first. In this example, we have assumed a fixed size of 100 values, and kept our numbers small enough to use the number line to calculate the negative representation of a number. However, there is a formula that you can use to compute the negative representation. This representation of negative numbers is called the ten’s complement. k Negative(I) = 10 -I, where k is the number of digits Chapter 2 Binary Values and Data Representation 25 Two’s Complement Although humans tend to think in terms of sign and magnitude to represent numbers, the complement strategy is actually easier in some ways when it comes to electronic calculations. And since we store everything in a modern computer in binary, we use the binary equivalent of the ten‘s complement, called the two’s complement. Therefore, the formula used to compute the negative representation becomes: k Negative(I) = 2 -I, where k is the number of digits Let‘s assume that a number must be represented in eight bits. Addition and subtraction are accomplished the same way as in 10‘s complement arithmetic: – 127 10000001 + 1 00000001 – 126 10000010 Notice that with this representation, the leftmost bit in a negative number is always a 1. Therefore, you can tell immediately whether a binary number in two‘s complement is negative or positive. Number Overflow Overflow occurs when the value that we compute cannot fit into the number of bits we have allocated for the result. For example, if each value is stored using eight bits, adding 127 to 3 would overflow: 01111111 + 00000011 10000010 100000010 in our scheme represents -126, not +130. If, however, we were not representing negative numbers, the result would be correct. Overflow is a classic example of the type of problems we encounter by mapping an infinite world onto a finite machine. No matter how many bits we allocate for a number, there is always the potential need to represent a number that doesn‘t fit. How overflow problems are handled varies by computer hardware and by the differences in programming languages. 2.3.2 Representing Real Numbers In computing, we call all noninteger values (that can be represented) real values. For our purposes here, we define a real number as a value with a potential fractional part. That is, real numbers have a whole part and a fractional part, either of which may be zero. For example, some real numbers in base 10 are 104.32, 0.999999, 357.0, and 3.14159. As we explored in Section 2.1, the digits represent values according to their position, and those position values are relative to the base. To the left of the decimal point, in base 10, we have the 1s position, the 10s position, the 100s position, etc. These position values come from raising the base value to increasing powers (moving from the decimal point to the left). The positions to the right of the decimal point work the same way, except the powers are negative. So the positions to the right of the decimal -1 -2 point are the tenths position (10 or one tenth), the hundredths position (10 or one hundredth), etc. In binary, the same rules apply but the base value is 2. Since we are not working in base 10, the decimal point is referred to as a radix point, a term that can be used in any base. The positions to the right of 26 Introduction to Computing Science -1 -2 the radix point in binary are the halves position (2 or one half), the quarters position (2 or one quarter), etc. So how do we represent a real value in a computer? We store the value as an integer and include information showing where the radix point is. That is, any real value can be described by three properties: the sign (positive or negative one), the mantissa, which is made up of the digits in the value with the radix point assumed to be to the right, and the exponent, which determines how the radix point is shifted relative to the mantissa. A real value in base 10 can therefore be defined by the following formula: exp sign * mantissa * 10 The representation is called floating point because the number of digits is fixed but the radix point floats. When a value is in floating-point form, a positive exponent shifts the decimal point to the right, and a negative exponent shifts the decimal point to the left. Let‘s look at how to convert a real number expressed in our usual decimal notation into floating point. For example, consider the number 148.69. The sign is positive, and there are two digits to the right of 2 the decimal point. Thus, the exponent is 2, giving us 14869 * 10. Table 2.1 shows other examples. For the sake of this discussion, we assume that only five digits can be represented. Table 2.1 Values in decimal notation and floating-point notation (five digits) How do we convert a value in floating-point form back into decimal notation? The exponent on the base tells us how many positions to move the radix point. If the exponent is negative, we move the radix point to the left. If the exponent is positive, we move the radix point to the right. Apply this scheme to the floating-point values in Table 2.1. Notice that in the last example in Table 2.1, we lose information. Since we are only storing five digits to represent the significant digits (the mantissa), the whole part of the value is not accurately represented in floating-point notation. Likewise, a binary floating-point value is defined by the following formula: exp sign * mantissa * 2 Note that only the base value has changed. Of course, the mantissa would only contain binary digits. To store a floating-point number in binary on a computer, we can store the three values that define it. For example, according to a common standard, if we devote 64 bits to the storage of a floating-point value, we use 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa. Internally, this format is taken into account any time the value is used in a calculation or is displayed. Scientific notation is a term with which you may already be familiar, so we mention it here. Scientific Chapter 2 Binary Values and Data Representation 27 notation is a form of floating-point representation in which the decimal point is kept to the right of the leftmost digit. That is, there is one whole number. In many programming languages, if you print out a large real value without specifying how to print it, the value is printed in scientific notation. Because exponents could not be printed in early machines, an ―E‖ was used instead. For example, 12001.32708 would be written as 1.200132708E+4 in scientific notation. 2.4.Representing Text A text document can be decomposed into paragraphs, sentences, words, and ultimately individual characters. To represent a text document in digital form, we simply need to be able to represent every possible character that may appear. The document is the continuous (analog) entity, and the separate characters are the discrete (digital) elements that we need to represent and store in computer memory. We should distinguish at this point between the basic idea of representing text and the more involved concept of word processing. When we create a document in a word processing program such as Microsoft Word, we can specify all kinds of formatting: fonts, margins, tabs, color, and so on. Many word processors also let us add art, equations, and other elements. This extra information is stored with the rest of the text so that the document can be displayed and printed the way you want it. The core issue, however, is the way we represent the characters themselves; therefore, those techniques remain our focus at this point. There are a finite number of characters to represent. So the general approach for representing characters is to list them all and assign each a binary string. To store a particular letter, we store the appropriate bit string. So what characters do we have to worry about? There are the 26 letters in the English language. But uppercase and lowercase letters have to be treated separately, so that‘s really 52 unique characters. Various punctuation characters also have to be represented, as do the numeric digits (the actual characters ‗0‘, ‗1‘, through ‗9‘). Even the space character must have a representation. And what about languages other than English? The list of characters we may want to represent starts to grow quickly once you begin to think about it. Keep in mind that, as we discussed earlier in Section 2.2, the number of unique things (characters in this case) we want to represent determines how many bits we‘ll need to represent any one of them. A character set is simply a list of characters and the codes used to represent each one. There have been several character sets used over the years, though a few have dominated. By agreeing to use a particular character set, computer manufacturers have made the processing of text data easier. We explore two character sets in the following sections: ASCII and Unicode. 2.4.1 The ASCII Character Set ASCII stands for American Standard Code for Information Interchange. The ASCII character set originally used seven bits to represent each character, allowing for 128 unique characters. The eighth bit in each character byte was originally used as a check bit, which helped ensure proper data 28 Introduction to Computing Science transmission. Later ASCII evolved so that all eight bits were used to represent a character. This eight-bit version is formally called the Latin–1 Extended ASCII character set. The extended ASCII set allows for 256 characters and includes accented letters as well as several additional special symbols. The entire extended ASCII character set is shown in Figure 2.4. The codes in this chart are expressed as decimal numbers, but these values get translated to their binary equivalent for storage. Note that the ASCII characters have a distinct order based on the codes used to store them. Each character has a relative position (before or after) every other character. This property Figure 2.4 The ASCII character set is helpful in various ways. For example, note that both the uppercase and lowercase letters are in order. Therefore, we can use the character codes to help us put a list of words into alphabetical order. Also note that the first 32 characters in the ASCII character chart do not have a simple character representation that you could print to the screen. These characters are reserved for special purposes Chapter 2 Binary Values and Data Representation 29 such as carriage return and tab. These characters are usually interpreted in special ways by whatever program is processing the information. 2.4.2 The Unicode Character Set The extended version of the ASCII character set provides 256 characters, which is enough for English but not enough for international use. This limitation gave rise to the Unicode character set, which has a much stronger international influence. The goal of the people who created Unicode is nothing less than to represent every character in every language used in the entire world, including all of the Asian ideograms. It also represents many additional special-purpose characters such as scientific symbols. To accomplish this, the Unicode character set uses 16 bits per character. Therefore, the Unicode 16 character set can represent 2 , or over 65 thousand, characters. Compare that to the 256 characters represented in the extended ASCII set. The Unicode character set is gaining popularity and is used by many programming languages and computer systems today. However, the character set itself is still evolving. Not all of the available codes have been assigned to particular characters. Figure 2.5 shows a few select characters currently represented in the Unicode character set. For consistency, Unicode was designed to be a superset of ASCII. That is, the first 256 characters in the Unicode character set correspond exactly to the extended ASCII character set, including the codes used to represent them. Therefore, programs that assume ASCII values are unaffected even if the underlying system embraces the Unicode approach. 2.4.3 Text Compression Alphabetic information (text) is a fundamental type of data. Therefore, it is important that we find ways to store text efficiently and transmit text efficiently between one computer and another. The following sections examine three types of text compression: keyword encoding run-length encoding Huffman encoding As we discuss later in this chapter, some of the underlying ideas of these text compression techniques come into play when dealing with other types of data as well. Figure 2.5 A few characters in the Unicode character set 30 Introduction to Computing Science Keyword Encoding Consider how often you use words such as ―the,‖ ―and,‖ ―which,‖ ―that,‖ and ―what.‖ If these words took up less space (that is, had fewer characters), our documents would shrink in size. Even though the savings on each word would be small, they are used so often in a typical document that the combined savings would add up quickly. One fairly straightforward method of text compression is called keyword encoding, in which frequently used words are replaced with a single character. To decompress the document, you reverse the process: replace the single characters with the appropriate full word. For example, suppose we used the following chart to encode a few words: Let‘s encode the following paragraph: The human body is composed of many independent systems, such as the circulatory system, the respiratory system, and the reproductive system. Not only must all systems work independently, they must interact and cooperate as well. Overall health is a function of the well-being of separate systems, as well as how these separate systems work in concert. The encoded paragraph is: The human body is composed of many independent systems, such ^ ~ circulatory system, ~ respiratory system, + ~ reproductive system. Not only & all systems work independently, they & interact + cooperate ^ %. Overall health is a function of ~ %-being of separate systems, ^ % ^ how # separate systems work in concert. There are a total of 351 characters in the original paragraph including spaces and punctuation. The encoded paragraph contains 316 characters, resulting in a savings of 35 characters. The compression ratio for this example is 316/351 or approximately 0.9. There are several limitations to keyword encoding. First, note that the characters we use to encode the keywords cannot be part of the original text. If, for example, the ‗$‘ was part of the original paragraph, the resulting encoding would be ambiguous. We wouldn‘t know whether a ‗$‘ represented the word ―that‖ or if it was the actual dollar-sign character. This limits the number of words we can encode as well as the nature of the text that we are encoding. Also, note that the word ―The‖ in the example is not encoded by the ‗~‘ character because the word ―The‖ and the word ―the‖ contain different letters. Remember, the uppercase and lowercase versions of the same letter are different characters when it comes to storing them on a computer. A separate symbol for ―The‖ would have to be used if we wanted to encode it. Chapter 2 Binary Values and Data Representation 31 Finally, note that we would not gain anything to encode words such as ―a‖ and ―I‖ because it would simply be replacing one character for another. The longer the word, the more compression we get per word. Unfortunately, the most frequently used words are often short. On the other hand, some documents use certain words more frequently than others depending on the subject matter. For example, we would have some nice savings if we had chosen to encode the word ―system‖ in our example, but it might not be worth encoding in a general situation. An extension of keyword encoding is to replace specific patterns of text with special characters. The encoded patterns are generally not complete words, but rather parts of words such as common prefixes and suffixes—―ex‖, ―ing‖, and ―tion,‖ for instance. An advantage of this approach is that patterns being encoded generally appear more often than whole words (because they appear in many different words). A disadvantage, as before, is that they are generally short patterns and the replacement savings per word is small. Run-Length Encoding In some situations, a single character may be repeated over and over again in a long sequence. This type of repetition doesn‘t generally take place in English text, but often occurs in large data streams, such as DNA sequences. A text compression technique called run-length encoding capitalizes on these situations. Run-length encoding is sometimes called recurrence coding. In run-length encoding, a sequence of repeated characters is replaced by a flag character, followed by the repeated character, followed by a single digit that indicates how many times the character is repeated. For example, consider the following string of seven repeated ‗A‘ characters: AAAAAAA If we use the ‗*‘ character as our flag, this string would be encoded as: *A7 The flag character is the indication that the series of three characters (including the flag) should be decoded into the appropriate repetitious string. All other text is treated regularly. Therefore, the following encoded string: *n5*x9ccc*h6 some other text *k8eee would be decoded into the following original text: nnnnnxxxxxxxxxccchhhhhh some other text kkkkkkkkeee The original text contains 51 characters, and the encoded string contains 35 characters, giving us a compression ratio in this example of 35/51 or approximately 0.68. Note that in this example the three repeated ‗c‘ characters and the three repeated ‗e‘ characters are not encoded. Since it takes three chara