CO Book.pdf

INDEX S. No Topic Page No. Week 1 1 Evolution of Computer Systems 1 2 Basic Operation of a Computer 22 3 Memory Addressing and La...

INDEX S. No Topic Page No. Week 1 1 Evolution of Computer Systems 1 2 Basic Operation of a Computer 22 3 Memory Addressing and Languages 39 4 Software and Architecture Types 56 5 Instruction Set Architecture 72 Week 2 6 Number Representation 90 7 Instruction Format and Addressing Modes 112 8 CISC and RISC Architecture 131 9 MIPS32 Instruction Set 151 10 MIPS Programming Examples 167 11 SPIM â€“ A MIPS32 SIMULATOR 183 Week 3 12 Measuring Cpu Performance 202 13 Choice Of Benchmarks 221 14 Summarizing Performance Results 239 15 Amadahlâ€™S Law (Part 1) 252 16 Amadahlâ€™S Law (Part 2) 267 Week 4 17 Design Of Control Unit (Part 1) 280 18 Design Of Control Unit (Part 2) 297 19 Design Of Control Unit (Part 3) 313 20 Design Of Control Unit (Part 4) 325 21 Mips Implementation (Part 1) 345 22 Mips Implementation (Part 2) 361 Week 5 23 Processor Memory Interaction 372 24 Static And Dynamic Ram 388 25 Asynchronous Dram 404 26 Synchronous Dram 417 27 Memory Interfacing And Addressing 432 Week 6 28 Memory Hierarchy Design (Part 1) 448 29 Memory Hierarchy Design (Part 2) 468 30 Cache Memory (Part 1) 479 31 Cache Memory (Part 2) 497 32 Improving Cache Performance 511 Week 7 33 Design Of Adders (Part 1) 530 34 Design Of Adders (Part 2) 550 35 Design Of Multipliers (Part 1) 570 36 Design Of Multipliers (Part 2) 586 37 Design Of Dividers 604 Week 8 38 Floating-Point Numbers 625 39 Floating-Point Arithmetic 642 40 Basic Pipelining Concepts 657 41 Pipeline Scheduling 674 42 Arithmetic Pipeline 691 Week 9 43 Secondary Storage Devices 706 44 Input-Output Organization 726 45 Data Transfer Techniques 740 46 Interrupt Handling (Part 1) 754 47 Interrupt Handling (Part 2) 770 Week 10 48 Direct Memory Access 784 49 Some Example Device Interfacing 800 50 Exercises On I/O Transfer 815 51 Bus Standards 828 52 Bus Standards 845 Week 11 53 Pipelining The Mips32 Data Path 864 54 Mips Pipeline (Contd.) 879 55 Pipeline Hazards (Part 1) 890 56 Pipeline Hazards (Part 2) 907 57 Pipeline Hazards (Part 3) 925 58 Pipeline Hazards (Part 4) 939 Week 12 59 Multicycle Operations In Mips32 954 60 Exploiting Instruction Level Parallelism 966 61 Vector Processors 981 62 Multi-Core Processors 997 63 Some Case Studies 1011 64 Summarization Of The Course 1027 Computer Architecture and Organization Prof. Kamalika Datta Department of Computer Science and Engineering National Institute of Technology, Meghalaya Lecture - 01 Evaluation of Computer System I welcome you all to the MOOC course on Computer Architecture and Organization. In this particular course, we expect to cover various aspects of computer design where you will be seeing how we can make a computer faster, how a computer actually works, how the information data are stored there and various other aspects. The lectures will span over 12 weeks where we will cover the instruction set architecture, processor design, arithmetic and logic unit design, memory unit design, input-output system design and then we will also cover parallel processing and pipeline. (Refer Slide Time: 01:28) To start with this course, I will come first to evolution of computer system. So, we all know that computer has become a part and parcel of our daily lives. We cannot disagree to this fact. We see everywhere computers that is some kind of processing unit. When you think about a laptop which we use in our daily use, tablets, mobile phones which are used by one and all today and intelligent appliances like off course your smart phone is one apart from that you have smart watch, and various other appliances. So, computer has become a part and parcel of our life. So, we need to understand how a computer 1 actually works. So, what is there inside a computer? So, we in this particular course we will be seeing all these various aspects where the two terms computer architecture and computer organization will be taken care. (Refer Slide Time: 02:40) Now coming to what is computer architecture, and what is computer organization, the title of the course. So, here computer architecture consists of those attributes of the system that are visible to the programmer. By this what we mean is how the various components of a computer system are integrated to achieve the desired level of performance. In an analogy, I can say that you think of an architect who does who plans the entire design of your house, but it is ultimately the civil engineers who actually does the exact building like what kind of construction will be taken care of, how the construction will be taken care of, how much percentage of cement, bricks will be there , will be taken care by a civil engineer. So, in that respect the design of components and functional blocks using which computer systems are built comes to the organizational part. So, I will take a very small example like you have in your computer some functional blocks like your processor unit; inside processor unit we will be seeing that we have many other components like registers, ALU and other units. I will just take a small example let say I will have an adder, but what kind of adder I will be having that is to the discretion of the computer organization, whether we will have a carry save adder or a 2 carry look ahead adder or anything else. So, these are the two different aspects of computer organization and computer architecture that we will be seeing in this particular course. (Refer Slide Time: 04:49) Coming to the historical perspective, how computers have evolved over the years. So, whenever there is a need for doing certain things then only something comes up like a constant quest of building automatic computing machines have driven the development of computers. So, in the initial efforts, some mechanical devices like pulleys, levers, gears were built. During World War 2, mechanical relays were used to perform some kind of computation like using small relays people design circuits to carry out the operations. Then comes vacuum tubes using which the first electronic computer called ENIAC was developed. And from then semiconductor transistors were developed when semiconductor transistors came into picture then the journey of miniaturization started. First with small scale integration then people moved with medium scale integration then large scale integration then to very large scale integration and now the era of ultra large scale integration where we stand today. 3 (Refer Slide Time: 06:22) So, this is the first mechanical calculator that was invented by B Pascal. So, this particular calculator could add only two numbers, it can only add two numbers or it can only subtract two numbers. And if you wanted to do multiplication and division, it could have been done by repeated addition or repeated subtraction. (Refer Slide Time: 06:58) Then the Babbage engine came this was the first automatic computing engine designed by father of computer Charles Babbage in the 19th century, but he could not build that only he designed that. Later in 2002 that is 153 years after its design, it was built and it 4 consists of 8,000 parts and it weighted 5 tons and it was 11 feet in length. So, you can imagine how large it is. (Refer Slide Time: 07:43) The first electronic computer was built which is called ENIAC - Electrical Numerical Integrated and Calculator. It was developed by University of Pennsylvania, and it uses 18,000 vacuum tubes and weighted 30 tons, and it also occupied a very large space which is 30 feet cross 50 feet. So, what is a vacuum tube? Vacuum tube is a device that controls electric current between electrodes in an evacuated container in a tube. So, using those vacuum tubes the first computer ENIAC was built. It also dissipated a huge amount of heat. 5 (Refer Slide Time: 08:44) Next was Harvard mark 1. This was built at the University of Harvard in 1944, with support from IBM and it uses mechanical relays and off course, some electric signals were also used to work with the relays to represent the data. And it also weighted 35 tons, and required 500 miles of wiring. So, these are the computers which were built in the early stages. (Refer Slide Time: 09:20) Then comes in 60s and 70s where the popular mainframe computer came into picture. So, this popular computer IBM system 360 was introduced; it introduces many advanced 6 architectural concepts that we use today, but you can see from the picture that how big it was. But as I said some of the architectural aspects or concepts that were used in that computer appeared in today’s microprocessor several decades later. (Refer Slide Time: 10:03) Now, we stand, where the modern processor chips comes in dual-core, quad-core and also in 6-core variants. So, you can see that how many core today’s computer has, where each core is a 64-bit processor that comes with various micro architectures. These are the various micro architectures within an I7, where they are called Haskell, Haswell, Nehalem, Sandy Bridge etc. 7 (Refer Slide Time: 10:52) Now, coming to the generation of computers. So, broadly we can classify the growth of computer, how computer has evolved into some generations. So, these actually represent some features of operating system and representative machines. So, between the years 1945 to 54, vacuum tubes and relays were used; and the representative system uses machine language. So, if you want to enter some data into the computer for processing, you have to enter in machine language or either in assembly language, no high level language were used to enter the data. Then comes the second generation where transistors, memories, IO processors are the main technologies. Here batch processing is performed, and you could enter the data in high level language. Then comes the third generation, where we were into semiconductor industry where small-scale integration and medium-scale integration integrated circuits were used. And in that IBM 360, Intel 8008, where the first one to make the step into this computer market. Then in the fourth and fifth generation, this large scale integration and very large scale integration came into picture and we could see multi processors. Now, what has happened is like the space has become larger because the components that we use to build those basic blocks has become smaller. So, in a same space where earlier we could keep small number of components, now we could keep more number of components into the same space. And hence multiprocessors comes into picture. And now we are in the era of ultra large scale integration with many scalable architectures and post-CMOS 8 technologies. Now, we also have massively parallel processors like Pentium, SUN Ultra workstation etc. (Refer Slide Time: 13:34) Now, you see that evolution of the types of computer. So, initially mainframes came in the year 60s and late 70 in between 60s and 70s. Then came mini computers then came workstations, and finally we see personal computers. And then further we see laptops and smart phones. So, every smart phone today is having some processing capability that is why we call it smart phone with some smart features into it. And we also have now millimeter scale sensors and these are not only sensors that it will sense something rather these sensors are having some intelligent capability like some processing capability which can process those data that are collected from the sensors and also it has the communication feature, so that it can communicate to others if required. So, this millimeter scale sensors have both the capabilities of processing as well as communication. So, we can see how the evolution of types of computer systems have come up. And now where we stand in the future, we will see large scale IOT based systems where these sensors itself are very small, where they will have not only the sensing capability, but also the processing capability. 9 (Refer Slide Time: 15:25) Now, you see this evolution of PC form factors over the years. This is the first standard ATX where you can see the motherboard slot. You can see this is a motherboard, you can see the processor, you can see the slots for memories for all other components. Now, how that miniaturization has taken place. First it was standard-ATX, then micro-ATX then goes mini-ITX, now nano-ITX now we are in pico-ITX. So, we are in the era of miniaturization. And now with miniaturization of course, we are able to do some good thing for performance, but at the same time power issue is one of the important factors that we have to also handle which we cannot deny. (Refer Slide Time: 16:37) 10 Now, inside a laptop if you see what all components we have. So, we have shrunked each of the components. So, hard drive is now getting replaced by flash-based memory devices and because of miniaturization cooling is a major issue that has come up. (Refer Slide Time: 17:23) Now, this is the famous Moore’s law that refers to an observation made by Intel co- founder Gordon Moore in 1965. What he noticed is that the number of transistors per square inch on integrated circuits had doubled every year since invention. And this law predicts that this trend will continue into the foreseeable picture, but now Moore’s law stand here where it says that the number of transistor that can be placed in an integrated circuit gets doubled in every 18 months. And this has held over a period. 11 (Refer Slide Time: 18:17) Now, you can see this diagram which how Moore’s law works, this is the number of transistor per chip and this is over the years how it has grown. So, this straight line actually indicates that Moore’s law holds. And starting from 4004 we are now in ivy bridge, sandy bridge, and various architectural advancement has taken place and number of transistor that can be put in a chip has doubled in every 18 months. And as long as it is a straight line, we can say that Moore’s law holds. (Refer Slide Time: 19:16) 12 Now, coming to the simplified block diagram of a computer system. So, in this diagram what you can see is that we have a processor. The processor is having control unit and arithmetic logic unit. We have a memory; memory is divided into primary memory and secondary memory. And you have input devices and output devices. In this architecture course we will be taking each and every aspect of this design like we will be seeing the processor unit design; in the processor unit design there are two parts we will be seeing control unit design and arithmetic logic unit design. We will also look into the memory, memory is broadly divided into primary memory and secondary memory. What we have in primary memory we will be seeing, what we have in secondary memory we will be seeing. And of course, how the inputs are given to the computer and how we get the outputs from the computer that also we will be look into. So, all instructions and data are stored in memory. Whatever instruction or data that we want to execute that is stored in memory. And every time an instruction and data is required it is brought from the memory to the processor for execution, and input output devices are connected to it. So, if input is required input is taken from an input device processing takes place in the processor and the output is provided in the output device. This typical architecture where we store both program and data in the memory alongside we call it von-Neumann architecture we will be seeing this in more detail little later. (Refer Slide Time: 21:31) 13 Now, let us see what is there inside a processor. A processor is also called a Central Processing Unit; it consists of a Control Unit and it consists of an Arithmetic Logic Unit. All the calculations happens inside the CPU. So, any arithmetic computation, like we need to add two numbers, we need to divide two numbers or any arithmetic operation that are performed inside the ALU. The control unit basically generates the sequence of control signals to carry out all operations. So, all the operations that are performed, it is the control unit that generates those sequence of control signals i.e, when I say that we will execute an instruction. So, you have to instruct the computer that, execute this particular instruction. So, giving the computer some kind of instruction, some kind of control signals that, yes now you have to execute this now you have to execute that and finally, you have to store the result or you have to display the result. So, all these steps that we are instructing to a computer is generated by the control unit. In a processor an instruction actually specify the exact operation that is to be performed. So, the instruction will tell you ADD A, B. So, A, B are some operands that perform the required operation, like ADD. So, you have to instruct that this is an operation that has to be performed on some operands. It also specifies the data that are to be operated on. And now let us see what is a program. I have talked about a single instruction. Now, a program is a set of instruction, like I need to add ten numbers. So, adding ten numbers I have to write a set of some instructions; those set of some instructions are written to perform that particular task. A program is a set of instructions that constitute a program. 14 (Refer Slide Time: 24:35) Now, what is the role of ALU? ALU or the processor unit consists of several registers; some registers are called general purpose registers, and some are called special purpose registers, and some of them are temporary storage. So, what are these registers? Registers are some storage unit, and these registers are used to store data, then again we compute some operation and again store back that results into it. We store data for computation; and after the computation is performed, we also store back the data. It contains circuitry to carry out the logic operations. So, basically when we say we are adding two numbers, we are subtracting two numbers, we are doing some kind of operation, some kind of logic operation is performed to carry out that particular operation. So, it also contains circuitry to carry out arithmetic operations like addition, subtraction, multiplication, division. We will be seeing in more detail every aspect of ALU in a separate lecture unit. Now, during execution, the data or the operands are brought in and stored in some register, the desired operation is carried out and the result is stored back in some register or memory. So, what does it means that during an instruction execution the instructions and data are stored in some memory locations. So, we have to bring the data from those memory locations into some of the registers, perform the operation and then we store back the data into either register or into those memory locations. 15 (Refer Slide Time: 26:41) Finally what is the role of control unit? As I said earlier it generates the signals that is necessary to perform the task. So, it acts as a nerve center that senses the states of various functional units and sent control signals to control the states. Suppose you have to carry out an operation where you have to add R2 and R3 and store back in R1. So, what you need to do is that you need to enable the output of R2 and R3 such that the outputs of R2 and R3 are available in a place where you can do the operation. After the operation is performed it is stored in the register R3. So, you have to store the output in a circuit into the of the adder circuit into register R1. (Refer Slide Time: 28:04) 16 Let us say this is an instruction. ADD A,B: this is an instruction. ADD R1,R2 is also an instruction. So, this instruction consists of two parts: the first part we call it Opcode and next part is the operand. Opcode specifies what operation we will be doing and operand is on which we will be doing the operation. So, in this particular case the operation that we will be doing is ADD and on which we will be operating are R1 and R2, these are the two registers where we will be doing the operation. (Refer Slide Time: 29:10) Now, consider the memory unit. There are two types of memory subsystems, two main types. So, primary or main memory stores the active instructions and data for program being executed on the processor. And the secondary memory is used as backup and stores all active and inactive programs and data typically the files. Now the processor can only have a direct access to primary memory. As I said the programs and data is stored in your primary memory and whenever it is required the processor ask it from your primary memory and not from your secondary memory. And in reality, the memory system is implemented as a hierarchy of several levels. So, we have L1 cache, we have L2 cache, we have L3 cache, primary memory and secondary memory. What is the objective of all these things to make the processing faster? So, we will be seeing all these aspects in course of time, but for now you must know that in the memory unit we store instructions and data. And for processing of those 17 instructions on data, you have to bring those instructions and data to the processors to execute it. (Refer Slide Time: 30:59) Now, we have various different types of memory, Random Access Memory, Read Only Memory, we have Magnetic Disk, we have Flash Memories. Random Access Memory is used for cache and primary memory and Read and Write access times are independent of the location being accessed. This means, you either access location 1 or you access the last location or the middle location, the access time will be same. Read Only Memories are used as a part of primary memory to store some fixed data that is not required to be changed. Magnetic Disk uses direction of magnetization of tiny magnetic particles on a metallic surface to store the data and the access time vary depending on the location being accessed, and these are used in secondary memory. Now, Flash Memories are coming into market, which is replacing this magnetic disk as secondary memory. They are much faster and smaller in size, and they do not have any movable part. 18 (Refer Slide Time: 32:28) Now, these are the pictures showing these. The first picture shows the RAM primary memory. Next one is the ROM. This is a hard disk. If you open a hard disk it looks like this, this one is the SSD and so on. (Refer Slide Time: 32:53) Coming to the input unit, it is used to feed data to the computer system. The commonly used devices are keyboards, mouse, joystick and camera. 19 (Refer Slide Time: 33:03) These are the relevant pictures that are shown. (Refer Slide Time: 33:15) And the output unit is used to send the result of some computation to outside world like printer is used to print the data, LCD screen or LED screen is used to see the output on the screen, you have speakers, you have projection system that are also used as an output unit. 20 (Refer Slide Time: 33:42) So, these are some of the relevant pictures showing the output units. So, we have come to the end of lecture 1 where we have seen how computer systems have evolved over the years and what are the main functional components of a computer system, and how these components are required and how we can execute a particular instruction. Thank you. 21 Computer Architecture and Organization Prof. Kamalika Datta Department of Computer Science and Engineering National Institute of Technology, Meghalaya Lecture - 02 Basic Operation Of A Computer (Refer Slide Time: 00:27) Welcome to the next lecture on basic operation of a computer. So, the basic mechanism through which an instruction gets executed shall be illustrated. And just recall that what we discussed in the previous lecture. Your ALU is having some registers, some registers are called special purpose registers and some are general purpose registers. And firstly, what we will do we will discuss some function of special purpose register. (Refer Slide Time: 01:01) 22 Instructions and data are stored in memory. And from memory you have to bring the data and the instruction to the processors and then you have to execute it. For this purpose, we require two special purpose registers they are called Memory Address Register(MAR) and Memory Data Register (MDR). So, let us see first what is memory address register. So, memory address register holds the address of a memory location to be accessed. So, when I say that it holds the address of memory location that is to be accessed that means, I can access a memory location for reading an instruction, I can access a memory location for reading a data and I can also access a memory location for writing back the data. So, memory address register holds the address of the instruction that is to be read or it holds the address of the data that is to be read from the memory or the address of the memory where the data is to be written. The next register is called memory data register. Memory data register or MDR holds the data that is been written into memory or the data that, we will receive when read out from the memory location. So, when we write we always write a data into the memory, but when we read as I said, I can read an instruction or I can read a data. So, this memory data register will contain the instruction that is to be read from the memory or the data that is to be read from the memory or the data that is to be written into the memory, it can contain any of these three. Now, you can see this that addresses it is spanning from 0 to 1023. So, the number of memory locations that we can address is starting from 0 to 1023, hence 1024 memory locations in total. So, a memory is considered as a linear array of storage locations, bytes 23 or words, each with a unique address. So, these are the addresses of the memory location where it is incremented one by one and it has got 1024 locations. Now, what I am trying to say is that this memory address register will hold one of such address that is either 0, 1, 4 or 5 or anything 1011 anything and the memory data register will hold the content of that location. If it is 5, whatever content will be there in that particular location that will be present in MDR. (Refer Slide Time: 04:33) Now, just see this diagram it shows the connection between processor and main memory. So, I talked about MAR and MDR. MAR is Memory Address Register and MDR is Memory Data Register. Memory Address Register is connected with primary memory through address bus. Memory Data Register is connected with primary memory through data bus. And some control signals are also required. So, why control signal are required. So, when I say that this particular data will be read from the memory or this particular data will be written into the memory, so we need to specify that information to the memory that from this particular location we need to read a data or from this particular location we need to write back data. Control signals are for this purpose. First we provide the address in the address bus that hits to the primary memory then we provide the control signals either Read or Write depending on that, a particular value is read from that particular address and it comes to MDR. Or if I want to write a value, I have to put that value in MDR and the address where I want to write in MAR, and then I 24 activate the write control signal. Hence according to read or write a word is read or written from or into memory. (Refer Slide Time: 06:40) So, as I said this slide will summarize it. To read the data from memory, we first load the memory address into MAR then we issue the control signals i.e., read then the data from memory is read and it is stored into MDR. Now, to write into memory what we were doing as I said we have to load the memory address into MAR the data that is to be written must be loaded into MDR and then we issue write control signal. So, for reading these are the following steps that are required and for writing these are the following steps we have to issue. 25 (Refer Slide Time: 07:49) Now, for keeping track of the program what I said like let us say this is my memory, these are some locations. So, this is say 100 location, this is 101 location and in these locations we have some instructions. Now, suppose you are executing at this particular location how will you move to the next location. So, for that reason there must be some counter that will point to location 100. So, you will go to location 100, you will fetch the instruction, execute the instruction. Then your counter should automatically get incremented to 101; such that after executing this instruction, you move to the next instruction and execute the other instruction. (Refer Slide Time: 09:11) 26 So, now we see for this purpose we have two special purpose registers. One is called Program Counter (PC) that holds the memory address of the next instruction to be executed, automatically it is incremented to point to the next instruction when an instruction is fetched and about to be executed. So, PC holds the memory address of the next instruction to be executed. So, once we read and execute an instruction, PC now must point to the next instruction that I have to execute next and so on. Next, another register is the Instruction Register. So, now you see in this diagram, when I say that this is the location. So, PC will contain the location from where I have to read the instruction. Now, I have to read this instruction and store it somewhere, the register where it will get stored is known as Instruction Register where this entire instruction ADD R1,R2 is stored. So, this register Instruction Register temporarily holds an instruction that has been fetched from memory. Now, once an instruction is fetched from memory, you need to decode that instruction. See we understand this is add, but a computer is a layman. So, you have to instruct the computer do this, do that, then only the computer will do that. So, when I say that ADD R1,R2. So, add is an instruction saying that add the content of register R1 with R2, and store back in R1. So, the instruction needs to be decoded and then the computer must understand, now it has to execute the instruction add the content of R1 and R2 and, store back the result in R1. So, the instruction register temporarily holds an instruction that has been fetched from memory, needs to decode to find out the instruction type, which contains information about the location of the data. IR contains the entire instruction. So, after we get this instruction we need to decode to know this is add, and we also need to know locations of R1, R2 registers. So, ultimately after decoding we will understand this add R1, R2 is nothing but some bits of 0’s and 1’s. If a total of 16 registers are present then we need four bits to specify any one of the 16 registers and we can just name that R1 is a four bit register and it is represented by 1000. You can also say R2 is another register which is represented by 1001. And add is an opcode which is represented by some bits. So, ultimately this set of instruction is nothing but bits of zeros and ones. So, we will see this in near future. 27 (Refer Slide Time: 13:31) Now, we can see the architecture of an example processor or in other words we can say that how memory is connected with the processor. So, this is a processor where we have some special purpose registers MAR, MDR, PC and IR you already know the functions of these registers. We have some general purpose registers and we have control unit and ALU. Now, what is the function of this ALU? When I say that ADD R1,R2, R1 and R2 are registers that are present inside your processor. So, these registers we need to bring to ALU so R1 must be brought to ALU and R2 must be brought to ALU and then the particular instruction like add or mul is executed on R1 and R2 and we get the result. So, this is an example processor how it looks like, but there will be many more things that we will see later. 28 (Refer Slide Time: 15:02) Now, we shall illustrate the process of instruction execution with the help of following two examples. So, the two examples that we will be taking the first one is ADD R1, LOCA. (Refer Slide Time: 15:24) The instruction ADD R1, LOCA will add the content of R1 and location A and store back the result in R1. LOCA is a location in memory. So, this is your memory and this is some location. The content of this will be added with R 1 and will get stored back in R 29 1. The next one is simply add the two registers that is R1 and R2 both are present in processor and store back the result in R1. (Refer Slide Time: 16:45) Now, let us see how we will execute this instruction. So, we have to do some assumption here. As I said that firstly, when we say this instruction, this instruction is stored in some memory location. We assume here that the instruction is stored in location 1000, and the initial value of R 1 is 50, and LOCA value is 5000. Before the instruction is executed PC contains the value 1000. Now, the content of PC is transferred to MAR as we need to read the instruction. To read the instruction from memory what I said earlier that the address needs to be loaded in MAR, and we need to activate the read control. Hence 1000 now should be transferred to MAR, then a read request is issued to memory unit. After the reading is performed, the instruction is in MDR; and then from MDR, it should be transferred to IR Instruction Register. And while doing these steps, we have to increment the PC to point to the next instruction. As I said in computer there will be sequential execution of instruction and the instructions are stored one by one. So, here first PC was pointing to 1000, we fetch the instruction from location 1000 and then the PC should point to the next location that is 1001. So, PC will now point to the next location which is 1001. And next whatever instruction that we have read from memory is loaded in MDR; and from MDR, it will be transferred to IR. And we know after it has 30 come to IR the instruction needs to be decoded and then it has to be executed. We will see step by step how it happens. So, from this we can say that firstly, PC value is transferred to MAR. Read signal is activated. The content specified by the memory location(MAR) is read and it is stored in MDR. From MDR, it is transferred to IR and at the same time PC is incremented to 4. Why we have said here 4, it is depending on the word size, we will be coming to this little later, but it depends on what will be the word size. So, I will just explain it once. Let us consider a byte addressable. So, let us say if this location is 2000, and the next location is 2004 that means, we are using a 32-bit machine; and the PC is incremented by 4. If it is a 64-bit machine, the PC will get incremented by 8. So, here the PC is incremented by 4, so it is a 32-bit machine. Now, once the instruction comes to IR it needs to be decoded and executed. Let us see how it is done. Now, location is for example, it is 5000 it is transferred from IR to MAR. So, if you consider the same instruction that is ADD R1, LOCA. Now, this is in IR. After decoding it has understood that one of the operand is in memory. If one of the operand is from memory then you have to read that particular operand the data which is there in that particular location from the memory. (Refer Slide Time: 21:54) So, you have to load that location into MAR, then activate the read control signal then the data that is present in that location LOCA that is 5000 will be read and will be 31 transferred to MDR. Now the MDR contains one of the operand and R1 contains one of the operand. These two will be brought into your ALU which will be added and stored back in R1. So, these are the steps that we follow to execute the instruction ADD R1, LOCA. In this instruction, how many read operation we have to do the first read operation we perform to fetch the instruction. And after decoding we again see that one of the operand is a memory location. So, we again have to read that particular location from the memory, we again read that particular data from that memory location, we execute the instructions and store back the result. So, two memory operations were required in this particular instruction. Let us move on with the next one. So, these are the steps that are being carried out and these are called micro operation. We will see in more detail later, but just to see first PC value is transferred to MAR. Then whatever value of MAR is read from memory and it is stored in MDR. Now, from MDR the value is moved to IR. The PC gets incremented to point to the next memory location. If we see that there are some operands that is to be read from memory, then it has to be read and then transferred to MAR. And then read signal is activated and data is read and is brought into MDR. And finally, now all my data are present in processor both R1 is present in processor after reading MDR the data of LOCA is also present in MDR. So, both are processor register we need to add this and store the result in R 1. (Refer Slide Time: 24:32) 32 So, whatever I have explained is that PC contain this value, MAR contain this value. PC get incremented it points to 1004 then MDR initially have the instruction ADD R1, LOCA that is moved to IR. Then after decoding LOCA value will be in MAR the data will be read and in that location the value is 75 which is stored in MDR. And finally, after adding this the value is stored back in register R 1. (Refer Slide Time: 25:14) In a similar way, let us see execution of another instruction that is ADD R1,R2. Here you can see that in this instruction both R1 and R2 are processor register. So, you need not have to bring anything from memory. So, both the operands are present in your processor register; all you need to do is that you have to read this particular instruction from the memory, and then you will execute this instruction. So, assume that the instruction is stored in memory location 1500, the initial value of R1 and R2 are 50 and 200. Before the execution, PC contains 1500 and then the content of PC is transferred to MAR. We activate the read control signal the instruction is fetched and it is stored in MDR. Then the content of MDR is transferred to IR, it is decoded. And then finally, the content of R1 and R2 that is 50 and 200 are added and stored back in R1. 33 (Refer Slide Time: 26:34) So, these are the steps that are shown. PC will contain 1500 then MAR will also contain that. PC will get incremented by 4. And then the entire instruction is brought into MDR. IR will also contain this instruction, this instruction will now get decoded and the value of R1 and R2 will be added, and will be stored back in R1 that is 250. (Refer Slide Time: 27:39) Now, coming to the bus architecture. So, we were discussing about how an instruction will get executed. When we say that processor is a module, memory is a module, input output is a module, so all these are communicating between each other. So, they need a 34 communicating pathway to communicate between each of these functional units. So, the different functional modules must be connected in an organized manner to form an operational system. By bus what we refer is to a group of lines that serve as a connecting path for several devices. So, the simplest way to connect all these is through a single bus architecture, where only one data transfer will be allowed in one cycle. For multi bus architecture parallelism in data transfer is allowed. (Refer Slide Time: 28:15) So, let us see this. So, this is a single bus that we can see where all our modules, memory, processor, input and output are connected to a single bus. So, if the processor wants to communicate with memory, it has to use this bus and no other modules will be using that at that moment. In the same way if from memory something needs to be sent to output device, no other module can communicate. This is a bottleneck of a single level system bus. 35 (Refer Slide Time: 28:54) Now, coming to the system-level two-bus architecture. In this two-bus architecture, what we can see is that there is a bus dedicated to processor memory and of course, the IO processor is also connected to it. But for input and output there is a separate bus and this bus will be communicating with the IO processor in turn and the IO processor will then be communicating with the processor or the memory as and when it is required. So, when there is more communication between processor and memory then we must have a bus dedicated for this. (Refer Slide Time: 29:43) 36 Now we can also see a bus that is required within the processor. Within the processor, we have seen that there are many components many transfers are taking place. So, whenever we are moving some register value within the processor, we also need a bus within the processor. So, in that bus, there are other components like, ALU and the registers that are all connected via this single bus. The bus is basically internal to the processor. So, I am talking about a bus which is inside the processor. Typical single bus processor architecture is shown in the next slide. (Refer Slide Time: 30:49) This is an internal processor bus. So, all these things are inside the processor PC, MAR, MDR, Y register is there and then ALU is there. There are some temporary registers Z, Y and there are some general purpose registers; IR is also there and there is a instruction decoding and control unit. So, information is getting transferred within all these modules. If I have to transfer a data from R1 to R2 or I have to transfer a data from PC to MAR or have to transfer a data from MDR to R1 or R2 or to ALU, some control signals needs to be generated. So, how they will be communicating, they will be communicating through this internal processes bus. 37 (Refer Slide Time: 31:44) Now, multi-bus architectures are also there. So, modern processors use this multi-bus architecture. Here, within your processor to communicate between various registers you have multiple-bus. The advantage we get is that more operations can be performed and we get results much faster. So, there will be a overall improvement in instruction execution time. Some smaller parasitic capacitance can be there and hence smaller delay. So, we have come to the end of lecture 2. In lecture 2, what we have studied how an instruction actually get executed, various processor registers, various registers that it present inside the processor. And finally, how we can increase the performance using either single-bus architecture or multi-bus architecture. Thank you. 38 Computer Architecture and Organization Prof. Kamalika Datta Department of Computer Science and Engineering National Institute of Technology, Meghalaya Lecture - 03 Memory Addressing and Languages (Refer Slide Time: 00:29) Welcome to the third lecture on memory addressing and languages. So, let us know about the overview of memory organization. What is memory? Memory is one of the most important subsystems of a computer that determines the overall performance. What do you mean by that? See as you have seen in the previous lecture that we are storing the instruction and data in the memory. If your memory is slower then loading the data from the memory will be slower. So, in that case we need to have a good speed memory. The conceptual view of memory is it is an array of storage locations with each storage location having a unique address. So, it is an array of memory locations. 39 (Refer Slide Time: 01:30) So, as I said it is an array of memory locations. So, we have first location as 0 0 0 0, next location as 0 0 0 1 and so on, maybe the last location is 1 1 1 1. So, it is an array of storage location each with a unique address. So, these are individual locations and this is the address associated with each location. And each storage location can hold a fixed amount of information, which can be multiple of bits which is the basic unit of data storage. A memory system with M locations and N bits per location is referred to as an M x N memory, where both M and N are typically some powers of 2. An example: 1024 x 8. So, if I say 1024 x 8, it means we have 10-bit in the address, and each location is having 8-bit. So, you can store these many locations in these many locations; this shows how a memory will look like. 40 (Refer Slide Time: 03:53) Now, some terminologies you must know when we talk about memory. What is a bit, we all know a bit is a single binary digit either 0 or 1. Nibble is a collection of 4 bits. Byte is a collection of 8 bits. And word does not have a unique definition because we can either have a 32 bit word length or 64 bit word length. So, word does not have a unique definition. (Refer Slide Time: 04:29) Now let us see how is memory organized. Memory is often byte organized. So, we never say that each bit is having an address, rather we say each byte is having an address, that 41 means every byte of the memory has a unique address. And multiple bytes of a data can be accessed by an instruction. I will just take an example: ADD R1,LOCA. So, if you consider this instruction, it is depending on how many bits this ADD will have, how many bits this register will have, and how many bits this location will have; this will define that how many words this instruction will have or how many bytes this instruction will have. So, in that sense what I am trying to say is that how many bytes this instruction will take is dependent on various other factors like the total number of instructions available in your computer. The total number of registers present in your computer, and also the number of locations you are having based on which you can determine the number of bytes required to represent this particular instruction. For higher data transfer rate, memory is often organized such that multiple bytes can be read or written simultaneously. This is basically needed to bridge the processor memory speed gap; we shall of course discuss this later, but I will just tell very briefly about this memory processor speed gap. So, as you know that processor speed is increasing memory speed is also increasing, but not at this pace the processor is increasing. (Refer Slide Time: 07:06) So, this picture will show you the processor memory performance gap. See with technological advancement both processor speed is increasing and also memory speed is increasing; however, there is a speed gap which is steadily increasing. So, earlier the 42 speed gap was much less, but now with technological advancement CPU speed has increased to a greater extent; memory speed has also increased, but not at the same pace as the CPU. So, we can see this. So, some special techniques are used to bridge this gap. We will see this in the memory module design, where the concept of cache memory and memory interleaving will be talked about, but from this we can see what we can say is that there is a still huge gap between the speed of a processor and speed of your memory. So, this is where we have to agree upon that. We are still in the phase we are growing we are trying to make certain techniques to bridge this gap, but still this gap exists. (Refer Slide Time: 08:40) Now, how do you specify memory sizes. Memory sizes can be 8 bit which is a byte. It can be kilobyte 210; it can be megabyte 220; gigabyte 230; terabytes 240, and many more like petabyte, exabyte and zettabyte. 43 (Refer Slide Time: 09:09) Now, you see if there are n bits in an address, the maximum number of storage locations that can be accessed is 2n. (Refer Slide Time: 09:31) This is a small example, we will take n = 3. So we can say how many locations we can access; 23 = 8. So, the first location will be 0 0 0, the next location will be 0 0 1, next will be 0 1 0, 0 1 1, 1 0 0, 1 0 1, 1 1 0 and 1 1 1. So, with n bits we can have 2n locations that can be accessed. 44 So, if we have 3 bit in the address, so maximum location that can be accessed is 8. So, for n = 8, 256 locations (28); for n = 16, 216 = 64K locations; for n = 20, 1M locations, etc. can be accessed. So, this diagram shows the address bits; if you have n bit address we can have 2 to the power n locations that can be accessed. And modern-day memory chips can store several gigabytes of data that is our dynamic RAM. We will be looking into more details about each and every aspect of memory module. (Refer Slide Time: 11:33) Now, as I said for an 8 bit address, 2 to the power 8 unique locations will be there. The first locations will be all 0s, and the last location will be all 1s; and each of these locations again will have some content. So, consider an example of 28 x 16 memory. So, in each of these locations we will have some data which is 16 bits. 45 (Refer Slide Time: 12:02) Let us see a computer with 64 MB of byte addressable memory. How many bits are needed in the memory address? As I already said, that 64 MB = 2 26 ; that is, we need 26 bits to represent the address. Now, let us take another example where we say a computer has 1 GB of memory. So, we are saying total of 1 GB of memory, each word in this computer is 32 bit. (Refer Slide Time: 13:18) 46 So, 1 GB = 230. If each word is 32 bits, that means 8, 8, 8, 8. So, total words possible will be 230 / 4 = 228. So, we require 28 bits, with address from 0 to 2 28-1. If it is byte addressable, each byte can be accessed with address from 0 to 230-1. (Refer Slide Time: 15:02) Now, let us also understand what is byte ordering convention. Many data items require multiple bytes for storage. And different computers use different data ordering convention, it is known as low order byte first and high order byte first. So, these two are called basically Little Endian and Big Endian. So, you see this data type character is 1 byte, integer is 4 byte, long integer is 8, floating point 4 and double precision is 8. Thus if you have a 16 bit number which is represented like this, so in one way this is the total number high order bit is stored in high order address, and the low order is stored here and so on here it is stored differently. This is stored first and this is stored next. 47 (Refer Slide Time: 16:13) So, let us see the convention that has been named as little endian. The least significant byte is stored at lower address followed by most significant byte. So, Intel processors DEC alpha they all use little endian method, where again I repeat least significant byte is stored at lower address. Now, same concept follows for arbitrary multi-byte. So, if you also want to store multi byte then the same thing will follow there also. Now, in big endian the most significant byte is stored at lower address followed by least significant byte. So, the most significant byte is stored at lower address followed by least significant byte. IBM’s 370 mainframe uses big endian concept. (Refer Slide Time: 17:14) 48 Now, let us see this representation with this example of a 32-bit number both in little endian and big endian. So, this is the number and lower byte is stored in lower address, then the next byte, then the next byte, and then the next byte. And here the higher order address higher order byte is stored lower address, then the next byte then the next byte and then this one. So, just see the difference between these two in little endian what we are storing; this is the least significant byte, we are storing in the least address that is 2000 and then the next one, next one, next one. In a similar way, the most significant byte we are storing it in the lower address and so on. (Refer Slide Time: 18:23) Let us see memory access by instructions. Now, as I said that the program and instruction in the program instructions and the data are stored in memory. So, there are two basic ways how this can be stored. One is von-Neumann architecture, they are stored same in the same memory both the program and the data are stored in same memory. In Harvard architecture, they are stored in different memories. So, for executing the program two basic operations are required. What are the two basic operation we need to know this. Load the content of specified memory location. So, LOAD R1,2000 means load the content from memory location 2000 into processor register R1. Another one is STORE R3,2010 that means store the content of R3 into this specific location 2020. So, either we load a data from memory into one of the processor registers, or we store the data from processor register to some location in memory. 49 (Refer Slide Time: 20:11) Now, let us take an example. Suppose we need to execute this particular instruction. We need to compute this. S, A, B, C and D are stored in memory. So, what we need to do is we need to load individual data, that is A and B and then only we can execute it. So, let us see what are the steps that are required to execute this. First of all let us do A + B. For this, I need to load first A into some processor register, B into some processor register and then add tyem. So, LOAD R1,A will load the content of A into R1; LOAD R2,B will load the content of B into R2. The ADD instruction will add the content of R1 and R2, and store the result in R3. So, this portion is done. Next I load C into R1, D into R2, then I subtract C, I subtract D from C, and then I store the result in R4. Now, A + B is stored in R3, C - D is stored in R4, and now I need to subtract them. SUB R3,R3,R4 will cause the result to be stored in R3. But finally, I have to store the result in another location, that is S. So, what I should do now I have to store the content of R3 into S. 50 (Refer Slide Time: 22:58) Now, let us also understand what is machine language, assembly language, and high level language. Machine language is native to a processor and it is executed directly by in hardware. So, it only consists of binary code 1s and 0s. (Refer Slide Time: 23:29) So, I have talked about this if you remember that let say this is my instruction. Again I take the same example ADD R1,LOCA and as I said this can be 01001 (5-bit), this can be a 4-bit number, and this can be a 12-bit number. So, what I am specifying I am specifying the entire instruction in a sequence of 0s and 1s. So, when I specify an 51 instruction in the form of 0s and 1s, is called machine language, which is native to processor executed directly by hardware. Next is assembly language. It is also a low-level symbolic version of machine language. Instead of 0s and 1s, I write it symbolically as ADD R1,LOCA. When I represent something symbolically; instead of writing 0s and 1s, I am writing it with some mnemonics. So, this language is a low-level symbolic version of machine language; one to one correspondence with machine language is there. Pseudo instructions are used that are much more readable and easier to use. What do you mean by much more readable and easier to use? That means it will be difficult for me to remember 0110 is add, but it will be very much easy for me to remember the mnemonic ADD. So, I can write an instruction ADD R1,R2. So, this is much easier way to represent a program. So, assembly language is nothing but some kind of one-to-one correspondence with machine language. Next comes to high-level language. Now, you see if you have to add ten numbers or you have to sort some list in an array, you need to perform certain operation. Again writing such kind of language where we say that first sum of 10 numbers you have to initialize it and then you have to repeatedly add it. So, there will be set of more instruction that are required to perform an add operation; rather for adding 10 numbers, I can very easily write let say for (i=0; i, but the value of the number is minus 1 to the power S into M multiplied by 2 to the power E. 629 (Refer Slide Time: 11:39) Coming again to this representation as I said s is the sign bit. This indicates whether the number is negative (for that s is 1) or it is a positive (s is 0). This M is conventionally known as the mantissa and I will explain a little later why this mantissa is usually a fraction, which is in the range 1.0 to 2.0. And E is called the exponent, which is weighted as you can see by a power of 2. 2 to the power E; and in the IEEE floating point format you can have either single precision numbers or double precision numbers. The general format looks like this. The number starts with the sign bit, followed by some number of bits for the exponent, and finally some bits for the mantissa. For single precision numbers the total size is 32 bits, for E you have 8 bits and for M you have 23 bits. 630 (Refer Slide Time: 13:09) But for double precision numbers you have 64 bits in total, where E is 11 bits and M is 52 bits. Now just going back once the number of significant digits in your representation will depend on how many bits you are using for the mantissa. This is the number of significant digits, it will depend on how many bits you have in M. Now, although for single precision number we have kept 23 bits in the mantissa, but actually the idea is as follows. The mantissa is represented as a number that always starts with a 1. (Refer Slide Time: 13:44) 631 And there is a implied radix point here and after that you can have anything, 0 1 0 1 1. The first digit is always 1. We assume that this is an implied bit and because it is always 1 you do not store this; you store only the remaining number of bits. So, here for single precision there are 23 bits here, but if you also count this 1 it becomes 24. So, the number actually is a 24-bit number, but because it always starts with a 1 you are actually storing 23 bits. So, when you are calculating you will be assuming 24-bit mantissa because that implied 1 bit is also there. Now how do you calculate number of significant digits is very simple. In binary you have 24 bits, in decimal how many bits? You use this; this equation 2 to the power 24 equal 10 to the power x. You take logarithm on both sides, and get x = 7.2. This means in decimal you can have 7 significant digits, but if you think of double precision number where for mantissa we have 52 bits. 52 plus 1 will be 53; 53 multiplied by log 2. So, it will be much larger --- 15 or 16 digits you will have for significant digits. Similarly for exponent in single precision number you have 8 bits, and this exponent can be either positive or negative 2 to the power plus something or 2 to the power minus something. This 8 bit is actually you can regard it that is the 2’s compliment signed integer; range will be -128 to +127. If you want to find out the range in decimal you follow a similar calculation, 2 to the power 127 is equal to 10 to the power y. Finally y comes to about 38 point something. So, in decimal equivalently you can have maximum exponent value as 38 which means in single precision you can represent up to 7 significant decimal places and the range of the number can be 10 to the power 38 to 10 to the power -38. 632 (Refer Slide Time: 16:50) Now in floating point number there is an interesting concept called normalization; let us look at the encoding part of it that how the exponent and mantissa are actually stored or encoded. Let us assume that the actual exponent of the number is EXP. The number is M multiplied with 2 to the power EXP. We are encoding this EXP in some way and we get E. Now the value of E can range from 1 up to 254; it is an 8 bit number you know for unsigned number the range is 0 up to 255, but here we are saying 1 up to 254, which means we are leaving out 0 and 255. So, the all 0 and all 1 patterns are left out; they are not allowed. So, how we are encoding E, you have your actual exponent EXP; we are adding a bias to it. You see EXP can be either negative or positive, but after we add this bias E becomes always positive. So, how much is the bias? For 8-bit exponents for single precision you take the bias as 127. So, because you see I mean say again if you look at the value of the actual exponent it can range from -128 to +127. Now this one combination you leave out let us say it is from - 127 to +127. So, if you add a bias 127 to it, -127 + 127 become 0, and 127 + 127 becomes 254. So, this is positive. The bias is 127 for single precision, and it is 2 11-1 - 1 which is 1023 for double precision, because in double precision we have 11 bits for the exponent. 633 (Refer Slide Time: 19:31) Now, for the mantissa as had said earlier we encode the mantissa in such a way that it always starts with a leading 1 and whenever you encode mantissa in this way we say that the number is normalized. If the mantissa starts with 0 we say that it is not normalized; and this x x x denote the bits that are actually stored because the first bit is always 1; we do not store it explicitly. We are getting this extra bit for free, we are storing only the remaining 23 bits. When the value of x is all 0’s the value of M is minimum 1.0 0 0 0 0 the value is 1. When x is all 1’s; that means 1.11111. As we saw earlier point 1 1 1 1 1 is the value which is very close to 1. The value of this number becomes very close to 2; 2 minus very small value epsilon. You recall we mentioned earlier that the mantissa M will be having a value between 1 and 2, and this is why it is so. 634 (Refer Slide Time: 20:54) Let us look at some encoding examples. We consider a number F let us say 15335. If you convert this into binary it is like this, and if you expressed it in fractional form, if you put a decimal point here the radix point here. So, if you count the digits into 2 to the power 13 there are 13 binary bits here. So, if you express the number like this you see the mantissas already normalized. It starts with 1; so you are not storing this 1 you are storing the remaining bits. So, the mantissa will be stored as this. How many bits will be there for the mantissa? 23 bits and exponent is 13. So, bias for single precision is 127. So, the actual value of E will be 13 + 127 = 140. This number will be stored as sign positive, followed by exponent, followed by mantissa. If you divide this number into 4 4 bits and convert into hexadecimal you will see you will start is 0 1 0 0 which is 4, then 0 1 1 0 which is 6 ,again 0 1 1 0 which is 6, 1 1 1 1 which is F, and so on. 635 (Refer Slide Time: 22:24) Let us take another example where the number is negative -3.75, which in binary you can write like this. If you express in normalized form push the decimal point here it becomes like this. So, we will be storing the mantissa as only this 1 1 1 this one we do not stored and the remaining zeros and exponent EXP is 1, you add the bias it becomes 128 which is this. So, the number will be represented as sign is one, negative exponent and mantissa which in hexadecimal is C070. (Refer Slide Time: 23:19) 636 Now, some special values are supported. When we saw earlier we mentioned that the values of E all 0’s and all 1’s are not allowed in actual numbers, but what happens when the value of E is all 0’s or all 1’s. So, these are the so-called special values. When E is all 0’s then if M is also all 0’s this combination represents this special number 0. So, you see the number 0 is represented as an all 0 string. It can be easily checked by hardware. And for E all 0’s if we have a mantissa which is not all 0’s this represents some numbers which are very close to 0, because see E = 0 means it is already in biased format means 2 to the power minus 127. After adding 127 you are getting 0. So, the magnitude of the number is really very small. Here we are representing numbers which are very close to 0 when E is all 0 these are sometimes called de-normalized numbers because if M is all 0, you do not have a 1 at all you cannot make the first bit 1. So, for these de-normalized numbers your exponent has to be all 0’s meaning that this is a special kind of a number where the mandatory requirement of the first bit of the mantissas 1 is not to be assumed in this case, and 0 is represent with all 0 string. When E is all 1’s then the all 0’s combination of the mantissa represents the value infinity; this is our representation in the IEEE format, and any value which is not equal to 0 this represents an invalid number which means not a number sometimes it is called NaN. Why it is required? Because you see there is certain cases where you do not know the value of a number like you have defined some variables, but not initialized them. 637 (Refer Slide Time: 26:07) In this scale this summary of number encodings you can see that in between when you have the exponent and mantissa values all 0, you have the number 0. So, there are two representations of 0, +0 and -0. And for very small numbers you can have de-normalized numbers; and when you have normalized numbers it will be beyond that and you can go up to plus infinity, plus infinity is a special representation; minus infinity is also special representation. Beyond that you have invalid numbers which are NaN. (Refer Slide Time: 26:54) 638 Now, there is another feature of the IEEE format, which is also very much useful this called rounding. Suppose when we add two numbers in single precision. Normally we add the mantissa value after aligning the exponents. What do you mean by exponent alignment? Suppose my one number is 1. 1 0 0 into 2 to the power 2. (Refer Slide Time: 27:18) Let us say other number is 1. 0 1 1 into 2 to the power 1. So, when we want to add we cannot straight add because the exponents and not same. What we do this second number we make it 2 to the power 2, and for that we shift the decimal point to the left by one position. So, it becomes 0.1011 then we can add 1.100 with this, and we get the result. This is what is mentioned here. We add the mantissa value after shifting one of them. We shall be coming to this again. After adding, the first 23 bits of this sum is taken and the remaining bits are discarded; this is called the residue. Now IEEE format supports four different rounding modes, one is truncation --- beyond 23 bits you just discard the remaining bits this is truncation, rounding to plus infinity is the second mode what it says is that you look at r if you see where there r is greater than or equal to 0.5. If r is greater than or equal to 0.5 just like ceiling function, you increase mantissa by 1. Similarly if it is less than 0.5 we will rounding to minus infinity you move it to the next lower integer value, and second one is a rounding. 639 So, depending on the value of the r here you are either moving into the next higher or the next lower, and rounding means if it is 0.5 or higher you move it up, otherwise you move it down. (Refer Slide Time: 29:14) This is rounding to the nearest. To implement this two temporary bits are used in the representation, one is called round bit r which is the most significant bit of the remaining residue R, and sticky bit s which is actually the logical OR of the remaining bits of R other than the MSB. (Refer Slide Time: 30:24) 640 Here are some exercises for you to work out. We have seen some representations of floating point numbers and in particular the IEEE format that is almost universally used nowadays in almost all computer systems. We have seen how numbers are represented, how some special numbers are represented that are very useful during computations and also how we can do rounding of the numbers. With this we come to the end of lecture number 38. In the next lecture we shall be starting discussion on how we can carry out arithmetic using floating point numbers. In this lecture you have looked at just the representation, later on we would be seeing how we can carry out addition, subtraction, multiplication and division on floating point numbers. Thank you. 641 Computer Architecture and Organization Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 39 Floating - Point Arithmetic In this lecture we shall be starting or discussion on Floating - Point Arithmetic. We have seen earlier how a floating point number can be represented. Now we shall see how with this representation, we can carry out addition, subtraction, multiplication, division and for doing that what kind of hardware is required. (Refer Slide Time: 00:44) We start with floating point addition and subtraction. We consider them together because the process is very similar, and the same hardware can be used for performing both addition and subtraction. Let us assume that we have two numbers; for the time being we are not showing this sign, sign is also there. The first numbers is M1 x 2 to the power E1, and the other number is M2 x 2 the power E2, and we assume that the exponent value of the first number is larger than that of the second, E1 > E2. For addition or subtraction the basic steps will be as follows. Here I have assumed that the second number has the smaller exponent. Select the number of the smaller exponent, here it is E2 and shift its mantissa right by E1 - E2 positions, 642 such that this E2 becomes equal to E1. That means, we are aligning the mantissa such that the exponents are becoming equal. Suppose this was 2 to the power 10 and this was 2 to the power 5; you shift M2 by 5 positions such that this also becomes 2 to the power 10. So, both the exponents will become equal. So, after this we will be adding or subtracting M1 and M2 straight away, because M2 is already shifted right. We have already aligned the mantissa. After shifting we simply carry out addition or subtraction depending on which operation we are trying to carry out. Also we can see the result of addition and subtraction, and you can calculate the sign of the result and after addition or subtraction we may need to normalize the result because the result may not start with a 1. For normalization we have to shift it left, so that the first bit of the mantissa always becomes 1, so that we can represent the mantissa in a suitable way excluding that implied 1. (Refer Slide Time: 03:14) Let us take an example. Suppose we are adding two numbers F1 and F2, values are 270.75 and 2.375. 270.75 if you express in binary it becomes this and in a normalized binary floating point representation, you can write it as this. So, it becomes this multiplied the 2 to the power 8. Similarly F2 can be represented in binary as this. So, if you shift this point by one position it becomes this. So, it becomes 2 the power 1. 643 Now for the two numbers, one of them is 2 to the power 8, other is 2 the power 1; second one is smaller. So, what I do we shift the mantissa of the number with the smaller exponent; that means, F2 by 8 - 1 = 7 positions. So, it becomes like this. Then we add the two manissas. So, you take the first 24 bits of the sum, this will be your actual result, and these remaining bits will be the residue. So, your actual result will be this where it this number is already normalized it starts with 1. (Refer Slide Time: 05:32) Similarly, let us take a number 224 and 270.75. We can represent the numbers as follows. Similarly we will have to shift the mantissa of F2 by one position. Then you subtract the mantissas. You need a normalization step here and the result will be this. This adjustment is required at the end. This is the process of addition or subtraction of floating point numbers; what are the steps that are involved. From this you can directly generate the circuit. 644 (Refer Slide Time: 07:16) You look at this schematic. These are the two numbers, this is the first number, this is second number; sign exponent fraction and sign exponent fraction. This is a small 8 bit ALU; this is actually subtracting the two exponents and finding out which one is smaller; also it is calculating the exponent difference because later on we will have to shift the mantissa by that amount. So, there is a control circuit here. This information is going to the control depending on which exponent is smaller. There is another ALU where this smaller of the exponents will be coming to the first input; there is a multiplexer here. Either this fraction or this fraction will be selected; control will be generating the signal. That will be shifted right by so many positions (exponent difference). After shifting you do the actual addition or subtraction; these two multiplexers are selected by the control unit. After you do the addition or subtraction again similarly you will have to look at the process of normalization. Like in the previous example you have seen that the result was not normalized; you had to shift it left by 3 positions. You do the same thing, you may have to shift it and you see by how many positions you have to shift. Accordingly you make the adjustment and for IEEE format you need an additional step for rounding. So, if you have enabled rounding, after everything is done the mantissa will be rounded depending on those two bits r and s and you get the final result. 645 The concept of floating point addition subtraction is very simple because already you have seen the steps involved. You have to compare the exponents, you have to align the mantissa by shifting one of them, then you have to do addition or subtraction, then we have to carry out and normalization of the result, and finally, if required, rounding. (Refer Slide Time: 09:50) Next let us come to multiplication; floating point multiplication and division are relatively easier in concept, because it involves less number of steps. Suppose I am trying to multiply two numbers like this M1 x 2 to the power E1, M2 x 2 to the power E2. 646 (Refer Slide Time: 10:17) Suppose I have a number 1.2 x 10 to the power 5, and there is another number 1.2 x 10 to the power 6. When you multiply the two numbers, there is no question of doing any adjustment of mantissa; you simply multiply the two mantissa. So, 1.2 x 1.2 will become 1.44. And you simply add the two exponents; it will be 11, and we get the result. But in floating point number one thing you have to remember that is E1 and E2 are both biased; that means, for single precision number already we have done +127. So, for multiplication if you are adding these two numbers whatever you get, do not forget to subtract a 127 because you have counted 127 twice. So, you have to subtract it once. What I have mentioned is written here. You add the two exponents and subtract the bias that is 127; multiply M1 and M2 and also determine the sign of the result, and only at the end you need to normalize result if it is required. So, it is much simpler then addition. 647 (Refer Slide Time: 12:00) So, let us take an example. Suppose we want to multiply F1 = 270.75 and F2 = -2.375; both are normalized. The steps of multiplication are shown. (Refer Slide Time: 13:08) The multiplier hardware will be conceptually simpler. You have the first number here, and the second number here. There is an adder which will be adding E1 and E2; you will have to add the two exponents and these are 8 bit numbers, temporarily you generate an 9-bit some because you will have to subtract the bias 127. 648 After subtracting the bias you get the final 8-bit result which is the final exponent. On the other side you have a 24 x 24 multiplier, because is mantissas are all 23 bit numbers, but there is an implied 1 bit which is hidden. So, the multiplier will also have to be fed with that extra 1 bit, and then the two numbers multiplied. So, we get 48 bit product; you look at the product whether its starts with 1 or not; if not you will have to do a normalization, shift it and also adjust the exponent accordingly. After that the final exponent will go into the result. The final mantissa 23 bits will go to here, and the sign will be the exclusive or of s1 and s2. So, if one of the numbers is positive and other is negative then only the result will be negative; otherwise the result will be positive. So, this is the multiplication hardware; floating point division is quite similar. (Refer Slide Time: 15:02) For division let us say we have two numbers M1 x 2 to the power E1, and M 2 x 2 to the power E2. For division, instead of adding the exponents we are subtracting the exponents, but you see in both the exponents we have added a bias. So, when you subtract them, the bias also gets canceled out. So, you will have to additionally add that bias after the subtraction, then you divide M1 and M2, and finally, normalize. 649 (Refer Slide Time: 15:40) So, let us take an example; suppose I want to divide this number by this number. The steps of division are all shown. (Refer Slide Time: 16:41) The hardware will be very similar to multiplication. The only difference is that here instead of addition you do a subtraction; you subtract the two exponents, you get a 9 bit result, and you add the bias to get the final value of the exponent. 650 Similarly, you have a 24-bit divider. These 23 bit mantissas are coming and the implied 1 bit are here. The divider will be generating the bits of the result. Again you may have to normalize as the previous example showed, and after normalization you take the corrected value of exponent. The first 23 bits of the result will go to mantissa, and just like multiplication for division also you take the XOR of the s1 and s2 that will give you this sign of the result. (Refer Slide Time: 17:42) Now, for MIPS32 instructions set architecture there are also floating point arithmetic, which we have not talked about earlier. Very briefly let us look at what are the features that are available in MIPS. The first thing is that in the MIPS architecture we have some floating point registers called F0 up to F31; just like for integer registers we had R0 to R31. 651 (Refer Slide Time: 18:25) These registers are all capable of storing a single precision number because they are 32- bit numbers. Now additionally you can also operate on double precision numbers which are 64 bits in size, but how you do it? You will be actually using register pairs (F0, F1), (F2, F3), etc. (Refer Slide Time: 18:49) If you are using F0 and F1, together this will be a 64-bit register, if you look at F2 and F3 together this will also be a 64-bit register. But when you use it in a program you just refer to as F0 double precision, F2 double precision, etc. 652 So, when you are into double precision F0 means (F0, F1), F2 means (F2, F3). It is implied that even and odd register pairs will be taken. In addition to this there are some special purpose registers also, but here we are not going into details of them because these are not really required. (Refer Slide Time: 19:43) (Refer Slide Time: 19:54) And the instructions that relate to the floating point registers are as follows. There are Load and Store instructions; we can load a single precision floating point number from 653 memory or we can also do a load double. If you have a load double let us say to F0, it will be loaded into F0 and F1. Similarly, there is a store word, store double word data movement instruction. You can move data from an integer register to a floating point register or vice versa, or into some control registers. Also the arithmetic instructions add, subtract, multiply, divide. (Refer Slide Time: 20:34) There are some other instructions also. You can calculate the absolute value of a number, you can compare two numbers, you can negate; that means, given x you find out -x, square root of a number. Multiply add, multiply subtract means you multiply two numbers and add a third number a x b + c. This multiply accumulate kind of instructions or multiply subtract are quite useful in various DSP applications, where this kind of computations appear quite frequently. 654 (Refer Slide Time: 21:27) There are some instructions to round the numbers. Here you can specify what kind of rounding you are doing: truncation, ceiling, floor, or just round. I mentioned these 4 types are there, and also you can sometimes convert from single to double precision or vice versa; those instructions are also there. (Refer Slide Time: 21:51) I just take a simple example of adding a scalar to a vector. Like in C let us say I have a loop like this. The MIPS32 assembly language code is also shown. 655 This is a very simple example that shows how we can use the floating point registers and instructions. It is quite simple, just an extension of the integer instructions. So, with this we come to the end of this lecture. If you recall in this lecture we had looked at how we can carry out floating point operations, in particular the addition, subtraction, multiplication and division, and also we had discussed the corresponding hardware requirements. Thank you. 656 Computer Architecture and Organization Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 40 Basic Pipelining Concepts In this lecture will shall be talking about some Basic Pipelining Concepts. Now you may wonder why we are discussing pipeline in the context of floating point arithmetic and floating point; I mean hardware implementation of arithmetic functions. I would like to tell you here is that, pipelining is a very commonly used and widely used technique to enhance the performance of a system without significantly investing. In hardware if I want to make something faster I can always replicate some functional units. So, instead of one adder I can use 5 adders; my addition time will speed up 5 times. But here the philosophy is different; we are not replicating hardware, we are using a different kind of a philosophy by which we can promote something called overlapped execution whereby the performance improvement can be quite close to what we can achieve by actually replicating the hardware. We are getting good return of investment without making any significant enhancement in the hardware. This is the basic concept of pipelining. The reason we are discussing pipelining here is that, we shall see how we can use pipelining to improve the performance of arithmetic circuits that you have already seen. Later on we shall also see how we can enhance the performance of the processor, that is called instruction level pipelining; how instructions in the MIPS32 data path can be executed faster by implementing a pipeline there. But before that you will have to understand what is a pipeline, how it benefits the designer, and what kind of speed up you are expected to get. So, here in this lecture we shall be talking about the basic pipelining concepts. 657 (Refer Slide Time: 02:57) As I had said pipelining is nothing but a mechanism for overlapped execution of several different computations. Several different computation means they are possibly carrying out some computation on several input data sets. The basic principle is that we try to divide our overall computation into a set of k sub-computations, called stages. As I had said if we do this there is a very nominal increase in the cost of implementation. Again I shall be illustrating it with a real life example. But we can achieve very significant speedup that can be very close to k. This is really fantastic; we are not really investing in additional hardware, just we are dividing the hardware into smaller functional sub-blocks, and by doing that we get an architecture where we can equivalently get a speedup of up to the number of sub-functional units that you have divided our overall computation into. This kind of pipelining can be used in several different areas of computer design. For instruction execution we shall see later that several instructions can be executed in some sequence, because you have already seen earlier that when instruction is executed in a processor basically there is a fetch and execute cycle. During fetch we fetch an instruction, then we decode it; depending on the type of instruction we can do the appropriate actions or operations to complete the execution. Now the idea is that when an instruction is being decoded and it is being executed, why not fetch the next instruction 658 and try to decode it at the same time; this is what I mean by overlapped execution. This is what is there for instruction pipelining. Similarly, for arithmetic computation we shall be seeing how we can speed up arithmetic computations by implementing pipeline. Similarly for memory access also we shall see later again that we can use the concepts of pipelining to speedup accesses from consecutive locations in memory. (Refer Slide Time: 06:16) Let us take a real life example say of washing. Suppose we have a requirement where we need to wash clothes, dry them and iron them. Suppose I have a machine, which can do washing, which can do drying, which can do ironing. So, there are 3 stages of the machine. I feed my clothes to the machine. The clothes will be moving through the 3 stages and then I can feed the next cloth or next set of clothes whatever. Let us assume that this whole thing takes a time T. So assuming that I can feed one cloth at a time, if there are N clothes I need to do washing, drying and ironing, the total time will be N x T. Now as an alternative what we are saying that we are not buying 3 such machines, but rather we are breaking the machine into 3 smaller parts. There will be one machine which can do only washing, there will be one machine which can do only drying, and one machine which can do only ironing. So, roughly the total cost remains approximately equal to the total cost of the original machine. Assuming that they take 659 equal time, let us assume that earlier it was taking a time of T, now each of these will be taking a time of T/3. We shall see very shortly that if I do this then for washing, drying and ironing N clothes, I need a time (2 + N). T/3. So, if N is large you can ignore 2; it is approximately N.T/3, which means I have got a speed up of approximately 3 (equal to the number of pieces I have broken my original machine into). This is the essential idea behind pipelining; I can get a significant speedup. (Refer Slide Time: 09:01) Let us see, washing - drying - ironing; after washing is completed I will be giving it to the dryer, after drying is completed I shall giving it to ironing. To start with, during the first time slot let us say the time slots are all T/3, T/3 each, the first cloth reaches the washer. It finishes washing at the end of this time slot. Then this cloth-1will be moving to D. What will happen in the next time slot? This cloth-1 will be moving to D, but now the washer will be idle. So, I can feed this second cloth to the washer. Now, the washer and dryer are working on two different clothes in an overlapped way. So, at the end of this time slot, dryer has finished with cloth-1 and washer has finished with cloth-2. So, what will happen next? Cloth-1 will come here to the ironer, cloth-2 will come here, and cloth-3 will come to washer. 660 Now you see you see all these 3 machines are busy. Now at the end of this time what will happen? Cloth-1 is done and will be taken out; at the next time cloth-2 will come here, cloth-3 here, and next cloth will come here. You see after this initial two periods of time that is required for this pipeline to be filled up, after that in every time slot I am getting one output; that means, one cloth is finishing and I can take them out. That is why I talked about the total time as (2 + N). T/3. (Refer Slide Time: 11:36) We can extend the concept discussed so far to actual processor designs. Suppose for some computation we want to achieve a speedup of k. The two alternatives are with us. We can use k copies of the hardware, this will obviously give us a speedup of k, but the cost will also go up k times because we are replicating the hardware. Alternative 2 is pipelining. We are splitting the computation into k number of stages. This will involve very nominal increase in cost, but potentially it can give you a speedup of k. But one thing is required which we ignored; you see if you think of the washing example – washing, drying and ironing, see after one machine finishes and a cloth comes out and goes to the input of the other machine, I need some kind of a tray or a buffer in between. The cloth will be temporarily stored in that buffer before it can be fed to the next machine. Because it may so happen then when the first machine finishes with washing, the dryer is still working on the previous one, and it is still not free. 661 So, only when it is free then only the cloth can be given to the dryer. Thus, I need some kind of a buffering mechanism in between the stages. This is what we talk about here – the need for buffering. For the washing example is it said we need a tray between the machines to keep the cloth temporarily before the next machine accepts it. In exactly the same way when we want to implement pipeline in hardware, we need something similar to a buffer. It is nothing but a latch or a register between successive stages, because one stage finish some calculation and prior to giving it to the next stage, maybe the next stage is still not finished. So, that value is temporarily kept in the latch, so that the next stage whenever it is free can take it from the latch. This is the idea behind hardware pipelining. (Refer Slide Time: 14:41) I am showing a schematic diagram of a k stage hardware pipeline. This is called a synchronous pipeline because there is a clock which is synchronizing the operation of all these stages. So, as you can see there are k number of stages. Each stage involves some computation, S1, S2 to Sk, and also there is a latch. The latch will temporarily store the result of the previous stage before the next stage is ready to accept it. The clock will be generating active signals periodically and clock period will be large enough, so that all these stages can finish their computation and also it should take care of the delay of the latch. This we shall come a little later how the clock frequency or the clock period can be calculated. The stages S1, S2, … are typically combinational circuits. So, whenever the 662 next active edge of the clock comes, the next data is fed to S1, the output of S1 goes to S2, S2 goes to S3, and so on. There is a shift that goes on whenever the clock signal appears. In synchronism with the clock the pipeline shifts result from one stage to the other in a lock step fashion. (Refer Slide Time: 16:34) You can classify or categorize pipelined processors based on various parameters like degree of overlap, depth of the pipelining, structure of the pipeline, and how we schedule the operations. (Refer Slide Time: 16:57) 663 Let us very briefly look at these. When you talk of the degrees of overlap, one extreme can be serial that is a strictly non-pipeline implementation. The next operation can start only after the previous operation finishes. So, here I shown it schematically like this, this is an operation let us say which takes 1 2 3 4 5 steps. So, only after the first operation is completed only then the second operation can start. It is strictly sequential, with no overlap. There can be partial overlap as the second diagram shows. Partial overlap naturally results in a speedup because earlier it was taking so much time, but now it is taking 2 time units less. In the extreme case you can have something called pipeline as I said. There is almost complete overlap; when the first operation he is finished with the first step it moves to the second step, and the second operation can start with its first step. This is called fine-grained overlapped execution. Here naturally the time required is much less. So, depending on the degree of overlap you can classify how deep or how efficient your pipeline implementation is. Ideally you should have something like this. (Refer Slide Time: 19:12) Then comes the depth of the pipeline. We have seen earlier that if there are k number of stags, then we can have a potential speedup of up to k. Then you can argue why do not have a very large value of k so that we can get a very large speedup. But there are some limitations to increase the value of k; we cannot break up a computation into smaller 664 pieces beyond a limit. Beyond a limit it does not make sense, maybe the delay of the latch will be more will become more than the delay of the computation. So, the depth of the pipeline is an issue. You can either have a shallow pipeline with a few number of stages, or you can have a deep pipeline with large number of stages. If you have 9 stages then potentially you can have a speedup of 9. Depth of pipeline is a design issue. But it is also very important to evaluate that you can increase the depth all right, but you must also see in what way you can allow the computations to proceed so that you can have overlapped execution without any conflict. We shall be seeing later that conflicts in a pipeline are very important and there are so many techniques to handle them. For a shallow pipeline because we are making the number of stages smaller, individual stages are more complex. But if I have a very large number of stages, each stages will become simpler. (Refer Slide Time: 21:18) Next comes structure of the pipeline. We can have a linear pipeline that is most conventional. Linear pipeline means I have divided the computation

Document Details

Tags

Related

Full Transcript