Week 3 - Floating Point.pdf

Full Transcript

COM1006 Devices and Networks 3: Floating Point This week: ● ● ● ● ● Fixed Point vs Floating Point arithmetic Compare range, precision, and accuracy IEEE 754 floating point format Briefly review floating point arithmetic Discuss the implications of overflow and underflow in binary floating Handl...

COM1006 Devices and Networks 3: Floating Point This week: ● ● ● ● ● Fixed Point vs Floating Point arithmetic Compare range, precision, and accuracy IEEE 754 floating point format Briefly review floating point arithmetic Discuss the implications of overflow and underflow in binary floating Handling Fractions Consider these two calculations: + 7 6 3 2 1 3 5 1 7 9 4 8 2 1 9 4 2 6 9 5 6 + 7 6 3 2 . 1 3 5 1 7 9 4 . 8 2 1 9 4 2 6 . 9 5 6 They are identical except we have written a dot in the right hand one… Fixed Point Arithmetic Simply writing the “dot” in a constant place is a perfectly valid way of representing fractional numbers. We could, for example, just agree that our 8-bit numbers now use the 4 Most Significant Bits for the integer part, and the 4 Least Significant Bits for the fractional part. Fixed Point Arithmetic As a slight aside: what would be the values of the columns to the right of the decimal point in our binary place-value system? 103 102 101 100 . 10-1 10-2 10-3 9 4 2 6 . 9 5 6 1 /2 23 22 21 20 . 2-1 1 /4 2-2 1 /8 2-3 1 /16 2-4 = 1013/16 = 10.8125 1 0 1 0 . 1 1 0 1 Fixed Point Arithmetic So, some fixed point arithmetic: 3.625+6.5 0 0 1 1 . 1 0 1 0 0 1 1 0 . 1 0 0 0 1 0 1 0 . 0 0 1 0 = 101/8 = 10.125 Fixed Point Arithmetic Fixed point arithmetic works, and is used in some settings: ● Financial applications, where we will always be working to the nearest £0.01 (i.e. 1p) ● Some graphics applications where more speed!!1! Is more important than flexibility (e.g. some models of Sony Playstation use fixed point) ● Some embedded systems where no floating point hardware is available. Fixed Point Arithmetic Fixed point arithmetic has some limitations though: ● You are seriously limiting your absolute range of values: we can now only count up to 15.9375 with 8 bits. ● You are limiting your precision: we can do 15.0625 or 15.125, but not anything in between. Fixed Point Arithmetic Our general purpose computers may have to handle huge numbers: Mass of the sun: 1.98892 × 1030 kg Or really tiny numbers: Mass of an electron: 9.10938188 × 10-31 kg Floating Point Numbers Notice how scientific notation represents the numbers with several separate components: Mass of the sun: 1.98892 × 1030 kg exponent mantissa base This is much easier than writing the place-value form: 1988920000000000000000000000000.00 Floating Point Numbers We can also use the same general format to represent the really tiny number: Mass of an electron: 9.10938 × 10-31 kg exponent mantissa base Binary form of Floating Point Do we need to store the base for a binary number? Binary form of Floating Point We must decide: ● ● ● ● the total number of bits used by the number the number of bits allocated to the mantissa and exponent representation of mantissa (two’s complement etc.) representation of exponent (biased etc. –see later) Range and Precision Our decisions will affect two things: ● Range -- More bits allocated to the exponent give us more range (do we need numbers up to 2255?) ● Precision -- More bits allocated to the mantissa give us more precision (3.142 vs 3.141592) Range, Precision, and Accuracy 3.241592 is more precise than 3.141, since it has more digits. However, it is less accurate (notice that the second digit is wrong?) Telling me that “Its 2,456.927837 degrees C in here!” is very precise. Its (hopefully!) very inaccurate. IEEE Floating Point Single Precision IEEE floating point (float in Java, as opposed to double, which is 64-bit) Normalisation Why don’t we bother storing the “1.” part of the number? What’s the difference between these numbers? 1 0 . 0 1 2 6 x 105 1 2 . 6 0 0 0 x 102 1 2 6 . 0 0 0 0 x 10 2 6 0 . 0 0 0 0 If you could only compare the digits in each column, how complicated a program would you need to answer that? Normalisation Ok, so let’s agree to only ever have one digit to the left of the decimal point: 1 . 2 6 4 x 102 9 . 1 0 9 x 10-31 Normalisation If we do this in binary, we never have a zero to the left of the decimal point: 1 . 0 1 1 x 223 1 . 1 0 1 x 2-11 So, what do we always have?... This moving up/down is called Normalisation Range of Valid Mantissas This does, technically, give us a problem: but...meh... Representing the exponent We need positive and negative exponents. We could use sign and magnitude. Or we could use two’s complement But why not come up with yet another format?...Biased form Representing the exponent An m-bit unsigned binary integer has values from 0 to 2m-1 With biased form we shift that range to -2m-1+1…+2m-1 So, for 4-bit (0-15) we start counting from -7 and go up to 8. Binary Unsigned int Biased 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 Isn’t that all a bit unnecessary? It makes comparisons easier: 4.67x10-4 needs to be smaller than 1.01x103 The Full IEEE 32-bit floating point representation Sign: 1 bit Exponent: 8 bits Mantissa: (biased, so starting at 2-127 and going to 2128) 23 bits (just the fractional part, skipping the leading “1.” -- which gives an extra bit of precision!) Converting Decimal to IEEE Floating Point 1. Convert number to binary (integer part and fractional part) 2. Normalise 3. Encode all components: a. b. c. sign bit S (0=positive, 1=negative) exponent: add bias B and convert to binary fractional part of the mantissa (ignore imaginary 1) Let’s try -2345.125... (Here’s one Dirk did earlier…) But what if I want to write zero? The IEEE standard “steals” the highest and lowest values of the exponent (00…..0 and 11…..1) to represent special values. But what if I want to write zero? -127 (i.e. 00000000) is used for zero. This is nice, because it means all-zeros is zero… Also, remember that decreasing exponents are getting closer to zero -- e.g. 1.10101 x 2-126 But what if I want to write zero? +127 (i.e. 11111111) is used for positive and negative infinity And also for “Not-a-Number” (NaN), which can result from division by zero. There are some other special codes in the standard (e.g. error codes), but they aren’t common. Floating Point Arithmetic + 1 . 1 1 0 1 0 0 x 25 1 . 0 1 0 0 0 1 x 23 We can’t just add the columns because they aren’t properly aligned! (in a place-value sense) Floating Point Arithmetic We need a more complicated process: 1. Identify the number with the smaller exponent; 2. Make the smaller exponent equal to the larger exponent by dividing the mantissa of the smaller number by the same factor by which its exponent must be increased; 3. Add (or subtract) the mantissas; 4. If necessary, normalise the result (post-normalisation) 5. Truncate or round the mantissa. Floating Point Arithmetic Some implications of Floating Point Arithmetic ● The Normalisation (shifting left and right) can lose precision ● Not all numbers can be represented in finite precision: ○ ○ 0.7510 = 0.112 But 0.110 = 0.000110011001…2 (this is the same as 1/3 = 0.333333….10) ● If X is very much larger than Y, then X + Y might equal X Overflow and Underflow If the exponent goes above 11111110 then we get exponent overflow. This is an error condition. Handily, errors are signalled by setting the exponent to 1111111… Overflow and Underflow If the exponent goes below 00000001 then we get exponent underflow. This is handled by gradual underflow (or subnormal numbers) were the exponent is 00000000 but the mantissa is not zero. This then represents a number without the leading “1.”. So we can have 0.000001101010. Summary ● We can represent fractional values using fixed point or floating point binary representations. The latter is usually preferred. The current standard is IEEE 754: ○ ○ the mantissa is represented in normalised sign and magnitude form the exponent is represented in biased form. ● We learned how to convert between decimal numbers and IEEE 754. Summary ● Two floating point numbers are added by increasing the smaller exponent and adding, normalising, and truncating/rounding the mantissas. ● Underflow, overflow and rounding/truncation errors can affect the accuracy of floating point calculations.

Use Quizgecko on...
Browser
Browser