Floating-Point Numbers II Lecture Notes PDF

8. Floating-point Numbers II Floating-point Number Systems, IEEE Standard, Limits of Floating-point Arithmetics, Floating-point Guidelines, Harmonic Numbers 208 Floating-point Number Systems A Floating-point number system is defined by the four natural numbers: β ≥ 2, the base, p ≥ 1, the precision (number of places), emin , the smallest possible exponent, emax , the largest possible exponent. Notation: F (β, p, emin , emax ) 209 Floating-point number Systems F (β, p, emin , emax ) contains the numbers p−1 di β −i · β e , X ± i=0 di ∈ {0,... , β − 1}, e ∈ {emin ,... , emax }. represented in base β: ± d0 d1... dp−1 × β e 210 Floating-point Number Systems Representations of the decimal number 0.1 (with β = 10): 1.0 · 10−1 , 0.1 · 100 , 0.01 · 101 ,... Different representations due to choice of exponent 211 Normalized representation Normalized number: ± d0 d1... dp−1 × β e , d0 ̸= 0 Remark 1 The normalized representation is unique and therefore preferred. Remark 2 The number 0, as well as all numbers smaller than β emin , have no nor- malized representation (we will come back to this later). 212 Set of Normalized Numbers F ∗ (β, p, emin , emax ) 213 Normalized Representation Example F ∗ (2, 3, −2, 2) (only positive numbers) d0 d1 d2 e = −2 e = −1 e=0 e=1 e=2 1.002 0.25 0.5 1 2 4 1.012 0.3125 0.625 1.25 2.5 5 1.102 0.375 0.75 1.5 3 6 1.112 0.4375 0.875 1.75 3.5 7 0 8 1 1.11 · 22 = 7 1.00 · 2−2 = 4 214 Calculations with Floating-point Numbers Example (β = 2, p = 4): 1.111 · 2−2 + 1.011 · 2−1 = 1.001 · 20 1. adjust exponents by denormalizing one number 2. binary addition of the significands 3. renormalize 4. round to p significant places, if necessary 215 The IEEE Standard 754 defines floating-point number systems and their rounding behavior and is used nearly everywhere Single precision (float) numbers: F ∗ (2, 24, −126, 127) (32 bit) plus 0, ∞,... Double precision (double) numbers: F ∗ (2, 53, −1022, 1023) (64 bit) plus 0, ∞,... All arithmetic operations round the exact result to the next representable number 216 The IEEE Standard 754 Why F ∗ (2, 24, −126, 127)? 1 sign bit 23 bit for the significand (leading bit is 1 and is not stored) 8 bit for exponent (256 possible values)(254 possible exponents, 2 special values: 0, ∞,... ) ⇒ 32 bit in total. 217 The IEEE Standard 754 Why F ∗ (2, 53, −1022, 1023)? 1 sign bit 52 bit for the significand (leading bit is 1 and is not stored) 11 bit for exponent (2046 possible exponents, 2 special values: 0, ∞,... ) ⇒ 64 bit in total. 218 Example: 32-bit Representation of a Floating Point Number 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 ± Exponent Mantisse 2−126 ,... , 2127 1.00000000000000000000000... ± 0, ∞,... 1.11111111111111111111111 219 Binary and Decimal Systems Internally the computer computes with β = 2 (binary system) Literals and inputs have β = 10 (decimal system) Inputs have to be converted! 220 Conversion (0 < x < 2) Computation of the binary representation: ∞ bi 2−i X x= i=0 = b0 b1 b2 b3... = b0 + 0 b1 b2 b3... =⇒ (x − b0 ) = 0 b1 b2 b3 b4... int current_bit; 2 · (x − b0 ) = b1 b2 b3 b4... while (x != 0) { if (x >= 1) current_bit = 1; else current_bit = 0; std::cout

Floating-Point Numbers II Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue