Floating-Point Numbers II Lecture Notes PDF
Document Details
Uploaded by FervidDune
ETH Zurich
Tags
Related
Summary
This handout covers floating-point number systems, including IEEE standards, limits, guidelines, and harmonic numbers. It provides examples and explanations relevant to computer science, particularly in numerical analysis and computer arithmetic.
Full Transcript
8. Floating-point Numbers II Floating-point Number Systems, IEEE Standard, Limits of Floating-point Arithmetics, Floating-point Guidelines, Harmonic Numbers 208 Floating-point Number Systems A Floating-point number system i...
8. Floating-point Numbers II Floating-point Number Systems, IEEE Standard, Limits of Floating-point Arithmetics, Floating-point Guidelines, Harmonic Numbers 208 Floating-point Number Systems A Floating-point number system is defined by the four natural numbers: β ≥ 2, the base, p ≥ 1, the precision (number of places), emin , the smallest possible exponent, emax , the largest possible exponent. Notation: F (β, p, emin , emax ) 209 Floating-point number Systems F (β, p, emin , emax ) contains the numbers p−1 di β −i · β e , X ± i=0 di ∈ {0,... , β − 1}, e ∈ {emin ,... , emax }. represented in base β: ± d0 d1... dp−1 × β e 210 Floating-point Number Systems Representations of the decimal number 0.1 (with β = 10): 1.0 · 10−1 , 0.1 · 100 , 0.01 · 101 ,... Different representations due to choice of exponent 211 Normalized representation Normalized number: ± d0 d1... dp−1 × β e , d0 ̸= 0 Remark 1 The normalized representation is unique and therefore preferred. Remark 2 The number 0, as well as all numbers smaller than β emin , have no nor- malized representation (we will come back to this later). 212 Set of Normalized Numbers F ∗ (β, p, emin , emax ) 213 Normalized Representation Example F ∗ (2, 3, −2, 2) (only positive numbers) d0 d1 d2 e = −2 e = −1 e=0 e=1 e=2 1.002 0.25 0.5 1 2 4 1.012 0.3125 0.625 1.25 2.5 5 1.102 0.375 0.75 1.5 3 6 1.112 0.4375 0.875 1.75 3.5 7 0 8 1 1.11 · 22 = 7 1.00 · 2−2 = 4 214 Calculations with Floating-point Numbers Example (β = 2, p = 4): 1.111 · 2−2 + 1.011 · 2−1 = 1.001 · 20 1. adjust exponents by denormalizing one number 2. binary addition of the significands 3. renormalize 4. round to p significant places, if necessary 215 The IEEE Standard 754 defines floating-point number systems and their rounding behavior and is used nearly everywhere Single precision (float) numbers: F ∗ (2, 24, −126, 127) (32 bit) plus 0, ∞,... Double precision (double) numbers: F ∗ (2, 53, −1022, 1023) (64 bit) plus 0, ∞,... All arithmetic operations round the exact result to the next representable number 216 The IEEE Standard 754 Why F ∗ (2, 24, −126, 127)? 1 sign bit 23 bit for the significand (leading bit is 1 and is not stored) 8 bit for exponent (256 possible values)(254 possible exponents, 2 special values: 0, ∞,... ) ⇒ 32 bit in total. 217 The IEEE Standard 754 Why F ∗ (2, 53, −1022, 1023)? 1 sign bit 52 bit for the significand (leading bit is 1 and is not stored) 11 bit for exponent (2046 possible exponents, 2 special values: 0, ∞,... ) ⇒ 64 bit in total. 218 Example: 32-bit Representation of a Floating Point Number 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 ± Exponent Mantisse 2−126 ,... , 2127 1.00000000000000000000000... ± 0, ∞,... 1.11111111111111111111111 219 Binary and Decimal Systems Internally the computer computes with β = 2 (binary system) Literals and inputs have β = 10 (decimal system) Inputs have to be converted! 220 Conversion (0 < x < 2) Computation of the binary representation: ∞ bi 2−i X x= i=0 = b0 b1 b2 b3... = b0 + 0 b1 b2 b3... =⇒ (x − b0 ) = 0 b1 b2 b3 b4... int current_bit; 2 · (x − b0 ) = b1 b2 b3 b4... while (x != 0) { if (x >= 1) current_bit = 1; else current_bit = 0; std::cout