For most applications in science and engineering integer numbers are not sufficient; we need to work with real numbers. Real numbers like have an infinite number of decimal digits; there is no hope to store them exactly. On a computer, floating point convention is used to represent (approximations of) the real numbers. The design of computer systems requires in-depth knowledge about FP. Modern processors have special FP instructions, compilers must generate such FP instructions, and the operating system must handle the exception conditions generated by these FP instructions.
We will now illustrate the floating point representation in base 10. Any decimal number x can be uniquely written as
+1 or -1 | sign | |
mantissa | ||
integer | exponent |
For example
If we did not impose the condition we could have represented
the number in various different ways, for example
Suppose our storage space is limited to 6 decimal digits per FP number. We allocate 1 decimal digit for the sign, 3 decimal digits for the mantissa and 2 decimal digits for the exponent. If the mantissa is longer we will chop it to the most significant 3 digits (another possibility is rounding, which we will talk about shortly).
A floating point number is represented as with a limited number of digits for the mantissa and the exponent. The parameters of the FP system are (the basis), (the number of digits in the mantissa) and (the number of digits for the exponent).
Most real numbers cannot be exactly represented as floating point numbers. For example, numbers with an infinite representation, like , will need to be ``approximated'' by a finite-length FP number. In our FP system, will be represented as
In general, the FP representation is just an approximation
of the real number . The relative error is the difference between
the two numbers, divided by the real number
Another measure for the approximation error is the number of
units in the last place, or ulps. The error in ulps
is computed as
The difference between relative errors corresponding to 0.5 ulps
is called the wobble factor. If = 0.5 ulps
and
, then
, and
since
we have that
If the error is ulps, the last digits in the number are contaminated by error. Similarly, if the relative error is , the last digits are in error.
With normalized mantissas, the three digits always read , i.e. the decimal point has fixed position inside the mantissa. For the original number, the decimal point can be floated to any position in the bit-string we like by changing the exponent.
We see now the origin of the term floating point: the decimal point can be floated to any position in the bit-string we like by changing the exponent.
With 3 decimal digits, our mantissas range between . For exponents, two digits will provide the range .
Consider the number . When we represent it
in our floating point system,
we lose all the significant information:
What is the maximum number allowed by our toy floating point system?
If and , we obtain
If and we obtain a representation of ZERO. Depending on , it can be or . Both numbers are valid, and we will consider them equal.
What is the minimum positive number that can be represented in our toy floating point system? The smallest mantissa value that satisfies the normalization requirement is ; together with this gives the number . If we drop the normalization requirement, we can represent smaller numbers also. For example, and give , while and give .
The FP numbers with exponent equal to ZERO and the first digit in the mantissa also equal to ZERO are called subnormal numbers.
Allowing subnormal numbers improves the resolution of the FP system near 0. Non-normalized mantissas will be permitted only when , to represent ZERO or subnormal numbers, or when to represent special numbers.
Example (D. Goldberg, p. 185, adapted): Suppose we work with our toy FP system
and do not allow for
subnormal numbers. Consider the fragment of code
designed to "guard" against division by 0.
Let
and
.
Clearly but,
(since we do not use subnormal numbers)
.
In spite of all the trouble we are dividing by 0!
If we allow subnormal numbers,
and the code behaves correctly.
Note that for the exponent bias we have chosen 49 and not 50.
The reason for this is self-consistency: the inverse of the smallest normal number
does not overflow
Similar to the decimal case, any binary number x can be represented
+1 or -1 | sign | |
mantissa | ||
integer | exponent |
For example,
When we use normalized mantissas, the first digit is always nonzero.
With binary floating point representation, a nonzero digit is (of course) 1,
hence the first digit in the normalized binary mantissa is always 1.
For our binary example () the leftmost bit (equal to 1, of course, showed in bold) is redundant. If we do not store it any longer, we obtain the hidden bit representation:
We can now pack more information in the same space: the rightmost bit of the mantissa holds now the bit of the number () (equal to 1, showed in bold). This bit was simply omitted in the standard form (). Question: Why do we prefer