Next: The Set of FP
Up: The IEEE standard
Previous: Number range
  Contents
To define the precision of the FP system, let us go back to our
toy FP representation (2 decimal digits for the exponent and 3 for
the mantissa).
We want to add two numbers, e.g.
In order to perform the addition, we
bring the smaller number to the same exponent as the larger number
by shifting right the mantissa. For our example,
Next, we add the mantissas and normalize the result if necessary.
In our case
Suppose now we want to add
For bringing them to the same exponent, we need to shift right the mantissa
3 positions, and, due to our limited space (3 digits) we lose
all the significant information. Thus
We can see now that this is a limitation of the FP system
due to the storage of only a finite number of digits.
The precision of the floating point system (the
``machine precision'') is the smallest number for which
.
For our toy FP system, it is clear from the previous discussion that
.
If the relative error in a computation is , then the
number of corrupted decimal digits is .
In (binary) IEEE arithmetic, the first single precision number
larger than 1 is , while the first double precision
number is . For extended precision there is no
hidden bit, so the first such number is .
You should be able to justify this yourselves.
If the relative error in a computation is , then the
number of corrupted binary digits is .
Table:
Precision of different IEEE representations
IEEE Format |
Machine precision () |
No. Decimal Digits |
Single Prec. |
|
7 |
Double Prec. |
|
16 |
Extended Prec. |
|
19 |
|
Remark: We can now answer the following question.
Signed integers are represented in two's complement.
Signed mantissas are represented using the sign-magnitude convention.
For signed exponents the standard uses a biased representation.
Why not represent the exponents in two's complement,
as we do for the signed integers?
When we compare two floating point numbers
(both positive, for now)
the exponents are looked at first; only if they are equal
we proceed with the mantissas. The biased exponent is a much more convenient
representation for the purpose of comparison.
We compare two signed integers
in greater than/less than/ equal to expressions; such expressions
appear infrequently enough in a program,
so we can live with the two's complement formulation, which has other benefits.
On the other hand, any time we perform a floating point addition/subtraction
we need to compare the exponents and align the operands.
Exponent comparisons are therefore quite frequent,
and being able to do them efficiently is very important.
This is the argument for preferring the biased exponent representation.
Next: The Set of FP
Up: The IEEE standard
Previous: Number range
  Contents
Adrian Sandu
2001-08-26