Next: The Set of FP Up: The IEEE standard Previous: Number range   Contents

## Precision

To define the precision of the FP system, let us go back to our toy FP representation (2 decimal digits for the exponent and 3 for the mantissa).

We want to add two numbers, e.g.

In order to perform the addition, we bring the smaller number to the same exponent as the larger number by shifting right the mantissa. For our example,

Next, we add the mantissas and normalize the result if necessary. In our case

Suppose now we want to add

For bringing them to the same exponent, we need to shift right the mantissa 3 positions, and, due to our limited space (3 digits) we lose all the significant information. Thus

We can see now that this is a limitation of the FP system due to the storage of only a finite number of digits.

The precision of the floating point system (the machine precision'') is the smallest number for which .

For our toy FP system, it is clear from the previous discussion that .

If the relative error in a computation is , then the number of corrupted decimal digits is .

In (binary) IEEE arithmetic, the first single precision number larger than 1 is , while the first double precision number is . For extended precision there is no hidden bit, so the first such number is . You should be able to justify this yourselves.

If the relative error in a computation is , then the number of corrupted binary digits is .

Table: Precision of different IEEE representations
 IEEE Format Machine precision () No. Decimal Digits Single Prec. 7 Double Prec. 16 Extended Prec. 19

Remark: We can now answer the following question. Signed integers are represented in two's complement. Signed mantissas are represented using the sign-magnitude convention. For signed exponents the standard uses a biased representation. Why not represent the exponents in two's complement, as we do for the signed integers? When we compare two floating point numbers (both positive, for now) the exponents are looked at first; only if they are equal we proceed with the mantissas. The biased exponent is a much more convenient representation for the purpose of comparison. We compare two signed integers in greater than/less than/ equal to expressions; such expressions appear infrequently enough in a program, so we can live with the two's complement formulation, which has other benefits. On the other hand, any time we perform a floating point addition/subtraction we need to compare the exponents and align the operands. Exponent comparisons are therefore quite frequent, and being able to do them efficiently is very important. This is the argument for preferring the biased exponent representation.

Next: The Set of FP Up: The IEEE standard Previous: Number range   Contents