For most applications in science and engineering integer numbers are not sufficient; we need to work with real numbers. Real numbers like have an infinite number of decimal digits; there is no hope to store them exactly. On a computer, floating point convention is used to represent (approximations of) the real numbers. The design of computer systems requires in-depth knowledge about FP. Modern processors have special FP instructions, compilers must generate such FP instructions, and the operating system must handle the exception conditions generated by these FP instructions.

We will now illustrate the floating point representation in base 10.
Any decimal number *x* can be *uniquely* written as

+1 or -1 | sign | |

mantissa | ||

integer | exponent |

For example

If we did not impose the condition we could have represented
the number in various different ways, for example

When the condition is satisfied, we say that the mantissa

- the FP representation is unique,
- since there is exactly one digit before the decimal point, and
- since the first digit in the mantissa is nonzero. Thus, none of the available digits is wasted by storing leading zeros.

Suppose our storage space is limited to 6 decimal digits per FP number. We allocate 1 decimal digit for the sign, 3 decimal digits for the mantissa and 2 decimal digits for the exponent. If the mantissa is longer we will chop it to the most significant 3 digits (another possibility is rounding, which we will talk about shortly).

*A floating point number is represented as
with a limited number of
digits for the mantissa and the exponent.*
The parameters of the FP system are
(the basis), (the number of digits in the mantissa)
and (the number of digits for the exponent).

Most real numbers cannot be exactly represented as floating point numbers. For example, numbers with an infinite representation, like , will need to be ``approximated'' by a finite-length FP number. In our FP system, will be represented as

In general, the FP representation is just an approximation
of the real number . The *relative error* is the difference between
the two numbers, divided by the real number

For example, if , and is its representation in our FP system, then the relative error is

Another measure for the approximation error is the number of
*units in the last place*, or `ulps`. The error in `ulps`
is computed as

where is the exponent of and is the number of digits in the mantissa. For our example

The difference between relative errors corresponding to 0.5 `ulps`
is called the *wobble factor*. If = 0.5 `ulps`
and
, then
, and
since
we have that

If the error is ulps, the last digits in the number are contaminated by error. Similarly, if the relative error is , the last digits are in error.

With normalized mantissas, the three digits always read , i.e. the decimal point has fixed position inside the mantissa. For the original number, the decimal point can be floated to any position in the bit-string we like by changing the exponent.

We see now the origin of the term *floating point*:
the decimal point can be floated to any position in the bit-string
we like by changing the exponent.

With 3 decimal digits, our mantissas range between . For exponents, two digits will provide the range .

Consider the number . When we represent it
in our floating point system,
we lose all the significant information:

In order to overcome this problem, we need to allow for negative exponents also. We will use a

is then represented, with the biased exponent convention, as

What is the maximum number allowed by our toy floating point system?
If and , we obtain

If and we obtain a representation of ZERO. Depending on , it can be or . Both numbers are valid, and we will consider them equal.

What is the minimum positive number that can be represented in our toy floating point system? The smallest mantissa value that satisfies the normalization requirement is ; together with this gives the number . If we drop the normalization requirement, we can represent smaller numbers also. For example, and give , while and give .

*The FP numbers with exponent equal to ZERO and the first
digit in the mantissa also equal to ZERO are called subnormal numbers*.

Allowing subnormal numbers improves the resolution of the FP system near 0. Non-normalized mantissas will be permitted only when , to represent ZERO or subnormal numbers, or when to represent special numbers.

Example (D. Goldberg, p. 185, adapted): Suppose we work with our toy FP system
and do not allow for
subnormal numbers. Consider the fragment of code

designed to "guard" against division by 0.
Let
and
.
Clearly but,
(since we do not use subnormal numbers)
.
In spite of all the trouble we are dividing by 0!
If we allow subnormal numbers,
and the code behaves correctly.

Note that for the exponent bias we have chosen 49 and not 50.
The reason for this is self-consistency: the inverse of the smallest normal number
does not overflow

(with a bias of 50 we would have had = = ).

Similar to the decimal case, any binary number *x* can be represented

+1 or -1 | sign | |

mantissa | ||

integer | exponent |

For example,

When we use normalized mantissas, the first digit is always nonzero.
With binary floating point representation, a nonzero digit is (of course) 1,
hence the first digit in the normalized binary mantissa is always 1.

As a consequence, it is not necessary to store it; we can store the mantissa starting with the second digit, and store an extra, least significant bit, in the space we saved. This is called

For our binary example () the leftmost bit (equal to 1, of course, showed in bold) is redundant. If we do not store it any longer, we obtain the hidden bit representation:

We can now pack more information in the same space: the rightmost bit of the mantissa holds now the bit of the number () (equal to 1, showed in bold). This bit was simply omitted in the standard form (). Question: Why do we prefer