next up previous contents
Next: The IEEE standard Up: Computer Representation of Numbers Previous: Note.   Contents

Floating-Point Numbers

For most applications in science and engineering integer numbers are not sufficient; we need to work with real numbers. Real numbers like have an infinite number of decimal digits; there is no hope to store them exactly. On a computer, floating point convention is used to represent (approximations of) the real numbers. The design of computer systems requires in-depth knowledge about FP. Modern processors have special FP instructions, compilers must generate such FP instructions, and the operating system must handle the exception conditions generated by these FP instructions.

We will now illustrate the floating point representation in base 10. Any decimal number x can be uniquely written as

+1 or -1 sign
integer exponent

For example

If we did not impose the condition we could have represented the number in various different ways, for example

When the condition is satisfied, we say that the mantissa is normalized. Normalization guarantees that
  1. the FP representation is unique,
  2. since there is exactly one digit before the decimal point, and
  3. since the first digit in the mantissa is nonzero. Thus, none of the available digits is wasted by storing leading zeros.

Suppose our storage space is limited to 6 decimal digits per FP number. We allocate 1 decimal digit for the sign, 3 decimal digits for the mantissa and 2 decimal digits for the exponent. If the mantissa is longer we will chop it to the most significant 3 digits (another possibility is rounding, which we will talk about shortly).

Our example number can be then represented as

A floating point number is represented as with a limited number of digits for the mantissa and the exponent. The parameters of the FP system are (the basis), (the number of digits in the mantissa) and (the number of digits for the exponent).

Most real numbers cannot be exactly represented as floating point numbers. For example, numbers with an infinite representation, like , will need to be ``approximated'' by a finite-length FP number. In our FP system, will be represented as

Note that the finite representation in binary is different than finite representation in decimal; for example, has an infinite binary representation.

In general, the FP representation is just an approximation of the real number . The relative error is the difference between the two numbers, divided by the real number

For example, if , and is its representation in our FP system, then the relative error is

Another measure for the approximation error is the number of units in the last place, or ulps. The error in ulps is computed as

where is the exponent of and is the number of digits in the mantissa. For our example

The difference between relative errors corresponding to 0.5 ulps is called the wobble factor. If = 0.5 ulps and , then , and since we have that

If the error is ulps, the last digits in the number are contaminated by error. Similarly, if the relative error is , the last digits are in error.

With normalized mantissas, the three digits always read , i.e. the decimal point has fixed position inside the mantissa. For the original number, the decimal point can be floated to any position in the bit-string we like by changing the exponent.

We see now the origin of the term floating point: the decimal point can be floated to any position in the bit-string we like by changing the exponent.

With 3 decimal digits, our mantissas range between . For exponents, two digits will provide the range .

Consider the number . When we represent it in our floating point system, we lose all the significant information:

In order to overcome this problem, we need to allow for negative exponents also. We will use a biased representation: if the bits are stored in the exponent field, the actual exponent is (49 is called the exponent bias). This implies that, instead of going from to , our exponents will actually range from to . The number

is then represented, with the biased exponent convention, as

What is the maximum number allowed by our toy floating point system? If and , we obtain

If and we obtain a representation of ZERO. Depending on , it can be or . Both numbers are valid, and we will consider them equal.

What is the minimum positive number that can be represented in our toy floating point system? The smallest mantissa value that satisfies the normalization requirement is ; together with this gives the number . If we drop the normalization requirement, we can represent smaller numbers also. For example, and give , while and give .

The FP numbers with exponent equal to ZERO and the first digit in the mantissa also equal to ZERO are called subnormal numbers.

Allowing subnormal numbers improves the resolution of the FP system near 0. Non-normalized mantissas will be permitted only when , to represent ZERO or subnormal numbers, or when to represent special numbers.

Example (D. Goldberg, p. 185, adapted): Suppose we work with our toy FP system and do not allow for subnormal numbers. Consider the fragment of code

designed to "guard" against division by 0. Let and . Clearly but, (since we do not use subnormal numbers) . In spite of all the trouble we are dividing by 0! If we allow subnormal numbers, and the code behaves correctly.

Note that for the exponent bias we have chosen 49 and not 50. The reason for this is self-consistency: the inverse of the smallest normal number does not overflow

(with a bias of 50 we would have had = = ).

Similar to the decimal case, any binary number x can be represented

+1 or -1 sign
integer exponent

For example,


With 6 binary digits available for the mantissa and 4 binary digits available for the exponent, the floating point representation is


When we use normalized mantissas, the first digit is always nonzero. With binary floating point representation, a nonzero digit is (of course) 1, hence the first digit in the normalized binary mantissa is always 1.

As a consequence, it is not necessary to store it; we can store the mantissa starting with the second digit, and store an extra, least significant bit, in the space we saved. This is called the hidden bit technique.

For our binary example ([*]) the leftmost bit (equal to 1, of course, showed in bold) is redundant. If we do not store it any longer, we obtain the hidden bit representation:


We can now pack more information in the same space: the rightmost bit of the mantissa holds now the bit of the number ([*]) (equal to 1, showed in bold). This bit was simply omitted in the standard form ([*]). Question: Why do we prefer

next up previous contents
Next: The IEEE standard Up: Computer Representation of Numbers Previous: Note.   Contents
Adrian Sandu 2001-08-26