Floating Point Numbers
Why are we working on this?
Once we are not restricted to integer values, there are not only
infinitely many numbers, but infinitely many within a given range.
Because a set number of bits, such as 8 or 32 or 64, can't represent
infinitely many values, we have to do a bit more representational work
to handle non-integer values. Furthermore, our techniques in this
section give rise to some interesting new tradeoffs to think about.
Skills in this section:
Represent non-integers in scientific notation
---
Convert numbers to/from IEEE normalized form
---
Translate IEEE normalized form to/from bit patterns
Concepts:
Data representation,
Limitations
Introduction
We have seen how decimal fractions
can be converted to binary. For instance, we can
write 6.25_{10} as:
4 + 2 + 1/4 | = 2^{2} + 2^{1} + 2^{-2} |
= 1*2^{2} + 1*2^{1} + 0*2^{0} + 0*2^{-1} + 1*2^{-2} | |
= 110.01_{2} |
Teaching a computer how to do arithmetic using such binary fractions would be difficult. One problem is that the binary point is not fixed; it needs to float. If you multiply a two place fraction by another two place fraction, for instance, the result has four fractional places, not two.
Computer scientists realized that it would be easier to do floating point arithmetic if the numbers were written in scientific notation. You may recall this notation from your science classes, where very large numbers and numbers that are very close to zero are written using powers of ten. For example,
1,234,000,000 = 1.234*10^{9} ,
0.0000567 = 5.67*10^{-5} .
Computers use binary notation and powers of two, of course. The resulting format is called floating point notation.
A floating point number has three parts: its sign, a fractional part, and an exponent:
+/- fractional_part * 2^{exponent}
For instance, the decimal floating point number 5.16 * 2^{13} has a positive sign, a fractional part of 5.16, and an exponent of 13. It is equivalent to
5.16 * 2^{13} = 5.16 * 8192 = 42,279.72
in ordinary signed decimal form.
The Basic Ideas
Open this pdf to read about the basic idea of the process for converting to floating point representations.Standardizations
Open this pdf to read about standardized formats for floating point numbers.
Magnitude and Precision
Magnitude refers to the raw size of the number; that is, how large or small it can be. Precision refers to the number of digits of accuracy in a number. Magnitude and precision measure different things. Magnitude refers to the possible number of digits, precision to how many of those digits are accurate.
Often we don’t care much about the precision in very large numbers. For instance, the 2007 United States population was 302 million people. We don’t really mean exactly 302,000,000 , of course. The number is accurate to only three digits. Indeed, it would not be possible to be much more accurate, since the exact population is constantly changing.
When a computer stores a number in integer format, every digit is accurate. The largest integer that can be stored in 32 bits using two’s complement notation is 2^{31} -1, which is 2,147,483,648. Say an integer has value 15,431. We can be sure that each digit is correct. However, an integer such as 3,500,630,119 cannot be stored in 32 bits. It is too large.
Floating point numbers generally do not have this precision property. The magnitude is determined by the exponent, while the precision is determined by the fractional part. All digits of a displayed floating point number may not be accurate.
In IEEE format, the precision is 24 bits, including the hidden bit. This translates to about seven decimal digits of accuracy. The magnitude, however, is much larger. The biggest exponent in excess 127 notation that can be stored in eight bits is 127, so the largest number that can be represented has all 1’s in the fractional part and 12710 in the exponent: 1.11…1 *2^{127} . This number is approximately 3.4 * 10^{38} .
A number that is larger than 3.4 * 10^{38} cannot be stored in a computer using standard IEEE format. This is a huge number, but it is possible to exceed it. For example, there are exactly 35 legal choices for each chess move, but the total number of choices grows exponentially to produce more than 10^{50} possible board positions, a number too large for even a computer to hold.
It is important to realize that while the magnitude allows us to store numbers, the precision may mean that not all digits are accurate. For instance, suppose the budget for a large corporation is $632,785,417.25 . This number can be stored in standard floating point, because its magnitude is much less than 3.4 * 10^{38} , but only about seven digits of accuracy will be preserved. The stored value will be approximately $632,785,400.00 . The last several digits will probably be lost.
Other floating point formats are available that increase the precision (but probably not the magnitude). Regardless, you should always remember that a computer generated number is only accurate to a maximum number of digits (seven for standard IEEE format). Any digits beyond that maximum will not be reliable.
Exercises
- For each, write the decimal number in normalized form; that is, in the form 1.xx…x * 2^{exponent} .
- 562
- 961
- 1055
- 2050
- -69
- -120
- 28.125
- -106.25
- For each, find the excess 127 form of the
base 10 number.
- 7
- 38
- -8
- -19
- For each, write the decimal
number in IEEE floating point format
- 42.5
- 105.375
- -26 1/4
- -145.625
- 11/16
- -15/ 32
- For
each, can the quantity be stored as a 32 bit integer? as a
standard IEEE floating point number? In each case, if it can be
stored, will the stored number be accurate? Explain your
answers.
- The number of seconds in an hour.
- The number of seconds in a day.
- The number of seconds in a week.
- The number of seconds in the month of March.
- The number of seconds in a non-leap year.
- The numbers of seconds in a century.
- The number 3.141592674 (the first ten digits of the number π (pi)).
- The number 2.718281828 (the first ten digits of the number e).
- The distance from the Sun to the Earth, expressed in miles.
Credits and licensing
This article is by Robert P. Webber and Scott McElfresh, licensed under a Creative Commons BY-SA 3.0 license.
Version 2016-Mar-14 10:00