Floating Point Numbers
Floating point numbers are used to represent noninteger fractional numbers, for example, 3.256, 2.1, 0.0036, and so forth. Floating point numbers are used in most engineering and technical calculations. The most common floating point standard is the IEEE standard, according to which floating point numbers are represented with 32 bits (single precision) or 64 bits (double precision).
In this section we are looking at the format of 32-bit floating point numbers only and seeing how mathematical operations can be performed with such numbers.
According to the IEEE standard, 32-bit floating point numbers are represented as:
The most significant bit indicates the sign of the number, where 0 indicates the number is positive, and 1 indicates it is negative.
The 8-bit exponent shows the power of the number. To make the calculations easy, the sign of the exponent is not shown; instead, the excess-128 numbering system is used. Thus, to find the real exponent we have to subtract 127 from the given exponent. For example, if the mantissa is “10000000,” the real value of the mantissa is 128 – 127 ¼ 1.
The mantissa is 23 bits wide and represents the increasing negative powers of 2. For example, if we assume that the mantissa is “1110000000000000000000,” the value of this mantissa is calculated as 2-1 þ 2-2 þ 2-3 ¼ 7/8.
The decimal equivalent of a floating point number can be calculated using the formula:
The smallest number in 32-bit floating point format is:
Converting a Floating Point Number into Decimal
To convert a given floating point number into decimal, we have to find the mantissa and the exponent of the number and then convert into decimal as just shown.
Some examples are given here.