The IEEE 754 Standard - Floating-point numbers

Floating-point numbers

2.3 The IEEE 754 Standard

In this section, a small part of the IEEE 754 Standard is introduced. We focus on the number formats used in this dissertation, see Table (2.2). Notice that this is the newest revision of the standard, named IEEE 754-2008. The standard also defines decimal formats, but they are irrelevant for us; there are optional formats (for example single extended) and one that is currently not supported by any hardware (256-bit format). Every format in Table (2.2) except for the double extended type uses the hidden leading bit convention. The 32 bit single and 64 bit double formats are supported by most of the hardware. We use half format later in this dissertation for demonstration purposes.

The 80-bit format is mostly used by the Intel’s FPU register stack. The FPU has 8 80-bit wide registers, each register can store 1 floating-point value. Of course, this hardware can store 32 and 64-bit formats also, this is controlled by a special control register. Software developers typically use this feature in order to perform temporary calculations inside the FPU with higher precision.

Special values

The IEEE 754-2008 standard specifies some special values. These are the zero, infinity the and not-a-number symbols. Now letn=(−1)^s×m×β^edenote the floating-point number that we would like to store. The exponents of the finite numbers are biased: The real exponente ofn is incremented by a fixed value such a way that if the most significand bit can show that|n| > βor not. LetEthe stored, biased version ofe, and Mthe integer number created by the bits of significand. However, if theEconsists of only 1’s,nis an infinity or not-a-number symbol. So there are the following cases:

• E>0 andEhas at least one 0 bit: nis a normal number.

• E=0: In this case the number can be a zero, or a subnormal number. IfM=0, then

|n|=0. Notice that, thesbit can be 1 or 0, so we can represent the -0 symbol also. If M>0, thennis a subnormal number, ande=E+emin.

• Econtains only 1’s: This is not a finite number, the symbol depends on theM:

– M=0: Thenis−/+∞, depending ons.

– M > 0: The n is a not-a-number (NaN) symbol. This symbol can represent uninitialized numbers, or the result of an invalid operation (for example √

−1).

significand: 0.1640625 (hidden bit=0)

=0.1640625×2⁻¹⁴ =1.0013580322265625×10⁻⁵

• 1

The CPU rounds every number which is not a p-length floating-point number. For example, let us consider this bit pattern, divided to sections:

1.101000

The Section A and B are totally 10-bit length, then there is a 0, and a third section, where is at least one 1. Now image what happens, if we would like to round this number to 10 digits. We have to look at the 11th bit, that is 0, so the number is rounded down, the result is:

In the next step, round this number to 6 digits: The next, 7th bit is 1, and this bit is followed by only zeros. It means that this number is exactly at the halfway of the two possible results, so the round to nearest even strategy is applied: We have to round to the number that has 0 at the 6th bit, so the result is the following:

1.101000

Now go back to the original number, and round it to 6 digits. The next, 7th digit is 1, and there is at least additional 1 after this bit, so we round up, the result is:

1.101001

We have got 2 different results. Although this is an example with short bit patterns, this can happen in real life software also. The traditional 32 bit Intel CPU’s used the FPU to perform floating-point calculations. As we mentioned earlier, this FPU has 80 bit wide registers, which can store 32, 64 and 80-bit floating-point numbers. On default settings, on the 32-bit architectures the compiled code uses these registers: The program loads the 64-bit double type variables into the FPU and converts them to an 80-bit format.

The calculations are performed with this, more precise format. In this code (see lines 9 and 14) additions and multiplications are executed. The CPU rounds the results of each elementary operations, and stores this rounded 80-bit wide value in the FPU’s registers.

When the 9th line is executed, or the for loop is finished, the CPU stores the result back to the variable c, but this is a 64-bit variable; so a new rounding is necessary. Of course, both compiled programs are 64-bit binaries, but in the first case, the compiler option -mfpmath=387 forces the compiler to use the FPU.

Let us investigate the code of Figure (2.5). This C++code is compiled in two ways: In both cases, we build an X86 64-bit Linux executable binary file, so the executables work on 64-bit mode. The first version is compiled with the -mfpmath=387 option, the second is not. As it can be seen, the output of the two versions is different.

Without that option, the compiler does the compiling without the FPU because the 64-bit Intel architectures have the new SIMD registers which are more flexible than the old-fashioned FPU. In these new registers, we can store 32 or 64-bit floating-point numbers only. So if the code uses these registers, the temporary results of the calculation are always rounded to 64-bit, and there is no more additional rounding at the end.

1 # include < iostream >

2 # include < cstdlib >

4 int main () { 5 srand (0);

6 std :: cout . precision (23);

7 double a = 9223372036854775808.0;

8 double b = 1024.25;

9 double c = a + b;

10 std :: cout << c << std :: endl ;

11 for (int i = 0; i < 10000; i ++) { 12 a = rand () % 10000000000;

13 b = rand () % 10000000000;

14 c += a * b;

15 }

16 std :: cout << c << std :: endl ; 17 return 0;

18 }

The output, with compile option -mfpmath=387:

9223372036854775808 11360200579421323657216

The output, with default options (sse, 64 bit):

9223372036854777856 11360200579421325754368

Figure 2.5: A short demo code in C++to demonstrate the double rounding. Compiled with gcc version 9.1.0.

In document Numerically stable simplex method implementation (Pldal 50-54)