• Nem Talált Eredményt

Vector addition using simplified relative tolerance Input: a, b, λInput: a,b, λ

Low-level optimizations

Algorithm 5.4 Vector addition using simplified relative tolerance Input: a, b, λInput: a,b, λ

Output: a

1: fori :=1 ton

2: c :=ai+λbi 3: if|air ≥ |c|then

4: c:=0

5: end

6: ai :=c

7: end

more conditional branching breaks the pipeline mechanism, so these implementations are slower than the naive one.

The simplified and Orchard-Hays’s method give the same output if |ai| = |λbi|, or max{|ai|,|λbi|}=|ai|. There is difference if max{|ai|,|λbi|}=|λbi|, andai(λbi)<0.

Figure (5.2) shows four examples. Subfigures (5.2a) and (5.2b) represent two examples for the functioning of Orchard-Hays’s method. In these cases, the magnitude ofais greater than the magnitude of b. In this case|a|ǫ is chosen by both methods in the comparison.

However, in subfigures (5.2c) and (5.2d) |b| is greater than |a|, so we can observe the difference between the two methods: There is a small range between |a|ǫ and|b|ǫ, where the simplified method will not move the result to zero unlike the Orchard-Hays’s method.

Namely, the output of the two methods is different if

|a|ǫ <|ab| ≤ |b|ǫ.

Note that not only can the two results differ ifaandbare close to zero.

Figure 5.3: The red area is the small range where the two methods give a different result. The Orchard-Hays’s method goes to zero in the blue and red ranges, while the simplified method moves to zero only in the blue area.

(a) Orchard-Hays’s method ifb<0 anda>0.

(b) Orchard-Hays’s method ifb>0 anda>0.

(c) The simplified method ifa<0,b>0, and|a|<b.

(d) The simplified method ifa<0,b>0, and|a|<b.

Figure 5.2: Comparison of Orchard-Hays’s and the simplified ad-dition algorithms. For the sake of simplicity,λ=1 andǫ=0.02.

5.2.1 SIMD vector addition

As conditional jumping breaks the pipeline mechanism, it slows down the execution of the program. It is useful to implement the algorithms in a way that is free from conditional jumping. Intel’s SIMD architecture contains several instructions that help us design such an implementation. We will use the following atomic instructions:

• Move: Moves the content of a register to another register.

• Multiply: Multiplies the number pairs of two registers.

• Add: Adds the number pairs of two registers.

• And: Performs a bitwise AND between two registers.

• Compare: Compares the number pairs of two registers. If they are identical, the des-tination register will contain a bit pattern filled by 1’s (the result is NaN), otherwise 0.

• Max: Chooses the larger of two numbers stored in two registers. It is used for the implementation of Orchard-Hays’s addition method.

The detailed description of these instructions can be found in Intel’s instruction set documentation [37]. The key point of the conditional jump-free implementations (called accelerated stable addition in this paper) is the compare instruction. It compares the numbers and stores the results in a register. If the register contains two double pairs then the comparator places two bit patterns to the destination area. One pattern can be filled by 1 if the result of the comparison is true, otherwise 0 as it is shown in Figure (5.4). These bit patterns can be used for masking.

Figure 5.4: The compare instruction of the SSE2 instruction set

In this section, we will refer to the XMM registers only for the easier discussion. But we notice that the AVX and AVX-512 implementations use YMM and ZMM registers respectively. Moreover, at first, we describe the algorithm as if we would implement it in assembly language. However, as we will see in (5.4) it is not the most clever way, it is much better if we let a modern C++ compiler to generate the assembly level code, and this code likely will use the registers in the other way.

Figures (5.5) and (5.6) show the flowchart of the SSE2 versions of our stable add operations with relative and absolute tolerances. The algorithms add two number pairs loaded to registers XMM0 and XMM1. The final result goes into XMM2.

The implementations have two major phases: initialization, and processing. We prepare certain registers to store the constant value ofλ(XMM7),ǫr(XMM4),ǫa (XMM6) and the absolute value mask (XMM5). In the process part we perform the stable add operations for the successive number pairs, without modifying registers XMM4-XMM7.

Figures (5.5) and (5.6) show only one iteration in the processing phase. One iteration of the absolute tolerance stable adder performs these 6 steps:

1. Multiply XMM1 and XMM7, store the result in XMM1, XMM1 will storeλbi. 2. Add XMM1 to XMM0, so XMM0 storesc=ai+λbi.

3. Move XMM0 to XMM2. We have to store the original value ofc, in order to use its absolute value in later steps.

4. Bitwise AND between XMM2 and XMM5, store the result in XMM2. Therefore XMM2 stores|c|.

5. Now we have to compare |c| and ǫa. If |c| < ǫa, then the CPU sets the bits of the corresponding floating point number in XMM2, otherwise clears them.

6. Bitwise AND between XMM2 and XMM0. After this step, if |c| < ǫa then XMM2 stores zero, because of the cleared bit mask in XMM0, otherwise XMM2 storesc.

The stable add operation that uses relative tolerance performs the following 9 steps in one iteration:

1. Multiply XMM1 and XMM7, store the result in XMM1, XMM1 will storeλbi.

2. Move XMM0 to XMM2. We have to store the original value ofai andλbi, in order to use their absolute value in the later steps.

3. Add XMM1 to XMM2, so XMM1 storesc=ai+λbi.

4. Move XMM2 to XMM3, because we will use the absolute value ofcin the next steps, but we will need the original value ofcas well.

5. Bitwise AND between XMM3 and XMM5, store the result in XMM3. Therefore XMM3 stores|c|.

6. Bitwise AND between XMM0 and XMM5, XMM0 stores|ai|.

7. Multiply XMM0 and XMM4, and store the result in XMM0, so XMM0 stores|air. 8. Now we have to compare|airand|c|. If|air<|c|, then the CPU sets the bits of the

corresponding floating point number in XMM0, otherwise clears them.

9. Bitwise AND between XMM2 and XMM0. After this step, if|air ≥ |c|then XMM2 stores zero, because of the cleared bit mask in XMM0.

The operations above belong to exactly one SSE2 or AVX instruction, so the reader can easily reproduce our results. These implementations use several additional operations on top of the one addition and multiplication, so they have an overhead compared to the naive implementation. They use a few additional bit masking steps because Intel’s SIMD instruction sets (except the AVX-512) have no absolute value operations. However, we can obtain the absolute value of a floating-point number by clearing the sign bit.

Therefore, we have to apply the bit masking technique to obtain the absolute values, as in the steps 5-7, in relative tolerance adder, and step 4 in absolute tolerance adder.

However, SSE2 performs every instruction between two number pairs in parallel, so this overhead is not significant. Moreover, AVX can execute the instructions between 4 number pairs (and AVX-512 doubles this amount). Consequently, the overhead will be even lower. In order to improve the speed of the algorithms, our implementations utilize the two memory ports mentioned in Section (5.1): While one number pair is being processed, the next pair is loaded to other unused registers, so the delay of memory access is decreased. We use this technique in our dot-product implementations.

We modified the above described relative tolerance adder procedure to implement Orchard-Hays’s method. Two additional steps are inserted after step 6:

1. Bitwise AND between XMM1 and XMM5, XMM1 stores|λbi|.

2. Use MAX operation between XMM0 and XMM1, XMM0 stores max{|ai|,|λbi|}.

5.3 Vector dot-product

The dot-product between twondimensional vectorsaandbis defined as:

aTb= Xn

i=1

aibi.

Algorithm (5.5) shows the pseudo-code of its naive implementation. The problem is that the add operation in line 3 can cause a cancellation error if the operands have different signs.

Algorithm 5.5Naive dot-product