Implementation details - SIMD dot-product

SIMD dot-product

5.4 Implementation details

prod

S M E

Base address

+

Negative sum Positive sum

Address

1 union Number { 2 double num ;

3 unsigned long long int bits ; 4 } number ;

6 double negpos [2] = {0.0 , 0.0};

7 [...]

8 const double prod = a * b;

9 number . num = prod ;

10 *( negpos + ( number . bits >> 63)) += prod ;

Figure 5.7: Handling positive and negative sums with pointer arith-metic without branching, whereSis the sign bit,Mis the significand andEis the exponent

5.4 Implementation details

The obvious way to implement these algorithms is the use of an assembler. The elementary SIMD operations introduced in this chapter are equivalent to specific machine code instructions. However, this method has numerous disadvantages:

• Performance: Although until the 90s, writing software in assembly was a good idea to optimize the speed and size, it is not a state of the art programming today. The modern compilers produce much better machine code than most programmers can write in assembly. Today’s CPUs are much more complex than decades ago. They use caching, branch predictors, instruction pipelines, and so on. Some instructions can help or pull back these features. The programmers have to have very strong and specific knowledge in order to write better code than a compiler can produce.

• Portability: There are major differences between Intel’s 32-bit and 64-bit architec-tures. The operations can have different arguments, and moreover, there are new

Move Move

Figure 5.8: Flow chart of the stable dot product implementation.

Arrow numbers show the order of the operations.

instructions in 64-bit mode. For example, the X86 CPUs do not support integer op-erations with 64-bit integers in 32-bit mode (this feature is realized by the compilers or compiler libraries). If we would like to implement a 32-bit assembly code to 64 bit, we have to rewrite it with different instruction codes. However, the 64-bit world is more complicated: There are different function conventions between Unix/ Unix-like and Windows systems. For example, under Windows, the first four integer arguments are passed in registers RCX, RDX, R8, and R9. System V AMD64 ABI is used by some Unix and Unix-like systems: Linux, Solaris, FreeBSD, and macOS.

Here, the first six integers are passed in these registers: RDI, RSI, RDX, RCX, R8, R9.

It is clear that if we would like to implement our software to Windows and Linux,

we need to implement 64-bit codes twice.

• Maintainability: Assembly codes are very hard to read, implement and modify.

The higher-level languages, like C++support clean code conventions [63], so it is much easier to write a code that is easy to read, and it can be the documentation of itself. Moreover, the probability of writing flawless code in C++is easier than with assembly.

The compilers can utilize the SIMD instructions, but sometimes we have to give a bit of help , especially, when our algorithm is complex. Intel’s solution to this problem is the intrinsic instructions. These special compiler instructions fit the higher-level source code and suggest the compiler how it should generate the machine code. Figure (5.9) shows a simple sample code. The special data type__m128dstores 2 double type variables, and it can mean thatdataA1,dataB1, and other similar variables have to move into the SIMD registers. The_mm_load_pdloads data from the memory to these variables, and this time, to the registers. Obviously, _mm_add_pd and _mm_mul_pd adds and multiplies numbers between these registers. There are other instructions also, we can use bit operations, comparisons, and more.

There is one more implementation problem: Our SIMD implementations have two versions, one uses the cache to store the result, one not. The only difference between the two versions is one instruction: The non-cache versions use movnt instructions instead of the simple data moving operations. It is clear that in the clean code principles writing of these functions twice (with one small difference) is contraindicated. The C++language gives certain tools to solve this problem. Somehow, we have to pass the storing method to our function, without loss of speed, so the branching is inappropriate. The traditional way is using a function pointer or function object. However, our measurements show performance loss with these methods. The problem is that the compiler has to generate a machine code that somehow takes into account the current storing method in one single function. The template system of C++can help us: If the storing method is the template argument, then the compiler has to generate different versions of our function, and one function can be simpler (and faster). In our solution, we used special local defined structs with inline static functions as template parameters. For example, in Figure (5.10) the denseToDenseAddSSE2_cache function creates the CACHE_WRITER struct with the write function inside. This function implements the simple store method using cache. This struct is passed as template argument to the denseToDenseAddSSE_temp template function, and this function calls the write method in code lines 15, 21 and 30 of Figure (5.9). Additional sample codes can be found in Appendix E.

For performance reasons, in Figures (5.9) and (5.10) the array pointers of the function arguments have the__restrict__modifier. This modifier tells the compiler that the two arrays do not overlap each other, so the compiler can generate a machine code that is a little bit faster.

The SSE2, AVX, and AVX-512 implementations have their special hardware require-ments. If the current CPU can support only SSE2 instructions, the software will not use the AVX functions. However, these C++instructions needs some GCC compiler options:

-msse2, -mavx, -mavx512f -mavx512dq. But if we apply these options on every function,

1 template < class WRITER >

2 void denseToDenseAddSSE2_temp (const double * __restrict__ a ,

3 const double * __restrict__ b ,

4 double * c ,

5 size_t count , double lambda ) {

6 const size_t rem1 = count % 4;

7 size_t i;

8 __m128d mul = _mm_set_pd1 ( lambda );

9 for (i = 0; i < count - rem1 ; i += 4) { 10 // c = a + b * lambda

11 __m128d dataA1 = _mm_load_pd (& a[i ]);

12 __m128d dataB1 = _mm_load_pd (& b[i ]);

13 __m128d dataC1 = _mm_add_pd ( dataA1 ,

14 _mm_mul_pd ( dataB1 , mul ));

15 WRITER :: write (c + i , dataC1 );

17 __m128d dataA2 = _mm_load_pd (& a[i + 2]);

18 __m128d dataB2 = _mm_load_pd (& b[i + 2]);

19 __m128d dataC2 = _mm_add_pd ( dataA2 ,

20 _mm_mul_pd ( dataB2 , mul ));

21 WRITER :: write (c + i + 2, dataC2 );

23 }

24 const size_t secondCount = rem1 / 2;

25 if ( secondCount ) {

26 __m128d dataA = _mm_load_pd (& a[i ]);

27 __m128d dataB = _mm_load_pd (& b[i ]);

28 __m128d dataC = _mm_add_pd ( dataA ,

29 _mm_mul_pd ( dataB , mul ));

30 WRITER :: write (c + i , dataC );

31 i += 2;

32 }

33 for (; i < count ; i ++) {

34 c[i] = a[i] + b[i] * lambda ;

35 }

36 }

Figure 5.9: The source code of the naive SSE2 version vector-vector add template function.

1 extern "C" void denseToDenseAddSSE2_cache ( 2 const double * __restrict__ a ,

3 const double * __restrict__ b , 4 double * c ,

5 size_t count , double lambda ) { 6 struct CACHE_WRITER {

7 inline static void write (double * address ,

8 __m128d & value ) {

9 _mm_store_pd ( address , value );

10 }

11 };

12 denseToDenseAddSSE2_temp < CACHE_WRITER >(a , b , c ,

13 count , lambda );

14 } 15

16 extern "C" void denseToDenseAddSSE2_nocache ( 17 const double * __restrict__ a ,

18 const double * __restrict__ b , 19 double * c ,

20 size_t count , double lambda ) { 21 struct NOCACHE_WRITER {

22 inline static void write (double * address ,

23 __m128d & value ) {

24 _mm_stream_pd ( address , value );

25 }

26 };

27 denseToDenseAddSSE2_temp < NOCACHE_WRITER >(a , b , c ,

28 count , lambda );

29 }

Figure 5.10: The source code of the naive SSE2 version vector-vector add functions, with caching and non-caching versions.

the compiler will use AVX512 instructions in our SSE2 and AVX implementations, be-cause it will be much faster. In order to avoid this, we have to put the SSE2, AVX and AVX-512 codes into three different libraries, and we have to build these libraries with different compiler options.

In document Numerically stable simplex method implementation (Pldal 107-112)