Atomic functions - Built-in functions

clGetDeviceInfo and the types and descriptions of the properties

Chapter 5. The OpenCL C language

6. Built-in functions

6.8. Atomic functions

Generally, the increasing of a variable consists of three steps: in the first step, the current value of the variable is loaded into a register of the processor. Then, the value is increased and finally, the result is stored in the memory. A problem may arise if parallel codes try to increase the value of the same variable. Consider the case when processor A loads the actual value (x) of the variable into a register and increases it to x+1. In the same time, processor B also loads the value (x) of the variable into a register and increases its value to x+1. Then, processor B stores the increased value in the memory and finally, processor A also uploads the increased value to the address of the variable. Easy to see that the value of the variable is x+1 instead of x+2, although the value of the variable is increased two times. In the field of parallel programming an operation is called atomic operation, if the result of an operation appears in the memory instantly, no caching is performed and concurrent codes have no chance to access the concerned regions of memory.

OpenCL C defines atomic operations for the manipulation of signed and unsigned integers and single-precision floating-point values in the __global and __local memories.

Specification:

type atomic_operation(__qualifier type* p, type val);

Generic parameters: type - int or unsigned int; float can be used only if the operation is xchg.

operation - add: addition, sub: subtraction, min: determining the minimum, max: determining the maximum, and: bitwise "and", or: bitwise "or", xor: bitwise "exclusive or", xchg: exchange.

qualifier - __global or __local.

Parameters: p - An address in the global or local memory.

val - A value of type int or unsigned int. Return value: The result of the operation operation applied to the

value at the address p and val.

The function applies the operation operation for the value of type type at the address p with the argument val. The result is stored at the address p and also returned by the function. When the operation is xchg, the function reads the value at address p, writes the value val to the address p and returns the value previously read from p.

Specification:

type atomic_operation(__qualifier type* p);

Generic parameters: type - int or unsigned int.

operation - inc: incrementation, dec: decrementation.

qualifier - __global or __local.

Parameters: p - An address of the global or local memory.

Return value: The value at the address p increased or decreased by 1.

The only argument of the second variant of atomic functions is an address, and the value at that address is increased or decreased and returned by the function.

Specification:

type atomic_cmpxchg(__qualifier type* p, type cmp,

type val);

Generic parameters: type - int or unsigned int. qualifier - __global or __local.

Parameters: p - An address in the local or global memory.

comp - A value from type type. val - A value from type type.

Return value: The value at the address p at the time of the function call.

The third variant of atomic operations performs conditional assignment: compares the values *p and cmp, and if the values are equal, val is written to the address p. Otherwise, the value at p is not modified: (*p == cmp) ? val : old. The function returns the value at the address p at the time of the function call.

The use of the functions is demonstrated by a kernel that computes the number of positive elements in an array.

The first argument of the kernel is the array, the second argument is the pointer of a variable of type unsigned integer, used to store the results. We can assume that the variable is initialized by 0, and the size of the one-dimensional index range equals to the size of the array.

Example 5.17. count.k

__kernel void count(__global int* input, __global uint* number) {

uint idx= get_global_idx(0);

if ( input[idx] > 0 ) atomic_inc(number);

}

Since the workitems running parallel can modify the value of number at the same time, it is unavoidable to use atomic operations, particularly, the function atomic_inc.

If the array contains only the values 0 and 1, the code can be simplyfied by leaving the expensive conditional statement.

Example 5.18. count.k

{

uint idx= get_global_idx(0);

atomic_add(number, input[idx]);

}

6.9.

^printf

The function printf works similarly as in ANSI C99, with some minor differences. For example, the symbol

%s can be used to print string literals only. The return value of the function is a logical value instead of the number of characters written to the output. The value is 0 in the case of unsuccessful execution and -1 otherwise. The specification leaves several issues for the implementation. The most important issue defined by the implementation is the channel where the output of the function is written. Particularly, depending on the OpenCL implementation, the output of the function can appear on the standard output and standard error channel, as well. The detailed description of the function can be found in the OpenCL 1.2 specification. The following simple kernel writes the indices of workitems to the output.

Example 5.19. printf.k

__kernel void localMemory(__global ulong* input, __local ulong* linput) {

printf("global id: %d, %d, %d\t"\

"local id: %d, %d, %d\t"\

"wgroup id: %d, %d, %d\n",

get_global_id(0), get_global_id(1), get_global_id(2), get_local_id(0), get_local_id(1), get_local_id(2), get_group_id(0), get_group_id(1), get_group_id(2));

}

The kernel can be used to check whether the global and local index ranges contain the same indices we have described in the discussion of the execution model.

7. Summary

In this chapter we have overviewed the elements of the OpenCL C language. We have introduced vector types, presented the rules of conversion, the qualifiers and the most important built-in functions. At first glance, it may seem easy to write simple codes in OpenCL C. However, in the case studies of the following chapters the reader will see that the use of different constructions (atomic or vector types; built-in functions or own implementation;

constant or global memory) in the kernel code can lead to significantly different performances. In fact, knowing the language OpenCL C is not enough to write professional and efficient OpenCL programs. In order to reach the highest performance in the available hardware environment, the reader has to obtain experience in the use OpenCL C. The case studies of the next chapters provide a good starting point to learn how efficient OpenCL C codes look like and what makes one implementation more efficient than others.

8. Excercises

1. Review and comprehend the code of the kernels helloWorld and sqrtKernel given in the previous chapters!

2. (★★) Implement a host program to drive the kernel localMaximum.k! 3. (★★) Implement a host program to drive the kernel betaKernel.k! 4. (★★) Implement a host program to drive the kernel commonBits.k!

5. (★★) Implement a kernel and a host program to demonstrate the work of _rte, _rtz, _rtp and _rtn rounding modes!

6. (★★★) Create a header file called cltoansi.h, and define macros enabling the kernels including the header to be compilable with ANSI C compilers! Suppose that the kernels use only the functions determining the indices of workitems!

7. (★★) Implement a kernel to evaluate the probability distribution function of the χ² distribution. The kernel has two arguments: an array containing the values where the function needs to be evaluated and the positive integer k parameter of the distribution! In the implementation use as many built-in functions of OpenCL C as possible! The probability distribution function of the χ² distribution is given below.

(5.2)

Implement a host program to drive the kernel!

8. (★★★) Implement kernels and a host program to copy the contents of a buffer object into another one. First, perform the operation using atomic values, one by one. Then perform the operation using vector data types of lengths 4, 8 and 16! Compare the runtimes and explain the differences of the results!

Chapter 6. Case study - Linear

In document György Kovács OpenCL (Pldal 130-133)