Execution of native kernels - Execution of kernels

clGetDeviceInfo and the types and descriptions of the properties

5. Runtime layer

5.7. Execution of kernels

5.7.3. Execution of native kernels

The third way to execute kernels provides the same functionality as clEnqueueTask for non-kernel functions.

Particularly, the function clEnqueueNativeKernelOpenCL 1.075 enables the execution of native C/C++ functions by command queue objects. Obviously, the function is not supported by GPU devices, but can be used with

user_func - A function of the host program with the proper specification.

args - The argument intended to be passed to user_func.

cb_args - The size of the argument at the address arg, in bytes.

75http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNativeKernel.html

num_mem_objects - The size of the array mem_list. mem_list - An array of buffer objects.

args_mem_loc - An array containing the pointers of pointers used by the function user_func.

num_events_in_wait_list - The size of array event_wait_list.

event_wait_list - An array of event objects belonging to commands required to be completed before the parallel execution of the workitems.

event - The event object belonging to the command of parallel execution. If NULL pointer is specified, no event is returned.

Return value: Error code in the case of unsuccessful execution, CL_SUCCESS otherwise.

The arguments specifying the command queue and the event objects are self-evident. However, the rest of the arguments may be confusing. Obviously, the function user_func is the native kernel executed on a processing unit. However, the programmer is not allowed to specify index ranges, or the number of times the kernel should be run in parallel. Questions may arise: How does the OpenCL environment execute the function on a multi-core CPU in parallel? Is it able to run in parallel? The answer is that the function clEnqueueNativeKernel executes only one instance of the native kernel, thus, to utilize the hardware resources, one have to call the function as many times as many threads are required. The strategy of parallel execution is highly similar to Pthreads based parallelization:

• The problem indented to solve in parallel is decomposed to subproblems by the programmer, and the parameters of the subproblems are stored in simple record data structures. The number of records is the same as the number of subproblems.

• The function clEnqueueNativeKernel is called as many times as many subproblems are defined.

• In each call of the function the native kernel and a record specifying a subproblem are passed as arguments.

• The commands of executing the native kernels for the subproblems are put in the command queue and executed by the device.

Thus, the argument args is the address of a record specifying a subproblem. Since the record can contain arbitrary number of fields, the fact that the function user_func has only one argument of type void* does not pose any restriction. The pointer is casted in the body of function user_func to the pointer of the proper record type, and the fields specifying the subproblem become available in the function. The subproblems are solved in the rest of the body of the function user_func.

There are only three arguments left to clarify. All of them are related to the buffer objects used by the kernel. In practice, most problems are related to data stored in buffer objects. Previously we have described the variants of buffer objects in details. The passing of these objects to native kernels (basically conventional C functions) is not so simple. The efficient implementation (with reduced data transfer) depends on the types of buffers. To make the implementation easier, OpenCL hides the differences of buffer objects, and the function clEnqueueNativeKernel performs the data transfer and the query of pointers to the contents of buffers in optimized ways. The size of the arrays mem_list and args_mem_loc are both num_mem_objcts. The former contains buffer objects and the latter the pointers of the pointers we are using in the function user_func: the native kernel is written to access the contents of the buffer object mem_list[i] by the pointer at the address args_mem_loc[i]. The function clEnqueueNativeKernel sets the values of pointers pointed by the elements of the array args_mem_loc properly making the native kernels ready to run.

The use of the function is demonstrated by a sample code similar to the previous ones, computing the square root of integers parallelly.

Example 4.35. nativeKernel.c

#include <stdio.h>

properties[i*2 + 1]= (cl_platform_id)(platforms[i]);

}

properties[i*2]= 0;

context= clCreateContextFromType(properties, CL_DEVICE_TYPE_CPU, NULL, NULL, &err);

ERROR(err, "clCreateContextFromType")

err= clGetContextInfo(context, CL_CONTEXT_DEVICES, MAX_DEVICES*sizeof(cl_device_id), devices, &size);

ERROR(err, "clGetContextInfo")

err= clGetContextInfo(context, CL_CONTEXT_NUM_DEVICES, sizeof(cl_uint), &numDevices,

&size);

ERROR(err, "clGetContextInfo")

queue= clCreateCommandQueue(context, devices[0], CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);

ERROR(err, "clCreateCommandQueue")

memobjInput= clCreateBuffer(context, CL_MEM_USE_HOST_PTR, ARRAY_SIZE * sizeof(float), (void*)&input, &err);

err= clEnqueueNativeKernel(queue, &sqrtKernel,

err= clEnqueueReadBuffer(queue, memobjInput, 1, 0, sizeof(float)*ARRAY_SIZE, input, NULL, NULL, &event);

0.000000 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 3.000000 3.162278 3.316625 3.464102 3.605551 3.741657 3.872983 4.000000 4.123106 4.242640 4.358900

The native kernel is implemented in the function sqrtKernel being part of the host program. The subproblems are specified in records from the type parameters. The record has a field for an index and another one for the pointer of the memory region containing the integers. In the function sqrtKernel the argument of type void*

is converted to type parameters*, and everything is given to compute the square root on the element of the array specified by the index in the record.

For efficient scheduling, the command queue is created by the property CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE to enable the execution of the independent workitems in arbitrary order. Note that the program operates without this fine feature, however, the overall time of execution may increase when the out-of-order-execution is disabled. The reason is that whenever an instance of the native kernel is stucked on a computing unit, the later specified kernels cannot be executed on other computing units, since the order they are put in the command queue is to be kept.

6. Summary

In this chapter we have overviewed and demonstrated by sample codes most of the functions of the OpenCL specification available in the host program. The naming conventions and the consistency of arguments can highly aid the understanding and memorizing of these functions. Learning the functions presented in the chapter enable the reader to discover the OpenCL devices, create and initialize the data structures of parallel execution, compile, link and execute OpenCL C codes. Except a few number of functions most of them are present in the specification from version 1.0, thus, they can be used with almost all device from all vendor supporting OpenCL.

7. Excercises

1. (★) Interpret the host code of the sample program presented in the section My first OpenCL program!

Determine what kind of synchronization approaches are applied and where are blocking and non-blocking function calls performed!

2. (★★) Implement an application capable to query all the properties of OpenCL devices! Interpret the output of the application for the OpenCL devices available in your computer!

3. (★★★) Implement a function realizing a general, intarctive, command line interface for the selection of available platforms and devices used for parallel execution! The skeleton of the program can be found in the chapter, however, it should support the selection of multiple devices!

4. (★★★) Implement a program capable to measure the memory bandwidth between the global memory of the available OpenCL devices and the host memory, in both directions! Determine the runtimes of the operations by event objects! Compare the results when the buffer object is allocated in the host memory, in the global memory and in pinned memory regions!

5. (★★) Implement functions like the ones specified in stopper.h and stopper.c used to measure time and based on the properties of event objects!

6. (★★) Implement functions providing similar functionalities to clEnqueueMarkerWithWaitList and clEnqueueBarrierWithWaitList using the functions clWaitForEvents, clFinish and clFlush!

7. (★★★) Create a compiler application for OpenCL C: the command line arguments of oclcc are considered to be the source files, and the binary codes are written into the file specified after the option -o. All the command line options beginning with '-' except '-o' are to be passed to the function clBuildProgram! 8. (★★★) Modify the sample code demonstrating the use of function clEnqueueNativeKernel in the

following way: the native kernels should process a contiguous part of the array of size blockSize instead of the single elements of the array! Compare the runtimes for large arrays when different blockSize values are used and interpret the results!

In document György Kovács OpenCL (Pldal 107-112)