Creation and filling of buffer objects - Buffer objects and memory handling

clGetDeviceInfo and the types and descriptions of the properties

5. Runtime layer

5.3. Buffer objects and memory handling

5.3.1. Creation and filling of buffer objects

Buffer objects are to be created by the function clCreateBufferOpenCL 1.026. The function returns the buffer object, particularly the identifier of the object.

Specification:

cl_mem clCreateBuffer( cl_context context, cl_mem_flags flags, size_t size,

void* host_ptr, cl_int*

errcode_ret);

Parameters: context - Identifier of a context object.

flags - A bitfield describing the properties of the buffer object to be allocated. The possible values are summarized in table 4.9.

size - Size of the buffer object to be created in bytes.

host_ptr - A pointer from the address space of the host program pointing to the data to be copied into the buffer.

errcode_ret - The error code is written to this address.

Return value: In case of successful execution it is the identifier of a valid buffer object, the error code is set otherwise.

Table 4.9. The constants defining the possible properties of buffer objects and a short description of the properties

Constant Description

CL_MEM_READ_WRITE The memory object can be read and written by kernels (default).

CL_MEM_WRITE_ONLY The kernels can only write the memory object.

CL_MEM_READ_ONLY The kernels can only read the memory object.

CL_MEM_USE_HOST_PTR The memory object is using the region referred by

23Note that the OpenCL 1.2 standard enables the use of the function printf in kernel codes, thus, in some applications it may be enough to write the results to the standard output, without any memory operations.

24The term memory object includes 2D and 3D images that we are discussing in a later chapter in details.

25One can create buffer objects that can not be read by the host program or can not be written by workitems, however, these constraints are only logical.

26http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateBuffer.html

Constant Description

host_ptr, only the required parts of this region are transferred to the cache of the OpenCL device during the execution of kernels. When this option is used, the pointer host_ptr can not be NULL.

CL_MEM_ALLOC_HOST_PTR Similar to CL_MEM_USE_HOST_PTR, but memory is allocated.

CL_MEM_COPY_HOST_PTR The contents of the memory region pointed by host_ptr are copied into the buffer object.

CL_MEM_HOST_WRITE_ONLY The host program can only write the buffer object.

CL_MEM_HOST_READ_ONLY The host program can only read the buffer object.

CL_MEM_HOST_NO_ACCESS The host program has no access to the contents of the buffer object.

OpenCL platforms have several different physical memories that can be accessed assymmetrically by the CPU and the processors of the OpenCL device. For the ease of discussion, in the rest of the section we are focusing on GPUs as OpenCL devices:

• Individual graphics cards have dedicated physical memories. These cards can access their memories directly with large bandwidth. On the contrary, CPUs have direct access to only some parts of the GPU memories and the reading and writing operations are much slower.

• CPUs have fast, direct and large bandwidth access to the host memory, but GPUs have direct access to only some special regions of it.

• Recently some special architectures²⁷ appeared where the CPU and GPU can access the host memory in a symmetric way, but these approaches are still not part of the mainstream architectures.

Buffer objects can be created in the memory of the OpenCL device and the host machine, as well. Due to the generality and portability of OpenCL, the memory objects can be allocated, handled and accessed in several different ways. Depending on the underlying hardware environment, each approach has its pros and cons. In the rest of the section we are discussing the cases of GPU and CPU OpenCL devices.

Any kind of buffer object is created, the flags controlling its access by the host program and the kernels are working in the same way as it is described in table 4.9. If accessing flags are not specified, the buffer object is readable and writable by the host program and the kernels, as well. When the memory model of OpenCL was discussed, we have mentioned constant memories, that cannot be changed during the parallel execution of a kernel for an index range. Note that a buffer object that cannot be written²⁸ by kernels is not part of the constant memory, even it seems to be part of it. The use of constant memory is discussed in details in the next chapter.

When the flags CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR are not used, the buffer object is allocated on the OpenCL device with the specified size and access properties. If the OpenCL device is a graphics card, the buffer object is allocated on the dedicated memory of that card. Obviously, if the OpenCL device is a CPU, the memory is allocated in the host memory.

When the flag CL_MEM_USE_HOST_PTR is used, the argument host_ptr of function clCreateBuffer cannot be NULL. In this case the buffer object uses the memory region specified by host_ptr. The buffer object is a kind of synonym of this memory region of the host memory, independently from the type of the OpenCL device. If the OpenCL device is CPU, the workitems can access the data stored in that region without any data transfer. In contrast with this, if the OpenCL device is GPU, anytime the workitems are accessing the contents of the buffer object, data transfer is required through the PCI-express bus. In some cases this approach can be highly efficient. Consider a large set of data and a parallel algorithm which uses only a small part of the dataset, but that part is determined in runtime. In this case there is no need to copy the large dataset into the memory of the OpenCL device, one gets a faster execution if only the required parts of the dataset are transferred. Obviously, the access of the buffers can be fastened by caching mechanisms on the OpenCL device. Due to the possible caching mechanisms, whenever the host program wants to modify the memory region addressed by host_ptr,

27Like AMD Fusion or its descendant, the AMD APU.

28It is created using the flags CL_MEM_READ_ONLY.

synchronization operations have to be applied to ensure that the regions stored in the host memory and the cache of the device are consistent.

Before the CL_MEM_ALLOC_HOST_PTR is discussed, we make a short detour to describe the pinned memory regions. Nowadays, all the modern computers and operating systems are based on the concepts of virtual memory; the physical memory and its limitations are hidden from the software. The physical memory is devided to contiguous regions, called pages and the mechanism of paging can write pages to the hard drive and load back pages into the memory. When the contents of a page are written to the hard drive, the region belonging to the page is considered to be free; other pages can be loaded there from the hard drive. Due to paging, the applications can use more memory that is available physically in the computer. However, paging is a quite expensive operation. There are many data structures maintained by the operating system and being used so often that it is not worth to page them out from the host memory to the hard drive. These data structures are stored in pages having the pinned property, that is, they are never written to the hard drive by the paging mechanism. The DMA²⁹ controller of distinct grapics cards can directly access pinned memory regions, that is, it can read or write that regions without wasting CPU time. The communication between the main memory and the GPU always takes place through pinned memory regions. If the region one wants to upload to the GPU is not part of pinned pages, the first step is that the OpenCL implementation copies these regions to pinned pages and these are accessed by the DMA controller of the GPU. Considering the paging mechanism and the precious properties of pinned regions, it is easy to see why buffer objects allocated in pinned regions provide optimal communication between the GPU and the main memory.

When the CL_MEM_ALLOC_HOST_PTR flag is used to create a buffer object, the buffer can be accessed by the CPU and GPU directly. The realization of this shared buffer depends on the hardware and the implementation of OpenCL, but the memory region is generally allocated in pinned pages:

• OpenCL 1.2 does not specify³⁰ whether pinned or not-pinned memory regions are to be allocated;

• according to the description of the AMD OpenCL implementation³¹ [1], in Windows 7 and Windows Vista operating systems and in Linux using GPUs with AMD Southern Islands architecture pinned regions are allocated. In Linux using GPUs with other architecture (Evergreen, Northern Islands) the memory is allocated on the device.

• The NVidia OpenCL Best Practices Guide³² does not specify whether pinned or not-pinned memory is allocated, but ensures that the NVidia driver chooses the most efficient way.

Anything is implemented by the OpenCL library, the programmer can ensure the OpenCL implementation would choose the fastest way for data transfer between the host machine and the OpenCL device. Obviously, this is not for free: the allocation of pinned memory regions is an expensive operation.

When the CL_MEM_COPY_HOST_PTR flag is used the buffer is initialized by the contents of the memory region addressed by host_ptr. Obviously, the flag CL_MEM_COPY_HOST_PTR cannot be used with the flag CL_MEM_USE_HOST_PTR, but can be used with the flag CL_MEM_ALLOC_HOST_PTR. In the latter case the memory region can be directly accessed by the CPU and GPU, and the region is initialized by the contents of the region addressed by host_ptr.

Although the arguments of the function clCreateBuffer seem to be self-evident, there are fine details that can highly improve the performance of the software if the memory handling is fine-tuned according to requirements of problem by minimizing the amount of data transfer between the host machine and the OpenCL device. We are coming back to the efficient handling of memory at the discussion of functions clEnqueueMapBuffer and clEnqueueUnmapMemoryObject.

The content of buffer objects can be initialized when they are created. Furthermore, buffer objects can be filled with patterns at any point of the host program using the function clEnqueueFillBufferOpenCL 1.233. This function has the special property that ignores the accessing flags given at the creation of the buffer objects. The pattern can be a single value, like zero, or a sequence of values that is repeated to fill the memory region of the buffer

29Direct Memory Access - a mechanism enabling the peripherals to access the main memory without the use of the CPU.

30http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateBuffer.html

31http://www.siliconwolves.net/frames/why_buy/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

32http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf

33http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueFillBuffer.html

(like an alternating pattern containing two values). The function is the first example for clEnqueue* functions we have referred to at the introduction of event objects.

Specification:

cl_int clEnqueueFillBuffer(

cl_command_queue command_queue

cl_mem buffer, const void*

pattern,

size_t pattern_size,

size_t offset, size_t size, cl_uint num_events_in_wait_list,

const cl_event*

event_wait_list,

cl_event*

event);

Parameters: command_queue - Command queue object.

buffer - The buffer object to fill.

pattern - The pattern to fill the buffer object with.

pattern_size - The size of the pattern in bytes.

offset - The offset of the region to fill in the buffer object.

size - The size of region to fill in bytes.

num_events_in_wait_list - Size of the array event_wait_list.

event_wait_list - Array of event objects of size num_events_in_wait_list.

event - The event object belonging to the command of filling a buffer is written to this address.

Return value: Error code in case of unsuccessful execution, CL_SUCCESS otherwise.

The arguments of the function are self-evident. The filling of the buffer object is put into the command queue passed as the first argument. The pattern used to fill the buffer is given in the array pattern. In several cases one wants to fill only a part of the buffer instead of the whole buffer. Then, the offset parameter can be used to specify the index of the first byte of the filling operation and size is used to give the number of bytes filled sequentionally. The next three arguments are working as we have discussed before: the execution of the command of filling begins when all the events in the array event_wait_list are in completed state. The last argument is the address of an event object. The event belonging to the command is written to that address and can be used to monitor the state of filling and make further commands depend on that state. Generally, it is allowed to pass NULL pointer to clEnqueue* functions as the pointer of the output event. In this case the functions are working properly, but there is no way to monitor the state of the command nor to make other commands depend on it.

In document György Kovács OpenCL (Pldal 62-65)