Memory model - The OpenCL specification

Chapter 3. The OpenCL model

1. The OpenCL specification

1.3. Memory model

In the classical scheme of the Neumann-architecture there is only one computing unit (processor) and one memory. As we have described it in the introduction of the book, today's electronic devices usually have at least two processor cores. Projecting this property to the Neumann-model, there are two or even more computing units. The following question raises: how can these processors access the memory? Do they have separate, dedicated physical memories or do they access the same physical memory, able to read or write even the same bytes of that memory? In this section we are discussing these questions in details. First, the classical memory models of parallel and distributed computing are described. Then, in the light of these existing models the memory model of the OpenCL device is introduced.

1.3.1. Classical memory models

Based on the type of memory access, the computer architectures containing several processors are sorted into three classes:

• distributed memory model - each processor has an own, dedicated, physical memory, allowing only process⁴ based parallelization. The main feature of process-based parallelization is that the processes running parallel can communicate only through the process communication mechanisms provided by the operating system (like TCP/IP networks). For example, distributed memory model is realized by a cluster of desktop computers connected by Ethernet network. Process-based parallelization can be applied using libraries implementing the Message Passing Interface (MPI) specification (like OpenMPI⁵, MPICH⁶) or the highly similar Parallel Virtual Machines⁷ (PVM) technology⁸. See figure 3.1(b) for the schematic diagram of the model.

• shared memory model - the processors access and use the very same physical memory, enabling thread⁹-based parallelization. The main characteristic of thread-based parallelization is that the threads running on different processors are started by the same process, thus the threads can communicate with each other through the memory regions of the parent process their share. For example, shared memory model is realized by multicore desktop computers and multicore smartphones. On these devices thread-based parallelization can be implemented using POSIX threads or the Open MultiProcessing¹⁰ (OpenMP) technology. See figure 3.1(a) for the schematic diagram of the model.

• hybrid memory model - the processors are organized into disjoint groups. The processors inside a group have shared physical memory, enabling the use of thread-based parallelization. The intercommunication of the groups takes place by services of the operating systems, enabling process-based parallelization and implementing the distributed memory model. For example, the cluster of computers having multicore processors can be considered to realize a hybrid memory model. Parallel solutions for hybrid memory models can be implemented using the technologies dedicated to shared and distributed memory models combined.

For example, one can use the OpenMPI technology to distribute the computing tasks between computers, and the individual computers can solve the subproblems parallel on their processors using OpenMP for parallelization. See figure 3.1(c) for the schematic diagram of the model.

Just like in the case of the Flynn-classes of parallel solutions, many implementations can not be strictly classified into the shared, distributed or hybrid classes. In many cases the memory models are only abstractions at the level of implementation. For example, one can implement process-based, distributed memory-like parallelization using a single desktop computer having multicore processors. From the implementation point of view, distributed memory model is applied, from the physical point of view the same hardware is used.

Figure 3.1. Schematic diagrams of various memory models.

4The process is the instance of a computer program being executed. The process realizes a sequence of instructions. In the commonly used time-sharing operating systems each process can have three different states: running (one of the processors is executing it); ready to run (the process is temporarily stopped by the operating system, but it can return to running in any moment); blocked (the process is waiting for the fulfilment of some independent events). At the level of the operating system the managment of processes means the maintaining of process descriptors enabling the switching of process states. Particularly, the operating system maintains the instruction counter, the stack pointer, the descriptors of the allocated memory regions and other resources (like opened files), the state of the process (the contents of CPU registers), etc.

5http://www.open-mpi.org/

6http://www.mpich.org/

7http://www.csm.ornl.gov/pvm/

8Altough PVM was a popular toolkit for distributed computing, nowadays it seems to be replaced by MPI.

9Threads can be defined as lightweight processes. Threads are the smallest units that can be scheduled by the operating system. The most important properties of threads are the following: just like processes, threads can be in running, ready to run and blocked states; threads are started by already running processes and threads are parts of their parent processes; threads share the memory and resource descriptors with their parent processes, but have distinct instruction counters and states (the contents of the CPU registers).

10http://openmp.org/wp/

(a) shared memory model (b) distributed memory model (c) hybid memory model There are many more classifications and groupings of parallel programming models we do not discuss in this book due to lack of space. However, the understanding of the main memory models must be enough to discuss the memory model of the abstract OpenCL device in details and to understand the how the OpenCL memory model differ from the conventional models.

1.3.2. The memory model of the abstract OpenCL device

The memory model of the abstract OpenCL device differs from the shared, distributed and hybrid models in many senses. Most likely it can be considered as a decomplex hybrid model. The main difference is that the OpenCL device has four distinct memories with different accessing properties. See figure 3.2 for the schematic diagram of the model.

Figure 3.2. Schematic model of the OpenCL platform and execution

Schematic model of the OpenCL platform and abstract device

Model of execution

The four memory regions and their properties:

• global memory - the host program can read and write this memory, and the data stored in the global memory is also readable and writeable by the workitems;

• constant memory - usually implemented as a special region of the global memory, that can be read or written by the host program, but the workitems of a parallel execution can only read it. Accordingly, the memory allocation and the initialization of memory regions is to be performed by the host program or a previous parallel execution of kernels;

• local memory - the shared memory of workitems belonging to a workgroup. The shared variables used for the communication of workitems belonging to the same workgroup are usually placed in the local memory. The host program can allocate regions in the local memory but cannot read or write it. The regions allocated in the local memory have to be initialized by workitems;

• private memory - the dedicated, own, private memory of a workitem. Nor the host program neither other workitems can read or write it.

The above description of the various memories is given from the execution point of view, using the terms host program, workitems and workgroups instead of hardware elements, like the host CPU, processing or computing units. We could have used the names of the various processors as well, however note that the processing and computing units are only abstractions, and it depends on the OpenCL device if they were distinct hardware elements at the physical level. In the description of memories we tried to highlight the abstractions by using workitems and workgroups. The aim is to strengthen the understanding of the reader that the limited access is not necessarily the consequence of hardware features. For example, the architecture of GPU devices is very similar to the abstract OpenCL device. However, in the case of OpenCL supporting CPUs, the host processor, the computing units and the processing units are all the available CPUs, and the local memories and the global memory are part of the host memory. Therefore, the statement that a processing unit (the CPU) could not access any local memory (that is part of the host memory) can be misleading. From the execution point of view the same statement becomes more clear: a workitem can access the data available in its workgroups local memory.

It is worth to talk about the caches that can be present at several levels of the OpenCL device, but none of them is obligatory. On figure 3.2 we have considered a cache to be present for the global and constant memories that is very common in modern GPU devices, however, in older GPUs no cache is available for the global and

constant memories. Beside the cache for the global and constant memories, there can be further caches present at the levels of workgroups (computing units) or workitems (processing units).

The host program can use the global and constant memory of the OpenCL device in two ways. On the one hand the host program can put blocking or non-blocking commands in the command queue to allocate, read or write the global and constant memory. On the other hand, the host program can map a region of the global or constant memory to its own host memory space. After the mapping it can read or write the region mapped to the global or constant memory in its own host memory, and after the manipulation of the region is finished the mapping is ceased and the modifications appear in the memory of the OpenCL device.

An important issue is memory consistency: when the workitems are executed and the different processing units read and write the same region of the global or local memory, do they see the same image of that regions? Are there dedicated caches for the workitems (like in the case of thread-based parallelization on multicore computers), do the workitems read and write their cache instead of the global or local memory? May the caches of two workitems differ in the same time?

• The local memory is consistent for the workitems of a workgroup at the workgroup level synchronization points, that can be explicitly put in the code. Thus, it is possible that in the presence or absence of caching mechanisms, at a given point of the parallel execution the workitems see the very same image of the local memory.

• Similarly, the global memory is consistent for the workitems belonging to the same workgroup at global synchronization points. However, the memory image seen by the workitems belonging to different workgroups may differ, i.e. it is inconsistent.

Obviously, the consistency of the constant and private memories is not an issue, since the constant memory cannot be written by the workitems, and the private memory is dedicated to workitems, cannot be read or written by other workitems.

In document György Kovács OpenCL (Pldal 32-36)