Constructing multi-level parallelism - Mapping to Hardware with OP2

Mapping to Hardware with OP2

7.1 Constructing multi-level parallelism

The basic assumption that OP2 makes is that the order in which elements in any given loop are iterated over does not affect the result, within machine precision. This

makes it possible to organise parallelism in any way as long as data dependencies and race conditions are handled. To enable execution on today’s hardware that support multiple levels of parallelism, we first create a highly parametrisable execution plan, that is still independent of the hardware being used. There are three levels of parallelism created and supported;

1. Distributed memory parallelism based on the Communicating Sequential Processes (CSP) abstraction.

2. Coarse-grained shared memory parallelism for levels where communication and syn-chronisation are expensive.

3. Fine-grained shared memory parallelism for levels where communication and syn-chronisation are cheap.

The first level, distributed memory parallelism, is built entirely into the backend, it is based on MPI and uses standard graph partitioning techniques (similar to ideas developed previously in OPlus [72, 80]) in which the domain is partitioned among the compute nodes of a cluster, and import/export halos are constructed for message-passing. OP2 utilises either of two well established parallel mesh partitioning libraries, ParMETIS [89] and PT-Scotch [90] to obtain high quality partitions.

The two-level shared memory design is motivated by several key factors. Firstly a single node may have different kinds of parallelism depending on the target hardware: on multi-core CPUs shared memory multi-threading is available with the possibility of each thread using vectorisation to exploit the capabilities of SSE/AVX vector units. On GPUs, multiple thread blocks are available with each block having multiple threads. Secondly, memory bandwidth is a major limitation on both existing and emerging processors. In the case of CPUs this is the bandwidth between main-memory and the CPU cores, while on the GPUs this is the bandwidth between the main graphics (global) memory and the GPU cores. Thus this design is motivated to reduce the data movement between memory and cores. Based on ideas from FFTW [91], OP2 constructs for each parallel loop an execution

“plan” (op_plan) which breaks up the distributed-memory partition into blocks (or mini-partitions) for coarse-grained shared memory parallelism, and then works out possible data races within each of these blocks to enable fine-grained shared memory parallelism.

MPI boundary Owner-compute Halo exchanges

Block 1

Block 2

Figure 7.1. Handling data dependencies in the multi-level parallelism setting of OP2

7.1.1 Data Dependencies

One key design issue in parallelising unstructured mesh computations is managing data dependencies encountered when using indirectly referenced arrays [16]. For example, in a mesh with cells and edges, with a loop over edges updating cells a potential problem arises when multiple edges update the same cell.

At the higher distributed-memory level, we follow the OPlus approach [80, 72] in using an “owner compute” model in which the partition which “owns” a cell is responsible for performing the edge computations which will update it. If the computations for a particular edge will update cells in different partitions, then each of those partitions will need to carry out the edge computation. This redundant computation is the cost of this approach.

However, we assume that the distributed-memory partitions are very large such that the proportion of redundant computation becomes very small. It is possible that one partition may need to access data which belongs to another partition; in that case a copy of the required data is provided by the other partition. This follows the standard “halo” exchange mechanism used in distributed memory message passing parallel implementations. These data dependencies classify set elements as either “interior”, ones that do not depend on data owned by different processes, or “boundary”, ones that do. OP2 implements latency hiding; it overlaps the computation over interior set elements with the communication of boundary data.

Within a shared-memory setting the size of the blocks can be very small, and so the proportion of redundant computations would be unacceptably large if we used the owner-compute approach. Instead we use the approach described in [16], in which we adopt the

“colouring” idea used in vector computing [50], illustrated in Figure 7.1. The domain is

broken up into blocks of elements, the blocks are coloured so that no two blocks of the same colour will update the same cell. This allows for parallel execution of blocks, with synchronisation between different colours. Key to the success of this approach is the fact that the number of blocks is very large and so even if 10-20 colours are required, there are still enough blocks of each colour to mitigate the cost of synchronisation and to ensure good load-balancing

On the level of fine-grained shared memory parallelism it is important to have data reuse during indirect accesses, but there may be conflict when trying to update the same cell. Here, we again use the colouring approach, assigning colours to individual edges so that no two edges of the same colour update the same cell, as illustrated in Figure 7.1.

When incrementing data on cells, it is possible to first compute the increments for each edge, and then loop over the different edge colours applying the increments by colour, with synchronisation between each colour. This results in a slight loss of parallelism during the incrementing process, but permits data reuse.

Finally, in order to address the issue of serialisation on the fine level, I have introduced two additional colouring strategies that mitigate serialisation at the cost of increased irregularity. These strategies also enable the use of higher-level automatic parallelisation approaches, such as compiler auto-vectorisation, because execution can be formulated as a simple loop nest with the innermost loop being over elements of the same colour, which requires no synchronisation constructs. I present two new colouring approaches: the first permutes the execution of elements within blocks by colour (referred to as “block permute”

and the second that creates a single level of colouring for all elements, and provides a permutation to execute them by colour (referred to as “full permute”).

In document ABSTRACTION AND IMPLEMENTATION OF UNSTRUCTURED GRID ALGORITHMS ON MASSIVELY PARALLEL HETEROGENEOUS ARCHITECTURES (Pldal 115-118)