Distributed memory spMV - General sparse matrix-vector multiplication on GPUs

The FEM on GPUs

4.5 General sparse matrix-vector multiplication on GPUs

4.5.3 Distributed memory spMV

In most large-scale applications it is necessary to employ distributed memory paral-lelism because the problem may not fit in the memory of a single machine. The prevalent programming model and environment is the Message Passing Interface (MPI), which is used to explicitly communicate between remote processes by means of messages. Depend-ing on the hardware setup and the topology, communication may have significant latency and usually a much lower bandwidth compared to intra-node communication. Therefore, attention has to be given to how computations and communications are coordinated so that they don’t get in each others way.

The distributed memory parallelisation of the sparse matrix-vector product is concep-tually simple and well known [68]; following the decomposition of the graph described by the sparse matrix, I have a number of subgraphs that are connected by edges. The mul-tiplication of a row of the sparse matrix with the multiplicand vector can be interpreted as a weighted sum of values on vertices that the current row’s vertex is connected to via edges. This introduces a data dependency where edges cross from one subgraph to the other, therefore before the multiplication can be carried out on a boundary vertex, data

on vertices it is connected to need to be up to date - in the MPI model, values on these boundary vertices are gathered and sent in a message to other subgraphs where they are duplicated as halo vertices, so that their values can be updated. Furthermore, each sub-graph, consisting of the owned vertices, can be partitioned into two classes of vertices;

boundary vertices that have a data dependency from different subgraphs andinterior ver-tices that do not. Thus it is possible to perform the communications and the computations on the interior at the same time, thereby trying to hide the latency of communications with computations [69].

When investigating the performance of distributed memory algorithms, strong scaling is often of interest; how fast can the same problem be solved, given more resources. This pushes the limits in terms of the latency of communications, because in such a situation the size of the subgraphs decrease, but the relative number of boundary vertices increases - computations over the interior are increasingly unable to hide the latency of communi-cations. This situation also surfaces when considering Multigrid-type techniques to solve a system of linear equations [70]: on subsequent multigrid levels the graph has fewer and fewer vertices, and if these are further partitioned into the same number of subgraphs on each level, then this can be considered a strong scaling problem. Since GPUs are especially sensitive to latency due to memory copy operations, kernel launches and underutilisation of the hardware, addressing these issues is paramount to achieving high performance. In extreme cases of over-partitioning this may result in the distributed computation being slower than carrying out the same computations on a single device.

To address the negative impact of latency, following the idea of tiling [21] I have proposed a communication-avoiding algorithm that is capable of reducing the number of messages at the expense of redundant computations. By carrying out a dependency analysis, it is possible to map out several halo rings, the first consisting of vertices that boundary vertices are connected to, the second consists of vertices that are connected the first ring and do not belong to either the first ring or the boundary, and so on. If only one ring of halo is updated, then only one sparse matrix-vector multiplication can be carried out (over the owned vertices) that will still yield globally consistent results. By updating two rings of halo, it is possible to carry out the two operations, first over the owned vertices and the first ring and then a second over the owned vertices and still achieve globally consistent results, but only communicating once. Clearly, this can be extended to an arbitrary number of rings.

Having implemented this algorithm for GPUs, I carry out a set of benchmarks on a

10² 10³ 10⁴ 10⁵ 10⁶ 10⁷

1 GPU: number of nonzeros vs. 80 Jacobi iterations

Execution time (s)

(a) Performance scaling on a single GPU

10² 10³ 10⁴ 10⁵ 10⁶ 10⁷

Speedup over 1GPU with 1 ring 2−8 GPUs

Speedup of 1 ring distributed over 1 GPU

Number nonzeros 2 GPUs

4 GPUs 6 GPUs 8 GPUs

(b) Performance scaling over multiple GPUs Figure 4.21. Performance on different size sparse matrices on a single GPU and on

multiple GPUs

Speedup over 1 GPU with 4 ring 2−8 GPUs

Speedup of 4 ring distributed over 1 GPU

Number nonzeros 2 GPUs

4 GPUs 6 GPUs 8 GPUs

(a) Performance scaling over multiple GPUs

10² 10³ 10⁴ 10⁵ 10⁶ 10⁷

Speedup of 4 ring vs. 1 ring with 2−8 GPUs

Speedup of 4 ring over 1 ring

Number nonzeros

2 GPUs 4 GPUs 6 GPUs 8 GPUs

(b) Improvement over 1 ring approach Figure 4.22. Improvements by applying the communication-avoiding algorithm

cluster of 8 NVIDIA K20X GPUs, partitioning a set of 6 sparse matrices, retrieved from ANSYS Fluent’s CFD simulations, fed into AmgX, and dumped at each multigrid level, yielding a total of 72 matrices with increasingly smaller sizes. Strong scaling is investigated when computing 80 iterations of a Jacobi smoother, consisting of an spMV operation and a vector-vector operation. Figure 4.21a shows performance on a single GPU as a function of the number of non-zeros in the sparse matrix; it is clear that the GPU does not perform well below 10⁴ non-zeros due to underutilisation. Figure 4.21b shows the speedup over the single GPU performance when employing distributed memory parallelism using 2-8 GPUs, the cost of communications is obvious; below 10⁵ non-zeros it is actually slower to use multiple GPUs.

By employing the communication-avoiding algorithm, it is possible to lower the thresh-old where multi-GPU computations become excessively slow; Figure 4.22b shows that by using 4 rings of halo, performance scaling is maintained up to 3×10⁴, a more than three-fold improvement over the naive 1 ring approach. Finally, Figure 4.22a shows the speedup of the 4 ring approach over the 1 ring approach, clearly at high number of non-zeros latency is sufficiently hidden by a large interior, but as matrices get smaller, the advantage of the communication avoiding approach becomes clear. Thus, by employing this technique, it is possible to deliver strong-scalability to general spMV-type operations and scalability to multigrid solvers.

4.6 Summary and associated theses

In this chapter I provided a brief introduction to GPUs and discussed some of the architectural specifics than influence implementation and help understand performance.

After describing the experimental setup and the reference CPU implementation, I investi-gated the performance of the three formulations of FE integration on the GPU, discussing the effect of trade-offs, showing that the redundant compute approach delivers a sustained high performance even at high order elements [3]; this completes Thesis I.1: By applying transformations to the FE integration that trade off computations for communications and local storage, I have designed and implemented new mappings to the GPU, and shown that the redundant compute approach delivers high performance, comparable to classical formulations for first order elements, furthermore, it scales better to higher order elements without loss in computational throughput.

I have presented an in-depth study of different data structures, their implementation and performance on the GPU, both in the assembly and the solution phase, discussed how

data races are resolved and what impact that has on parallel efficiency. I have demon-strated that near-optimal performance in absolute terms is achieved on the GPU, out-performing a fully utilised dual-socket server CPU up to 10 times. I have shown that the LMA delivers superior performance in most cases, especially in single precision where the atomics are available [3]. This completes Thesis I.2:By applying transformations to the FE integration that trade off computations for communications and local storage, I have de-signed and implemented new mappings to the GPU, and shown that the redundant compute approach delivers high performance, comparable to classical formulations for first order el-ements, furthermore, it scales better to higher order elements without loss in computational throughput.

Finally, I have connected my research with the field of sparse linear algebra by investi-gating the parametrisation and mapping of the sparse matrix-vector product on GPUs; I have studied the effect of changing the balance of multi-level parallelism and presented a novel constant-time heuristic for determining near-optimal parameters for execution. Fur-thermore, I have introduced a run-time parameter tuning algorithm, and I have shown that these outperform the state-of-the-art by up to 2.3 times. My research on communication-avoiding algorithms for the sparse matrix-product also resulted in up to 2 times improve-ment over classical approaches [3, 8, 12]. Thus, as Thesis I.3 states:I have parametrised the mapping of sparse matrix-vector products (spMV) for GPUs, designed a new heuristic and a machine learning algorithm in order to improve locality, concurrency and load balancing.

Furthermore, I have introduced a communication-avoiding algorithm for the distributed ex-ecution of the spMV on a cluster of GPUs. My results improve upon the state of the art, as demonstrated on a wide range of sparse matrices from mathematics, computational physics and chemistry.

Chapter 5

The OP2 Abstraction for

In document ABSTRACTION AND IMPLEMENTATION OF UNSTRUCTURED GRID ALGORITHMS ON MASSIVELY PARALLEL HETEROGENEOUS ARCHITECTURES (Pldal 87-92)