Implications to Programming Models - ABSTRACTION AND IMPLEMENTATION OF UNSTRUCTURED GRID ALGORI

Programming languages still being used today in scientific computing, such as C and Fortran, were designed decades ago with a tight connection to the execution models of the hardware that were available then. The execution model - an abstraction of the way the hardware works - followed a flat Von Neumann machine: a sequence of instructions for the Arithmetic Logic Unit (ALU) with a flat addressable memory space. Code written using these programming models was trivially translated to the hardware’s execution model and then into hardware instructions. This ensured that the factor most important to performance, the sequence of instructions, was exposed the programmer, while others could be handled by the compiler, such as register management, or ignored and left to the hardware, such as memory movement. Over time, hardware and execution models have changed, but mainstream programming models remain the same, resulting in a disparity between the user’s way of thinking about programs and what the real hardware is capable and suited to do. While compilers do their best to hide these changes, decades of compiler research has shown that this is extremely difficult.

The obvious driving force between the changes to execution models is the ever in-creasing need for parallelism; due to the lack of single-core performance scaling, the de-partmental, smaller-scale HPC systems in a few years will consist of the same number of processing elements as the world’s largest supercomputers now. Table 1.1 show examples of the amount of parallelism required to fully utilise some of today’s Top500 [19] systems;

heroic efforts are required to scale applications to such systems and still produce scien-tifically relevant output. While improvements in hardware are expected to address some of these challenges by the time a similar amount of parallelism is going to be required on smaller-scale systems, many will have to be addressed by software; most important of all is parallelism. At the lowest level, in hardware, marginal improvements to instruction scheduling are still being made by using techniques such as out-of-order execution but above that level, in software, parallelism has to be explicit. Despite decades of research no

universal parallelising compiler was ever demonstrated, therefore it is the responsibility of the programmer to expose parallelism - in extreme cases up to hundreds of million ways.

Handling locality and coping with the orders of magnitude cost differences between computing and moving data across the different levels of the memory hierarchy is second fundamental programming challenge. This is one that is perhaps more against conven-tional programming methodologies, because it has been almost completely hidden from the programmer, but just like there is no universal parallelising compiler, it is also very difficult to automatically improve the locality of computational algorithms. Most of the die area on modern CPUs is dedicated to mitigating the cost of data movement when there is locality, but this is something that cannot be sustained: the fraction of memory capacity to computational capacity is constantly decreasing [20, 1], on-chip memory size per thread is still on the order of megabytes for CPUs, but only kilobytes of the Xeon Phi and a few hundred bytes for the GPU. Once again, relying on lower-level hardware features (such as cache) is not going to be sufficient to guarantee performance scaling; locality will have to be exposed to the programmer. A simple example of this during a flow simulation would be the computation of fluxes across faces for the entire mesh - streaming data in and out with little data reuse - followed by a step where fluxes are applied to flow variables - once again, streaming data in and out with little reuse; which is the way most simulations are organised. Instead, if computational steps were to be concatenated for a small number of grid points, that would increase data-reuse and locality. However, this programming style is highly non-conventional and introduces non-trivial data dependencies that need to be handled via some form of tiling [21, 22], resulting in potentially very complex code.

Finally, huge parallel processing capabilities and deep memory hierarchies will inher-ently result in load balancing issues; the third fundamental obstacle to programmability according to [23]. As parallelism increases, synchronisation, especially global barriers, be-come more and more expensive, statically determined workloads for processing units or threads are vulnerable to getting out of sync with each other due to slight imbalances in resource allocation or scheduling. Therefore considerable research has to be invested into algorithms that avoid global barriers and at the same time the performance of local synchronisation will have to be improved. Furthermore, to tolerate latency due to synchro-nisation and instruction issue stalls, more parallelism can be exploited - this is what GPU architectures already do; by keeping around several times more threads than processing cores it is possible to hide latency due to stalls in some threads by executing others. This can be combined with dynamic load balancing strategies to ensure the number of idle

threads is kept to a minimum. Load balancing strategies may be in conflict with the ef-fort of trying to minimise data movement; especially in architectures with deep memory hierarchies it may lead to cache invalidation and trashing. For example, operating sys-tems will move threads between different cores to improve resource utilisation, but with memory-intensive applications this often results in performance loss because locality is lost - thread pinning is advantageous in many cases. The idea of moving computations to data is receiving increasing attention, but as with the other programmability issues, this is currently not exposed by most programming abstractions and languages, and is contrary to the way of thinking about algorithms.

Traditionally, most popular programming abstractions and languages focus on the se-quence of computations to be executed in-order. However, due to the growing disparity between the cost of computations and data movement, computations are no longer a scarce resource, studies show the benefits redundant computations as a means to reduce data movement [24]. Therefore we are experiencing a paradigm shift in programming;

factors that affect performance, therefore factors that should be exposed to the program-mer are changing, concurrency, locality and load balancing are becoming more important, computations less so.

While there is a growing number of programming languages and extensions that aim to address these issues, at the same time it is increasingly more difficult to write scientific code that delivers high performance and is portable to current and future architectures, because often in-depth knowledge of architectures is required, and hardware-specific op-timisations have to be applied. Therefore there is a push to raise the level of abstraction;

describingwhat the program has to do instead of describinghow exactly to do it, leaving the details to the implementation of the language. Ideally, such a language would deliver generality, productivity and performance, but of course, despite decades of research, no such language exists. Recently, research into Domain Specific Languages (DSLs) [16, 25, 26, 27] applied to different fields in scientific computations has shown that by sacrificing generality, performance and productivity can be achieved. A DSL defines an abstraction for a specific application domain and provides an API that can be used to describe compu-tational problems at a higher level. Based on domain-specific knowledge it can for example re-organise computations to improve locality, break up the problem into smaller parts to improve load-balancing, and map execution to different hardware, applying architecture-specific optimisations. The key challenge is to define an abstraction architecture-specific enough so that these optimisations can be applied, but sufficiently general to support a broad set of

applications.

In document ABSTRACTION AND IMPLEMENTATION OF UNSTRUCTURED GRID ALGORITHMS ON MASSIVELY PARALLEL HETEROGENEOUS ARCHITECTURES (Pldal 27-30)