NVidia Kepler architecture - Graphical Processing Units

Parallel architectures

2.2 Graphical Processing Units

2.2.1 NVidia Kepler architecture

The Kepler architecture was released in 2012 and it can be regarded as the 4th ma-jor improvement of the GPU architecture since the introduction of the unified shader architecture. Compared to the previous generation (called Fermi), the manufacturer primarily focused on the minimization of the power consumption, and in the high-performance GK110 chip the double-precision high-performance was also increased. The main parameters of the NVIDIA Tesla products built upon the Kepler or the Fermi architectures are summarized in Table 2.2. The new architecture was reported 3 times more power efficient than the Fermi architecture [19] and introduced several new fea-tures, such as Dynamic Parallelism or NVidia GPUDirect. Dynamic Parallelism en-ables the programer to write smarter kernels which can dispatch new kernels with-out host intervention. NVidia GPUDirect is a new communication way, in which the GPU memory can be directly accessed via the PCI express interface eliminating CPU bandwidth and latency bottlenecks. The GPU can be directly connected to anetwork interface controller(NIC) to exchange data with other GPUs viaRemote Direct Mem-ory Access(RMDA). The GPU can also be connected to other 3rd party devices, e.g.

storage devices.

2.2.1.1 The general structure

A schematic block diagram of the Kepler GK100 chip is displayed in Figure 2.3. The chip is associated with CUDA Compute Capability 3.5, which is the revision number of the underlying architecture and determines the available CUDA features. The archi-tecture contains 15 Streaming Multiprocessors (SMX) and 6 64bit memory controllers.

Each SMX contains 192 single-precision CUDA cores, 64 double-precision units, 32 special function units, 65356 32bit registers, 64 KB shared memory, 48 KB read-only

Table 2.2: NVIDIA Tesla product line and the GTX 570 GPU, which was also tested in the Number of CUDA cores 448 448 448 512 2x1536 2496 2688 Core clock

frequency (MHz) 1464 1150 1150 1300 745 706 732

Onboard memory size

(GB) 1.3 3 6 6 8 5 6

Onboard memory

bandwidth (GB/s) 152 148 150 177 320 208 250

Peak double precision floating point

performance (GFlops)

175 515 515 665 190 1170 1310

Peak single precision floating point

performance (GFlops)

1405 1030 1030 1331 4580 3520 3950

Compute capability 2.0 2.0 2.0 2.0 3.0 3.5 3.5

data cache, and 4 warp schedulers. SMX supports the IEEE 754-2008 standard for single- and double-precision floating-point operations (e.g. fused multiply-add) and can execute 192 single-precision or 64 double-precision operations per cycle. Special function units can be used for approximate transcendental functions such as trigono-metric functions.

2.2.1.2 CUDA programming

The CUDA SDK, which was debuted in 2006, is a general computing framework and a programing model that enables developers to program the CUDA capable devices of NVidia. Programs running on CUDA capable devices are calledkernels. Kernels are programmed in CUDA C, which is a standard C with some extensions. Kernels can be dispatched from various supported high-level programming languages such as C/C++

or Fortran, and there are also CUDA libraries, which collect kernels written for specific applications (e.g. CuBLAS for Basic Linear Algebra Subroutines).

When a kernel is dispatched, several threads are started to execute the same kernel code on different input. The mechanism is called Single Instruction Multiple Threads (SIMT), and one of the key differences from Single Instruction Multiple Data (SIMD)

Figure 2.3: A schematic block diagram of the Kepler GK110 chip. Image source: NVidia Ke-pler Whitepaper [19]. The chip contains 15 Streaming Multiprocessors (SMX). Cache mem-ories, single-precision CUDA cores, and double-precision units are indicated by blue, green, and orange, respectively.

is that threads can access input data via an arbitrary input pattern. Threads are or-ganized into a thread hierarchy, which is an important concept in CUDA program-ming. The programmer can determine the number and the topology (1D, 2D, or 3D) of threads to form athread blockand several thread blocks can be defined to form a grid.

The total number of threads shall match to the size of the problem that the treads have to solve. During execution the tread blocks are distributed among the available stream processors.

The second important concept in CUDA programming is the memory hierarchy.

At thread level, each thread can use private (local) registers allocated in the register file of SMX, which is the fastest memory. At thread block level, threads of a block can reach ashared memoryallocated in the shared memory of SMX. Finally, at grid level, all threads can access the on-board DDR memory, which is the largest but slowest memory on the GPU card.

When thread blocks are assigned to an SMX, the threads of the assigned blocks can be executed concurrently; even execution of threads of different blocks can be overlapped. The SMX handles the threads in groups of 32, called warps. The SMX distributes the warps between its four warp schedulers, and each sheduler schedules the execution of the assigned warps to hide various latencies. Each scheduler can issue two independent instructionsfor one of its warps per clock cycles, that is, an SMX can issue eight instructions per clock cycle if the required instruction level parallelism and functional units are available. As a warp contains 32 threads, one instruction cor-responds to 32 operations. To give an example, 6 single-precision instructions can be executed on the 192 CUDA cores and 2 double-precision instructions can be exe-cuted on the 64 double-precision computing units. As there is no branch prediction, all threads of a warp shall agree on their execution path for maximal performance.

2.2.1.3 NVidia K20

The NVIDIA Tesla K20 graphics processing unit (GPU), which is utilized in the disser-tation, is an extension board containing a single GK110 chip. The board is connected to the host system via an x16 PCI Express Generation 2 interface, which provides 8 GB/s communication bandwidth. The board contains 5 GB GDDR5 memory, which is

timated power consumption of the full board during operation is approximately 225 W.

Chapter 3 Solving Partial Differential Equations

In document Efﬁcient implementation of computationally intensive algorithms on parallel computing platforms Csaba Nemes (Pldal 33-39)