Masszívan párhuzamos programozás GPGPU-k alkalmazásával

(1)

Írta: Sima Dezső, Szénási Sándor, Tóth Ákos

MASSZÍVAN

PÁRHUZAMOS

PROGRAMOZÁS GPGPU-K ALKALMAZÁSÁVAL

PÁRHUZAMOS SZÁMÍTÁSTECHNIKA MODUL

PROAKTÍV INFORMATIKAI MODULFEJLESZTÉS

Lektorálta: oktatói munkaközösség

(2)

COPYRIGHT:

2011-2016, Dr. Sima Dezső, Szénási Sándor, Tóth Ákos, Óbudai Egyetem, Neumann János Informatikai Kar

LEKTORÁLTA: oktatói munkaközösség

Creative Commons NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)

A szerző nevének feltüntetése mellett nem kereskedelmi céllal szabadon másolható, terjeszthető, megjelentethető és előadható, de nem módosítható.

TÁMOGATÁS:

Készült a TÁMOP-4.1.2-08/2/A/KMR-2009-0053 számú, „Proaktív informatikai modulfejlesztés (PRIM1): IT Szolgáltatásmenedzsment modul és Többszálas processzorok és programozásuk modul” című pályázat keretében

KÉSZÜLT: a Typotex Kiadó gondozásában FELELŐS VEZETŐ: Votisky Zsuzsa

ISBN 978-963-279-560-7

(3)

KULCSSZAVAK:

GPGPU, grafikai kártya, architektúrák, CUDA, OpenCL, adatpárhuzamosság, programozás, optimalizáció

ÖSSZEFOGLALÓ:

A processzorarchitektúrák elmúlt években bekövetkezett fejlődésének egyik szignifikáns eredménye az általános célú grafikai kártyák (GPGPU-k) és az alkalmazásukhoz szükséges szoftverprogramozási környezetek megjelenése. A tárgy keretében a hallgatók először megismerkednek a GPGPU-k általános felépítésével, a legfontosabb reprezentáns architektúrákkal. Ezt követően gyakorlati ismereteket szereznek az adatpárhuzamos

programozási modellen keresztül történő feladatmegoldásban, a számításigényes feladatok futásának gyorsításában. A tárgy keretein belül megjelennek a napjainkban legelterjedtebb GPGPU programozási környezetek (Nvidia CUDA illetve OpenCL), amelyekkel kapcsolatban a hallgatók megismerik azok felépítését, használatát (eszközök kezelése,

memóriaműveletek, kernelek készítése, kernelek végrehajtása), majd pedig gyakorlati feladatok megoldásán keresztül (alapvető mátrixműveletek, minimum-maximum kiválasztás stb.) bővíthetik programozási ismereteiket. A programozási környezetek bemutatásán túlmenően hangsúlyozottan megjelenik az újszerű eszközök speciális lehetőségeinek szerepe az optimalizáció területén (megosztott memória használata, atomi műveletek stb.).

(4)

Code refactoring costs are a kind of software maintenance costs that arise when the user switches from a given generation to a subsequent GPGPU generation (like from GT200 based devices to GF100 or GF110-based devices) or to a new software environment (like from CUDA 1.x SDK to CUDA 2.x or from CUDA 3.x SDK to CUDA 4.x SDK).

Benefits of the portability of the pseudo assembly code

2. Basics of the SIMT execution (6)

(29)

Remark

• For Java there is also an inherent pseudo ISA definition, called the Java bytecode.

• Applications written in Java will first be compiled to the platform independent Java bytecode.

• The Java bytecode will then either be interpreted by the Java Runtime Environment (JRE) installed on the end user’s computer or compiled at runtime by the Just-In-Time (JIT) compiler of the end user.

The virtual machine concept underlying both Nvidia’s and AMD’s GPGPUs is similar to the virtual machine concept underlying Java.

2. Basics of the SIMT execution (7)

(30)

At the virtual machine level GPGPU computing is specified by

• the SIMT computational model and

• the related pseudo iSA of the GPGPU.

Specification GPGPU computing at the virtual machine level

2. Basics of the SIMT execution (8)

(31)

The SIMT computational model

It covers the following three abstractions

Model of computational

resources

The memory model

Model of SIMT execution

Figure 2.2: Key abstractions of the SIMT computational model

2. Basics of the SIMT execution (9)

(32)

2. Basics of the SIMT execution (10)

1. The model of computational resources

It specifies the computational resources available at virtual machine level (the pseudo ISA level).

• Basic elements of the computational resources are SIMT cores.

Figure 2.3: Basic structure of the underlying SIMD cores ALUs operate in a pipelined fashion, to be discussed later.

First, let’s discuss the basic structure of the underlying SIMD cores.

SIMD cores execute the same instruction stream on a number of ALUs (e.g. on 32 ALUs), i.e. all ALUs perform typically the same operations in parallel.

• SIMT cores are specific SIMD cores, i.e. SIMD cores enhanced for efficient multithreading.

Efficient multithreading means zero-cycle penalty context switches, to be discussed later.

SIMD core ALU

Fetch/Decode

ALU ALU ALU ALU

(33)

SIMD ALUs operate according to the load/store principle, like RISC processors i.e.

The load/store principle of operation takes for granted the availability of a register file (RF) for each ALU.

RF

Figure 2.4: Principle of operation of a SIMD ALU

2. Basics of the SIMT execution (11)

• they load operands from the memory,

• perform operations in the “register space” i.e.

• they take operands from the register file,

• perform the prescribed operations and

• store operation results again into the register file, and

• store (write back) final results into the memory.

Load/Store Memory

ALU

(34)

As a consequence of the chosen principle of execution each ALU is allocated a register file (RF) that is a number of working registers.

Figure 2.5: Main functional blocks of a SIMD core

2. Basics of the SIMT execution (12)

Fetch/Decode

ALU ALU ALU

RF RF RF

ALU ALU ALU RF RF

RF

(35)

Remark

Figure 2.6: Allocation of distinct parts of a large register file to the private register sets of the ALUs The register sets (RF) allocated to each ALU are actually, parts of a large enough register file.

2. Basics of the SIMT execution (13)

ALU ALU

ALU ALU ALU

RF RF RF RF RF RF

ALU ALU ALU

(36)

Basic operations of the underlying SIMD ALUs

• and are pipelined, i.e.

• They execute basically FP32 Multiply-Add instructions of the form a x b + c ,

• need a few number of clock cycles, e.g. 2 or 4 shader cycles

to present the results of the FP32 Multiply-Add operations to the RF, Without further enhancements

the peak performance of the ALUs is 2 FP32 operations/cycle.

2. Basics of the SIMT execution (14)

• capable of starting a new operation every new clock cycle, (more precisely, every new shader clock cycle), and

ALU RF

(37)

• FX32 operations,

• FP64 operations,

• FX/FP conversions,

• single precision trigonometric functions (to calculate reflections, shading etc.).

2. Basics of the SIMT execution (15)

Beyond the basic operations the SIMD cores provide a set of further computational capabilities, such as

Note

Computational capabilities specified at the pseudo ISA level (intermediate level) are

• by firmware (i.e. microcoded,

• or even by emulation during the second phase of compilation.

• typically implemented in hardware.

Nevertheless, it is also possible to implement some compute capabilities

(38)

Aim of multithreading in GPGPUs

Speeding up computations by eliminating thread stalls due to long latency operations.

Achieved by suspending stalled threads from execution and allocating free computational resources to runable threads.

2. Basics of the SIMT execution (16)

Enhancing SIMD cores to SIMT cores

This allows to lay less emphasis on the implementation of sophisticated cache systems and utilize redeemed silicon area (used otherwise for implementing caches)

for performing computations.

SIMT cores are enhanced SIMD cores that provide an effective support of multithreading

(39)

Effective implementation of multithreading

requires that thread switches, called context switches, do not cause cycle penalties.

• providing and maintaining separate contexts for each thread, and

• implementing a zero-cycle context switch mechanism.

Achieved by

2. Basics of the SIMT execution (17)

(40)

Figure 2.7: SIMT cores are specific SIMD cores providing separate thread contexts for each thread

2. Basics of the SIMT execution (18)

SIMT cores

= SIMD cores with per thread register files (designated as CTX in the figure)

ALU

Actual context Register file (RF)

Context switch

Fetch/Decode

SIMT core

ALU ALU

ALU ALU ALU ALU

CTX CTX