GPGPUs/DPAs 5.1 - Masszívan párhuzamos programozás GPGPU-k alkalmazásával

5.1.2 Nvidia’s Parallel Thread eXecution (PTX) Virtual Machine concept

5.1 Nvidia’s Fermi family of cores

5.1.1 Introduction to Nvidia’s Fermi family of cores

5.1.3 Key innovations of Fermi’s PTX 2.0

5.1.5 Major innovations and enhancements of Fermi’s microarchitecture

5.1.6 Microarchitecture of Fermi GF100

5.1.7 Comparing key features of the microarchitectures of Fermi GF100 and the predecessor GT200

5.1.8 Microarchitecture of Fermi GF104

5.1.9 Microarchitecture of Fermi GF110 5.1.10 Evolution of key features of the microarchitecture of Nvidia’s GPGPU lines

5.1.4 Nvidia’s high level data parallel programming model

5.1.1 Introduction to Nvidia’s Fermi family of cores

5.1.1 Introduction to Nvidia’s Fermi family of cores (1)

Announced: 30. Sept. 2009 at NVidia’s GPU Technology Conference, available: 1Q 2010 [83]

Sub-families of Fermi

Fermi includes three sub-families with the following representative cores and features:

GPGPU Available

since Max. no.

of cores Max. no.

of ALUs No of

transistors Compute

capability Aimed at

GF100 3/2010 16¹ 512¹ 3200 mtrs 2.0 Gen. purpose

GF104 7/2010 8 384 1950 mtrs 2.1 Graphics

GF110 11/2010 16 512 3000 mtrs 2.0 Gen. purpose

1 In the associated flagship card (GTX 480) however, one of the SMs has been disabled, due to overheating problems, so it has actually only 15 SIMD cores, called Streaming Multiprocessors (SMs) by Nvidia and 480 FP32 EUs [69]

5.1.1 Introduction to Nvidia’s Fermi family of cores (2)

Terminology of these slides Nvidia’s terminology AMD/ATI’s terminology

CBA CA

Core Block Array in Nvidia’s G80/G92/

GT200

Core Array (else) SPA

Streaming Processor Array (7-10) TPC in

G80/G92/GT200

(8-16) SMs in the Fermi line

(4-24) X SIMD Cores SIMD Array

Data Parallel Processor Array DPP Array

Compute Device Stream Processor

CB Core Block

in Nvidia’s G80/G92/

G200 TPC

Texture Processor Cluster (2-3) Streaming

Streaming Multiprocessor (SM) G80-GT200: scalar issue to a single pipeline

GF100/110: scalar issue to dual pipelines

GF104: 2-way superscalar issue to dual pipelines

SIMD Core

SIMD Engine (Pre OpenCL term) Data Parallel Processor (DPP) Compute Unit

16 x (VLIW4/VLIW5) ALUs

ALU Algebraic Logic Unit (ALU) Streaming Processor CUDA Core

VLIW4/VLIW5 ALU

Stream core (in OpenCL SDKs) Compute Unit Pipeline (6900 ISA) SIMD pipeline (Pre OpenCL) term Thread processor (Pre OpenCL term) Shader processor (Pre OpenCL term)

EU Execution Units (EUs)

(e.g. FP32 units etc.) FP Units FX Units

Stream cores (ISA publ.s) Processing elements Stream Processing Units ALUs (in ISA publications)

5.1.1 Introduction to Nvidia’s Fermi family of cores (3)

5.1.2 Nvidia’s Parallel Thread eXecution (PTX)

Virtual Machine concept

The PTX Virtual Machine concept consists of two related components

5.1.2 Nvidia’s PTX Virtual Machine Concept (1)

• a parallel computational model and

• the ISA of the PTX virtual machine (Instruction Set architecture), which is a pseudo ISA, since programs compiled to the ISA of the PTX are not directly executable but need a further compilation to the ISA of the target GPGPU.

The parallel computational model underlies the PTX virtual machine.

The parallel computational model of PTX

It is based on three key abstractions

a) The model of computational resources b) The memory model

c) The data parallel execution model covering

These models are only outlined here, a detailed description can be found in the related documentation [147].

The parallel computational model of PTX underlies both the ISA of the PTX and the CUDA language.

5.1.2 Nvidia’s PTX Virtual Machine Concept (2)

Remark

The outlined four abstractions remained basically unchanged through the life span of PTX (from the version 1.0 (6/2007) to version 2.3 (3/2011).

c1) the mapping of execution objects to the execution resources (parallel machine model).

c2) The data sharing concept c3) The synchronization concept

A set of SIMD cores with on-chip shared memory

A set of ALUs within the SIMD cores

a) The model of computational resources [147]

5.1.2 Nvidia’s PTX Virtual Machine Concept (3)

b) The memory model [147]

5.1.2 Nvidia’s PTX Virtual Machine Concept (4)

Per-thread reg. space

Main features of the memory spaces

c) The data parallel execution model-1[147]

(SIMT model)

The execution model is based on

5.1.2 Nvidia’s PTX Virtual Machine Concept (5)

• a set of SIMT capable SIMD processors, designated in our slides as SIMD cores, (called Multiprocessors in the Figure and the subsequent description), and

• a set of ALUs (whose capabilities are declared in the associated ISA),

designated as Processors in the Figure and the subsequent description.

5.1.2 Nvidia’s PTX Virtual Machine Concept (6)

A concise overview of the execution model is given in Nvidia’s PTX ISA description, worth to cite.

“

c1) The data parallel execution model-2[147]

5.1.2 Nvidia’s PTX Virtual Machine Concept (7)

“ c1) The data parallel execution model-3 [147]

5.1.2 Nvidia’s PTX Virtual Machine Concept (8)

Per-thread reg. space

Main features of the memory spaces

c2) The data sharing concept-1 [147]

The data parallel model allows to share data for threads within a CTA by means of a Shared Memory declared in the platform model that is allocated to each SIMD core.

A set of SIMD cores with on-chip shared memory

A set of ALUs within the SIMD cores

5.1.2 Nvidia’s PTX Virtual Machine Concept (9)

c2) The data sharing concept-2 [147]

c3) The synchronization concept [147]

• Sequential consistency is provided by barrier synchronization (implemented by the bar.synch instruction.

• Threads wait at the barrier until all threads in the CTA has arrived.

• In this way all memory writes prior the barrier are guaranteed to have stored data before reads after the barrier will access referenced data (providing memory consistency).

5.1.2 Nvidia’s PTX Virtual Machine Concept (10)

Translation to executable CUBIN file

at load time Compilation to PTX pseudo ISA

instructions

The ISA of the PTX virtual machine

It is the definition of a pseudo ISA for GPGPUs that

• is close to the “metal” (i.e. to the actual ISA of GPGPUs) and

• serves as the hardware independent target code for compilers e.g. for CUDA or OpenCL.

CUBIN FILE

5.1.2 Nvidia’s PTX Virtual Machine Concept (11)

CUDA C compiler or

OpenCL compiler

CUDA driver Application

(CUDA C/OpenCL file)

Two-phase compilation

• First phase:

Compilation to the PTX ISA format (stored in text format)

pseudo ISA instructions) The PTX virtual machine concept gives rise to a two phase compilation process.

1) First, the application, e.g. a CUDA or OpenCL program will be compiled to a pseudo code, called also as PTX ISA code or PTX code by the appropriate compiler.

The PTX code is a pseudo code since it is not directly executable and needs to be translated to the actual ISA of a given GPGPU to become executable.

• Second phase (during loading):

JIT-compilation to

executable object code (called CUBIN file).

5.1.2 Nvidia’s PTX Virtual Machine Concept (12)

CUDA C compiler or

OpenCL compiler

CUDA driver Application

(CUDA C/OpenCL file)

Two-phase compilation

CUBIN file

(runable on the GPGPU )

2) In order to become executable the PTX code needs to be compiled to the actual ISA code of a particular GPGPU, called the CUBIN file.

This compilation is performed by the CUDA driver during loading the program (Just-In-Time).

5.1.2 Nvidia’s PTX Virtual Machine Concept (13)

• First phase:

Compilation to the PTX ISA format (stored in text format)

pseudo ISA instructions)

• Second phase (during loading):

JIT-compilation to

executable object code (called CUBIN file).

Benefit of the Virtual machine concept

• The compiled pseudo ISA code (PTX code) remains in principle independent from the actual hardware implementation of a target GPGPU, i.e. it is portable over subsequent GPGPU families.

Porting a PTX file to a lower compute capability level GPGPU however, may need emulation for features not implemented in hardware that slows down execution.

Forward portability of GPGPU code (CUBIN code) is provided however only within major compute capability versions.

• Forward portability of PTX code is highly advantageous in the recent rapid evolution phase of GPGPU technology as it results in less costs for code refactoring.

Code refactoring costs are a kind of software maintenance costs that arise when the user switches from a given generation to a subsequent GPGPU generation (like from GT200 based devices to GF100 or GF110-based devices) or to a new software environment (like from CUDA 1.x SDK to CUDA 2.x or CUDA 3.x SDK).

5.1.2 Nvidia’s PTX Virtual Machine Concept (14)

Remarks [149]

5.1.2 Nvidia’s PTX Virtual Machine Concept (15)

• Subsequent versions of the intermediate PTX ISA are designated as PTX ISA 1.x or 2.x.

• Subsequent versions of GPGPU ISAs are designated as sm_1x/sm_2x or simply by 1.x/2.x.

• The first digit 1 or 2 denotes the major version number, the second or subsequent digit denotes the minor version.

• Major versions of 1.x or 1x relate to pre-Fermi solutions whereas those of 2.x or 2x to Fermi based solutions.

• their intermediate virtual PTX architectures (PTX ISA) and

• their real architectures (GPGPU ISA).

• Designation of the compute capability versions

• Nvidia manages the evolution of their devices and programming environment by maintaining compute capability versions of both

Remarks (cont.) [149]

Main facts concerning the compute capability versions are summarized below.

5.1.2 Nvidia’s PTX Virtual Machine Concept (16)

Until now there is a one-to-one correspondence between the PTX ISA version and the GPGPU ISA versions, i.e. the PTX ISA versions and the GPGPU ISA versions with the same major and minor version number have the same compute capability.

However, there is no guarantee that this one-to-one correspondence will remain valid in the future.

• Correspondence of the PTX ISA and GPGPU ISA compute capability versions

a) Functional features provided by the compute capability versions

of the GPGPUs

and virtual PTX ISAs [81]

5.1.2 Nvidia’s PTX Virtual Machine Concept (17)

b) Device parameters bound to the compute capability versions of Nvidia’s GPGPUs and virtual PTX ISAs [81]

5.1.2 Nvidia’s PTX Virtual Machine Concept (18)

GPGPU cores GPGPU devices

10 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870,

FX4/5600, 360M 11 G86, G84, G98, G96, G96b, G94,

G94b, G92, G92b

GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT 120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M, 3/5/770M, 16/17/27/28/36/37/3800M,

NVS420/50

12 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M

13 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX 3/4/5800

20 GF100, GF110 GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro 600,4/5/6000, Plex7000, GTX570, GTX580

21 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 450, GTX 460, GTX 550Ti, GTX 560Ti

c) Supported compute capability versions of Nvidia’s GPGPU cards [81]

Capability vers.

(sm_xy)

5.1.2 Nvidia’s PTX Virtual Machine Concept (19)

PTX ISA 1.x/sm_1x Fermi implementations

d) Compute capability versions of PTX ISAs generated by subsequent releases of CUDA SDKs and supported GPGPUs (designated as Targets in the Table) [147]

PTX ISA 1.x/sm_1x Pre-Fermi implementations

5.1.2 Nvidia’s PTX Virtual Machine Concept (20)

e) Forward portability of PTX code [52]

Applications compiled for pre-Fermi GPGPUs that include PTX versions of their kernels should work as-is on Fermi GPGPUs as well .

f) Compatibility rules of object files (CUBIN files) compiled to a particular GPGPU compute capability version [52]

The basic rule is forward compatibility within the main versions (versions sm_1x and sm_2x), but not across main versions.

Object files (called CUBIN files) compiled to a particular GPGPU compute capability version are supported on all devices having the same or higher version number within the

same main version.

E.g. object files compiled to the compute capability 1.0 are supported on all 1.x devices but not supported on compute capability 2.0 (Fermi) devices.

This is interpreted as follows:

For more details see [52].

5.1.2 Nvidia’s PTX Virtual Machine Concept (21)

Remarks (cont.)

Whereas the PTX virtual machine concept is based on a forward compatible but not directly executable compiler’s target code (pseudo code), in traditional computer technology

the compiled code, such as an x86 object code, is immediately executable by the processor.

Subsequent CISCs, beginning with 2. generation superscalars (like the Pentium Pro),

including current x86 processors, like Intel’s Nehalem (2008) or AMD’ Bulldozer (2011) map x86 CISC instructions during decoding first to internally defined RISC instructions.

In these processors a ROM-based µcode Engine (i.e. firmware) supports decoding of

complex x86 instructions (decoding of instructions which need more than 4 RISC instructions).

The RISC core of the processor executes then the requested RISC operations directly.

Figure 5.1.1: Hardware/firmware mapping of x86 instructions of

directly executable RISC instructions in Intel Nehalem [51]

Earlier CISC processors, like Intel’s x86 processors up to the Pentium, executed x86 code immediately by hardware.

2. Contrasting the virtual machine concept with the traditional computer technology

5.1.2 Nvidia’s PTX Virtual Machine Concept (22)

Remarks (cont.)

3) Nvidia’ CUDA compiler (nvcc) is designated as CUDA C compiler, beginning with CUDA version 3.0 to stress the support of C.

5.1.2 Nvidia’s PTX Virtual Machine Concept (23)

CUDA C compiler

to executable object cod, called CUBIN file.

Two-phase compilation Application

(CUDA C/OpenCL file)

Forward compatibility within the same major

compute capability version no. (1.x or 2.x)

provided No forward compatibility,

the CUBIN file is bound to the given

compute capability

• First to PTX code

(pseudo ISA instructions)

• then during loading (JIT-compilation)

to executable object code, called CUBIN file.

Remarks (cont.)

4) nvcc can be used to generate both architecture specific files (CUBIN files) or forward compatible PTX versions of the kernels [52].

5.1.2 Nvidia’s PTX Virtual Machine Concept (24)

Remarks (cont.)

5.1.2 Nvidia’s PTX Virtual Machine Concept (25)

• For Java there is also an inherent computational model and a pseudo ISA, called the Java bytecode.

• Applications written in Java will first be compiled to the platform independent Java bytecode.

• The Java bytecode will then either be interpreted by the Java Runtime Environment (JRE) installed on the end user’s computer or compiled at runtime by the Just-In-Time (JIT) compiler of the end user.

The virtual machine concept underlying both Nvidia’s and AMD’s GPGPUs is similar to the virtual machine concept underlying Java.

5.1.3 Key innovations of Fermi’s PTX 2.0

5.1.3 Key innovations of Fermi’s PTX 2.0 (1)

• Fermi’s underlying pseudo ISA is the 2. generation PTX 2.x (Parallel Thread eXecution) ISA, introduced along with the Fermi line.

• PTX 2.x is a major redesign of the PTX 1.x ISA, towards a more RISC-like load/store architecture rather than being an x86 memory based architecture.

With the PTX2.0 Nvidia states that they have created a long-evity ISA for GPUs, like the x86 ISA for CPUs.

Based on the key innovations and declared goals of Fermi’s ISA (PTX2.0) and considering the significant innovations and enhancements made in the microarchitecture

it can be expected that Nvidia’s GPGPUs entered a phase of relative consolidation.

Overview of PTX 2.0

a) Unified address space for all variables and pointers with a single set of load/store instructions b) 64-bit addressing capability

c) New instructions to support the OpenCL and DirectCompute APIs d) Full support of predication

e) Full IEEE 754-3008 support for 32-bit and 64-bit FP precision

These new features greatly improve GPU programmability, accuracy and performance.

Key innovations of PTX 2.0

5.1.3 Key innovations of Fermi’s PTX 2.0 (2)

• In PTX 1.0 there are three separate address spaces (thread private local, block shared and global)

with specific load/store instructions to each one of the three address spaces.

• Programs could load or store values in a particular target address space at addresses that become known at compile time.

It was difficult to fully implement C and C++ pointers since a pointer’s target address could only be determined dynamically at run time.

a) Unified address space for all variables and pointers with a single set of load/store instructions-1 [58]

5.1.3 Key innovations of Fermi’s PTX 2.0 (3)

a) Unified address space for all variables and pointers with a single set of load/store instructions-2 [58]

• PTX 2.0 unifies all three address spaces into a single continuous address space that can be accessed by a single set of load/store instructions.

• PTX 2.0 allows to use unified pointers to pass objects in any memory space and

Fermi’s hardware automatically maps pointer references to the correct memory space.

Thus the concept of the unified address space enables Fermi to support C++ programs.

5.1.3 Key innovations of Fermi’s PTX 2.0 (4)

b) 64-bit addressing capability

• Nvidia’s previous generation GPGPUs (G80, G92, GT200) provide 32 bit addressing for load/store instructions,

• PTX 2.0 extends the addressing capability to 64-bit for future growth.

however, recent Fermi implementations use only 40-bit addresses allowing to access an address space of 1 Terabyte.

5.1.3 Key innovations of Fermi’s PTX 2.0 (5)

c) New instructions to support the OpenCL and DirectCompute APIs

• PTX2.0 is optimized for the OpenCL and DirectCompute programming environments.

• It provides a set of new instructions allowing hardware support for these APIs.

5.1.3 Key innovations of Fermi’s PTX 2.0 (6)

d) Full support of predication [56]

• PTX 2.0 supports predication for all instructions.

• Predicated instructions will be executed or skipped depending on the actual values of conditional codes.

• Predication allows each thread to perform different operations while execution continuous at full speed.

• Predication is a more efficient solution for streaming applications than using conventional conditional branches and branch prediction.

5.1.3 Key innovations of Fermi’s PTX 2.0 (7)

e) Full IEEE 754-3008 support for 32-bit and 64-bit FP precision

• calculations with subnormal numbers

(numbers that lie between zero and the smallest normalized number) and

• all four rounding modes (nearest, zero, positive infinity, negative infinity).

• Fermi’s FP32 instruction semantics and implementation supports now

• Fermi provides fused multiply-add (FMA) instructions for both single and double precision FP calculations (with retaining full precision in the intermediate stage)

instead of using truncation between the multiplication and addition as done in previous generation GPGPUs for multiply-add instructions (MAD).

5.1.3 Key innovations of Fermi’s PTX 2.0 (8)

Supporting program development for the Fermi line of GPGPUs [58]

• Nvidia provides a development environment, called Nexus, designed specifically to support parallel CUDA C, OpenCL and DirectCompute applications.

• Nexus brings parallel-aware hardware source code debugging and performance analysis directly into Microsoft Visual Studio.

• Nexus allows Visual Studio developers to write and debug GPU source code using

exactly the same tools and interfaces that are used when writing and debugging CPU code.

• Furthermore, Nexus extends Visual Studio functionality by offering tools to manage massive parallelism.

5.1.3 Key innovations of Fermi’s PTX 2.0 (9)

5.1.4 Nvidia’s high level data parallel

programming model

CUDA (Compute Unified Device Architecture) [43]

• It is Nvidia’s hardware and software architecture for issuing and managing

data parallel computations on a GPGPU without the need to mapping them to a graphics API.

• It became available starting with the CUDA release 0.8 (in 2/2007) and the GeForce 8800 cards.

• CUDA is designed to support various languages and Application Programming Interfaces (APIs).

Figure 5.1.2: Supported languages and APIs (as of starting with CUDA version 3.0)

5.1.4 Nvidia’s high level data parallel programming model (1)

API API

E.g. CUDA C (to be compiled

with nvcc)

(To manage the platform and to load and launch kernels) (To manage the platform

CUDA Driver API (E.g. CUBLAS)

Writing CUDA programs [43]

Writing CUDA programs

At the level of CUDA C At the level of the CUDA driver API

• CUDA C exposes the CUDA programming model as a minimal set of C language extensions.

• These extensions allow to define kernels along with the dimensions of associated grids and thread blocks.

• The CUDA C program must be compiled with nvcc.

• The CUDA Driver API is a lover level C API that allows to load and launch kernels as modules of binary or assembly CUDA code and to manage the platform.

• Binary and assembly codes are usually obtained by compiling kernels written in C.

5.1.4 Nvidia’s high level data parallel programming model (2)

The high-level CUDA programming model

• It supports data parallelism.

• It is the task of the operating system’s multitasking mechanism to manage accessing the GPGPU by several CUDA and graphics applications running concurrently.

• Beyond that advanced Nvidia GPGPU’s (beginning with the Fermi family) are able to run multiple kernels concurrently.

5.1.4 Nvidia’s high level data parallel programming model (3)

Main components of the programming model of CUDA [43],

a) The platform model

b) The memory model of the platform c) The execution model including

• The data parallel programming model is based on the following abstractions

c1) The kernel concept as a means to utilize data parallelism

c2) The allocation of threads and thread blocks to ALUs and SIMD cores c3) The data sharing concept

c4) The synchronization concept)

• These abstractions will be outlined briefly below.

• A more detailed description of the programming model of the OpenCL standard is given in Section 3.

5.1.4 Nvidia’s high level data parallel programming model (4)

a) The platform model [146]

SIMD core

ALUs

5.1.4 Nvidia’s high level data parallel programming model (5)

b) The memory model of the platform [43]

A thread has access to the device’s DRAM and on-chip memory through a set of

memory spaces.

The Local Memory is an extension of the per-thread

5.1.4 Nvidia’s high level data parallel programming model (6)

Remark

Compute capability dependent memory sizes of Nvidia’s GPGPUs

5.1.4 Nvidia’s high level data parallel programming model (7)

c) The execution model [43]

Serial code executes on the host while parallel code executes on the device.

5.1.4 Nvidia’s high level data parallel programming model (8)

Overview

c1) The kernel concept [43]

• CUDA C allows the programmer to define kernels as C functions, that, when called

are executed N-times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

• A kernel is defined by

The subsequent sample code illustrates a kernel that adds two vectors A and B of size N and stores the result into vector C as well as its invocation.

• Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

In document Masszívan párhuzamos programozás GPGPU-k alkalmazásával (Pldal 148-200)