Major enhancements of Kepler’s microarchitecture

CUDA SDKs and supported compute capability versions (sm[xx]) (designated as Supported Targets in the Table) [20]

3. Major innovations of Kepler’s microarchitecture

3.5. Major enhancements of Kepler’s microarchitecture

a) Setting down the operating frequency of the execution units (shader clock frequency) to the core frequency

b) Simplifying hardware dependency checks by introducing compiler hints c) Introducing quad warp schedulers per SMX

d) Quadrupling the number of registers that are accessible per thread e) Introduction of a 48 KB Read-Only data cache for general use f) Doubled L2 cache size and bandwidth vs Fermi

3.5.1. a) Setting down the operating frequency of the execution units (shader clock frequency) to the core frequency-1

Basically, there are two design options to double performance potential,

• either to let run the execution units at twice the original clock frequency or

• to double the number of available execution units, and let run them at the original clock speed.

The first option needs about the same silicon area as before but has higher power consumption, whereas the second option requires roughly twice the silicon area but needs less power than the first option.

Remark

• Higher clock frequency (fc) requires higher supply voltage V).

• As the dynamic dissipation of a processor D = const x fc x V2, higher fc implies higher power consumption.

• On the other hand larger silicon area results in higher fabrication cost, thus the real trade-off is less power consumption for higher fabrication cost or vice versa.

3.5.2. a) Setting down the operating frequency of the execution units (shader clock frequency) to the core frequency-2

Designers of the G80 to the Fermi family of GPUs opted for at least doubling the clock frequency of the execution units (termed as the shader frequency) to raise performance, as shown below.

5.17. ábra - Core clock frequency vs shader frequency in Nvidia’s major GPU families

The benefit of the chosen option is that it needs less silicon area and thus reduces fabrication cost, but the drawback is higher power consumption, compared to the second option.

3.5.3. a) Setting down the operating frequency of the execution units (shader clock frequency) to the core frequency-3

By contrast, beginning with Kepler Nvidia opted for reducing power consumption, by optimizing the performance/Watt figure instead of rough performance.

Accordingly, designers switched to the other basic option i.e. they strongly increased the number of execution units and set down the shader frequency to the operating clock frequency, as the next tables show.

5.18. ábra - Available execution resources in a Kepler SMX vs. a Fermi SM

3.5.4. a) Setting down the operating frequency of the execution units (shader clock frequency) to the core frequency-4

5.19. ábra - Core clock frequency vs shader frequency in Nvidia’s major GPU families

3.5.5. Comparing the silicon area needed and power consumption of the design approaches used to implement Fermi’s and Kepler’s execution units [76]

5.20. ábra

-Remark

As far as AMD’s GPU implementations concerns AMD stuck at clocking the EUs at the basic clock frequency, as indicated in the next Table.

5.21. ábra - Main features of Nvidia’s and AMD’s GPU cards [77]

3.5.5.1. Resulting power efficiency of the Kepler design vs. the Fermi design

Due to the overall design approach taken to optimize for performance/watt resulted in considerable higher power efficiency compared to the Fermi approach, as indicated below [76].

5.22. ábra - Comparing the power efficiency of the Fermi and Kepler designs [76]

3.5.6. b) Simplifying hardware dependency checks by introducing compiler hints

The Fermi design includes a complex hardware scheduler to perform dependency checking for warp scheduling, as shown below.

5.23. ábra - Block diagram of Fermi’s hardware dependency checking [76]

In their Kepler line, Nvidia greatly simplified hardware dependency checking by letting the compiler to give scheduling information revealing when instructions will be ready to issue.

This information can be calculated based on the fixed latencies of the execution pipelines.

This results in a more straightforward scheduler design, as shown in the next Figure.

5.24. ábra - Block diagram of Kepler’s hardware dependency checking [76]

3.5.7. c) Introducing quad warp schedulers per SMX

As stated before, Nvidia introduced a multiple of execution resources in their Kepler architecture, compared to the Fermi line, as indicated below.

3.5.8. Available execution resources in a Kepler SMX vs. a Fermi SM 5.25. ábra

-In order to utilize vastly increased execution resources (see the Tables above) Nvidia introduced quad warp schedulers per SMX units, as shown next.

5.26. ábra

-The 4 warp schedulers of an SMX unit select each a particular warp with two independent instructions per warp for dispatching each cycle, as shown below [70].

5.27. ábra

-3.5.9. d) Quadrupling the number of registers that are accessible per thread

Each thread in the GKxxx Kepler cores can access 255 registers, four times as many as in Fermi cores, as the Table below indicates.

5.28. ábra - Key device features bound to the compute capability versions [32]

3.5.10. Consequence of quadrupling the number of registers that can be accessed per thread in the cores of Kepler

Codes that exhibit high register pressure may benefit from having an extended number of registers [70].

3.5.11. e) Introduction of a 48 KB Read-Only data cache for general use

In the Fermi generation of GPGPUs there was already a 48 KB read-only data cache introduced.

Albeit this cache was accessible only by the Texture unit, experienced programmers often made use of this cache.

Kepler made the read-only cache available for general use, as shown below.

5.29. ábra - Kepler’s cache architecture [70]

3.5.12. f) Doubled L2 cache size and bandwidth vs. Fermi

Nvidia doubled the size of the L2 cache in the GK110 to 1.5 MB from 768 KB of the related previous Fermi cores (GF100/GF110) [70].

Note

The L2 size of the GK104 remained further on only 512 KB, actually the same size as that of the previous Fermi GF104/GF114 cores.

4. Nvidia’s GK104 Kepler core and related graphics

In document GPGPUs and their programming (Pldal 157-164)