• Nem Talált Eredményt

Párhuzamos rendszerek architektúrája

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Párhuzamos rendszerek architektúrája"

Copied!
996
0
0

Teljes szövegt

(1)

Írta: Sima Dezső

PÁRHUZAMOS RENDSZEREK ARCHITEKTÚRÁJA

PÁRHUZAMOS SZÁMÍTÁSTECHNIKA MODUL

PROAKTÍV INFORMATIKAI MODULFEJLESZTÉS

Lektorálta: oktatói munkaközösség

(2)

COPYRIGHT:

2011-2016, Dr. Sima Dezső, Óbudai Egyetem, Neumann János Informatikai Kar

LEKTORÁLTA: oktatói munkaközösség

Creative Commons NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)

A szerző nevének feltüntetése mellett nem kereskedelmi céllal szabadon másolható, terjeszthető, megjelentethető és előadható, de nem módosítható.

TÁMOGATÁS:

Készült a TÁMOP-4.1.2-08/2/A/KMR-2009-0053 számú, “Proaktív informatikai modulfejlesztés (PRIM1): IT Szolgáltatásmenedzsment modul és Többszálas processzorok és programozásuk modul” című pályázat keretében

KÉSZÜLT: a Typotex Kiadó gondozásában FELELŐS VEZETŐ: Votisky Zsuzsa

ISBN 978-963-279-561-4

2

(3)

KULCSSZAVAK:

többmagos processzorok, sokmagos processzorok, homogén többmagos processzorok, heterogén többmagos processzorok, mester-szolga elvű heterogén többmagos

processzorok, csatolt elvű heterogén többmagos processzorok, Core

2/Penryn/Nehalem/Nehalem-X/Westmere/Westmer-EX/Sandy Bridge-alapú Intel

architektúrák, egyéni (privat consumer) és vállalati (enterprise) orientált platformok, Intel vPro platformja, általános célú GPU-k (GPGPU-k), adatpárhuzamos gyorsítók (DPA-k), integrált CPU/GPU architektúrák

ÖSSZEFOGLALÓ:

A tárgy keretében a hallgatók áttekintést kapnak a processzorarchitektúrák terén az elmúlt években végbement rohamos fejlődésről. Megismerkednek a többmagos processzorok megjelenésének szükségszerűségével, a többmagos/sokmagos processzorok főbb osztályaival, nevezetesen a homogén és a heterogén többmagos processzorokkal, azok alosztályaival és reprezentáns implementációikkal.

Ismertetésre kerülnek a többmagos Intel processzorok főbb családjai és azok főbb jellemzői, nevezetesen a Core 2, Penryn, Nehalem, Nehalem-EX, Westmere, Westmere-EX és a Sandy Bridge alapú architektúrák és jellemzőik. Az előadásban a hallgatók megismerkednek a többmagos asztali számítógép platformokkal, kiemelten az egyéni ill. a vállalati alkalmazási orientációjú (vPro) platformokkal és azok sajátosságaival. Az anyag megértését nagyszámú konkrét megvalósítás bemutatása segíti. A továbbiakban az előadás tárgyalja a

számításigényes alkalmazások terén egyre szélesebb körben elterjedő általános célú GPU- kat (GPGPU-k) és adatpárhuzamos gyorsítókat (DPA-k). Végül ismertetésre kerülnek a reprezentáns Nvidia és AMD/ATI GPGPU családok architektúrái valamint a processzorok fejlődésének legutóbbi szakaszában megjelent integrált CPU/GPU architektúrák ill.

reprezentáns implementációik.

(4)

Tartalomjegyzék

• Multicore-Manycore Processors

• Evolution of Intel’s Basic Microarchitectures

• Intel’s Desktop Platforms

• GPGPUs/DPAs Overview

• GPGPUs/DPAs 5.1

• GPGPUs/DPAs 5.2

• Integrated CPUs/GPUs

• References to all four sectionsof GPGPUs/DPAs

www.tankonyvtar.hu

© Sima Dezső, ÓE NIK 4

(5)

Dezső Sima

Multicore-Manycore

Processors

(6)

Contents

1.The inevitable era of multicores

2. Homogeneous multicores

2.1 Conventional multicores

2.2 Many-core processors

3. Heterogeneous multicores

3.1 Master-slave type heterogeneous multicores

3.2 Add-on type heterogeneous multicores

4. Outlook

5. References

© Sima Dezső, ÓE NIK 6 www.tankonyvtar.hu

(7)

1. The inevitable era of multicores

(8)

Figure 1.1: Integer performance growth of Intel’s x86 processors

SPECint92

5 10 50

Year

86 88

79 1980 81 82 83 84 85 87 89 1990 91 92 93 94 95 96 97 98 99

*

*

*

* *

*

*

*

2

386/16

*

* *

*

*

* 8088/5 0.5 *

100

8088/8 80286/10

80286/12

386/20 386/25 386/33 500

*

* 1000 *

20 200

1

0.2

*

* * *

*

* *

486/25

486/33

486/50 486-DX2/66 Pentium/66

Pentium/100 Pentium/120

Pentium Pro/200 PII/450 PIII/600

486-DX4/100

Pentium/133 Pentium/166 Pentium/200

PII/300 PII/400 PIII/500

486-DX2/50

*

2000 01 02 03 5000

2000

*

*

*

*

*

* * *

* PIII/1000 P4/1500

P4/1700 P4/2000 P4/2200

P4/2400 P4/2800 P4/3060

P4/3200

~ 100*/10 years

*

*

* *

*

04 05 Northwood B 10000

Prescott (1M) Prescott (2M) Levelling off

1. The inevitable era of multicores (1)

Integer performance grows

1. The inevitable era of multicores

© Sima Dezső, ÓE NIK 8 www.tankonyvtar.hu

(9)

Pa = fC x IPC Performance (Pa)

Clock frequency Instructions Per Cycle

Efficiency

(Pa/fC) x

1. The inevitable era of multicores (2)

Clock

frequency x

P

a

=

(10)

Figure 1.2: Efficiency of Intel processors

f c SPECint_base2000/

Year

79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99

78 2000 01 02

0.05 0.1

0.02 0.5 1

0.2

0.01

~ ~

*

* *

* *

* *

*

* * *

Pentium 486DX

386DX

286

Pentium II Pentium Pro

* * Pentium III

~10*/10 years

Levelling off 2. generation

superscalars

1. The inevitable era of multicores (3)

© Sima Dezső, ÓE NIK 10 www.tankonyvtar.hu

(11)

1. The inevitable era of multicores (4)

Main sources of processor efficiency (IPC)

Processor width Core enhancements Cache enhancements

superscalar

• branch prediction

• speculative loads

• ...

L2/L3 enhancements (size, associativity ...)

1. Gen. 2. Gen.

1 2 4

pipeline

Figure 1.3: Main sources of processor efficiency

(12)

Figure 1.4: Extent of parallelism available in general purpose applications for 2. generation superscalars [37]

1. The inevitable era of multicores (5)

© Sima Dezső, ÓE NIK 12 www.tankonyvtar.hu

(13)

1. The inevitable era of multicores (6)

Main sources of processor efficiency (IPC)

Processor width Core enhancements Cache enhancements

superscalar

• branch prediction

• speculative loads

• ...

L2/L3 enhancements (size, associativity ...)

1. Gen. 2. Gen.

1 2 4

pipeline

Figure 1.5: Main sources of processor efficiency

(14)

Beginning with 2. generation superscalars

Pa = fC x IPC

Clock frequency Instructions Per Cycle x

Performance increase can basically be achived by fc

• the era of extensively increasing processor efficiency came to an end

• processor efficiency levelled off.

1. The inevitable era of multicores (7)

© Sima Dezső, ÓE NIK 14 www.tankonyvtar.hu

(15)

Figure 1.6: Evolution of Intel’s process technology [38]

1. The inevitable era of multicores (8)

Shrinking: ~ 0.7/2 Years

(16)

Figure 1.7: The actual rise of IC complexity in DRAMs and microprocessors [39]

1. The inevitable era of multicores (9)

© Sima Dezső, ÓE NIK 16 www.tankonyvtar.hu

(17)

Main sources of processor efficiency (IPC)

Processor width Core enhancements Cache enhancements

superscalar

• branch prediction

• speculative loads

• ...

L2/L3 enhancements (size, associativity ...)

1. Gen. 2. Gen.

1 2 4

pipeline

Doubling transistor counts ~ every two years What is the best use of ever increasing

number of processors ??? Moore’s

law

1. The inevitable era of multicores (10)

(18)

IC fab technology

Moore’s law

~ Doubling transistor counts / 2 years (Linear shrink ~ 0.7x/2 years)

Possible use of surplus transistors

Processor width Core enhancements Cache enhancements

superscalar

• branch prediction

• speculative loads

• ...

L2/L3 enhancements (size, associativity ...)

1. Gen. 2. Gen.

1 2 4

pipeline

Figure 1.8: Possible use of surplus transistors

1. The inevitable era of multicores (11)

© Sima Dezső, ÓE NIK 18 www.tankonyvtar.hu

(19)

Use available surplus transistors for multiple cores

The inevitable era of multicores

Increasing number of transistors Diminishing return in performance

1. The inevitable era of multicores (12)

(20)

Figure 1.9: Rapid spreading of Intel’s multicore processors [40]

1. The inevitable era of multicores (13)

© Sima Dezső, ÓE NIK 20 www.tankonyvtar.hu

(21)

2. Homogeneous multicores

2.1 Conventional multicores

2.2 Manycore processors

(22)

2. Homogeneous multicores (1)

Figure 2.1: Major classes of multicore processors

Desktops

Heterogeneous multicores Homogeneous

multicores

Multicore processors

Manycore processors

Servers

with >8 cores Conventional

multicores Master/slave

type multicores Add-on type multicores

MPC

CPU GPU 2 ≤ n ≤ 8 cores

General purpose

computing Prototypes/

experimental systems MM/3D/HPC

production stage HPC Mobiles

© Sima Dezső, ÓE NIK 22 www.tankonyvtar.hu

(23)

2.1 Conventional multicores

2.1.1 Example: Intel’s MP servers

(24)

2.1 Conventional multicores (1)

Figure 2.1: Major classes of multicore processors

Desktops

Heterogeneous multicores Homogeneous

multicores

Multicore processors

Manycore processors

Servers

with >8 cores Conventional

multicores Master/slave

type multicores Add-on type multicores

MPC

CPU GPU 2 ≤ n ≤ 8 cores

General purpose

computing Prototypes/

experimental systems MM/3D/HPC

production stage HPC Mobiles

© Sima Dezső, ÓE NIK 24 www.tankonyvtar.hu

(25)

2.1.1 Example: Intel’s MP servers

2.1.1.1 Introduction

2.1.1.2 The Pentium 4 based Truland MP platform

2.1.1.3 The Core 2 based Caneland MP platform

2.1.1.4 The Nehalem-EX based Boxboro-EX MP platform

2.1.1.5 Evolution of MP platforms

(26)

2.1.1.1 Introduction (1)

Uni-Processors (UP)

Servers

Servers with more than 8 processors

Multi Processors (typically 4 processors)

(MP) Dual Processors

(DP)

2.1.1.1 Introduction

© Sima Dezső, ÓE NIK 26 www.tankonyvtar.hu

(27)

Basic Arch. Core/technology MP server processors

Pentium 4 (Prescott)

Pentium 4 90 nm 11/2005 Paxville MP 2x1 C, 2 MB L2/C

Pentium 4 65 nm 8/2006 7100 (Tulsa) 2x1 C, 1 MB L2/C 16 MB L3

Core 2

Core2 65 nm 9/2007 7200 (Tigerton DC) 7300 (Tigerton QC)

1x2 C, 4 MB L2/C 2x2 C, 4 MB L2/C Penryn 45 nm 9/2008 7400 (Dunnington) 1x6 C, 3 MB L2/2C 16 MB L3

Nehalem

Nehalem-EP 45 nm Westmere-EP 32 nm

Nehalem-EX 45 nm 3/2010 7500 (Beckton) 1x8 C, ¼ MB L2/C 24 MB L3 Westmere-EX 32nm 4/2011

E7-48xx (Westmere-EX)

1x10 C, ¼ MB L2/C 30 MB L3

Sandy Bridge

Sandy Bidge 32 nm /2011 Ivy Bridge 22 nm 11/2012

2.1.1.1 Introduction (2)

(28)

Pentium 4 based MP server platform

MP server platforms

Sandy Bridge based MP server platform Nehalem-EX based

MP server platform

To be announced yet Truland (2005)

(90 nm/65 nm Pentium 4 Prescott MP based)

Core 2 based MP server platform

Caneland (2007) Boxboro-EX (2010)

2.1.1.1 Introduction (3)

© Sima Dezső, ÓE NIK 28 www.tankonyvtar.hu

(29)

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (1)

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform Overview

For presenting a more complete view of the evolution of multicore MP server platforms we include also the single core (SC) 90 nm Pentium 4 Prescott based Xeon MP (Potomac)

processor that was the first 64-bit MP server processor and gave rise to the Truland platform.

Remark

(30)

MP platforms

Xeon 7000 11/2005

MP cores Xeon 7100

8/2006

MCH

3/2005 11/2005

E8500 E8501

(Paxville MP) 2x1C (Tulsa) 2C

(Twin Castle) (Twin Castle?) 2xFSB

667 MT/s HI 1.5 4 x XMB (2 channels/XMB 4 DIMMs/channel DDR-266/333

DDR2-400 32GB

2xFSB 800 MT/s

HI 1.5 4 x XMB (2 channels/XMB 4 DIMMs/channel DDR-266/333

DDR2-400 32GB Truland

65 nm/1328 mtrs 2x1 MB L2 16/8/4 MB L3 800/667 MT/s mPGA 604

Pentium 4-based/65 nm

3/2005

Xeon MP 3/2005

(Potomac) 1C

90 nm/2x169 mtrs 2x1 (2) MB L2

- 800/667 MT/s

mPGA 604 90 nm/675 mtrs

1 MB L2 8/4 MB L3

667 MT/s mPGA 604

Pentium 4-based/90 nm ICH5

4/2003

Pentium 4 based Core 2 based Penryn based ICH

Truland( updated) 11/2005

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (2)

© Sima Dezső, ÓE NIK 30 www.tankonyvtar.hu

(31)

11/00 1/02

^

0.18 /42 mtrsµ

^

400 MHz FSB

Northwood-A Xeon DP line

Desktop-line

Celeron-line

Willamette

1.4/1.5 GHz

(Value PC-s)

On-die 256K L2

0.13 /55 mtrsµ

400 MHz FSB 2A/2.2 GHz On-die 512K L2

2/02

^

0.13 /55 mtrsµ

400 MHz FSB 1.8/2/2.2 GHz On-die 512K L2 5/01

^

0.18 /42 mtrsµ

400 MHz FSB 1.4/1.5/1.7 GHz On-die 256 K L2

11/02

^

Prestonia-B 0.13 /55 mtrsµ

533 MHz FSB 2/2.4/2.6/2.8 GHz On-die 512K L2

Foster Prestonia-A Nocona

2/04

^

0.09 /125mtrs µ

800 MHz FSB 2.80E/3E/3.20E/3.40E GHz

On-die 1M L2

2000 2001 2002 2003 2004

Xeon - MP line

3/02

^

0.18 /108 mtrs µ

400 MHz FSB 1.4/1.5/1.6 GHz On-die 256K L2

11/02

^

Gallatin 0.13 /178 mtrsµ

400 MHz FSB 1.5/1.9/2 GHz On-die 512K L2 Foster-MP

On-die 512K/1M L3 On-die 1M/2M L3

5/02

^

Northwood-B 0.13 /55 mtrsµ

533 MHz FSB 2.26/2.40B/2.53 GHz

On-die 512K L2

5/02

^

Willamette-128

400 MHz FSB 1.7 GHz

11/02

^

6/04

^

0.09 / 125 mtrsµ

800 MHz FSB 2.8/3.0/3.2/3.4/3.6 GHz

On-die 1M L2

Northwood-B

533 MHz FSB 3.06 GHz On-die 512K L2

0.13 /55 mtrsµ µ

400 MHz FSB 2 GHz On-die 128K L2 0.18 µ 0.13 µ

9/02

^

Northwood-128

On-die 128K L2

Cores supporting hyperthreading

5/03

^

Northwood-C

800 MHz FSB 2.40C/2.60C/2.80C GHz

On-die 512K L2 0.13 /55 mtrsµ

Cores with EM64T implemented but not enabled

2005 2Q/05

^

Potomac 0.09 µ

> 3.5 MHz On-die 1M L2 On-die 8M L3 (?)

Irwindale-C 1Q/05

^

0.09 µ 3.0/3.2/3.4/3.6 GHz On-die 512K L2, 2M L3

Jayhawk 2Q/05

^

0.09 µ

(Cancelled 5/04) 3.8 GHz On-die 1M L2

3Q/05

^

Tejas 0.09 /µ 4.0/4.2 GHz On-die 1M L2 (Cancelled 5/04) Irwindale-A

11/03

^

800 MHz FSB 3.2EE GHz On-die 512K L2, 2M L3

0.13 /178 mtrsµ

Cores supporting EM64T 6/04

^

0.09 /125mtrs µ

800 MHz FSB 2.8/3.0/3.2/3.4/3.6 GHz

On-die 1M L2 11/04

^

Irwindale-B 0.13 /178mtrs µ

1066 MHz FSB 3.4EE GHz On-die 512K L2, 2 MB L3

533 MHz FSB 2.4/2.53/2.66/2.8 GHz

On-die 256K L2 0.09 µ

6/04

^

Celeron-D

µPGA 603 µPGA 603

µPGA 603 µPGA 604

µPGA 478 LGA 775

µPGA 423 µPGA 478 µPGA 478 µPGA 478 µPGA 478 µPGA 478 LGA 775

PGA 478 µPGA 478

µPGA 603 µPGA 603

0.18 /42 mtrsµ

^

400 MHz FSB Willamette

On-die 256K L2 µPGA 478

3/04

^

Gallatin 0.13 /286 mtrsµ

400 MHz FSB 2.2/2.7/3.0 GHz On-die 512K L2 On-die 2M/4M L3 µPGA 603

8/01

µ µPGA 478

533 MHz FSB 2.53/2.66/2.80/2.93 GHz

On-die 256K L2 0.09 µ 9/04

^

Celeron-D Extreme Edition

7/03

^

Prestonia-C 0.13 /178 mtrsµ

533 MHz FSB 3.06 GHz On-die 512K L2, 1M L3

µPGA 603

1.4 ... 2.0 GHz

0.09 /125mtrs µ

800 MHz FSB 3.20F/3.40F/3.60F GHz

On-die 1M L2 LGA 775 8/04

^

12 13

8,9,10 Prescott

Prescott Prescott-F11

5 6,7

LGA 775 4

2,3

1 1

Figure 2.2: The Potomac processor as Intel’s first 64-bit Xeon MP processor based on the µPGA604

2/05

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (3)

(32)

Basic system architecture of the 90 nm Pentium 4 Prescott MP based Truland MP server platform

Pentium 4 Prescott MP based Truland MP server platform (for up to 2 cores) Pentium 4

Xeon 1C/2x1C

Pentium 4 XeonP 1C/2x1C

Pentium 4 XeonP 1C/2x1C Pentium 4

Xeon 1C/2x1C

85001/8501

ICH5 XMB

XMB DDR-266/333

DDR2-400

FSB

XMB

XMB HI 1.5

DDR-266/333 DDR2-400 Xeon 7000

(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP

(Potomac) 1C / /

XMB: eXxternal Memory Bridge Provides a serial link, 5.33 GB inbound BW 2.65 GB outbound BW (simultaneously)l

HI 1.5 (Hub Interface 1.5)

8 bit wide, 66 MHz clock, QDR, 66 MB/s peak transfer rate

1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac)

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (4)

© Sima Dezső, ÓE NIK 32 www.tankonyvtar.hu

(33)

Expanding the Truland platform to 3 generations of Pentium 4 based Xeon MP servers

Xeon 7000

2x1C Xeon 7100

Xeon MP 2C 1C

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (5)

Figure 2.3: Expanding the Truland platform [1]

(34)

Figure 2.4: Block diagram of a 8500 chipset based Truland MP server board [2]

Example 1: Block diagram of a 8500 chipset based Truland MP server board [2]

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (6)

© Sima Dezső, ÓE NIK 34 www.tankonyvtar.hu

(35)

XMB: eXxternal

Memory Bridge IMI: Independent

Memory Interface

5.33 GB inbound BW 2.67 GB outbound BW simultaneously

Figure 2.5: Intel’s 8501 chipset based Truland MP server platform (4/ 2006) [3]

Xeon DC MP 7000 (4/2005) or later DC/QC MP 7000 processors

Intelligent MC Dual mem. channels

DDR 266/333/400 4 DIMMs/channel

(North Bridge)

IMI: Serial link

Example 2: Block diagram of the E8501 based Truland MP server platform [3]

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (7)

(36)

7000/7100 DDR2

DIMMs 64 GB

Figure 2.6: Intel E8501 chipset based MP server board (Supermicro X6QT8) for the Xeon 7000/7100 DC MP processor families [4]

Xeon DC

E8501 NB

ICH5R SB 2 x XMB

2 x XMB

Example 3: E8501 based MP server board implementing the Truland platform

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (8)

© Sima Dezső, ÓE NIK 36 www.tankonyvtar.hu

(37)

Figure 2.7: Bandwith bottlenecks in Intel’s 8501 based Truland MP server platform [5]

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (9)

(38)

Remark

Previous (first generation) MP servers made use of a symmetric topology including only a single FSB that connects all 4 single core processors to the MCH (north bridge), as shown below.

Xeon MP1

SC Xeon MP1 SC

FSB

Xeon MP1

SC Xeon MP1 SC

Preceding ICH Preceding NBs

E.g. HI 1.5

HI 1.5 266 MB/s

E.g. DDR-200/266 E.g. DDR-200/266

Typical system architecture of a first generation Xeon MP based MP server platform

Figure 2.8: Previous Pentium 4 MP based MP server platform (for single core processors)

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (10)

© Sima Dezső, ÓE NIK 38 www.tankonyvtar.hu

(39)

Example: Block diagram of an MP server board that is based on Pentium 4 (Willamette MP) single core 32-bit Xeon MP processors (called Foster)

The chipset (CMIC/CSB5) is ServerWorks’

Grand Champion HE Classic chipset

The memory is placed on an extra card.

There are 4 memory controllers each supporting 4 DIMMs

(DDR-266/200)

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (11)

Figure 2.9: Block diagram of an MP server board [6]

(40)

Xeon MP1

SC Xeon MP1 SC

FSB

Xeon MP1

SC Xeon MP1 SC

Preceding ICH Preceding NBs

E.g. HI 1.5

HI 1.5 266 MB/s

E.g. DDR-200/266 E.g. DDR-200/266

85001/8501

ICH5 XMB

XMB

DDR-266/333 DDR2-400

FSB

XMB

XMB HI 1.5

DDR-266/333 DDR2-400 Xeon 7000

(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP

(Potomac) 1C / /

90 nm Pentium 4 Prescott MP based Truland MP server platform (for up to 2 C) Pentium 4

Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Previous Pentium 4 MP based MP server (for single core processors)

Evolution from the first generation MP servers supporting SC processors to the

90 nm Pentium 4 Prescott MP based Truland MP server platform (supporting up to 2 cores)

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (12)

© Sima Dezső, ÓE NIK 40 www.tankonyvtar.hu

(41)

2.1.1.3 The Core 2 based Caneland MP server platform (1)

2.1.1.3 The Core 2 based Caneland MP server platform

(42)

Pentium 4 based Core 2 based Penryn based MP platforms

MP cores

MCH

Xeon 7200 Xeon 7300 9/2007

9/2007 E7300

(Tigerton DC) 1x2C (Tigerton QC) 2x2C Caneland

9/2007

(Clarksboro) 4xFSB 1066 MT/s

ESI 4 x FBDIMM (DDR2-533/667 8 DIMMs/channel)

512GB

Xeon 7400 9/2008

(Dunnington 6C) 65 nm/2x291 mtrs

2x4 MB L2 - 1066 MT/s mPGA 604

65 nm/2x291 mtrs 2x(4/3/2) MB L2

- 1066 MT/s mPGA 604

45 nm/1900 mtrs 9/6 MB L2 16/12/8 MB L3

1066 MT/s mPGA 604

Core2-based/65 nm Penryn 45 nm 631xESB

632xESB 5/2006

ICH

2.1.1.3 The Core 2 based Caneland MP server platform (2)

© Sima Dezső, ÓE NIK 42 www.tankonyvtar.hu

(43)

Basic system architecture of the Core 2 based Caneland MP server platform

Core2 (2C/4C) Penryn (6C)

Core2 (2C/4C) Penryn (6C)

Core2 (2C/4C) Penryn (6C) Core2

(2C/4C) Penryn (6C)

7300

631xESB/

632xESB

up to 8 DIMMs 85001/8501

ICH5 XMB

XMB

DDR-266/333 DDR2-400

FSB

XMB

XMB

ESI FSB

HI 1.5

DDR-266/333

DDR2-400 FBDIMM

DDR2-533/667 Xeon 7000

(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP

(Potomac) 1C / /

1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac) Xeon 7200

(Tigerton DC) 1x2C Xeon 7300

(Tigerton QC) 2x2C Xeon 7400 (Dunnington 6C)

/ /

90 nm Pentium 4 Prescott MP based

Truland MP server platform (for up to 2 C) Core 2 based

Caneland MP server platform (for up to 6 C)

ESI: Enterprise System Interface

4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface, providing 1 GB/s transfer rate in each direction) HI 1.5 (Hub Interface 1.5)

8 bit wide, 66 MHz clock, QDR, 66 MB/s peak transfer rate Pentium 4

Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

2.1.1.3 The Core 2 based Caneland MP server platform (3)

(44)

Figure 2.10: Intel’s 7300 chipset based Caneland platform for the Xeon 7200/7300 DC/QC processors (9/2007) [7]

FB-DIMM 4 channels

8 DIMMs/channel up to 512 GB

7200 (Tigerton DC, Core 2), 2C

7300 (Tigerton QC, Core 2), QC Xeon

Example 1: Intel’s Nehalem-EP based Tylersburg-EP DP server platform with a single IOH

2.1.1.3 The Core 2 based Caneland MP server platform (4)

© Sima Dezső, ÓE NIK 44 www.tankonyvtar.hu

(45)

FB-DIMM DDR2 192 GB

ATI ES1000 Graphics with 32MB video memory

7200 DC 7300 QC (Tigerton)

Xeon

Figure 2.11: Caneland MP Supermicro serverboard, with the 7300 (Clarksboro) chipset

SBE2 SB 7300 NB

Example 3: Caneland MP serverboard

2.1.1.3 The Core 2 based Caneland MP server platform (5)

(46)

Figure 2.12: Performance comparison of the Caneland platform with a quad core Xeon (7300 family) vs the Bensley platform with a dual core Xeon 7140M [8]

2.1.1.3 The Core 2 based Caneland MP server platform (6)

© Sima Dezső, ÓE NIK 46 www.tankonyvtar.hu

(47)

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (1)

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform

(48)

Xeon 7500 Xeon E7-4800 4/2011

3/2010 7500 (Nehalem-EX)

(Becton) 8C (Westmere-EX) 10C Boxboro-EX

3/2010

(Boxboro) 2 QPI links 32xPCIe 2. Gen.

0.5 GB/s/lane/direction 1GB/s/directon ESI 45 nm/2300 mtrs/513 mm2

¼ MB L2/C 24 MB L3 4 QPI links 4 SMI links 2 mem. channels/link 2 DIMMs/mem. channel

DDR3 1067 MT/s 1 TB (64x16 GB)

LGA1567

32 nm/2600 mtrs/584 mm2

¼ MB L2/C 30 MB L3 4 QPI links 4 SMI links 2 mem. channels/link 2 DIMMs/mem. channel

DDR3 1333 MT/s 1 TB (64x16 GB)

LGA1567

Nehalem-EX-based 45 nm

3/2010

Westmere-EX 45 nm ICH10

6/2008 MP platforms

MP cores

IOH

ICH

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (2)

© Sima Dezső, ÓE NIK 48 www.tankonyvtar.hu

(49)

The 8 core Nehalem-EX (Xeon 7500/Beckton) Xeon 7500 MP server processor

2 cores

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (3)

Figure 2.13: The 8 core Nehalem-EX (Xeon 7500/Beckton) Xeon 7500 MP server processor [9]

(50)

The 10 core Westmere-EX (Xeon E7-!800) MP server processor [10]

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (4)

© Sima Dezső, ÓE NIK 50 www.tankonyvtar.hu

(51)

Block diagram of the Westmere-EX (E7-8800/4800/2800) processors [11]

E7-8800: for 8 P systems E7-4800: for MP systems E7-2800: for DP systems

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (5)

(52)

Main platform features introduced in the 7500 Boxboro IOH (1)

Along with their Nehalem-EX based Boxboro platform Intel continued their move to increase system security and manageability by introducing platform features

provided else by their continuously enhanced vPro technology for enterprise oriented desktops since 2006 and DP servers since 2007.

The platform features introduced in the 7500 IOH are basically the same as described for the Tylersburg-EP DP platform that is based on the 5500 IOH which is akin to the 7500 IOH of the Boxboro-EX platform.

They include:

a) Intel Management Engine (ME)

b) Intel Virtualization Technology for Directed I/O (VT-d2) VT-d2 is an upgraded version of VT-d.

c) Intel Trusted Execution Technology (TXT) .

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (6)

© Sima Dezső, ÓE NIK 52 www.tankonyvtar.hu

(53)

Nehalem-EX 8C Westmere-EX

10C

7500 IOH QPI

QPI

QPI QPI QPI QPI

QPI QPI SMB

SMB

DDR3-1067 SMB SMB

SMB SMB

SMB SMB

ICH10 ESI

DDR3-1067

SMI: Serial link between the processor and the SMB

SMB: Scalable Memory Buffer Parallel/serial conversion SMB

SMB

SMB SMB

SMB SMB

SMB

SMB 2x4 SMI

channels 2x4 SMI

channels

Basic system architecture of the Nehalem-EX based Boxboro-EX MP server platform (assuming 1 IOH)

ME

ME: Management Engine Xeon 7500

(Nehalem-EX) (Becton) 8C

Xeon 7-4800 (Westmere-EX) 10C

Nehalem-EX 8C Westmere-EX

10C

Nehalem-EX 8C Westmere-EX

10C

Nehalem-EX 8C Westmere-EX

10C /

Nehalem-EX based Boxboro-EX MP server platform (for up to 10 C)

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (7)

(54)

Wide range of scalability of the 7500/6500 IOH based Boxboro-EX platform [12]

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (8)

© Sima Dezső, ÓE NIK 54 www.tankonyvtar.hu

(55)

Example: Block diagram of a 7500 chipset based Boxboro-EX MP serverboard [13]

ESI

2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (9)

(56)

2.1.1.5 Evolution of MP server platforms (1)

2.1.1.5 Evolution of MP server platforms

© Sima Dezső, ÓE NIK 56 www.tankonyvtar.hu

(57)

Xeon MP1

SC Xeon MP1 SC

FSB

Xeon MP1

SC Xeon MP1 SC

Preceding ICH Preceding NBs

E.g. HI 1.5

HI 1.5 266 MB/s

E.g. DDR-200/266 E.g. DDR-200/266

85001/8501

ICH5 XMB

XMB

DDR-266/333 DDR2-400

FSB

XMB

XMB HI 1.5

DDR-266/333 DDR2-400 Xeon 7000

(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP

(Potomac) 1C / /

90 nm Pentium 4 Prescott MP based Truland MP server platform (for up to 2 C) Pentium 4

Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Previous Pentium 4 MP based

MP server platform (for single core processors)

Evolution from the first generation MP servers supporting SC processors to the

90 nm Pentium 4 Prescott MP based Truland MP server platform (supporting up to 2 cores)

2.1.1.5 Evolution of MP server platforms (2)

(58)

Core2 (2C/4C) Penryn (6C)

Core2 (2C/4C) Penryn (6C)

Core2 (2C/4C) Penryn (6C) Core2

(2C/4C) Penryn (6C)

7300

631xESB/

632xESB

up to 8 DIMMs 85001/8501

ICH5 XMB

XMB

DDR-266/333 DDR2-400

FSB

XMB

XMB

ESI FSB

HI 1.5

DDR-266/333

DDR2-400 FBDIMM

DDR2-533/667 Xeon 7000

(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP

(Potomac) 1C / /

1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac) Xeon 7200

(Tigerton DC) 1x2C Xeon 7300

(Tigerton QC) 2x2C Xeon 7400 (Dunnington 6C)

/ /

90 nm Pentium 4 Prescott MP based

Truland MP server platform (for up to 2 C) Core 2 based

Caneland MP server platform (for up to 6 C) ESI: Enterprise System Interface

4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface, providing 1 GB/s transfer rate in each direction) HI 1.5 (Hub Interface 1.5)

8 bit wide, 66 MHz clock, QDR, 266 MB/s peak transfer rate Pentium 4

Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Pentium 4 Xeon MP 1C/2x1C

Evolution from the 90 nm Pentium 4 Prescott MP based Truland MP platform (up to 2 cores) to the Core 2 based Caneland MP platform (up to 6 cores)

2.1.1.5 Evolution of MP server platforms (3)

© Sima Dezső, ÓE NIK 58 www.tankonyvtar.hu

(59)

Nehalem-EX 8C Westmere-EX

10C

7500 IOH QPI

QPI

QPI

QPI QPI QPI

QPI QPI SMB

SMB

DDR3-1067 SMB SMB

SMB SMB

SMB SMB

ICH10 ESI

DDR3-1067

SMI: Serial link between the processor and the SMBs

SMB: Scalable Memory Buffer Parallel/serial converter SMB

SMB

SMB SMB

SMB SMB

SMB

SMB 2x4 SMI

channels 2x4 SMI

channels ME

ME: Management Engine Xeon 7500

(Nehalem-EX) (Becton) 8C

Xeon 7-4800 (Westmere-EX) 10C

Nehalem-EX 8C Westmere-EX

10C

Nehalem-EX 8C Westmere-EX

10C

Nehalem-EX 8C Westmere-EX

10C /

Evolution to the Nehalem-EX based Boxboro-EX MP platform (that supports up to 10 cores) (In the basic system architecture we show the single IOH alternative)

2.1.1.5 Evolution of MP server platforms (4)

(60)

2.2 Many-core processors

2.2.1 Intel’s Larrabee

2.2.2 Intel’s Tile processor

2.2.3 Intel’s SCC

© Sima Dezső, ÓE NIK 60 www.tankonyvtar.hu

(61)

2.2 Manycore processors (1)

Desktops

Heterogeneous multicores Homogeneous

multicores

Multicore processors

Manycore processors

Servers

with >8 cores Conventional

multicores Master/slave

type multicores Add-on type multicores

MPC

CPU GPU 2 ≤ n ≤ 8 cores

General purpose

computing Prototypes/

experimental systems MM/3D/HPC

production stage HPC Mobiles

(62)

2.2.1 Intel’s Larrabee

© Sima Dezső, ÓE NIK 62 www.tankonyvtar.hu

(63)

2.2.1 Larrabee

Part of Intel’s Tera-Scale Initiative.

Project started ~ 2005

First unofficial public presentation: 03/2006 (withdrawn) First official public presentation: 08/2008 (SIGGRAPH)

Due in ~ 2009

• Performance (targeted):

2 TFlops

• Brief history:

• Objectives:

Not a single product but a base architecture for a number of different products.

High end graphics processing, HPC

2.2.1 Intel’s Larrabee (1)

(64)

Figure 2.14: Block diagram of a GPU-oriented Larrabee (2006, outdated) [41]

Update: SIMD processing width: SIMD-64 rather than SIMD-16 Basic architecture

2.2.1 Intel’s Larrabee (2)

© Sima Dezső, ÓE NIK 64 www.tankonyvtar.hu

(65)

Figure 2.15: Board layout of a GPU-oriented Larrabee (2006, outdated) [42]

2.2.1 Intel’s Larrabee (3)

(66)

Figure 2.16: Four socket MP server design with 24-core Larrabees connected by the CSI bus [41]

2.2.1 Intel’s Larrabee (4)

© Sima Dezső, ÓE NIK 66 www.tankonyvtar.hu

(67)

2.2.2 Intel’s Tile processor

(68)

• First incarnation of Intel’s Tera-Scale Initiative

Announced at IDF 9/2006 Due in 2009/2010

• Objective: Tera-Scale experimental chip (more than 100 projects underway)

• Brief history:

2.2.2 Tile processor

2.2.2 Intel’s Tile processor (1)

© Sima Dezső, ÓE NIK 68 www.tankonyvtar.hu

(69)

Figure 2.17: A forerunner: The Raw processor (MIT 2002)

(16 tiles, each tile has a compute element, router, instruction and data memory) [43]

2.2.2 Intel’s Tile processor (2)

(70)

Bisection bandwidth:

If the network is segmented into two equal parts, this is the bandwidth between the two parts

Figure 2.18: Die photo and chip details of the Tile processor [14]

2.2.2 Intel’s Tile processor (3)

© Sima Dezső, ÓE NIK 70 www.tankonyvtar.hu

(71)

Figure 2.19: Main blocks of a tile [14]

2.2.2 Intel’s Tile processor (4)

(72)

(Clocks run with the same frequency but unknown phases

FP Multiply-Accumulate (AxB+C)

Figure 2.20: Block diagram of a tile [14]

2.2.2 Intel’s Tile processor (5)

© Sima Dezső, ÓE NIK 72 www.tankonyvtar.hu

(73)

Figure 2.21: On board implementation of the 80-core Tile Processor [15]

2.2.2 Intel’s Tile processor (6)

(74)

Figure 2.22: Performance and dissipation figures of the Tile-processor [15]

2.2.2 Intel’s Tile processor (7)

© Sima Dezső, ÓE NIK 74 www.tankonyvtar.hu

(75)

Performance at 4 GHz:

Peak SP FP: up to 1.28 TFlops (2 FPMA x 2 instr./cyclex80x4 GHz = 1.28 TFlops)

2.2.2 Intel’s Tile processor (8)

(76)

Figure 2.23: Programmer’s perspective of the Tile processor [14]

2.2.2 Intel’s Tile processor (9)

© Sima Dezső, ÓE NIK 76 www.tankonyvtar.hu

(77)

Figure 2.24: The full instruction set of the Tile processor [14]

2.2.2 Intel’s Tile processor (10)

(78)

VLIW

Figure 2.25: Instruction word and latencies of the Tile processor [14]

2.2.2 Intel’s Tile processor (11)

© Sima Dezső, ÓE NIK 78 www.tankonyvtar.hu

(79)

Figure 2.26: Performance of the Tile processor – the workloads [14]

2.2.2 Intel’s Tile processor (12)

(80)

Figure 2.27: Instruction word and latencies of the Tile processor [14]

2.2.2 Intel’s Tile processor (13)

© Sima Dezső, ÓE NIK 80 www.tankonyvtar.hu

(81)

Figure 2.28: The significance of the Tile processor [14]

2.2.2 Intel’s Tile processor (14)

(82)

Figure 2.29: Lessons learned from the Tile processor (1) [14]

2.2.2 Intel’s Tile processor (15)

© Sima Dezső, ÓE NIK 82 www.tankonyvtar.hu

(83)

Figure 2.30: Lessons learned from the Tile processor (2) [14]

2.2.2 Intel’s Tile processor (16)

(84)

2.2.3 Intel’s SCC

© Sima Dezső, ÓE NIK 84 www.tankonyvtar.hu

(85)

• 12/2009: Announced

• 9/2010: Many-core Application Research Project (MARC) initiative started on the SCC platform

• Designed in Braunschweig and Bangalore

• 48 core, 2D-mesh system topology, message passing 2.2.3 Intel’s SCC (Single-chip Cloud Computer)

2.2.3 Intel’s SCC (1)

(86)

Figure 2.31: The SCC chip [14]

2.2.3 Intel’s SCC (2)

© Sima Dezső, ÓE NIK 86 www.tankonyvtar.hu

(87)

Figure 2.32: Hardware view of SCC [14]

2.2.3 Intel’s SCC (3)

(88)

Figure 2.33: Dual core tile of SCC [14]

2.2.3 Intel’s SCC (4)

© Sima Dezső, ÓE NIK 88 www.tankonyvtar.hu

(89)

(Joint Test Action Group) Standard Test Access Port

Figure 2.34: SCC system overview [14]

2.2.3 Intel’s SCC (5)

(90)

Figure 2.35: Removing hardware cache coherency [16]

2.2.3 Intel’s SCC (6)

© Sima Dezső, ÓE NIK 90 www.tankonyvtar.hu

(91)

Figure 2.36: Improving energy efficiency [16]

2.2.3 Intel’s SCC (7)

(92)

Figure 2.37: A programmer’s view of SCC [14]

2.2.3 Intel’s SCC (8)

© Sima Dezső, ÓE NIK 92 www.tankonyvtar.hu

(93)

(Message Passing Buffer)

Figure 2.38: Operation of SCC [14]

2.2.3 Intel’s SCC (9)

(94)

3. Heterogeneous multicores

3.1 Master-slave type multicores

3.2 Add-on type multicores

© Sima Dezső, ÓE NIK 94 www.tankonyvtar.hu

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

B B VD E HI J L LMO MP BO The Proposal based on the requirements of the professional economic actors makes it clear that everybody should measure the business risks in connection

Keywords: heat conduction, second sound phenomenon,

An apparatus has been built at the Department of Fluid Flow, Budapest Univer- sity of Technology and Economics, that is applicable to measurement of the flow characteristics of

Then 99 data about P u mp /P u up , the ratio of multiplanar CHS X-joints ultimate capacity (P u mp ) to that of the corresponding uniplanar X-joints (P u up ) and defined

At the moment we restrict our study to the propagation of the two laser pulses (800 nm and 267 nm) in Ar gas in a single-shot mode: we track the ionization dynamics of the gas both

Then Section 4 is dedicated to the implementation of the piecewise affine linear solutions as functions of the state in case of mp-QP in terms of a neuro-fuzzy approach using

The overall aim of the presented project is to study the cytotoxicity of meloxicam- potassium (MP) containing dry powder inhalation systems (DPIs) in monolayers

First, one of the 4 modules of the server can be chosen, listed on the left side of the starting page, Single spectrum analysis, Fold recognition, Multiple spectra analysis,