Írta: Sima Dezső
PÁRHUZAMOS RENDSZEREK ARCHITEKTÚRÁJA
PÁRHUZAMOS SZÁMÍTÁSTECHNIKA MODUL
PROAKTÍV INFORMATIKAI MODULFEJLESZTÉS
Lektorálta: oktatói munkaközösség
COPYRIGHT:
2011-2016, Dr. Sima Dezső, Óbudai Egyetem, Neumann János Informatikai Kar
LEKTORÁLTA: oktatói munkaközösség
Creative Commons NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)
A szerző nevének feltüntetése mellett nem kereskedelmi céllal szabadon másolható, terjeszthető, megjelentethető és előadható, de nem módosítható.
TÁMOGATÁS:
Készült a TÁMOP-4.1.2-08/2/A/KMR-2009-0053 számú, “Proaktív informatikai modulfejlesztés (PRIM1): IT Szolgáltatásmenedzsment modul és Többszálas processzorok és programozásuk modul” című pályázat keretében
KÉSZÜLT: a Typotex Kiadó gondozásában FELELŐS VEZETŐ: Votisky Zsuzsa
ISBN 978-963-279-561-4
2
KULCSSZAVAK:
többmagos processzorok, sokmagos processzorok, homogén többmagos processzorok, heterogén többmagos processzorok, mester-szolga elvű heterogén többmagos
processzorok, csatolt elvű heterogén többmagos processzorok, Core
2/Penryn/Nehalem/Nehalem-X/Westmere/Westmer-EX/Sandy Bridge-alapú Intel
architektúrák, egyéni (privat consumer) és vállalati (enterprise) orientált platformok, Intel vPro platformja, általános célú GPU-k (GPGPU-k), adatpárhuzamos gyorsítók (DPA-k), integrált CPU/GPU architektúrák
ÖSSZEFOGLALÓ:
A tárgy keretében a hallgatók áttekintést kapnak a processzorarchitektúrák terén az elmúlt években végbement rohamos fejlődésről. Megismerkednek a többmagos processzorok megjelenésének szükségszerűségével, a többmagos/sokmagos processzorok főbb osztályaival, nevezetesen a homogén és a heterogén többmagos processzorokkal, azok alosztályaival és reprezentáns implementációikkal.
Ismertetésre kerülnek a többmagos Intel processzorok főbb családjai és azok főbb jellemzői, nevezetesen a Core 2, Penryn, Nehalem, Nehalem-EX, Westmere, Westmere-EX és a Sandy Bridge alapú architektúrák és jellemzőik. Az előadásban a hallgatók megismerkednek a többmagos asztali számítógép platformokkal, kiemelten az egyéni ill. a vállalati alkalmazási orientációjú (vPro) platformokkal és azok sajátosságaival. Az anyag megértését nagyszámú konkrét megvalósítás bemutatása segíti. A továbbiakban az előadás tárgyalja a
számításigényes alkalmazások terén egyre szélesebb körben elterjedő általános célú GPU- kat (GPGPU-k) és adatpárhuzamos gyorsítókat (DPA-k). Végül ismertetésre kerülnek a reprezentáns Nvidia és AMD/ATI GPGPU családok architektúrái valamint a processzorok fejlődésének legutóbbi szakaszában megjelent integrált CPU/GPU architektúrák ill.
reprezentáns implementációik.
Tartalomjegyzék
• Multicore-Manycore Processors
• Evolution of Intel’s Basic Microarchitectures
• Intel’s Desktop Platforms
• GPGPUs/DPAs Overview
• GPGPUs/DPAs 5.1
• GPGPUs/DPAs 5.2
• Integrated CPUs/GPUs
• References to all four sectionsof GPGPUs/DPAs
www.tankonyvtar.hu
© Sima Dezső, ÓE NIK 4
Dezső Sima
Multicore-Manycore
Processors
Contents
1.The inevitable era of multicores
•
2. Homogeneous multicores
•
2.1 Conventional multicores
•
2.2 Many-core processors
•
3. Heterogeneous multicores
•
3.1 Master-slave type heterogeneous multicores
•
3.2 Add-on type heterogeneous multicores
•
4. Outlook
•
5. References
•
© Sima Dezső, ÓE NIK 6 www.tankonyvtar.hu
1. The inevitable era of multicores
Figure 1.1: Integer performance growth of Intel’s x86 processors
SPECint92
5 10 50
Year
86 88
79 1980 81 82 83 84 85 87 89 1990 91 92 93 94 95 96 97 98 99
*
*
*
* *
*
*
*
2
386/16
*
* *
*
*
* 8088/5 0.5 *
100
8088/8 80286/10
80286/12
386/20 386/25 386/33 500
*
* 1000 *
20 200
1
0.2
*
* * *
*
* *
486/25
486/33
486/50 486-DX2/66 Pentium/66
Pentium/100 Pentium/120
Pentium Pro/200 PII/450 PIII/600
486-DX4/100
Pentium/133 Pentium/166 Pentium/200
PII/300 PII/400 PIII/500
486-DX2/50
*
2000 01 02 03 5000
2000
*
*
*
*
*
* * *
* PIII/1000 P4/1500
P4/1700 P4/2000 P4/2200
P4/2400 P4/2800 P4/3060
P4/3200
~ 100*/10 years
*
*
* *
*
04 05 Northwood B 10000
Prescott (1M) Prescott (2M) Levelling off
1. The inevitable era of multicores (1)
Integer performance grows
1. The inevitable era of multicores
© Sima Dezső, ÓE NIK 8 www.tankonyvtar.hu
Pa = fC x IPC Performance (Pa)
Clock frequency Instructions Per Cycle
Efficiency
(Pa/fC) x
1. The inevitable era of multicores (2)
Clock
frequency x
P
a=
Figure 1.2: Efficiency of Intel processors
f c SPECint_base2000/
Year
79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99
78 2000 01 02
0.05 0.1
0.02 0.5 1
0.2
0.01
~ ~
*
* *
* *
* *
*
* * *
Pentium 486DX
386DX
286
Pentium II Pentium Pro
* * Pentium III
~10*/10 years
Levelling off 2. generation
superscalars
1. The inevitable era of multicores (3)
© Sima Dezső, ÓE NIK 10 www.tankonyvtar.hu
1. The inevitable era of multicores (4)
Main sources of processor efficiency (IPC)
Processor width Core enhancements Cache enhancements
superscalar
• branch prediction
• speculative loads
• ...
L2/L3 enhancements (size, associativity ...)
1. Gen. 2. Gen.
1 2 4
pipeline
Figure 1.3: Main sources of processor efficiency
Figure 1.4: Extent of parallelism available in general purpose applications for 2. generation superscalars [37]
1. The inevitable era of multicores (5)
© Sima Dezső, ÓE NIK 12 www.tankonyvtar.hu
1. The inevitable era of multicores (6)
Main sources of processor efficiency (IPC)
Processor width Core enhancements Cache enhancements
superscalar
• branch prediction
• speculative loads
• ...
L2/L3 enhancements (size, associativity ...)
1. Gen. 2. Gen.
1 2 4
pipeline
Figure 1.5: Main sources of processor efficiency
Beginning with 2. generation superscalars
Pa = fC x IPC
Clock frequency Instructions Per Cycle x
Performance increase can basically be achived by fc
• the era of extensively increasing processor efficiency came to an end
• processor efficiency levelled off.
1. The inevitable era of multicores (7)
© Sima Dezső, ÓE NIK 14 www.tankonyvtar.hu
Figure 1.6: Evolution of Intel’s process technology [38]
1. The inevitable era of multicores (8)
Shrinking: ~ 0.7/2 Years
Figure 1.7: The actual rise of IC complexity in DRAMs and microprocessors [39]
1. The inevitable era of multicores (9)
© Sima Dezső, ÓE NIK 16 www.tankonyvtar.hu
Main sources of processor efficiency (IPC)
Processor width Core enhancements Cache enhancements
superscalar
• branch prediction
• speculative loads
• ...
L2/L3 enhancements (size, associativity ...)
1. Gen. 2. Gen.
1 2 4
pipeline
Doubling transistor counts ~ every two years What is the best use of ever increasing
number of processors ??? Moore’s
law
1. The inevitable era of multicores (10)
IC fab technology
Moore’s law
~ Doubling transistor counts / 2 years (Linear shrink ~ 0.7x/2 years)
Possible use of surplus transistors
Processor width Core enhancements Cache enhancements
superscalar
• branch prediction
• speculative loads
• ...
L2/L3 enhancements (size, associativity ...)
1. Gen. 2. Gen.
1 2 4
pipeline
Figure 1.8: Possible use of surplus transistors
1. The inevitable era of multicores (11)
© Sima Dezső, ÓE NIK 18 www.tankonyvtar.hu
Use available surplus transistors for multiple cores
The inevitable era of multicores
Increasing number of transistors Diminishing return in performance
1. The inevitable era of multicores (12)
Figure 1.9: Rapid spreading of Intel’s multicore processors [40]
1. The inevitable era of multicores (13)
© Sima Dezső, ÓE NIK 20 www.tankonyvtar.hu
2. Homogeneous multicores
2.1 Conventional multicores
•
2.2 Manycore processors
•
2. Homogeneous multicores (1)
Figure 2.1: Major classes of multicore processors
Desktops
Heterogeneous multicores Homogeneous
multicores
Multicore processors
Manycore processors
Servers
with >8 cores Conventional
multicores Master/slave
type multicores Add-on type multicores
MPC
CPU GPU 2 ≤ n ≤ 8 cores
General purpose
computing Prototypes/
experimental systems MM/3D/HPC
production stage HPC Mobiles
© Sima Dezső, ÓE NIK 22 www.tankonyvtar.hu
2.1 Conventional multicores
2.1.1 Example: Intel’s MP servers
•
2.1 Conventional multicores (1)
Figure 2.1: Major classes of multicore processors
Desktops
Heterogeneous multicores Homogeneous
multicores
Multicore processors
Manycore processors
Servers
with >8 cores Conventional
multicores Master/slave
type multicores Add-on type multicores
MPC
CPU GPU 2 ≤ n ≤ 8 cores
General purpose
computing Prototypes/
experimental systems MM/3D/HPC
production stage HPC Mobiles
© Sima Dezső, ÓE NIK 24 www.tankonyvtar.hu
2.1.1 Example: Intel’s MP servers
2.1.1.1 Introduction
•
2.1.1.2 The Pentium 4 based Truland MP platform
•
2.1.1.3 The Core 2 based Caneland MP platform
•
2.1.1.4 The Nehalem-EX based Boxboro-EX MP platform
•
2.1.1.5 Evolution of MP platforms
•
2.1.1.1 Introduction (1)
Uni-Processors (UP)
Servers
Servers with more than 8 processors
Multi Processors (typically 4 processors)
(MP) Dual Processors
(DP)
2.1.1.1 Introduction
© Sima Dezső, ÓE NIK 26 www.tankonyvtar.hu
Basic Arch. Core/technology MP server processors
Pentium 4 (Prescott)
Pentium 4 90 nm 11/2005 Paxville MP 2x1 C, 2 MB L2/C
Pentium 4 65 nm 8/2006 7100 (Tulsa) 2x1 C, 1 MB L2/C 16 MB L3
Core 2
Core2 65 nm 9/2007 7200 (Tigerton DC) 7300 (Tigerton QC)
1x2 C, 4 MB L2/C 2x2 C, 4 MB L2/C Penryn 45 nm 9/2008 7400 (Dunnington) 1x6 C, 3 MB L2/2C 16 MB L3
Nehalem
Nehalem-EP 45 nm Westmere-EP 32 nm
Nehalem-EX 45 nm 3/2010 7500 (Beckton) 1x8 C, ¼ MB L2/C 24 MB L3 Westmere-EX 32nm 4/2011
E7-48xx (Westmere-EX)
1x10 C, ¼ MB L2/C 30 MB L3
Sandy Bridge
Sandy Bidge 32 nm /2011 Ivy Bridge 22 nm 11/2012
2.1.1.1 Introduction (2)
Pentium 4 based MP server platform
MP server platforms
Sandy Bridge based MP server platform Nehalem-EX based
MP server platform
To be announced yet Truland (2005)
(90 nm/65 nm Pentium 4 Prescott MP based)
Core 2 based MP server platform
Caneland (2007) Boxboro-EX (2010)
2.1.1.1 Introduction (3)
© Sima Dezső, ÓE NIK 28 www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (1)
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform Overview
For presenting a more complete view of the evolution of multicore MP server platforms we include also the single core (SC) 90 nm Pentium 4 Prescott based Xeon MP (Potomac)
processor that was the first 64-bit MP server processor and gave rise to the Truland platform.
Remark
MP platforms
Xeon 7000 11/2005
MP cores Xeon 7100
8/2006
MCH
3/2005 11/2005
E8500 E8501
(Paxville MP) 2x1C (Tulsa) 2C
(Twin Castle) (Twin Castle?) 2xFSB
667 MT/s HI 1.5 4 x XMB (2 channels/XMB 4 DIMMs/channel DDR-266/333
DDR2-400 32GB
2xFSB 800 MT/s
HI 1.5 4 x XMB (2 channels/XMB 4 DIMMs/channel DDR-266/333
DDR2-400 32GB Truland
65 nm/1328 mtrs 2x1 MB L2 16/8/4 MB L3 800/667 MT/s mPGA 604
Pentium 4-based/65 nm
3/2005
Xeon MP 3/2005
(Potomac) 1C
90 nm/2x169 mtrs 2x1 (2) MB L2
- 800/667 MT/s
mPGA 604 90 nm/675 mtrs
1 MB L2 8/4 MB L3
667 MT/s mPGA 604
Pentium 4-based/90 nm ICH5
4/2003
Pentium 4 based Core 2 based Penryn based ICH
Truland( updated) 11/2005
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (2)
© Sima Dezső, ÓE NIK 30 www.tankonyvtar.hu
11/00 1/02
^
0.18 /42 mtrsµ
^
400 MHz FSB
Northwood-A Xeon DP line
Desktop-line
Celeron-line
Willamette
1.4/1.5 GHz
(Value PC-s)
On-die 256K L2
0.13 /55 mtrsµ
400 MHz FSB 2A/2.2 GHz On-die 512K L2
2/02
^
0.13 /55 mtrsµ
400 MHz FSB 1.8/2/2.2 GHz On-die 512K L2 5/01
^
0.18 /42 mtrsµ
400 MHz FSB 1.4/1.5/1.7 GHz On-die 256 K L2
11/02
^
Prestonia-B 0.13 /55 mtrsµ
533 MHz FSB 2/2.4/2.6/2.8 GHz On-die 512K L2
Foster Prestonia-A Nocona
2/04
^
0.09 /125mtrs µ
800 MHz FSB 2.80E/3E/3.20E/3.40E GHz
On-die 1M L2
2000 2001 2002 2003 2004
Xeon - MP line
3/02
^
0.18 /108 mtrs µ
400 MHz FSB 1.4/1.5/1.6 GHz On-die 256K L2
11/02
^
Gallatin 0.13 /178 mtrsµ
400 MHz FSB 1.5/1.9/2 GHz On-die 512K L2 Foster-MP
On-die 512K/1M L3 On-die 1M/2M L3
5/02
^
Northwood-B 0.13 /55 mtrsµ
533 MHz FSB 2.26/2.40B/2.53 GHz
On-die 512K L2
5/02
^
Willamette-128
400 MHz FSB 1.7 GHz
11/02
^
6/04
^
0.09 / 125 mtrsµ
800 MHz FSB 2.8/3.0/3.2/3.4/3.6 GHz
On-die 1M L2
Northwood-B
533 MHz FSB 3.06 GHz On-die 512K L2
0.13 /55 mtrsµ µ
400 MHz FSB 2 GHz On-die 128K L2 0.18 µ 0.13 µ
9/02
^
Northwood-128
On-die 128K L2
Cores supporting hyperthreading
5/03
^
Northwood-C
800 MHz FSB 2.40C/2.60C/2.80C GHz
On-die 512K L2 0.13 /55 mtrsµ
Cores with EM64T implemented but not enabled
2005 2Q/05
^
Potomac 0.09 µ
> 3.5 MHz On-die 1M L2 On-die 8M L3 (?)
Irwindale-C 1Q/05
^
0.09 µ 3.0/3.2/3.4/3.6 GHz On-die 512K L2, 2M L3
Jayhawk 2Q/05
^
0.09 µ
(Cancelled 5/04) 3.8 GHz On-die 1M L2
3Q/05
^
Tejas 0.09 /µ 4.0/4.2 GHz On-die 1M L2 (Cancelled 5/04) Irwindale-A
11/03
^
800 MHz FSB 3.2EE GHz On-die 512K L2, 2M L3
0.13 /178 mtrsµ
Cores supporting EM64T 6/04
^
0.09 /125mtrs µ
800 MHz FSB 2.8/3.0/3.2/3.4/3.6 GHz
On-die 1M L2 11/04
^
Irwindale-B 0.13 /178mtrs µ
1066 MHz FSB 3.4EE GHz On-die 512K L2, 2 MB L3
533 MHz FSB 2.4/2.53/2.66/2.8 GHz
On-die 256K L2 0.09 µ
6/04
^
Celeron-D
µPGA 603 µPGA 603
µPGA 603 µPGA 604
µPGA 478 LGA 775
µPGA 423 µPGA 478 µPGA 478 µPGA 478 µPGA 478 µPGA 478 LGA 775
PGA 478 µPGA 478
µPGA 603 µPGA 603
0.18 /42 mtrsµ
^
400 MHz FSB Willamette
On-die 256K L2 µPGA 478
3/04
^
Gallatin 0.13 /286 mtrsµ
400 MHz FSB 2.2/2.7/3.0 GHz On-die 512K L2 On-die 2M/4M L3 µPGA 603
8/01
µ µPGA 478
533 MHz FSB 2.53/2.66/2.80/2.93 GHz
On-die 256K L2 0.09 µ 9/04
^
Celeron-D Extreme Edition
7/03
^
Prestonia-C 0.13 /178 mtrsµ
533 MHz FSB 3.06 GHz On-die 512K L2, 1M L3
µPGA 603
1.4 ... 2.0 GHz
0.09 /125mtrs µ
800 MHz FSB 3.20F/3.40F/3.60F GHz
On-die 1M L2 LGA 775 8/04
^
12 13
8,9,10 Prescott
Prescott Prescott-F11
5 6,7
LGA 775 4
2,3
1 1
Figure 2.2: The Potomac processor as Intel’s first 64-bit Xeon MP processor based on the µPGA604
2/05
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (3)
Basic system architecture of the 90 nm Pentium 4 Prescott MP based Truland MP server platform
Pentium 4 Prescott MP based Truland MP server platform (for up to 2 cores) Pentium 4
Xeon 1C/2x1C
Pentium 4 XeonP 1C/2x1C
Pentium 4 XeonP 1C/2x1C Pentium 4
Xeon 1C/2x1C
85001/8501
ICH5 XMB
XMB DDR-266/333
DDR2-400
FSB
XMB
XMB HI 1.5
DDR-266/333 DDR2-400 Xeon 7000
(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP
(Potomac) 1C / /
XMB: eXxternal Memory Bridge Provides a serial link, 5.33 GB inbound BW 2.65 GB outbound BW (simultaneously)l
HI 1.5 (Hub Interface 1.5)
8 bit wide, 66 MHz clock, QDR, 66 MB/s peak transfer rate
1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac)
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (4)
© Sima Dezső, ÓE NIK 32 www.tankonyvtar.hu
Expanding the Truland platform to 3 generations of Pentium 4 based Xeon MP servers
Xeon 7000
2x1C Xeon 7100
Xeon MP 2C 1C
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (5)
Figure 2.3: Expanding the Truland platform [1]
Figure 2.4: Block diagram of a 8500 chipset based Truland MP server board [2]
Example 1: Block diagram of a 8500 chipset based Truland MP server board [2]
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (6)
© Sima Dezső, ÓE NIK 34 www.tankonyvtar.hu
XMB: eXxternal
Memory Bridge IMI: Independent
Memory Interface
5.33 GB inbound BW 2.67 GB outbound BW simultaneously
Figure 2.5: Intel’s 8501 chipset based Truland MP server platform (4/ 2006) [3]
Xeon DC MP 7000 (4/2005) or later DC/QC MP 7000 processors
Intelligent MC Dual mem. channels
DDR 266/333/400 4 DIMMs/channel
(North Bridge)
IMI: Serial link
Example 2: Block diagram of the E8501 based Truland MP server platform [3]
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (7)
7000/7100 DDR2
DIMMs 64 GB
Figure 2.6: Intel E8501 chipset based MP server board (Supermicro X6QT8) for the Xeon 7000/7100 DC MP processor families [4]
Xeon DC
E8501 NB
ICH5R SB 2 x XMB
2 x XMB
Example 3: E8501 based MP server board implementing the Truland platform
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (8)
© Sima Dezső, ÓE NIK 36 www.tankonyvtar.hu
Figure 2.7: Bandwith bottlenecks in Intel’s 8501 based Truland MP server platform [5]
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (9)
Remark
Previous (first generation) MP servers made use of a symmetric topology including only a single FSB that connects all 4 single core processors to the MCH (north bridge), as shown below.
Xeon MP1
SC Xeon MP1 SC
FSB
Xeon MP1
SC Xeon MP1 SC
Preceding ICH Preceding NBs
E.g. HI 1.5
HI 1.5 266 MB/s
E.g. DDR-200/266 E.g. DDR-200/266
Typical system architecture of a first generation Xeon MP based MP server platform
Figure 2.8: Previous Pentium 4 MP based MP server platform (for single core processors)
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (10)
© Sima Dezső, ÓE NIK 38 www.tankonyvtar.hu
Example: Block diagram of an MP server board that is based on Pentium 4 (Willamette MP) single core 32-bit Xeon MP processors (called Foster)
The chipset (CMIC/CSB5) is ServerWorks’
Grand Champion HE Classic chipset
The memory is placed on an extra card.
There are 4 memory controllers each supporting 4 DIMMs
(DDR-266/200)
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (11)
Figure 2.9: Block diagram of an MP server board [6]
Xeon MP1
SC Xeon MP1 SC
FSB
Xeon MP1
SC Xeon MP1 SC
Preceding ICH Preceding NBs
E.g. HI 1.5
HI 1.5 266 MB/s
E.g. DDR-200/266 E.g. DDR-200/266
85001/8501
ICH5 XMB
XMB
DDR-266/333 DDR2-400
FSB
XMB
XMB HI 1.5
DDR-266/333 DDR2-400 Xeon 7000
(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP
(Potomac) 1C / /
90 nm Pentium 4 Prescott MP based Truland MP server platform (for up to 2 C) Pentium 4
Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Previous Pentium 4 MP based MP server (for single core processors)
Evolution from the first generation MP servers supporting SC processors to the
90 nm Pentium 4 Prescott MP based Truland MP server platform (supporting up to 2 cores)
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (12)
© Sima Dezső, ÓE NIK 40 www.tankonyvtar.hu
2.1.1.3 The Core 2 based Caneland MP server platform (1)
2.1.1.3 The Core 2 based Caneland MP server platform
Pentium 4 based Core 2 based Penryn based MP platforms
MP cores
MCH
Xeon 7200 Xeon 7300 9/2007
9/2007 E7300
(Tigerton DC) 1x2C (Tigerton QC) 2x2C Caneland
9/2007
(Clarksboro) 4xFSB 1066 MT/s
ESI 4 x FBDIMM (DDR2-533/667 8 DIMMs/channel)
512GB
Xeon 7400 9/2008
(Dunnington 6C) 65 nm/2x291 mtrs
2x4 MB L2 - 1066 MT/s mPGA 604
65 nm/2x291 mtrs 2x(4/3/2) MB L2
- 1066 MT/s mPGA 604
45 nm/1900 mtrs 9/6 MB L2 16/12/8 MB L3
1066 MT/s mPGA 604
Core2-based/65 nm Penryn 45 nm 631xESB
632xESB 5/2006
ICH
2.1.1.3 The Core 2 based Caneland MP server platform (2)
© Sima Dezső, ÓE NIK 42 www.tankonyvtar.hu
Basic system architecture of the Core 2 based Caneland MP server platform
Core2 (2C/4C) Penryn (6C)
Core2 (2C/4C) Penryn (6C)
Core2 (2C/4C) Penryn (6C) Core2
(2C/4C) Penryn (6C)
7300
631xESB/
632xESB
up to 8 DIMMs 85001/8501
ICH5 XMB
XMB
DDR-266/333 DDR2-400
FSB
XMB
XMB
ESI FSB
HI 1.5
DDR-266/333
DDR2-400 FBDIMM
DDR2-533/667 Xeon 7000
(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP
(Potomac) 1C / /
1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac) Xeon 7200
(Tigerton DC) 1x2C Xeon 7300
(Tigerton QC) 2x2C Xeon 7400 (Dunnington 6C)
/ /
90 nm Pentium 4 Prescott MP based
Truland MP server platform (for up to 2 C) Core 2 based
Caneland MP server platform (for up to 6 C)
ESI: Enterprise System Interface
4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface, providing 1 GB/s transfer rate in each direction) HI 1.5 (Hub Interface 1.5)
8 bit wide, 66 MHz clock, QDR, 66 MB/s peak transfer rate Pentium 4
Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
2.1.1.3 The Core 2 based Caneland MP server platform (3)
Figure 2.10: Intel’s 7300 chipset based Caneland platform for the Xeon 7200/7300 DC/QC processors (9/2007) [7]
FB-DIMM 4 channels
8 DIMMs/channel up to 512 GB
7200 (Tigerton DC, Core 2), 2C
7300 (Tigerton QC, Core 2), QC Xeon
Example 1: Intel’s Nehalem-EP based Tylersburg-EP DP server platform with a single IOH
2.1.1.3 The Core 2 based Caneland MP server platform (4)
© Sima Dezső, ÓE NIK 44 www.tankonyvtar.hu
FB-DIMM DDR2 192 GB
ATI ES1000 Graphics with 32MB video memory
7200 DC 7300 QC (Tigerton)
Xeon
Figure 2.11: Caneland MP Supermicro serverboard, with the 7300 (Clarksboro) chipset
SBE2 SB 7300 NB
Example 3: Caneland MP serverboard
2.1.1.3 The Core 2 based Caneland MP server platform (5)
Figure 2.12: Performance comparison of the Caneland platform with a quad core Xeon (7300 family) vs the Bensley platform with a dual core Xeon 7140M [8]
2.1.1.3 The Core 2 based Caneland MP server platform (6)
© Sima Dezső, ÓE NIK 46 www.tankonyvtar.hu
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (1)
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform
Xeon 7500 Xeon E7-4800 4/2011
3/2010 7500 (Nehalem-EX)
(Becton) 8C (Westmere-EX) 10C Boxboro-EX
3/2010
(Boxboro) 2 QPI links 32xPCIe 2. Gen.
0.5 GB/s/lane/direction 1GB/s/directon ESI 45 nm/2300 mtrs/513 mm2
¼ MB L2/C 24 MB L3 4 QPI links 4 SMI links 2 mem. channels/link 2 DIMMs/mem. channel
DDR3 1067 MT/s 1 TB (64x16 GB)
LGA1567
32 nm/2600 mtrs/584 mm2
¼ MB L2/C 30 MB L3 4 QPI links 4 SMI links 2 mem. channels/link 2 DIMMs/mem. channel
DDR3 1333 MT/s 1 TB (64x16 GB)
LGA1567
Nehalem-EX-based 45 nm
3/2010
Westmere-EX 45 nm ICH10
6/2008 MP platforms
MP cores
IOH
ICH
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (2)
© Sima Dezső, ÓE NIK 48 www.tankonyvtar.hu
The 8 core Nehalem-EX (Xeon 7500/Beckton) Xeon 7500 MP server processor
2 cores
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (3)
Figure 2.13: The 8 core Nehalem-EX (Xeon 7500/Beckton) Xeon 7500 MP server processor [9]
The 10 core Westmere-EX (Xeon E7-!800) MP server processor [10]
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (4)
© Sima Dezső, ÓE NIK 50 www.tankonyvtar.hu
Block diagram of the Westmere-EX (E7-8800/4800/2800) processors [11]
E7-8800: for 8 P systems E7-4800: for MP systems E7-2800: for DP systems
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (5)
Main platform features introduced in the 7500 Boxboro IOH (1)
Along with their Nehalem-EX based Boxboro platform Intel continued their move to increase system security and manageability by introducing platform features
provided else by their continuously enhanced vPro technology for enterprise oriented desktops since 2006 and DP servers since 2007.
The platform features introduced in the 7500 IOH are basically the same as described for the Tylersburg-EP DP platform that is based on the 5500 IOH which is akin to the 7500 IOH of the Boxboro-EX platform.
They include:
a) Intel Management Engine (ME)
b) Intel Virtualization Technology for Directed I/O (VT-d2) VT-d2 is an upgraded version of VT-d.
c) Intel Trusted Execution Technology (TXT) .
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (6)
© Sima Dezső, ÓE NIK 52 www.tankonyvtar.hu
Nehalem-EX 8C Westmere-EX
10C
7500 IOH QPI
QPI
QPI QPI QPI QPI
QPI QPI SMB
SMB
DDR3-1067 SMB SMB
SMB SMB
SMB SMB
ICH10 ESI
DDR3-1067
SMI: Serial link between the processor and the SMB
SMB: Scalable Memory Buffer Parallel/serial conversion SMB
SMB
SMB SMB
SMB SMB
SMB
SMB 2x4 SMI
channels 2x4 SMI
channels
Basic system architecture of the Nehalem-EX based Boxboro-EX MP server platform (assuming 1 IOH)
ME
ME: Management Engine Xeon 7500
(Nehalem-EX) (Becton) 8C
Xeon 7-4800 (Westmere-EX) 10C
Nehalem-EX 8C Westmere-EX
10C
Nehalem-EX 8C Westmere-EX
10C
Nehalem-EX 8C Westmere-EX
10C /
Nehalem-EX based Boxboro-EX MP server platform (for up to 10 C)
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (7)
Wide range of scalability of the 7500/6500 IOH based Boxboro-EX platform [12]
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (8)
© Sima Dezső, ÓE NIK 54 www.tankonyvtar.hu
Example: Block diagram of a 7500 chipset based Boxboro-EX MP serverboard [13]
ESI
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (9)
2.1.1.5 Evolution of MP server platforms (1)
2.1.1.5 Evolution of MP server platforms
© Sima Dezső, ÓE NIK 56 www.tankonyvtar.hu
Xeon MP1
SC Xeon MP1 SC
FSB
Xeon MP1
SC Xeon MP1 SC
Preceding ICH Preceding NBs
E.g. HI 1.5
HI 1.5 266 MB/s
E.g. DDR-200/266 E.g. DDR-200/266
85001/8501
ICH5 XMB
XMB
DDR-266/333 DDR2-400
FSB
XMB
XMB HI 1.5
DDR-266/333 DDR2-400 Xeon 7000
(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP
(Potomac) 1C / /
90 nm Pentium 4 Prescott MP based Truland MP server platform (for up to 2 C) Pentium 4
Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Previous Pentium 4 MP based
MP server platform (for single core processors)
Evolution from the first generation MP servers supporting SC processors to the
90 nm Pentium 4 Prescott MP based Truland MP server platform (supporting up to 2 cores)
2.1.1.5 Evolution of MP server platforms (2)
Core2 (2C/4C) Penryn (6C)
Core2 (2C/4C) Penryn (6C)
Core2 (2C/4C) Penryn (6C) Core2
(2C/4C) Penryn (6C)
7300
631xESB/
632xESB
up to 8 DIMMs 85001/8501
ICH5 XMB
XMB
DDR-266/333 DDR2-400
FSB
XMB
XMB
ESI FSB
HI 1.5
DDR-266/333
DDR2-400 FBDIMM
DDR2-533/667 Xeon 7000
(Paxville MP) 2x1C Xeon 7100 (Tulsa) 2C Xeon MP
(Potomac) 1C / /
1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac) Xeon 7200
(Tigerton DC) 1x2C Xeon 7300
(Tigerton QC) 2x2C Xeon 7400 (Dunnington 6C)
/ /
90 nm Pentium 4 Prescott MP based
Truland MP server platform (for up to 2 C) Core 2 based
Caneland MP server platform (for up to 6 C) ESI: Enterprise System Interface
4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface, providing 1 GB/s transfer rate in each direction) HI 1.5 (Hub Interface 1.5)
8 bit wide, 66 MHz clock, QDR, 266 MB/s peak transfer rate Pentium 4
Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Pentium 4 Xeon MP 1C/2x1C
Evolution from the 90 nm Pentium 4 Prescott MP based Truland MP platform (up to 2 cores) to the Core 2 based Caneland MP platform (up to 6 cores)
2.1.1.5 Evolution of MP server platforms (3)
© Sima Dezső, ÓE NIK 58 www.tankonyvtar.hu
Nehalem-EX 8C Westmere-EX
10C
7500 IOH QPI
QPI
QPI
QPI QPI QPI
QPI QPI SMB
SMB
DDR3-1067 SMB SMB
SMB SMB
SMB SMB
ICH10 ESI
DDR3-1067
SMI: Serial link between the processor and the SMBs
SMB: Scalable Memory Buffer Parallel/serial converter SMB
SMB
SMB SMB
SMB SMB
SMB
SMB 2x4 SMI
channels 2x4 SMI
channels ME
ME: Management Engine Xeon 7500
(Nehalem-EX) (Becton) 8C
Xeon 7-4800 (Westmere-EX) 10C
Nehalem-EX 8C Westmere-EX
10C
Nehalem-EX 8C Westmere-EX
10C
Nehalem-EX 8C Westmere-EX
10C /
Evolution to the Nehalem-EX based Boxboro-EX MP platform (that supports up to 10 cores) (In the basic system architecture we show the single IOH alternative)
2.1.1.5 Evolution of MP server platforms (4)
2.2 Many-core processors
2.2.1 Intel’s Larrabee
•
2.2.2 Intel’s Tile processor
•
2.2.3 Intel’s SCC
•
© Sima Dezső, ÓE NIK 60 www.tankonyvtar.hu
2.2 Manycore processors (1)
Desktops
Heterogeneous multicores Homogeneous
multicores
Multicore processors
Manycore processors
Servers
with >8 cores Conventional
multicores Master/slave
type multicores Add-on type multicores
MPC
CPU GPU 2 ≤ n ≤ 8 cores
General purpose
computing Prototypes/
experimental systems MM/3D/HPC
production stage HPC Mobiles
2.2.1 Intel’s Larrabee
© Sima Dezső, ÓE NIK 62 www.tankonyvtar.hu
2.2.1 Larrabee
Part of Intel’s Tera-Scale Initiative.
Project started ~ 2005
First unofficial public presentation: 03/2006 (withdrawn) First official public presentation: 08/2008 (SIGGRAPH)
Due in ~ 2009
• Performance (targeted):
2 TFlops
• Brief history:
• Objectives:
Not a single product but a base architecture for a number of different products.
High end graphics processing, HPC
2.2.1 Intel’s Larrabee (1)
Figure 2.14: Block diagram of a GPU-oriented Larrabee (2006, outdated) [41]
Update: SIMD processing width: SIMD-64 rather than SIMD-16 Basic architecture
2.2.1 Intel’s Larrabee (2)
© Sima Dezső, ÓE NIK 64 www.tankonyvtar.hu
Figure 2.15: Board layout of a GPU-oriented Larrabee (2006, outdated) [42]
2.2.1 Intel’s Larrabee (3)
Figure 2.16: Four socket MP server design with 24-core Larrabees connected by the CSI bus [41]
2.2.1 Intel’s Larrabee (4)
© Sima Dezső, ÓE NIK 66 www.tankonyvtar.hu
2.2.2 Intel’s Tile processor
• First incarnation of Intel’s Tera-Scale Initiative
Announced at IDF 9/2006 Due in 2009/2010
• Objective: Tera-Scale experimental chip (more than 100 projects underway)
• Brief history:
2.2.2 Tile processor
2.2.2 Intel’s Tile processor (1)
© Sima Dezső, ÓE NIK 68 www.tankonyvtar.hu
Figure 2.17: A forerunner: The Raw processor (MIT 2002)
(16 tiles, each tile has a compute element, router, instruction and data memory) [43]
2.2.2 Intel’s Tile processor (2)
Bisection bandwidth:
If the network is segmented into two equal parts, this is the bandwidth between the two parts
Figure 2.18: Die photo and chip details of the Tile processor [14]
2.2.2 Intel’s Tile processor (3)
© Sima Dezső, ÓE NIK 70 www.tankonyvtar.hu
Figure 2.19: Main blocks of a tile [14]
2.2.2 Intel’s Tile processor (4)
(Clocks run with the same frequency but unknown phases
FP Multiply-Accumulate (AxB+C)
Figure 2.20: Block diagram of a tile [14]
2.2.2 Intel’s Tile processor (5)
© Sima Dezső, ÓE NIK 72 www.tankonyvtar.hu
Figure 2.21: On board implementation of the 80-core Tile Processor [15]
2.2.2 Intel’s Tile processor (6)
Figure 2.22: Performance and dissipation figures of the Tile-processor [15]
2.2.2 Intel’s Tile processor (7)
© Sima Dezső, ÓE NIK 74 www.tankonyvtar.hu
Performance at 4 GHz:
Peak SP FP: up to 1.28 TFlops (2 FPMA x 2 instr./cyclex80x4 GHz = 1.28 TFlops)
2.2.2 Intel’s Tile processor (8)
Figure 2.23: Programmer’s perspective of the Tile processor [14]
2.2.2 Intel’s Tile processor (9)
© Sima Dezső, ÓE NIK 76 www.tankonyvtar.hu
Figure 2.24: The full instruction set of the Tile processor [14]
2.2.2 Intel’s Tile processor (10)
VLIW
Figure 2.25: Instruction word and latencies of the Tile processor [14]
2.2.2 Intel’s Tile processor (11)
© Sima Dezső, ÓE NIK 78 www.tankonyvtar.hu
Figure 2.26: Performance of the Tile processor – the workloads [14]
2.2.2 Intel’s Tile processor (12)
Figure 2.27: Instruction word and latencies of the Tile processor [14]
2.2.2 Intel’s Tile processor (13)
© Sima Dezső, ÓE NIK 80 www.tankonyvtar.hu
Figure 2.28: The significance of the Tile processor [14]
2.2.2 Intel’s Tile processor (14)
Figure 2.29: Lessons learned from the Tile processor (1) [14]
2.2.2 Intel’s Tile processor (15)
© Sima Dezső, ÓE NIK 82 www.tankonyvtar.hu
Figure 2.30: Lessons learned from the Tile processor (2) [14]
2.2.2 Intel’s Tile processor (16)
2.2.3 Intel’s SCC
© Sima Dezső, ÓE NIK 84 www.tankonyvtar.hu
• 12/2009: Announced
• 9/2010: Many-core Application Research Project (MARC) initiative started on the SCC platform
• Designed in Braunschweig and Bangalore
• 48 core, 2D-mesh system topology, message passing 2.2.3 Intel’s SCC (Single-chip Cloud Computer)
2.2.3 Intel’s SCC (1)
Figure 2.31: The SCC chip [14]
2.2.3 Intel’s SCC (2)
© Sima Dezső, ÓE NIK 86 www.tankonyvtar.hu
Figure 2.32: Hardware view of SCC [14]
2.2.3 Intel’s SCC (3)
Figure 2.33: Dual core tile of SCC [14]
2.2.3 Intel’s SCC (4)
© Sima Dezső, ÓE NIK 88 www.tankonyvtar.hu
(Joint Test Action Group) Standard Test Access Port
Figure 2.34: SCC system overview [14]
2.2.3 Intel’s SCC (5)
Figure 2.35: Removing hardware cache coherency [16]
2.2.3 Intel’s SCC (6)
© Sima Dezső, ÓE NIK 90 www.tankonyvtar.hu
Figure 2.36: Improving energy efficiency [16]
2.2.3 Intel’s SCC (7)
Figure 2.37: A programmer’s view of SCC [14]
2.2.3 Intel’s SCC (8)
© Sima Dezső, ÓE NIK 92 www.tankonyvtar.hu
(Message Passing Buffer)
Figure 2.38: Operation of SCC [14]
2.2.3 Intel’s SCC (9)
3. Heterogeneous multicores
3.1 Master-slave type multicores
•
3.2 Add-on type multicores
•
© Sima Dezső, ÓE NIK 94 www.tankonyvtar.hu