Key features of the Core 2 microarchitecture

2.2 Wide execution (1)

2.2 Wide execution

• 4-wide core

• Enhanced execution resources

• Micro fusion

• Macro fusion

2.2 Wide execution (2)

Key benefit of the Core family By contrast

both Intel’s previous Pentium 4 family and AMD’s K8 have 3-wide cores.

4-wide core

4-wide front end and retire unit

2.2 Wide execution (3)

Figure 2.2: Block diagram of Intel’s Core microarchitecture [4]

2.2 Wide execution (4)

carmean

Figure 2.3: Block diagram of Intel’s Pentium 4 microarchitecture [5]

Retire width: 3 instr./cycle

174

2.2 Wide execution (5)

Figure 2.4: Block diagram of AMD’s K8 microarchitecture[4]

2.2 Wide execution (6)

The Core has three complex SSE units By contrast

Enhanced execution resources

• The Pentium 4 provides a single complex SSE unit and

a second simple SSE unit performing only SSE move and store operations.

• The K8 has two SSE units

2.2 Wide execution (7)

Figure 2.5: Issue ports and execution units of the Core [4]

2.2 Wide execution (8)

Ports 0 und 1 can issue up to two microinstructions per cycle, allowing to issue altogether up to 6 microinstr./cycle

Figure 2.6: Issue ports and execution unit of the Pentium 4 [9]

2.2 Wide execution (9)

Remark

Both the Core’s and the Pentium 4’s schedulers can issue 6 operations per cycle, but

• Pentium 4’s schedulers have only 4 ports, with two double pumped simple ALUs,

• by contrast Core has a unified scheduler with 6 ports, allowing more flexibility for issuing instructions.

2.2 Wide execution (10)

Table 2.1: Key features of x86 processors [4]

www.tankonyvtar.hu 180

2.2 Wide execution (11)

Remark

IBM’s POWER4 and subsequent processors of this line have introduced 5-wide cores with 8-wide out of order issue.

These processor bundle 5 subsequent instructions into a group, dispatch groups in order, execute instructions out of order,

and retire groups (one group in a cycle) in order.

2.2 Wide execution (12)

Micro-op fusion [10]

• Combining micro-ops derived from the same macro-operation into a single micro-op.

• Micro-op fusion can reduce the total number of micro-ops to be processed by more than 10 %.

This results in higher processor performance.

• Originally introduced in the Pentium M (1. core (Banias) in 2003).

2.2 Wide execution (13)

Remark

IBM’s POWER4 and subsequent processor provide a 5-wide frontend.

2.2 Wide execution (14)

Macro-op fusion [10]

• New feature introduced into the Core.

• Combing common x86 instruction pairs (such as a compare followed by a conditional jump) into a single micro-op during decoding.

Two x86 instructions can be executed as a single micro-op.

This increases performance.

Example

2.2 Wide execution (15)

Figure 2.7: Macro-op fusion example (1) [11]

2.2 Wide execution (16)

Figure 2.8: Macro-op fusion example (2) [11]

2.2 Wide execution (17)

Figure 2.9: Macro-op fusion example (3) [11]

2.2 Wide execution (18)

Table 2.2: Comparing Intel’s and AMD’s fusion techniques [4]

2.2 Wide execution (19)

• AMD’s K8-based processors became the performance leader, first of all on the DP and MP server market, where the 64-bit direct connect architecture has clear benefits

vs Intel’s 32-bit Pentium 4 based processors using shared FSBs to connect processors to north bridges.

• integrated memory controllers and

• high speed point-to-point serial buses (the HyperTransport bus)

used to connect processors to processors and processors to south bridges.

Performance leadership changes between Intel and AMD

• In 2003 AMD introduced their K8-based processors implementing

• the 64-bit x86 ISA and

• the direct connect architecture concept, that includes

2.2 Wide execution (20)

Figure 2.10: DP web server performance comparison: AMD Opteron 248 vs. Intel Xeon 2.8 [6]

Example 1: DP web-server performance comparison (2003)

2.2 Wide execution (21)

“In the extensive benchmark tests under Linux Enterprise Server 8 (32 bit as well as 64 bit), the AMD Opteron made a good impression. Especially in the server disciplines, the benchmarks (MySQL, Whetstone, ARC 2D, NPB, etc.) show quite clearly that the Dual Opteron puts the Dual Xeon in its place”.

Example 2: Summary assessment of extensive benchmark tests contrasting dual Opterons vs dual Xeons (2003) [7]

 three complex FP/SSE units compared to two units available in the K8 or

just a single complex unit and a second simple unit performing only FP-move and FP store operations.

2.2 Wide execution (22)

• This situation has completely changed in 2006 when Intel introduced their Core 2 microarchitecture,

Intel regained performance leadership vs AMD.

• This and further enhancements of the Core microarchitecture, detailed subsequently, resulted in record breaking performance figures.

 a 4-wide front-end and retire unit compared to the 3-wide K8 or the Pentium 4,

• The Core 2 has

2.2 Wide execution (23)

Webserver Performance

MSI K2-102A2M

Opteron 275

MSI K2-102A2M Opteron 280

Opteron 280 vs.

Opteron 275

Jsp: Java Server Page performance AMP: Apache/MySQL/PHP

Example: DP web-server performance comparison (2006)

Figure 2.11: DP web server performance comparison: AMD Opteron 275/280 vs. Intel Xeon 5160 [8]

Remark

Both web-server benchmark results were published from the same source (AnandTech)

2.3 Smart L2 cache (1)

Pentium 4-based DCs Core 2-based DCs

2.3 Smart L2 cache

Shared L2 instead of private L2 caches associated with the cores.

Core1 Core2

Cache L2 L2 Cache

Core1 Core2

L2 Cache

Figure 2.12: Core’s shared L2 cache vs Pentium 4’s private L2 caches

Private Shared

2.3 Smart L2 cache (2)

Benefits of shared caches

• Dynamic cache allocation to the individual cores

• Efficient data sharing (no replicated data) + 2x bandwidth to L1 caches.

2.3 Smart L2 cache (3)

Figure 2.13: Dynamic L2 cache allocation according to cache demand [11]

2.3 Smart L2 cache (4)

Figure 2.14: Data sharing in shared and private (independent) L2 cache implementations [11]

2.3 Smart L2 cache (5)

Drawbacks of shared caches

Shared caches combine access patterns

Reduce the efficiency of hardware prefetching vs private caches.

Choice between shared and private caches Design decision depending on

whether benefits or drawbacks dominate as far as performance is concerned.

Trend

Core 2 prefers a shared L2 cache Nehalem prefers private L2 caches POWER5 prefers shared L2 cache POWER6 prefers private L2 caches

2.3 Smart L2 cache (6)

Table 2.3: Cache parameters of Intel’s and AMD’s processors [4]

2.4 Smart memory accesses (1)

2.4 Smart memory accesses

• Memory disambiguation

• Enhanced hardware prefetchers

Figure 2.15: Units involved in implementing memory disambiguation or hardware prefetching [12]

(L1 I-Cache not shown)

In document Párhuzamos rendszerek architektúrája (Pldal 170-200)