• Nem Talált Eredményt

Improving data locality for mesh computations

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Improving data locality for mesh computations"

Copied!
19
0
0

Teljes szövegt

(1)

Improving data locality for mesh computations

Antal Hiba

Pázmány Péter Catholic University

Budapest

TÁMOP-4.2.2/B-10/1/2012-0014

(2)

Memory Bandwidth Limits

(3)

Memory Bandwidth Limits

Nearly two order of magnitude difference between theoretical comp. power and the zero cache case.

FLOPS* < Real FLOPS < Theoretical Maximum

GDDR5 : 256 bit wide BUS, what happens if we read random address 32 bit data from main memory?

<<1% main memory data can be stored on-chip

(4)

Solutions

Better Memory Interfaces: Same FLOPS, better utilization.

Decreased oprating frequency of processing elements (PE): lower FLOPS (freq), but better utilization and FLOP\Watt(Green comp.)

Increased On-Chip memory: Better data reuse, lower FLOPS (num of PEs) with better

utilization.

Preoptimized input and algorithms: Same

FLOPS, better utilization.

(5)

Interesting Approach – Cellular Nonlinear Networks (CNN)

PEs get input directly (light sensor) – input BW problem solved

No shared cache, only local memory at each PE

Only local

interconnections between PEs

• Input is an image - space and time are also information: the

address(space) of data X in memory(input image) is determines which data Y (neighbors of X) will be involved in an operation defined on X.

(6)

Preoptimized input and algorithms

Locality-Based Placement of Input Data in Main Memory

Algorithms with maximal data reuse, and minimal I/O

Architecture dependent algorithms

Architectures for algorithm classes (FPGA)

(7)

Mesh Computations

Scramjet Windtunnel

Simulations of physical systems: sound, heat, elasticity, electrodynamics or fluid flow dynamics

Irregular memory access + data reload = Low processor utilization

(8)

Node Reordering

Reordering

Neighboring nodes

moved close to each other in main memory

Graph + Node Indexing Function => G-BW

G-BW is the maximal distance between nonzero elements and the diagonal of the adjacency matrix

FPGA: advantages of user-defined on-chip memory

(9)

FPGA Solution

Reads node data from off-chip DRAMs in sequential bursts (stream)

Reads and writes back all node data only once.

The whole DRAM bandwidth is utilized to feeding processor units. The index of mesh nodes in off-chip memory determines a G-BW. The required size of on-chip memory is (2*G-BW+1)*NodeSize for the given architecture.

To handle memory bandwidth limitations of mesh computing we suggested an FPGA design which:

(10)

Preoptimizing Input

One FPGA

Node Reordering

BW-Limited partitioning

Many FPGAs

BW-Limited partitioning

ADM-XRC-6T1

(11)

Bandwidth-Limited Partitioning

Create parts, which has node ordering with G-BW under a given bound. (ordering have to be given)

Edge-cut minimization relaxed, but has a bound.

(bound depends on the hardware architecture) COMM: number of outgoing edges / number of internal edges

COMM < 0.1 in our test environment (Alpha-Data ADM-XRC- 6T1 cards)

Size Balance

(12)

Example:

G-BW = 11

minimal on-chip memory size = 23

We have FPGA design with less on-chip memory

Tradicional Partitioners (METIS)?

(13)

Example:

Which kind of separator leads to lower G-BW?

What is longitudinal in case of an unstructured mesh? (xyz coordinates?)

(14)

Finding Structure in Unstructured Meshes: Depth Level Structure

Covering node set is often known

Breadth-first search – BFS from the Covering Set defines the Depth Level Structure

(15)

DLS-Based Bisection

Deepest Levels must be cut in a longitudinal direction, to get parts with good G-BW property

BFS waves are used to create separating node sets (surfaces)

(16)

Operators on node sets

 Level Structure(in: in-set, out: LS)

 Pseudo Diameter(in: in-set, out: (u,v))

(17)

Operators on node sets

 Fill(in: start node, border set, out: out set)

 Diletation, Erosion, etc. can be also used

(18)

Results of DLS-Based Bisection

(19)

Conclusions

 Novel mesh partitioning problem presented:

Create parts, with better locality(G-BW)

 First proof of concept algorithm created:

35% G-BW reduction with bisectioning (50% is max.) 20% better results vs. Metis (Metis has different goal!) constraint on COMM ratio can be satisfied

 Novel partitioning approach: get separators with operators defined on node sets

 Further investigations are needed...

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

For obtaining a better parameter set experimental data ‘orthogonal’ to the pure component properties are to be used, certain binary data sets were selected for that purpose, namely C

4 Broadband and growth – a short literature review Due to the better quality and availability of statistical data related to ICTs, the earliest empirical studies on the economic

– adversarial node participates in the route establishment – when it receives data packets for forwarding, it drops them – even better if combined with wormhole/tunneling..

Applicable partial solution generation pro- vides a better trade-off between optimization time and quality because it makes possible to restrict response time only for a subset

Applicable partial solution generation provides a better trade-off between optimization time and quality because it makes possible to restrict response time only for a subset

In MMP with one vote the voter has the same problem as in closed list or in majority-plurality SMD to punish a disliked candidate, namely the difficulty of abandoning the party

The Last.fm track data set is similar, but it incorporates playlist and album listening data, hence transition and similarity models perform better than factorization and

[7] J. Galambos, New Lower Bounds for Certain Classes of Bin Packing Algorithms, Theoretical Comp. Online bin packing with restricted repacking, J. Rabani, A Better Lower Bound