Hardware acceleration of 3D TLM Method with FPGA
TÁMOP-4.2.2/B-10/1/2012-0014
Phd seminar, Budapest, 9th November, 2012 László Füredi
(Supervisor: Péter Szolgay)
Overview
Motivations, FPGAs
Calculation of 3D TLM method
Discretization of equations
Implementation on FPGA
Performance comparison
Conclusions
Future plans
Motivations
High-performance computing, multi-processor environment, insufficient memory bandwidth
Frequency dependent parallel transmission lines
High latency
Problem: the frequency dependent transmission line design is not solved
Need computation intensive calculation to solve
transmission line equations
Parallel transmission lines
Parallel transmission lines (many processor)
FPGA (Field Programmable Gate Arrays)
Configurable Logic Block (CLB)
Look-up table (LUT)
Register
Logic circuit
Adder
Multiplier
Memory
Microprocessor
Input/Output Block (IOB)
Programmable interconnect
DSP block
DSP48E slice
25 x 18, two’s complement, multiplication
Optional adder, subtracter, and accumulator
Optional bitwise logical functionality, pipelining
Dedicated cascade connections
Integrated adder for complex-multiply or multiply-add
operation
Electromagnetic field calculation (The Maxwell’s equations)
0
x z y
x e x x
e h h
e
t y z
0
y x z
y e y y
e h h
e
t z x
0
z y x
z e z z
e h h
t x y e
0
x y z
x m x x
h e e
h
t z y
0
y z x
y m y y
h e e
h
t x z
0
z x y
z m z z
h e e
h
t y z
TLM Method
3D space-volume divided into nodes
Each node is a 12-port transmission-line junction
Scattering at the nodes models coupling between E and H fields
Transient E and H fields are calculated from combinations of voltages and currents on the transmission lines
Spectrum found by FFT
3 4 7 8
1 ( ) /
2
i i i i
E
y V V V V DY
Calculation of 3D TLM method
fifteen linear equations
eight coefficients
seven input voltages
one current source
SCN (symmetrical condensed node)
HSCN (hibrid symmetrical condensed node)
GSCN (general symmetrical condensed node)
HSCN is ideal for parallel computing
Numbering organized in sequence of pairs
, 1 ,
1
= |
i n r n
A d j a c e n t N o d e
V
V
Calculation of 3D TLM method II.
The scattering matrix (GSCN)
Yl x x x x y y y y z z z z
Yt z z y y z z x x y y x x
Ys y y z z x x z z x x y y x y z x y z
Rt Gs 1y 2y 3z 4z 5x 6x 7z 8z 9x 10x 11y 12y 13 14 15 16 17 18
y x 9x dzx -dzx bzx bzx azx czx g k
y x 10x -dzx dzx bzx bzx czx azx g k
x 13x b b b b h k
z x 5x dyx -dyx ayx cyx byx byx g k
z x 6x -dyx dyx cyx ayx byx byx g k
z y 1y axz cxy dxy -dxy bxy bxy g k
z y 2y cxy axy -dxy dxy bxy bxy g k
y 14y b b b b h k
x y 11y bzy bzy dzy -dzy azy czy g k
x y 12y bzy bzy -dzy dzy czy azy g k
y z 3z axz cxz bxz bxz dxz -dxz g k
y z 4z cxz axz bxz bxz -dxz dxz g k
z 15z b b b b h k
x z 7z byz byz ayz cyz dyz -dyz g k
x z 8z byz byz cyz ayz -dyz dyz g k
Capacitors Sources
Discretization of equations
0 0
2
2 2
s l t
p p
s l t
Y G Y Y
a Y G Y Y
0
2 2
t p q
s l t
b Y
Y G Y Y
0 0
2
p q
2
s l t
g Y
Y G Y Y
0 0
2 1
2
s l t
p q ij
s l t
Y G Y Y
h g
Y G Y Y
0
1
p q
2
s l t
k Y G Y Y
1 6 0
1 7 0
1 8 0
x y z
V j Z z y
V j Z x z
V j Z x y
0 0
2
2 4
2 2
s l t t t
p p
t t
s l t
Y G Y Y R Y
c Y G Y Y R Y
2
p q
4
t t
d R Y
p
p
S e p
m m p
q r
G S
p q r
R p
1
2
q
p q p q
p q p q
i O O
p q p q
Y C S
Z L
Y C S
t
t q C L m
s
R
t- magnetic losses, G
s– electric losses, C
pq– Capacitance of the lines, L
pq– Inductance of the
lines. C
Op- Capacitance of the open lines stubs
Implementation on FPGA
Implementation of the c pp equation
0 0
2
2 4
2 2
s l t t t
p p
t t
s l t
Y G Y Y R Y
c Y G Y Y R Y
Parts of the equations are reuseable
0 0
2
2 2
s l t
p p
s l t
Y G Y Y
a Y G Y Y
0
2 2
t p q
s l t
b Y
Y G Y Y
0 0
2
p q
2
s l t
g Y
Y G Y Y
0 0
2 1
2
s l t
p q ij
s l t
Y G Y Y
h g
Y G Y Y
0
1
p q
2
s l t
k Y G Y Y
1 6 0
1 7 0
1 8 0
x y z
V j Z z y
V j Z x z
V j Z x y
0 0
2
2 4
2 2
s l t t t
p p
t t
s l t
Y G Y Y R Y
c Y G Y Y R Y
2
p q
4
t t
d R Y
p
p
S e p
m m p
q r
G S
p q r
R p
1
2
q
p q p q
p q p q
i O O
p q p q
Y C S
Z L
Y C S
t
t q C L m
s
R
t- magnetic losses, G
s– electric losses, C
pq– Capacitance of the lines, L
pq– Inductance of the
lines. C
Op- Capacitance of the open lines stubs
Calculations of the equations I.
Calculations of the equations II.
The scattering matrix (HSCN)
Ymα y z x z y x y z z x x y y x x Ymβ z y z x x y x x y y z z z z y Ysγ x x y y z z z y x z y x x y z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 a b d b ‐d c g
2 b a d c ‐d b g
3 d a b b c ‐d g
4 b a d ‐d c b g
5 d a b c ‐d b g
6 d b a b ‐d c g
7 ‐d c b a d b g
8 b c ‐d d a b g
9 b c ‐d a d b g
10 ‐d b c b d a g
11 ‐d c b b a d g
12 c b ‐d b d a g
13 b b b b h
14 b b b b h
15 b b b b h
Required resources
Multiplier Adder Divisor Subtracter Change sign Division with 2
LUT (D) 279 813 279 813 15 24
FF (D) 413 719 413 719 0 64
DSP (D) 11 3 11 3 0 0
LUT(S) 103 281 103 281 15 24
FF(S) 111 302 111 302 0 64
DSP(S) 3 2 3 2 0 0
Input 3 0 0 0 0 0
S matrix 12 75 21 9 21 48
Matrix multip. 48 81 0 0 0 0
Output 21 24 0 9 3 6
Sum 84 180 21 18 24 54
Calculated with Virtex-5SX240T and Virtex-6SX475T
Required resources II.
Sum (D) Multiplier Adder Divisor Subtracter Change sign Division with 2 SUM Virtex‐5 Virtex‐6 Kintex‐7 Virtex‐7 LUT 30912 195660 7728 19566 360 1296 255522 149760 297600 254200 712000
FF 35532 170280 8883 17028 0 3456 235179 149760 595200 508400 1424000
DSP 840 540 210 54 0 0 1644 1056 2016 1540 3360
BRAM 0 0 0 0 0 0 49 11664000 38304000 28620000 67680000
0 238041 781714 584082 1381224 Sum (S) Multiplier Adder Divisor Subtracter Change sign Division with 2 SUM Virtex‐5 Virtex‐6 Kintex‐7 Virtex‐7
LUT 8316 44100 2079 4410 360 1296 60561 149760 297600 254200 712000
FF 9576 60660 2394 6066 0 3456 82152 149760 595200 508400 1424000
DSP 252 360 63 36 0 0 711 1056 2016 1540 3360
BRAM 0 0 0 0 0 0 25 11664000 38304000 28620000 67680000
0 466560 1532160 1144800 2707200 Number of cached cells
Number of cached cells
Required resources (single precision)
Required resources (double precision)
Performance comparison
Implementations
Intel Q9300 Intel Xeon X5550 Nvidia GTX480 XC6VSX475T Implementation type Software (Intel IPP) Software (Intel IPP) Software (Cuda) FPGA
Technology (nm) 45 45 65 40
Clock Frequency (MHz) 2500 2666 (3060) 1400 300
Number of Processing
Elements 4 Cores 4 Cores 480 Cuda Cores 1 PE
Power Consuption(W) 95 95 450 ~ 50
Million cell iteration/s 10,5 18,5 180 472
Speedup 1 1,76 17,11 44,95
Conclusion
The solution was optimized according to the special requirements of the FPGA architectures for „one cycle”
computing.
Optimized for minimized memory transfer