Performance comparisons - Emulált digitális CNN-UM architektúra megvalósítása újrakonfigurálhat

In this chapter the performance characteristics of the various Falcon architectures will be examined and its performance is compared to the CASTLE emulated digital architecture and the software simulation. Performance of the Falcon architecture is computed from post place and route timing simulations using the Xilinx Timing Analyzer tool while the performance of the CASTLE architecture is described in [12].

The computing performance of the software simulation was measured on the current high performance desktop microprocessors. In the test an AMD Athlon 64 3200+, AMD Athlon XP 3200+ and an Intel Pentium IV 3.0C processors was used and the clock frequency of the processors are 2.0GHz, 2.3GHz and 3.0GHz respectively. The AMD Athlon 64 3200+ processor has 1024kB on-die Level 2 cache while the other processors have 512kB.

The CNN simulation kernel was written in C++ by using the functions of the Intel Signal Processing Library which contains optimized routines for various signal and vector processing tasks [17]. During the performance test computing time required to calculate 10 forward Euler iterations was measured, the measurement was repeated 50 times and the best runtime was selected. The priority of the simulator was set to time critical and the number of elapsed clock cycles was measured by using the RDTSC instruction as described in [18]. The average performance of the microprocessor, in CNN cell iteration/s, is computed from the clock cycle length and the number of elapsed clock cycles. The results of the performance test using different image and template sizes are shown in Figure 3.23.

The results show that the higher clock frequency and memory bandwidth of the Intel Pentium IV provides higher performance than the AMD Atlhon 64 and Atlhon XP 3200+ processors. In most cases the performance of the Pentium IV is 30% higher which is due to the higher clock frequency. The performance of both microprocessors depends on the size of the input image. If the size is smaller than 150×150 or larger than 256×256 cells the performance is constant and slowly decreases between these two array sizes. To find out the reason for this performance decrease the size of the required memory should be computed first. In our case the forward Euler method is used to compute the dynamics of the cell array so three values should be stored for each cell: one for the pre-computed feed-forward and bias value, one for the state of the cell and one for the derivative. During the computation 64 bit double precision floating point numbers are used thus the number of cells in the array should be multiplied by 24 to get the required memory size. In Figure 3.24. the performance of the microprocessors are plotted as a function of the required memory during the computation.

On the re-scaled plot the performance of the Pentium IV processor is constant if the memory size is smaller than 512kB which is equal to the size of the Level 2 cache of

0 5 10 15 20 25 30

0 32 64 96 128 160 192 224 256 288 320 352

Cell array size

Performance (million cell iteration/s)

Athlon 64 3x3 Athlon XP 3x3 Pentium IV 3x3 Athlon 64 5x5 Athlon XP 5x5 Pentium IV 5x5 Athlon 64 7x7 Athlon XP 7x7 Pentium IV 7x7

Figure 3.23: Performance of the software simulation in case of different template sizes

0 5 10 15 20 25 30

0 512 1024 1536 2048 2560 3072

Memory (kByte)

Performance (million cell iteration/s)

Athlon 64 3x3 Athlon XP 3x3 Pentium IV 3x3 Athlon 64 5x5 Athlon XP 5x5 Pentium IV 5x5 Athlon 64 7x7 Athlon XP 7x7 Pentium IV 7x7

Figure 3.24: Performance of the software simulation as a function of the required memory in case of different template sizes

the processor. If the memory size is increased, performance decreases monotonically until the required memory is larger than 1536kB when it is constant again. The performance of the Athlon XP changes similarly except that the 128kB Level 1 cache increases the performance for very small arrays. Thus the performance of the software simulation largely depends on the size of the Level 2 cache of the processor and if the data set is larger, the performance decreases. In the comparisons with the Falcon architecture a 512×512 sized cell array is assumed therefore the smallest performance value was taken into account.

The huge number of possible configurations of the Falcon architectures makes it impossible to make post place and route timing simulations on each configuration.

Because several place and route steps using different timing constraints are required to determine the largest operating frequency of the architecture. To make performance evaluation of the Falcon architecture easier the critical timing paths of the design must be identified. The memory, mixer and template memory units are built from memories, shift registers and 2:1 multiplexers. These elements has a fixed delay thus changing the computational precision does not affect the performance of these units.

Only the speed of the arithmetic unit is affected by the computational precision.

There are two possible critical paths in this unit: one is the delay of the multipliers and the other is the delay of the adder in the accumulator. Therefore only the speed of the multipliers and adders should be determined in order to compute the performance of the Falcon architecture.

Each slice in the Virtex series FPGAs contains dedicated carry logic for a 2 bit adder. These adders can be connected via dedicated carry lines to form a ripple carry adder. Thus the delay of the adder depends on the size of its inputs. The delay of the adder with different bit widths on the Virtex series FPGAs is shown in Figure 3.25.

Other possible elements on the critical path are the multipliers in the arithmetic unit. In the Falcon architecture multipliers from the Xilinx Core Generator are used.

These are tree based multipliers where pipeline stages can be inserted between the tree levels to reduce clock cycle time. The performance of the multipliers on the Virtex-II Pro series FPGAs are shown in Figure 3.26. Because these multipliers use the dedicated carry logic of the Virtex series FPGAs the clock cycle time is linearly proportional to the width of the inputs. (Further details about the performance of the multipliers on the older Virtex family FPGAs are show in Figure B.1.)

The Xilinx Core Generator makes it possible to utilize the dedicated 18 bit by 18 bit signed multipliers on the Virtex-II and Virtex-II Pro devices easily. Using these multipliers smaller area is required for the arithmetic unit while its performance is higher and its power consumption is lower than the tree based multipliers. Perfor-mance of a multiplier using dedicated resources on the Virtex-II Pro devices is shown in Figure 3.27.

0 1 2 3 4 5 6 7 8 9 10

0 16 32 48 64 80 96 112 128

Input width (bit)

Delay (ns)

Virtex VirtexE Virtex2 Virtex2Pro

Figure 3.25: Delay of an adder using different input precision on the Xilinx Virtex family FPGAs

0 1 2 3 4 5 6 7 8

0 8 16 24 32 40 48 56 64

Input Width (bit)

Delay (ns)

2 4 6 8 10 12 14 16 18 24 32 48 64

Figure 3.26: Delay of the multiplier with different input precision on the Virtex-II Pro FPGA

0 1 2 3 4 5 6 7

0 8 16 24 32 40 48 56 64

Input Width (bit)

Delay (ns)

2 4 6 8 10 12 14 16 18 24 32 48 64

Figure 3.27: Delay of the dedicated multiplier with different input precision on the Virtex-II Pro FPGA

If both inputs are narrower than 18 bit, the delay of the multiplier is constant. In the higher precision cases where several 18 bit by 18 bit multipliers are used the delay is increased by the adders between the dedicated multipliers. In case of the Virtex-II Pro devices if the precision is smaller than 10 bit the delay of the dedicated multiplier is higher than the tree-based multiplier. However if the inputs are wider than 10 bit, the delay of the multiplier using the dedicated multipliers is 10-25% smaller than the delay of the tree based multiplier.

After identifying the possible critical paths in the arithmetic unit the minimal clock cycle time and the performance of the Falcon architecture can be computed.

The performance of the Falcon processor in case of 18 bit template precision is shown in Figure 3.28. (Further details about the performance of the Falcon processor are show in Figure B.2.)

The performance of the Falcon architecture decreases if the template size is in-creased according to the inin-creased number of computations except in the case of Distributed Arithmetic unit where the performance is independent from the template size. The performance of the Falcon architecture is compared to the performance of the Pentium IV 3.0C processor and the speedup values are shown in Figure 3.29.

(Further details about the speedup of the Falcon processor are show in Figure B.3.) Both types of arithmetic unit are 4-10 times faster than a high performance desk-top microprocessor while the Distributed Arithmetic unit has 15-60 times higher

0 50 100 150 200 250 300 350 400

0 8 16 24 32 40 48 56 64

State width (bit)

Performance (million cell iteration/s)

3x3 5x5 7x7 3x3, DM 5x5, DM 7x7, DM DA

Figure 3.28: Performance of the different arithmetic unit implementations on the Virtex-II Pro FPGA using 18 bit template precision (DM: Dedicated multipliers, DA: Distributed Arithmetic)

0 10 20 30 40 50 60

0 8 16 24 32 40 48 56 64

State width (bit)

Speedup

3x3 5x5 7x7 3x3, DM 5x5, DM 7x7, DM

3x3, DA 5x5, DA 7x7, DA

Figure 3.29: Speedup of the different Falcon processor implementations compared to a Pentium IV 3.0GHz processor using 18 bit template precision (DM: Dedicated multipliers, DA: Distributed Arithmetic)

0 50 100 150 200 250

0 8 16 24 32 40 48 56 64

State width (bit)

Number of processors

3x3 5x5 7x7 3x3, DM 5x5, DM 7x7, DM

3x3, DA 5x5, DA 7x7, DA

Figure 3.30: Number of realizable Falcon processor cores on the XC2VP125 FPGA using 18 bit template precision (DM: Dedicated multipliers, DA: Distributed Arith-metic)

performance. But the Falcon processors can be easily connected on an array with-out any communication penalty thus the performance increases linearly according to the number of processors. Therefore the number of implementable Falcon proces-sors are computed on the largest available FPGA the XC2VP125 and the results are shown in Figure 3.30. (Further details about the number of the implementable Falcon processors on the XC2VP125 FPGA are show in Figure B.4.)

According to the increased number of resources the number of implementable Falcon processor cores are decreasing quickly as the precision is increased. While several hundreds of processors can be implemented in the low precision cases the processor count is just 10 or even smaller if very high precision is used. The cumulated performance of the various Falcon processor array configurations compared to a high performance desktop microprocessor are shown in Figure 3.31. (Further details about the speedup of the Falcon processors array implemented on the XC2VP125 FPGA are show in Figure B.5.)

Comparison of the results show that the performance of the DA arithmetic unit is higher than the arithmetic unit implemented using dedicated multipliers (DM arith-metic) if the state precision is smaller than 16 bit. Emulation of a nearest neighbor CNN array in this case can be even 2000 times faster than the software simulation running on a 3.0GHz Pentium 4 processor. If the state precision is larger than 16

0 500 1000 1500 2000 2500

0 8 16 24 32 40 48 56 64

State width (bit)

Speedup

3x3 5x5 7x7 3x3, DM 5x5, DM 7x7, DM

3x3, DA 5x5, DA 7x7, DA

Figure 3.31: Speedup of an array of Falcon processors implemented on the XC2VP125 FPGA compared to a Pentium IV 3.0GHz processor (DM: Dedicated multipliers, DA:

Distributed Arithmetic)

bit the DM arithmetic unit provides higher performance than the DA arithmetic unit on the same FPGA. Accurate emulation of the CNN state equation is traded-off for performance and in this case at most 1000 times speedup can be achieved. In the larger neighborhood cases similar trend can be observed. In the low precision cases the DA arithmetic unit provides higher performance while in the high precision cases the DM arithmetic unit is faster.

3.6 Implementation of a real emulated digital CNN

In document Emulált digitális CNN-UM architektúra megvalósítása újrakonfigurálható áramkörökön és alkalmazásai (Pldal 68-76)