The Arithmetic unit - Programmable Logic Devices

1.2 Programmable Logic Devices

3.1.3 The Arithmetic unit

The memory structure described above makes continuous processing of the cells with very high rates possible so the performance of the Falcon architecture depends on the speed of the arithmetic unit. The arithmetic unit of the CASTLE architec-ture multiplies and sums one line of template and state values in one clock cycle.

Implementation of a similar arithmetic unit on FPGA is not efficient because the combinatorial delays of the multipliers and adders are summed resulting in very low performance. The performance can be increased by using pipelined multipliers and adding pipeline registers between the adders. By adding registers between the adders the computation of the first intermediate result is delayed by two clock cycles. Since this result is required in the computation of the second line of the template the in-creased performance of the pipelined architecture cannot be utilized. One possible solution is to hide the latency of the adders by implementing three virtual arithmetic units. In this case three different cells are computed in three subsequent clock cycles as described in [11]. But in our case the main memory and the mixer unit should be redesigned if this solution is used. However a simple modification of the arithmetic unit as shown in Figure 3.7 makes pipelining possible without affecting other parts of the architecture. In the case of the Falcon architecture the summing of the partial products is separated form accumulating the intermediate results by using an adder tree to sum the three products and a separate loadable accumulator for intermedi-ate results. This structure makes it possible to use arbitrary long pipelines in the multipliers and in the addition of the products.

Mult Mult Mult

Reg2

Shift reg

ACC

Reg4

Shift reg

Reg3

S₁ T₁ S₂ T₂ S₃ T₃ g_ij x_ij

Reg1

Figure 3.7: The structure of the modified arithmetic unit

The configurable multipliers in the CASTLE architecture are 12 bit wide in the 6 and 1 bit precision modes the faster operation is achieved by simply disabling those parts of the multipliers which are not required in the computation. This means that in 6 bit mode half of the arithmetic unit is disabled while the 1 bit mode uses a separate

”arithmetic” unit. The Falcon architecture is implemented on programmable devices so it is possible to design the arithmetic unit more efficiently by utilizing only the required amount of resources. Area requirements of the arithmetic unit with different state and template accuracy are shown in Figure 3.8.

We used the pipelined multiplier IP core from the Xilinx CoreGenerator in the arithmetic unit. These multipliers are optimized for Virtex FPGAs and also pre-placed to make placing and routing easier. The multiplier employs a tree structure to sum the partial products and the pipeline registers are placed between the tree levels.

It means that the latency of the multiplier depends on the width of its narrower input.

The size of the arithmetic unit is mainly determined by the area requirements of the three multipliers. If the template precision is held constant and the state precision is increased the area required to implement the arithmetic unit is increased linearly.

The Xilinx CoreGenerator also makes it possible to utilize the on-chip dedicated multiplier resources on the Virtex-II FPGAs. If the precision is larger than 18 bits, several dedicated multipliers and additional adders are required to compute and sum the partial products of the multiplication. In this case the Xilinx CoreGenerator

3x3, Without dedicated multipliers

0 1000 2000 3000 4000 5000 6000 7000 8000

0 8 16 24 32 40 48 56 64

State width (bit)

Area (Slice)

2 4 6 8 10 12 14 16 18 24 32 48 64

3x3, Speed optimized multipliers

0 500 1000 1500 2000 2500 3000 3500 4000

0 8 16 24 32 40 48 56 64

State width (bit)

Area (Slice)

2 4 6 8 10 12 14 16 18 24 32 48 64

3x3, Area optimized multipliers

0 500 1000 1500 2000 2500 3000 3500

0 8 16 24 32 40 48 56 64

State width (bit)

Area (Slice)

2 4 6 8 10 12 14 16 18 24 32 48 64

Figure 3.8: Area requirements of the arithmetic unit in the case of different multiplier implementations

offers area and speed optimized versions of the multiplier. Area requirements of the arithmetic unit with speed and area optimized dedicated multipliers are shown in Figure 3.8. Using the dedicated multipliers in the Virtex-II and Virtex-II Pro FPGAs area requirements of the arithmetic unit can be significantly decreased. In most cases the area of this arithmetic unit is 40-50% smaller. When the template and state precision are 18 bit even 70% area reduction can be achieved.

The arithmetic unit requires three clock cycles to compute the new cell value. At the first clock cycle the first line of the template and the corresponding state values are presented on the inputs along with the corresponding constant and the state value of the currently processed cell. In the following two clock cycles the remaining two lines from the template and the state values are loaded into the multipliers. The processing of the next cell can be started in the next clock cycle. After the latency of the multipliers two additional clock cycles are required to sum the products. When the first result is loaded into the Reg3 register the constant value of the current cell should be loaded into the ACC register. In the following two clock cycles the partial results of the first and second template lines are stored in the ACC register. The final result is stored in the Reg4 register because in this clock cycle ACC is loaded with the constant value of the next cell. In the next stage the computed derivative of the state value is added to the old state and the new state value is limited by using a sigmoid function. The length of the two shift registers, which holds the constant and old state values, varies and depends on the latency of the multipliers. The latency of the multipliers computing with different bit widths is summarized in Table 3.1.

In document Emulált digitális CNN-UM architektúra megvalósítása újrakonfigurálható áramkörökön és alkalmazásai (Pldal 47-50)