Area optimization by using distributed arithmetic

The previously described configurable Falcon architectures make it possible to use an optimal amount of resources in our computations. However the architecture is mainly optimized for space variant templates. For example in a 3×3 case if the number of templates is set to 1 the template memory still requires the same area as in the 4 template case because the template memory uses the on chip distributed 16×1 bit RAM resources. In this case only 3 memory locations, for the 3 lines of the template, are used out of the available 16. There are lots of applications where space variant templates are not required. Is it possible to modify the Falcon architecture which is more area efficient for these applications?

Computation of template operation P

A·y(t) or P

B·u(t) is equivalent to a 2 dimensional convolution operation or a 2 dimensional Finite Impulse Response (FIR) filter. FIR filters are the most common blocks used in DSP applications and several FPGA optimized implementations are available. One of the most popular FIR filter implementation on FPGAs is the distributed arithmetic FIR filter. In this section a new arithmetic unit will be introduced which use the distributed arithmetic technique to reduce the area requirements of the Falcon architecture.

Distributed arithmetic is a bit level rearrangement of a multiply accumulate to hide the multiplication [15]. It is a powerful technique for reducing the size of a parallel hardware multiply accumulate that is well suited to FPGA designs. Distributed arithmetic is widely used in FIR filter implementations on FPGAs [16].

The conventional FIR filter computes the following convolution sum:

y(k) =

N−1

n=0

a(n)x(k−n) (3.10)

where y(k) is the response of the filter at time k, a(n) are the filter coefficients, x(k−n) is the input sample of the filter and N is the number of filter coefficients.

The input sample can be written in the following fractional format:

x=−xB−1·2^B⁻^p⁻¹+

B−2

b=0

xb·2^b⁻^p (3.11)

where xb is a binary variable, B is the width of x and p is the position of the radix point. If (3.11) is substituted into (3.10) the following equation can be derived:

PSC Parallel to serial

converter x(n)

B bit shift registers N-1 shift registers

Figure 3.19: Structure of the serial distributed arithmetic FIR filter

y(k) = Equation (3.10) can be computed serially by the architecture depicted in Figure 3.19, which contains only look up tables, shift registers and one scaling adder. The coef-ficients of the FIR filter are stored in the 2^N word LUT in a pre-computed form as shown in Table 3.2.

Table 3.2: Contents of the LUT in the case of 3 tap FIR filter Location Value

If the input samples are represented with B bits of precision, B clock cycles are required to complete the calculation. The main advantage of this solution is that the clock cycles required to compute a new filtered value is independent of the number of taps. However, if the number of taps increases very large LUT is required to store the partially computed coefficients. The size of the LUT can be reduced if all possible combinations of the coefficients are not computed in advance. In this case the coefficients are grouped according to the optimal LUT size for the given architecture.

According to (3.12) all possible combinations of the coefficients in each group are computed in advance and stored in a separate LUT. The outputs of the LUTs are connected to an adder tree, which performs the remaining additions.

The performance of the FIR filter can be increased by using two or more partial product LUTs and a scaling adder tree to sum partial products. To achieve maximum performance, fully parallel distributed arithmetic FIR filter can be built which can compute the new result in a single clock cycle. Trade-off can be made between speed and area, by using a fully serial, fully parallel or mixed type distributed arithmetic FIR filter.

If (3.4a) is compared to (3.10) we can see that the summation part of (3.4a) is a two dimensional form of (3.10). To use distributed arithmetic in the Falcon processor the FIR filter architecture shown in Figure 3.19 should be modified according to (3.4a). After substituting (3.11) into (3.4a) the following equation can be derived:

y(k, l) = The two dimensional distributed arithmetic FIR filter is shown in Figure 3.20. The inputs of the filter are connected to the memory unit described in the previous section while its output is summed with the constant and the old state value to get the new value of the cell. The advantage of this arithmetic unit is that no separate mixer is required to store the window of state values around the currently processed cell.

However it does not make it possible to carry out the inter-processor communication invisibly because only the parallel to serial shift register can be directly loaded with

PSC S₃

LUT

+/-R e g 2^-1

Add/Subtract PSC

S₂ PSC S₁

LUT

LUT +

Shift registers

Figure 3.20: Structure of the two dimensional serial distributed arithmetic FIR filter the state values, thus 2n dummy cycles are required to load the neighboring cell values. The main advantage of this approach is its easy scalability compared to the previously described arithmetic units. In the case of conventional arithmetic unit the scalability is limited to three cases when (2n+ 1)×(2n+ 1) or 2n+ 1 or just one multiplier is used. But the cycle length of the distributed arithmetic unit is determined by the width of the state value, for example in a 12 bit case the cycle length can be 1, 2, 3, 4, 6 or 12 clock cycle/cell according to the parallelism used in the arithmetic unit. Area requirements of the distributed arithmetic unit for 5×5 sized template and 1 clock cycle/cell computing cycle length are shown in Figure 3.21 (Additional data about the area requirements of the distributed arithmetic unit for various template sizes and computing cycle lengths are shown in Figure A.2).

The area requirements of the distributed arithmetic units are usually smaller than the conventional approach if the computing cycle lengths are equal. However in most cases the computing cycle lengths of the two architecture types are different in this cases the area-time (AT) product should be computed for the comparison of the efficiency of the architectures. The clock cycle length of both arithmetic units are determined by the physical constraints of the given FPGA so there is no significant difference between operating frequencies. Thus the computing cycle length in clock cycles is used in the computation of the AT product. The ratio of AT products of the Distributed Arithmetic unit and the conventional arithmetic unit in case of 5×5 sized template and 1 clock cycle/cell computing cycle length is shown in Figure 3.22.

(Additional data about the ratio of AT products of the Distributed Arithmetic unit and the conventional arithmetic unit in case of different template sizes and computing cycle lengths are shown in Figure A.3.)

The results show that in most cases the DA arithmetic unit has 30-40% smaller AT product than the conventional approach. The DA arithmetic unit is very efficient if the template size is large and fully parallel distributed arithmetic is used. In these cases more than 50% smaller AT product can be achieved.

5x5, 1clk/cell

Figure 3.21: Area requirements of the DA arithmetic unit in case of 5×5 sized tem-plates and 1 clock cycle/cell computing cycle length

5x5, 1clk/cell

Figure 3.22: Ratio of the AT product of the DA arithmetic unit and the conventional arithmetic unit in case of 5×5 sized templates and 1 clock cycle/cell computing cycle length

In document Emulált digitális CNN-UM architektúra megvalósítása újrakonfigurálható áramkörökön és alkalmazásai (Pldal 63-68)