Arbitrary sized templates on the Falcon architecture

In the previous section a configurable emulated digital CNN-UM architecture was introduced. This processor used 3×3 sized templates however several applications require larger templates. For example by using 5×5 sized templates the halftoning and inverse halftoning algorithm give better image quality than their nearest neighborhood version. The robustness and accuracy of some texture segmentation algorithms can also be increased by using large sized templates. Additionally implementation of some finite difference operators on CNN also requires large template for example the biharmonic operator requires 5×5 sized template. In case of feed-forward template the large template usually can be decomposed to series of nearest neighborhood templates but this solution increases the runtime of the algorithm and some templates cannot be decomposed.

Currently no analog VLSI CNN-UM supports large neighborhood templates. A special version of the CASTLE emulated digital CNN-UM processor array was de-scribed in [12] which can compute with 5×5 sized templates but this chip has not been implemented yet. The larger computing requirements of the large neighborhood templates make software simulation very slow. In this section a new emulated digital CNN-UM processor, based on the Falcon architecture, will be described which can be configured to use arbitrary sized templates.

3.2.1 The Memory unit

If the neighborhood of the template is denoted byn the number of template elements and state values is (2n+ 1)² thus the I/O requirements of the arithmetic unit increase quadratically asnincreases. So optimization of the data flow of the processor is more important than in the case of the CASTLE or the basic Falcon processors. The I/O requirements of the processor core can be significantly reduced if the memory structure of the basic Falcon architecture is generalized. In the basic case 3×3 sized templates are used and 3 lines must be stored on the chip to reduce the I/O requirements to 1 load and 1 store operation for every cell. If the neighborhood is n this simple rule can be generalized and 2n+ 1 line should be stored from the cell array on the chip.

Additionally the constant and template select values must be synchronized with the state of the cell so n+ 1 lines should be stored from these values. In the generalized case, (3.6) is modified and the size of the required memory unit for one processor using arbitrary sized template can be computed by the following expression:

w((2n+ 1)·sw+ (n+ 1)·(cw+tsw)) (3.7)

StateOut(1) Shift Register Shift Register Shift Register Shift Register

Shift Register StateIn

StateOut(2n+1)

StateOut(2n)

StateOut(n) StateOut(n-1)

Shift Register StateOut(n+1)

Figure 3.11: Structure of the generalized memory unit

where w is the width of the cell array, n is the neighborhood value, sw, cw, and tsw are the width of the state, constant and template select values respectively. The size of the memory unit grows linearly according to the neighborhood value. The structure of the generalized memory unit is shown in Figure 3.11. The first line of the cell array is loaded into the upper n memory lines to implement the zero-flux boundary condition and the first line can be fed back for similar reasons but this is used at the end of the computation. The generalized memory unit requires large area but this is worthwhile because the I/O requirements of the processor can be reduced to one input and one output operation for every cell.

3.2.2 The Mixer unit

To compute a new cell value (2n+ 1)² template and state values must be multiplied and the partial results must be summed. Implementation of an arithmetic unit us-ing (2n+ 1)² multipliers requires huge area while using just one multiplier results in very low performance. A reasonable compromise between implementation area and processing speed can be obtained if the computation of the template operation is executed row-wise by using 2n+ 1 multipliers. This solution requires a generalized version of the mixer unit described in the previous section. The structure of this generalized mixer in case of 5×5 sized template is shown in Figure 3.12. Area re-quirements of the generalized mixer and template memory units with various state

Mix5

Mix4 Mix3 Mix2 Mix1

StateIn(1) StateIn(2) StateIn(3) StateIn(4) StateIn(5)

RightIn

RightRegs

LeftIn LeftNew

S5 S4 S3 S2 S1

LeftRegs

Figure 3.12: Structure of the generalized mixer unit in case of 5×5 sized template and template precision and different neighborhood values are shown in Figure 3.13.

Similarly to the previous case the area required to implement the mixer unit increases linearly if the state or template precision is increased.

State values from the memory unit are connected to the StateIn inputs and the S outputs are connected to the corresponding inputs of the arithmetic unit. The RightIn input of the processor is connected to the output of the Mixn register of the right neighbor via its RightOut output. Similarly the LeftIn input is connected to the Mixn register of the left neighbor via its LeftOut output while the LeftNew input is connected to the StateIn input of the left neighbor. The generalized mixer unit can be divided into three groups: the central part is the Mix1-Mixn registers, which hold the neighborhood of the current cell. The shift registers LeftRegs and RightRegs on the left and right are required for inter-processor communication similarly to the 3×3 case. The length and the number of the shift registers are changing according to the neighborhood value n. The length of each shift register is equal to the size of the template. The number of the Mix registers is equal to 2n+1 and the rightmost one is a parallel-in serial-out shift register. If the first or the last cell is computed ncolumn of cell values are required from the neighboring processors. Therefore the LeftRegs and RightRegs containnshift registers and the output of these shift registers is connected to the leftmost and rightmostnoutput registers via a multiplexer. Additionallyn−1 auxiliary shift registers are required to hold the right neighborhood of the edge cell during inter-processor communication.

When inter-processor communication is not required the operation of the gener-alized mixer unit is very similar to the 3×3 case. The new column of the template

0 100 200 300 400 500 600 700 800 900 1000

0 8 16 24 32 40 48 56 64

State precision (bit)

Area (Slice)

Mixer 3x3 Mixer 5x5 Mixer 7x7

Template Mem. 3x3 Template Mem. 5x5 Template Mem. 7x7

Figure 3.13: Area requirements of the generalized mixer and template memory units in Slices

window is loaded into the Mixn shift register at every n^th clock cycle and its con-tents are shifted to the next shift register. The data flow of the generalized mixer in a 5×5 case during communication with its left and right neighbors is described in Figure 3.14. In the figure the same notation is used as in Figure 3.5.

The necessity of the auxiliary registers can be seen in Figure 3.14. Let’s assume that the processing of the 13^th cell value has just finished (Step 11. in Figure 3.14) and the appropriate cell values are already present in the RightRegs. In the following 5 cycles these data should be used at the rightmost column of the template window.

These values must be stored because they are required during the computation of the next cell value (Step 21. in Figure 3.14). But we can not store them in the same shift register because it is required during the inter-processor communication, which is also carried out in these cycles. In a general case cell values from the first column of the right neighbor are required in the computation of the last n cells at the end of the row therefore n−1 auxiliary shift registers are required.

We must note that in the RightRegs it is possible to use different data flow, which do not require auxiliary registers, but in this case additional mutiplexers are required in the inputs. The implementation of these multiplexers on FPGA requires at least as much area as the auxiliary registers. On the other hand the modified data flow is more complicated which makes control unit design rather complex.

48 49 50 51 52 33R 48R 49R

Figure 3.14: Data flow of the 5×5 sized mixer

3.2.3 The Arithmetic unit

As stated earlier good compromise between area requirements and computing per-formance can be achieved if the template operation is computed row-wise in the arithmetic unit. This generalized arithmetic unit is shown in Figure 3.15. It contains a row of 2n+ 1 multipliers, an adder tree to sum the partial results and a loadable accumulator to store intermediate results. Additionally two shift registers are re-quired for the synchronization of the constant and the old state values according to the delay of the multipliers and the adder tree. The length of the shift registers can be computed from the neighborhood value and the delay of the multipliers, which depends on the bit width of the state and template values.

Operation of the generalized arithmetic unit is very similar to the 3×3 case. The rows of the current template and the corresponding state values loaded into the multipliers at every clock cycle and the new state is computed in 2n+ 1 clock cycles.

The utilization of the arithmetic unit is 100%. The length of the shift registers depends on the pipeline latency of the multipliers and the template size.

Area requirements of the generalized arithmetic unit in the case of 5×5 sized template and using dedicated multipliers are shown in Figure 3.16 (Detailed area diagrams of the arithmetic unit in the case of different template sizes and different types of multipliers can be found in Figure A.1 ). According to the increased number of multipliers the generalized arithmetic units require a much larger area than the standard arithmetic unit working with 3×3 sized templates. Using the Virtex-II and Virtex-II Pro FPGAs this large area requirement can be efficiently reduced by utilizing the dedicated multipliers. Similarly to the 3×3 case the area of the arithmetic unit can be reduced by 40-60%.

Mult Mult Mult

Shift reg

+ ACC

Reg4

Shift reg

+ S₁ T₁ S₂ T₂ S_2n+1 T_2n+1 g_ij x_ij

Adder tree

Figure 3.15: Structure of the generalized arithmetic unit

5x5, Speed optimized multipliers

0 1000 2000 3000 4000 5000 6000 7000

0 8 16 24 32 40 48 56 64

State width (bit)

Area (Slice)

2 4 6 8 10 12 14 16 18 24 32 48 64

Figure 3.16: Area requirements of the arithmetic unit in case of 5×5 templates using speed optimized dedicated multipliers

According to the increased number of cycles to compute a new cell state the control unit of the processor should be modified. However the modular structure of the control unit makes this modification rather simple. Only the number of states of the Ctrl state machine should be increased to 2n+ 1 states.

In document Emulált digitális CNN-UM architektúra megvalósítása újrakonfigurálható áramkörökön és alkalmazásai (Pldal 53-60)