• Nem Talált Eredményt

Humans do precise manipulation tasks largely through tactile perception of objects.

When a fingertip touches an object a contact stress profile is induced at its surface.

The resulting stress profile has three components: the normal stress σn and the shear stress τx, τy in the x and y dimensions. Sensing these three stress components provides humans with a rich source of information about their physical environment.

Measurement and processing of the stress components also has a great importance in precise robotic manipulation. These biology inspired systems use a small array of sensing elements to improve sensing capabilities. Analog CNN-UM cells can be used to process the measured stress components in real time [22].

Simulation of the static and dynamic properties of the sensors in design time re-quires high computing power. Several studies proved the effectiveness of the CNN-UM solution of different PDEs [3] [4]. But in most cases the results cannot be used in real life implementations because of the limitations of the analog CNN-UM chips such as low precision or the application of 5×5 sized templates. These drawbacks can be solved by using an emulated digital CNN-UM. In this section the transient behavior of a simple tactile sensor will be modeled by using the multi-layer Falcon emulated digital CNN-UM architecture. The standard multi-layer architecture is specialized to solve the state equation of the tactile sensor to reduce the area requirements and improve the performance of the architecture.

Tactile sensors are usually composed of a central shuttle plate, which is suspended by four bridges over a pit. The suspension of the whole structure allows deformation of the bridges as normal and shear stress is applied to the central plate. Each bridge contains an embedded piezoresistor. The resistance of the bridges is changing due to the deformations and the voltage changes on the bridges can be measured. In our case the bridges are located on the center of each edge as shown in Figure 4.18.

The transient response of the central plate due to an applied normal pressure can be described by the following partial differential equation [23]:

hρ∂2w

∂t2 =p−D ∂4w

∂x4 + 2 ∂4w

∂x2∂y2 +∂4w

∂y4

(4.22) wherew is the displacement of the plate,p is the applied pressure, his the thickness of the plate andρ is the density of the plate. The flexure rigidityDcan be computed by the following expression:

D= Eh3

12 (1−ν2) (4.23)

where E is the Young’s modulus andν is the Poisson’s ratio.

The dimension of the plate is 100µm×100µm and the thickness is 2.85µm. The width of the suspension bridges is 12.5µm. For simplicity the suspension bridges

Figure 4.18: Structure of the tactile sensor

themselves are not modeled but our solution can be extended to handle it. The tactile sensor is made from silicon so the material constants are the following: E=47GPa, ν=0.278 and ρ=2330kg/m3.

To solve (4.22) on a CNN-UM the plate should be spatially discretized and each finite element is assigned to one CNN cell. Equation (4.22) is second order in time so two coupled CNN layers are required where the displacement of the plate is computed by the first layer while the velocity is computed by the second. The approximation of the spatial derivatives requires the following 5×5 sized template which is the con-ventional discretized form of the biharmonic operator:

D∇4w=D ∂4w

∂x4 + 2 ∂4w

∂x2∂y2 + ∂4w

∂y4

≈ D

∆x4

0 0 1 0 0

0 2 −8 2 0

1 −8 20 −8 1

0 2 −8 2 0

0 0 1 0 0

(4.24)

where ∆xis the distance between the grid points. At the free edges of the plate zero-flux boundary conditions are used while fixed boundary conditions are used at the suspensions. Due to the two different boundary conditions space variant templates are required. Equation (4.22) can not be solved on the current analog VLSI chips because 5×5 sized and space variant templates are not supported in these architectures. Using the Falcon configurable emulated digital CNN-UM architecture the limitations of the analog VLSI chips can be solved. The Falcon architecture can be configured to support two CNN layers and 5×5 sized space variant templates.

On the multi-layer Falcon architecture a fully connected multi-layer CNN struc-ture can be emulated. In our case most of the elements of the inter-layer templates A12 and A21 are zero or one while all elements of the self-feedback templates A11

W1i,j

W1i,j+2 W1i,j-2 W1i+2,j W1i-2,j W1i+1,j+1 W1i+1,j-1 W1i-1,j+1 W1i-1,j-1 W1i,j+1 W1i,j-1 W1i+1,j W1i-1,j

+ +

+

+ +

+

+ +

+

+ Reg

-+

Reg

Reg

Reg

+

* D/ x4

F(w1)

<<2

<<2

<<3

<<1

Figure 4.19: Structure of the optimized arithmetic unit

and A22 are zero. Using the original multi-layer Falcon arithmetic unit the template operation is computed in 5 clock cycles in row-wise order and 5 multipliers are re-quired for each template; this is 20 multipliers altogether in our case. The number of required multipliers can be halved if the computation with theA11andA22 templates is removed from the arithmetic unit. Additionally the symmetry of the template operator A21 (4.24) makes it possible to further optimize the arithmetic unit of the Falcon architecture. Naturally no multiplier is required for multiplication with 0 and 1 additionally multiplication with 2 and -8 can be done by shifts in a radix 2 number system. Using a simple trick multiplication by 20 also can be computed without any multipliers by multiplying the state value by 16 and 4 and sum the partial results.

After this optimization just one clock cycle and only one multiplier are required dur-ing the computation. The derivative computation part of the optimized arithmetic unit is shown in Figure 4.19.

To achieve better numerical stability the leapfrog method (4.17) is used instead of the forward Euler method during the computation of the new cell value. The implementation of this method requires additional memory elements and doubles the required memory bandwidth of the processor but these modifications are worthwhile because much larger timestep can be used. Because only one column of processors is used no inter-processor communication is required in the horizontal direction hence the number of lines stored on the chip can be reduced by one. In this case the state values entering into the processor can be connected to the mixer unit directly. The structure of the optimized memory unit is shown in Figure 4.20. To eliminate the multiplication in (4.17) the timestep value must be integer power of two; in this case this multiplication can also be done by shifts.

W1(n) W1(n-1) W2(n) W2(n-1)

Figure 4.20: Structure of the optimized memory unit

Using rapid prototyping techniques and high-level hardware description languages such as Handel-C from Celoxica [24] makes it possible to develop the optimized arith-metic unit much faster compared to the conventional VHDL based RTL level ap-proach. The required resources to implement one processor which can compute (4.22) on a 512×512 sized grid with different precisions are summarized in Figure 4.21.

The proposed architecture was implemented on our RC200 prototyping board from Celoxica Ltd. [24]. The Virtex-II 1000 (XC2V1000) FPGA on this card can host three Falcon processor cores using 36bit precision, which makes it possible to compute three iterations in one clock cycle. The performance of the system is limited by the speed of the on board memory resulting in a maximum clock frequency of 90MHz. The theoretical performance of the three processor cores are 270 million cell update/s. Unfortunately the board has 72bit wide data bus, so 4 clock cycles are

0

Figure 4.21: Resource requirements of the optimized arithmetic unit

1 10 100 1000 10000

10 12 14 16 18 20 22 24 26 28 30 32 34 36

Precision (bit)

Speedup

XC2VP125 XC2V1000 RC200

Figure 4.22: Speedup of the arithmetic unit compared to the Pentium IV 3.0C mi-croprocessor

required to read a new cell value and to store the results this reduces the achievable performance to 67.5 million cell update/s. The size of the memory is also a limiting factor because the state values must fit into the 4Mbyte memory of the board. Thus the grid size is restricted to 512×512 elements though the width of the grid can be increased to 1024 or even 2048 elements.

By using the new Virtex-II Pro devices with larger and faster memory the per-formance of the architecture can reach the 230MHz clock rate and can compute a new cell value in each clock cycle. Additionally the huge amount of on-chip memory and multipliers on the largest XC2VP125 FPGA makes it possible to implement 34 processor cores using 36 bit precision which results in 7,820 million cell update/s computing performance. On the other hand the large number of arithmetic units makes it possible to implement higher order and more accurate numerical methods.

The achievable performance and the speedup compared to the conventional micro-processors are summarized in Figure 4.22.

The results show that even the limited implementation of the modified Falcon processor on our RC200 prototyping board can outperform a high performance desk-top PC. If adequate memory bandwidth (288 bit wide memory bus) is provided the performance of the emulated digital solution is nearly 100 times faster while using the largest FPGA from Xilinx more than 1000 times speedup can be achieved.

A simple test case was used to determine the accuracy of the fixed-point solution.

The input function was a step function, which was applied to the center of the plate.

Figure 4.23: Displacement of the plate after 3µs

The first 3.814µs (221 steps using 239 timestep) of the transient response was com-puted using 64bit floating-point numbers. The displacement of the plate after 3µs where the amplitude of the oscillation is the largest is shown in Figure 4.23.

The 64 bit floating-point result was compared to the results of the fixed-point computations by using different state precisions. The maximum difference between the two solutions is shown in Figure 4.24.

The largest amplitude of the plate is 2.3×1011during the simulated time interval thus at least 18 bit precision must be used to get about 10% accurate results. If the precision is increased, the fixed-point computation is more and more accurate but the accuracy cannot be increased beyond 5×1016. This behavior is very similar to the results obtained in the previous sections where the state equation of the mechanical vibrating system and the heat equation are solved by using 32 bit floating-point num-bers. The explanation of this behavior is the rounding errors because our arithmetic unit does not perform any rounding during the computation of the derivatives. If 36 bit precision is used, the inside precision is 60 bits because the coefficient D/∆x4 is 18 bit wide and 6 additional bits are required during the computation of the spatial difference operator: this is 60 bits altogether. In the case of the standard IEEE 64 bit floating-point numbers the size of the mantissa is 52 bit so it is very likely that some bits are lost when the result of the spatial difference operator is multiplied by D/∆x4. In most cases by using 60 bit precision inside the arithmetic unit the results

(a)

1.00E-16 1.00E-15 1.00E-14 1.00E-13 1.00E-12 1.00E-11 1.00E-10 1.00E-09

10 12 14 16 18 20 22 24 26 28 30 32 34 36

Precision (bit)

Error

64 128

(b)

1.00E-08 1.00E-07 1.00E-06 1.00E-05 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00

10 12 14 16 18 20 22 24 26 28 30 32 34 36

Precision (bit)

Error

64 128

Figure 4.24: Error of the displacement (a) and speed (b) of the plate

will be more accurate than the floating-point results. The spatial distribution of the error is shown in Figure 4.25. The maximum of the error is located on the center of the plate where the amplitude is the highest while the error is much smaller at the corners.

Figure 4.25: Distribution of the error after 3µs using 36 bit precision

The state equation of a micro-electromechanical tactile sensor with realistic in-put parameters was solved by using CNN architecture. The Falcon emulated digital CNN-UM processor was optimized to solve the governing equations of the model.

Due to the optimization the area requirements of the arithmetic unit are significantly reduced while its computing power is increased. The proposed architecture was im-plemented on a mid-sized FPGA with one million equivalent system gates on our RC200 prototyping board. The performance of this solution is limited by the speed of the memories and the width of the memory bus. But even this restricted solution is 10 times faster than a Pentium IV 3GHz processor. If larger FPGA and wider memory bus is used, 1000-fold performance increase can be achieved. The accuracy of the solution is very promising even if low precision is used in the arithmetic unit the results are very close to the 64 bit floating-point computations. If the precision is increased over 32 bit, the accuracy of the fixed-point solution cannot be further in-creased because in this case the rounding errors of the fixed-point and floating-point computation are in the same range. By using 32 bit precision the error is about 0.002%, which is acceptable in most engineering applications.