Barotropic ocean model - Emulált digitális CNN-UM architektúra megvalósítása újrakonfigurálható

Simulation of compressible and incompressible fluids is one of the most exciting areas of the solution of PDEs because these equations appear in many important applica-tions in aerodynamics, meteorology and oceanography. In this section a CNN-UM simulation of ocean currents will be presented. The governing equations of the ocean model are derived from the Navier-Stokes equations of incompressible fluids. CNN-UM solution of the Navier-Stokes equations was described in [3]. But the non-linearity of the state equations does not make it possible to utilize the huge computing power of the current analog CNN-UM chips. By using an emulated digital CNN-UM the known limitations of the analog chips such as low precision, small array size and noise sensitivity can be solved. To improve the performance of our solution the cell model of the architecture is modified to handle the non-linearity of the model.

Building a universal ocean model that can accurately describe the state of the ocean on all spatial and temporal scales is very difficult [25]. Thus ocean modeling efforts can be diversified into different classes, some concerned only with the turbulent surface boundary layers, some with continental shelves and some with the circulation in the whole ocean basin. Fine resolution models can be used to provide real-time weather forecasts for several weeks. These forecasts are very important to the fishing industry, ship routing and search and rescue operations. The more coarse resolution models are very efficient in long term global climate simulations such as simulating El Ni˜no effects of the Pacific Ocean.

In general, ocean models describe the response of the variable density ocean to atmospheric momentum and heat forcing. In the simplest barotropic ocean model a region of the ocean’s water column is vertically integrated to obtain one value for the vertically different horizontal currents. The more accurate models use several horizontal layers to describe the motion in the deeper regions of the ocean. Though these models are more accurate investigation of the barotropic ocean model is not worthless because it is relatively simple, easy to implement and it provides a good basis for the more sophisticated 3-D layered models.

The governing equations of the barotropic ocean model on a rotating Earth can be derived from the Navier-Stokes equations of incompressible fluids. Using Cartesian coordinates these equations have the following form [25] [26]:

dux Coriolis P ressure Lateral viscosity Advection

duy

dη

dt =−∂ux

∂x − ∂uy

∂y (4.25c)

where η is the height above mean sea level, ux and uy are volume transports in the x and y directions respectively. In the Coriolis term Ω is the angular rotation of the Earth and Θ is the latitude. The pressure term contains H(x, y), which is the depth of the ocean and the gravitational acceleration g. The wind and bottom stress components in both xand y directions are represented by τwx, τwy, τbx andτby

respectively. The lateral viscosity is denoted by A.

The bottom stress components can be linearly approximated by multiplying u_x and uy by a constant value σ’ the recommended value of this parameter is in the range 1−5×10⁻⁷. The wind stress components can be computed from the wind speed above the sea surface by the following approximation [27]:

τw = ρa

ρw

CdU₁₀² (4.26)

Where ρa is the density of the air,ρw is the density of the sea water, U10 is the speed of the wind at 10 meters above the surface andCdis the drag coefficient. One possible approximation of the drag coefficient is the following:

1000Cd= 0.29 + 3.1 U10

+ 7.7

U₁₀² (3≤U10 ≤6m/s) (4.27a) 1000Cd= 0.5 + 0.071·U10 (6≤U10 ≤26m/s) (4.27b) The horizontal friction parameter A can be computed from the mesh-box Reynolds number Rc:

Rc = ∆xU

A (4.28)

where ∆x is the mesh size and U is the magnitude of the velocity in the mesh-box.

By approximating U with √

gH and setting Rc = 4 which is generally considered in nonlinear flow simulations the lateral friction can be computed by the following equation:

A= ∆x√ gH Rc

(4.29) The circulation in the ocean is generally the result of the wind stress at the ocean’s surface and the source sink mass flows at the basin boundaries. In our investigation steady wind was used to force our model. In this case the ocean will generally arrive at a steady circulation after an initial transient behavior.

Solution of equations (4.25a)-(4.25c) on a CNN-UM architecture requires finite difference approximation on a uniform square grid. The spatial derivatives can be

approximated by the following well known finite difference schemes and CNN

By using these templates the pressure and lateral viscosity terms can be easily computed on CNN-UM architecture. However the computation of the advection terms requires the following non-linear CNN template which can not be implemented on the present analog CNN-UM architectures:

ux,ij

Most ocean models arrange the time dependent variables u_x,u_y and η on a stag-gered grid called C-grid. In this case the pressurepand heightH variables are located at the center of the mesh boxes, and mass transports ux and uy are at the center of the box boundaries facing the x and y directions respectively. In this case the state equation of the ocean model can be solved by a one layer CNN but the required tem-plate size is 5×5 and space variant templates should be used. Another approach is to use 3 layers for the 3 time dependent variables. In this case the CNN-UM solution can be described by the following set of equations:

du_x,ij

dηij

dt =− Adxux− Adyuy (4.31c)

At the edges of the model the normal and tangential components of the flow is zero: the former means that there is no flow through the boundary while the later means that there is no slip at the solid boundary. To achieve these conditions fixed boundary conditions are used in the case of ux and uy. In the case of the elevation η zero-flux boundary conditions are used because the water can move freely at the boundary.

Using equation (4.31a)-(4.31c) and templates (4.30a)-(4.30d) an analogic algo-rithm can be constructed to solve the state equation of the barotropic ocean model.

However the non-linear advection terms do not allow us to implement our algorithm on the present analog CNN-UM chips. The non-linear behavior can be modeled by us-ing software simulation but this solution does not differ from the traditional approach and does not provide any performance advantage.

The Falcon emulated digital CNN-UM architecture can be modified to handle the non-linear templates required in the advection terms of the ocean model. The required blocks are an additional multiplier to compute the non-linearity and a memory unit to store the requiredux,ijoruy,ijvalues. Of course this modification requires the redesign of the whole control unit of the processor. However this modified Falcon architecture can be used to approximate the behaviour of the ocean model, its performance would not be significant because six templates should be run to compute the next step of the ocean model. The performance can be greatly improved by designing a specialized arithmetic unit which can compute these templates fully parallel. Instead of building a general CNN-UM architecture which can handle the required non-linear templates an array of specialized cells is designed which can solve the state equation of the discretized ocean model directly.

To emulate the behavior of the specialized cells the continuous state equations (4.31a)-(4.31c) must be discretized in time. In the solution the leapfrog method is used but in this case we have an upper limit on the ∆t timestep. The maximal value of the timestep can be computed by using the Courant-Friedrichs-Levy (CFL) stability condition.

∆t <∆x/cw (4.32)

where ∆t is the timestep, ∆x is the distance between the grid points and cw is the speed of the surface gravity waves typically cw =√

gH.

Computation of the derivatives of ux and uy is the most complicated part of the arithmetic unit. The proposed structure to compute the derivative of u_x is shown in Figure 4.26 similar circuit is required to compute the derivative of uy according to (4.31b).

This complex arithmetic unit can compute the derivatives and update the cell’s state value in one clock cycle but it requires pipelining to achieve high clock speeds.

f_ij gH_ij u_y,ij

recH_ij

-A_iju_x,i-1,ju_x,i+1,j u_x,i,j-1u_x,i,j+1u_x,ij

+ +

+ + +

u_y,ij ηi−1,j ηi+1,ju_x,ij u_x,i-1,ju_x,i+1,ju_x,iju_x,i,j-1u_x,i,j+1

, wx ij

Figure 4.26: Structure of the arithmetic unit

The values of ∆x and ∆t are restricted to be integer powers of two. In this case multiplication and division by these values can be done by shifts. This simple trick makes it possible to eliminate several multipliers from the arithmetic unit and greatly reduces the area requirements. The multiplication of g with Hij in the pressure term and the reciprocal ofHij in the advection term are constant during the computation thus these values are computed in advance. This solution requires additional mem-ory both on- and off-chip but the required computing precision and the area of the arithmetic unit can be significantly reduced.

The width of the values is computed by using simple rules and heuristics. For example the deepest point of the Pacific Ocean is about 11,000 meters deep, so 14 bits are required to represent this depth with one meter accuracy. This one meter resolution is far more accurate than the available data of the ocean bottom so the last or the last two bits can be safely removed. Similar considerations can be used to determine the required bit width and the displacement of the radix point for the remaining constant values such asf, A, g×H, 1/H, τ_wx and τ_wy. Using fixed-point numbers with various width and displacement values makes it harder to design the arithmetic unit but it is worthwhile because the area can be reduced and clock speed is increased. Area requirement of the arithmetic unit is shown in Figure 4.27.

The proposed architecture is implemented on our RC200 prototyping board from Celoxica Ltd. The XC2V1000 FPGA on this card can host one arithmetic unit which

(a)

Figure 4.27: (a) Area requirements of the optimized arithmetic unit (b) Number of implementable optimized arithmetic units on the XC2VP125 FPGA

makes it possible to compute a new cell value in one clock cycle. Unfortunately the board has 72 bit wide data bus, so 6 clock cycles are required to read a new cell value and to store the results if 36 bit precision is used. The performance can be increased by reducing the array size and implementing six memory units which use the arithmetic unit alternately. In this case 6 clock cycles are required to compute 6 new cell values and the utilization of the arithmetic unit is 100%. The performance of the system is limited by the speed of the on board memory therefore the maximum clock frequency is 90MHz. In this case the performance of the chip is 90 million cell update/s. The size of the memory is also a limiting factor because the state and constant values must fit into the 4Mbyte memory of the board. The size of the cell array is restricted to 256×256 cells by the limited amount of on-board memory however the XC2V1000 FPGA can be used to work with 1024 or even 2048 cell wide arrays.

By using the new Virtex-II Pro devices with larger and faster memory the per-formance of the architecture can reach 200MHz clock rate and can compute a new cell value in each clock cycle. Additionally the huge amount of on-chip memory and multiplier on the largest XC2VP125 FPGA make it possible to implement 13 opti-mized arithmetic units. These arithmetic units work in parallel and the cumulative performance is 2600 million cell update/s. On the other hand, the large number of arithmetic units make it possible to implement more sophisticated and more accu-rate ocean models. The performance of the different implementations compared to an AMD Athlon 64 3200+ processor running on 2GHz clock frequency is shown in Figure 4.28.

The performance of the emulated digital solution of the model is very encouraging even using the mid-sized XC2V1000 FPGA on our prototyping board the computation

1 10 100 1000 10000

10 12 14 16 18 20 22 24 26 28 30 32 34 36

Precision (bit)

Speedup

XC2VP125 XC2V1000 RC200

Figure 4.28: Speedup of the architecture compared to an AMD Athlon 64 3200+

microprocessor

is accelerated by 60 times. Using the currently available largest FPGA the XC2VP125 the speedup is 1700-fold when 36 bit precision is used.

To evaluate the accuracy of the fixed-point solution a simple model is used. The size of the modeled ocean is 2097km, the boundaries are closed, the grid size is 256×256 and the grid resolution is 8192m. Six different bottom topographies were used during the computation in the first two cases the ocean bottom was flat and 4000m and 1000m deep. In the next two cases a reversed parabola shape was used which formed an underwater ridge where the deepest point is 4000m and 1500m while the top of the ridge is 2000m and 1000m deep. In the last two cases the bottom topography was a seamount which is described by the following equation:

H(x, y) =−h₀

1−Ae⁻^x2+y

2 L2

(4.33) where h0 is the maximum depth, A is the height and Lis the slope of the seamount.

In both cases h0 was 4500m whileA and L was 0.9 and 0.2 in the fifth case and 0.75 and 0.3 in the sixth case. The bottom topographies used in the 3^rd and 6^th case are shown in Figure 4.29.

The model is forced by a steady wind blowing from west and its value is constant along the x direction. The wind speed in the y direction is described by a reversed parabolic curve where the speed is zero at the edges and 8m/s in the center. The

Figure 4.29: Bottom topography of the underwater ridge and the seamount used in the 3^rd and 6^th case

results of the 36 bit fixed-point computations after 72 hours of simulation time using 32s timestep are shown in Figure 4.30.

In the first two cases where the bottom is flat two nearly symmetrical circulations are formed. If the ocean is shallower the stream is stronger and the elevation is larger however its shape is very similar to the deeper case. In the next two cases the flow pattern is changed by the underwater ridge. There are still two circulations but its center is moved before and back to the top of the ridge. According to this change the flow direction is also changed in the center. In the last two cases where the flow around a seamount is modeled the flow pattern has spiral like shape with two arms and a wave front is formed before the top of the seamount. The 36 bit fixed-point computations are very close to the 64 bit floating-point but our question is what the optimal fixed-point precision is. To answer this question the model is simulated by using different precisions for all cases. The result of the fixed-point computations is compared to the 64 bit floating-point results and the absolute maximum difference between the two solutions is shown in Figure 4.31.

If the precision is smaller than 12 bit the state of the ocean model does not change.

By slightly increasing the accuracy the state of the model will change but the error of the flow is in the same range as the flow value itself. To achieve 10% accuracy, which can be accepteble in ocean simulations because the uncertainty of the measured initial conditions is large, at least 22 bit precision is required however in this case the error of the elevation is still too high compared to its maximum value. If the precision is further increased, the accuracy of the solution increases quickly but seemingly the error cannot be decreased if the precision is larger than 30 bit. This behavior is very

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

(a) Results in the 1^stcase. Bottom topography: flat 4000m deep

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

(b) Results in the 2^ndcase. Bottom topography: flat 1000m deep

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

Figure 4.30: Results of the ocean simulation after 72 hours simulation time

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

(d) Results in the 4^thcase. Bottom topography: ridge

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

(e) Results in the 5^th case. Bottom topography: seamount

0 200 400 600 800 1000 1200 1400 1600 1800 2000 0

(f) Results in the 6^thcase. Bottom topography: seamount

Figure 4.30: Results of the ocean simulation after 72 hours simulation time (continued)

1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.0E-01 1.0E+00

10 12 14 16 18 20 22 24 26 28 30 32 34 36

Precision (bit)

Error

Case1 Case2 Case3 Case4 Case5 Case6

1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.0E-01 1.0E+00

10 12 14 16 18 20 22 24 26 28 30 32 34 36

Precision (bit)

Error

Case1 Case2 Case3 Case4 Case5 Case6

Elevation

1.0E-07 1.0E-06 1.0E-05 1.0E-04 1.0E-03 1.0E-02

10 12 14 16 18 20 22 24 26 28 30 32 34 36

Precision (bit)

Error

Case1 Case2 Case3 Case4 Case5 Case6

Figure 4.31: Error of the solution of the ocean model

similar to the results obtained in the previous sections and the error of the fixed-point solution in these cases is smaller than the error of the 64 bit floating-fixed-point solution. Though 30 bits of precision seems to be enough for the computation of the flow values the elevation of the ocean model should be computed more accurately to achieve similar accuracy than the 64 bit floating-point solution.

The state equation of a barotropic ocean model with realistic input parameters was solved by using CNN architecture. The Falcon emulated digital CNN-UM processor was optimized to solve the governing equations of the model. Due to the optimiza-tion the area requirements of the arithmetic unit are significantly reduced while its computing power is increased. The proposed architecture was implemented on a mid-sized FPGA with one million equivalent system gates on our RC200 prototyping board. The performance of this solution is limited by the speed of the memories and the width of the memory bus. But even this restricted solution is 60 times faster than an AMD Athlon 64 3200+ processor. If larger FPGA and wider memory bus are used, 1700-fold performance increase can be achieved even if 36 bit precision is used.

Recent advances in microprocessor and FPGA technology

The first version of this theses is submitted in 2004 but submission of the final ver-sion is delayed until 2007. By following Moore’s Law transistor geometry is shrinked down from 130nm to 90nm in mainstream applications in the last three years while most advanced devices are manufactured by using 65nm technology and 45nm will be introduced soon. Due to the increased number of transistors some new architec-tures have been developed. In this section advances in microprocessor and FPGA architectures will be introduced. Especially the impact of the new devices on CNN simulation/emulation will be examined.

5.1 Multiprocessors

Today both AMD and Intel manufacturing high performance microprocessors by using 65nm technology and Intel’s new 45nm processors will be ready for mass production by the end of 2007. The increased number of transistors makes it possible to integrate two processor cores on the same die. While in 2004 multiprocessor systems was only available in the server market today mainstream desktop microprocessors are usually dual core architectures and in the high performance segment even quad core processors are available. Current quad core processors contains two dual core chips which are integrated into the same package. Main parameters of the currently available desktop microprocessors are shown in Table 5.1.

The architecture of the Athlon64 processors remains almost unchanged however the clock frequency of the architecture can be increased by 50% due to the advanced

In document Emulált digitális CNN-UM architektúra megvalósítása újrakonfigurálható áramkörökön és alkalmazásai (Pldal 109-124)