The proposed greedy algorithm - Empirically validating the advantage of locally con- con-trolle

Partitioning and Placement

4.4 Empirically validating the advantage of locally con- con-trolled arithmetic unitscon-trolled arithmetic units

4.4.1 The proposed greedy algorithm

The number of cut arcs can be minimized and the I/O connections of clusters can be evenly distributed by traditional graph partitioning techniques, however, the resulting operating frequency is far from the theoretical limits suggested by FPGA data sheets.

According to my measurements, during the synthesis of a locally controlled AU, which was generated from a well-partitioned (see Problem 1) data-flow graph, high operating frequency is estimated, however, at place-and-route phase the circuit cannot be placed efficiently resulting in a significantly lower operating frequency.

The Xilinx place-and-route process gives the designer the ability to constrain the physical position of parts of the circuit. In my experience, constraining the placement of the partition classes and the synchronizing FIFOs indeed improves the operating fre-quency. Unfortunately, in case of naive partitioning techniques, the manual placement of the resulting classes and FIFOs is very challenging, and the operating frequency is sensitive to the local connectivity. In Figure 4.4 manual placement constraints of a cir-cuit, which was partitioned with a naive partitioning technique [3], are demonstrated.

Despite the fact that the partitioning limited the number of I/Os of each class, the poor placement resulted in low frequency. In the figure, the critical net (also indicated by red in Figure 4.6) explicitly limiting the operating frequency is colored by red. The question is how to place all connected components close to one another to avoid long interconnections. In the presented example the critical net is related to an input vari-able, which is used in three different operations in the mathematical expression. The three operations are associated to three different partition classes, therefore three extra synchronization FIFOs are required and should be placed close to one another, as they use the same input. (The phenomenon also exists at lower layers, see green vertices in Figure 4.6.) A smart partitioning algorithm shall avoid multiway cutting of hyperarcs and make the efficient placement possible.

The idea of the new algorithm, inspired by the previous problem, is to draw the graph into the plane before the partitioning starts. If a representation of the graph which minimizes the distance between the connected arcs is given, a simple greedy

Figure 4.4: Placement constraints (blue) and the placed instances (grey) of a circuit generated from a partition which was created by a naive partitioning strategy. Connectivity of parti-tion classes is indicated by orange while a timing critical net is indicated by red (also shown in Figure 4.6). The figure was generated with the standard Xilinx place-and-route editor. It demonstrates the challenges a developer has to solve during manual placement of circuit ele-ments. In a high-performance implementation, connected elements shall be placed close to one another.

algorithm can provide a partitioning without long interconnections. Furthermore, the placement becomes straightforward and placement constraints can be adjusted manu-ally. The proposed strategy significantly differs from low-level partitioning and place-ment techniques as we partition and place the circuit at the level of IP components.

The proposed algorithm is a two-step procedure preceded by a simple preprocess-ing. During preprocessing, vertical coordinates of the vertices are fixed via a simple method calledlayering. Next, in the first step, horizontal coordinates are calculated to minimize the distance between the connected components. Finally, in the second step, a greedy method is used to partition the graph based on the spatial information of the vertices. The steps of the algorithm (also summarized in Algorithm 1) are described in the following paragraphs.

Algorithm 1Outline of the proposed greedy algorithm.

1: Add delay vertices to make the graph bipartite, and associate every vertex with a level according to its distance from the global inputs (layering).

2: Place vertices randomly into the corresponding layer.

3: Create an initial horizontal placement of vertices by the barycentre heuristic.

4: Find the horizontal position of the vertices and a local minimum of the objective function via a simple swap-based iterative algorithm.

5: Partition the graph according to the spatial positions of the vertices using a greedy algorithm.

4.4.1.1 Preprocessing and layering

The mathematical expression to be implemented is described in a text file, where in-puts, outputs and internal variables are also defined. The input file is parsed and a data-flow graph representation of the mathematical expression is created. Every math-ematical operator is represented by a vertex and has an associated delay which will be the pipeline latency of the corresponding IP core in the implemented circuit.

In the next step alayering[31] is performed, in which the data-flow graph is con-verted to a special bipartite graph. In this bipartite graph, every vertex is associated to a layer and each arc directs immediately to the next layer.

Figure 4.5: A simple data-flow garph and its layered version

Definition 3 L = {l1, l2, ...}layering is a partition of the V vertices of theG(V, E) directed graph such that

∀e∈E : S(e)⊂li andT(e)⊂li+1.

A layering can be generated via a breadth-first-search in linear time [31] by splitting up the arcs which span more than one layer with extra delay vertices. An example lay-ering is shown in Figure 4.5. The laylay-ering of the graph is an artificial restriction on the placement and the partitioning of the graph. Vertices can only be moved horizontally during the placement, and the representation of the clusters also depends on the struc-ture of the layers. Evidently, this restriction can leave out the optimal solution from the search space, however, it significantly decreases the representation costs. Unfortu-nately, the complexity of the original problem requires some expandable restrictions, otherwise the problem cannot be handled.

Fortunately, layering has several other benefits beside the simplification. First of all, if the vertices had the same pipeline lengths, the horizontal cutting would guar-antee that the incoming arcs of a cluster have the same pipeline level. Despite the pipeline lengths are different in practice, horizontal cutting combined with the cut arcs minimization produces acceptable results and does not increase the overall pipeline length drastically (see Table 4.3). The second benefit of the layering is that a simple and sufficient criterion can be formulated to check the existence of deadlocks during the simulated annealing (see Lemma 2).

Finally, in physical implementation, extra delay vertices are implemented as shift registers (extra vertices inside one cluster are joined), which hold the data for the proper number of clock cycles. From the aspect of performance, it is advantageous because smaller interconnections help to keep the timing requirements.

4.4.1.2 Swap-based horizontal placement

Vertices get horizontal coordinates randomly, then the number of edge crossings is minimized to create an initial solution. Minimal edge crossing objective does not guarantee good placement but it was found to be a good initial solution for my ver-tex swapping iterative algorithm.

Barycentre heuristic [57] is a fast and simple algorithm to minimize edge cross-ings in layered directed graphs, however, various other min-crossing algorithms can be applied for this purpose [58]. The minimization of the edge crossing is NP-complete, even if there are only two layers [59]. The method operates layer-by-layer: in every iteration one layer of the graph is fixed and the vertices of the next layer is arranged.

The horizontal coordinate (x_A) of each vertex (A) is chosen to the barycentre of its neighborhoodfrom the fixed layer:

whereN_A denotes the set of vertices connected to vertexAfrom the fixed layer, and x_v denotes the horizontal coordinate of a vertexv.

For horizontal placement, an adapted KL algorithm is used to minimize the dis-tance between the connected vertices. The objective function is defined as the sum of the distance between the connected vertices. The distance between two vertices is determined according to their horizontal coordinates:

distance(A, B) :=

(x_A−x_B)² if A and B are connected

0 otherwise (4.13)

wherex_Aandx_Bare the horizontal coordinates of vertexAandB, respectively. Both horizontal and vertical coordinates are integer numbers. The physical size of the floating-point units are not considered in this representation and set to one. Vertical coordinates can be neglected as the graph is a layered graph.

In the adapted version of the KL algorithm, the objective function has been re-placed by Equation 4.13. During iterations, instead of swapping vertices between partition classes, the position of neighboring vertices are swapped. Similarly to the original algorithm, it can escape some of the local minima, however, not all of them.

As it is very sensitive to the initial placement of the vertices, the previously described Barycentre heuristic is used to initialize the placement of the vertices.

The second step of the procedure is a greedy clustering method, which creates rect-angular clusters based on the spatial information of the vertices. The height of the rectangular domains can be chosen arbitrary, however, in the demonstrated CFD ex-ample it was set to two. The clustering starts from the top left corner, and the largest possible rectangular cluster is created which still meets the I/O constraint (defined in Problem 1). Next, the algorithm moves right and the rectangular-based clustering is continued on the unclustered vertices. If there are no more unclustered vertices in the selected layers the algorithm moves down and continues with the lower layers.

In spite of the greedy nature of the clustering method, the decisions are made on comprehensive information, as the vertices have been already positioned in such a way that short interconnections will enforce locally coupled clusters during greedy clustering. The resulting partitioning is shown in Figure 4.6.

In document Efﬁcient implementation of computationally intensive algorithms on parallel computing platforms Csaba Nemes (Pldal 72-77)