Elementary parallel algorithms - fejezet

Animation 5.1: The dynamics of a non arbitrarily often firable, but potentially firable Petri net

K. Jensen: Coloured Petri Nets and the Invariant-Method, Theoretical Computer Science 14 (1981), 317-336

6. fejezet - Parallel programs

6.1. Elementary parallel algorithms

In the following, we will observe a few fundamental problems and their solver algorithms, in particular the possibility of parallelization.

The first problem is searching.

Problem (Searching):

In: n, A[n], x (x is called the key)

Out: an index i with A[j] ≠ x if such index exists; 0 otherwise.

This problem has a number of different solutions. The simplest one is to read the whole sequence of numbers until we find the key. In the worst case we need as many comparison operations as the number of members of the set.

Algorithm 1.

1. i ← n

2. while (i ≥ 0) and ((x ≠ A[i])) do 3. i ← i – 1

Statement:

Let t(n) be the time complexity of Algorithm 1. Then t(n) ∈ O(n).

To get meaning to the statement, we have to give exactly what we mean by time complexity in this case. There are several definitions and actually, for any of them - which are expressive enough - the statement would be true. The most general, when we count every executed command, makes the calculation more difficult in most of the cases. This is why it is practical to choose the most important commands: those which use the most expensive resources or those which are useful to estimate the number of all the other executed commands. In the case of the present algorithm (and those which are related to the problem) the time complexity will be understood as the number of comparisons. However, the complexity of the comparison itself remains hidden.

The proof of the statement is easy, it can easily be proved with a basic experience in programming, so we leave it to the reader.

A more interesting solution of the problem is when we suppose that the incoming data is in an ordered structure.

In this case, theoretically it is sufficient to carry out log(n) comparisons. The following example is the so called binary search algorithm.

In: n, A[n], x, where A[j] ≤ [j + 1] for all 1 ≤ j ≤ n.

Out: an index i with A[i] = x if such index exists; 0 otherwise.

Algorithm 2.

1. i ← 1 2. j ← n

3. while i < j do 4.

5. if A[k] = x then

Similarly to the previous algorithm, we define the time complexity by the number of comparisons.

Proof:

There are comparisons only in the 5th and 7th line of the algorithm, thus we make two comparisons during the execution of the loop in lines 3-12.

The execution of the loop continues until i < j. Since j and i are integer, this means that 1 ≤ j – 1. (*) Assume that during the lth execution of the loop body i = il and j = jl, ∀ 0 < l is fullfilled, then we find that

Since j1 – i1 = n, then after the lth iteration

holds.

By (*)

holds.

Rearranging the inequality and taking the logarithm of both sides we get the following relation:

2^l < n 1 < log(n)

This exactly means that the loop body can be executed at a maximum of log(n) times. All together, we do not make more comparisons than 2log(n), so the time complexity is O(log(n)). √

During the observation of the algorithm, we face the problem that we have executed a sequence of operations that did not completely read the incoming data. On the other hand, we have to notice that the whole input is immediately available for the algorithm. These two observations seem to be problematic for the first view if we try to construct the algorithm with the traditional Turing machine model. The difficulty is that the speed of the transmission of the information can become arbitrary big. (Transmission of the information: the algorithm gets the input and transmits the result to somewhere.) However, it does not have such a big importance as basically we do not solve the problems with Turing machines but with some machine with particular architecture. The solution: we do not provide the input (output) in the traditional way, it is simply available. (e.g., it is the result of the previous computation in the memory.) In most cases it models quite well the problem to solve, but we have to notice that the abilities of the algorithm in practice (~ program) are limited because of the bounds on the

resources. (Basically a computer is a finite automaton.) Accepting these kinds of limiting factors, it does not make any sense to observe the time complexity problems only from theoretical point of view, since we can give a finite number of answers to a finite number of inputs in just one step. Obviously, we can only get practical results if we bring in more limiting factors during the observations.

Returning to the original problem, we can give a parallel solving algorithm, as well. This compares the element to be found with all the elements of the set at the same time and gives positive answer only if any of the parallel whatever we can. As we have seen in the chapter on Super Turing machines, theoretically line 8 can be executed in the time of one comparison. This implies the following:

Statement: Let T(n) be the total time complexity and t(n) be the absolute time complexity of Algorithm 3.

Then T(n) ∈ O(n) and t(n) ∈ O(1).

If we define the time complexity by the number of comparisons (as before), then the proof of the statement is easy. It gives additional refinement to the interpretation that in general the command in line 9 - depending on the model (or architecture) - cannot be necessarily executed during constant time.

There is the question though, whether Algorithm 3 is an algorithm in the traditional sense, since basically we need different architectures for each task. The problem can only be solved in the theoretically determined time, if there is a suitable amount of processors available, with the necessary architecture. This practical consideration is the reason of the introduction of the limited complexity concepts, as the size of the host system generally cannot be changed. The FPGA architectures, which are more and more common at the computations, give new possibilities. With them we can change not only the software but the hardware as well to fit to the problem.

The second basic problem is sorting. We define it in the following way:

Problem (Sorting): Let A be a set with a sorting relation ≤ and let T[1 … n] be an array such, that its entries are from the set A. problems where we can prove exact lower bound for the time complexity.

Theorem: Let R be a general sorting algorithm with time complexity t(n). Then n ⋅ log(n) ∈ O(t(n)).

This means that a general sorting algorithm cannot be faster than c ⋅ n ⋅ log(n).

Several known algorithms reach this theoretical bound, so except for a constant factor, it gives an exact lower bound to the time complexity.

One of these algorithms is the algorithm of merge sort.

6.1.6.1.1. Merge sort

This method is based on the principle of divide and conquer ÖFR(n, A)

The algorithm of merging:

ÖF(n,A,B) // A and B sorted 1. i ← 1

j ← 1 k ← 1

2. while (i ≤ n) or (j ≤ n) do 3. if (j > n) then

4. C[k] ← A[i]

i ← i + 1

5. else if (i > n) then 6. C[k] ← A[j]

j ← j + 1

7. else if (A[i] < B[j]) then 8. C[k] ←A[i]

i ← i + 1 9. else

10. C[k] ← A[j]

j ← j + 1 11. endif; endif; endif 12. k ← k + 1

13. endwhile

We assume that the two input sets have the same amount of elements, but it works similarly in case of different cardinality.

Example: Let’s sort the array A = [5,2,4,6,1,3,2,6].

By analyzing the algorithm of merging we can see that it is a typical sequential one, and by parallelizing we cannot make it a lot faster. (A minimal acceleration is possible, if we start the comparison of the next elements with pipelining while executing the data movement.)

Real growth can only be achieved by the reconsideration of merging.

6.1.6.1.2. Batcher’s even-odd sort

This procedure makes more effective the tight cross section of the algorithm, the merging.

A big advantage is that its steps can be parallelized.

Let A[1 … n] and B[1 … n] be two sorted arrays with the same number of entries. For simplicity assume that n

= 2^k for some positive integer k.

The used algorithm is the following: We create two new arrays from each original array: the sequences of the even and odd indexed elements. The merging of these arrays provides sorted arrays with very good properties.

BÖF(n,A,B) // A and B is sorted 1. if n > 1 then

4. pfor i ← 1 … n do

5. C[2i – 1] ← min{C1[i], C2[i]}

C[2i] ← max{C1[i], C2[i]}

6. endpfor 7. else

8. C[1] ← min{A[1], B[1]}

C[2] ← max{A[1], B[2]}

The result of merging is in C. Although, it does not affect the time complexity related to the number of comparisons, it is worth to consider to reduce the data movement operations. The corresponding computer program can have significant speedup, if we apply an in place sort. Of course, in architectures, where data movement is a simple wiring (e.g. FPGA), it is not an issue.

Theorem:

The Batcher’s merging is correct, i.e. a sorted array is created from the entries of the two original arrays.

Proof. For simplicity, we assume that every element of the array are different. The statement is true in the more general case, too, but the proof would be slightly more complicated. It is clear that the algorithm is swapping the elements, thus a value neither gets in nor can disappear. Thus the correctness of merging depends on the observation that the instructions executed in step 4 are moving the elements of the array to the correct place.

The entries C[2i – 1] and C[2i] are on the correct place relative to each other, we only have to prove that there cannot be a bigger entry before them, and a smaller one after them.

Before the pair of C[2i – 1] and C[2i] there can only be such kind of elements that were the elements of arrays C1 and C2 with the indexes j < i. Then it is clear that C1[j] < C1[i] and C2[j] < C2[i] ∀ 0 < j < i, so we only have to show that C2[j] < C1[i] and C1[j] < C2[i] ∀ < 0 < j < i.

Since C1 was created be the merging of the sorted arrays A1 and B2, thus k entries are from A1 and i – k – 1 from B2 in C1[1, …, i – 1]. More precisely C1[1, …, i – 1] is the merge of A1[1, …, k] and B2[1, …, i – k – 1].

Assume that C1[i] was originally in A. Since A1 contains the odd and A2 contains the even entries, thus A2

contains exactly k entries which are less than C1[i] and consequently

which are greater. Furthermore, since B2 stends of the even entries of B, thus B1 have i – k – 1 entries which are less

and

entries which are greater than C1[i]. However, the (i – k)th is not known. This yields that in a C2 there exist k + i – k – 1 = i –1 entries which are less and

entries which are greater than C1[i]. Only C2[i] cannot be determined in advance how is it related to C1[i]. The algorithm requires this comparison operation during step 5.

By similar arguments, we get the same result if C1[i] was originally not in A but in B.

√

6.1.6.1.3. Full comparison matrix sort

For the sorting we will create an n × n processor matrix.

The processors in the intersection will compare the elements corresponding to its row and column index. If in this comparison the entry corresponding to the row index is bigger, it sends 1 to the adder collecting data from the column, otherwise it sends 0. At the bottom of the column there will be the sum of these values. If we assume that the members of the array to sort are pairwise different, then the sum represents the index of position

where the member, corresponding to the column, should be placed. The algorithm has time complexity O(log(n)), while it uses O(n²) processors.

In document Parallel approach of algorithms (Pldal 137-143)