• Nem Talált Eredményt

We developed PGlobal to combine the advantages of both the algorithm Global and the multithreaded execution. The existing algorithmic components have been modified for the parallel execution. They must be prepared so that PGlobal will not have a fixed order of execution in contrast to the sequential execution of Global.

We have to organize the data flow between the computation units and the effective distribution of tasks between the different threads.

While the improved Global algorithm of GlobalJ takes advantage of sequential execution as much as possible, PGlobal runs a less efficient algorithm but uses much more computation capacity. This is necessary because the efficient sequential oper-ation of the algorithm of GlobalJ comes from a strict order execution. Fortunately,

4.3 Design of PGLOBAL Based on GLOBAL 45 the increased computational performance well compensates the loss of algorithmic efficiency.

PGlobal combines the advantages of functional decomposition, data decom-position, and the pipeline architecture. It implements a priority queue that dis-tributes the tasks between the workers implementing the functional and data de-composition as multiple logical units and multiple instances of the same logical unit run simultaneously. From the data point of view, the algorithm is sequen-tial, working like a pipeline, as each data packet goes through a fixed sequence of operations.

Figure4.4illustrates how the workers choose the tasks for themselves. Workers always choose the data packet that already went through the most operations. In this example, the data has to be processed by an algorithm in the stages A, B, and C of the pipeline in that order. Stage B may only work on two data packets that already went through stage A, while stage C may only work on data packets that previously processed stage B. The workers operate simultaneously and process all four data packets completely under ideal circumstances three times faster than a single-threaded algorithm would do.

Fig. 4.4 The combination of functional decomposition, data decomposition, and the pipeline ar-chitecture

We implemented PGlobal based on the principles previously presented in this chapter. As we have already discussed, the algorithm can be separated into several different tasks. The main operations are the sample generation, the clustering, and the local searches. We study the available options of parallel computation in Global from the perspective of the data that the different logical units use and produce considering their relations and sequential dependency as well.

It is easy to see that sample generation is an independent functionality due to only depending on one static resource, the objective function. Therefore no interference is possible with any other logical component of the algorithm. Sample generation precedes the clustering in the pipeline from the data point of view.

46 4 Parallelization The clustering and the local search module have a close cooperation in the im-proved algorithm. They work together frequently on the unclustered samples and output of resulting local searches. Scheduling the alternating repetition of these two operation types is easy in case of sequential execution. The immediate feed-back from local searches at the appropriate moments provides the most complete information for clustering all the time. The improved Global uses this strategy (Figure4.5).

Fig. 4.5 The control flow of the improved algorithm used in GlobalJ

If we have to handle large sample sets, clustering requires a lot of resources that can only mean the usage of additional threads; therefore the parallelization of the clustering module is inevitable.

The local search during the clustering could be implemented in PGlobal by using a single thread for the searches while the others are waiting. However this part of the algorithm cannot remain sequential in the multithreaded case. Clustering becomes a quite resource heavy operation when a lot of sample points have to be handled that we must address by running the clustering operation on more threads. Using a single thread, the local search can block the execution even in the case of simple objective functions or make it run for an unacceptably long time period in case of complex objective functions. We cannot keep the improved clustering strategy and

4.3 Design of PGLOBAL Based on GLOBAL 47 run this part of the algorithm efficiently on multiple threads at the same time; we have to choose. Fortunately with the slight modification of the improved algorithm, we are able to process a significant portion of data in parallel while we only sacrifice a little algorithmic efficiency.

We modified the algorithm in order to cease the circular dependency between the clusterizer and the local search method. We need to keep all data flow paths as we have to execute the whole algorithm on all the data, but after analyzing the problem, we can realize that not all paths have to exist continuously or simultane-ously. We do not modify the algorithm if the data path going from the clusterizer to the local search method is implemented temporary. The data can wait at an exe-cution point without data loss; therefore we can create an almost equivalent version of the improved algorithm of GlobalJ. The resulting algorithm can easily be made multithreaded using the separated modules. As long as there are samples to clus-ter, the program works as expected, and it integrates the found local optima in the meantime. This type of operation stops as soon as we run out of samples. This is the trigger point that activates the previously inactive, temporary data flow. To make this blocking operation as short-lived as possible, we only uphold this block-ing connection while the unclustered samples are transferred from the clusterizer to the local search module. We reestablish the multithreaded operation right after this data transfer ends.

This modification results in an algorithm that moves samples in blocks into the local search module and clusters local optima immediately. This algorithm com-bines the improvements introduced in GlobalJ with the multithreaded execution (Figure4.6).

Fig. 4.6 Data paths between the algorithm components of PGlobal

48 4 Parallelization