• Nem Talált Eredményt

Improved Greedy Algorithm to Look for Median Strings

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Improved Greedy Algorithm to Look for Median Strings"

Copied!
1
0
0

Teljes szövegt

(1)

Improved Greedy Algorithm to Look for Median Strings

Ferenc Kruzslicz

The distance of a string from a set of strings is defined by the .sum of distances to each string of the given set. A string that is closest to the set is called the median of the set. To find a median string is NP-Hard problem in general, so it is useful to develop fast algorithms that give a good approximation of the median string. These methods are significally depend on the type of distance used to measure the dissimilarity between strings. This algorithm is based on edit distance of strings, and constructing the approximate median in a letter by letter manner.

Introduction. If optical character recognition (OCR) problem is considered as a "black box" process, where images are mapped to strings, then we use a certain kind of off-line approach. In this way the efficiency of some OCR processes could be increased in OCR software and language independent man- ner. Suppose we have a set of strings as results of OCR processes of the same input bitmap. When the same OCR software was used to produce this set, with different paper orientation, changed resolution or simply repeated OCR processes we can eliminate the effects of pollution (fingerprints on the glass etc.) While in case of different OCR software their efficiency can be compared to each other.

To find a median string that is minimal in sum of distances form a given input set of strings, is known to be NP-hard problem. Therefore it is interesting to find fast algorithms, that give as good approxi- mations. One of the latest algorithms is called greedy algorithm, because it builds up the approximate median string letter by letter, by always choosing the best possible continuation. In this paper an im- provement of this algorithm is described.

The real advantage of the improved algorithm appears when the probability of edit operations in the garbling process is increased. In other words the improved algorithm works better if the strings in the input set are far from each other. In the example the string recognition was garbled with delete, insert and substitute string-edit operations. For substituting and inserting only the letters r, e, c, g, t, i, o, n, s, p, a were used, and each of the operations and its place was equally distributed.

For example let us consider the following test set H = { ggroeonitin, rpcsogngapaoponc, secsgttin, gecciicn, eectgcgnitiopn, repsogniporpassn, raatnini, rnrcpnto, nirnscogtipntgo, nrectogansinageine } where the cost of all edit operations is 1.

The greedy algorithm gives the result recrognitin with summarized distances 75, and the improved algorithm found the string rectognitin with sum of distances 74, while the median of H is recognition, where the sum of distances is 73.

Conclusions. The improved approximate median algorithm is a simple refinement of the greedy al- gorithm. It has the same time complexity

O ( k

2

n

j

j

)

as the previous one (wherej

jis the size of the alphabet,

n

is the number of input strings, and

k

is the length of the longest one). The space complexity was a bit reduced as well, because the new algorithm runs only in

O ( kn )

space. The garbled strings are closer to each other the improvement is less significant. Therefore the new improved greedy algorithm is more suitable for searching approximate median of highly dissimilar strings.

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

In order to make the correspondence among partial closure systems and partial closure operators unique, we introduce a particular, so called sharp partial closure operator (SPCO)..

Scholars of Centre for Economic and Regional Studies conducted a survey in spring of 2020 on the situation and role of local governments in the first months of the outbreak of

The input signal is also assumed to be noise free, but the most practical use of this applet is to find a combination of filtering and “over”sampling that pushes aliasing below

A large number of NP-hard problems are known to be fixed-parameter tractable (FPT) parameterized by treewidth, that is, if the input instance contains a tree decomposition of width w

The NP-hardness for the minimum-cost popular assignment problem (see Section 7) implies that given a set of desired edges, it is NP-hard (even for strict preferences) to find a

The Maastricht Treaty (1992) Article 109j states that the Commission and the EMI shall report to the Council on the fulfillment of the obligations of the Member

The proposed optimization problem is solved by using the Genetic Algorithm (GA) to find the optimum sequence- based set of diagnostic tests for distribution transformers.. The

According to Rozsa Peter it is not decisive for mathematical thinking to find the correct answer (the trap into which many fall is that they believe that a whole spoonful of