Information complexity - Complexity of Algorithms

Fix an alphabetΣ. Let Σ0 = Σ\ {∗}. It will be convenient to identify Σ0

with the set{0,1, . . . , m−1}. Consider a 2-tape, universal Turing machineT overΣ. We say that the word (program)qoverΣ0printswordxif writingq on the second tape (the program tape) ofT and leaving the ﬁrst tape empty, the machine stops in ﬁnitely many steps with the wordx on its ﬁrst tape (the data tape).

Let us note right away that every word is printable onT. There is namely a one-tape (perhaps large, but rather trivial) Turing machineSx that, when started with the empty tape, writes the wordxonto it and halts. This Turing machine can be simulated by a programqx that, in this way, printsx.

Theinformation complexity (also called Kolmogorov complexity) of a wordx∈Σ^∗₀ we mean the length of the shortest word (program) that makes T print the wordx. We denote the complexity of the wordxbyK_T(x).

We can also consider the program printing x as a “code” of the word x where the Turing machineT performs the decoding. This kind of code will be called aKolmogorov code. For the time being, we make no assumptions about how much time this decoding (or encoding, ﬁnding the appropriate program) can take.

We would like the complexity to be a characteristic property of the word xand to depend on the machine T as little as possible. It is, unfortunately, easy to make a Turing machine that is obviously “clumsy”. For example, it uses only every second letter of each program and “skips” the intermediate letters. Such a machine can be universal, but every word will be deﬁned twice as complex as on the machine without this strange behavior.

We show that if we impose some, rather simple, conditions on the machine T then it will no longer be essential which universal Turing machine is used for the deﬁnition of information complexity. Roughly speaking, it is enough

to assume that every input of a computation performable onT can also be submitted as part of the program. To make this more exact, we assume that there is a word (say, DATA) for which the following holds:

(a) Every one-tape Turing machine can be simulated by a program that does not contain the word DATA as a subword;

(b) If T is started so that its program tape contains a word of the form xDATAy where the wordxdoes not contain the subword DATA, then the machine halts if and only if it halts when started withywritten on the data tape and xon the program tape, and in fact with the same output on the data tape.

It is easy to see that every universal Turing machine can be modiﬁed to satisfy the assumptions (a) and (b). In what follows, we will always assume that our universal Turing machine has these properties.

Lemma 6.1.1. There is a constant cT (depending only on T) such that K_T(x)≤ |x|+cT.

Proof. T is universal, therefore the (trivial) one-tape Turing machine that does nothing (stops immediately) can be simulated on it by a programp0(not containing the word DATA). But then, for every wordx∈Σ^∗₀, the program p0DATAxwill print the wordxand stop. Thus the constant cT =|p0|+ 4 satisﬁes the conditions.

Remark. We had to be a little careful since we did not want to restrict what symbols can occur in the wordx. In the BASIC programming language, for example, the instruction PRINT "x" is not good for printing words x that contain the symbol ". We are interested in knowing how concisely the word xcan be coded in the given alphabet, and we do not allow therefore the extension of the alphabet.

We prove now the basic theorem showing that the complexity (under the above conditions) does not depend too much on the given machine.

Theorem 6.1.2 (Invariance Theorem). Let T and S be universal Turing machines satisfying conditions (a), (b). Then there is a constant cT S such that for every wordxwe have|K_T(x)−K_S(x)| ≤cT S.

Proof. We can simulate the two-tape Turing machineSby a one-tape Turing machineS0 in such a way that if on S, a program q prints a wordx then writingqon the single tape of S0, it also stops in ﬁnitely many steps, with xprinted on its tape. Further, we can simulate the work of Turing machine S0 onT by a programpS0 that does not contain the subword DATA.

Now letxbe an arbitrary word fromΣ^∗₀and let qx be a shortest program printing x on S. Consider the program pS0DATAqx on T: this obviously prints x and has length only |qx|+|pS0|+ 4. The inequality in the other direction is obtained similarly.

On the basis of this lemma, we will not restrict generality if we consider T ﬁxed and do not indicate the index T. So, K(x)is determined up to an additive constant.

Unfortunately, the following theorem shows that in general the optimal code cannot be found algorithmically.

Theorem 6.1.3. The function K(x) is not recursive.

Proof. The essence of the proof is a classical logical paradox, the so-called typewriter-paradox. (This can be formulated simply as follows: letnbe the smallest number that cannot be deﬁned with fewer than 100 symbols. We have just deﬁnednwith fewer than 100 symbols!)

Assume, by way of contradiction, that K(x) is computable. Let c be a natural number to be chosen appropriately. Arrange the elements ofΣ^∗₀ in increasing order, and letx(k)denote thek-th word according to this ordering.

Letx0be the ﬁrst word withK(x0)≥c. Assuming that our language can be programmed in the programming language Pascal let us consider the following simple program.

var k: integer;

function x(k: integer) : integer;

...

function Kolm(k: integer) : integer;

... begin

k:=0;

while Kolm(k)<c do k:=k+1;

printx(k);

end.

(The dotted parts stand for subroutines computing the given functions.

The ﬁrst is easy and could be explicitly included. The second is hypothetical, based on the assumption thatK(x)is computable.)

This program obviously printsx0. When determining its length we must take into account the subroutines for the computation of the functionsx(k) andKolm(k) =K(x(k))(wherex(k)is thek-th string); but this is a constant (independent ofc). Thus the total number of all these symbols is onlylogc+ O(1). If we takeclarge enough, this program consists of fewer thancsymbols and printsx0, which is a contradiction.

As a simple application of the theorem, we get a new proof for the undecid-ability of the halting problem. To this end, let’s ask the following question:

Why is it not possible to compute K(x) as follows? Take all words y in increasing order and check whether T prints x when started with y on its program tape. Return the ﬁrsty for which this happens; its length isK(x).

We know that something must be wrong here, since K(x) is not com-putable. The only trouble with this algorithm is thatT may never halt on somey. If the halting problem were decidable, we could “weed out” in ad-vance the programs on whichT would work forever, and not even try these.

Thus we could computeK(x), therefore, the halting problem is not decidable.

Exercise 6.1.1. Show that we cannot compute the function K(x)even ap-proximately, in the following sense: Iff is a recursive function then there is no algorithm that for every word x computes a natural numberγ(x) such that for allx

K(x)≤γ(x)≤f(K(x)).

Exercise 6.1.2. Show that there is no algorithm that for every given number nconstructs a 0-1 sequence of lengthnwithK(x)>2 logn.

Exercise 6.1.3. If f : Σ^∗₀ → Z₊ is a recursive function such that f(x)≤ K(x)for all stringsx, thenf is bounded.

In contrast to Theorem 6.1.3, we show that the complexity K(x) can be very well approximatedon the average.

For this, we must ﬁrst make it precise what we mean by “on the average”.

Assume that the input words come from some probability distribution; in other words, every wordx∈Σ^∗₀ has a probabilityp(x). Thus

p(x)≥0, X

x∈Σ^∗₀

p(x) = 1.

We assume thatp(x)iscomputable, i.e., eachp(x)is a rational number whose numerator and denominator are computable fromx. A simple example of a computable probability distribution isp(xk) = 2^−k wherexk is thek-th word in increasing order, orp(x) = (m+ 1)^−|x|−1 wheremis the alphabet size.

Remark. There is a more general notion of a computable probability distri-bution that does not restrict probabilities to rational numbers; for example, {e⁻¹,1−e⁻¹}could also be considered a computable probability distribution.

Without going into details we remark that our theorems would also hold for this more general class.

Theorem 6.1.4. For every computable probability distribution there is an algorithm computing a Kolmogorov codef(x)for every wordxsuch that the expectation of|f(x)| −K(x)is finite.

Proof. For simplicity of presentation, assume thatp(x)>0for every wordx.

Letx1, x2, . . .be an ordering of the words inΣ^∗₀for whichp(x1)≥p(x2)≥ · · ·, and the words with equal probability are, say, in increasing order (since each word has positive probability, for every x there are only a ﬁnite number of words with probability at least p(x), and hence this is indeed a single sequence).

Proposition 6.1.5. (a) Given a word x, the index i for which x = xi is computable.

(b)Given a natural numberi, the word xi is computable.

Proof. (a) Let y1, y2, . . . be all words arranged in increasing order. Given a wordx, it is easy to ﬁnd the indexj for which x=yj. Next, ﬁnd the ﬁrst k≥j for which

p(y1) +· · ·+p(yk)>1−p(yj). (6.1.1) Since the left-hand side converges to 1 while the right-hand side is less than 1, this will occur sooner or later.

Clearly each of the remaining words yk+1, yk+2, . . . has probability less thanp(yj), and hence to determine the index of x=yj it suﬃces to order the ﬁnite set{y1, . . . , yk}according to decreasingp, and ﬁnd the index ofyj

among them.

(b) Given an index i, we can compute the indices ofy1, y2, . . . using (a) and wait untilishows up.

Returning to the proof of the theorem, the program of the algorithm in the above proposition, together with the number i, provides a Kolmogorov codef(xi)for the wordxi. We show that this code satisﬁes the requirements of the theorem. Obviously,|f(x)| ≥K(x).

Furthermore, the expected value of|f(x)| −K(x)is

∞

i=1

p(xi)(|f(xi)| −K(xi)).

We want to show that this sum is ﬁnite. Since its terms are non-negative, it suﬃces to show that its partial sums remain bounded, i.e., that

i=1

p(xi)(|f(xi)| −K(xi))< C for someC independent ofN. We can express this sum as

i=1

p(xi)(|f(xi)| −log_mi) +

i=1

p(xi)(log_mi−K(xi)). (6.1.2)

We claim that both sums are bounded. The diﬀerence|f(xi)| −log_miis just the length of the program computingxi without the length of the parameter i, and hence it is an absolute constantC. Thus the ﬁrst sum in (6.1.2) is at mostC.

To estimate the second term in (6.1.2), we use the following simple but useful principle. Let a1 ≥a2 ≥ · · · ≥am be a decreasing sequence and let (we can’t compute this ordering, but we don’t have to compute it). Then by the above principle,

The Kolmogorov-code, strictly taken, uses an extra symbol besides the al-phabetΣ0: it recognizes the end of the program while reading the program tape by encountering the symbol “∗”. We can modify the concept in such a way that this should not be possible: the head reading the program should not run beyond program. We will call a wordself-delimitingif, when it is written on the program tape of our two-tape universal Turing machine, the

In document Complexity of Algorithms (Pldal 142-147)