Kolmogorov complexity, entropy and coding

Let p= (p1, p2, . . .) be a discrete probability distribution, i.e., a non-negative (ﬁnite or inﬁnite) sequence with P

ipi = 1. Its entropy is the

quantity

H(p) =X

−pilogpi

(the termpilogpi is considered to be 0 ifpi = 0). Notice that in this sum, all terms are nonnegative, so H(p) ≥ 0; equality holds if and only if the value of somepi is 1 and the value of the rest is 0. It is easy to see that for ﬁxed alphabet sizem, the probability distribution with maximum entropy is (1/m, . . . ,1/m)and the entropy of this islogm.

Entropy is a basic notion of information theory and we do not treat it in detail in these notes, we only point out its connection with Kolmogorov complexity. We have met with entropy for the casem= 2 in Lemma 6.3.3.

This lemma is easy to generalize to arbitrary alphabets as

Lemma 6.4.1. Let x ∈ Σ^∗₀ with |x| = n and let ph denote the relative frequency of the letterhin the wordx. Letp= (ph:h∈Σ0). Then

K(x)≤ H(p) logmn+O

mlogn logm

Proof. Let us give a diﬀerent proof using Theorem 6.2.4. Consider the prob-ability distribution over the strings of lengthnin which each symbolhis cho-sen with probabilityph. The probabilitiesphare fractions with denominator n, hence their description needs at mostO(mlogn)bits, what is O(^m_log^log_mⁿ) symbols of our alphabet. The distribution over the strings is therefore an enumerable probability distributionP whose program has lengthO(mlogn).

According to Theorem 6.2.4, we have

K(x)≤H(x)≤ −log_mP(x) +O(mlogn).

But−log_mP(x)is exactly ^nH(p)_log_m.

Remark. We mention another interesting connection between entropy and complexity: the entropy of a computable probability distribution over all strings is close to the average complexity. This reformulation of Corollary 6.2.5 can be stated as

H(p)−X

p(x)H(x)

=O(1),

for any computable probability distributionpover the setΣ^∗₀.

LetL ⊆Σ^∗₀ be a recursive language and suppose that we want to ﬁnd a short program, “code”, only for the words inL. For each wordxin L, we are thus looking for a programf(x)∈ {0,1}^∗ printing it. We call the function

f :L →Σ^∗ aKolmogorov codeofL. Theconcisenessof the code is the function

η(n) = max{ |f(x)|:x∈ L, |x| ≤n}.

We can easily get a lower bound on the conciseness of any Kolmogorov code of any language. LetLⁿ denote the set of words ofL of length at most n.

Then obviously,

η(n)≥log|Lⁿ|.

We call this estimate theinformation theoretical lower bound.

This lower bound is sharp (to within an additive constant). We can code every wordx in L simply by telling its serial number in the increasing or-dering. If the word x of length n is the t-th element then this requires logt≤log|Lⁿ|bits, plus a constant number of additional bits (the program for taking the elements ofΣ^∗ in lexicographic order, checking their member-ship inLand printing thet-th one).

We arrive at more interesting questions if we stipulate that the code from the word and, conversely, the word from the code should be polynomially computable. In other words: we are looking for a language L^′ and two polynomially computable functions:

f :L → L^′, g:L^′→ L

withg◦f = idLfor which, for everyxinLthe code|f(x)|is “short” compared to|x|. Such a pair of functions is called apolynomial time code. (Instead of the polynomial time bound we could, of course, consider other complexity restrictions.)

We present some examples when a polynomial time code approaches the information-theoretical bound.

Example 6.4.1. In the proof of Lemma 6.3.3, for the coding of the 0-1 sequences of lengthnwith exactlym1’s, we used the simple coding in which the code of a sequence is the number giving its place in the lexicographic ordering. We will show that this coding is polynomial.

Let us view each 0-1 sequence as the obvious code of a subset of the n-element set{n−1, n−2, . . . ,0}. Each such set can be written as{a1, . . . , am}

So, givena1, . . . , am, the value oft is easily computable in time polynomial in n. Conversely, if t < _mⁿ

is given then t is easy to write in the above form: ﬁrst we ﬁnd, using binary search, the greatest natural numbera1with

implyinga1> a2. It follows similarly that a2> a3>· · ·> am≥0 and that there is no “remainder” aftermsteps, i.e., that 6.4.1 holds. It can therefore be determined in polynomial time which subset is lexicographically thet-th.

Example 6.4.2. Consider trees, given by their adjacency matrices (but any other “reasonable” representation would also do). In such representations, the vertices of the tree have a given order, which we can also express saying that the vertices of the tree are labeled by numbers from 0 to(n−1). We consider two trees equal if whenever the nodesi, j are connected in the ﬁrst one they are also connected in the second one and vice versa (so, if we renumber the nodes of the tree then we may arrive at a diﬀerent tree). Such trees are called labeled trees. Let us ﬁrst see what does the information-theoretical lower bound give us, i.e., how many trees are there. The following classical result, called Cayley’s Theorem, applies here:

Theorem 6.4.2(Cayley’s Theorem). The number ofn-node labeled trees is nⁿ⁻².

Consequently, by the information-theoretical lower bound, for any encod-ing of trees somen-node tree needs a code with length at least⌈log(nⁿ⁻²)⌉=

⌈(n−2) logn⌉. But can this lower bound be achieved by a polynomial time computable code?

(a) Coding trees by their adjacency matrices takesn²bits. (It is easy to see that ⁿ₂

bits are enough.)

(b) We fare better if we specify each tree by enumerating its edges. Then we must give a “name” to each vertex; since there are n vertices we can give to each one a 0-1 sequence of length⌈logn⌉as its name. We specify each edge by its two endnodes. In this way, the enumeration of the edges takes cca.2(n−1) lognbits.

(c) We can save a factor of 2 in (b) if we distinguish a root in the tree, say the node 0, and we specify the tree by the sequence(α(1), . . . , α(n−1)) in whichα(i)is the ﬁrst interior node on the path from nodei to the root (the “father” of i). This is (n−1)⌈logn⌉ bits, which is already nearly optimal.

(d) There is, however, a procedure, the so-calledPrüfer code, that sets up a bijection between then-node labeled trees and the sequences of length n−2 of the numbers 0, . . . , n−1. (Thereby it also proves Cayley’s theorem). Each such sequence can be considered the expression of a natural number in the base nnumber system; in this way, we order a

“serial number” between 0 and nⁿ⁻²−1 to the n-node labeled trees.

Expressing these serial numbers in the base two number system, we get a coding in which the code of each number has length at most

⌈(n−2) logn⌉.

The Prüfer code can be considered as a reﬁnement of procedure (c). The idea is that we order the edges [i, α(i)] not by the value of i but a little diﬀerently. Let us deﬁne the permutation(i1, . . . , in)as follows: leti1 be the smallest endnode (leaf) of the tree; ifi1, . . . , ik are already deﬁned then let ik+1 be the smallest endnode of the graph remaining after deleting the nodes i1, . . . , ik. (We do not consider the root 0 an endnode.) Letin = 0. With the ik’s thus deﬁned, let us consider the sequence (α(i1), . . . , α(in−1)). The last element of this is 0 (the “father” of the nodein−1can namely be onlyin), it is therefore not interesting. We call the remaining sequence(α(i1), . . . , α(in−2)) thePrüfer codeof the tree.

Claim 6.4.3. The Prüfer code of a tree determines the tree.

For this, it is enough to see that the Prüfer code determines the sequence i1, . . . , in; then we know all the edges of the tree (the pairs[i, α(i)]).

The nodei1is the smallest endnode of the tree; hence to determinei1, it is enough to ﬁgure out the endnodes from the Prüfer code. But this is obvious:

the endnodes are exactly those that are not the “fathers” of other nodes, i.e., the ones that do not occur among the numbers α(i1), . . . , α(in−2),0. The nodei1 is therefore uniquely determined.

Assume that we know already that the Prüfer code uniquely determines i1, . . . , ik−1. It follows similarly to the above thatik is the smallest number not occurring neither amongi1, . . . , ik−1nor amongα(ik), . . . , α(in−2),0. So, ik is also uniquely determined.

Claim 6.4.4. Every sequence (b1, . . . , bn−2), where 0 ≤bi ≤n−1, occurs as the Prüfer code of some tree.

Using the idea of the proof above, let bn−1 = 0 and let us deﬁne the permutationi1, . . . , in by the recursion that ik is the smallest number not occurring neither amongi1, . . . , ik−1nor amongbk, . . . , bn−1, where(1≤k≤ n−1); and let in = 0. Connect ik with bk for all 1 ≤ k ≤ n−1 and let γ(ik) =bk. In this way, we obtain a graph Gwith n−1edges on the nodes 0, . . . , n−1. This graph is connected, since for everyitheγ(i)comes later in

the sequencei1, . . . , in thani and therefore the sequence i, γ(i), γ(γ(i)), . . . is a path connectingi to the node 0. But thenGis a connected graph with n−1 edges, therefore it is a tree. That the sequence (b1, . . . , bn−2) is the Prüfer code ofGis obvious from the construction.

Remark. An exact correspondence like the Prüfer code has other advantages besides optimal Kolmogorov coding. Suppose that our task is to write a program for a randomized Turing machine that outputs a random labeled tree of size n in such a way that all trees occur with the same probability.

The Prüfer code gives an eﬃcient algorithm for this. We just have to generate randomly a sequenceb1, . . . , bn−2, which is easy, and then decode from it the tree by the above algorithm.

Example 6.4.3. Consider now the unlabeled trees. These can be deﬁned as the equivalence classes of labeled trees where two labeled trees are considered equivalent if they areisomorphic, i.e., by a suitable relabeling, they become the same labeled tree. We assume that we represent each equivalence class by one of its elements, i.e., by a labeled tree (it is not interesting now, by which one). Since each labeled tree can be labeled in at mostn! ways (its labelings are not necessarily all diﬀerent as labeled trees!) therefore the number of unlabeled trees is at least nⁿ⁻²/n! > 2ⁿ⁻² (if n ≥ 25). The information-theoretical lower bound is therefore at least n−2. (According to a diﬃcult result of George Pólya, the number of n-node unlabeled trees is asymptoticallyc1cⁿ₂n^3/2 wherec1andc2 are constants deﬁned in a certain complicated way.)

On the other hand, we can use the following coding procedure. Consider ann-node treeF. Walk throughF by the “depth-ﬁrst search” rule: Let x0

be the node labeled 0 and deﬁne the nodesx1, x2, . . .as follows: if xi has a neighbor that does not occur yet in the sequence then letxi+1be the smallest one among these. If it does not have such a neighbor and xi 6=x0 then let xi+1 be the neighbor of xi on the path leading from xi to x0. Finally, if xi =x0 and every neighbor ofx0 occured already in the sequence then we stop.

It is easy to see that for the sequence thus deﬁned, every edge occurs among the pairs[xi, xi+1], moreover, it occurs once in both directions. It follows that the length of the sequence is exactly2n−1. Let nowεi= 1ifxi+1 is farther from the root thanxi and εi = 0otherwise. It is easy to understand that the sequenceε0ε1· · ·ε2n−3 determines the tree uniquely; passing trough the sequence, we can draw the graph and construct the sequence x1, . . . , xi of nodes step-for-step. In step(i+ 1), ifεi = 1then we take a new node (this will bexi+1) and connect it with xi; ifεi= 0then letxi+1 be the neighbor ofxi in the “direction” ofx0.

Remarks. 1. With this coding, the code assigned to a tree depends on the labeling but it does not determine it uniquely (it only determines the unlabeled tree uniquely).

2. The coding is not bijective: not every 0-1 sequence will be the code of an unlabeled tree. We can notice that

(a) there are as many 1’s as 0’s in each tree;

(b) in every starting segment of every code, there are at least as many 1’s as 0’s.

(The diﬀerence between the number of 1’s and the number of 0’s among the ﬁrstinumbers gives the distance of the node xi from the node 0). It is easy to see that for each 0-1 sequence having the properties (a)–(b), there is a labeled tree whose code it is. It is not sure, however, that this tree, as an unlabeled tree, is given with just this labeling (this depends on which unlabeled trees are represented by which of their labelings). Therefore, the code does not even use all the words with properties (a)–(b).

3. The number of 0-1 sequences having properties (a)–(b) is, according to a well-known combinatorial theorem, _n¹ ²ⁿ⁻²_n−1

(the so-calledCatalan number).

We can formulate a tree notion to which the sequences with properties (a)–(b) correspond exactly: these are the rooted planar trees, which are drawn without intersection into the plane in such a way that their distinguished vertex – their root – is on the left edge of the page. This drawing deﬁnes an ordering among the “sons” (neighbors farther from the root) “from the top to the bottom”; the drawing is characterized by these orderings. The above described coding can also be done in rooted planar trees and creates a bijection between them and the sequences with the properties (a)–(b).

Exercise 6.4.1. (a) Letxbe a 0-1 sequence that does not contain 3 consec-utive 0’s. Show thatK(x)< .99|x|+O(1).

(b) Find the best constant in place of.99. [Hint: you have to ﬁnd approx-imately the number of such sequences. LetA(n)andB(n)be the number of such sequences ending with0 and 1, respectively. Find recurrence relations forAandB.]

(c) Give a polynomial time coding-decoding procedure for such sequence that compresses each of them by at least 1 percent.

Exercise 6.4.2. (a) Prove that for any two strings x, y∈Σ^∗₀, K(xy)≤2K(x) +K(y) +c,

where c depends only on the universal Turing machine in the deﬁnition of infromation complexity.

(b) Prove that the stronger and more natural looking inequality K(xy)≤K(x) +K(y) +c

is false.

Exercise 6.4.3. Suppose that the universal Turing machine used in the deﬁnition ofK(x)uses programs written in a two-letter alphabet and outputs strings in ans-letter alphabet.

(a) Prove thatK(x)≤ |x|logs+O(1).

(b) Prove that, moreover, there are polynomial time functionsf, gmapping stringsxof length nto binary strings of length nlogs+O(1)and vice versa withg(f(x)) =x.

Exercise 6.4.4.

(a) Give an upper bound on the Kolmogorov complexity of Boolean func-tions ofnvariables.

(b) Give a lower bound on the complexity of the most complex Boolean function ofnvariables.

(c) Use the above result to ﬁnd a numberL(n)such that there is a Boolean function withnvariables which needs a Boolean circuit of size at least L(n)to compute it.

Exercise 6.4.5. Call an inﬁnite 0-1 sequencex(informatically) strongly randomifn−H(xn)is bounded from above. Prove that every informatically strongly random sequence is also weakly random.

Exercise 6.4.6. Prove that almost all inﬁnite 0-1 sequences are strongly random.

Pseudorandom numbers

We have seen that various important algorithms use random numbers (or, equivalently, independent random bits). But how do we get such bits?

One possible source is from outside the computer. We could obtain “real”

random sequences, say, from radioactive decay. In most cases, however, this would not work: our computers are very fast and we have no physical device giving the equivalent of unbiased coin-tosses at this rate.

Thus we have to resort to generating our random bits by the computer.

However, a long sequence generated by a short program is never random, according to the notion of randomness introduced in Chapter 6 using in-formation complexity. Thus we are forced to use algorithms that generate random-looking sequences; but, as Von Neumann (one of the ﬁrst mathemati-cians to propose the use of these) put it, everybody using them is inevitably

“in the state of sin”. In this chapter, we will understand the kind of protection we can get against the graver consequences of this sin.

There are other reasons besides practical ones to study pseudorandom number generators. We often want to repeat some computation for various reasons, including error checking. In this case, if our source of random num-bers was really random, then the only way to use the same random numnum-bers again is to store them, using a lot of space. With pseudorandom numbers, this is not the case: we only have to store the “seed”, which is much shorter.

Another, and more important, reason is that there are applications where what we want is only that the sequence should “look random” to somebody who does not know how it was generated. The collection of these applications called cryptography is to be treated in Chapter 12.

The way apseudorandom bit generatorworks is that it turns a short ran-dom string called the “seed” into a longer pseudoranran-dom string. We require that it works in polynomial time. The resulting string has to “look” random:

153

and the important fact is that this can be deﬁned exactly. Roughly speaking, there should be no polynomial time algorithm that distinguishes it from a truly random sequence. Another feature, often easier to verify, is that no algorithm can predict any of its bits from the previous bits. We prove the equivalence of these two conditions.

But how do we design such a generator? Various ad hoc methods that produce random-looking sequences (like taking the bits in the binary rep-resentation of a root of a given equation) turn out to produce strings that do not pass the strict criteria we impose. A general method to obtain such sequences is based onone-way functions: functions that are easy to evaluate but diﬃcult to invert. While the existence of such functions is not proved (it would imply that P is diﬀerent from NP), there are several candidates, that are secure at least against current techniques.

7.1 Classical methods

There are several classical methods that generate a “random-looking” se-quence of bits. None of these meets the strict standards to be formulated in the next section; but due to their simplicity and eﬃciency, they (espe-cially linear congruential generators, example 7.1.2 below) can be used well in practice. There is a large amount of practical information about the best choice of the parameters; we don’t go into this here, but refer to Volume 2 of Knuth’s book.

Example 7.1.1. Shift registers are deﬁned as follows. Let f : {0,1}ⁿ → {0,1} be a function that is easy to compute. Starting with aseed ofn bits a0, a1, . . . , an−1, we compute bitsan, an+1, an+2, . . . recursively, by

ak=f(ak−1, ak−2, . . . , ak−n).

The name shift register comes from the fact that we only need to store n+1bits: after storingf(a0, . . . , an−1)inan, we don’t needa0any more, and we can shifta1 toa0,a2toa1, etc. The most important special case is when f is a linear function over the 2-element ﬁeld, and we’ll restrict ourselves to this case.

Looking at particular instances, the bits generated by a linear shift register look random, at least for a while. Of course, the sequence a0.a1, . . . will eventually have some n-tuple repeated, and then it will be periodic from then on; but this need not happen sooner than a2ⁿ, and indeed one can select the (linear) functionf so that the period of the sequence is as large as2ⁿ.

The problem is that the sequence has more hidden structure than just periodicity. Indeed, let

f(x0, . . . , xn−1) =b0x0+b1x1+. . . bn−1xn−1

(wherebi∈ {0,1}). Assume that we do not know the coeﬃcientsb0, . . . , bn−1, but observe the ﬁrst nbits an, . . . , a2n−1 of the output sequence. Then we have the following system of linear equations to determine thebi:

b0a0+b1a1+. . . bn−1an−1 = an

b0a1+b1a2+. . . bn−1an = an+1

...

b0an−1+b1an+. . . bn−1a2n−2 = a2n−1

Here aren equations to determine then unknowns (the equations are over the 2-element ﬁeld). Once we have thebi, we can predict all the remaining elements of the sequencea2n, a2n+1, . . .

It may happen, of course, that this system is not uniquely solvable, because the equations are dependent. For example, we might start with the seed 00. . .0, in which case the equations are meaningless. But it can be shown that for a random choice of the seed, the equations determine the coeﬃcients

In document Complexity of Algorithms (Pldal 153-164)