• Nem Talált Eredményt

Kolmogorov complexity, entropy and coding

In document Complexity of Algorithms (Pldal 153-164)

Let p= (p1, p2, . . .) be a discrete probability distribution, i.e., a non-negative (finite or infinite) sequence with P

ipi = 1. Its entropy is the

quantity

H(p) =X

i

−pilogpi

(the termpilogpi is considered to be 0 ifpi = 0). Notice that in this sum, all terms are nonnegative, so H(p) ≥ 0; equality holds if and only if the value of somepi is 1 and the value of the rest is 0. It is easy to see that for fixed alphabet sizem, the probability distribution with maximum entropy is (1/m, . . . ,1/m)and the entropy of this islogm.

Entropy is a basic notion of information theory and we do not treat it in detail in these notes, we only point out its connection with Kolmogorov complexity. We have met with entropy for the casem= 2 in Lemma 6.3.3.

This lemma is easy to generalize to arbitrary alphabets as

Lemma 6.4.1. Let x ∈ Σ0 with |x| = n and let ph denote the relative frequency of the letterhin the wordx. Letp= (ph:h∈Σ0). Then

K(x)≤ H(p) logmn+O

mlogn logm

.

Proof. Let us give a different proof using Theorem 6.2.4. Consider the prob-ability distribution over the strings of lengthnin which each symbolhis cho-sen with probabilityph. The probabilitiesphare fractions with denominator n, hence their description needs at mostO(mlogn)bits, what is O(mloglogmn) symbols of our alphabet. The distribution over the strings is therefore an enumerable probability distributionP whose program has lengthO(mlogn).

According to Theorem 6.2.4, we have

K(x)≤H(x)≤ −logmP(x) +O(mlogn).

But−logmP(x)is exactly nH(p)logm.

Remark. We mention another interesting connection between entropy and complexity: the entropy of a computable probability distribution over all strings is close to the average complexity. This reformulation of Corollary 6.2.5 can be stated as

H(p)−X

x

p(x)H(x)

=O(1),

for any computable probability distributionpover the setΣ0.

LetL ⊆Σ0 be a recursive language and suppose that we want to find a short program, “code”, only for the words inL. For each wordxin L, we are thus looking for a programf(x)∈ {0,1} printing it. We call the function

f :L →Σ aKolmogorov codeofL. Theconcisenessof the code is the function

η(n) = max{ |f(x)|:x∈ L, |x| ≤n}.

We can easily get a lower bound on the conciseness of any Kolmogorov code of any language. LetLn denote the set of words ofL of length at most n.

Then obviously,

η(n)≥log|Ln|.

We call this estimate theinformation theoretical lower bound.

This lower bound is sharp (to within an additive constant). We can code every wordx in L simply by telling its serial number in the increasing or-dering. If the word x of length n is the t-th element then this requires logt≤log|Ln|bits, plus a constant number of additional bits (the program for taking the elements ofΣ in lexicographic order, checking their member-ship inLand printing thet-th one).

We arrive at more interesting questions if we stipulate that the code from the word and, conversely, the word from the code should be polynomially computable. In other words: we are looking for a language L and two polynomially computable functions:

f :L → L, g:L→ L

withg◦f = idLfor which, for everyxinLthe code|f(x)|is “short” compared to|x|. Such a pair of functions is called apolynomial time code. (Instead of the polynomial time bound we could, of course, consider other complexity restrictions.)

We present some examples when a polynomial time code approaches the information-theoretical bound.

Example 6.4.1. In the proof of Lemma 6.3.3, for the coding of the 0-1 sequences of lengthnwith exactlym1’s, we used the simple coding in which the code of a sequence is the number giving its place in the lexicographic ordering. We will show that this coding is polynomial.

Let us view each 0-1 sequence as the obvious code of a subset of the n-element set{n−1, n−2, . . . ,0}. Each such set can be written as{a1, . . . , am}

So, givena1, . . . , am, the value oft is easily computable in time polynomial in n. Conversely, if t < mn

is given then t is easy to write in the above form: first we find, using binary search, the greatest natural numbera1with

a1

implyinga1> a2. It follows similarly that a2> a3>· · ·> am≥0 and that there is no “remainder” aftermsteps, i.e., that 6.4.1 holds. It can therefore be determined in polynomial time which subset is lexicographically thet-th.

Example 6.4.2. Consider trees, given by their adjacency matrices (but any other “reasonable” representation would also do). In such representations, the vertices of the tree have a given order, which we can also express saying that the vertices of the tree are labeled by numbers from 0 to(n−1). We consider two trees equal if whenever the nodesi, j are connected in the first one they are also connected in the second one and vice versa (so, if we renumber the nodes of the tree then we may arrive at a different tree). Such trees are called labeled trees. Let us first see what does the information-theoretical lower bound give us, i.e., how many trees are there. The following classical result, called Cayley’s Theorem, applies here:

Theorem 6.4.2(Cayley’s Theorem). The number ofn-node labeled trees is nn−2.

Consequently, by the information-theoretical lower bound, for any encod-ing of trees somen-node tree needs a code with length at least⌈log(nn−2)⌉=

⌈(n−2) logn⌉. But can this lower bound be achieved by a polynomial time computable code?

(a) Coding trees by their adjacency matrices takesn2bits. (It is easy to see that n2

bits are enough.)

(b) We fare better if we specify each tree by enumerating its edges. Then we must give a “name” to each vertex; since there are n vertices we can give to each one a 0-1 sequence of length⌈logn⌉as its name. We specify each edge by its two endnodes. In this way, the enumeration of the edges takes cca.2(n−1) lognbits.

(c) We can save a factor of 2 in (b) if we distinguish a root in the tree, say the node 0, and we specify the tree by the sequence(α(1), . . . , α(n−1)) in whichα(i)is the first interior node on the path from nodei to the root (the “father” of i). This is (n−1)⌈logn⌉ bits, which is already nearly optimal.

(d) There is, however, a procedure, the so-calledPrüfer code, that sets up a bijection between then-node labeled trees and the sequences of length n−2 of the numbers 0, . . . , n−1. (Thereby it also proves Cayley’s theorem). Each such sequence can be considered the expression of a natural number in the base nnumber system; in this way, we order a

“serial number” between 0 and nn−2−1 to the n-node labeled trees.

Expressing these serial numbers in the base two number system, we get a coding in which the code of each number has length at most

⌈(n−2) logn⌉.

The Prüfer code can be considered as a refinement of procedure (c). The idea is that we order the edges [i, α(i)] not by the value of i but a little differently. Let us define the permutation(i1, . . . , in)as follows: leti1 be the smallest endnode (leaf) of the tree; ifi1, . . . , ik are already defined then let ik+1 be the smallest endnode of the graph remaining after deleting the nodes i1, . . . , ik. (We do not consider the root 0 an endnode.) Letin = 0. With the ik’s thus defined, let us consider the sequence (α(i1), . . . , α(in−1)). The last element of this is 0 (the “father” of the nodein−1can namely be onlyin), it is therefore not interesting. We call the remaining sequence(α(i1), . . . , α(in−2)) thePrüfer codeof the tree.

Claim 6.4.3. The Prüfer code of a tree determines the tree.

For this, it is enough to see that the Prüfer code determines the sequence i1, . . . , in; then we know all the edges of the tree (the pairs[i, α(i)]).

The nodei1is the smallest endnode of the tree; hence to determinei1, it is enough to figure out the endnodes from the Prüfer code. But this is obvious:

the endnodes are exactly those that are not the “fathers” of other nodes, i.e., the ones that do not occur among the numbers α(i1), . . . , α(in−2),0. The nodei1 is therefore uniquely determined.

Assume that we know already that the Prüfer code uniquely determines i1, . . . , ik−1. It follows similarly to the above thatik is the smallest number not occurring neither amongi1, . . . , ik−1nor amongα(ik), . . . , α(in−2),0. So, ik is also uniquely determined.

Claim 6.4.4. Every sequence (b1, . . . , bn−2), where 0 ≤bi ≤n−1, occurs as the Prüfer code of some tree.

Using the idea of the proof above, let bn−1 = 0 and let us define the permutationi1, . . . , in by the recursion that ik is the smallest number not occurring neither amongi1, . . . , ik−1nor amongbk, . . . , bn−1, where(1≤k≤ n−1); and let in = 0. Connect ik with bk for all 1 ≤ k ≤ n−1 and let γ(ik) =bk. In this way, we obtain a graph Gwith n−1edges on the nodes 0, . . . , n−1. This graph is connected, since for everyitheγ(i)comes later in

the sequencei1, . . . , in thani and therefore the sequence i, γ(i), γ(γ(i)), . . . is a path connectingi to the node 0. But thenGis a connected graph with n−1 edges, therefore it is a tree. That the sequence (b1, . . . , bn−2) is the Prüfer code ofGis obvious from the construction.

Remark. An exact correspondence like the Prüfer code has other advantages besides optimal Kolmogorov coding. Suppose that our task is to write a program for a randomized Turing machine that outputs a random labeled tree of size n in such a way that all trees occur with the same probability.

The Prüfer code gives an efficient algorithm for this. We just have to generate randomly a sequenceb1, . . . , bn−2, which is easy, and then decode from it the tree by the above algorithm.

Example 6.4.3. Consider now the unlabeled trees. These can be defined as the equivalence classes of labeled trees where two labeled trees are considered equivalent if they areisomorphic, i.e., by a suitable relabeling, they become the same labeled tree. We assume that we represent each equivalence class by one of its elements, i.e., by a labeled tree (it is not interesting now, by which one). Since each labeled tree can be labeled in at mostn! ways (its labelings are not necessarily all different as labeled trees!) therefore the number of unlabeled trees is at least nn−2/n! > 2n−2 (if n ≥ 25). The information-theoretical lower bound is therefore at least n−2. (According to a difficult result of George Pólya, the number of n-node unlabeled trees is asymptoticallyc1cn2n3/2 wherec1andc2 are constants defined in a certain complicated way.)

On the other hand, we can use the following coding procedure. Consider ann-node treeF. Walk throughF by the “depth-first search” rule: Let x0

be the node labeled 0 and define the nodesx1, x2, . . .as follows: if xi has a neighbor that does not occur yet in the sequence then letxi+1be the smallest one among these. If it does not have such a neighbor and xi 6=x0 then let xi+1 be the neighbor of xi on the path leading from xi to x0. Finally, if xi =x0 and every neighbor ofx0 occured already in the sequence then we stop.

It is easy to see that for the sequence thus defined, every edge occurs among the pairs[xi, xi+1], moreover, it occurs once in both directions. It follows that the length of the sequence is exactly2n−1. Let nowεi= 1ifxi+1 is farther from the root thanxi and εi = 0otherwise. It is easy to understand that the sequenceε0ε1· · ·ε2n−3 determines the tree uniquely; passing trough the sequence, we can draw the graph and construct the sequence x1, . . . , xi of nodes step-for-step. In step(i+ 1), ifεi = 1then we take a new node (this will bexi+1) and connect it with xi; ifεi= 0then letxi+1 be the neighbor ofxi in the “direction” ofx0.

Remarks. 1. With this coding, the code assigned to a tree depends on the labeling but it does not determine it uniquely (it only determines the unlabeled tree uniquely).

2. The coding is not bijective: not every 0-1 sequence will be the code of an unlabeled tree. We can notice that

(a) there are as many 1’s as 0’s in each tree;

(b) in every starting segment of every code, there are at least as many 1’s as 0’s.

(The difference between the number of 1’s and the number of 0’s among the firstinumbers gives the distance of the node xi from the node 0). It is easy to see that for each 0-1 sequence having the properties (a)–(b), there is a labeled tree whose code it is. It is not sure, however, that this tree, as an unlabeled tree, is given with just this labeling (this depends on which unlabeled trees are represented by which of their labelings). Therefore, the code does not even use all the words with properties (a)–(b).

3. The number of 0-1 sequences having properties (a)–(b) is, according to a well-known combinatorial theorem, n1 2n−2n−1

(the so-calledCatalan number).

We can formulate a tree notion to which the sequences with properties (a)–(b) correspond exactly: these are the rooted planar trees, which are drawn without intersection into the plane in such a way that their distinguished vertex – their root – is on the left edge of the page. This drawing defines an ordering among the “sons” (neighbors farther from the root) “from the top to the bottom”; the drawing is characterized by these orderings. The above described coding can also be done in rooted planar trees and creates a bijection between them and the sequences with the properties (a)–(b).

Exercise 6.4.1. (a) Letxbe a 0-1 sequence that does not contain 3 consec-utive 0’s. Show thatK(x)< .99|x|+O(1).

(b) Find the best constant in place of.99. [Hint: you have to find approx-imately the number of such sequences. LetA(n)andB(n)be the number of such sequences ending with0 and 1, respectively. Find recurrence relations forAandB.]

(c) Give a polynomial time coding-decoding procedure for such sequence that compresses each of them by at least 1 percent.

Exercise 6.4.2. (a) Prove that for any two strings x, y∈Σ0, K(xy)≤2K(x) +K(y) +c,

where c depends only on the universal Turing machine in the definition of infromation complexity.

(b) Prove that the stronger and more natural looking inequality K(xy)≤K(x) +K(y) +c

is false.

Exercise 6.4.3. Suppose that the universal Turing machine used in the definition ofK(x)uses programs written in a two-letter alphabet and outputs strings in ans-letter alphabet.

(a) Prove thatK(x)≤ |x|logs+O(1).

(b) Prove that, moreover, there are polynomial time functionsf, gmapping stringsxof length nto binary strings of length nlogs+O(1)and vice versa withg(f(x)) =x.

Exercise 6.4.4.

(a) Give an upper bound on the Kolmogorov complexity of Boolean func-tions ofnvariables.

(b) Give a lower bound on the complexity of the most complex Boolean function ofnvariables.

(c) Use the above result to find a numberL(n)such that there is a Boolean function withnvariables which needs a Boolean circuit of size at least L(n)to compute it.

Exercise 6.4.5. Call an infinite 0-1 sequencex(informatically) strongly randomifn−H(xn)is bounded from above. Prove that every informatically strongly random sequence is also weakly random.

Exercise 6.4.6. Prove that almost all infinite 0-1 sequences are strongly random.

Pseudorandom numbers

We have seen that various important algorithms use random numbers (or, equivalently, independent random bits). But how do we get such bits?

One possible source is from outside the computer. We could obtain “real”

random sequences, say, from radioactive decay. In most cases, however, this would not work: our computers are very fast and we have no physical device giving the equivalent of unbiased coin-tosses at this rate.

Thus we have to resort to generating our random bits by the computer.

However, a long sequence generated by a short program is never random, according to the notion of randomness introduced in Chapter 6 using in-formation complexity. Thus we are forced to use algorithms that generate random-looking sequences; but, as Von Neumann (one of the first mathemati-cians to propose the use of these) put it, everybody using them is inevitably

“in the state of sin”. In this chapter, we will understand the kind of protection we can get against the graver consequences of this sin.

There are other reasons besides practical ones to study pseudorandom number generators. We often want to repeat some computation for various reasons, including error checking. In this case, if our source of random num-bers was really random, then the only way to use the same random numnum-bers again is to store them, using a lot of space. With pseudorandom numbers, this is not the case: we only have to store the “seed”, which is much shorter.

Another, and more important, reason is that there are applications where what we want is only that the sequence should “look random” to somebody who does not know how it was generated. The collection of these applications called cryptography is to be treated in Chapter 12.

The way apseudorandom bit generatorworks is that it turns a short ran-dom string called the “seed” into a longer pseudoranran-dom string. We require that it works in polynomial time. The resulting string has to “look” random:

153

and the important fact is that this can be defined exactly. Roughly speaking, there should be no polynomial time algorithm that distinguishes it from a truly random sequence. Another feature, often easier to verify, is that no algorithm can predict any of its bits from the previous bits. We prove the equivalence of these two conditions.

But how do we design such a generator? Various ad hoc methods that produce random-looking sequences (like taking the bits in the binary rep-resentation of a root of a given equation) turn out to produce strings that do not pass the strict criteria we impose. A general method to obtain such sequences is based onone-way functions: functions that are easy to evaluate but difficult to invert. While the existence of such functions is not proved (it would imply that P is different from NP), there are several candidates, that are secure at least against current techniques.

7.1 Classical methods

There are several classical methods that generate a “random-looking” se-quence of bits. None of these meets the strict standards to be formulated in the next section; but due to their simplicity and efficiency, they (espe-cially linear congruential generators, example 7.1.2 below) can be used well in practice. There is a large amount of practical information about the best choice of the parameters; we don’t go into this here, but refer to Volume 2 of Knuth’s book.

Example 7.1.1. Shift registers are defined as follows. Let f : {0,1}n → {0,1} be a function that is easy to compute. Starting with aseed ofn bits a0, a1, . . . , an−1, we compute bitsan, an+1, an+2, . . . recursively, by

ak=f(ak−1, ak−2, . . . , ak−n).

The name shift register comes from the fact that we only need to store n+1bits: after storingf(a0, . . . , an−1)inan, we don’t needa0any more, and we can shifta1 toa0,a2toa1, etc. The most important special case is when f is a linear function over the 2-element field, and we’ll restrict ourselves to this case.

Looking at particular instances, the bits generated by a linear shift register look random, at least for a while. Of course, the sequence a0.a1, . . . will eventually have some n-tuple repeated, and then it will be periodic from then on; but this need not happen sooner than a2n, and indeed one can select the (linear) functionf so that the period of the sequence is as large as2n.

The problem is that the sequence has more hidden structure than just periodicity. Indeed, let

f(x0, . . . , xn−1) =b0x0+b1x1+. . . bn−1xn−1

(wherebi∈ {0,1}). Assume that we do not know the coefficientsb0, . . . , bn−1, but observe the first nbits an, . . . , a2n−1 of the output sequence. Then we have the following system of linear equations to determine thebi:

b0a0+b1a1+. . . bn−1an−1 = an

b0a1+b1a2+. . . bn−1an = an+1

...

b0an−1+b1an+. . . bn−1a2n−2 = a2n−1

Here aren equations to determine then unknowns (the equations are over the 2-element field). Once we have thebi, we can predict all the remaining elements of the sequencea2n, a2n+1, . . .

It may happen, of course, that this system is not uniquely solvable, because the equations are dependent. For example, we might start with the seed 00. . .0, in which case the equations are meaningless. But it can be shown that for a random choice of the seed, the equations determine the coefficients

It may happen, of course, that this system is not uniquely solvable, because the equations are dependent. For example, we might start with the seed 00. . .0, in which case the equations are meaningless. But it can be shown that for a random choice of the seed, the equations determine the coefficients

In document Complexity of Algorithms (Pldal 153-164)