Kolmogorov complexity, entropy and coding

n=1P(A_n) is convergent. But then, the Borel-Cantelli Lemma from probability theory implies that with probability 1, only finitely many of the events A_n occur, which means that K(x_n)/n→ ∞.

Remark. If the members of the sequence x are generated by an algorithm, then x_n can be computed from the program of the algorithm (constant length) and from the numbern (can be given inlogn bits). Therefore, for such a sequenceK(x_n)grows very slowly.

6.4 Kolmogorov complexity, entropy and coding

Let p= (p₁, p₂, . . .) be a discrete probability distribution, i.e., a non-negative (finite or infinite) sequence with P

ip_i = 1. Itsentropy is the quantity H(p) =X

−p_ilogp_i

(the term pilogpi is considered to be 0 ifpi = 0). Notice that in this sum, all terms are nonnegative, so H(p)≥0; equality holds if and only if the value of somep_i is 1 and the value of the rest is 0. It is easy to see that for fixed alphabet size m, the probability distribution with maximum entropy is(1/m, . . . ,1/m)and the entropy of this islogm.

Entropy is a basic notion of information theory and we do not treat it in detail in these notes, we only point out its connection with Kolmogorov complexity. We have met with entropy for the case m= 2 in Lemma 6.3.3. This lemma is easy to generalize to arbitrary alphabets as

Lemma 6.4.1. Let x∈Σ^∗₀ with |x|=n and let p_h denote the relative frequency of the letter h in the word x. Let p= (ph :h∈Σ0). Then

K(x)≤ H(p)

logmn+O(mlogn logm ).

Proof. Let us give a different proof using Theorem 6.2.4. Consider the probability dis-tribution over the strings of lengthn in which each symbolhis chosen with probability p_h. The probabilitiesp_hare fractions with denominatorn, hence their description needs at most O(mlogn) bits, what is O(^m_log^log_mⁿ) symbols of our alphabet. The distribution over the strings is therefore an enumerable probability distribution P whose program has length O(mlogn). According to Theorem 6.2.4, we have

K(x)≤H(x)≤ −log_mP(x) +O(mlogn).

But −log_mP(x) is exactly ^nH(p)_log_m.

Remark. We mention another interesting connection between entropy and complexity:

the entropy of a computable probability distribution over all strings is close to the average complexity. This reformulation of Corollary 6.2.5 can be stated as

¯¯

¯H(p)−X

p(x)H(x)

¯¯

¯=O(1),

6. Chapter: Information complexity 127 for any computable probability distribution pover the set Σ^∗₀.

Let L ⊆ Σ^∗₀ be a recursive language and suppose that we want to find a short program, “code”, only for the words inL. For each wordxinL, we are thus looking for a program f(x) ∈ {0,1}^∗ printing it. We call the function f : L → Σ^∗ a Kolmogorov code of L. The conciseness of the code is the function

η(n) = max{ |f(x)|:x∈ L, |x| ≤n}.

We can easily get a lower bound on the conciseness of any Kolmogorov code of any language. LetLn denote the set of words ofL of length at most n. Then obviously,

η(n)≥log|Ln|.

We call this estimate the information theoretical lower bound.

This lower bound is sharp (to within an additive constant). We can code every wordx inL simply by telling its serial number in the increasing ordering. If the word xof lengthn is the t-th element then this requires logt ≤log|L_n|bits, plus a constant number of additional bits (the program for taking the elements ofΣ^∗ in lexicographic order, checking their membership inL and printing the t-th one).

We arrive at more interesting questions if we stipulate that the code from the word and, conversely, the word from the code should be polynomially computable. In other words: we are looking for a language L⁰ and two polynomially computable functions:

f :L → L⁰, g :L⁰ → L

withg◦f = id_L for which, for everyx inL the code|f(x)|is “short” compared to |x|.

Such a pair of functions is called a polynomial time code. (Instead of the polynomial time bound we could, of course, consider other complexity restrictions.)

We present some examples when a polynomial time code approaches the information-theoretical bound.

Example 6.4.1. In the proof of Lemma 6.3.3, for the coding of the 0-1 sequences of lengthn with exactlym1’s, we used the simple coding in which the code of a sequence is the number giving its place in the lexicographic ordering. We will show that this coding is polynomial.

Let us view each 0-1 sequence as the obvious code of a subset of the n-element set {n−1, n−2, . . . ,0}. Each such set can be written as{a₁, . . . , a_m}witha₁ > a₂ >· · ·>

a_m. Then the set {b₁, . . . , b_m} precedes the set {a₁, . . . , a_m} lexicographically if and only if there is anisuch that b_i < a_i whilea_j =b_j holds for allj < i. Let {a₁, . . . , a_m} be the lexicographically t-th set. Then the number of subsets {b₁, . . . , b_m} with the above property for a giveni is exactly ¡ _a

m−i+1

¢. Summing this for all i we find that

t = 1 + µa₁

¶ +

µ a₂ m−1

+· · ·+ µa_m

. (6.3)

So, given a1, . . . , am, the value of t is easily computable in time polynomial in n.

Conversely, if t < ¡_n

¢ is given then t is easy to write in the above form: first we find, using binary search, the greatest natural number a₁ with ¡_a

¢ ≤ t−1, then the

128 6.4. Kolmogorov complexity, entropy and coding greatest numbera₂ with¡ _a₂

m−1

¢≤t−1−¡_a₁

¢, etc. We do this formsteps. The numbers obtained this way satisfy a₁ > a₂ > · · ·; indeed, according to the definition of a₁ we have¡_a

1+1 m

¢=¡_a

¢+¡ _a

m−1

¢> t−1and therefore ¡ _a

m−1

¢> t−1−¡_a

¢implyinga₁ > a₂. It follows similarly that a₂ > a₃ >· · ·> a_m ≥0 and that there is no “remainder” after m steps, i.e., that 6.3 holds. It can therefore be determined in polynomial time which subset is lexicographically the t-th.

Example 6.4.2. Consider trees, given by their adjacency matrices (but any other

“reasonable” representation would also do). In such representations, the vertices of the tree have a given order, which we can also express saying that the vertices of the tree are labeled by numbers from 0 to (n−1). We consider two trees equal if whenever the nodes i, j are connected in the first one they are also connected in the second one and vice versa (so, if we renumber the nodes of the tree then we may arrive at a different tree). Such trees are called labeled trees. Let us first see what does the information-theoretical lower bound give us, i.e., how many trees are there. The following classical result, called Cayley’s Theorem, applies here:

Theorem 6.4.2 (Cayley’s Theorem). The number of n-node labeled trees is nⁿ⁻². Consequently, by the information-theoretical lower bound, for any encoding of trees some n-node tree needs a code with length at least dlog(nⁿ⁻²)e=d(n−2) logne. But can this lower bound be achieved by a polynomial time computable code?

(a) Coding trees by their adjacency matrices takesn² bits. (It is easy to see that ¡_n

¢ bits are enough).

(b) We fare better if we specify each tree by enumerating its edges. Then we must give a “name” to each vertex; since there are n vertices we can give to each one a 0-1 sequence of length dlogne as its name. We specify each edge by its two endnodes. In this way, the enumeration of the edges takes cca.2(n−1) lognbits.

(c) We can save a factor of 2 in (b) if we distinguish a root in the tree, say the node 0, and we specify the tree by the sequence (α(1), . . . , α(n−1)) in whichα(i) is the first interior node on the path from node ito the root (the “father” of i). This is (n−1)dlogne bits, which is already nearly optimal.

(d) There is, however, a procedure, the so-called Prüfer code, that sets up a bijection between then-node labeled trees and the sequences of lengthn−2of the numbers 0, . . . , n−1. (Thereby it also proves Cayley’s theorem). Each such sequence can be considered the expression of a natural number in the base n number system;

in this way, we order a “serial number” between 0 and nⁿ⁻² −1 to the n-node labeled trees. Expressing these serial numbers in the base two number system, we get a coding in which the code of each number has length at most d(n−2) logne.

The Prüfer code can be considered as a refinement of procedure (c). The idea is that we order the edges [i, α(i)] not by the value of i but a little differently. Let us define the permutation (i1, . . . , in) as follows: let i1 be the smallest endnode (leaf) of the tree; if i₁, . . . , i_k are already defined then let i_k+1 be the smallest endnode of the graph remaining after deleting the nodes i₁, . . . , i_k. (We do not consider the root 0

6. Chapter: Information complexity 129 an endnode.) Let i_n = 0. With the i_k’s thus defined, let us consider the sequence (α(i₁), . . . , α(i_n−1)). The last element of this is 0 (the “father” of the node i_n−1 can namely be only i_n), it is therefore not interesting. We call the remaining sequence (α(i₁), . . . , α(i_n−2))the Prüfer code of the tree.

Claim 6.4.3. The Prüfer code of a tree determines the tree.

For this, it is enough to see that the Prüfer code determines the sequencei₁, . . . , i_n; then we know all the edges of the tree (the pairs[i, α(i)]).

The node i₁ is the smallest endnode of the tree; hence to determine i₁, it is enough to figure out the endnodes from the Prüfer code. But this is obvious: the endnodes are exactly those that are not the “fathers” of other nodes, i.e., the ones that do not occur among the numbersα(i₁), . . . , α(i_n−2),0. The nodei₁ is therefore uniquely determined.

Assume that we know already that the Prüfer code uniquely determinesi₁, . . . , i_k−1. It follows similarly to the above that i_k is the smallest number not occurring neither amongi1, . . . , ik−1 nor among α(ik), . . . , α(in−2),0. So, ik is also uniquely determined.

Claim 6.4.4. Every sequence (b₁, . . . , b_n−2), where 0 ≤ b_i ≤ n− 1, occurs as the Prüfer code of some tree.

Using the idea of the proof above, let bn−1 = 0 and let us define the permutation i₁, . . . , i_n by the recursion that i_k is the smallest number not occurring neither among i₁, . . . , i_k−1 nor among b_k, . . . , b_n−1, where (1 ≤ k ≤ n −1); and let i_n = 0. Connect ik with bk for all 1 ≤ k ≤ n −1 and let γ(ik) = bk. In this way, we obtain a graph G with n −1 edges on the nodes 0, . . . , n− 1. This graph is connected, since for everyithe γ(i)comes later in the sequence i₁, . . . , i_n thaniand therefore the sequence i, γ(i), γ(γ(i)), . . . is a path connecting i to the node 0. But then G is a connected graph with n−1 edges, therefore it is a tree. That the sequence (b₁, . . . , b_n−2) is the Prüfer code ofG is obvious from the construction.

Remark. An exact correspondence like the Prüfer code has other advantages besides optimal Kolmogorov coding. Suppose that our task is to write a program for a random-ized Turing machine that outputs a random labeled tree of size n in such a way that all trees occur with the same probability. The Prüfer code gives an efficient algorithm for this. We just have to generate randomly a sequenceb1, . . . , bn−2, which is easy, and then decode from it the tree by the above algorithm.

Example 6.4.3. Consider now the unlabeled trees. These can be defined as the equivalence classes of labeled trees where two labeled trees are considered equivalent if they are isomorphic, i.e., by a suitable relabeling, they become the same labeled tree.

We assume that we represent each equivalence class by one of its elements, i.e., by a labeled tree (it is not interesting now, by which one). Since each labeled tree can be labeled in at most n! ways (its labelings are not necessarily all different as labeled trees!) therefore the number of unlabeled trees is at least nⁿ⁻²/n!>2ⁿ⁻² (if n ≥25).

The information-theoretical lower bound is therefore at least n−2. (According to a difficult result of George Pólya, the number ofn-node unlabeled trees is asymptotically c₁cⁿ₂n^3/2 where c₁ and c₂ are constants defined in a certain complicated way.)

On the other hand, we can use the following coding procedure. Consider ann-node tree F. Walk through F by the “depth-first search” rule: Letx₀ be the node labeled 0 and define the nodesx₁, x₂, . . . as follows: ifx_i has a neighbor that does not occur yet

130 6.4. Kolmogorov complexity, entropy and coding in the sequence then letx_i+1 be the smallest one among these. If it does not have such a neighbor and x_i 6= x₀ then let x_i+1 be the neighbor of x_i on the path leading from x_i tox₀. Finally, if x_i =x₀ and every neighbor of x₀ occured already in the sequence then we stop.

It is easy to see that for the sequence thus defined, every edge occurs among the pairs [x_i, x_i+1], moreover, it occurs once in both directions. It follows that the length of the sequence is exactly 2n −1. Let now ε_i = 1 if x_i+1 is farther from the root than x_i and ε_i = 0 otherwise. It is easy to understand that the sequence ε₀ε₁· · ·ε_2n−3 determines the tree uniquely; passing trough the sequence, we can draw the graph and construct the sequence x₁, . . . , x_i of nodes step-for-step. In step (i+ 1), if ε_i = 1then we take a new node (this will be x_i+1) and connect it with x_i; if ε_i = 0 then let x_i+1 be the neighbor of x_i in the “direction” of x₀.

Remarks. 1. With this coding, the code assigned to a tree depends on the labeling but it does not determine it uniquely (it only determines the unlabeled tree uniquely).

2. The coding is not bijective: not every 0-1 sequence will be the code of an unlabeled tree. We can notice that

(a) There are as many 1’s as 0’s in each tree;

(b) In every starting segment of every code, there are at least as many 1’s as 0’s (the difference between the number of 1’s and the number of 0’s among the first i numbers gives the distance of the node x_i from the node 0). It is easy to see that for each 0-1 sequence having the properties (a)−(b), there is a labeled tree whose code it is. It is not sure, however, that this tree, as an unlabeled tree, is given with just this labeling (this depends on which unlabeled trees are represented by which of their labelings). Therefore, the code does not even use all the words with properties (a)−(b).

3. The number of 0-1 sequences having properties(a)−(b)is, according to a well-known combinatorial theorem, _n¹¡_2n−2

n−1

¢ (the so-called catalan number). We can formulate a tree notion to which the sequences with properties (a)−(b) correspond exactly: these are the rooted planar trees, which are drawn without intersection into the plane in such a way that their distinguished vertex—their root—is on the left edge of the page.

This drawing defines an ordering among the “sons” (neighbors farther from the root)

“from the top to the bottom”; the drawing is characterized by these orderings. The above described coding can also be done in rooted planar trees and creates a bijection between them and the sequences with the properties (a)−(b).

Exercise 6.4.1. (a) Let x be a 0-1 sequence that does not contain 3 consecutive 0’s.

Show that K(x)< .99|x|+O(1).

(b) Find the best constant in place of .99. [Hint: you have to find approximately the number of such sequences. Let A(n) and B(n) be the number of such sequences ending with 0 and 1, respectively. Find recurrence relations for A and B.]

(c) Give a polynomial time coding-decoding procedure for such sequence that com-presses each of them by at least 1 percent.

Exercise 6.4.2. (a) Prove that for any two strings x, y ∈Σ^∗₀, K(xy)≤2K(x) +K(y) +c,

6. Chapter: Information complexity 131 where cdepends only on the universal Turing machine in the definition of infromation complexity.

(b) Prove that the stronger and more natural looking inequality K(xy)≤K(x) +K(y) +c

is false.

Exercise 6.4.3. Suppose that the universal Turing machine used in the definition of K(x) uses programs written in a two-letter alphabet and outputs strings in an s-letter alphabet.

(a) Prove that K(x)≤ |x|logs+O(1).

(b) Prove that, moreover, there are polynomial time functions f, g mapping strings x of lengthn to binary strings of length nlogs+O(1)and vice versa withg(f(x)) = x.

Exercise 6.4.4.

(a) Give an upper bound on the Kolmogorov complexity of Boolean functions of n variables.

(b) Give a lower bound on the complexity of the most complex Boolean function of n variables.

(c) Use the above result to find a number L(n) such that there is a Boolean function with n variables which needs a Boolean circuit of size at leastL(n)to compute it.

Exercise 6.4.5. Call an infinite 0-1 sequence x (informatically) strongly random if n− H(x_n) is bounded from above. Prove that every informatically strongly random sequence is also weakly random.

Exercise 6.4.6. Prove that almost all infinite 0-1 sequences are strongly random.

Chapter 7 Pseudorandom numbers

We have seen that various important algorithms use random numbers (or, equivalently, independent random bits). But how do we get such bits?

One possible source is from outside the computer. We could obtain “real” random sequences, say, from radioactive decay. In most cases, however, this would not work:

our computers are very fast and we have no physical device giving the equivalent of unbiased coin-tosses at this rate.

Thus we have to resort to generating our random bits by the computer. However, a long sequence generated by a short program is never random, according to the notion of randomness introduced in Chapter 6 using information complexity. Thus we are forced to use algorithms that generate random-looking sequences; but, as Von Neumann (one of the first mathematicians to propose the use of these) put it, everybody using them is inevitably “in the state of sin”. In this chapter, we will understand the kind of protection we can get against the graver consequences of this sin.

There are other reasons besides practical ones to study pseudorandom number generators. We often want to repeat some computation for various reasons, including error checking. In this case, if our source of random numbers was really random, then the only way to use the same random numbers again is to store them, using a lot of space. With pseudorandom numbers, this is not the case: we only have to store the

“seed”, which is much shorter. Another, and more important, reason is that there are applications where what we want is only that the sequence should “look random” to somebody who does not know how it was generated. The collection of these applications called cryptography is to be treated in Chapter 12.

The way apseudorandom bit generator works is that it turns a short random string called the “seed” into a longer pseudorandom string. We require that it works in polynomial time. The resulting string has to “look” random: and the important fact is that this can be defined exactly. Roughly speaking, there should be no polynomial time algorithm that distinguishes it from a truly random sequence. Another feature, often easier to verify, is that no algorithm can predict any of its bits from the previous bits. We prove the equivalence of these two conditions.

But how do we design such a generator? Various ad hoc methods that produce random-looking sequences (like taking the bits in the binary representation of a root of a given equation) turn out to produce strings that do not pass the strict criteria we impose. A general method to obtain such sequences is based on one-way functions:

7. Chapter: Pseudorandom numbers 133 functions that are easy to evaluate but difficult to invert. While the existence of such functions is not proved (it would imply that P is different from NP), there are several candidates, that are secure at least against current techniques.

In document Complexity of Algorithms (Pldal 126-133)