• Nem Talált Eredményt

3 Compactness of Representations

This section compares compactness of two representation models, F and T . Let ={c|c :{0,1}n7→ {0,1}}denote the set of all n-dimensional Boolean functions (hence| |=22n), and let c denote a Boolean function in . R(c)denotes the set of all representations for c in a representation model R (possibly R(c) =/0). Let us callγR(c)compactness of c by R, defined as the smallest complexity of a representation for c by R, i.e.,

γR(c) =min{γ(r)|rR(c)}.

First, we show that F and T have the ability to represent any Boolean function in . Let F (resp., T) denote the set of Boolean functions in for which F (resp., T ) has some representations (i.e., F={c∈ |F(c)6=/0}, T ={c∈ | corresponding vector x. Then we see that c is represented by this t. 2

Note thatγ(f) =Ω(2n)andγ(t) =Ω(2n)for the representations f and tin the proof of this theorem. From our assumption on the oracle y, fand t may not be the classifier representations of our interest; we are interested in more compact representations. The following theorem states that, for any Boolean function c∈ , we can construct a feature

fF(c)from any decision tree tT(c)such that its complexityγ(f)is bounded by O(γ(t)).

96 KAZUYAHARAGUCHI, HIROSHINAGAMO CHI, TOSHIHIDEIBARAKI

Theorem 3 For any Boolean function cand any decision tree tT(c), there exists a feature f ∈F(c)such that γ(f) =O(γ(t)).

PRO OF: Let tvdenote the subtree of t= ((Vt,Et), `)which has node vVt as its root node, and let cv denote a Boolean function represented by tv(i.e., tvT(cv)). For any node v∈Vt, we show that there exists a feature fvF(cv)such that γ(fv)≤dγ(tv)for some constant d>0.

We prove this by an induction on the height h of node v. Assume h=h(t). Then, v is a leaf in t, and subtree tvconsists of one node v, which represents a constant function cvthat outputs`(v)for any input vector x∈ {0,1}n. We choose a feature

fvF(cv)that has one fan-in (e.g., fv=f{a

(Features for constant functions are omitted in Fig. 2.)

Corollary 4 Given a training set X , for anyε-classifier c for someε ∈[0,1]and any decision tree tT(c), there exist an ε-classifier c0and a feature fF(c0)such thatγ(f) =O(γ(t)).

In what follows, we consider the converse of the above relation; is there any compact representation of T for such a classifier c that admits a compact representation of F?

Theorem 5 Given training set X={0,1}n, there exists an oracle ysuch thatγF(c) =O(n)andγT(c) =Ω(2n)hold for any classifier cwith e(c) =0.

PRO OF: Let oracle y be a parity function such that y(x) =

1 if ∑nj=1xjis odd,

0 otherwise. (1)

Let cdenote a classifier such that e(c) =0. First, we show that there exists a feature fF(c)whose complexityγ(f) is bounded by O(n). For each attribute ajA, j=1,2, . . . ,n1, we construct feature fjas follows:

Next, we show that complexity of any decision tree tT(c)for classifier c, consistent with y under X={0,1}n, requires Ω(2n). For this, we show the following two lemmas.

KAZUYAHARAGUCHI, HIROSHINAGAMOCH I, TOSHIHIDEIBARAKI 97

Lemma 6 For any A0A and any x1X1, there exists an example x0X0such that x1|A0=x0|A0 (i.e., aj(x1) =aj(x0) holds for any ajA0).

PROOF: It is sufficient to show that such an x0exists for any A0=A\{aj}, j=1,2, . . . ,n, and any x1X1. Since we have all the n-dimensional vectors as X= (X1,X0), we find exactly one example x0X0satisfying x1|A0=x0|A0, by taking x0= (x11,x12, . . . ,x1j1x1j,x1j+1, . . . ,x1n). 2

Lemma 7 For any A0=A\{aj}, j=1,2, . . . ,n, and any x1X1, there exists no xX1(x16=x)such that x1|A0=x|A0holds.

PROOF: Assume that there exist two examples x1,x∈X1such that x1|A0=x|A0for some A0=A\{aj}. Since x16=x, aj(x1)6= aj(x)holds; it follows that either x1X0or xX0holds as y is a parity function, which contradicts the assumption. 2

From Lemma 6, for each x1X1, there exists an example x0X0which visits the same nodes as x1visits, in the depth of less than n1. Then, since decision tree t is consistent, x1and x0are separated at an inner node uVtinner,n1in the depth of n1. (Note that h(t)n.) From Lemma 7, there is no xX1(x16=x) that visits u, and thus there is no example x0X0 that visits u. Since there are|X1|=2n/2=2n1examples for x1,|Vtinner,n1|=2n1holds. Then the number of leaves|Vtleaf| equals 2·2n1=2n. Hence,γ(t) =Ω(2n). 2

Theorem 8 Given a training set X⊆ {0,1}nof 2kexamples(i.e.,|X|=2k), for any integer k∈[1,n], there exists an oracle ysuch thatγF(c) =O(k)andγT(c) =Ω(2k)hold for any classifier cwith e(c) =0.

PROOF: We associate each k-dimensional vector z in{0,1}kwith a n-dimensional vector xzwhose the first k attribute values a1(x),a2(x), . . . ,ak(x)are the same as in z and the values of for the rest of nk attributes are set by duplicating these k values, i.e., ask+j(x) =aj(x), j=1,2, . . . ,k, and sk+jn.

Let oracle y be the parity function of the first k attributes a1,a2, . . . ,ak, as given in (1). Clearly, we can construct a feature fF(c)from the first k attributes such that e(c) =0 andγ(f) =O(k)as in the proof for Theorem 5.

On the other hand, we see that there is no decision tree tT(c)such that e(c) =0 andγ(t) =o(2k), since otherwise we would have obtained a decision tree t withγ(t) =o(2n)in Theorem 5. ThusγT(c) =Ω(2k). 2

Theorem 9 Given a training set X={0,1}n, for everyε∈[0,1/2), there exists an oracle y∈ such thatγF(c) =O(n)and γT(c) =Ω((1−2ε)2n)hold for any classifier cwith e(c)≤ε.

PROOF: Let oracle y be the parity function as given in (1). From the proof for Theorem 5, there is a feature fF(c)such that e(c) =0≤εandγ(f) =O(n), and thusγF(c) =O(n).

Now we consider a decision tree t= ((Vt,Et), `)∈T(c). It suffices to show that t contains at least|X|(1−2ε) = (1−2ε)2n leaves, from whichγ(t) =Ω((1−2ε)2n)follows. By definition of decision trees, the error rate of c is given by

e(c) = 1

|X|(

wVtleaf:`(w)=1

|Xw,0|+

wVtleaf:`(w)=0

|Xw,1|)

≥ 1

|X|(

wVtleaf\Vtleaf,n:`(w)=1

|Xw,0|+

wVtleaf\Vtleaf,n:`(w)=0

|Xw,1|). (3)

Claim 10 For each wVtleaf\Vtleaf,n, it holds|Xw,0|=|Xw,1|.

PROOF: Let wVtleaf,hbe a leaf whose depth is h(<n), and Aw denote the set of all attributes used in the path between w and the root. Suppose that we repeat branching at w and its resulting children recursively, by the attributes A0=A\Awuntil each of the resulting leaves has depth n. the parent u of each of these leaves has depth n−1, we see that|Xu,1|=|Xu,0|=1 holds from the proof of Theorem 5. This means that|Xw,1|=|Xw,0|holds. 2

By this claim, we see that (3) can be written as follows.

e(c)≥ 1

|X|

wVtleaf\Vtleaf,n

|Xw|

2 = 1

2|X|(|X| − |Vtleaf,n|) = 1

2n+1(2n− |Vtleaf,n|).

Since c satisfies e(c)≤ε, we haveε≥2n+11 (2n− |Vtleaf,n|). Hence|Vtleaf,n| ≥2n−ε2n+1holds, as required. 2

98 KAZUYAHARAGUCHI, HIROSHINAGAMO CHI, TOSHIHIDEIBARAKI

Theorem 11 Given a training set X⊆ {0,1}nof 2kexamples(i.e.,|X|=2k), for any integer k∈[1,n]and everyε∈[0,1/2), there exists an oracle ysuch thatγF(c) =O(k)andγT(c) =Ω((1−2ε)2k)hold for any classifier cwith e(c)≤ε. PRO OF: Let X be the set of examples given in Theorem 8, and oracle y be the parity function of the first k attributes a1,a2, . . . ,akas given in (1). Then, we can clearly construct a feature fF(c)from the first k attributes such that e(c) =0≤ε andγ(f) =O(k)as in the proof for Theorem 5.

Since aj(x) =ak+j(x) =a2k+j(x) =. . .(j=1,2, . . . ,k)for all examples xX , any decision tree can be constructed by using the first k attributes. Then, it is sufficient to consider constructing anε-classifier from training set X0={0,1}kof 2k examples with the first k attributes of A. From Theorem 9, the required complexity isΩ((1−2ε)2k). 2

4 Conclusion

In this paper, we defined the compactness of a representation model, and showed that iteratively composed features F is not inferior to decision trees T in terms of this compactness. However, this does not immediately imply that the representation model F always provides a learning model that produces a compact representation; there remains an important task to discover a construction algorithm that actually finds a compact representation by F for a given set of examples. The current algorithm employed in the learning model(F,CF)[4] constructs a representation by repeatedly attaching new features without trying to reduce the complexity of representations being constructed. It is our future work to design a learning model by F by taking the complexity of representations into account, and to conduct a computational experiment on the resulting error rate by the new model.

References

[1] D. ANGLUIN, Queries and Concept Learning, Machine Learning (1988) 2, pp. 319-342.

[2] A. BLUMER, A. ENRENFEUCHT, D. HAUSSLER, M. K. WARMUTH, Occam’s Razor, Information Processing Letters (1987) 24, pp. 377-380.

[3] L. BREIMA N, J. H. FREIDMAN, R. A. OLSHEN, C. J. STONE, Classification and Regression Trees, Wadsworth International Group (1984).

[4] K. HARAGUC HI, T. IBARAKI, E. BOROS, Classifiers Based on Iterative Compositions of Features, Proceedings of the 1st International Conference on Knowledge Engineering and Decision Support (2004), pp. 143-150.

[5] M. GAROFALAKIS, D. HYUN, R. RASTOGI, K. SHIM, Building Decision Trees with Constraints, Data Mining and Knowledge Discovery (2003) 7, pp. 187-214.

[6] J. R. QUINLAN, C4.5: Programs for Machine Learning, Morgan Kaufmann (1993).

[7] B. D. RIPLEY, Pattern Recognition And Neural Networks, Cambridge University Press (1996).

[8] J. C. SHAFER, R. AGRAWAL, M. MEHTA, SPRINT: A Scalable Parallel Classifier for Data Mining, Proceedings of the 22nd International Conference on Very Large Data Bases (1996), pp. 544-555.

[9] L. G. VALIANT, A Theory of the Learnable, Communications of the ACM (1984) 27, pp. 1134-1142.