This section compares compactness of two representation models, F and T . Let ={c|c :{0,1}n7→ {0,1}}denote the set of all n-dimensional Boolean functions (hence| |=22n), and let c denote a Boolean function in . R(c)denotes the set of all representations for c in a representation model R (possibly R(c) =/0). Let us callγR∗(c)compactness of c by R, defined as the smallest complexity of a representation for c by R, i.e.,
γR∗(c) =min{γ(r)|r∈R(c)}.
First, we show that F and T have the ability to represent any Boolean function in . Let F (resp., T) denote the set of Boolean functions in for which F (resp., T ) has some representations (i.e., F={c∈ |F(c)6=/0}, T ={c∈ | corresponding vector x. Then we see that c is represented by this t∗. 2
Note thatγ(f∗) =Ω(2n)andγ(t∗) =Ω(2n)for the representations f∗ and t∗in the proof of this theorem. From our assumption on the oracle y, f∗and t∗ may not be the classifier representations of our interest; we are interested in more compact representations. The following theorem states that, for any Boolean function c∈ , we can construct a feature
f ∈F(c)from any decision tree t∈T(c)such that its complexityγ(f)is bounded by O(γ(t)).
96 KAZUYAHARAGUCHI, HIROSHINAGAMO CHI, TOSHIHIDEIBARAKI
Theorem 3 For any Boolean function c∈ and any decision tree t ∈T(c), there exists a feature f ∈F(c)such that γ(f) =O(γ(t)).
PRO OF: Let tvdenote the subtree of t= ((Vt,Et), `)which has node v∈Vt as its root node, and let cv denote a Boolean function represented by tv(i.e., tv∈T(cv)). For any node v∈Vt, we show that there exists a feature fv∈F(cv)such that γ(fv)≤dγ(tv)for some constant d>0.
We prove this by an induction on the height h of node v. Assume h=h(t). Then, v is a leaf in t, and subtree tvconsists of one node v, which represents a constant function cvthat outputs`(v)for any input vector x∈ {0,1}n. We choose a feature
fv∈F(cv)that has one fan-in (e.g., fv=f{a
(Features for constant functions are omitted in Fig. 2.)
Corollary 4 Given a training set X , for anyε-classifier c for someε ∈[0,1]and any decision tree t∈T(c), there exist an ε-classifier c0and a feature f∈F(c0)such thatγ(f) =O(γ(t)).
In what follows, we consider the converse of the above relation; is there any compact representation of T for such a classifier c that admits a compact representation of F?
Theorem 5 Given training set X={0,1}n, there exists an oracle y∈ such thatγF∗(c) =O(n)andγT∗(c) =Ω(2n)hold for any classifier c∈ with e(c) =0.
PRO OF: Let oracle y be a parity function such that y(x) =
1 if ∑nj=1xjis odd,
0 otherwise. (1)
Let c∈ denote a classifier such that e(c) =0. First, we show that there exists a feature f∈F(c)whose complexityγ(f) is bounded by O(n). For each attribute aj∈A, j=1,2, . . . ,n−1, we construct feature fjas follows:
Next, we show that complexity of any decision tree t∈T(c)for classifier c, consistent with y under X={0,1}n, requires Ω(2n). For this, we show the following two lemmas.
KAZUYAHARAGUCHI, HIROSHINAGAMOCH I, TOSHIHIDEIBARAKI 97
Lemma 6 For any A0⊂A and any x1∈X1, there exists an example x0∈X0such that x1|A0=x0|A0 (i.e., aj(x1) =aj(x0) holds for any aj∈A0).
PROOF: It is sufficient to show that such an x0exists for any A0=A\{aj}, j=1,2, . . . ,n, and any x1∈X1. Since we have all the n-dimensional vectors as X= (X1,X0), we find exactly one example x0∈X0satisfying x1|A0=x0|A0, by taking x0= (x11,x12, . . . ,x1j−1,¯x1j,x1j+1, . . . ,x1n). 2
Lemma 7 For any A0=A\{aj}, j=1,2, . . . ,n, and any x1∈X1, there exists no x∈X1(x16=x)such that x1|A0=x|A0holds.
PROOF: Assume that there exist two examples x1,x∈X1such that x1|A0=x|A0for some A0=A\{aj}. Since x16=x, aj(x1)6= aj(x)holds; it follows that either x1∈X0or x∈X0holds as y is a parity function, which contradicts the assumption. 2
From Lemma 6, for each x1∈X1, there exists an example x0∈X0which visits the same nodes as x1visits, in the depth of less than n−1. Then, since decision tree t is consistent, x1and x0are separated at an inner node u∈Vtinner,n−1in the depth of n−1. (Note that h(t)≤n.) From Lemma 7, there is no x∈X1(x16=x) that visits u, and thus there is no example x0∈X0 that visits u. Since there are|X1|=2n/2=2n−1examples for x1,|Vtinner,n−1|=2n−1holds. Then the number of leaves|Vtleaf| equals 2·2n−1=2n. Hence,γ(t) =Ω(2n). 2
Theorem 8 Given a training set X⊆ {0,1}nof 2kexamples(i.e.,|X|=2k), for any integer k∈[1,n], there exists an oracle y∈ such thatγF∗(c) =O(k)andγT∗(c) =Ω(2k)hold for any classifier c∈ with e(c) =0.
PROOF: We associate each k-dimensional vector z in{0,1}kwith a n-dimensional vector xzwhose the first k attribute values a1(x),a2(x), . . . ,ak(x)are the same as in z and the values of for the rest of n−k attributes are set by duplicating these k values, i.e., ask+j(x) =aj(x), j=1,2, . . . ,k, and sk+j≤n.
Let oracle y be the parity function of the first k attributes a1,a2, . . . ,ak, as given in (1). Clearly, we can construct a feature f ∈F(c)from the first k attributes such that e(c) =0 andγ(f) =O(k)as in the proof for Theorem 5.
On the other hand, we see that there is no decision tree t∈T(c)such that e(c) =0 andγ(t) =o(2k), since otherwise we would have obtained a decision tree t withγ(t) =o(2n)in Theorem 5. ThusγT∗(c) =Ω(2k). 2
Theorem 9 Given a training set X={0,1}n, for everyε∈[0,1/2), there exists an oracle y∈ such thatγF∗(c) =O(n)and γT∗(c) =Ω((1−2ε)2n)hold for any classifier c∈ with e(c)≤ε.
PROOF: Let oracle y be the parity function as given in (1). From the proof for Theorem 5, there is a feature f ∈F(c)such that e(c) =0≤εandγ(f) =O(n), and thusγF∗(c) =O(n).
Now we consider a decision tree t= ((Vt,Et), `)∈T(c). It suffices to show that t contains at least|X|(1−2ε) = (1−2ε)2n leaves, from whichγ(t) =Ω((1−2ε)2n)follows. By definition of decision trees, the error rate of c is given by
e(c) = 1
|X|(
∑
w∈Vtleaf:`(w)=1
|Xw,0|+
∑
w∈Vtleaf:`(w)=0
|Xw,1|)
≥ 1
|X|(
∑
w∈Vtleaf\Vtleaf,n:`(w)=1
|Xw,0|+
∑
w∈Vtleaf\Vtleaf,n:`(w)=0
|Xw,1|). (3)
Claim 10 For each w∈Vtleaf\Vtleaf,n, it holds|Xw,0|=|Xw,1|.
PROOF: Let w∈Vtleaf,hbe a leaf whose depth is h(<n), and Aw denote the set of all attributes used in the path between w and the root. Suppose that we repeat branching at w and its resulting children recursively, by the attributes A0=A\Awuntil each of the resulting leaves has depth n. the parent u of each of these leaves has depth n−1, we see that|Xu,1|=|Xu,0|=1 holds from the proof of Theorem 5. This means that|Xw,1|=|Xw,0|holds. 2
By this claim, we see that (3) can be written as follows.
e(c)≥ 1
|X|
∑
w∈Vtleaf\Vtleaf,n
|Xw|
2 = 1
2|X|(|X| − |Vtleaf,n|) = 1
2n+1(2n− |Vtleaf,n|).
Since c satisfies e(c)≤ε, we haveε≥2n+11 (2n− |Vtleaf,n|). Hence|Vtleaf,n| ≥2n−ε2n+1holds, as required. 2
98 KAZUYAHARAGUCHI, HIROSHINAGAMO CHI, TOSHIHIDEIBARAKI
Theorem 11 Given a training set X⊆ {0,1}nof 2kexamples(i.e.,|X|=2k), for any integer k∈[1,n]and everyε∈[0,1/2), there exists an oracle y∈ such thatγF∗(c) =O(k)andγT∗(c) =Ω((1−2ε)2k)hold for any classifier c∈ with e(c)≤ε. PRO OF: Let X be the set of examples given in Theorem 8, and oracle y be the parity function of the first k attributes a1,a2, . . . ,akas given in (1). Then, we can clearly construct a feature f∈F(c)from the first k attributes such that e(c) =0≤ε andγ(f) =O(k)as in the proof for Theorem 5.
Since aj(x) =ak+j(x) =a2k+j(x) =. . .(j=1,2, . . . ,k)for all examples x∈X , any decision tree can be constructed by using the first k attributes. Then, it is sufficient to consider constructing anε-classifier from training set X0={0,1}kof 2k examples with the first k attributes of A. From Theorem 9, the required complexity isΩ((1−2ε)2k). 2
4 Conclusion
In this paper, we defined the compactness of a representation model, and showed that iteratively composed features F is not inferior to decision trees T in terms of this compactness. However, this does not immediately imply that the representation model F always provides a learning model that produces a compact representation; there remains an important task to discover a construction algorithm that actually finds a compact representation by F for a given set of examples. The current algorithm employed in the learning model(F,CF)[4] constructs a representation by repeatedly attaching new features without trying to reduce the complexity of representations being constructed. It is our future work to design a learning model by F by taking the complexity of representations into account, and to conduct a computational experiment on the resulting error rate by the new model.
References
[1] D. ANGLUIN, Queries and Concept Learning, Machine Learning (1988) 2, pp. 319-342.
[2] A. BLUMER, A. ENRENFEUCHT, D. HAUSSLER, M. K. WARMUTH, Occam’s Razor, Information Processing Letters (1987) 24, pp. 377-380.
[3] L. BREIMA N, J. H. FREIDMAN, R. A. OLSHEN, C. J. STONE, Classification and Regression Trees, Wadsworth International Group (1984).
[4] K. HARAGUC HI, T. IBARAKI, E. BOROS, Classifiers Based on Iterative Compositions of Features, Proceedings of the 1st International Conference on Knowledge Engineering and Decision Support (2004), pp. 143-150.
[5] M. GAROFALAKIS, D. HYUN, R. RASTOGI, K. SHIM, Building Decision Trees with Constraints, Data Mining and Knowledge Discovery (2003) 7, pp. 187-214.
[6] J. R. QUINLAN, C4.5: Programs for Machine Learning, Morgan Kaufmann (1993).
[7] B. D. RIPLEY, Pattern Recognition And Neural Networks, Cambridge University Press (1996).
[8] J. C. SHAFER, R. AGRAWAL, M. MEHTA, SPRINT: A Scalable Parallel Classifier for Data Mining, Proceedings of the 22nd International Conference on Very Large Data Bases (1996), pp. 544-555.
[9] L. G. VALIANT, A Theory of the Learnable, Communications of the ACM (1984) 27, pp. 1134-1142.