3 Compactness of Representations - Tighter Bounds on the OBDD size of Integer Multiplication

This section compares compactness of two representation models, F and T . Let ={c|c :{0,1}ⁿ7→ {0,1}}denote the set of all n-dimensional Boolean functions (hence| |=2²ⁿ), and let c denote a Boolean function in . R(c)denotes the set of all representations for c in a representation model R (possibly R(c) =/0). Let us callγ_R^∗(c)compactness of c by R, defined as the smallest complexity of a representation for c by R, i.e.,

γR^∗(c) =min{γ(r)|r∈R(c)}.

First, we show that F and T have the ability to represent any Boolean function in . Let _F (resp., _T) denote the set of Boolean functions in for which F (resp., T ) has some representations (i.e., _F={c∈ |F(c)6=/0}, _T ={c∈ | corresponding vector x. Then we see that c is represented by this t^∗. 2

Note thatγ(f^∗) =Ω(2ⁿ)andγ(t^∗) =Ω(2ⁿ)for the representations f^∗ and t^∗in the proof of this theorem. From our assumption on the oracle y, f^∗and t^∗ may not be the classifier representations of our interest; we are interested in more compact representations. The following theorem states that, for any Boolean function c∈ , we can construct a feature

f ∈F(c)from any decision tree t∈T(c)such that its complexityγ(f)is bounded by O(γ(t)).

96 KAZUYAHARAGUCHI, HIROSHINAGAMO CHI, TOSHIHIDEIBARAKI

Theorem 3 For any Boolean function c∈ and any decision tree t ∈T(c), there exists a feature f ∈F(c)such that γ(f) =O(γ(t)).

PRO OF: Let t^vdenote the subtree of t= ((Vt,Et), `)which has node v∈V_t as its root node, and let c^v denote a Boolean function represented by t^v(i.e., t^v∈T(c^v)). For any node v∈V_t, we show that there exists a feature f^v∈F(c^v)such that γ(f^v)≤dγ(t^v)for some constant d>0.

We prove this by an induction on the height h of node v. Assume h=h(t). Then, v is a leaf in t, and subtree t^vconsists of one node v, which represents a constant function c^vthat outputs`(v)for any input vector x∈ {0,1}ⁿ. We choose a feature

f^v∈F(c^v)that has one fan-in (e.g., f^v=f_{_a

(Features for constant functions are omitted in Fig. 2.)

Corollary 4 Given a training set X , for anyε-classifier c for someε ∈[0,1]and any decision tree t∈T(c), there exist an ε-classifier c⁰and a feature f∈F(c⁰)such thatγ(f) =O(γ(t)).

In what follows, we consider the converse of the above relation; is there any compact representation of T for such a classifier c that admits a compact representation of F?

Theorem 5 Given training set X={0,1}ⁿ, there exists an oracle y∈ such thatγ_F^∗(c) =O(n)andγ_T^∗(c) =Ω(2ⁿ)hold for any classifier c∈ with e(c) =0.

PRO OF: Let oracle y be a parity function such that y(x) =

1 if ∑ⁿ_j=1x_jis odd,

0 otherwise. (1)

Let c∈ denote a classifier such that e(c) =0. First, we show that there exists a feature f∈F(c)whose complexityγ(f) is bounded by O(n). For each attribute aj∈A, j=1,2, . . . ,n−1, we construct feature f^jas follows:

Next, we show that complexity of any decision tree t∈T(c)for classifier c, consistent with y under X={0,1}ⁿ, requires Ω(2ⁿ). For this, we show the following two lemmas.

KAZUYAHARAGUCHI, HIROSHINAGAMOCH I, TOSHIHIDEIBARAKI 97

Lemma 6 For any A⁰⊂A and any x¹∈X¹, there exists an example x⁰∈X⁰such that x¹|A⁰=x⁰|A⁰ (i.e., aj(x¹) =aj(x⁰) holds for any aj∈A⁰).

PROOF: It is sufficient to show that such an x⁰exists for any A⁰=A\{a_j}, j=1,2, . . . ,n, and any x¹∈X¹. Since we have all the n-dimensional vectors as X= (X¹,X⁰), we find exactly one example x⁰∈X⁰satisfying x¹|A⁰=x⁰|A⁰, by taking x⁰= (x¹₁,x¹₂, . . . ,x¹_j₋₁,¯x¹_j,x¹_j+1, . . . ,x¹_n). 2

Lemma 7 For any A⁰=A\{a_j}, j=1,2, . . . ,n, and any x¹∈X¹, there exists no x∈X¹(x¹6=x)such that x¹|A⁰=x|A⁰holds.

PROOF: Assume that there exist two examples x¹,x∈X¹such that x¹|A⁰=x|A⁰for some A⁰=A\{a_j}. Since x¹6=x, a_j(x¹)6= a_j(x)holds; it follows that either x¹∈X⁰or x∈X⁰holds as y is a parity function, which contradicts the assumption. 2

From Lemma 6, for each x¹∈X¹, there exists an example x⁰∈X⁰which visits the same nodes as x¹visits, in the depth of less than n−1. Then, since decision tree t is consistent, x¹and x⁰are separated at an inner node u∈V_t^inner,n⁻¹in the depth of n−1. (Note that h(t)≤n.) From Lemma 7, there is no x∈X¹(x¹6=x) that visits u, and thus there is no example x⁰∈X⁰ that visits u. Since there are|X¹|=2ⁿ/2=2ⁿ⁻¹examples for x¹,|V_t^inner,n⁻¹|=2ⁿ⁻¹holds. Then the number of leaves|V_t^leaf| equals 2·2ⁿ⁻¹=2ⁿ. Hence,γ(t) =Ω(2ⁿ). 2

Theorem 8 Given a training set X⊆ {0,1}ⁿof 2^kexamples(i.e.,|X|=2^k), for any integer k∈[1,n], there exists an oracle y∈ such thatγ_F^∗(c) =O(k)andγ_T^∗(c) =Ω(2^k)hold for any classifier c∈ with e(c) =0.

PROOF: We associate each k-dimensional vector z in{0,1}^kwith a n-dimensional vector x^zwhose the first k attribute values a₁(x),a₂(x), . . . ,a_k(x)are the same as in z and the values of for the rest of n−k attributes are set by duplicating these k values, i.e., a_sk+_j(x) =a_j(x), j=1,2, . . . ,k, and sk+j≤n.

Let oracle y be the parity function of the first k attributes a₁,a2, . . . ,a_k, as given in (1). Clearly, we can construct a feature f ∈F(c)from the first k attributes such that e(c) =0 andγ(f) =O(k)as in the proof for Theorem 5.

On the other hand, we see that there is no decision tree t∈T(c)such that e(c) =0 andγ(t) =o(2^k), since otherwise we would have obtained a decision tree t withγ(t) =o(2ⁿ)in Theorem 5. Thusγ_T^∗(c) =Ω(2^k). 2

Theorem 9 Given a training set X={0,1}ⁿ, for everyε∈[0,1/2), there exists an oracle y∈ such thatγ_F^∗(c) =O(n)and γ_T^∗(c) =Ω((1−2ε)2ⁿ)hold for any classifier c∈ with e(c)≤ε^.

PROOF: Let oracle y be the parity function as given in (1). From the proof for Theorem 5, there is a feature f ∈F(c)such that e(c) =0≤ε^andγ(f) =O(n), and thusγ_F^∗(c) =O(n).

Now we consider a decision tree t= ((Vt,Et), `)∈T(c). It suffices to show that t contains at least|X|(1−2ε) = (1−2ε)2ⁿ leaves, from whichγ(t) =Ω((1−2ε)2ⁿ)follows. By definition of decision trees, the error rate of c is given by

e(c) = 1

|X|(

∑

w∈V_t^leaf:`(w)=1

|X^w,0|+

∑

w∈V_t^leaf:`(w)=0

|X^w,1|)

≥ 1

|X|(

∑

w∈V_t^leaf\V_t^leaf^,n:`(w)=1

|X^w,0|+

∑

w∈V_t^leaf\V_t^leaf^,n:`(w)=0

|X^w,1|). (3)

Claim 10 For each w∈V_t^leaf\V_t^leaf,n, it holds|X^w,0|=|X^w,1|.

PROOF: Let w∈V_t^leaf,hbe a leaf whose depth is h(<n), and A^w denote the set of all attributes used in the path between w and the root. Suppose that we repeat branching at w and its resulting children recursively, by the attributes A⁰=A\A^wuntil each of the resulting leaves has depth n. the parent u of each of these leaves has depth n−1, we see that|X^u,1|=|X^u,0|=1 holds from the proof of Theorem 5. This means that|X^w,1|=|X^w,0|holds. 2

By this claim, we see that (3) can be written as follows.

e(c)≥ 1

|X|

∑

w∈V_t^leaf\V_t^leaf^,n

|X^w|

2 = 1

2|X|(|X| − |V_t^leaf^,n|) = 1

2ⁿ⁺¹(2ⁿ− |V_t^leaf^,n|).

Since c satisfies e(c)≤ε^{, we have}ε≥₂n+1¹ (2ⁿ− |V_t^leaf,n|). Hence|V_t^leaf,n| ≥2ⁿ−ε²ⁿ⁺¹holds, as required. 2

98 KAZUYAHARAGUCHI, HIROSHINAGAMO CHI, TOSHIHIDEIBARAKI

Theorem 11 Given a training set X⊆ {0,1}ⁿof 2^kexamples(i.e.,|X|=2^k), for any integer k∈[1,n]and everyε∈[0,1/2), there exists an oracle y∈ such thatγF^∗(c) =O(k)andγT^∗(c) =Ω((1−2ε)2^k)hold for any classifier c∈ with e(c)≤ε^. PRO OF: Let X be the set of examples given in Theorem 8, and oracle y be the parity function of the first k attributes a₁,a₂, . . . ,a_kas given in (1). Then, we can clearly construct a feature f∈F(c)from the first k attributes such that e(c) =0≤ε andγ(f) =O(k)as in the proof for Theorem 5.

Since a_j(x) =a_k+j(x) =a_2k+j(x) =. . .(j=1,2, . . . ,k)for all examples x∈X , any decision tree can be constructed by using the first k attributes. Then, it is sufficient to consider constructing anε-classifier from training set X⁰={0,1}^kof 2^k examples with the first k attributes of A. From Theorem 9, the required complexity isΩ((1−2ε)2^k). 2

4 Conclusion

In this paper, we defined the compactness of a representation model, and showed that iteratively composed features F is not inferior to decision trees T in terms of this compactness. However, this does not immediately imply that the representation model F always provides a learning model that produces a compact representation; there remains an important task to discover a construction algorithm that actually finds a compact representation by F for a given set of examples. The current algorithm employed in the learning model(F,CF)[4] constructs a representation by repeatedly attaching new features without trying to reduce the complexity of representations being constructed. It is our future work to design a learning model by F by taking the complexity of representations into account, and to conduct a computational experiment on the resulting error rate by the new model.

References

[1] D. ANGLUIN, Queries and Concept Learning, Machine Learning (1988) 2, pp. 319-342.

[2] A. BLUMER, A. ENRENFEUCHT, D. HAUSSLER, M. K. WARMUTH, Occam’s Razor, Information Processing Letters (1987) 24, pp. 377-380.

[3] L. BREIMA N, J. H. FREIDMAN, R. A. OLSHEN, C. J. STONE, Classification and Regression Trees, Wadsworth International Group (1984).

[4] K. HARAGUC HI, T. IBARAKI, E. BOROS, Classifiers Based on Iterative Compositions of Features, Proceedings of the 1st International Conference on Knowledge Engineering and Decision Support (2004), pp. 143-150.

[5] M. GAROFALAKIS, D. HYUN, R. RASTOGI, K. SHIM, Building Decision Trees with Constraints, Data Mining and Knowledge Discovery (2003) 7, pp. 187-214.

[6] J. R. QUINLAN, C4.5: Programs for Machine Learning, Morgan Kaufmann (1993).

[7] B. D. RIPLEY, Pattern Recognition And Neural Networks, Cambridge University Press (1996).

[8] J. C. SHAFER, R. AGRAWAL, M. MEHTA, SPRINT: A Scalable Parallel Classifier for Data Mining, Proceedings of the 22nd International Conference on Very Large Data Bases (1996), pp. 544-555.

[9] L. G. VALIANT, A Theory of the Learnable, Communications of the ACM (1984) 27, pp. 1134-1142.

In document Tighter Bounds on the OBDD size of Integer Multiplication (Pldal 95-99)