• Nem Talált Eredményt

Analysis of the Dyadic Closure Method

In document ¨OT+EGY KIEMELT DOLGOZAT (Pldal 65-70)

A Few Logs Suffice to Build Almost All Trees I ( )

6. PERFORMANCE OF DYADIC CLOSURE METHOD FOR TREE RECONSTRUCTION UNDER THE NEYMAN 2-STATE MODEL

6.1. Analysis of the Dyadic Closure Method

FEW LOGS SUFFICE TO BUILD ALMOST ALL TREES 171

Step 3. If for all w, DCTC applied to Qw returns Insufficientor Inconsistent, then Return Fail.

We now show that this method accurately reconstructs the treeT if AAlBB/B wi.e., if hypothesis 14 holds .Ž . x

Theorem 7. Let T be a fixed binary tree. The Dyadic Closure Method returns T if

Ž . Ž 5 .

hypothesis 14 holds, and runs in O n logn time on any input.

Proof. If wgAAlBB, then DCTC applied to Qw returns the correct tree T by Ž .

Theorem 6. Hypothesis 14 implies that AAlBB/B, hence the Dyadic Closure Method returns a tree if it examines any width in that intersection; hence, we need only prove that DCM either examines a width in that intersection, or else reconstructs the correct tree for some other width. This follows directly from Theorem 6.

The running time analysis is easy. Since we do a binary search, the DCTC

Ž .

algorithm is called at mostOlogn times. The dyadic closure phase of the DCTC

Ž 5. Ž .

algorithm costs O n time, by Lemma 5, and reconstructing the treeT from cl Q Ž 5.

uses at mostO n time using standard techniques. B

Note that we have only guaranteed performance for DCM when AAlBB/B; indeed, when AAlBBsB, we have no guarantee that DCM will return the correct tree. In the following section, we discuss the ramifications of this requirement for accuracy, and show that the sequence length needed to guarantee that AAlBB/B with high probability is actually not very large.

6. PERFORMANCE OF DYADIC CLOSURE METHOD FOR TREE RECONSTRUCTION UNDER THE NEYMAN 2-STATE MODEL

In this section we analyze the performance of a distance-based application of DCM to reconstruct trees under the Neyman 2-state model under two standard distribu-tions.

6.1. Analysis of the Dyadic Closure Method

Our analysis of the Dyadic Closure Method has two parts. In the first part, we

Ž .

establish the probability that the estimation using the Four-Point Method of the split induced by a given quartet is correct. In the second part, we establish the probability that the greedy method we use contains all short quartets but no incorrectly analyzed quartet.

Our analysis of the performance of the DCM method depends heavily on the following two lemmas:

w w xx Ž .

Lemma 4 Azuma]Hoeffding inequality, see 3 . Suppose Xs X1,X2, . . . ,Xk are independent random¨ariables taking¨alues in any set S, and L: SkªR is any

< Ž . Ž .<

function that satisfies the condition: Lu yLv Ft whene¨er u and v differ at just

ERDOS ET AL.˝ 172

one coordinate.Then,

l2 P LŽX.yE LŽX. Gl Fexp

ž

y2t k2

/

,

l2

P LŽX.yE LŽX. Fyl Fexp

ž

y2t k2

/

. B

Ž . Ž X. <

We define the standard L` metric on distance matrices, L d,` d smaxi j di j

X <

ydi j. The following discussion relies upon definitions and notations from Section 2.

Lemma 5. Let T be an edge weighted binary tree with four lea¨es i,j,k,l,let D be the additi¨e distance matrix on these four lea¨es defined by T, and let x be the weight on the single internal edge in T. Let d be an arbitrary distance matrix on the four lea¨es.

Ž .

Then the Four-Point Method infers the split induced by T from d if L d,` D -xr2.

Ž . <

Proof. Suppose that L d,` D -xr2, and assume that T has the valid split ij kl.

<

Note that the four-point method will return a single quartet, split ij klif and only if

4 <

di jqdk l-min di kqdjl,di lqdjk . Note that since ij kl is a valid quartet split in

Ž .

T,Di jqDk lq2xsDi kqDjlsDi lqDjk. Since L d,` D -xr2, it follows that di jqdk l-Di jqDk lqx,

di kqdjl)Di kqDjlyx, and

di lqdjk)Di lqDjkyx,

Ž .

with the consequence that di jqdk l is the unique smallest of the three pairwise

sums. B

Recall that DCM applied to the Neyman 2-state model computes quartet splits

Ž .

using the four-point method FPM .

Theorem 8. Assume that z is a lower bound for the transition probability of any edge of a tree T in the Neyman 2-state model, yGmax Ei j is an upper bound on the compound changing probability o¨er all ij paths in a quartet q of T.The probability that FPM fails to return the correct quartet split on q from k sites is at most

2 2

y

Ž

1y

'

1y2z

.

Ž1y2y. k

18 exp . Ž15.

8

Proof. First observe from formula 1 thatŽ . z is also a lower bound for the compound changing probability for the path connecting any two vertices of T. We know that FPM returns the appropriate subtree given the additive distances Di j;

< < 1 Ž .

furthermore, if di jyDi jFy4log 1y2z for all i,j, then FPM also returns the

( )

FEW LOGS SUFFICE TO BUILD ALMOST ALL TREES 173

appropriate subtree on all ijkl, by Lemma 5. Consequently,

< < 1 For convenience, we drop the subscripts when we analyze the events in 17 and just write D and d; we write pfor the corresponding transition probabilityEi jand pfor the relative frequency hi j. By simple algebra,

ˆ

Now we consider the probability that the Four-Point Method fails, i.e., the event

Ž . Ž . w x

estimated in 17 . If pGp, then formula 19 applies, so thatˆ PFPM errs is algebraically equivalent to

y1r2

pypˆG12 Ž1y2z. y1 Ž1y2p.. Ž20. This can then be analyzed using Lemma 4. The other case is where pp. In this

Ž . w x

case, formula 18 applies, andPFPM errs is algebraically equivalent to pyp 1

ˆ y1r2

G Ž1y2z. y1 . Ž21. 1y2pˆ 2

Ž .

Select an arbitrary positive number e. Then pˆypG 1y2p e with probability

2 2

ERDOS ET AL.˝

tribute each the same exponential expression to the error, and 16 or 17 multiplies it by 6, due to the six pairs in the summation. B

This allows us to state our main result. First, recall the definition of depth from Section 2.

Theorem 9. Suppose k sites e¨ol¨e under the Neyman2-state model on a binary tree

Ž . w x

T, so that for all edges e, p e g f,g ,where we allow f,g to be functions of n.Then the dyadic closure method reconstructs T with probability1yoŽ .1 , if

c?logn

Proof. It suffices to show that hypothesis 14 holds. For k evolving sites i.e.,

. 4

To establish 31 , first note that h satisfies the hypothesis of the Azuma]

Hoeff-Ž .

ding inequality Lemma 4 with Xi the sequence of states for site i and ts1rk .

( )

FEW LOGS SUFFICE TO BUILD ALMOST ALL TREES 175

Suppose Ei jG.5yt. Then, e . These two bounds establish 31 .

ž /

2 we condition on the occurrence of event C. This holds for all i,jgq, so by

Ž . induced subtree inT has mutation probability at least f nŽ .on its central edge, and

i j 4

whenever qgX. Also, the occurrence of event C implies that

Zt:X, Ž34.

where the second inequality follows from 34 , as this shows that when C occurs, FqgZtBq=FqgX Bq. Invoking the Bonferonni inequality, we deduce that

ERDOS ET AL.˝

Formula 25 follows by an easy calculation. B

In document ¨OT+EGY KIEMELT DOLGOZAT (Pldal 65-70)