• Nem Talált Eredményt

3.5 Tree criteria

3.5.5 The Maximum Likelihood Criterion

The Maximum Likelihood (ML) criterion is in spirit similar to the Maximum Parsimony Criterion, but the cost of a change in parsimony is not a function of the edge length.

In the case of MPC we simply count the changes which occurred on an edge.

The conditions are similar to those described above. First, let us assume that we have a set of aligned sequences D = {s1, . . . , sn} over Σ of length l. Moreover, we also know their phylogeny which is represented by a rooted phylogenetic tree T whose edge lengths are known. In addition, we have an evolutionary model, such as the Jukes-Cantor model, which allows us to determine the Pxy(t) probabilities, where x, y Σ.

This means that we will then know the probability of the different types of changes on an edge of length t.

Before we define the likelihood score of a phylogenetic tree, we need to make two basic assumptions. These are that

1. the sequences evolve independently, position by position 2. the evolutionary changes on the edges are independent

In general these assumptions are not valid in real life, but they are economizing the computation of the likelihood of a tree, as we shall shortly see.

Roughly speaking, the likelihood of a tree is P(D|T). In other words, it is the probability that the set D has evolved when our hypothetical tree is T. Let us now see how we can compute this value for a given set of sequences D and for a rooted phylogenetic treeT. Using the first assumption we can reformulate this likelihood value like so:

L=P(D|T) = Yl

i=1

P(si1, . . . , sin|T) (3.21) This means that we can handle the likelihood of each position independently, thus we need just to compute the likelihood for a constrained tree, where the leaves represent just single letters. After, we can assign a letter from Σ to an interior node of T. Let us examine the case when an interior point g has been labelled by a letterx∈Σ, that is by gx. The second assumption allows us to handle the lineages separately. Then we can compute the conditional likelihood of an interior nodeg if we know the conditional likelihood of its immediate descendants g1 and g2 via the following formula:

Lg(x) = ÃX

y∈Σ

Pgxg1y(t)Lg1(y)

! ÃX

z∈Σ

Pgxgz2(t)Lg2(z)

!

(3.22) As we mentioned above, thePxy(t)probabilities are known. Next, we need to calculate the likelihoods of the leaves. This is simply defined as a Kronecker-delta function. For a leaf l will be defined as

Ll(x) = (

1, if leaf l represents x

0, otherwise (3.23)

3.5 Tree criteria 25

In the end, the total likelihood of the tree T is the sum of all conditional likelihoods at the root, weighted by the background discrete probability πx of the letters

LT =X

x∈Σ

πxL(r), (3.24)

whereris the root ofT. This can be computed using a dynamic programming approach [26]. But we want to calculate the maximum likelihood for a tree topology. So we need to determine the edge lengths such that the likelihood score of the phylogenetic tree T will be maximal. This can be done by applying the simple Newton-method, say. We should mention here that the likelihood function is not always convex.

Of course, the likelihood criterion can be defined in a straightforward way: we prefer those trees which have a higher likelihood value. There are many extensions for computing the likelihood score. The most remarkable was introduced by Churchill and Felsenstein [45]. In their article the first assumption was resolved because they modelled the dependencies of the neighboring positions by a Hidden Markov Model.

We will also use this procedure for calculating the likelihood values in our experiments.

Chapter 4

A Tree Building Method Based On The Least-Squares Criteria

4.1 Introduction

The reliable reconstruction of a tree topology from a set of homologous, sequenced data is one of the most important goals in system biology. A major family of the phylogenetic tree building methods is the distance-based or distance matrix methods. The general idea behind them is to calculate a measure for the distance between each pair of taxons, and then find a tree that predicts the observed set of distances as closely as possible.

There are quite a few heuristic distance-based algorithms with a fixed criterion available for estimating phylogeny, and their strengths and weaknesses are familiar to everyone in the field. The distance-based methods, like the Unweighted Pair-Group Method using Arithmetic averages (UPGMA) [10] and Neighbor-Joining (NJ) [9], work similarly: they iteratively form clusters, always choosing the best possibility based on a given criterion.

We can call these methods greedy in a certain sense, because they always work on the current best candidate subtrees. The NJ method produces additive trees, while UPGMA assumes that the evolutionary process can be represented by an ultrametric tree. These restrictions may then interfere with the correct estimation of the evolutionary process.

Atteson [46] showed, however, that the NJ method is able to return the true phylogeny, when the observed distance is sufficiently close to the true evolutionary distance.

The chief aim of this chapter is to develop a good distance-based method that approximates closely to the true tree for any available evolutionary (not just for ul-trametric or additive) distance. To achieve this we apply a special form of the Least Square Criteria (LSC) to phylogenetic trees [43]. The LSC will guarantee a minimal deviation between the evolutionary distances and the leaf distances in the phylogenetic tree. It is fortunate that the LSC weighting for a phylogenetic tree can be computed in O(n2)time. The original LSC was introduced by Fitch and Margoliash, and nowadays several forms of it are in use in the literature, like the Weighted LSC [47], Unweighted and Generalized LSC [48]. We applied the constrained version of LSC (CLSC) here to evaluate phylogenetic trees because the weights of the edges have to be non-negative.

27

The solution of the problem retains its simplicity because the Constrained LSC can easily be handled by the Levenberg-Marquardt method [49].

Since finding the least squares tree (whether it is constrained or not) is an NP-complete problem [50], a polynomial-time algorithm to solve it is unlikely to exist. Many meta-heuristics have been applied in phylogenetic tree-building such as the Genetic Algorithm [51], the Tree Fusing approach [52], the Branch and Bound approach [53], the Maximum Likelihood approach [26] and the Fitch&Margoliash approach, the latter being an extension of UPGMA.

We now propose a novel heuristic, based on the so-called Multi-Stack (MS) con-struction [54]. The MS heuristic organizes the candidate subtrees having the same number of leaves into a priority queue according to their distance estimation error, and generates newer candidate trees by joining the existing trees via a novel tree joining strategy. It may happen however that there are many trees within a priority queue that have a non-disjunct set of leaves, and it is not possible to join them. The Closest-Neighborhood Tree Joining (CNTJ) strategy introduced here always provides a tree topology based on all of the subtrees, swapping their common taxa with their closest neighbor.

Our method was tested on artificial as well as on real-life datasets like Primates, Myoglobins and Hydrogenases.