On Contract-and-Refine Transformations Between Phylogenetic Trees
Vijaya Ramachandran Tandy Warnow
  е ж ж й и е ж з е е  

Ganeshkumar Ganapathy 
 

з

Abstract The inference of evolutionary trees using approaches which attempt to solve the maximum parsimony (MP) and maximum likelihood (ML) optimization problems is a standard part of much of biological data analysis. However, both problems are hard to solve: MP provably NP-hard, and ML even harder in practice. Consequently, hill-climbing heuristics are used to analyze datasets for phylogeny reconstruction. Two primary topological transformations have been used in the most popular heuristics: TBR (treebisection-and-reconnection) and ECR (edge-contractionsand-refinements). While most of the popular heuristics exclusively use TBR moves to explore tree space, some recent methods have used ECR in conjunction with TBR and found significant improvements in the speed and accuracy with which they can analyze datasets. In this paper we analyze ECR moves in detail, and provide results on the diameter of the tree space, the neighborhood intersection with TBR, structural analysis of the ECR operation, and an efficient method for sampling uniformly from the 2-ECR neighborhood of a tree. Our results should lead to a better understanding of the impact of ECR moves on the performance of heuristic searches. 1 Introduction Most, if not all, of the favored approaches in biology for inferring phylogenetic (i.e., evolutionary) trees are based upon attempts to solve certain NP-hard optimization problems; of these, perhaps Maximum Parsimony [6] is the most popular. Maximum Likelihood [5] is also favored, but considerably harder in practice to solve than Maximum Parsimony (though not established to be NP-hard). Approximation algorithms for Maximum Parsimony exist, but the approximation ratios are not good enough for use in molecular systematics where errors as small as 1% are unacceptable. Consequently, heuristics, largely based upon hill-climbing (also called local-search), are used to search for optimal trees. Two topological transformations on trees, TBR (for
Address: Dept. of Computer Sciences, The Univ. of Texas at Austin, Austin, TX 78712. Email: gsgk,vlr,tandy @cs.utexas.edu  Supported in part by NSF grant ITR-0121680  Supported in part by NSF grant CCR-9988160 з Supported in part by NSF grant ITR-0331453 and a fellowship from the David and Lucile Packard Foundation
б

Tree-Bisection-and-Reconnection) and p-ECR, short for p-Edge Contract and Refine [7], are the basis for the most popular heuristics in use for phylogenetic analysis under Maximum Parsimony. Of these two, the TBR transformation has been traditionally more popular, and is better understood in terms of the properties of the "landscape" of trees it induces [21, 16, 14, 1]. In a p-ECR move p of the edges in the given tree are contracted and the resulting tree is refined to give back a new tree. Sankoff et al. [20] define a version of the ECR move where the contracted edges are restricted to form a subtree (henceforth, we will call this move the p-sECR or p-subtree ECR move). In [20] an experimental comparison of local searches based on p-sECR moves for different values of p is presented, and evaluated with regard to the quality of local optima generated. Subsequently the p-ECR move has appeared implicitly rather then explicitly in the local-search heuristic sectorial-search [8]. In sectorial-search, a tree is transformed through contractions of edges and subsequent refinements, but the edges to be contracted are chosen using some specific heuristic, and so the number of edges contracted can vary during the search. The p-ECR move as used in this paper was defined recently in [7], where the neighborhoods of trees induced by the 2-ECR move and by the TBR move were compared and were shown to have a small intersection. In this paper, we present several results about the properties of the general p-ECR operation and the search space induced by it. In particular, we present asymptotically tight bounds for the diameter of treespace under p-ECR and p-sECR moves as a function of p, showing that the diameter of the search space in both cases is in  n log n (where n is the number of leaves p log p in the trees). This result could be potentially useful in selecting a suitable range of values of p for performing local searches based on p-ECR operations.
д д

a comparison of the neighborhoods of a tree induced by TBR and p-ECR moves, showing that their intersection is of size O min n2 p n2 p . The neighborhoods themselves are much larger: there could be  n3 trees in the TBR neighborhood of a tree, while the p-ECR neighborhood contains  n p 2 p trees. These results may help explain why the combination of the two moves imж

Copyright й 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688

г

в

900


proves upon the use of just one, as reported in [8]. This 2.2 Tree Transformations We now look at the two prinwork generalizes the result in [7] for 2-ECR. We present cipal tree transformation operations and the metrics induced related results comparing p-sECR and TBR neighbor- by the operations on the set of trees. hoods. Contractions and Refinements. A contraction collapses an O n pre-processing-time, O 1 update-time algo- an edge in the tree and identifies its two end points, while rithm for sampling a tree uniformly at random from the a refinement expands an unresolved node into two nodes set of 2-ECR neighbors of a phylogenetic tree. This po- connected by an edge (see Figure 1). The p-ECR tree tentially has applications in Markov Chain Monte Carlo rearrangement operation on a binary phylogeny is defined methods for inferring evolutionary histories through to be p edge-contractions, which are then followed by Bayesian analysis [13, 11, 12]. refinements that give back a binary phylogeny. The trees T 1 and T 5 in Figure 1 are separated by one 2-ECR operation. a structural analysis of the p-ECR operation, motivated by its application in our algorithm for uniformly sam- a d d a d a e1 e2 contract e2 pling from the 2-ECR neighborhood of a tree. We dee2 contract e1 b fine the properties of irreducibility and commutativity of e b c c e b e c p-ECR operations, and observe a surprising connection T3 T2 T1 between irreducible p-ECR operations and elementary bipartite graphs. We exploit this connection to develop a b c d a a an O n p2 algorithm to reduce a p-ECR operation refinement refinement into an equivalent sequence of irreducible ECR operab d c e e b tions. e c
ж е ж

Tree Bisection and Reconnection (TBR). In a TBR move, an edge is removed from T , creating subtrees t and T t , and then a new edge is added between the midpoints of any two edges in t and T t , creating a new tree. Throughout the operation any internal node of degree two is suppressed. The 2.1 Bipartitions A notion crucial to the study of phylo- TBR operation is illustrated in Figure 2. genies is that of a bipartition: removing an edge e from Nearest Neighbor Interchange (NNI). The NNI move a leaf-labeled tree T induces a bipartition e on its set of swaps one rooted subtree on one side of an internal edge e leaves. We denote by C T the set e : e E T , which with another on the other side; note that this is equivalent to represents the set of bipartitions induced by T . The set C T contracting the edge e, and then resolving the resultant tree is known as the character encoding of the tree T . Buneman into a new binary tree. The NNI operation is thus the same proved [2] that two phylogenies are isomorphic if and only as a 1-ECR operation, and is also a special case of the TBR if they have the same character encoding. operation. Every sequence of p NNI moves on a tree is a

901

ж

в

в

й в

ж

и

ж

е

ж

2 Basics A phylogeny is an unrooted tree (rooted, if the evolutionary origin is known) whose leaves are labeled and represent extant species, and all of whose internal nodes have degree at least three. A binary phylogeny is one where all internal nodes are of degree three. Edges that are not incident on leaves are called internal edges. Non-binary phylogenies are referred to as being unresolved at the nodes of degree greater than three. Any isomorphism between phylogenies must preserve the leaf labels.

O B S E RVAT I O N 1 . Let T and T be two unrooted binary leaflabeled trees on n leaves, and let p be any integer between 1 and n 3. Then RF T T 2 p if and only if T and T are one p-ECR move apart.

е

ж

е

жи в

ж

е

ж

в

е

и

 в ев

е

ж

е

ж з

ж

е

е

в

г дв

ж

и

е

The rest of the paper is organized as follows: In Section 2 we introduce some basic concepts necessary for the remaining sections. In Section 3 we present upper and lower bounds on the diameter of the search space induced by the p-ECR and p-sECR operations. In Section 4 we compare the neighborhoods of a tree induced by the two ECR operations and the TBR operation. In Section 5 we present our algorithm for sampling uniformly from the set of 2-ECR neighbors of a tree, and in Section 6 we carry out structural analyses of the p-ECR operation vis-a-vis the properties of irreducibility and commutativity.

ж

е

й

ж

е

б

з

ж

е

 
ж

е

е

д д

d

T3

T4

T5

Figure 1: Two edges are contracted in T1 to produce T3, which is then refined to produce T5; T3 and T5 are thus separated by one 2-ECR operation. The Robinson-Foulds metric. The Robinson-Foulds distance between two unrooted leaf-labeled (not necessarily binary) trees T and T , denoted RF T T is defined to be the length of a shortest sequences of contractions and refinements that transforms T to T [19]. It was also shown in [19] that RF T T CT CT CT CT Based on the above definitions we can deduce the following simple fact.


a a e g

e

g

TBR
b c i d f h b d c i f h

3.1 Upper Bounds We prove the upper bound for psECR, which then applies to p-ECR as well.
ж е ж

t

t

1 2 7 1 23 4 5 6 3 4 5 6

C

B

Figure 3: B is a complete rooted binary tree on seven leaves. Note that in general the leaves need not be in sorted order in the tree. C is the sorted caterpillar tree for the same set of leaves. Converting an arbitrary complete binary tree to a sorted caterpillar tree. The procedure is recursive, and is illustrated in Figures 4 and 5. Let B be the complete binary tree on n leaves. Let B1 , B2 , B3 , . . . B p be the subtrees of B at a depth of log2 p. Since B is a complete binary tree, so are subtrees B1 through B p . Recursively convert each of the p subtrees to a sorted caterpillar tree, producing the bc-tree (binary-cum-caterpillar tree) B . The subtrees of B at a depth log2 p are now sorted caterpillar trees C1 through C p (see Figure 4). The following process is illustrated in Figure 5: consider the first p leaves in the sorted order. These p leaves can be "pulled" up to the root by contracting only 2 p 2 edges (this is because the caterpillar trees C1 through C p are sorted). The contraction of these 2 p 2 edges make the root of the tree unresolved, with each of the p leaves now a descendant of the root. To complete the 2 p 2 -sECR operation, we have to refine the root, and when we refine the root the p leaves can be transferred to "above" (see Figure 5) the root in sorted order. The next 2 p 2 -sECR move will transfer the next p leaves in the sorted order to above the root. In this manner we can obtain a C from B in n p 2 p 2 -sECR moves. This gives us the following recursive

902

ж

ж

в

ж

е

ж

ж

е

ж

ж

ж

е

е

3 Bounds on the Diameter of G p ECR and G p sECR In this section we derive asymptotically tight bounds for the diameter of the tree-space induced by the p-ECR and p-sECR operations as a function of p. It was shown in 14 that  GN N I  n log n , and it was shown in [1] that  GT BR  n . As mentioned earlier, the NNI operation is just the 1-ECR operation (and the 1-sECR operation). Hence the diameter of the 1-ECR (and 1-sECR) operation is in  n log n . The diameter of the n 3 -ECR (and n 3 -sECR) operation is, of course, one. Obtaining the diameter as a function of p might give us a way to pick the right range of values of p to use in a search, based on the diameter.
ж

в

ж

е

в

ж

ж

е

 
p-ECR move on that tree; however there are p-ECR moves that cannot be performed by a sequence of p NNI moves (see, e.g., Figure 1). Neighborhoods, distances, and diameters. We define the neighborhood of an unrooted binary leaf-labeled tree T under a tree-rearrangement move to be the set of all trees that can be obtained from T by one move. For each of the different tree rearrangement operations (TBR, NNI, and pECR), we define the edit distance between two trees on the same set of leaves as the minimum number of moves needed to move from one tree to the other. All these distances are known to be finite (follows from [18] ), but tend to be hard to compute [1, 10, 3]. We denote the edit distance under the p-ECR move by  p ECR T T , and the others are similarly defined. Given a specific move (such as TBR, p-ECR, etc.), we can define the diameter of tree space to be the maximum edit distance between any two trees. For convenience, we will phrase the diameter of the search space as the diameter of a graph, in which the trees on a given set of leaves constitute the vertices, and an edge exists between two trees if they are related to each other by one move. Thus, the graph defined by the p-ECR move is G p ECR UE, where U is the set of unrooted leaf-labeled binary trees on n leaves, and u v E if and only if u and v are separated by one p-ECR move. We denote the diameter of G p ECR by  G p ECR .
ж и е

ж

ж

е

ж

е

 
Figure 2: Tree T is one TBR move away from T .

й д

ж

и

е

  ед

 
г

й

з

T

и

жжж зиз

и

и

T''

Proof. Let C be the sorted "caterpillar" tree for the set of leaf labels 1 2 n (Figure 3). We will show that for any unrooted binary leaf-labeled tree T ,  2 p 2 ECR T C n n p log p n O p . We will first show that the number of 2 p 2 -ECR steps needed to convert a complete binary tree on n leaves to a caterpillar C is at most n log p n. We then show that any p tree can be transformed into a complete binary tree in O n p steps, each step being a 2 p 2-ECR move. The above two results would then imply that any tree can be converted to C in at most n log p n O n steps. A complete binary tree and p p the sorted caterpillar tree for the set of leaves labeled from 1 through 7 are shown in Figure 3.

  ед

2
 

sECR

б

THEOREM 3.1.  G 2p
г
е

O

n log n p log p

.

ж

ж

е

 
г

 
е

 
ж

ж

б

в

ж

е

ж

и

е

б

 
е

ж

е

 
ж

е

в

ж

б

е

вв

7

б

ж

и

е

ж

ж

ж

 
е

е


equation for the number of moves required to convert B to C. Let S n denote the number of 2 p 2 -sECR steps required to convert an n-leaf complete binary tree to the corresponding sorted caterpillar tree. Then,
ж

each of which contain at least two leaves, which implies that S can be refined into two subtrees, each containing at least q leaves. Further, since S could not be refined into a q-spoke, the subtree rooted at v was not of size between q and 2q, so one of the two subtrees formed is a new one, resulting n n Sn pS in an increase in potential of at least one. Hence T can be p p transformed into a q-caterpillar in O n p-sECR moves. p n Solving the recurrence yields us S n By reversing the above strategy the q-caterpillar can p log p n. be transformed into any binary tree, including a complete Converting any tree to a complete binary tree with pbinary tree, in the same number of moves. Hence any binary sECR moves. We now show that any tree can be transformed tree can be transformed into a complete binary tree in O n p into a complete binary tree in O n steps, each step being a p steps, each step being a p-sECR move. p-sECR move. This will then give us the desired result. Note that these transformations concern only the tree topology and T not the leaf labels. In a caterpillar we define the terminal leaves to be the two pairs of leaves at each end of the caterpillar; the remaining leaves are the internal leaves. The path connecting the two pairs of terminal leaves is the spine of the caterpillar. We define a q-caterpillar as a caterpillar in which each internal leaf is replaced by a q-spoke, which is a caterpillar with q 2 Figure 6: Tree T is a 4-caterpillar. The leaf-labels are not internal leaves and one pair of end leaves (see Figure 6). The shown since they are not relevant in the transformation of a last spoke in the q-caterpillar (one that is adjacent to the par- tree to a q-caterpillar. ent of one of the two terminal pairs of leaves) is a q -spoke for some q q. Let T be an unrooted binary tree. Any binary tree 3.2 Lower Bounds We prove the lower bound for p-ECR, contains at least two pairs of leaves that are siblings. We fix which then applies to p-sECR as well. Proof.(Sketch.) Let T be an unrooted binary tree on n labeled leaves, and let T be a tree such that  p ECR T T 1. It can be established that N N I T T 2 p log2 p O p , for p 1. n log2 n o n log n From the results in [14, 1],  GN N I . 4 This, together with the fact that any p-ECR move can be emulated by at most 2 p log2 p O p NNI moves, gives us the desired result. Thus we obtain the following theorem: T H E O R E M 3 . 3 .  G p ECR and  G p sECR are both in  n llog n . p og p 4 Comparison of ECR and TBR Neighborhoods Recall that the neighborhood of a tree T under a tree rearrangement operation is the set of all trees that can be obtained by performing one such operation on T . In this section we first establish bounds on the size of the p-ECR and p-sECR neighborhoods of a tree on n leaves, and then show that the sizes of the intersections of each of the above two neighborhoods and the TBR neighborhood of a tree are small. We will denote the neighborhood of a tree T under, say, the TBR operation, as T BR T . It is known that T BR T O n3 [1]. In the following, we let r!! be the product of all odd numbers up to r.
ж е ж ж

903

д

гв

д

ж

е

и

г

 
б

е

 
е

ж

е

 
д

е й

г

е

ж

йв

 
з

ж

ж

е

и

е

е

е

 
е ж

ECR
ж

и

д

г

 
ж

в

 
ж

е

е

е

ж

и

е

two such pairs of sibling leaves in T and we call the unique path connecting their parents x1 and x2 the spine of T . We fix one of the parents, say x1 , and relative to x1 we define the potential  T 2 n1 n2, where n1 is the number of successive q-spokes in T starting with the vertex adjacent to x1 on the spine, and n2 is the number of successive subtrees with at least q leaves rooted at vertices on the spine following the initial n1 q-spokes. Consider the transformation of an arbitrary binary tree T into a q-caterpillar using p-sECR moves, where q p2. The initial potential of T is non-negative and final potential of the transformed tree is 2 n 4 q . We now describe a method that transforms T into a q-caterpillar using p-sECR moves that increase the potential of the transformed tree in each step by at least one. Thus this method transforms T into a q-caterpillar in O n moves. p Our method will apply a p-sECR move by contracting edges starting with the edges in the subtree rooted at the vertex v on the spine, which is adjacent to the last vertex that is a root of one of the n1 q-spokes already constructed (if n1 0 then this is the vertex adjacent to x1 on the spine). The resulting contracted subtree S on p internal edges will have p 3 external edges incident on it, of which two are on the spine and p 1 are within subtrees rooted at one or more vertices on the spine. If S can be refined to form a new successive q-spoke, then the potential increases by at least one. Otherwise, at least q external edges end in subtrees,

THEOREM 3.2.  G p

n log2 n o n log n 8 p log2 p O p

, for all p

1.

ж

д

е

ж

е

гв

ж

б

в

г

ж

е

гв

й

ж

ж

 
ж

е

ж

е

ж

е

е

б

 
й

ж

 
е

ж

 
г

е

ж

е

ж

е

йв

 
г


B

three more recursive calls one recursive call
C1
3
2 3 7 5 12 6 15 11 1 10 13 4 14 8 16 9

B'

C1
2 12 6 15 11 1 10 13 4 14 8 16 9 3 7 5 15 12 2

C2
6 11 13

C3
1 4 10 16 14

C4
8 9

7

5

B'

refine contract
C1
2 3 7 5 15 12 11
7 5 1 6 11 2 3 4 13 8 9 16 14 11 15 12 6 7 5 13 10

C2
6 13

C3
1 4 10 16

C4
8 9 14
15 12

1

1

15

2

refine

3 4 5

contract

13 16 14 9 8 7 6 5 4 3 2

1 12 11 10

6 7 8

2 3 4 5 6 7 8

refine

11

13

10 16 14

9 15 12

contract and refine
16 1 15 14 13 12 11 10 9 8 7 6 5 4 3 2

16 15 12 11 13 10 9

14

15

12

C

Figure 5: Converting a bc-tree B to a sorted caterpillar tree C with 6-ECR moves. At every contraction the bold edges are contracted. L E M M A 4 . 1 . Let T be an unrooted binary leaf-labeled tree p n on n leaves. Then, k 1 n k 3 2k  p ECR T p 2p 1 !!. Proof. For any tree T in  p ECR T , RF T T 246 2 p . We will show that the number of trees T in  p ECR T such that RF T T 2k is at least n k 3 2k , and that will give us the result that we desire. Let k be such that 1 k n 3 . For every way of choosing k edges in T , there are at least 2k different k-ECR moves that can be performed on T : for each chosen edge, contract the edge and refine the resulting unresolved node in one of two ways that results in the alteration of the bipartition corresponding to the edge. Thus, there are at least n3 k 2k. This completes k 2 trees T such that RF T T our proof of the lower bound. For the upper bound, observe that for each of the n ways of selecting p edges to contract, p there are at most 2 p 1 !! neighbors ( 2 p 1 !! is the number of unrooted leaf-labeled binary trees with p internal edges). This completes our proof.

For p-sECR, as noted in [20], the neighborhood size is in O n 2 p 1 !! . Further, we observe here that the p-sECR neighborhood size is in  n 2 p 1 !! , since there are  n p p disjoint sets of p contiguous edges in any binary tree. T H E O R E M 4 . 1 . Let T be an unrooted binary leaf-labeled tree on n leaves. Then, for any p,  p ECR T T BR T min 2n 3 p 1 2 p 3 4 2n 3 2 p 1 . T BR T , and let T be in S. Proof. Let S  p ECR T Then, C T CT p, since T  p ECR T . Moreover, since T T BR T , the edges in T corresponding to bipartitions in C T C T all must lie on a path, and the bipartitions corresponding to all edges on the path except three (the first edge, the last edge and the edge that is broken for the TBR move) must be in C T C T . Hence, each such T can be specified by three edges that lie on a path of length at most p 3 . Now, the number of paths of length at most p 3 is at most 2n 3 2 p 3. This is because the number of paths of length exactly p 3 is at most 2n 3 2 p 2: fix one of the terminal edges of the path, and there are at most 2 p 3 paths with a given terminal edge since the tree is binary.
ж

904

й

е

е

ж

з

е

в

ж

ж

е

ж

г д

ж

ж

й

ж

е

 
е

ж

 
в

ж

ж

бв

 
е

е

е

ж

ж

е

ж и

 
ж

ж

е

з

г е

в
1 2 3 4 10 9 16 14 8

Figure 4: Converting a complete binary tree B to a bc-tree B

contract
1 2 3 4

11

5

6

7

8 13 10 16 14

9

 
е

е

ж

ж

в

ж

и

е

ж

е

е

з

й

е

ж

е

ж

ж

в

ж

 
ж

е

е

ж

 
е

б зв

ж

г

ж

е

ж

ж

ж

 
е

ж

е

е

е

ж

е

е

з

 
з

в

е

в

бв

 
д

ж

в

е

 
и

в

б

б

е

ж

й

е

 
ж

ж

е

е

е

ж

 
гв

ж

ж

 
гв

е

е

ж

и

й

в

й

и

е

в

е

в

й

ж

 
б

 
в

б

е

в

й

ж

и

е

жжж зиз

и

 
и

в

и

 
ж

б

з


2. Generate q at random from a uniform distribution on 0 1.
2 3. If q 2n n 6 6 s , generate a tree uniformly at random from N N I T .

905

ж

е

ж

ж

ж

е

е

г

е

ее

и

д е

ж

ж

г

е

ее

е

в

ж

е

L E M M A 4 . 2 . Let T be an unrooted binary leaf-labeled tree on n leaves. Then, for any p,  p sECR T T BR T min 2n 3 p 1 2 p 2 4 2n 3 2 p 1 .

Sampling from N N I T : Step (3) is easy and can be performed in O 1 time, given the set of internal edges of T . The following result relates the above two results by We choose an internal edge e uniformly at random, and pick comparing  p sECR T BR with  p ECR T BR . each of the two trees that can be obtained by contracting and L E M M A 4 . 3 . For any p 1, there exists an unrooted binary refining e with probability 1 2. It can be verified that in this leaf-labeled tree T such that  p sECR T T BR T is a manner we do sample uniformly at random from N N I T . Sampling from S: This is a complicated by the fact that proper subset of  p ECR T T BR T . sampling two edges e1 and e2 one after the other without Further, for any tree T and for any p 1,  p ECR T replacement, and then sampling uniformly at random from T BR T  p 1 sECR T T BR T . the set of neighbors obtained by performing a 2-ECR move Proof. In Figure 7 the tree T is in  p ECR T T BR T , involving edges e1 and e2 does not induce a uniform distrias evinced by the bold edges in T that correspond to bi- bution on S. This is due to the following reason: when e1 and C T . However, T  p sECR T partitions in C T e2 are adjacent, they contribute 10 neighbors to S, while they T BR T since the p bold edges do not form a subtree. This contribute only 4 neighbors to S when they are not adjacent. proves the first part of the lemma. Hence, we adopt the following strategy: we arbitrarily For the second part, let T be a tree in order the internal edges in T , and let ind ex e denote the  p ECR T T BR T . Then, as observed in Theorem position of the edge e in such an order. 4.1, the edges corresponding to bipartitions in C T CT We let Y be the set of neighbors that can be obtained lie on a path and all except at most one edge (the edge that from T through a sequence of two 1-ECR moves, is broken for the TBR move) in the path are in C T CT the first one involving edge e1 and the next involving (i.e., the corresponding bipartitions are in C T CT . ind ex e2 . Every pair e2 , and such that ind ex e1 There are at most p bipartitions in C T C T and of internal edges (whether adjacent or non-adjacent) hence the length of the path is at most p 1. The tree contributes four trees to Y . We let Y y. T can be obtained by contracting all edges on the path and subsequently introducing the edges corresponding to Let X S Y , and let X x. The set of neighbors X bipartitions in C T C T through refinements, while contains the following two classes of trees: retaining the edge that gets broken for the TBR move. This н Trees that cannot be obtained by a sequence of two corresponds to a p 1 -sECR operation on T , and hence 1-ECR moves. There are two such trees for every T  p 1 sECR T T BR T . This completes the proof pair of adjacent internal edges, and none for any of the second part of the lemma. pair of non-adjacent edges. 5 Uniform Sampling from the Set of 2-ECR Neighbors н Trees that are obtained by two 1-ECR moves inThe use of MCMC (Markov Chain Monte Carlo) algorithms volving two adjacent internal edges, e1 first and e2 next, such that ind ex e1 in phylogeny reconstruction is of increasing interest in ind ex e2 . Every pair
ж

з

ж

е

 
г

и

4. If q S.

2n 6 2n 6 s , generate a tree uniformly at random from

 
ж

е

й

The following is the analogous result for the p-sECR operation. The proof is similar to that of Theorem 4.1.

г

1. Compute s
в
и

S.

ж

е

ж

ж

ж

е

е

ж

е

 
ж

г

ее

ж

з

 
е

б

But in this manner each path will be counted at least twice, and hence there are at most 2n 3 2 p 2 paths of length exactly p 3 . Summing over all p we get that the number of paths of length at most p 3 is at most 2n 3 2 p 3. Also, each path of length at most p 3 corresponds to at most p 1 trees that are in S, since there are p 1 ways of choosing the edge that is broken for the TBR move. Hence we have that S 2n 3 2 p 3 p 1 . Moreover, the total number of paths in T is 2n 3 2 . For every tree in S, there is a path in T , and each path 4 2n contributes at most 4 p 1 trees to S. Hence S 32p 1. Thus, we have that  p ECR T T BR T min 2n 3 p 1 2 p 3 4 2n 3 2 p 1 .
ж

the research and user community [12, 11, 13]. In this section, therefore, we address the problem of selecting a tree uniformly at random from the set of 2-ECR neighbors of a tree. Our algorithm takes O 1 time, after a one-time pre-processing step that costs O n time. We partition the set of 2-ECR neighbors of T into two subsets: N N I T and S 2 ECR T N N I T . The size of the former set is 2n 6, and the size of the latter set depends on the structure of T . The outline of our algorithm is as follows:

д

д

ж

ж

е

ж

г д

г

в

в

ж

й

й

д

д

ж

в

ж

ж

ж

е

е

е

е

ж

 
е

е

ж

е

е

е

ж

ж

ж

в

е

е

е

з

ж ж и

ж з

е

е

ж

й

ж

е

ж

 
ее

ж

 
е

е

г

е

ж

ж

ж

г

ж

ж

е

ж

ж

ж

 
г

ж

й

й

е

г

ж

ж

вб гв

 
ж

ж

е

е

е

 
е

г

е

е

в

е

 
е

 
з

ж

ж

е

е

з

 
е

е

 
е

ж

ж

ж

 
ж

е

ж

е

ж

ж

ж

ж

ж

ж

в

е

г

г

е

е

е

ж

ж

е

ж

ж

 
е

в

и

и

и

е

ж

е

 
е

й

е

з

з

е

ж в

ее

ж

г

е

ж

 
ж

г

ж

ж

е

 
ж

  ед

е

е

е

ж

ж

 
з

е

 
е

е

 
г

ж

ж

г

  ед

ж

е

ж

ж

б

ж

ж

е

з

 
ж

ж

е

г

 
е

е

е

е

е

з

з

е

бв

 
в

ж


a e1

T
e2 e3

c a

T'

k

...
b 1 2 k

...
k+1

p+1 p+2

d

b

...
k -1 p+2

.. .
k+1 k+2

c

d

Figure 7: Trees T and T' are TBR neighbors; The TBR move deletes edge e2 in T and introduces an edge bifurcating edges e1 and e3. Tree T' is a p-ECR neighbor, but not a p-sECR neighbor of T. The p-ECR move contracts the bold edges in T and introduces the bold edges in T . of adjacent internal edges contributes four such trees to X . Note that two 1-ECR operations performed in the reverse order on two non-adjacent edges do not generate any new trees, since the order does not matter when the edges are nonadjacent.

Selecting a 2-sECR neighbor uniformly at random. This is made simple by the fact that there are only O n neighbors and each pair of adjacent edges contributes the same number of neighbors (fourteen) to the set of neighbors. Thus, we Note that 2 ECR T 2n 6 x y. We are now in can first select a pair of adjacent edges uniformly at random a position to describe our algorithm. and then select one neighbor uniformly at random from the Algorithm to sample uniformly from S. set of fourteen neighbors contributed by the above pair of edges. Each of the above selections can be done in O 1 1. Calculate x and y. This can be done in O n time since x time, after an O n preprocessing that computes the list of depends only on the number of pairs of adjacent internal pairs of adjacent edges in the tree. Further, this information edges in T , and y depends only on n. can be updated in O 1 time after generating a new tree. At first sight, our algorithm seems to be a series of 2. Generate q at random from a uniform distribution on case analyses. However, the analyses reveal some interest0 1. ing properties of the structure of 2-ECR and 2-sECR moves: 3. if q x x y , then sample a pair of adjacent internal edges there are some 2-ECR and 2-sECR moves that are not ree1 and e2 , and then sample a tree uniformly at random ducible to two successive NNI moves. Among the reducible from the set of neighbors contributed to X by a 2-ECR moves, some (but not all) 2-ECR moves involve two succesmove involving e1 and e2 (this involves sampling one sive NNI moves that are commutable (i.e., those that can be tree from a set of six trees). reordered), while no 2-sECR move involves successive NNI moves that are commutable. We believe that these concepts 4. if q x x y , then sample two internal edges e1 and e2 (and generalizations of them) will be essential in designing one after the other without replacement from the set of an algorithm that samples efficiently from the set of both pinternal edges. Then sample a tree uniformly at random ECR and p-sECR neighbors of a tree for p 2. In the next from the set of neighbors contributed to Y by a 2-ECR section we study reducibility and commutability of p-ECR move involving e1 and e2 (this involves sampling one moves, and show that these concepts generalize to all values tree from a set of four trees). of p through a surprising connection to elementary bipartite graphs. Every 2-ECR neighbor is generated with a probability of 2n 61 x y by our algorithm. The running time is O n , the time taken to to calculate the number of pairs of adjacent 6 Structural Analyses of the p-ECR Operation internal edges in T . However, note that once a new a tree is In this section we describe a method to construct, for any generated, this number can be calculated for the new tree in two given trees, a sequence of elementary or irreducible ECR O 1 time, since a 2-ECR move makes only local changes to operations that transforms one tree to another. We will use the convention that an ECR operation is a p-ECR operation the tree structure. Hence, we have the following: for some (unspecified) p. We first introduce some terminology and notation. Let T H E O R E M 5 . 1 . We can generate a tree uniformly at random from the set of 2-ECR neighbors of an unrooted leaf-labeled T be an unrooted leaf-labeled tree. Let X and Y be two ECR
ж е ж е

906

ж

е

и

ж

е

ж

е

ж

е

ж

е

ж

е

 
в

binary tree on n leaves in O 1 time, after an O n preprocessing step.

ж

г

е

ж

е

 
е

з

з

зз

и

в

й

и

 
б

ж

е


operations on T . We will say X equals Y if performing X on nodes of degree two (through out the rest of the paper we T results in the same tree as the one obtained by performing use n to denote the number of leaves). This gives us the Y on T . For two ECR operations X and Y , we will let Y X be following: the following sequence of two ECR operations: X on T , followed by Y on the tree that results from performing X on T . C O RO L L A RY 6 . 1 . The maximum cardinality of any set of compatible bipartitions of a set of n elements is 2n 3. D E FI N I T I O N 1 . Reducible p-ECR operation Let T be an unrooted leaf-labeled tree. Let X be a pWe now define a graph, which we call the incompatibilECR operation on T . X is said to be reducible if there exists ity graph, defined by two leaf-labeled trees. a p1 -ECR operation X1 and a p2 -ECR operation X2 such that D E FI N I T I O N 3 . Incompatibility Graph X X2 X1 and p p1 p2. Let T and T be two unrooted leaf-labeled trees. The The concepts of reducibility and irreducibility of ECR incompatibility graph G between T and T is defined as follows: G is a bipartite graph with vertex set U V , where operations are illustrated in Figure 8. U C T C T ,V C T C T 1 , and edge set u v : b d a a u U v V u and v incom pat ibl e .
one irreducible 2-ECR move c T2 e

b

c T1

e

d

a

d a 1-ECR move c T1 T4 e b

d

a 1-ECR move d T3

b

An elementary bipartite graph is one where every edge is in some perfect matching [15]. Suppose T and T are two trees such that  p ECR T T 1; then, we show that the pECR move that separates them is irreducible if and only if the incompatibility graph induced by the two trees is elementary. We start with the following lemma. L E M M A 6 . 2 . Let G be the incompatibility graph between two unrooted binary leaf-labeled trees. Then G has a perfect matching. Proof. (Sketch) By applying Corollary 6.1 it can be shown that the incompatibility graph G satisfies the following two properties: (1) For every subset S of V ,  S S , and (2) For every subset R of U ,  R R . Our result will then follow from Hall's matching theorem [9].

b

c

e

c

e

one reducible 2-ECR move

In this section we address the Irreducible Decomposition problem, which we define as follows: given two biT H E O R E M 6 . 1 . Let X be a p-ECR move that transforms an nary trees T and T such that RF T T 2 p, decompose unrooted leaf-labeled binary tree T to T . Let G UVE the p-ECR operation X that separates T and T such that be the incompatibility graph between T and T . Then, X is X Xk Xk 1 X1, each Xi is an irreducible pi -ECR opirreducible if and only if G is elementary. eration, and k 1 pi p. i

Using the above characterization, we now show that we L E M M A 6 . 1 . ( F RO M B U N E M A N [ 2 ] ) A set of bipartitions can check efficiently if a p-ECR move is irreducible. This is compatible iff any two bipartitions in the set are are also means that we can compute, for any given p-ECR move, pairwise compatible. Furthermore, two bipartitions A A1 : its irreducible decomposition. A2 and B B1 : B2 are compatible iff at least one of the four sets A1 B1 , A1 B2 , A2 B1 and A2 B2 is empty. 1 Note that the definition here is almost the same as the definition of
г ед в

Observe that there can not be more than 2n 3 edges in a phylogenetic tree with n leaves, since there are no internal

the incompatibility graph appearing in [17], where U and V were C T and C T respectively. Our definition has the effect of removing isolated vertices from the incompatibility graph.
г в

907

д

D E FI N I T I O N 2 . Bipartition (or edge) compatibility: A set of bipartitions B is said to be compatible if and only if B C T for some tree T .
ж е

ее

г

е

ж

е

е

6.1 Irreducibility and Elementary Bipartite Graphs We begin with a definition:

Proof. (Sketch) It can be shown that X is irreducible if an only if there is no proper subset S of V such that  S S. This is equivalent to showing that G is elementary, since we have already established that G always contains a perfect matching.

ж

д

и

и

е

ее

г

в

е

е

ж

е

в

е

ее

е

е

ж

е

е

Figure 8: Trees T1 and T2 are one irreducible 2-ECR move away, and trees T1 and T3 are one reducible 2-ECR move away.

ж

и

е

з

в

ж

б

в

ж

й

е

жв

гв

ж

ж

е

и

е

г

в

 
ж

в

е

и

ж

б

ж

е

и

г

б

 
г

в

ж

гв

ж

г

и

е

г

 
г

в

г

 
жж ззж

г

 
г

б

 
г

г

г


T H E O R E M 6 . 3 . Let T be an unrooted leaf-labeled tree. Let X be a p-ECR operation on T that would result in a tree T . Then, there exist k mutually commutable ECR operations X1 , X2 , . . . Xk such that X Xk Xk 1 X1 if and only if the edges corresponding to bipartitions in C T CT constitute a forest with k trees.
ж

6.3 Existence of Irreducible p-ECR Moves One can establish that an irreducible p-ECR move can be constructed for every set of p connected edges in any tree through explicit construction:

Proof. We show that X is separable if and only if the edges T H E O R E M 6 . 4 . Let T be any unrooted binary tree with corresponding to the bipartitions in C T C T do not at least p internal edges. Then there is a tree T with

908

д

This completes our proof.

в

в

ж

е

гв

вг

ж

жв

ж

е

жв

гв

е

We now present a necessary and sufficient condition for separability of X that can be verified without computing the incompatibility graph.

/ / QZ 2. Y 0, since Y Q 0. Now, since Q / / 0, we have Y Z P 0 and Q Z PY / / 0. This means that Y 0 and QZ QZ / Y 0 as well. Thus, we have that Y : Y is QZ incompatible with v2 .

г

жв

е

вг

б

вг

ж

вг

б

е

г

г

ж

в

жв

е

вг

ж

вг

б

г

е

г

г

L E M M A 6 . 3 . Let X be an ECR move executable on an unrooted leaf-labeled binary tree. The incompatibility graph induced by X is not connected if and only if X is separable.

вг в

гв

вб

  бв

г

вг в

вг в

г

г

вг

в

в

вб б

г

The following is a necessary and sufficient condition for separability of a p-ECR operation the proof of which we omit:

/ 1. Y Q 0, since as we already saw, Q P Y . Also, / 0. Moreover, Q Y (since Y Q. Hence, Y Q / / 0. Similarly, Q Y 0. Q P ), and hence Q Y Thus, we have that Y : Y is incompatible with v1 .

ж

е

ж

вг

ж

б

е

е

 в

г

ж

ж

г

жв

е

е

ж

е

в

г

ж

в

вб

D E FI N I T I O N 4 . A p-ECR operation X is separable if and only if there are two ECR moves X1 and X2 such that X X2 X1 X1 X2 . The ECR moves X1 and X2 are then said to be commutable.

жв

б

вб г

6.2 Commutable p-ECR Moves

в

ж

и

вб

в

ж

в

и

ж

 
жв

е

жв в г вб вг в г б

в

б

вб

ж

г д

ж

и

жв

в

жв

е

е

б

г

б

г

ж

ж

в

г

ж

и

ж

е

жв

 
е

жв

жв

жв

б

е

ж

е

г

б

ж

г

е

и

в

и

бв

г

и

е

и

Proof. Let G U V E be the incompatibility graph corresponding to the p-ECR move X . The graph G can be constructed in O n p2 time as follows: The sets U and V can be computed in O n time, while calculating the RF distance between T and T [4]. Once U and V have been determined, E can be calculated in O n p2 time, since for each bipartition in U , we can identify all bipartitions in V incompatible with it in O p time. Once we have G, we use the method in [15], Section 4.3, to decompose G into maximal vertex-disjoint components such that the subgraph of G induced by each component is elementary, as follows: we compute a perfect matching in G (which is guaranteed to exist by Lemma 6.2) and then find the strongly connected components in an associated directed graph. This can be accomplished in O p2 time. Let the vertices of the strongly connected components be ordered as S1 T1 Sk Tk according to a topologically sorted order in the above directed graph. Then if we let component i correspond to the ith ECR operation Xi , it is assured that operation Xi can be performed once operations X1 through Xi 1 have been performed and the outcome of the sequence of operations X1 through Xk is X .

в

е

ж з

е

г

ж

в

е

ж

ж

е

THEOREM 6 . 2 . Let X be a p-ECR move that can be performed on an unrooted binary leaf-labeled tree on n leaves. Then, in O n p2 time, we can determine if X is reducible, and we can compute an irreducible decomposition of X .
ж и и е ж

form a connected subtree. The desired result would then follow. We omit proving that X is separable if the edges represented by C T C T do not form a connected subtree and focus on proving the other direction. Let U C T CT and V C T C T . We prove the following: let u1 u2 be two bipartitions in U , corresponding to two edges e1 and e2 adjacent in T . Let v1 and v2 be two bipartitions in V such that, u1 v1 E , u2 v2 E , u1 v2 E , and u2 v1 E . We show that u1 , u2 , v1 and v2 can be reached from one another in G. This would imply that if the edges in U form a single subtree in T , then G is connected. We now prove the above claim. Let u1 be the bipartition P : P and let u2 be P Y : P Y , for some Y P (this entails no loss of generality since u1 and u2 are compatible). Similarly, let v1 and v2 be Q : Q and Q Z : Q Z respectively, for some Z Q . Since u1 and v2 are incompatible, / 0 (it can be verified that the other we have P Q Z three pairwise intersection cannot be empty) from Lemma / 0. 6.1. Similarly, we have Q P Y Now, since the tree T is binary, and the edges e1 and e2 are adjacent, we have that Y : Y (where Y is the complement of Y ) is a bipartition in T , and the corresponding edge is adjacent to both e1 and e2 . We show that Y : Y is incompatible with both v1 and v2 , thus showing that u1 u2 v1 v2 are reachable from each other. / / Now, Q P Y 0 and Q P 0 implies that Q P Y . Now, we show that Y Q P . Suppose, to the contrary, that Y Q P . Then, Y Q. This, combined with the fact that Q Z P, means that Q Z P Y. / 0, and hence However, this contradicts Q Z PY Y Q P . Thus, we have Q P Y . We now show that Y : Y is incompatible with both v1 and v2 , thus completing our proof.

г

в

д

г

е

ж д

ж

в

ж

е

е

ж

ж

 
ж

жж изж

ж

и

е

 
е

жжж изз

е

 
и

ж

 
и

ж

ж

в

г

е

 
е

е

ж

 
е

г

 
е

ж

г

е

 
в


RF T T 2 p such that the incompatibility graph between T and T is elementary. Proof. (Sketch) The proof relies on constructing two trees T and T such that the incompatibility graph between them contains a Hamiltonian cycle, and hence is elementary. We sketch the construction briefly below. We will call a pair of leaves that are siblings as forming a cherry. Let T have k cherries, x1 y1 xk yk . We create T thus: T is identical to T as far as tree topology (neglecting leaf-labels) is concerned. Hence, T also contains k cherries. In T we let x1 yk x2 y1 x3 y2 xk yk 1 be the cherries. It can be shown that the incompatibility graph between T and T contains a Hamiltonian cycle. Acknowledgment. We thank an anonymous referee for pointing out reference [20] to us. References
[1] B. Allen and M. Steel. Subtree Transfer Operations and Their Induced metrics on Evolutionary Trees. Annals of Combinatorics, 5:1н15, 2001. [2] P. Buneman. The Recovery of Trees from Measures of Dissimilarity. Mathematics in the Archaelogical and Historical Sciences, pages 387н395, 1971. [3] B. Dasgupta, X. He, T. Jiang, M. Li, J. Tromp, and L. Zhang. On the Distances Between Phylogenetic Trees. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 427н436. ACM-SIAM, 1997. [4] W. H. E. Day. Optimal Algorithms for Comparing Trees with Labeled Leaves. Journal of Classification, 2:7н28, 1985. [5] J. Felsenstein. Evolutionary Trees for DNA Sequences: A Maximum Likelihood Approach. Journal of Molecular Evolution, 17:368н376, 1981. [6] W. Fitch. Toward Defining Course of Evolution: Minimum Change for a Specified Tree Topology. Systematic Zoology, 20:406н416, 1971. [7] G. Ganapathy, V. Ramachandran, and T. Warnow. Better Hill-Climbing Seaches for Parsimony. In Proceedings of the Third International Workshop on Algorithms in Bioinformatics (WABI), pages 245н258, 2003. [8] P. A. Goloboff. Analyzing Large Datasets in Reasonable Times: Solutions for Composite Optima. Cladistics, 15:415н 428, 1999. [9] P. Hall. On Representatives of Subsets. Journal of the London Mathematical Society, 10:26н30, 1935. [10] J. Hein, T. Jiang, L. Wang, and K. Zhang. On the Complexity of Comparing Evolutionary trees. Discrete Applied Mathematics, 71:153н169, 1996. [11] J. P. Huelsenbeck and F. Ronquist. MRBAYES: Bayesian Inference of Phylogenetic Trees. Bioinformatics, 17(8):754н 755, 2001. [12] J. P. Huelsenbeck, F. Ronquist, R. Nielsen, and J. P. Bollback. Bayesian Inference of Phylogeny and its Impact on Evolutionary Biology. Science, 294:2310н2314, 2001.
ж

[13] B. Larget and D.L.Simon. Markov Chain Monte Carlo Algorithms for the Bayesian Analysis of Phylogenetic Trees. Molecular Biology and Evolution, 16:277н283, 1999. [14] M. Li, J. Tromp, and L. Zhang. On the Nearest Neighbour Interchange Distance Between Evolutionary Trees. Journal of Theoretical Biology, 182:463н467, 1996. [15] L. Lovasz and M. D. Plummer. Matching Theory. NorthHolland Publishing Company, 1986. [16] D. R. Maddison. The Discovery and Importance of Multiple Islands of Most Parsimonious Trees. Systematic Zoology, 43(3):315н328, 1991. [17] C. A. Phillips and T. Warnow. The Asymmetric Median Tree: A New Model for Building Consensus Trees. Discrete Applied Mathematics, 71:311н335, 1996. [18] D. F. Robinson. Comparison of Labeled Trees with Valency Three. Journal of Combinatorial Theory, 11:105н119, 1971. [19] D. F. Robinson and L. R. Foulds. Comparison of Phylogenetic Trees. Mathematical Biosciences, 53:131н147, 1981. [20] D. Sankoff, Y. Abel, and J. Hein. A Tree, A Window, A Hill, Generalization of Nearest Neighbor Interchange in Phylogenetic Optimization. Journal of Classification, 11:209н232, 1994. [21] D. Swofford, G. J. Olson, P. J. Waddell, and D. M. Hillis. Molecular Systematics, chapter entitled "Phylogenetic Inference", pages 407н425. Sinauer Associates, Sunderland, Massachusetts, second edition, 1996.

д

 
и

ж

е

и

и

жжж зиз

е

и

и

жжж изи

ж

в

и

и

ж

е

и

и

ж

е

и

е

и

ж

и

в

е

в

в

гв
ж

в

в

в

и

е

909