September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology Providence, RI 02912 U.S.A. E-mail: {clkahn,braphael}@cs.brown.edu Segmental duplications are abundant in the human genome, but their evolutionary history is not well-understo od. The mystery surrounding them is due in part to their complex organization; many segmental duplications are mosaic patterns of smaller rep eated segments, or duplicons. A two-step mo del of duplication has b een proposed to explain these mosaic patterns. In this model, duplicons are copied and aggregated into primary duplication blocks that subsequently seed secondary duplications. Here, we formalize the problem of computing a duplication scenario that is consistent with the two-step mo del. We first describ e a dynamic programming algorithm to compute the duplication distance b etween two strings. We then use this distance as the cost function in an integer linear program to obtain the most parsimonious duplication scenario. We apply our metho d to derive putative ancestral relationships between segmental duplications in the human genome. 1. Intro duction Mammalian genomes consist of many repetitive sequences. Many of these are insertions of (retro)transposons and are present in many copies, but longer segmental duplications, or low-copy repeats, are also common. Current estimates suggest that approximately 5% of the human genome consists of segmental duplications > 1 kb in length with 90% sequence identity between copies1 . Moreover, the pattern of segmental duplications in primate genomes is unusual: many of these duplications are interchromosomal1 and are recent alterations in the human genome, occurring after the separation of the Old World monkey and great ape lineages. It is estimated that Work Work supp orted by a National Science Foundation Graduate Research Fellowship. supp orted by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera segmental duplications account for a significant fraction of the differences between humans and primate genomes, and it has been shown that some human segmental duplications contain novel gene families and genes under strong positive selection. In addition, segmental duplications are implicated in disease-causing rearrangements2 and copy-number polymorphisms in human populations1 . Thus, reconstructing the evolutionary history of segmental duplications is essential for understanding primate evolution and for ancestral genome reconstruction3 . Despite the prominence of segmental duplications, both their evolutionary history and the mechanisms by which large segments of the genome are duplicated and transposed to other genomic loci are poorly understood. Studies of the human genome sequence revealed the surprising fact that many segmental duplications are complex mosaics of smaller duplicated pieces4 . A two-step mo del of segmental duplication (reviewed in1 ) has been proposed to explain these mosaic patterns, but the convoluted nature of overlapping, interleaved duplicated material in the genome make segmental duplications refractory to traditional sequence analysis. Recently, Jiang El Ab.5 examined the ancestral relationships between human segmental duplications, and identified "clades" of segmental duplications that share an abundance of repeated subsequences. However, their approach ignored the order and orientation of these repeated subsequences within the segmental duplications, and thus did not explicitly explain the mosaic organization of segmental duplications. While algorithms have been developed for various aspects of duplication analysis ­ such as the evolution of gene clusters6,7,8 , the analysis of whole-genome duplications13,14 , and the computation of reversal distance9,10,11 and reconciliation of gene trees and species trees12 in the presence of orthologous and paralogous genes ­ none of these methods are directly applicable to the analysis of the organization of segmental duplications. The problems of determining the evolutionary history of human segmental duplications and of validating the two-step model have not yet received a rigorous algorithmic treatment. In this paper, we consider the problem of constructing a duplication scenario that is consistent with the two-step model of duplication, and that minimizes the total number of duplication operations. We formulate this problem as an integer linear program that is equivalent to the facility location problem, a classic problem in operations research. This formulation requires a measure of the minimum number of duplications to build a target string from a source string. We generalize the duplication distance algorithm introduced in 15 to compute the distance when both the source and September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera target strings contain repeated characters, a requirement for applications to real genomic data. We apply our methods to the duplication blocks derived in 5 and discover a two-step duplication scenario in which 13 seed duplication blocks are first constructed and then duplicated to create 416 secondary duplication blocks. 2. Metho ds Informally, given an ancestral genome that is devoid of duplications and a present-day genome rife with duplicated material, the problem is to compute a sequence of duplication events by which the ancestral genome is transformed into the present-day genome. In particular, we are interested in computing the simplest or most-parsimonious sequence of duplication events that accomplishes this transformation. In the next section, we describe a model of evolution that is consistent with the two-step model of duplication. Then in Section 2.2, we formulate the problem of computing the most-parsimonious duplication scenario as an integer linear program. Finally, in Section 2.3 we describe an algorithm to compute the duplication distance between a pair of signed strings, a distance used to define the cost function in the integer linear program. This algorithm generalizes the algorithm presented in15 , and an implementation is available upon request. 2.1. Problem Formulation As described above, the segments of the present-day genome that contain duplicated material, hereafter duplication blocks, contain complex mosaic patterns of smaller segments, hereafter duplicons, that appear in multiplicity across the genome. We model both the ancestral and present-day genomes as signed strings of duplicons. We assume the present-day genome, which has incurred segmental duplications, is a superstring of the ancestral genome, and the duplication blocks are substrings of the present-day genome. See Figure 1. In the two-step model of duplication, duplicons are copied from their ancestral loci and aggregated into larger, contiguous segments or seed duplication blocks during the first duplication phase. In the second phase, substrings of both the seed blocks and the ancestral genome are copied and then reinserted into the genome at disparate locations, creating secondary duplication blocks. We say that a seed duplication block seeds a secondary duplication block in the second phase if substrings of the seed block are used in the construction of the secondary block. See Figure 2a. Note that a seed block may seed multiple secondary duplication blocks but some seed September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera 4 ancestral genome present-day genome duplication blocks Figure 1. The present-day genome is a superstring of the ancestral genome. The duplicated material comprises duplication blocks which are maximal contiguous substrings of the present-day genome that were not part of the ancestral genome. blocks may not seed any secondary blocks. We make four simplifying assumptions about the two-step model of duplication: (1) The ancestral genome contains exactly one copy of every duplicon. (2) No other type of rearrangement operations ­ such as inversions of deletions ­ occur. (3) The seed blocks are a subset of the duplication blocks observed in the present-day genome. (4) Each secondary duplication block is seeded by exactly one seed duplication block. Under these assumptions, we can summarize a two-step duplication scenario in a two-step duplication tree. Definition 2.1. Given ancestral and present-day genomes, a two-step duplication tree is a tree of height three where the root is the ancestral genome and the descendants are the duplication blocks. Nodes at depth one (i.e. the children of the root) are the seed blocks created in the first phase of duplication, while nodes at depth two (i.e. children of seed blocks) are the secondary duplication blocks constructed from substrings of one seed block and of the ancestral genome. See Figure 2b. For a given pair of ancestral and present-day genomes, a most-parsimonious two-step duplication tree is that which defines a partition of the duplication blocks into seed duplication blocks and secondary duplication blocks and defines the ancestral relationships between seed and secondary blocks such September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera ancestral genome primary duplications seed duplication blocks secondary duplications { { (a) (b) secondary duplication blocks Figure 2. (a) The two-step model of duplication. Solid arrows indicate duplicons copied during the first phase of duplication. Dashed arrows indicate duplicons copied during the second phase of duplication. (b) The corresp onding two-step duplication tree. that the total number of duplication events needed to construct first the seed blocks and then the secondary blocks is minimum. We define a duplication event as one duplicate operation : Definition 2.2. A duplicate op eration, s,t,p (X ), copies a substring xs . . . xt of a source string X and pastes it into another string Z at position p. Specifically, if X = x1 . . . xm and Z = z1 . . . zn , then Z s,t,p (X ) = z1 . . . zp-1 xs . . . xt zp . . . zn . In15 , we introduced a distance measure from a source string to a target string, called duplication distance. Definition 2.3. The duplication distance from a source string X to a target string Y , denoted DX (Y ), is the number of duplicate operations in the simplest (i.e. shortest) sequence of duplicate operations that transforms an initially empty string into the target Y by repeatedly inserting substrings of X . The total duplication distance for a two-step duplication tree is the sum of the number of duplicate operations needed to build all the duplication blocks. We express the number of duplicate operations needed to build a seed block Bi from the ancestral genome G as DG (Bi ). Secondary duplication blocks are built from substrings of both its parent seed block and the ancestral genome. Thus, we express the number of duplicate operations needed to build a secondary block Bj from its parent seed block Bi and G as DBi G (Bj ), where Bi G denotes the concatenationa of the strings Bi a We insert a "dummy character" b etween Bi and G in the concatenate to avoid copying September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera and G . We now state our problem. Problem: Computing a Most-Parsimonious Two-Step Duplication Tree. Input: The ancestral genome G and a set duplication blocks B1 , . . . , BN from the present-day genome. Output: The two-step duplication tree (Definition 2.1) with minimum total duplication distance. 2.2. Computing the Two-Step Duplication Tree In this section, we show how to formulate the problem of constructing a most-parsimonious two-step duplication tree as an integer linear program (ILP). A two-step duplication tree for a given ancestral genome and a set of duplication blocks is defined by a labeling of each of the N duplication blocks as either seed blocks or as secondary blocks. In addition to this labeling, we must also define for each secondary block which seed duplication block seeded it, i.e. which seed block is its parent in the tree. A most-parsimonious two-step duplication tree is a solution of the following integer linear program. iN i N jN v min (1) (ui × DG (Bi )) + ij × DBj G (Bi ) U,V =1 =1 =1 such that j vij = 1 for all i (2) (3) (4) N vij - uj 0 for all i, j ui {0, 1} and vij {0, 1} The binary variables U = [u1 , . . . , uN ] and binary matrix V = [vij ]i,j =1 describe the topology of the duplication tree. The binary variable ui indicates whether a duplication block Bi is labeled as a seed block and thus defines an edge in the tree from the root G to Bi . The binary variable vij indicates that secondary duplication block Bi is seeded by seed block Bj and thus block Bi is a child of Bj in the tree. We note that this program is equivalent to a special case of the facility location problem, a classic NP-hard combinatorial optimization problem. substrings across the b oundary. September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera The input to the facility location problem is a set of customers and a set of potential facility sites. For each site, there is a cost associated with opening a facility, and for each site-customer pair, there is a cost associated with supplying that customer from a facility at that site. The ob jective is to minimize the total cost of opening facilities and supplying customers such that every customer is supplied by exactly one open facility. In the context of the two-step duplication tree, each duplication block is both a customer to be supplied and the site of a potential facility. Opening a facility at site Bi corresponds to classifying Bi as a seed duplication block. Supplying customer Bj from facility Bi corresponds to classifying Bj as a secondary block that is constructed from substrings of seed block Bi and the ancestral genome G . 2.3. Duplication Distance Algorithm In this section, we present a dynamic programming algorithm for computing duplication distance from signed source string X to signed target string Y . This algorithm generalizes the O(|Y |4 ) algorithm introduced in15 for the special case where the source string X is non-ambiguous meaning that it contains at most one instance of each character. Here, we present an algorithm whose complexity is O(|Y |3 |X |) for the general case where both X and Y are ambiguous. We assume that all the characters that appear in the target string also appear at least once in the source string. Intuitively, if signed strings X and Y are "close" in duplication distance, they share common subsequences and thus there is duplication scenario that builds Y by repeated insertions of substrings of X (see Definitions 2.2 and 2.3). We define some notation. Given a signed string Y = y1 y2 . . . yn , let Ys,t = ys ys+1 . . . yt be the substring ranging between indices s and t, inclusive, for 1 s t n. Given strings X and Y , let IX,Y (i) = {j |xj = yi }. Finally, let d(Ys,t ) = DX (Ys,t ) be the number of duplicate operations required to build the substring Ys,t . First, note that we can compute the duplication distance for disjoint substrings of the target Y independently, allowing us to recursively divide the target string into subproblems. Suppose we wish to compute the duplication distance for a substring Ys,t of Y . There are two possible cases to consider: (i) ys was duplicated by itself as a substring of X of length one, or (ii) ys was copied into Y as part of a larger substring of X . In the first case, we proceed recursively on Ys+1,t . In the second case, we note that because X may be ambiguous, we must consider the possibility that the substring September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera of X that was copied to produce the character at index s in Y might have started at any of the indices in IX,Y (s). If a substring of X beginning at index i and having length greater than one is duplicated and inserted into Y , the character xi+1 must appear in Y at some index j IYs+1,t ,X (i + 1). Note, that although xi was copied into Y at index s, it need not be the case that xi+1 appear at index s + 1 in Y because subsequent duplicate operations may have inserted substrings into Y in between the characters xi and xi+1 . This observation leads to the following theorem. Theorem 2.1. Given a string Ys,t and indices s t, d(Ys,t ) satisfies the fol lowing recurrence. 1 if s = t (base case), d(Ys,t ) = (5) miniIX,Y (s) di (Ys,t ) otherwise, where di (Ys,t ) = min {as,t , bs,t }, as,t = 1 + d(Ys+1,t ), bs,t = j IYs+1,t ,X (i+1) min (d(Ys+1,j -1 ) + di+1 (Yj,t )) . Intuitively, di (Ys,t ) represents the number of duplicate operations needed to build Ys,t given that the character ys resulted from the copying of a substring of X beginning at index i. The value of di (Ys,t ) can be expressed as the minimum of two values corresponding to the two possible cases described above. And the value d(Ys,t ) can then be computed by considering all possible indices i such that ys is the same character as xi . The running time of the recurrence is bounded by the time to compute di (Ys,t ) for all possible values of i, s and t. For each substring Ys,t , we compute di (Ys,t ) for each index i IX,Y (s). This is bounded by O(|X |) for all values of i, s and t. Because there are O(|Y |2 ) substrings Ys,t , we compute di (Ys,t ) a total of O(|Y |2 |X |) times. Finally, since the time to compute di (Ys,t ) for fixed i, s, and t is bounded by |IYs+1,t ,X (s + 1)| O(|Y |), the total running time is O(|Y |3 |X |). Note that it is straightforward to extend our algorithm to allow duplicate reversal operations on signed strings, i.e. duplicate operations in which the copied substring of X is inverted before being inserted into Y . We can accomplish by by defining d(Ys,t ) as the minimum of d(Ys,t ) and d(-Ys,t ) for all values of s and t, where -Ys,t denotes the reversal of Ys,t . September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera 3. Results We implemented our two-step duplication tree method to analyze the ancestry of segmental duplications in the human genome. We used data from5 who identified 4,692 ancestral duplicons that appeared in complex mosaic arrangements within 437 duplication blocks in the human reference genome (hg17, May 2004). We enumerated the duplicons according to their ancestral location in the genome and aligned each duplication block to all duplicons using Nucmer16 thereby representing each duplication block as a signed string in the alphabet of integers between -4692 and +4692. This process yielded a total of 429 duplication blocks, as 8 blocks could not be aligned to duplicons. We defined the ancestral genome G to be the string of duplicons in the order of their ancestral loci, i.e. +1 + 2 · · · + 4692. We inserted dummy characters between successive duplicons whose ancestral loci were greater than 10kb apart. For each ordered pair of duplication blocks Bi , Bj , we computed the distance DBi G (Bj ) using the algorithm presented in Section 2.3. We also computed, for every duplication block Bi , the distance DG (Bi ) from G to that block. Using these distances, we solved the ILP defined in (1) using CPLEX. Given the solution to the ILP, we labeled all blocks Bi such that the binary variables ui = 1 as seed blocks. We labeled the remaining blocks as secondary blocks. For any pair of blocks Bi , Bj , if the binary variable vij = 1, we designated secondary block Bi to be a child of seed block Bj . The corresponding two-step duplication tree is shown in Figure 3. This tree has 13 seed blocks, with one block (chr11:49122753-50286686) having 301 children. This duplication block is long, consisting of 2922 ancestral duplicons, but is not the longest in physical distance. Interestingly, this block is located near the centromere of chromosome 11, consistent with the hypothesis that duplicon seeding occurs in pericentromeric regions1 . However, the large number of children of this block is likely an artifact that results from our restriction that secondary blocks are descended from exactly one seed block. A more parsimonious duplication scenario might have been that this block was itself formed by duplication from several other seed blocks. 4. Discussion and Future Directions We formulated the problem of computing the most-parsimonious duplication scenario consistent with the two-step model, and showed how to solve this problem as an integer linear program that is equivalent to the facil- September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera Figure 3. Most-parsimonious two-step duplication tree. The large node is the ro ot and represents the ancestral genome. The thirteen children of the ro ot are seed blo cks that were constructed during the first step of duplication and are represented as mediumsized circles. Children of seed blo cks were constructed during the second step and are represented as small squares. Starting with the seed blo ck with the most children and pro ceeding clo ck-wise around the ro ot, the seed blocks are: chr11:4912275350286686, chr9:87954745-87985307, chr22:19790229-20121932, chr9:38462158-38620390, chr7:127698302-127893857, chr9:42280547-42512326, chr17:31799963-31889184, chr10:30678896-30730597, chr17:18219669-18692134, chr9:94148626-94280597, chr7:74410377-74802698, chr16:88650921-88822254, and chr17:34433993-34453309. ity location problem. To our knowledge, this work is the first attempt to construct the ancestry of segmental duplications using parsimony methods. Our two-step duplication tree provides a set of putative seed blocks that are useful for further studies of the history of segmental duplications in the human genome. Jiang et al.5 used a different approach to analyze the ancestral relationships of segmental duplications. They represented each duplication block as a binary vector indicating the presence or absence of each duplicon and performed hierarchical clustering on these binary vectors. This analysis ignores the order and orientation of duplicons within a duplication block. We compared our 13 seed groups of duplication blocks (consisting of a September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera seed block and its children) to the 24 "clades" of closely-related duplication blocks reported in Jiang et al. We found that some of our groups were enriched for blocks from a single clade. For example, two of the four members of seed group chr17:34433993-34453309 are shared with the 21 members of Clade M1 defined in5 , (p = 0.017 by hypergeometric test). However, in many instances the clusters produced by the two methods differed. To investigate further, we searched for instances where duplicon content and duplication distance diverged. For example, duplication block B1 = chr7:74410377-74802698 is a seed block and is the parent of secondary block B2 = chr16:5067675-5156097 in our analysis (Figure 3). In fact, the duplication distance DB1 G (B2 ) = 113 is significantly lower than the average duplication distance to B2 over all duplication blocks. However, the duplicon content of these two duplication blocks is sufficiently different so that the analysis in5 placed them into different clades: chr7-2 and M1, respectively. This occurred because although these two duplication blocks did not share a large percentage of their duplicons, surprisingly they did share a few long, conserved subsequences of duplicons that appeared in the same order in both duplication blocks. The presence of conserved subsequences of duplicons suggests that deriving evolutionary relationships from duplicon content alone might be misleading in some cases. Conversely, we also found examples of duplication blocks that, while very similar in duplicon content, were "far" in terms of duplication distance indicating that the duplicons were not organized into the same order in both blocks. Duplication blocks B3 = chr11:3365455-3580141 and B4 = chr16:5229432-5285786 were both placed into clade M1 by5 , but were children of seed blocks chr11:4912275350286686 and chr7:127698302-127893857, respectively, in our analysis. Our method relied on several simplifying assumptions. These included: (1) duplicate operations were the only genome rearrangement events that occurred in constructing segmental duplications; (2) seed blocks that were constructed in the first phase of duplication still exist in an unaltered state in the present-day genome; (3) each secondary duplication block was descended from exactly one seed block. A more comprehensive analysis of segmental duplications will require the removal of some or all of these restrictions. In addition, we assumed that the work of identifying and delimiting the ancestral duplicons was already completed (e.g. by5 ). Incorporating duplication models into the annotation of duplication blocks themselves is an interesting avenue for future study. Finally, a similar two-step model involving extrachromosomal intermediates17 has been proposed18 to explain the extensive duplications of genomic segments in cancer genomes that September 23, 2008 0:22 Pro ceedings Trim Size: 9in x 6in Kahncamera also display a complex architecture19 . It would be interesting to extend our methods to infer the sequence of duplications that occur in a cancer genome. References 1. J. Bailey and E. Eichler, Nat. Rev. Genet. 7, 552 (2006). 2. P. Stankiewicz and J. Lupski, Trends Genet. 18, 74 (2002). 3. M. Blanchette, E. Green, W. Miller and D. Haussler, Genome Res. 14, 2412 (2004). 4. J. Bailey, Z. Gu, R. Clark, K. Reinert, R. Samonte, S. Schwartz, M. Adams, E. Myers, P. Li and E. Eichler, Science 297, 1003 (2002). 5. Z. Jiang, H. Tang, M. Ventura, M. F. Cardone, T. Marques-Bonet, X. She, P. A. Pevzner and E. E. Eichler, Nature Genetics 39, 1361 (2007). 6. Y. Zhang, G. Song, T. Vinar, E. D. Green, A. C. Siepel and W. Miller, Reconstructing the evolutionary history of complex human gene clusters, in RECOMB , 2008. 7. O. Elemento, O. Gascuel and M.-P. Lefranc, Mol Biol Evol 19, 278 (2002). 8. M. La joie, D. Bertrand, N. El-Mabrouk and O. Gascuel, J. Comp. Bio. 14, 462 (2007). 9. N. El-Mabrouk, J. Comput. Syst. Sci. 65, 442 (2002). 10. X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S. Lonardi and T. Jiang, IEEE/ACM Trans. Comput. Biol. Bioinformatics 2, 302 (2005). 11. M. Marron, K. Swenson and B. Moret, Genomic distances under deletions and insertions (2003). 12. L. Arvestad, A.-C. Berglund, J. Lagergren and B. Sennblad, Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution, in RECOMB '04: Proceedings of the eighth annual international conference on Research in computational molecular biology , (ACM, New York, NY, USA, 2004). 13. N. El-Mabrouk and D. Sankoff, SIAM J. Comput. 32, 754 (2003). 14. M. A. Alekseyev and P. A. Pevzner, SIAM J. Comput. 36, 1748 (2007). 15. C. Kahn and B. Raphael, Bioinformatics 24, i133 (2008). 16. S. Kurtz, A. Phillippy, A. Delcher, M. Smoot, M. Shumway, C. Antonescu and S. Salzberg, Genome Biol. 5 (2004). 17. B. E. Windle and G. M. Wahl, Mutat Res 276, 199 (1992). 18. B. Raphael and P. Pevzner, Bioinformatics 20 Suppl 1, i265 (2004). 19. B. Raphael, S. Volik, P. Yu, C. Wu, G. Huang, E. Linardopoulou, B. Trask, F. Waldman, J. Costello, K. Pienta, G. Mills, K. Ba jsarowicz, Y. Kobayashi, S. Shivaranjani, P. Paris, Q. Tao, S. Aerni, R. Brown, A. Bashir, J. Gray, J. Cheng, P. de Jong, M. Nefedov, T. Ried, H. Padilla-Nash and C. Collins, Genome Biol. 9, p. R59 (2008).