Fast and Cheap Genome Wide Haplotype Construction via Optical Mapping T.S. Anantharaman, V. Mysore, and B. Mishra Pacific Symposium on Biocomputing 10:385-396(2005) September 24, 2004 0:51 Proceedings Trim Size: 9in x 6in amm12a-procs9x6 FAST AND CHEAP GENOME WIDE HAPLOTYPE CONSTRUCTION VIA OPTICAL MAPPING T.S. ANANTHARAMAN, V. MYSORE, AND B. MISHRA Wisconsin Biotech Center, Univ. Wisc., Madison WI, U.S.A Courant Institute of Mathematical Sciences, NYU, New York, NY, U.S.A. E-mail:tsa@biostat.wisc.edu; {vm40,mishra}@nyu.edu We describe an efficient algorithm to construct genome wide haplotype restriction maps of an individual by aligning single molecule DNA fragments collected with Optical Mapping technology. Using this algorithm and small amount of genomic material, we can construct the parental haplotypes for each diploid chromosome for any individual. Since such haplotype maps reveal the polymorphisms due to single nucleotide differences (SNPs) and small insertions and deletions (RFLPs), they are useful in association studies, studies involving genomic instabilities in cancer, and genetics, and yet incur relatively low cost and provide high throughput. If the underlying problem is formulated as a combinatorial optimization problem, it can be shown to be NP-complete (a special case of K -population problem). But by effectively exploiting the structure of the underlying error processes and using a novel analog of the Baum-Welch algorithm for HMM models, we devise a probabilistic algorithm with a time complexity that is linear in the number of markers for an -approximate solution. The algorithms were tested by constructing the first genome wide haplotype restriction map of the microb e T. pseudoana , as well as constructing a haplotype restriction map of a 120 Mb region of Human chromosome 4. The frequency of false positives and false negatives was estimated using simulated data. The empirical results were found very promising. 1. Intro duction Diploid organisms, such as humans, carry two mostly similar copies of each chromosome, referred to as haplotypes. Variations in a large population of haplotypes at specific loci are called polymorphisms. The co-associations The work rep orted in this pap er was supp orted by grants from NSF's Qubic program, NSF's ITR program, Defense Advanced Research Pro jects Agency (DARPA), Howard Hughes Medical Institute (HHMI) biomedical support research grant, the US Department of Energy (DOE), the US air force (AFRL), National Institutes of Health (NIH) and New York State Office of Science, Technology & Academic Research (NYSTAR). September 24, 2004 0:51 Proceedings Trim Size: 9in x 6in amm12a-procs9x6 of these variations across the loci indices are of intense interest in disease research. The main limitation of most SNP based approaches is that each SNP is assayed separately without the related phasing information. Instead, the phase is inferred statistically from a large population of SNP data and employ certain simplifying assumptions such as: parsimony in the total number of different haplotypes in the population, the Hardy-Weinberg equilibrium, perfect phylogeny to combinatorially constrain the possible haplotypes. See the full paper 4 for a detailed survey of the literature. For a genotyping method to be able to correctly determine the phasing between neighboring polymorphic markers in every individual haplotype map, it must ultimately be able to test single DNA fragments containing 2 or more heterozygous polymorphic markers in a single test. It is possible, of course, to assemble individual haplotype maps by sequencing the individual's entire genome using a modified sequence assembly algorithm 7,9 but the cost of doing this is prohibitivea . Here, we propose a direct and more cost-effective approach using the fairly well developed single molecule technology of Optical Mapping. Each individual haplotype map of restriction sites will only detect a small fraction of all polymorphisms in the human genome, but using a commonly accepted linkage disequilibrium assumption (see4 ), approximately 8 individual haplotype restriction maps will contain more than the 300,000 SNPs required to infer all other known polymorphisms in the individual genome. Even with 50 fold data redundancy required, all date required for 8 individual haplotype restriction maps can be collected for under $1000. 2. Problem Formulation Our problem can be formulated mathematically as follows: We assume that all individual single molecule DNA fragments are derived from a diploid genome (ignoring the case of sex chromosomes) with two copies of homologous chromosomes. Each DNA fragment is further mapped by cleavage with a restriction enzyme of choice and imaged by an imaging algorithm to produce an ordered sequence of "restriction fragment lengths" or equivalently, "restriction sites." The variations in these restriction fragment lengths are primarily due to RFLPs as well as SNPs at the restriction sites. Additionally, there are further variations introduced by the experimental a This cost has been estimated to be over $10 million per individual. September 24, 2004 0:51 Proceedings Trim Size: 9in x 6in amm12a-procs9x6 process and could be assumed due to: sizing errors, partial digestion, short missing restriction fragments, false cuts, ambiguities in the orientation, optical chimerisms, etc. Thus, the genomes may be represented as two haplotype restriction maps, H1 and H2 , for the same individual which differ only slightly from a genotype restriction map H by a small number of short insertions, deletions and SNPs that coincide with restriction sites. All such maps, H , H1 and H2 , are assumed to be representable as a sequence of restriction sites (e.g. H2,i , with indices 0 i (N + 1), where H2,0 and H2,N +1 represent the chromosome ends), but are unknown. However, short DNA fragments of around 500 Kb derived from such maps, and further corrupted by experimental noise processes can be readily generated at high throughput and very low cost using a technology like Optical Mapping (see the full paper 4 , and additional references therein). These short DNA fragments will be written as Dk , with indices 1 k M , where M is the number of data fragments and each data fragment is in turn represented as a sequence of restriction sites (e.g. Dk,j , 0 j mk + 1 ) and can be aligned globally to create an estimate of genotype map H using algorithms described previously 2 . The algorithmic problem, we wish to study, is to further separate H into two maps H1 and H2 in such a manner that each data fragment Dk is aligned well to one haplotype or other and that H1 and H2 differ from H only by modifications consistent with SNPs or RFLPs polymorphisms. Thus, ultimately, this problem corresponds to a problem of refining a multiple map alignment into two families, starting with one global alignment. A combinatorial generalization, where the number of such families is arbitrarily large (k > 1) and the cost of each alignment is arbitrarily unconstrained, has been shown to lead to computationally infeasible problems. See 8 for the proof of NP-completeness as well as a probabilistic analysis to show conditions under which the problem can be solved efficiently with a probability close to one. The key to an effective solution of these problems relies on careful experiment design (e.g., choice of coverage, restriction enzyme, experimental conditions, etc.) to ensure conditions under which a polynomial time probabilistic algorithm will work with high probability in conjunction with a Bayesian error model that encodes the error processes properly. To construct individual haplotype maps from Optical Mapping data we use a mixture hypothesis of pairs of maps H1 and H2 for each chromosome, corresponding to the correct restriction map of the two parental chromosomes. We first assemble the data into a regular map of the entire genome September 24, 2004 0:51 Proceedings Trim Size: 9in x 6in amm12a-procs9x6 and use this assembly to separate the data into distinct chromosome sets: all maps from the same chromosome belonging to a pair will be included in the same set. We then use a probabilistic model of the errors in the data to derive conditional probability density expressions f (Dk |H1 ) and f (Dk |H2 ), and apply Bayes rule to maximize a score for the best alignment with respect to proposed H1 and H2 , Equation 1.: f (H1 , H2 |D1 , . . . , DM ) f (H1 , H2 )f (D1 , . . . , DM |H1 , H2 ) (1) The first term on the right side is the prior probability of H1 and H2 and we just use a low prior probability for each polymorphism (difference in H1 vs. H2 ). For the conditional probability term, we can assume each map is a statistically independent sample from the genome and that the mapping errors are drawn from i.i.d. distributions and hence write: f (D1 , . . . , DM |H1 , H2 ) = kM [f (Dk |H1 ) + f (Dk |H2 )] 2 =1 (2) The conditional terms of the form f (Dk |Hi ) above can be written as a summation over all possible (mutually exclusive) alignments between the particular Dk and Hi , and for each alignment the probability density is based on an enumeration of the map errors in the alignment and multiplying together the probability associated with each error under some suitable error model. The exact form of the error models suitable for Optical Mapping is described in the next section, but for almost any error models used the sum of the probability for all alignments can be computed effectively using dynamic programming. Other methods for assembling Optical Mapping data for relatively short clones into ordered restriction maps exist, as detailed in the full paper 4 , but they only focus on genotyping, and with widely varying degrees of success. See the full paper 4 for a detailed survey of the literature. We believe that this paper describes the first published algorithm for assembling single molecule data into haplotype maps. 3. Algorithm Following theorems form the basis for computing various conditional probabilities for a hypothesis. Theorem 3.1. Consider an arbitrary alignment between the data D and the hypothesis H , J th restriction site of D matching the I th restriction site of H . We wil l denote this aligned pair by J I . September 24, 2004 0:51 Proceedings Trim Size: 9in x 6in amm12a-procs9x6 1 H 2 3 45 I P N D 1 2 J Qm Figure 1. To define the notation required we consider a single arbitrary alignment between a particular data D and hypothesis H . Recall that N is the number of restriction sites in H and m the number of restriction sites in D . Any arbitrary alignment between D and H can be described as a list of pairs of restriction sites from H and D that describes which restriction site from H is aligned with which restriction site from D . As an example, Here the alignment consists of 4 aligned pairs (4, 2), (5, 2), (I , J ) and (P , Q). Notice that not all restriction sites in H or D need be aligned. For example between aligned pairs (I , J ) and (P , Q) there is one misaligned site on H and D each, corresponding to a missing site (false-negative) and extra-site (false-positive) in D . In this alignment a true small fragment between sites 4 and 5 in H are missing from D , which is shown by aligning both sites 4 and 5 in H with the same site 2 in D . Note that if two or more consecutive fragments in H are all missing in D , this would be described by aligning all sites for the missing fragments in H with the same site in D (rather than showing only the outermost of this set of consecutive sites in H aligned with D , for example). The expression for the conditional probability density of any alignment, such as the one here, can be written as the product of a number of probability terms corresponding to the regions of alignment between each pair of aligned sites, plus one probability term for each unaligned region at the two ends of the alignment. Let the probability density of the unaligned portion on the left and right end of such an alignment be denoted by fur (I , J ) on the right end if J I is the rightmost aligned pair, and ful (I , J ) on the left end if J I is the leftmost aligned pair. In addition, the fol lowing probability density functions fm and fa denote the fol lowing: fm (I , P ) = P r[H [I ..P ] is missing in the observed data D]. fa (I , J, P, Q) = P r[H [I ..P ] is an aligned region but not a missing fragment with respect to the observed data region D[J..Q]]. We assume that I < P and J < Q. Then the fol lowing holds: f (D|H ) = NX X m +1 I =1 J =0 ful (I , J )f (D[J..m + 1]|H [I ..N ] J I ). September 24, 2004 0:51 Proceedings Trim Size: 9in x 6in amm12a-procs9x6 f (D[J..m + 1]|H [I ..N ] J I ) = fur (I , J ) + fm (I , I + 1)f (D[J..m + 1]|H [I + 1..N ] J (I + 1)) + N X m +1 X fa (I , J, P, Q)f ([Q..m + 1]|H [P..N ] Q P ) P =I +1 Q=J +1 In particular, if the intermediate values are kept in a DP table Asuf [I , J ] Asuf [I , J ] = f (D[J..m + 1]|H [I ..N ] J I ) then it is easily seen that f (D|H ) can be computed exactly in O(m2 N 2 ) time and O(mN ) space, assuming that fm and fa are O(1) time functions andI ful and fur are O(N ) time functions. n a later section we will see how to reduce the complexity to linear time ~ when we only require an -approximate value f ~ f (D|H ) - < f (D|H ) < f (D|H ) - , for the probability density function arising in the context of optical mapping as follows: H fm (I , I + 1) = P I +1 -HI fa (I , J, P, Q) = Q-J -1 Pd (1 - Pd )P -I -1 (1 - P )HP -HI G(HP -HI ),2 (HP -HI ) (DQ - DJ ), where Pd = the digest rate, = the false-positive site rate, 2 h = the Gaussian sizing error variance for a fragment of size h, P = the probability of missing a fragment of unit size, and Re = the breakage rate of DNA (the inverse of the expected fragment size). For a random variable x following a Gaussian distribution N(µ, 2 ), the probability density value at d is Gµ,2 (d) = exp[-(d - µ)2 /2 2 ]/( 2 ). The exact form of the functions for ful and fur for Optical Mapping are complicated, but do not affect the complexity of the algorithm; thus a detailed discussion is omitted here, but can be seen in the full paper 4 . The key assumption required is that ful and fur permit O(1) -approximation. As it has been shown elsewhere 1 , a good approximate location of the best alignment between D and H can be determined in O(1) expected time, if the conditional probability density has been previously evaluated for a similar H or alternatively, through a geometric hashing algorithms. Only September 24, 2004 0:51 Proceedings Trim Size: 9in x 6in amm12a-procs9x6 a O(1)-width band of the DP table needs to be evaluated to compute an ~ approximation f (D|H ). In particular, the band width of the DP table used in practice is usually about = 8; more generally for Optical mapping is bounded by (1 - Pd )-1 = , or =1+ ln() . ln(1 - Pd ) With this approach we achieve a reduced time complexity of O(min(m, N )) (more explicitly, O(min(m3 , N ))). Now we show how we may recompute conditional probabilities for a modification to hypothesis: How can one re-evaluate the conditional probability distribution function, f (D|H = p(H )) when the new hypothesis, H , has been obtained by locally changing H in just one place (corresponding to a polymorphism). There are three cases to consider. We study one of the three cases here in detail and refer the reader to 4 for the remaining cases. The omitted cases are similar but tedious. We may obtain H by (1) Deleting one of the existing restriction sites in H , as the site may contain a heterozygous SNP; (2) Adding a new restriction site at a specified location in H , symmetrical to the previous case; (3) Increasing or decreasing a restriction fragment length in H , an RFLP; Consequently, we may also need to compute the first and second derivative of f (D|H ) relative to the change in any fragment size in H . Theorem 3.2. Consider an arbitrary alignment between the data D and the hypothesis H , J th restriction site of D matching the I th restriction site of H . Using the notations of the previous discussion, we write: Asuf [I , J ] = f (D[J..m + 1]|H [I ..N ] J I ), and Apref [I , J ] = f (D[0..J ]|H [1..I ] J I ). Then Asuf [I , J ] = fur (I , J ) + fm (I , I + 1)Asuf [I + 1, J ] N + P =I +1 Qm+1 fa (I , J, P, Q)Asuf [P, Q], =J +1 September 24, 2004 0:51 Proceedings Trim Size: 9in x 6in amm12a-procs9x6 and similarly, Apref [I , J ] = ful (I , J ) + Apref [I - 1, J ]fm (I - 1, I ) + I J P-1 Q-1 =1 Apref [P, Q]fa (I , J, P, Q), =0 If H \ {HK } is obtained from H by deleting the site HK , then f (D|H \ {HK }) = P r [Alignments with rightmost aligned I < K ] + P r [Alignments with leftmost aligned J > K ] + P r [Alignments with a fragment spanning H [K - 1..K + 1]] = K - 1 m +1 XX I =1 J =0 N X m +1 X Apref [I , J ]fur (-Hk ) (I , J ) + ful (-Hk ) (I , J )Asuf [I , J ] I =K +1 J =0 +IK