Pacific Symposium on Biocomputing 13:515-526(2008) CMARRT: A TOOL FOR THE ANALYSIS OF CHIP-CHIP DATA FROM TILING ARRAYS BY INCORPORATING THE CORRELATION STRUCTURE ¨ ¨ PEI FEN KUAN1 , HYONHO CHUN1 , SUNDUZ KELES1,2 ¸ 1 Department of Statistics, of Biostatistics and Medical Informatics, 1300 University Avenue, University of Wisconsin, Madison, WI 53706. E-mail: keles@stat.wisc.edu 2 Department Whole genome tiling arrays at a user specified resolution are b ecoming a versatile to ol in genomics. Chromatin immunoprecipitation on microarrays (ChIPchip) is a powerful application of these arrays. Although there is an increasing number of methods for analyzing ChIP-chip data, p erhaps the most simple and commonly used one, due to its computational efficiency, is testing with a moving average statistic. Current moving average methods assume exchangeability of the measurements within an array. They are not tailored to deal with the issues due to array designs such as overlapping probes that result in correlated measurements. We investigate the correlation structure of data from such arrays and propose an extension of the moving average testing via a robust and rapid metho d called CMARRT. We illustrate the pitfalls of ignoring the correlation structure in simulations and a case study. Our approach is implemented as an R package called CMARRT and can b e used with any tiling array platform. Keywords : ChIP-chip, moving average, auto correlation, false discovery rate. 1. Background Whole genome tiling arrays utilize array-based hybridization to scan the entire genome of an organism at a user specified resolution. Among their applications are ChIP-chip experiments for studying protein-DNA interactions. These experiments produce massive amounts of data and require rapid and robust analysis methods. Some of the commonly used methods are ChIPOTle,1 Mpeak,2 TileMap,3 HMMTiling,4 MAT5 and TileHGMM.6 Although these algorithms have been shown to be useful, they don't address the issues due to array designs. The most obvious issue is the correlation of the measurements from probes mapping to consecutive genomic locations.15 The basis for such a correlation structure is due to both overlapping probe Pacific Symposium on Biocomputing 13:515-526(2008) design and fragmentation of the DNA sample to be hybridized on the array. There are several hidden Markov model (HMM) approaches to address the dependence among probes but the current implementations are limited to first order Markov dependence.4 Generalizations to higher orders increase the computational complexity immensely. We investigate the correlation structure of data from complex tiling array designs and propose an extension of the moving average approaches1,7 that carefully addresses the correlation structure. Our approach is based on estimating the variance of the moving average statistic by a detailed examination of the correlation structure and is applicable with any array platform. We illustrate the pitfalls of ignoring the correlation structure and provide several simulations and a case study illustrating the power of our approach CMARRT (Correlation, Moving Average, Robust and Rapid method on Tiling array). 2. Metho ds Let Y1 , ..., YN denote measurements on the N probes of a tiling path. Yi could be an average log base 2 ratio of the two channels or (regularized) paired t-statistic for arrays with two channels (e.g., Nimblegen) and a (regularized) two sample t-statistic for single channel arrays (Affymetrix) at the i-th probe. These wide range of definitions of Y make our approach suitable for experiments with both single and multiple replicates per probe. A common test statistic for analyzing ChIP-chip data is a moving average of Yi 's over a fixed number of probes or fixed genomic distance.1,3,7 The parameter wi will be used to define a window size of 2wi + 1, i.e., wi probes to the right and left of the i-th probe. In the case of moving average across a fixed number of probes for tiling arrays with constant probe length and resolution, the window size wi is calculated by L × (2wi + 1) - 2wi × O = F L, where L is the probe length, O is the overlap between two probes and F L is the average fragment size. Our framework also covers tiling arrays with non-constant resolution. In this case, wi will be different for each genomic interval and corresponds to the number of probes within a fixed genomic distance. For simplicity in presentation, we will utilize window size of fixed number of probes. We assume that the data has been properly normalized by potentially taking into account the sequence features,8 and that E [Y ] = µ and var(Y ) = 2 . Consider the following moving average statistic Ti = i j 1 Yi . 2wi + 1 =i-w i i+w (1) Pacific Symposium on Biocomputing 13:515-526(2008) Then, standard variance calculation leads to j i+wi k 1 (2wi + 1) 2 + var(Ti ) = (2wi + 1)2 =i-w i cov(Yj , Yk ) . =j (2) The standardized moving average statistic is given by Si = T vi . ar(Ti ) (3) Standard practice of using moving average statistics relies on (1) estimating 2 based on the observations that represent lower half of the unbound distribution; (2) ignoring the covariance term in equation (2); (3) and obtaining a null distribution under the hypothesis of no binding at probe i. In particular, ChIPOTle considers a permutation scheme where the probes are shuffled and the empirical distribution of the test statistic over several shufflings is used as an estimate of the null distribution. As an alternative, a Gaussian approximation is utilized assuming that Yi 's are independent and identically distributed as normal random variables under the null distribution. As discussed by the authors of ChIPOTle, both approaches assume the exchangeability of the probes under the null hypothesis. Exchangeability implies that the correlation within any subset of the probes is the same. However, empirical autocorrelation plots from tiling arrays often exhibit evidence against this (Fig. 1). In particular, in the case of overlapping designs, a correlation structure is expected by design. When the spacing among the probes is large, correlation diminishes as expected (the right panel of Fig. 1), and this was the case for the dataset on which ChIPOTle was developed. We illustrate the problem with ignoring the correlation structure on a ChIP-chip dataset from an E-coli RNA Polymerase II experiment utilizing a Nimblegen isothermal array (Landick Lab, Department of Bacteriology, UW-Madison). The probe lengths vary between 45 and 71 bp, tiled at a 22 bp resolution. Approximately half of the probes are of length 45 bp. We compute the standardized moving average statistic Si (assuming cov(Yj , Yk ) = 0) and Si (assuming independence of Yi 's). A method of estimating cov(Yj , Yk ) is described in the next section. The p-values for each Si and Si are obtained from the standard Gaussian distribution under the null hypothesis. We expect the quantiles of Si and Si for unbound probes to fall along a 45 reference line against the quantiles from the standard Gaussian distribution, whereas the quantiles for bound probes to deviate from this reference line. As evident in Fig. 2, if the correlation structure is ignored, the distribution of Si 's for unbound probes deviates from the standard Pacific Symposium on Biocomputing 13:515-526(2008) Gaussian distribution. Since the data is obtained from a RNA Polymerase II experiment, we expect a larger number of points, corresponding to promoters, to deviate from the reference line. An additional diagnostic tool is the histogram of the p-values. If the underlying distributions for Si and Si are correctly specified, the p-values obtained should be a mixture of uniform distribution between 0 and 1 and a non-uniform distribution concentrated near 0. The histograms of the p-values (Fig. 2) again illustrate that the distribution for Ti is misspecified. 2.1. Estimating the correlation structure Although it is desirable to develop a structured statistical model that captures the correlations, developing such a model is both theoretically and computationally challenging due to the complex, heterogeneous data generated by tiling array experiments. We propose a fast empirical method that estimates the correlation structure based on sample autocorrelation function. The covariance cov(Yj , Yj +k ) can be estimated from the sample autocorrelation (k ) and sample variance 2 ,10 ^ ^ T -k ¯ ¯ (Yt - Y )(Yt+k - Y ) (k ) = t=1 T ^ , cov(Yj , Yj +k ) = (k ) 2 . ^^ (4) ¯ (Yt - Y )2 t=1 The following strategy is used in CMARRT for estimating the correlation structure. The top M % of outlying probes which roughly correspond to bound probes are excluded in the estimation of (k ). For the remaining ^ probes, the sample autocorrelation at lag k (j (k )) is computed for each ^ segment j consisting of at least N consecutive probes. Genomic regions flanking a large gap or repeat masked regions will be considered as two separate segments. For any lag k , we let (k ) to be the average of j (k ) ^ ^ over j . Here, N can be considered as a tuning parameter and our initial experiments with ENCODE datasets suggest that N = 500 works well in practice based on the diagnostic plots discussed in Section 1. M is an anti-conservative preliminary estimate of the percentage of bound probes which can be obtained under the assumption of independence among probes (usually 1 - 5%, depending on the type of ChIP-chip experiment). 3. Simulation studies In this section, we investigate the performance of CMARRT, the conventional normal approximation approach under the independence assumption (Indep) and the HMM option in TileMap under various scenarios where we Pacific Symposium on Biocomputing 13:515-526(2008) know the true bound regions in terms of sensitivity and specificity while controlling FDR at various levels used in practice. Simulation I: Autoregressive model. We consider the following model kp Yi = Ni + Ri , Ni = i-k Ni-k + i, (5) =1 where Ni is the autoregressive background component and Ri is the real signal. We generate 100,000 Ni from AR(p) to represent the background component under the assumption of cor(Ni , Ni+k ) = 0.4(k-1)+1 and randomly choose 500 peak start sites. We let the size of a peak to be 10 probes, so that 5% of the probes belong to bound regions. To design scenarios similar to what we have observed in practice, we also allow for 3 outliers within a bound region. The data is simulated from various p (AR order), (cor(Ni , Ni+k )) and (var(Ni )) for the background component, and strength c for the real signal. Simulation II: Hidden Markov model. In this scenario, the data is simulated from hidden markov models (HMMs)12 with explicit state duration distribution to introduce direct dependencies at the probe level observations. Let the duration HMM densities be pSi (di ) Geometric(pSi ). The transition probabilities (aij ) and the parameters pSi in the duration HMM densities are chosen such that 5% of the probes belong to bound regions. We consider the joint observation density fNi (Y1 , Y2 , ...Yd1 ) M V N (0, N ) for the unbound regions and fBi (Y1 , Y2 , ...Yd1 ) M V N (µ, B ), µ > 0 for the bound regions, where M V N denotes the multivariate normal distribution. The parameters µ, N and B are chosen such that generated data resembles observed ChIP-chip data exhibiting correlations at the observation level. Each simulation scenario is repeated 50 times. A probe is declared as bound if its adjusted p-value11 is smaller than a pre-specified FDR level when analyzing with CMARRT and Indep. For TileMap, we use the direct posterior probability approach13 to control the FDR. 3.1. Results of simulations I and II In Fig. 3, we summarize the sensitivity at the peak level and the specificity at various FDR thresholds from Simulation I for CMARRT, Indep, and TileMap. CMARRT is able to identify most of the bound regions at FDR of 0.05 and above while TileMap tends to be more conservative in declaring bound regions as shown in the sensitivity plots. Although Indep has Pacific Symposium on Biocomputing 13:515-526(2008) the highest sensitivity, it also has a high proportion of false positives. The specificity of Indep is significantly lower compared to CMARRT, even under the case of low correlation among the probes. Similar results are obtained in Simulation II under the duration HMM (Fig. 4). The left panels show the sensitivity and specificity for the case of smaller peaks with an average peak size of 10 probes while the right panels are for the case of larger peaks of size 20 probes on average. These results illustrate the superior performance of CMARRT in terms of both sensitivity and specificity even when the data is generated from a complex model. The heuristic way of estimating the correlation structure in CMARRT is able to reduce the false positives (specificity) significantly, but not at the expense of increasing false negatives (sensitivity). On the other hand, ignoring the correlation structure results in a higher proportion of false positives. Additionally, the HMM option in TileMap is more conservative than the moving average approach when the FDR is controlled at the same level. 4. Case study: ZNF217 ChIP-chip data We provide an illustration of CMARRT with a ZNF217 ChIP-chip data tiling the ENCODE regions (available from Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)14 with accession number GSE6624). The ENCODE regions were tiled at a density of one 50-mer every 38 bp, leading to 380,000 50-mer probes on the array. We analyze two different replicates of this dataset separately and compare the analysis on these single replicates. In Krig et al.,14 the bound regions were identified with the Tamalpais Peaks program,9 which requires a bound region to have at least 6 consecutive probes in the top 2% of the log base 2 ratios. This criteria tends to be too stringent and fails to identify bound regions which contain a few outlier probes with log base 2 ratios below the top 2% threshold and may result in a higher level of false negatives. In the top right panel of Fig. 5, we show one potential peak missed by the Tamalpais Peaks program. In such cases, the sliding window approach is more powerful for finding peaks. Moreover, this method also assumes the observations are independent. As evident in the left panel of Fig. 1, observations from nearby probes in this tiling array are correlated. As shown in Fig 5, the histograms of p-values for the unbound probes under the independence assumption deviates from the expected distribution in both replicates. Similar problem is present in the normal quantile-quantile plots (online supp. mat.) when the correlation structure is ignored. As in Krig et al.,14 we require the number of consecutive probes in each Pacific Symposium on Biocomputing 13:515-526(2008) bound region to be at least 6. A set of peaks is obtained for each replicate at a given FDR control. We assess the extent of overlaps between the set of peaks in these two replicates. The results are summarized in Table 1. All the methods identified more peaks in replicate 1 than replicate 2. Therefore, using the peaks from rep 1 as reference, the common peaks are defined as the percentage of overlapping peaks in replicate 2. For all FDR thresholds (except 0.01), CMARRT has the highest value of common peaks, followed by Indep and TileMap, which illustrates the consistency of the peaks identified by CMARRT. As an independent validation, we determine the location of bound regions relative to the transcription start site (TSS) of the nearest gene using GENECODE genes from UCSC Genome Browser as in Krig et al.14 (Table 1). For a given FDR control, the percentage of peaks located within ±2k b, ±10k b and ±100k b of the TSS is the highest in CMARRT, followed by Indep and TileMap. As expected, these numbers decrease as we increase the FDR threshold for all the three methods. These results illustrate the power of CMARRT in detecting biologically more plausible bound regions of ZNF217. 5. Discussion We have investigated and illustrated the pitfalls of ignoring the correlation structure due to tiling array design in ChIP-chip data analysis. We proposed an extension of the moving average approaches in CMARRT to address this issue. CMARRT is a robust and fast algorithm that can be used with any tiling platform and any number of replicates. Both the simulation results and the case study illustrate that CMARRT is able to reduce false positives significantly but not at the expense of increasing false negatives, thereby giving a more confident set of peaks. We have recently became aware of the work of Bourgon15 who carefully studies the correlation structure in ChIP-chip arrays and proposes a fixed order autoregressive moving average model (ARMA(1, 1)) and we are in the process of comparing CMARRT with this approach. CMARRT is developed using the Gaussian approximation approach and the diagnostic plots illustrated can be utilized to detect whether a given dataset violates this assumption. One possible relaxation of this assumption is a constrained permutation approach that aims to conserve the correlation structure among the probes under the null distribution. Implementation of such an approach efficiently is a challenging future research direction. Pacific Symposium on Biocomputing 13:515-526(2008) Acknowledgements We thank Professor Robert Landick for providing the E-coli ChIPchip data for our analysis. Supplementary materials are available at http://www.stat.wisc.edu/keles/CMARRT.sm.pdf. This research has been supported in part by a PhARMA Foundation Research Starter Grant (P.K. and S.K.) and NIH grants 1-R01-HG03747-01 (S.K.) and 4-R37GM038660-20 (H.C.). References 1. M.J.Buck, A.B. Nobel and J.D. Lieb (2005), ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data, Genome Biol. 6(11). 2. T.H. Kim, L.O. Barrera, M. Zheng, C. Qu, M.A. Singer, T.A. Richmand, Y. Wu, R.D. Green and B. Ren (2005), A high-resolution map of active promoters in the human genome, Nature 436:876-880. 3. H. Ji and W.H. Wong (2005), TileMap: create chromosomal map of tiling array hybridizations, Bioinformatics 21(18):3629-3636. 4. W Li and C.A. Meyer and X.S. Liu(2005), A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences, Bioinformatics 21Suppl 1:i274-i282. 5. W.E. Johnson, W. Li, C.A. Meyer, R. Gottardo, J.S. Carroll, M. Brown and X.S. Liu (2006), MAT: Model-based Analysis of Tiling-arrays for ChIP-chip, Proc Natl Acad Sci USA 103:12457-12462. 6. S. Keles (2006), Mixture modeling for genome-wide localization of transcription factors, Biometrics, 63(1):10-21. 7. S. Keles, M. J. van der Laan, S. Dudoit and S.E. Cawley (2006), Multiple Testing Methods for ChIP-Chip High Density Oligonucleotide Array Data, J. of Comp. Bio. 13(3):579-613. 8. T.E. Royce, J.S. Rozowsky and M.B. Gerstein (2007), Assessing the need for sequence-based normalization in tiling microarray experiments, Bioinformatics. 9. M. Bieda, X. Xu, M.A. Singer, R. Green and P.J. Farnham (2007), Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome, Genome. 10. G.P. Box and G.M. Jenkins (1976), Time series analysis forecasting and control, Holden-Day. 11. Y. Benjamini and Y. Hochberg (1995), Controlling the false discovery rate: a practical and powerful approach to multiple testing, JRSS-B 57:289-300. 12. L.R. Rabiner (1989), A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE 77(2):257-286. 13. M.A. Newton, A. Noueiry, D. Sarkar and P. Ahlquist (2004), Detecting differential gene expression with a semiparametric hierarchical mixture method, Biostatistics 5:155-176. 14. S.R. Krig, V.X. Jin, M.C. Bieda, H. O'Geen, P. Yaswen, R. Green and P.J. Pacific Symposium on Biocomputing 13:515-526(2008) Farnham (2007), Identification of genes directly regulated by the oncogene ZNF217 using ChIP-chip assays, J. Biol. Chem. 282(13):9703-9712. 15. R.W. Bourgon (2006), Chromatin immunoprecipitation and high-density tiling microarrays: a generative model, methods for analysis and methodology assessment in the absence of a "gold standard". Ph.D. Thesis, UC Berkeley. Table 1. FDR=0.01 Distance of ZNF217-binding sites relative to TSS. CMARRT 0.803(791/935) 0.334 0.619 0.911 CMARRT 0.806(1023/1269) 0.321 0.589 0.903 CMARRT 0.805(1209/1491) 0.300 0.579 0.904 CMARRT 0.794(1333/1678) 0.284 0.564 0.899 Indep 0.819(1423/1736) 0.278 0.565 0.903 Indep 0.790(1796/2272) 0.267 0.565 0.900 Indep 0.779(2096/2689) 0.265 0.561 0.894 Indep 0.763(2301/3051) 0.259 0.552 0.890 TileMap 0.718(799/1113) 0.136 0.442 0.824 TileMap 0.714(978/1370) 0.134 0.431 0.826 TileMap 0.703(1071/1524) 0.135 0.428 0.821 TileMap 0.701(1171/1671) 0.136 0.434 0.827 Common peaks % of peaks within ±2kb % of peaks within ±10kb % of peaks within ±100kb FDR=0.05 Common peaks % of peaks within ±2kb % of peaks within ±10kb % of peaks within ±100kb FDR=0.10 Common peaks % of peaks within ±2kb % of peaks within ±10kb % of peaks within ±100kb FDR=0.15 Common peaks % of peaks within ±2kb % of peaks within ±10kb % of peaks within ±100kb Pacific Symposium on Biocomputing 13:515-526(2008) Autocorrelation 1.0 1.0 Autocorrelation 1.0 Autocorrelation 0.8 0.8 0.6 ACF ACF 0.6 ACF 0.4 0.2 0.0 0 5 10 20 Lag 30 0.0 0 0.2 0.4 0.0 0.2 0.4 0 5 10 20 Lag 30 0.6 0.8 5 10 20 Lag 30 Fig. 1. Example autocorrelation plots from ChIP-chip data. The left, middle and right panels are from the data in Krig et al.,14 Landick Lab and Kim et al.2 respectively. The autocorrelation plots for Krig et al.14 and Landick Lab clearly show the presence of correlations among probes. The auto correlation plot for Kim et al.2 shows that the correlation structure diminish with increasing spacing between probes. The data from Krig et al.14 and Landick Lab are from tiling arrays with overlappping prob es, whereas the design in Kim et al.2 have subtantial spacing between prob es (i.e., prob e length = 50 bp and resolution = 100 bp). Under correlation structure 80 Under independence 25 Sample Quantiles 20 Sample Quantiles -4 -2 0 Theoretical Quantiles 2 4 15 10 0 5 0 -4 20 40 60 -2 0 Theoretical Quantiles 2 4 Under correlation structure Under independence 5 4 Density Density 0.0 0.2 0.4 pvalue 0.6 0.8 1.0 3 2 0 1 0 0.0 2 4 6 8 0.2 0.4 pvalue 0.6 0.8 1.0 Fig. 2. Normal quantile-quantile plots (qqplot) and histograms of p-values. The left panels show the qqplot of Si and distribution of p-values under correlation structure. The top right panel shows that if the correlation structure is ignored, the distribution of Si 's for unbound prob es deviates from the standard Gaussian distribution. The bottom right panel shows that if the correlation structure is ignored, the distribution of p-values for unbound probes deviates from the uniform distribution for larger p-values. Pacific Symposium on Biocomputing 13:515-526(2008) Sensitivity AR( 3 ) rho=0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AR( 3 ) rho=0.5 _ _ _ _ _ _ _ _ _ _ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AR( 3 ) rho=0.7 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 AR( 6 ) rho=0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AR( 6 ) rho=0.5 _ _ _ _ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AR( 6 ) rho=0.7 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 0 _ _ _ _ _ _ _ 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 AR( 9 ) rho=0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AR( 9 ) rho=0.5 _ _ _ _ _ _ _ _ _ _ 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AR( 9 ) rho=0.7 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 0 _ Indep TileMap CMARRT 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 Specificity AR( 3 ) rho=0.3 _ _ _ _ 0.8 AR( 3 ) rho=0.5 _ _ _ _ _ _ _ _ _ _ _ _ _ _ 0.9 0.9 AR( 3 ) rho=0.7 _ _ _ _ 1.0 _ _ _ _ 0.9 _ _ _ _ _ 1.0 1.0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 0.8 _ 0.7 0.7 _ 0.7 _ 0.8 _ 0.6 _ _ 0.6 0.5 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.5 0.6 _ 0 0.05 0.1 0.15 0.2 0.25 0.3 AR( 6 ) rho=0.3 _ _ _ _ _ _ _ _ _ 0.9 0.9 AR( 6 ) rho=0.5 1.0 1.0 AR( 6 ) rho=0.7 _ _ _ _ _ _ _ _ _ _ _ 0.7 _ _ _ _ _ 0.9 _ _ _ _ _ _ 1.0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 0.8 0.8 0.7 0.7 _ _ _ _ _ _ _ 0.6 0.6 0.5 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.5 0 0.6 _ 0.8 0.05 0.1 0.15 0.2 0.25 0.3 AR( 9 ) rho=0.3 _ _ _ _ _ _ _ _ 0.8 0.9 0.9 AR( 9 ) rho=0.5 _ _ _ _ _ _ _ _ _ _ _ _ _ 0.9 AR( 9 ) rho=0.7 _ _ _ _ _ _ _ _ _ _ _ Indep TileMap 0 1.0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 1.0 1.0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 0.8 _ _ _ _ 0.8 0.7 0.7 0.6 0.6 0.6 _ _ _ 0.7 _ _ CMARRT _ _ _ 0.5 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.5 0.05 0.1 0.15 0.2 0.25 0.3 Fig. 3. Sensitivity at peak level (top figure) and specificity (bottom figure) at various FDR control (x-axis). The background N is generated from various autoregressive mo dels with sd(Ni )=0.3, Yi = Ni + 1.5, p = {3, 6, 9} and = {0.3, 0.5, 0.7}. Vertical lines are error bars. CMARRT is able to identify most of the b ound regions at FDR of 0.05 and above. TileMap tends to b e more conservative in declaring b ound regions. Although Indep gives the highest sensitivity, it also has the highest proportion of false p ositives. The specificity for CMARRT is significantly higher than the Indep approach. Pacific Symposium on Biocomputing 13:515-526(2008) Sensitivity( peaksize=10 ) 1.0 1.0 Sensitivity( peaksize=20 ) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 0.6 0.6 0.2 0.4 0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 0.8 0.4 _ _ _ _ _ _ 0.2 0 0.05 0.1 0.15 0.2 0.25 0.3 0.8 0.05 0.1 0.15 0.2 0.25 0.3 Specificity( peaksize=10 ) _ _ _ _ _ _ _ _ _ _ 0.6 0.6 Specificity( peaksize=20 ) _ _ _ _ 0.8 _ _ _ _ 0.9 0.9 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 1.0 1.0 _ _ 0.8 _ _ _ _ _ _ 0.7 0.7 _ _ _ Indep TileMap _ CMARRT _ _ 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 Fig. 4. Sensitivity and specificity at various FDR control (x-axis). The left panels are the results under duration HMM simulation with average p eak size of 10 probes. The right panels corresp ond to using average p eak size of 20 probes. TileMap tends to b e more conservative and has the lowest sensitivity and highest sp ecificity. CMARRT is able to achieve a balance between sensitivity and sp ecificity at each FDR threshold. Indep tends to identify many false p ositives. Under correlation structure Under correlation structure 1.0 2.0 1.5 1.0 log2 ratio Density 0.5 Example of peak missed by Tamalpais Peaks 3.0 2.0 2.5 Density 1.5 1.0 0.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 -1.5 4100 0.0 0.0 -1.0 -0.5 0.0 0.5 4150 4200 4250 probe 4300 4350 4400 pvalue (rep 1) pvalue (rep 2) Under independence 5 Under independence 4 Density 3 Density 2 0 1 0.0 0.2 0.4 0.6 0.8 1.0 0 0.0 1 2 3 0.2 0.4 0.6 0.8 1.0 pvalue (rep 1) pvalue (rep 2) Fig. 5. Histograms of p-values for replicates 1 and 2 and an example of peak missed by the Tamalpais Peaks program. The distributions of the prob es for unbound regions deviates from uniform distribution when the correlation structure is not taken into account (bottom panels). The dotted line in top right panel is the 98-th p ercentile of the log base 2 ratios. Tamalpais Peaks requires a peak to have at least 6 prob es in a row to be in the top 2 %.