Normalization Regarding Non-Random Missing Values in High-Throughput Mass Spectrometry Data Pei Wang, Hua Tang, Heidi Zhang, Jeffrey Whiteaker, Amanda G. Paulovich, and Martin Mcintosh Pacific Symposium on Biocomputing 11:315-326(2006) September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang NORMALIZATION REGARDING NON-RANDOM MISSING VALUES IN HIGH-THROUGHPUT MASS SPECTROMETRY DATA PEI WANG§ , HUA TANG, HEIDI ZHANG, JEFFREY WHITEAKER, AMANDA G PAULOVICH, MARTIN MCINTOSH Fred Hutchinson Cancer Research Center, Seattle, WA, 98109 § E-mail: pwang@fhcrc.org We propose a two-step normalization pro cedure for high-throughput mass spectrometry (MS) data, which is a necessary step in biomarker clustering or classification. First, a global normalization step is used to remove sources of systematic variation between MS profiles due to, for instance, varying amounts of sample degradation over time. A probability mo del is then used to investigate the intensity-dependent missing events and provides p ossible substitutions for the missing values. We illustrate the p erformance of the method with a LC-MS data set of synthetic protein mixtures. 1. Intro duction High-throughput mass spectrometry (MS) technology offers a powerful means of analyzing biological samples. The ability of MS to identify and precisely quantify thousands of proteins from complex samples is expected to broadly affect biology and medicine3 . However, MS systems are subject to considerable noise and variability that is not fully characterized or accounted for. Thus, it is important and necessary to properly conduct data-preprocessing steps such as signal filtering, peak detection, alignment in time (and mass charge ratio), and amplitude normalization before reliable conclusions can be made from the data1 . In this paper, we focus on the normalization step, and propose a probability model for intensity-dependent missing events in MS-based data sets. In MS experiments, the instrument may have trouble detecting the weak signals of low-abundance peptides. Even if the instrument detects the signal, the peak intensities may be too low to be distinguished from background noise during data processing. Therefore, the lower the ion abundance, the 1 September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 2 more likely the peptide will be "missing" in the MS output data. Ignoring such non-random missing pattern may introduce significant bias into subsequent analyses. In this paper, we propose a novel probability model to describe the missing behavior, which accounts for this type of intensitydependent missing events. The rest of the paper is organized as follows: Section 2 provides a brief description of a data example illustrating the problem. Section 3 introduces a global normalization step, which adjusts systematic trends. The missing model, which represents our ma jor contribution, is described in Section 4. Section 5 applies the proposed methods to an example data set and Section 6 is the conclusion. 2. Exp eriment and Data In this section, we describe an experiment, in which replicates of two protein mixtures were analyzed on three consecutive days. We find that the samples processed in later days experienced higher levels of protein degradation due to, for instance, longer storage time as well as more freeze-thaw cycles2 . Such variations are often unavoidable in real disease studies involving human samples. 2.1. Sample preparation Two mixtures of proteins were assembled as part of an exploratory study to understand the performance of our MS instrument. One mixture (denoted as A) consisted of four proteins: bovine albumin, bovine transferrin, bovine alpha lactalbumin and bovine catalase. The other mixture (denoted as B) consisted of the same four proteins plus bovine beta lactoglobulin (proteins were selected based on their length and abilities to produce tryptic peptides). All five recombinant proteins were purified with reversed-phase high performance liquid chromatography (VisionWorkstation Applied Biosystems, Framingham, MA, USA). The collected protein fractions were dried in SpeedVac (Thermo Savant, San Jose, CA, USA). The purified proteins were denatured individually with 60% MeOH, reduced with 10 mM DTT at 60o C for 1 hr, and alkylated with 50 mM iodoacetamide in the dark at room temperature for 30 min. The polypeptides were trypsinized for 6 hr at 37o C with a protein/enzyme of 50/1. September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 3 2.2. LC-MS system The LC-MS system comprised an 1100 Series Nanoflow LC system (Agilent Technologies, Palo Alto, CA, USA), a binary capillary pump, a C18 Symmetry NanoEase trapping column (Waters Corporation, Milford, MA, USA), a C18 PepMap nano LC column(LC Packings, Sunnyvale, CA, USA), and an LCT Premier time-of-flight mass spectrometer (Waters Corporation). The flow rates are 20 uL/min in the trapping column, and 400 nL/min in the LC column. The solvents were A(0.1% formic acid in water) and B (0.1% formic acid in acetonitrile). Linear gradient elution was applied from 0 to 40% B in 30 min. Mass spectra were acquired every 1.0 s with a 0.1 s interscan delay time. The instrument was mass-calibrated with a sodium formate solution prior to analysis. 2.3. Data and problem The raw data is first processed using a program developed in our group,msInspect.a , which includes modules for detecting and aligning peptide features. The output peptide array reports the intensities of all peptide features in each sample (an LC-MS experiment). Denote the intensity of k the ith feature in the k th sample as yi . If the ith feature is detected in the k k th sample, then yi is set to 0. The total number of non-zero intensity peptides in each sample is summarized in Table 1. Clearly, more features were detected in experiments Table 1. Numb er of p eptide features in each sample. Mixture A consists of four proteins, while mixture B consists of five proteins. Day 1 Day 2 Day 3 Sample Index Feature Number Sample Index Feature Number Sample Index Feature Number A1 660 B5 609 A11 237 B2 648 B6 339 B12 302 B3 789 A7 386 B 13 406 A4 49 5 B8 49 2 A14 17 8 A9 3 84 A10 413 Note : In the sample indexes, A=4 protein mixtures, B=five protein mixtures, and the number indicators the experimental order. performed on day 1 than those performed on day 3. If we further compare Sample A1 (with 660 features) and Sample A14 a Available at http://proteomics.fhcrc.org/CPL/home.html September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 4 (with 178 features), as illustrated in Figure 1, we see there is an overall decrease in intensity in sample A14 compared to sample A1. Figure 1. Compare Sample A1 and Sample A14. The two plots compare the intensities of all features in the two samples. The x co ordinate is the mass charge ratio (mz), and the y coordinate is the intensity value. Each feature is represented by a vertical line at its mz p osition, with the length of the line equal to its intensity. Given this kind of variation, it is crucial to normalize intensities before different samples can be properly compared. 3. Global Normalization By globally normalizing signal intensities across multiple samples, we aim to identify and remove systematic variation arising because of differential amounts of sample loaded into the LC-MS system, protein degradation over time, or variation in the sensitivity of the instrument detector. It is natural to assume that the sample intensities are all related by a constant factor5 . A common choice for this re-scaling coefficient is the sample mean or median. This choice is based on the assumption that the number of features whose measurements change is few compared to the total number of features. So the distribution of the measurements of all the features should be roughly the same across different experimental runs4 . However, in MS experiments, because of the limitation of detector sensitivity and the unavoidable instrument noise, ions below a certain intensity level may hardly be detected, which leads to non-random missing of peptide features in the result. Thus, it is not appropriate to use overall mean September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 5 Figure 2. The effect of non-random missing. Supp ose the overall p eptide abundance of sample 1 are twice as great as the overall p eptide abundance of sample 2. The histograms shows the true intensity distribution of all p eptides in sample 1 and sample 2 resp ectively. The minimal detection level of the instrument is represented by the vertical line (features on the left side of the line can not b e observed in the exp eriment). Using the mean or median intensity of the observed features in each sample leads to biased estimate of the scaling co efficients. or median for re-scaling. This is illustrated in Figure 2. In order to avoid the possible bias due to non-random missing events, we propose to use the top L ordered statistics of feature intensities in each sample, where L is a parameter chosen by users. For the simple case of two samples, denote the intensity measurements of one sample as X = (x1 , x2 , ..., xn ) and of the other sample as Y = (y1 , y2 , ..., ym ), whose order statistic can be represented as x(1) > x(2) > ... > x(n) and y(1) > y(2) > ... > y(m) respectively. Then, for a chosen number L(L < min(n, m)), the scaling coefficient of X versus Y can be L L estimated as = i=1 x(i) / j =1 y(j ) or more robustly, = median(x(1) , ..., x(L) )/median(y(1) , ..., y(L) ). (1) For the case of K (K > 2) samples, denote the intensity measurements of the k th sample as X k = (xk , xk , ..., xk k ). For a given number L(L < n 2 1 min({nk }K 1 )), define the population median as k= 1k µ0 = median(xk1) , xk2) , ..., xkL) ). ( ( ( K (2) Then the scaling coefficient for the k th sample is 1 k = median(xk1) , xk2) , ..., xkL) ) ( ( ( µ0 September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 6 4. Mo del of Missing Events We can make inferences on the missing events of one sample based on the information from other samples. The idea is illustrated in Figure 3. Suppose Sample 1 and Sample 2 are identical mixtures, but due to experimental factors, the overall peptide abundance of Sample 1 is smaller than the overall peptide abundance of Sample 2. Peptide 1 cannot be observed in Sample 1 because its intensity falls below the minimum detectable level. However, based on the intensities of those peptides observed in both samples (e.g. Peptide 2), the scale difference of the overall abundances between Sample 1 and Sample 2 can be estimated. Therefore, the "missing" intensity of Peptide 1 in Sample 1 can be reasonably approximated with the intensity measured in Sample 2 divided by a scale coefficient. Figure 3. Missing mo del. The heights of the vertical bars indicate the true intensities of different peptides in the two samples. The dashed vertical lines represent Peptide 1, while the solid vertical lines represent Peptide 2. The horizontal dashed line indicates the minimal detection level of the instrument. More general, we use a probability model to describe such missing events, which is described in below. 4.1. Probability Model In one sample, we introduce a latent variable zi for the ith peptide, which indicates whether this peptide exists in the sample or not: 1 , if ith peptide exists in the sample; zi = (3) 0, if ith peptide does not exist in the sample. Given zi = 1 (the ith peptide exists in the sample), the abundance of this peptide xi can be deemed as a random variable: = 0, if zi = 0, (4) xi fi , if zi = 1, September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 7 where fi is the density function of some probability distribution. It is reasonable to assume that zi and xi |zi = 1 are independent with each other. Suppose the minimum detectable level of the instrument is d. Then, given the value (xi , zi , d), the observed abundance yi of this peptide satisfies 0, if zi = 0; (5) yi |(xi , zi , d) = 0, if zi = 1 and xi < d; xi , if zi = 1 and xi d. We say that a missing event happens to the ith peptide if the ith peptide exists in the sample but no signal has been detected (denoted as Mi = {zi = 1, yi = 0}). We are interested in the probability of missing event when no signal is observed, i.e. P (Mi |yi = 0), which can be calculated as follows: Pd ((zi = 1, yi = 0)|yi = 0) = Pd (zi =1,yi =0) Pd (yi =0) Pd (yi =0|zi =1)P (zi =1) Pd (yi =0|zi =1)P (zi =1)+Pd (yi =0|zi =0)P (zi =0) Pd (xi d|zi = 1) > 0, we have P (zi = 1) = Therefore, given the detectable level parameter d, the distribution function fi , and the observed abundance yi , we can estimate the probability P (Mi |yi = 0) with Equation (6) and (7). Moreover, a natural choice for imputing the intensity of a missing peak is E (xi |yi = 0), which can be calculate as E (xi |yi = 0) = E (xi |yi = 0, zi = 1)P (zi = 1|yi = 0) +E (xi |yi = 0, zi = 0)P (zi = 0|yi = 0) = E (xi |xi < d, zi = 1)P (zi = 1|yi = 0) + 0 = E (xi |xi < d, zi = 1)P (Mi |yi = 0). (8) Pd (yi > 0) Pd (zi = 1, xi > d) = . Pd (xi > d|zi = 1) Pd (xi > d|zi = 1) (7) September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 8 Note E (xi |xi < d, zi = 1) only depends on the detector level parameter d and the distribution function fi . 4.2. Model Fitting 4.2.1. Detectable level d A reasonable estimate of the parameter, d, is the background noise level in each MS profiles, since those peaks with height below this value can not be confidently distinguished from noise signals. For a set of profiles from the same instrument, we assume that the same detectable level. Hence, we estimate d using all raw profiles. After the global normalization described ^ d in section (3), the detectable level of the k th profile becomes dk = k , where k is the normalization scale coefficient in Equation (2). 4.2.2. Abundance distribution fx A. When Biological Replicates Available xk k For K replicates of the same biology samples, we assume that { i |zi = k 2 1}K 1 are independently identically distributed as N (µi , i ) for some pak= rameter µi and i , where xk is the true abundance of the ith peptide in the i k th profile. k k Since xk |zi = 1 and zi are independent from each other, it is easy to see i k k k k that yi |(yi > 0) and xi |(xk > d, zi = 1) are equal in distribution. Thus, i k yi k |(yi k > 0) fik (t) = = 2 where µi ,i is the density function of N (µi ,,i ). For the simple case where i dk - µi . d k d|zi = 1) with I k < µi It follows P (xk /k dt,xk >d,zi =1) i i P (xk >d,zi =1) i µi ,i (t) dk . P (xi >d|zi =1) , for t > (9) we can approximate P (xk > i Together with Eq.(7), we have Thus, the mean intensity of the ith peptide can be estimated as the average of the observed signals: kk k y i / . (11) µi = k k I (yi > 0) P (zi = 1) = k k k I (yi > 0) . I (µi > dk ) ^ fk dk < µi . µi ,i , when i (10) (12) September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 9 Therefore k P (Mik |yi = 0) = And then, with Eq.(8), Eq.(11) and Eq.(12). E (xk |y k = 0) = ii ^ ^ µi P (zi = 1), if µi < dk , ^ 0, if µi > dk . ^ ^ P (zi = 1), if µi < dk , ^ 0, if µi > dk . ^ (13) (14) B. When Biological Replicates Not Available Because the biological samples are limited, a large number of MS replicates are not always available for each sample. In such cases, a natural solution is to use the nearest K "neighbor samples" as pseudo replicates to fit the missing model. Here nearest K "neighbors" refers to the K closest profiles to the target profile under certain distance metrics (i.e. L2 norm). However, if the missing rate is relatively high, the distance measured with the raw data could be misleading. Thus, we propose the following iteration procedure to try to recover the true "neighborhood" structure: (1) Begin with K=N, where N is the total number of samples. Denote the original peptide array data matrix as P ep0 . (2) (a) Based on P epN -K , calculate the distance between each two samples. (b) For each sample, estimate the missing features by using its nearest K neighbors. Denote the new peptide arrays as P epN -K +1 . (c) K=K-1. (3) Repeat step 2 until K = K0 , where K0 is a pre-selected number. If we aim to separate the samples into two clusters, a possible choice for K0 is N/2. 5. Result 5.1. Global normalization The scale coefficients of global normalization are estimated with the top 80 order statistics of each sample according to Eq.(2). Fig.4 shows the relationships between the top 80 order statistics of Samples 11 - 14 (the four samples on the third day) and the top 80 order statistics of Sample 1. Table3 shows the scale coefficients for the four pairs of samples in Fig.4. Compared to the estimators derived with the order statistics, the estimators September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 10 Figure 4. The top 80 order statistics of Sample 11 - 14 v.s. Sample 1. The y and x co ordinates represent the log intensities of the 80 most abundant features in the corresp onding samples. The go o d linear relationship with slop e= 1 justifies the assumption that the sample intensities are related by a constant factor (model x = y is equivalent to model log (x) = log (y ) + b, where b and are parameters). derived with overall medians dramatically overestimate the scale change between these sample pairs. This demonstrates the necessariness of using the top order statistics to conduct the global normalization when nonrandom missing is a concern in the study. Table 2. Scale Coefficients v.s. Sample 1. 11 0.43 1.19 12 0.69 1.10 13 0. 95 0.99 14 0.32 1.09 Sample Index (based on the order statistics) (based on overall median) 5.2. Study of Missing Events We consider 12 of the 14 samples whose non-zero features are at least 10% of the total. 5.2.1. Supervised analysis Treating all 4-protein samples as replicas and all 5-protein samples as replicas, using Equation (13) we can estimate the total number of possibly missiP k ing features I ( (Mik |yi = 0) > 0) for each sample. The result is shown in Table 3. Again, we can see that the missing trend is more severe in some samples than in others. Ignoring such trend may bring unexpected bias into downstream analysis. September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 11 Table 3. Sample Missing A1 8 B2 0 B3 0 Numb er of p eptide features in each sample. A4 8 B5 0 B6 187 A7 118 B8 63 A9 116 A10 69 B12 116 B13 17 5.2.2. Unsupervised analysis The goal here is to use the MS profiles to recover the 4-protein and 5protein group labels for each sample. First, based on the data after global normalization, we perform hierarchical clustering analysis using the R b function hclust with complete linkage. The dendrogram is illustrated in the top plot of Figure 5. The two main sub-clusters are separated according to when the MS experiments were conducted (the first four samples were processed on day 1 while the others on the day 2 and 3). Next, we use the iterative procedure described in section 4.2.2 to substitute the possible missing measurements with their expected values, and perform the hierarchical clustering on the resulting data. The new dendrogram is illustrated in the bottom plot of Figure 5, in which the 4-protein samples and the 5-protein samples are correctly clustered into two groups. This suggests that properly modelling the missing events would prevent the analysis from being driven by experimental variation rather than biological variation. Figure 5. Tree Structures of Unsup ervised Hierarchical Clustering. Each leaf in the tree represents one sample. A = 4 protein mixture; B = 5 protein mixture. bR is a free statistics software, which can be downloaded at: http://www.r-project.org/ September 23, 2005 12:14 Pro ceedings Trim Size: 9in x 6in wang 12 6. Conclusion In this paper, we have shown that ignoring the intensity-dependent missing events in MS experiments may result in severe biases in the data analysis. To address this problem, we developed a probability model for the missing events and implemented a few normalization schemes to remove the negative effects. The missing rate estimates can also be used as a quality control of the data. In the probability model, given that one peptide exists in the sample, a normal density is used to approximate the distribution of the intensity of this peptide. This approximation is supported by the synthetic data example: the Kolmogorov-Smirnov distance between N (0, 1) and the observed distribution of intensities (centered to mean = 0 and scaled to sd = 1) is 0.0392, which corresponds to a p-value of 0.1742. When we estimate the missing values with nearest-neighbor scheme, the iteration number need to be carefully controlled to avoid problem of over-fitting. Acknowledgments We would like to thank three referees and Andrea E Detter for the comments that improved this manuscript. This work was funded by National Cancer Institute contract 23XS144A. HT was partially supported by NIHCA86368. HZ, JW and AP were partially supported by philanthropy from Listwin Foundation/Canary Fund, Paul G. Allen Family Foundation and Keck Foundation. References 1. J. Listgarten and A. Emili, Molecular and Cel lular Proteomics 4.4, 2005. 2. B.L. Mitchell, Y. Yasui, C.I.Li, A.L.Fitzpatrick and P.D.lampe, Cancer Informatics 1(1) 25-31, 2005. 3. M. Man and R.Aebersold, Nature 422, 2003. 4. J. Quackenbush, Nat. Genet. 32, 2002. 5. A. Sauve and T. Speed, Proceedings Gensips, 2005. 6. M. Wagner, D. Naik and A. Pothem, Proteomics 3, 1692-1698, 2003. 7. K.A. Baggerly, J.S. Morris, J. Wang, D. Gold, L.C.Xiao and K.R. Coombes, Proteomics 3, 1667-1672, 2003. 8. M. Anderle, S. Roy, H. Lin, C. Becker and K. John, Bioinformatics 20, 35753582, 2004. 9. R. Tibshirani, T. Hastie, and et.al. Bioinformatics 20, 3034-3044, 2004. 10. W. Wang, H. Zhou,and et.al. Anal. Chem. 75, 4818-4826, 2003.