Threshold Gradient Descent Method for Censored Data Regression with Applications in Pharmacogenomics J. Gui and H. Li Pacific Symposium on Biocomputing 10:272-283(2005) September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui THRESHOLD GRADIENT DESCENT METHOD FOR CENSORED DATA REGRESSION WITH APPLICATIONS IN PHARMACOGENOMICS J. GUI AND H. LI Department of Statistics and Rowe Program in Genetics University of California, Davis, CA 95616, USA E-mail: jgui@ucdavis.edu, hli@ucdavis.edu An important area of research in pharmacogenomics is to relate high-dimensional genetic or genomic data to various clinical phenotypes of patients. Due to large variability in time to certain clinical event among patients, studying possibly censored survival phenotypes can be more informative than treating the phenotypes as categorical variables. In this paper, we develop a threshold gradient descent (TGD) method for the Cox model to select genes that are relevant to patients' survival and to build a predictive model for the risk of a future patient. The computational difficulty associated with the estimation in the high-dimensional and low-sample size settings can be efficiently solved by the gradient descent iterations. Results from application to real data set on predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed method can be used for identifying imp ortant genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. The TGD based Cox regression gives better predictive performance than the L2 penalized regression and can select more relevant genes than the L1 penalized regression. 1. Intro duction With the sequencing of the human genome near completion and with the development of high-throughput technologies, we are now able to obtain information about an individual's entire genome or the entire genomic profiles of a tumor. Very high-dimensional genetic and genomic data are being generated in pharmaceutical industries and in biomedical and clinical research. Examples of high-throughput data include whole genome wide SNP data, microarray-based gene expression data and proteomic data. Traditional environmental risk factors such as diet, age and lifestyle can influ This work is supp orted by NIH grantES09911 September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui ence a person's response to treatments, it is believed that understanding an individual's genetic makeup or individual tumor's genomic profiles would provide the key for explaining such variation and for creating personalized drugs with greater efficacy and safety. For example, with the DNA microarray technology, one can simultaneously measure expression profiles for thousands of genes in cancer tissues, which offers the possibility of a powerful, genome-wide approach to the genetic basis of different types of tumors. Recent studies 1,2 have demonstrated great succuss in predicting cancer class using the gene expression data. Different classes of cancer may correspond to different clinical outcomes of a given treatment. In addition, studies also demonstrated that additional predictive power can be obtained by incorporating genomic information in addition to the traditional predictive factors such as tumor grades, sizes and stages 2 . In pharmacogenomics, an important area of research is to relate the high-dimensional genetic, genomic or proteomic data to various phenotypes such as continuous drug response levels, response to treatment which can be categorical or censored clinical outcomes such as time to cancer recurrence or death after treatment. Due to large variability in time to cancer recurrence among cancer patients, studying possibly censored survival phenotypes can be more informative than treating the phenotypes as binary or categorical variables. Since the follow up time is limited, some patients' exact survival time can't be measured. For those patients, we only have their right censored survival time. The emphasis of this paper is to develop methods for predicting patient's time to clinical event using high-dimensional genetic or genomic data. The most popular method in regression analysis for censored survival data is the Cox regression model 3 . However, due to the very high dimensional space of the predictors, e.g., the genes with expression levels measured by microarray experiments, the standard maximum Cox partial likelihood method cannot be applied directly to obtain parameter estimates. There are mainly two solutions to this problem. One approach is based on dimension deduction such as singular value decomposition (SVD) or the partial Cox egression (PCR) 4 . The other approach is to use the penalized partial likelihood. This includes L2 and L1 penalization. Li and Luan 5 was the first to investigate the L2 penalized estimation of the Cox model in the high-dimensional low-sample size settings and applied their method to relate the gene expression profile to survival data. As pointed out by Li and Luan 5 , one limitation of L2 penalization is that it uses all the genes in the prediction and does not provide a way of selecting relevant genes. September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui An alternative is to use the L1 penalized estimation, which was proposed by Tibshirani 6 and was called the least absolute shrinkage and selection operator (Lasso). Using newly developed least angle regression (LARS) by Efron et al. 7 , Gui and Li 8 proposed an efficient way to estimate L1 penalized Cox regression model, which was called the LARS-Lasso procedure. One limitation of the LARS-Lasso procedure is that the number of genes selected cannot be greater than the sample size. In addition, if there is a group of variables among which the pair-wise correlations are very high, the LARS-Lasso procedure tends to select only one variable from the group. This could be a potential limitation when the goal is to select all important genetic or genomic features which are related to the risk of a clinical event. Friedman and Popescu 9 have recently proposed a stepwise optimization method called threshold gradient descent (TGD) and have demonstrated it application in linear regression and classification problems. They showed that with different threshold value, TGD can approximate the estimates of partial least square, ridge regression, Lasso and LARS. The TGD based methods provide a data-driven approach for selecting the penalty function. Friedman and Popescu 9 further demonstrated that with small threshold, the TGD method approximates the PLS or ridge estimates and has better predictive power for data simulated with small variability in true coefficients. On the other hand, the TGD with large threshold value can approximate the Lasso or LARS estimates, which provide better predictive performance when there is high variability of the coefficients. In this paper, we extend the TGD method to censored survival data and to the Cox regression model. We demonstrate the method by analyzing a real data set of diffuse large B-cell lymphoma (DLBCL) survival times and gene expression levels 2 . Finally, we give a brief discussion of the methods and conclusions. 2. Statistical Mo dels and Metho ds 2.1. Cox proportional hazards model Suppose that we have a sample size of n from which to estimate the relationship between the survival time and the genetic/genomic profiles such as the gene expression levels X1 , · · · , Xp of p genes. Due to censoring, for i = 1, · · · , n, the ith datum in the sample is denoted by (ti , i , xi1 , xi2 , · · · , xip ), where i is the censoring indicator and ti is the suri vival time if i = 1 or censoring time if i = 0, and xi = {xi1 , xi2 , · · · , xip } s the vector of the genetic/genomic profiles of p genes for the ith sample. Our aim is to build the following Cox regression model for the hazard of September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui cancer recurrence or death at time t (t) = 0 (t) exp(1 X1 + 2 X2 + · · · + p Xp ) = 0 (t) exp( X ), (1) where 0 (t) is an unspecified baseline hazard function, = {1 , · · · , p } is the vector of the regression coefficients, and X = {X1 , · · · , Xp } is the vector of genetic/genomic profiles with the corresponding sample values of X xi = {xi1 , · · · , xip } for the ith sample. We define f (X ) = to be the linear risk score function. Based on the available sample data, the Cox's partial likelihood 3 can be written as x k exp( k ) j , P L( ) = x Rk exp( j ) D where D is the set of indices of the events (e.g., deaths) and Rk denotes the set of indices of the individuals at risk at time tk - 0. Our goal is to find the coefficient vector which minimizes the negative log partial likelihood function. However, in the settings when p >> n, there is no unique solution to this optimization problem. In addition, one expects the high variability of the estimates based on different random samples drawn from the population distribution. A common remedy is to regularize this optimization problem by addition a penalty P ( ) to the negative partial likelihood, i.e., ^ () = argmin {- log P L( ) + P ( )}. (2) Here the negative log partial likelihood is treated as a loss function l( ) which we want to minimize. The most popular penalties include the L2 2 | penalty where P ( ) = j |. j and the L1 p enalty where P ( ) = 2.2. The threshold gradient descent algorithm As observed in Friedman and Popescu 9 , for a given penalty function P ( ), ^ the procedure represented by (2) produces a family of estimates, (), each is indexed by a particular value of the tuning parameter . This family lies on a one-dimensional path of finite length in the p dimensional space of all joint parameter values . Model selection procedure such as cross-validation (see next section) can be use for selecting a point (i.e., a parameter) on that path. Different penalty P ( ) therefore corresponds to different path. To account for different possible true values of , Friedman and Popescu 9 September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui further suggested a gradient directed path finding algorithm for estimating . Specifically, let l( ) = - log P L( ), = X , and define k dk j µi = - l/ i = i - exp(i ) , Rk exp(j ) Ci where Ci = {k : i Rk } denotes the risk sets containing individual i and dk is the number of events at time tk . Then µ = (µ1 , · · · , µn ) is the negative gradient of the loss function with respective to {1 , · · · , n }. The negative gradient with respective to is therefore g = - l/ = X µ. Starting from ^ = 0, the gradient directed paths can be updated as ^ ^ ( + ) = ( ) + h( ), where > 0 is an infinitesimal increment and h( ) is the direction in ^ the parameter space tangent to the path evaluated at ( ). This tangent vector at each step represents a descent direction. In order to direct the path towards parameter points with diverse values, Friedman and Popescu 9 suggested to define h( ) as h( ) = {hj ( )}p = {fj ( ).gj ( )}p , 1 1 where fj ( ) = I [|gj ( )| · max1kp |gk ( )|], where I [.] is an indicator function, and 0 1 is a threshold parameter that regulates the diversity of the values of fj ( ); larger values of lead to ^ more diversity 9 . g ( ) is the negative gradient evaluated at = ( ). For any threshold value 0 1 , the threshold gradient decent path finding algorithm for the Cox model involves the following five steps, (1) (2) (3) (4) (5) Set (0) = 0, = 0. Calculate , µ, g ( ) = - l/ for the current . Calculate fj ( ) = I [|gj ( )| · max1kp |gk ( )|] Update ( + ) = ( ) + · g ( ) · f ( ), = + . Repeat 2-4. Cross validation (see next section) is then employed to determine a point on the path ( ) and to terminate the iterations. 2.3. Model selection through the cross validated partial likelihood For a given gradient threshold , to determine the value of the tuning parameter in the final model, one can choose which minimizes the September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui cross-validated partial likelihood (CVPL), which is defined as C V P L( ) = - n 1i ^ ^ [l(f (-i) ( ) - l(-i) (f (-i) ( )], n =1 ^ where f (-i) ( ) is the estimate of the score function based on the threshold gradient descent with tuning parameter from the data without the ith sub ject. The terms l(f ) and l(-i) (f ) are the log partial likelihoods with all the sub jects and without the ith sub ject, respectively. The optimal value of is chosen to maximize the sum of the contributions of each sub ject to the log partial likelihood. When the threshold is unknown, we can perform a two-dimensional parameter search using CVPL for and simultaneously. 3. Application to prediction of survival time of patients with DLBCL To demonstrate the utility of the TGD based Cox regression in relating genomic data to censored survival phenotypes, we re-analyzed a recently published data set of DLBCL by Rosenwald et al. 2 . This data set includes a total of 240 patients with DLBCL, including 138 patient deaths during the followups with median death time of 2.8 years. Rosenwald et al. divided the 240 patients into a training set of 160 patients and a validation set or test set of 80 patients and built a multivariate Cox model. The variables in the Cox model included the average gene expression levels of smaller sets of genes in four different gene expression signatures together with the gene expression level of BMP6. It should be noted that in order to select the gene expression signatures, they performed a hierarchical clustering analysis for genes across all the samples (including both test and training samples). In order to compare our results with those in Rosenwald et al., we used the same training and test data sets in our analysis. In this data set, the gene expression measurements of 7,399 genes (not unique since many genes were spotted multiple times on the arrays) are available for analysis. Table 1. Number of features selected for different threshold value 0.0 7399 0.2 1171 0.4 464 0.6 140 0.8 43 1.0 4 Threshold value ( ) Number of non-zero coefficients 3.1. Effects of the threshold value Table 1 shows for several threshold values, the number of coefficients estimated to be non zero for the training set of 160 patients. For each September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui threshold value , we obtain the optimal predictive point along the path using a 10-fold CVPL. As expected, larger values of give rise to fewer non-zero coefficient values, with only 4 out of 7399 features having any influence on survival for = 1.0. Figure 1 shows the coefficient values obtained for the DLBCL training data using threshold gradient descent with threshold values = 0, 0.4, 0.6 and 1, sorted by the gene order in the original data set of Rosenwald et al. All solutions produce relatively large absolute coefficient values in similar regions, with larger values of selecting fewer non-zero values within each one. In addition, we clearly observed that smaller resulted in smaller absolute coefficients of all the genes, and larger resulted in very different estimates of the coefficients among all the genes. 3.2. Predictive performance In order to assess how well the model predicts the outcome, we employ the idea of time dependent receiver-operator characteristics (ROC) curve for censored data and area under the curve (AUC) as our criteria. These methods were recently developed by Heagerty et al. 10 in the context of the medical diagnosis and were proposed as criteria for censored data regression with microarray gene expression data 5,4 . Note that larger AUC at time t based on a score function f (X ) indicates better predictability of time to event at time t as measured by sensitivity and specificity evaluated at time t. Figure 2 (a) shows the areas under the ROC curves for different threshold values based on 10-fold cross-validation for the training data set. This plot suggested that the model with = 0.4, 0.6 and 0.8 gave the best predictive results as measured by the areas under the curves. Since the model with = 0.8 includes fewer genes in the model, one should choose this model for future prediction. Figure 2 (b) shows the areas under the ROC curves based on the predicted scores for the patients in the testing data set. This plot indicates that = 0.8 and = 1.0 gave almost the same predictive performance, which implies that the DLBCL data set encourages variability among the coefficients of different genes. The AUCs are between 0.66 and 0.68 in the first 10 years of followups, indicating a reasonable predictive performance. In contrast, smaller values of , which correspond approximately to L2 penalized estimation, gave much lower values of AUCs, indicating worse predictive performance. To further examine whether clinically relevant groups can be identified by the model, we divided the patients in the test data into two groups based September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui Threshold = 0.0 Threshold = 0.4 0.002 coefficients coefficients 0 2000 4000 Gene index 6000 -0.002 -0.006 -0.04 -0.02 0 0.00 0.02 2000 4000 Gene index 6000 Threshold = 0.6 Threshold = 1.0 coefficients 0.00 coefficients 0 2000 4000 Gene index 6000 -0.04 -0.08 -0.10 0 -0.06 -0.02 2000 4000 Gene index 6000 Figure 1. Coefficient values obtained for the DLBCL data using threshold gradient descent with threshold values = 0, 0.4, 0.6 and 1, sorted by the gene order in the original data set of Rosenwald et al. All solutions produce relatively large absolute coefficient values in similar regions, with larger values of selecting fewer non-zero values within each one. on their estimated risk scores in the Cox model (1) using the mean score as a cutoff value. Figure 3 shows the Kaplan-Meier curves for the two groups of patients, showing very significant difference (p-value=0.0004) in overall survival between the high risk group (36 patients) and low risk group (44 patients). Similar analysis was done using L2 penalization, less significant difference was observed (p-value=0.003). 3.3. Genes identified As shown previously, cross-validation analysis for the training data set suggested that threshold value = 0.8 should result in better predictive perfor- X September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui Table 2. Gene ID AA286871 AI361763 AA213564 AA443696 AA714637 AA714637 AA837360 AA760674 AI087048 W72411 H92332 AA411017 AA159668 AA411017 AA729055 AA032179 AA714513 AA729003 R97095 AA480985 AA805575 AA278822 AA485725 AA487453 AA598653 LC-29222 AA495985 H98765 AA579913 R62612 X14420 W87899 AI370252 AA147638 AA147638 N29376 AA833786 W63749 AA495936 H44867 AA292532 AA804793 AA243583 Genes that were identified to be related to the risk of death when = 0.8. Group Description tumor necrosis factor receptor superfamily, member 17 UDP-Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 2 ribosomal protein S21 ribosomal protein S21 ribosomal protein S12 ribosomal protein S12 proapoptotic caspase adaptor protein COX15 homolog, cytochrome c oxidase assembly protein (yeast) interferon regulatory factor 4 tumor protein p63 ma jor histo compatibility complex, class II, DQ alpha 1 ma jor histo compatibility complex, class II, DQ alpha 1 ma jor histo compatibility complex, class II, DQ alpha 1 ma jor histo compatibility complex, class II, DQ alpha 1 ma jor histo compatibility complex, class II, DR alpha ma jor histo compatibility complex, class II, DR beta 5 ma jor histo compatibility complex, class II, DR beta 5 T-cell leukemia/lymphoma 1A T-cell leukemia/lymphoma 1A Weakly similar to germinal center expressed transcript Weakly similar to A47224 thyroxine-binding globulin precursor Fc receptor-like protein 1 immunoglobulin kappa constant GRO2 oncogene osteoblast specific factor 2 (fasciclin I-like) small inducible cytokine subfamily A (Cys-Cys), member 18 cytochrome P450, subfamily XXVIIA, polypeptide 1 leukocyte immunoglobulin-like receptor, subfamily B, member 1 fibronectin 1 collagen, type II I, alpha 1 aryl hydrocarbon receptor T cell receptor beta locus T cell receptor beta locus T cell receptor beta locus myeloid cell nuclear differentiation antigen hemoglobin, alpha 2 B-cell CLL/lymphoma 2 microsomal glutathione S-transferase mal, T-cell differentiation protein regulator of G-protein signalling 16 ESTs KIAA0084 protein PS PS PS PS MHC MHC MHC MHC MHC MHC MHC GCB GCB PS LNS LNS LNS LNS LNS LNS LNS mance. For = 0.8, 43 non-unique genes have non-zero coefficients. Table 2 lists the gene IDs and their descriptions for these 43 genes. Note some genes appear more than once due to replicates on arrays. These genes are September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui 0.68 (a) thres=0 thres=0.2 thres=0.4 thres=0.6 thres=0.8 thres=1 Area under the curve 0.66 0.66 0.68 Area under the curve thres=0 thres=0.2 thres=0.4 thres=0.6 thres=0.8 thres=1 0.64 0.62 0.60 0.58 0.56 2 4 6 8 time 10 12 14 0.56 0.58 0.60 0.62 0.64 5 10 time 15 20 Figure 2. ROC curves based on (a) the 10-fold cross validated scores based on the training data set and (b) the estimated scores for the 80 patients in the testing data set for different gradient threshold values . 1.0 low-risk patients high-risk patients 0.8 p= 0.0003498 Survival Probability 0.0 0 0.2 0.4 0.6 5 10 Survival in Years 15 20 Figure 3. The Kaplan-Meier curves for the high and low risk groups defined by the estimated scores for the 80 patients in the test data set. related to the risk of death among the DLBCL patients. It is interesting to note that many of these 43 genes belong to the four signature groups defined by Rowsenald et al. using clustering analysis of genes. The four groups are MHC class II (MHC), Proliferation signature (PF), Lymph node September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui signature (LNS) and Germinal center B-cell signature (GCB). The genes in these groups were shown to be related to the risk of death for the DLBCL patients. The estimated coefficients for the genes in MHC, GCB and LNS signature groups were all negative except for AA495985, indicating that high expression levels of these genes reduce the risk of death among the patients with DLBCL. All the genes in the PF group have positive coefficients. This agrees with what Rosenwald et al. (2002) has found. In addition, since many genes were spotted several times on the arrays, it is interesting that these genes were all selected as predictors. Other genes which do not belong to the four signature groups include T-cell receptor beta locus, T-cell leukemia/lymphoma 1A and T-cell differentiation protein. As a comparison, when = 1.0, only four genes (H92332, LC 29222, AA480985, H98765) were selected by the model. These four genes belong to three of the four signature groups and have large absolute coefficients. This is in direct contrast with the genes selected by the TGD Cox regression with = 0.8, where genes which are highly correlated were also selected. 4. Discussion and Conclusions In pharmacogenomics, one important problem is to predict patient's time to cancer relapse or time to death due to cancer after treatment using genomic profiles of the cancerous cells prior to the treatment. Powerful statistical methods for such prediction allow for high-dimensional genomic data such as microarray gene expression data to be used most efficiently. In this paper, we have extended the threshold gradient descent based method 9 for censored survival data in order to identify important predictive genes for survival using high throughput genetic or genomic data. Since the risk of cancer recurrence or death due to cancer may result from the interplay between many genes, methods which can utilize data of many genes, as in the case of our proposed method, are expected to show better performance in predicting risk. We have demonstrated the applicability of our methods by analyzing time to death of the diffuse large B-cell lymphoma patients and obtained satisfactory results, as evaluated by both applying the model to the test data set and the time dependent ROC curves. Our simulations also indicated better predictive performance of the proposed method than other penalization methods (results not shown due to page limitation). As we observed in our analysis of the DLBCL data set, if there is a group of variables or genes among which the pairwise correlations are very high, the procedure with = 1 tends to select only one variable from the group September 9, 2004 17:28 Proceedings Trim Size: 9in x 6in gui and does not care which one is selected. Such procedure, which is very closely related to Lasso, can give a good predictive performance. However, for genes sharing the same biological pathways, the correlations among them can be high. If the goal is to select important and relevant genes, one may want to include all these highly correlated genes, if one of them is selected. We observed that the TGD procedure with = 0.8 precisely achieved this goal. In addition, we may expect more robust prediction since the gene expression levels of highly correlated genes are used in the model. Therefore, in practice, one should use cross validation to select the optimal threshold value . Some possible extensions of the proposed methods are worth mentioning. One is to study the treatment effect adjusting for genetic or genomic profiles. This can be done by including a treatment indicator variable in the Cox regression model. However, regularization by threshold gradient is only applied to the variables related the high-dimensional genomic profiles. A profile-type penalization can then be developed. Another interesting problem is to identify certain genetic profiles which may interact with the treatment in determining the risk of certain clinical event. This can be done by including all the treatment by gene interactions in the model and using the threshold gradient descent method to select the relevant genes and their interactions. Both approaches deserve further investigation. In summary, the proposed threshold gradient descent method for the Cox model can be very useful in pharmacogenomics in building a parsimonious predictive model of risk based on the genetic or genomic profiles. The procedure is robust and numerically stable and can also be applied to select important genes that are related to patients' survival outcome. References 1. Golub, T.R., Slonim, D.K., Tamayo, P., et al. 1999 Science 286,531-537. 2. Rosenwald, A., Wright, G., Chan, W., et al. 2000 The New England Journal of Medicine 346,1937-1947. 3. Cox, D.R. 1972. Journal of the Royal Statistical Society. Series B 34,187-220. 4. Li, H. and Gui, J. 2004. Bioinformatics 20, i208-215. 5. Li, H. and Luan, Y. 2003. Pacific Symposium on Biocomputing 8,65-76. 6. Tibshirani, R. 1995. Journal of Royal Statistical Society B 58,267-288. 7. Efron B., Johnston I., Hastie T. and Tibshirani R. 2004. Annals of Statistics. 8. Gui J. and Li H. 2004. Technical report, UC Davis. 9. Friedman, J.H. and Popescu B.E. 2004. Technical Report, Statistics Department, Stanford University. 10. Heagerty, P.J., Lumley, T. and Pepe, M. 2000. Biometrics 56,337-344.