Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierarchical Approach ´ ´ Jose Miguel Hernandez-Lobato ´ Escuela Politecnica Superior ´ Universidad Autonoma de Madrid Madrid, Spain josemiguel.hernandez@uam.es Tjeerd Dijkstra Leiden Malaria Research Group Leiden University Medical Center Ledien, The Neatherlands t.dijkstra@lumc.nl Tom Heskes Institute for Computing and Information Sciences Radboud University Nijmegen Nijmegen, The Neatherlands t.heskes@science.ru.nl Abstract We introduce a hierarchical Bayesian model for the discovery of putative regulators from gene expression data only. The hierarchy incorporates the knowledge that only a few regulators regulate most of the genes. This is implemented through a spike-and-slab prior with mixing weights from a hierarchical Bernoulli model. For efficient inference we implemented expectation propagation. Running the model on a malaria parasite data set, we found four genes with significant homology to transcription factors in an amoebe, one RNA regulator and three genes of unknown function (out of the top ten genes considered). 1 Introduction Bioinformatics provides a rich source for the application of machine learning techniques. Particularly, the elucidation of regulatory networks underlying gene expression has lead to a cornucopia of approaches [1]. Here we focus on one aspect of network elucidation, the identification of the regulators of the causative agent of severe malaria, Plasmodium falciparum. Plasmodium has a complex life cycle, with separate stages in humans (liver and blood) and mosquitos (midgut, blood and saliva). After a bite by an infected mosquito, the parasites enter the blood stream and travel to the liver, where they undergo many rounds of cell division. Subsequently, numbering in the millions they reenter the bloodstream and each parasite enters a red blood cell. Several properties of the parasite necessitate a tailored algorithm for regulator identification: · In most species gene regulation takes place at the first stage of gene expression when a DNA template is transcribed into mRNA. This so-called transcriptional control is mediated by specific transcription factors. Few specific transcription factors have been identified in Plasmodium based on sequence homology with other species [2, 3]. This could be due to Plasmodium possessing a unique set of transcription factors or due to other mechanisms of gene regulation, e.g. at the level of chromatin or post-transcritional regulation. · Essentially none of the genes are inducible [4]. Compared with yeast, gene expression in Plasmodium is hardly changed by perturbations e.g. by adding chemical or changing temperature. The interpretation is that the parasite is so narrowly adapted to its environment inside a red blood cell that it either lives (following a stereotyped gene expression program) 1 or dies (when a drug is added; in this case no gene expression is left to assay). This finding means that techniques relying on perturbations of gene expression cannot be used. These properties point to a model based only on gene expression series. The model should not rely on sequence information: sequence homology-based searches have come up empty-handed and it is currently unclear at which stage regulation takes place. However, the model should be flexible to integrate sequence information in the future. This points to a Bayesian model as favored approach. 2 The model We start with a semi-realistic model of transcription based on Michaelis-Menten kinetics [1] and subsequently simplify to obtain a linear model. Denoting the concentration of a certain mRNA transcript at time t by z (t) we write: dz (t) VN aN (t)MN 1 V1 a1 (t)M1 ··· p(t) - z (t), = M1 dt K1 + a1 (t) KN + aN (t)MN z (1) with aj (t) the concentration of the j -th activator (positive regulator), p(t) the concentration of RNA polymerase and Vj , Kj , Mj and z reaction constants. N denotes the number of potential activators. The activator is thought to bind to DNA motifs upstream of the transcription start site and binds RNA polymerase which reads the DNA template to produce an mRNA transcript. Mj can be thought of as the multiplicity of the motif, z captures the characteristic life time of the transcript. While reasonably realistic, this equation harbors too many unknowns for reliable inference: 3N + 1 with N 1000. We proceed with several simplifications: · aj (t) Kj : activator concentration is low; · p(t) = p0 is constant; · dz (t) dt z (t+)-z (t) with the sampling period; · z : sampling period roughly equal to transcript life time. Counting time in units of and taking logarithms on both sides, Equation (1) then simplifies to log z (t + 1) = C + M1 log a1 (t) + · · · + MN log aN (t), with C = log(T V1 · · · VN p0 /(K1 · · · KN )). This is a linear model for gene expression given the expressions of a set of activators. Similarly, one can also include repressors [1]. Given the expression of all the genes of an organism, an important challenge is to discover which of them are regulators. 2.1 A Bayesian model for sparse linear regression Let y be a vector with the log expression of the target gene and X = (x1 , . . . , xN ) a matrix whose columns contain the log expression of the candidate regulators. Assuming that the measurements are corrupted with additive Gaussian noise, we then have y N (X , 2 I) where = (1 , . . . , N )T is a vector of regression coefficients and 2 is the variance of the noise. Such a linear model for gene expression is commonly used (see e.g., [5, 6]). Both y and x1 , . . . , xN are mean-centered vectors with T measurements. We specify an inverse gamma (IG) prior for 2 so that P ( 2 ) = IG( 2 , /2, /2), where is a prior estimate of 2 and is the sample size associated with that estimate. We assume that a priori all components i are independent and take a so-called spike and slab prior [7] for each of them. That is, we introduce binary latent variables i , with i = 1 if xi takes part in the regression of y and i = 0 otherwise. Given , the prior on then reads P ( | ) = iN P (i |i ) = iN N (i , 0, v1 )i N (i , 0, v0 )1-i , =1 =1 where N (x, µ, 2 ) denotes a Gaussian density with mean µ and variance 2 evaluated at x. In order to enforce sparsity, the variance v1 of the slab should be larger than the variance v0 of the spike. Instead of picking the hyperparameters v1 and v0 directly, it is convenient to pick a threshold of 2 w v0 v1 i j i j ji j j xjt+1 w0 w1 T N Figure 1: The hierarchical model for gene regulation. xit N practical significance so that P (i = 1) gets more weight when |i | > and P (i = 0) gets more weight when |i | < [7]. In this way, given and one of v1 or v0 , we pick the other one such that 2 = log(v1 /v0 ) -. - v0 1 - v 1 1 iN (2) Finally, we assign independent Bernoulli priors to the components of the latent vector : P ( ) = iN Bern(i , w) = wi (1 - w)1-i , =1 =1 so that each of the x1 , . . . , xN can independently take part in the regression with probability w. We can identify the candidate genes whose expression is more likely to be correlated with the target gene by means of the posterior distribution of : 2 2 P ( |y, X) = P ( , , |y, X) d d P ( , , 2 , y|X) d d 2 , , 2 , 2 where P ( , , 2 , y|X) = N (y, X , 2 I)P ( | )P ( )P ( 2 ) T N iN tN i N 2 = (i , 0, v1 )i N (i , 0, v0 )1-i xi,t i , ) N (yt , =1 =1 =1 i Bern(i , w) =1 I G( 2 , /2, /2) . (3) Unfortunately, this posterior cannot be computed exactly if the number N of candidate genes is larger than 25. An approximation based on Markov Chain Monte Carlo (MCMC) is proposed in [8]. 2.2 A hierarchical model for gene regulation In the section above we made use of the prior information that a target gene is typically regulated by a small number of regulators. We have not yet made use of the prior information that a regulator typically regulates more than one gene. We incorporate this information by a hierarchical extension of our previous model. We introduce a vector of binary latent variables where i = 1 if gene i is a regulator and i = 0 otherwise. The following joint distribution captures this idea: iN t j N T -1 2 xi,t j,i , j ) N (xj,t+1 , P ( , , , 2 |X) = =1 =1 =1, i=j jN i jN i N =1 =1,i=j N jN 2 IG(j , j /2, j j /2) N (j,i , 0, v1 )j,i N (j,i , 0, v0 )1-j,i =1 i 1-i Bern(j,i , w1 ) Bern(j,i , w0 ) =1 =1,i=j i =1 N Bern(i , w) . (4) 3 In this hierarchical model, is a matrix of binary latent variables where j,i = 1 if gene i takes part in the regression of gene j and j,i = 0 otherwise. The relationship between regulators and regulatees suggests that P (j,i = 1|i = 1) should be bigger than P (j,i = 1|i = 0) and thus w1 > w0 . Matrix contains regression coefficients where j,i is the regression coefficient between the expression of gene i and the delayed expression of gene j . Hyperparameter w represents the prior 2 probability of any gene being a regulator and the elements j of the vector 2 contain the variance of the noise in each of the N regressions. Hyperparameters j and j have the same meaning as in the model for sparse linear regression. The corresponding plate model is illustrated in Figure 1. We can identify the genes more likely to be regulators by means of the posterior distribution P ( |X). Compared with the sparse linear regression model we expanded the number of latent variables from O(N ) to O(N 2 ). In order to keep inference feasible we turn to an approximate inference technique. 3 Expectation propagation The Expectation Propagation (EP) algorithm [9] allows to perform approximate Bayesian inference. In all Bayesian problems, the joint distribution of the model parameters and a data set D = {(xi , yi ) : i = 1, . . . , n} with i.i.d. elements can be expressed as a product of terms P ( , D) = in P (yi |xi , )P ( ) = n+1 i =1 ti ( ) , (5) =1 where tn+1 ( ) = P ( ) is the prior distribution for and ti ( ) = P (yi |xi , ) for i = 1, . . . , n. Expectation propagation proceeds to approximate (5) with a product of simpler terms n+1 n+1 i i ~ ti ( ) ti ( ) = Q( ) , (6) =1 =1 ~ where all the term approximations ti are restricted to belong to the same family F of exponential distributions, but they do not have to integrate 1. Note that Q will also be in F because F is closed ~ under multiplication. Each term approximation ti is chosen so that j ~ ~ ~ Q( ) = ti ( ) tj ( ) = ti ( )Q\i ( ) =i is as close as possible to ti ( ) in terms of the direct Kullback-Leibler (K-L) divergence. The pseudocode of the EP algorithm is: ~ 1. Initialize the term approximations ti and Q to be uniform. ~i converge: 2. Repeat until all t ~ ~ (a) Choose a ti to refine and remove it from Q to get Q\i (e.g. dividing Q by ti ). ~i so that it minimizes the K-L divergence between ti Q\i and ti Q\i . ~ (b) Update the term t \i ~ (c) Re-compute Q so that Q = ti Q . The optimization problem in step (b) is solved by matching sufficient statistics between a distribu~ tion Q within the F family and ti Q\i , the new ti is then equal to Q /Q\i . Because Q belongs to the exponential family it is generally trivial to calculate its normalization constant. Once Q is normalized it can approximate P ( |D). Finally, EP is not guaranteed to converge, although convergence can be improved by means of damped updates or double-loop algorithms [10]. 3.1 EP for sparse linear regression j ~ tj ( ) = ti ( )Q\i ( ) , =i The application of EP to the models of Section 2 introduces some nontrivial technicalities. Furthermore, we describe several techniques to speed up the EP algorithm. We approximate P ( , , 2 , y|X) for sparse linear regression by means of a factorized exponential distribution: N I i 2 P ( , , , y|X) Bern(i , qi )N (i , µi , si ) G( 2 , a, b) Q( , , 2 ) , (7) =1 4 where {qi , µi , si : i = 1, . . . , N }, a and b are free parameters. Note that in the approximation Q( , , 2 ) all the components of the vectors and and thenvariable 2 are considered to be independent; this allows the approximation of P ( |y, X) by i=1 Bern(i , qi ). We tune the parameters of Q( , , 2 ) by means of EP over the unnormalized density P ( , , 2 , y|X). Such density appears in (3) as a product of T + N terms (not counting the priors) which correspond to the ti terms in (5). This way, we have T + N term approximations with the same form as (7) and which ~ correspond to the term approximations ti in (6). The complexity is O(T N ) per iteration, because updating any of the first T term approximations requires N operations. However, some of the EP update operations require to compute integrals which do not have a closed form expression. To avoid that, we employ the following simplifications when we update the first T term approximations: 1. When updating the parameters {µi , si : i = 1, . . . , N } of the Gaussians in the term approximations, we approximate a Student's t-distribution by means of a Gaussian distribution with the same mean and variance as the t-distribution. This approximation becomes more and more accurate as the degrees of freedom of the t-distribution increase. 2. When updating the parameters {a, b} of the IG in the term approximations, instead of propagating the sufficient statistics of an IG distribution we propagate the expectations of 1/ 2 and 1/ 4 . To achieve this, we have to perform two approximations like the one stated above. Note that in this case we are not minimizing the direct K-L divergence. However, at convergence, we expect the resulting IG in (7) to be sufficiently accurate. In order to improve convergence, we re-update all the N last term approximations each time one of the first T term approximations is updated. Computational complexity does not get worse than O(T N ) and the resulting algorithm is faster. By comparison, the MCMC method [8] takes O(N 2 ) steps to generate a single sample from P ( |y, X). Problems of much smaller size than the one considered here typically require on the order of 10000 samples to obtain reasonably accurate estimates [7]. 3.2 EP for gene regulation We approximate P ( , , , 2 |X) for the hierarchical gene regulation model by means of the factorized exponential distribution N N j iN i Bern(j,i , wj,i ) Bern(i , ti ) Q( , , , 2 ) = =1 =1,i=j N =1 jN i =1 =1,i=j jN 2 IG(j , aj , bj ) , N (j,i , µj,i , sj,i ) =1 where {aj , bj , ti , wj,i , µj,i , sj,i : i = 1, . . . , N ; j = 1, . . . , N ; i = j } are free parameters. The posterior probability P ( |X) that indicates which genes are more likely to be regulators can then N be approximated by means of i=1 Bern(i , ti ). Again, we fix the parameters in Q( , , , 2 ) by means of EP over the joint density P ( , , , 2 |X). It is trivial to adapt the EP algorithm used in the sparse linear regression model to this new case: the terms to be approximated are the same as before except for the new N (N - 1) terms for the prior on . As in the previous section and in order to improve convergence, we re-update all the N (N - 1) term approximations corresponding to the prior on each time N of the N (T - 1) term approximations corresponding to regressions are updated. In order to reduce memory requirements, we associate all the N (N - 1) terms for the prior on into a single term, which we can do because they are independent so that we only store in memory one term approximation instead of N (N - 1). We also group the N (N - 1) terms for the prior on into N independent terms and the N (T - 1) terms for the regressions into T - 1 independent terms. Assuming a constant number of iterations (in our experiments, we need at most 20 iterations for EP to converge), the computational complexity and the memory requirements of the resulting algorithm are O(T N 2 ). This indicates that it is feasible to analyze data sets that contain thousands of genes. An MCMC algorithm would require O(N 3 ) steps to generate a single sample. 5 1.5 3 1.0 0.5 Expression Expression 0.0 -0.5 -1.0 -1.5 Regulator Regulatees 0 10 20 30 40 50 -2 -1 0 1 2 Gene PF11_321 Genes positively regulated Genes negatively regulated 0 10 20 30 40 50 Measurement Measurement Figure 2: Left: Plot of the vectors x2 , ..., x50 in grey and the vector x1 in black. The vector x1 contains the expression of a regulator which would determine the expressions in x2 , ..., x50 . Right: Expressions of gene PF11 321 (black) and the 100 genes which are more likely to be regulated by it (light and dark grey). Two clusters of positively and negatively regulated genes can be appreciated. 4 Experiments with artificial data We carried out a series of experiments with artificial data in order to validate the two EP algorithms previously described. In the experiments for sparse linear regression we fixed the hyperparameters in (3) so that = 3, is the sample variance of the target vector y, v1 = 1, = N -1 , v0 is chosen according to (2) and w = N -1 . In the experiment for gene regulation we fixed the hyperparameters in (4) so that w = (N - 1)-1 , i = 3 and i is the sample variance of the vector xi , w1 = 10-1 (N - 1)-1 , w0 = 10-2 (N - 1)-1 , v1 = 1, = 0.2 and v0 is chosen according to (2). Although the posterior probabilities are quite sensitive to some of the choices made, the orderings of these probabilities, e.g. to determine the most likely regulators, are very robust to large changes. 4.1 Sparse linear regression In the first experiment we set T = 50 and generated x1 , . . . , x6000 N (0, 32 I) candidate vectors and a target vector y = x1 - x2 + 0.5 x3 - 0.5 x4 + , where N (0, I). The EP algorithm assigned values close to 1 to w1 and w2 , the parameters w3 and w4 obtained values 5.2 · 10-3 and 0.5 respectively and w5 , . . . , w6000 were smaller than 3 · 10-4 . We repeated the experiment several times (each time using new data) and obtained similar results on each run. In the second experiment we set T = 50 and generated a target vector y N (0, 32 I) and x1 , . . . , x500 candidate vectors so that xi = y + i for i = 2, . . . , 500, where i N (0, I). The candidate vector x1 is generated as x1 = y + 0.5 1 where 1 N (0, I). In this way, the noise in x1 is twice smaller than the noise in the other candidate vectors. Note that in this situation all the candidate vectors are highly correlated with each other and with the target vector. This is precisely what happens in gene expression data sets where many genes show similar expression patterns. Finally, we ran the EP algorithm 100 times (each time using new data) and it always assigned to all the w1 , . . . , w500 more or less the same value around 6 · 10-4 . However, w1 obtained the highest value on 54 of the runs and it was among the three ws with highest value on 87 of the runs. We repeated these experiments setting N = 100, using the MCMC method of [8] and the EP algorithm. The results of both techniques were statistically indistinguishable (the approximations obtained through EP fell within the variation of the MCMC method), with EP being much faster. 4.2 Gene regulation In this experiment we set T = 50 and generated a vector z with T + 1 measurements from a sinusoidal wave. We then generated 49 vectors x2 , ..., x50 where xi,t = zt + i,t for i = 2, . . . , 50 and t = 1, . . . , T , where i,t N (0, 2 ) and is one fourth of the sample standard deviation of 6 z. We also generated a vector x1 so that x1,t = zt+1 + t where t = 1, . . . , T and t N (0, 2 ). In this way, x1 contains the expression of a regulator which would determine the expressions in x2 , ..., x50 . A single realization of x1 , . . . , x50 is displayed on the left of Figure 2. Finally, we ran the EP algorithm for gene regulation over 100 different realizations of x1 , . . . , x50 . The algorithm assigned t1 the highest value on 33 of the runs and x1 was ranked among the five vectors more likely to be regulators on 74 of the runs. This indicates that the EP algorithm detects very small differences in correlations and should be able to find new regulators if real microarray data were used. 5 Experiments with real microarray data We have tested our algorithm on four gene expression data sets. The first is a yeast cell-cycle data set from [11] which is commonly used as a benchmark for regulator discovery. Data sets two through four are from the intra-erythrocytic developmental cycle of three different Plasmodium strains [12]. Missing values were estimated by means of nearest neighbor imputation [13] and the hyperparameters were fixed at the same values as in Section 4. The yeast cdc15 data set contains 23 measurements of 6178 genes. We singled out 751 genes which met a minimum criterion for cell cycle regulation [11]. The top ten genes with the highest t and their annotation from the Saccharomyces Genome (SG) database are listed below: the top two genes are specific transcription factors and IOC2 is associated with transcription regulation. As 4% of the yeast genome is associated with transcription the probability of this occurring by chance is 0.0062. rank 1 2 3 4 5 6 7 8 9 10 standard name YLR098c YOR315w YJL073w YOR023c YOR105w YLR095w YOR321w YLR231c YOR248w YOR247w common name CHA4 SFG1 J EM1 AHC1 IOC2 PM T 3 BNA5 SR L 1 annotation DNA binding transcriptional activator putative transcription factor for growth of superficial pseudohyphae DNAJ-like chaperone subunit of the ADA histone acetyl transferase complex dubious open reading frame transcription elongation protein O-mannosyl transferase kynureninase dubious open reading frame mannoprotein The three data sets for the malaria parasite [12] contain 53 measurements (3D7), 50 measurements (Dd2) and 48 measurements (HB3). We focus on the 3D7 strain as this is the sequenced reference strain. We singled out 751 genes whose expression pattern showed the highest variation as quantified by the interquartile range of the expression measurements. The top ten genes with the highest values for t along with their annotation from the PlasmoDB are listed below. Recalling the motivation for our approach, the paucity of known transcription factors, we cannot expect to find many annotated regulators in PasmoDB. Thus, we list selected BLASTP hits instead of the absent annotation. These hits were the highest scoring ones outside of the genus Plasmodium. We find four genes with a large identity to transcription factors in Dictyostelium (a recently sequenced social amoebe) and one annotated helicase which typically functions in post-transcriptional regulation. Interestingly three genes have no known function and could be regulators. rank 1 2 3 4 5 6 7 8 9 10 standard name PFC0950c PF11 0321 PFI1210w MAL6P1.233 PFD0175c MAL7P1.34 MAL6P1.182 PF13 0140 PF13 0138 MAL13P1.14 annotation or selected BLASTP hits 25% identity to GATA binding TF in Dictyostelium 25% identity to putative WRKY TF in Dictyostelium no BLASTP matches outside Plasmodium genus no BLASTP matches outside Plasmodium genus 32% identity to GATA binding TF in Dictyostelium 35% identity to GATA binding TF in Dictyostelium N-acetylglucosaminyl-phosphatidylinositol de-n-acetylase dihydrofolate synthase/folylpolyglutamate synthase no BLASTP matches outside Plasmodium genus DEAD box helicase Results for the HB3 strain were similar in that five putative regulators were found. Somewhat disappointing, we found only one putative regulator (a helicase) among the top ten genes for Dd2. 7 6 Conclusion and discussion Our approach enters a field full of methods enforcing sparsity ([14, 5, 6, 15] and references therein). Our main contributions are: a hierarchical model to discover regulators, a tractable algorithm for fast approximate inference in models with many interacting variables, and the application to malaria. Arguably most related is the hierarchical model in [14]. The covariates in this model are a dozen external variables, coding experimental conditions, instead of the hundreds of expression levels of other genes as in our model. Furthermore, the prior in [14] enforces sparsity on the "columns" of the weight matrix to implement the idea that some target genes are not influenced by any of the experimental conditions. Our prior, on the other hand, enforces sparsity on the "rows" in order to find regulators. Future work could include more involved priors, e.g., enforcing sparsity on both "rows" and "columns", incorporating information from DNA sequence data, as well as more complicated models with latent factors and covariates. The approximate inference techniques described in this paper make it feasible to evaluate such extensions in a fraction of the time required by MCMC methods. References [1] T.S. Gardner and J.J. Faith. Reverse-engineering transcription control networks. Physics of Life Reviews, 2:65­88, 2005. [2] R. Coulson, N. Hall, and C. Ouzounis. Comparative genomics of transcriptional control in the human malaria parasite Plasmodium falciparum. Genome Res., 14:1548­1554, 2004. [3] S. Balaji, M.M. Babu, L.M. Iyer, and L. Aravind. Discovery of the principal specific transcription factors of apicomplexa and their implication for the evolution of the ap2-integrase dna binding domains. Nucleic Acids Research, 33(13):3994­4006, 2005. [4] A. M. Gunasekera, A. Myrick, K. Le Roch, E. Winzeler, and D. F. Wirth. Plasmodium falciparum: genome wide perturbations in transcript profiles among mixed stage cultures after chloroquine treatment. Exp Parasitol, 117(1):87­92, 2007. [5] C. Sabatti and G.M. James. Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics, 22(6):739­746, 2006. [6] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, UCL, 2003. [7] E.I. George and R.E. McCulloch. Approaches for Bayesian variable selection. Statistica Sinica, 7:339­374, 1997. [8] E.I. George and R.E. McCulloch. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423):881­889, 1993. [9] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, 2001. [10] T. Heskes and O. Zoeter. Expectation propagation for approximate inference in dynamic Bayesian networks. In UAI-2002, pages 216­223, 2002. [11] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, and B. Futcher. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12):3273­3297, 1998. [12] M. LLinas, Z. Bozdech, E. D. Wong, A.T. Adai, and J. L. DeRisi. Comparative whole genome transcriptome analysis of three Plasmodium falciparum strains. Nucleic Acids Research, 34(4):1166­1173, 2006. [13] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520­525, 2001. [14] J. Lucas, C. Carvalho, Q. Wang, A. Bild, J. Nevins, and M. West. Sparse statistical modelling ¨ in gene expression genomics. In K.A. Do, P. Muller, and M. Vannucci, editors, Bayesian inference for gene expression and proteomics. Springer, 2006. [15] M.Y. Park, T. Hastie, and R. Tibshirani. Averaged gene expressions for regression. Biostatistics, 8, 2007. 8