Feature Selection via Blo ck-Regularized Regression Seyoung Kim School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Eric Xing School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract Identifying co-varying causal elements in very high dimensional feature space with internal structures, e.g., a space with as many as millions of linearly ordered features, as one typically encounters in problems such as whole genome association (WGA) mapping, remains an open problem in statistical learning. We propose a block-regularized regression model for sparse variable selection in a high-dimensional space where the covariates are linearly ordered, and are possibly subject to local statistical linkages (e.g., block structures) due to spacial or temporal proximity of the features. Our goal is to identify a small subset of relevant covariates that are not merely from random positions in the ordering, but grouped as contiguous blocks from large number of ordered covariates. Following a typical linear regression framework between the features and the response, our proposed model employs a sparsity-enforcing Laplacian prior for the regression coefficients, augmented by a 1st-order Markovian process along the feature sequence that "activates" the regression coefficients in a coupled fashion. We describe a sampling-based learning algorithm and demonstrate the performance of our method on simulated and biological data for marker identification under WGA. Figure 1: An illustration of linkage disequilibrium in chromosomes in an association study. pression, and physiological measurements in order to discover genetic markers that affect the phenotype of the individual possessing a particular variation of the marker. Variations in a single nucleotide called single nucleotide polymorphisms (SNPs) provide a useful set of genetic markers since they are relatively common across the genome. The whole-genome association study has become feasible because of the relatively low cost involved in typing a large number of SNPs. The challenge in this type of study is to identify a small subset of SNPs associated with the phenotype among the full set of SNPs that can be as large as 8 million (the International HapMap Consortium, 2005). A simple single-marker test has been widely used for detecting an association (Stranger et al. 2005, Cheung et al. 2005). Using this approach, one examines the correlation between the given phenotype and frequencies of each polymorphic allele of one SNP marker at a time to compute p-value of the SNP, and finds the SNPs with low p-values to be significant. The single-marker test assumes that SNPs are independent of each other, ignoring an important correlation structure due to the linkage disequilibrium present in the sequence of SNPs. In reality, the states of SNPs that are adjacent in the genome can be tightly coupled (i.e., in linkage disequilibrium). This is because when an individual inherites a chromosomal material from each of the parents, a recombination event can break the parental chromosomes into non-random inheritable segment, causing SNPs within the segment to be inherited with high probability, and preventing random combinations of all possible SNP states within the segment. Since the recombination sites are non-uniformly distributed across the genome, recombination events in chromosomes in a population over generations lead to a block structure in SNPs on the 1 INTRODUCTION Recent advances in high-throughput genotyping technology have allowed researchers to generate a high volume of genotype data at a relatively low cost. An association study involves examining genotype data of individuals in a population and phenotype data for the same individuals such as disease status, gene ex- chromosome. As illustrated in Figure 1, each chromosome is a mosaic of ancestor chromosomes, where segments of SNPs of the same color have been inherited from the same ancestor chromosome. The true association SNPs called causal SNPs are indicated as circles in Figure 1. Since a chromosome segment carrying causal alleles can be inherited as a block, we can take advantage of this block structure to increase the power of the study for detecting association by considering a block of linked SNPs jointly rather than a single SNP at a time. A multi-marker approach takes into account this linkage disequilibrium pattern by testing a short segment of SNP markers called a haplotype for an association (Zailen et al. 2007, Zhang et al. 2002). In this case, a haplotype instead of a single SNP acts as a proxy for untyped causal SNPs. However, they test a haplotype of a fixed length for an association, scanning the genome using a sliding window. Most of these approaches do not explicitly make use of the block structure with possibly varying block lengths in the sequence of SNPs. In this paper, we propose a model for association mapping that explicitly incorporates the linkage disequilibrium pattern. We focus on continuous valued phenotypes, and base our method on a linear regression model, where the SNPs are predictors and the phenotypes are response variables. The SNPs with large regression coefficients are found to be significant. The number of SNPs involved in a typical association study is very large, and we are interested in extracting a small number of causal SNPs that are grouped into blocks due to linkage disequilibrium. Thus, we can view this problem as 1) identifying relevant covariates when the covariates lie in a high-dimensional space and 2) learning a block structure in those relevant covariates at the same time. To enforce sparsity in the regression model, we use the Laplacian prior on the regression coefficients, similar to the L1 penalty in the lasso (Tibshirani 1996). However, the lasso does not provide any mechanism to incorporate the correlation structure in covariates into the model to address the second problem. In this paper, we propose to encode the information of the correlation pattern in the SNPs as a Markov chain, and use this Markov chain as a prior in the regression model. The block boundaries for chromosome regions with a high level of association are determined probabilistically through transition probabilities in the Markov chain. A regression-based association has been used previously (Servin and Stephens 2007), but they did not address the problem of taking into account the block structure in the genome in a high dimensional space to improve the power of the study. There is a large body of literature on variable selection methods such as the lasso (Tibshirani 1996) and Bayesian variable selection algorithms (George and McCulloch 1993, Ishwaran and Rao 2005, Yuan and Lin, 2005). Most of these works did not consider situations in which the covariates are structured in a certain manner. Nott and Green (2004) used the correlation information in the covariates to improve the convergence of sampling algorithm, but the model itself did not assume any structure in the covariates. In the fused lasso (Tibshirani et al. 2005), covariates were assumed to be ordered, and adjacent regression coefficients tended to be fused to take the same value, encouraging sparsity in the difference between adjacent coefficients. However, it used only the ordering information, and did not take into account the additional information on the structure in covariates such as the linkage disequilibrium pattern in the case of association study. In the group lasso (Yuan and Lin 2005), the group structure in covariates was assumed to be known, whereas in our proposed model we determine the block structure of relevant covariates during learning given prior knowledge on the correlation structure. The rest of the paper is organized as follows. In Section 2, we describe the proposed model for association mapping and the learning algorithm. In Section 3, we apply the model to simulated data and mouse data, and compare the performance of the proposed model with that of existing methods. In Section 4, we conclude with a brief discussion of future work. 2 GENOME-WIDE ASSOCIATION VIA BLOCK-REGULARIZED REGRESSION THE MODEL 2. 1 The Asso ciation Mo del Let us assume that a set of J SNP markers have been typed for N individuals. The genotype of each SNP in an individual consists of two alleles corresponding to each of a pair of chromosomes in the case of diploid organisms such as humans and mice. For our regression analysis, we construct an N × J design matrix X, where each element xij takes values from {0, 1, 2} according to the number of minor alleles in the genotype of the j th SNP of the ith individual, assuming a strictly additive genetic effect of the minor allele. Note that in our analysis the J SNP markers are assumed to be ordered in terms of their positions on chromosome. In addition, a measurement of phenotype yi is available for each individual i, and we let y be an N × 1 vector of such measurements for the N individuals. In particular, the covariates in X lie in a high dimensional space with only a small sub- set of the J SNPs influencing the output y. In this setting, we assume a standard linear regression model as follows: y = X + , N (0, 2 ), where is a vector of J regression coefficients {1 , . . . , J }, and the noise is modeled as having a normal distribution with mean 0 and variance 2 . Taking a Bayesian approach, we set the prior for 2 to Inv-gamma(0 /2, (0 s2 )/2). 0 We use a prior on that enforces sparsity in the coefficients. We model each regression coefficient j as coming from a mixture of two components, one component representing irrelevant covariates (non-causal SNPs) and the other for relevant covariates (causal SNPs). We introduce a random variable cj that takes values from {0, 1} to indicate the mixture component label for j . If cj = 0, the j th SNP is non-causal, and j is set to 0. If cj = 1, then we model j as coming from a Laplacian distribution. The complete probability distribution for j given cj is given as I (j = 0) - i if cj = 0 j |cj (1) |j | 1 f cj = 1 , 2(22 ) exp 22 where is the parameter that controls the amount of sparsity. We use Inv-gamma(, ) as a prior distribution on , and sample from its posterior during learning (Park and Casella 2008). tightly linked SNPs. In a region with a high recombination rate, the previously linked SNPs are likely to be decoupled, resulting in a weak correlation, whereas a segment of tightly linked SNPs is preserved in the absence of recombination during inheritance. Becuase of the linkage disequilibrium structure, considering a block of highly correlated SNPs instead of a single SNP for an association with the phenotype can potentially increase the power of the association study for detecting causal SNPs. In this section, we propose to use a Markov chain prior on the indicator variable c to take into account the block structure due to the linkage disequilibrium in SNP markers by incorporating the recombination rate information in the prior. There is a large body of literature on estimating recombination rates given genotype data of unrelated individuals from a population (Fearnhead and Donnelly 2001, Li and Stephens 2003, Sohn and Xing, 2007). Any of these methods can be used to estimate recombination rates as part of a preprocessing step prior to the association analysis through the proposed method. Given the estimated recombination rates, we assume that the J covariates are ordered in their positions to have a chain structure and that there is an implicit block structure in the chain where the block boundaries are defined stochastically in terms of the distance and recombination rate between each pair of covariates. In this setting, a group of SNPs within a block can be assigned together to be either causal or noncausal. We model the sequence of indicator variables c = {c1 , . . . , cJ } as a Markov chain as follows: P (c) = P (c1 ) jJ P (cj |cj -1 ). The Laplacian prior in Equation (1) is the same as the L1 penalty used in the lasso (Tibshirani 1996), and has been shown to be useful in a more general problem of learning a sparse model in high-dimensional space (Wainwright et al. 2006). In a Bayesian setting, models similar to the one described above have been used in the literature of Bayesian variable selection with various different distributions in Equation (1) (George and McCulloch 1993, Ishwaran and Rao, 2005, Yuan and Lin, 2005). In almost all of these works, cj 's were modeled as coming from a Bernoulli distribution with parameter p, assuming that cj 's are independent of each other. This independence assumption is not appropriate in the case of genetic data since nearby SNP markers are known to be highly correlated due to the linkage disequilibrium. In the next section, we propose to use a Markov chain prior for cj 's that takes advantage of this dependency. Markov Chain Prior for Blo ck Structure Because of the linkage disequilibrium, individual chromosomes tend to have a block structure in the sequence of genetic markers, and such blocks of genetic markers are shared across individuals in a population. Recombination rate summarizes the degree of correlation between =2 For P (cj |cj -1 ), we use a Poisson process model, a model commonly used for recombination process (Li and Stephens 2003), P (cj |cj -1 ) = exp(-dj j ) (cj , cj -1 ) (2) +(1 - exp(-dj j )) cj-1 ,cj , , 1 - 0 0 where is a transition matrix 1 - 1 1 dj is the distance between two adjacent SNPs at positions (j - 1) and j on chromosome, and j is the recombination rate for the same interval. The first term on the right-hand side of Equation (2) corresponds to the probability of no recombination events between the (j - 1)th and the j th SNPs. On the other hand, the second term models a transition in the presence of a recombination event between the two SNPs. At recombination, the model can either transition to the same state, or to a different state. If the distance dj between the (j - 1)th and j th SNPs is small or the recombination rate j is low, the two SNPs are tightly linked, and it is likely that both SNPs will receive the same assignment for cj -1 and cj . Thus, the cj 's for causal SNPs are set to 1 in a coupled manner, activating the corresponding covariates to take non-zero regression coefficients. We place a prior Beta(a00 , b00 ) on 0 , and Beta(a10 , b10 ) on 1 . Our model differs from the fused lasso (Tibshirani et al. 2005) in that it encourages adjacent correlated covariates to take on the same assignment of whether they are relevant or not, while allowing each covariate to have its own regression coefficient. In the fused lasso, the adjacent regression coefficients themselves are encouraged to have the same values. In addition, our method directly makes use of the additional information in covariates such as the distance and recombination rate between two SNPs, whereas the fused lasso only uses the ordering information in covariates to fuse coefficients. 2. 2 PARAMETER ESTIMATION as p(y| -j , cj = 0, X, 2 ) N 1 - = ex p 2 2 (yi - xi )2 2 2 i . When cj = 1, we compute the integral as below. p(y| -j , cj = 1, X, 2 ) k N -i · 1 xik k )2 (yi - ex p = 2 2 2 2 - d - 1 |j | j ex p 2) 2(2 2 2 d -i | | (zi - xij j )2 + j j =K ex p 2 2 - 0 -i d (zi - xij j )2 - j =K ex p j 2 2 - , d -i (zi - xij j )2 + j j + ex p 2 2 0 k and K where zi = yi - /j xik k , 2 N 2 (4) = Because of the non-differentiability of the function used in Equation (1) for j when cj = 0, it is not possible to learn parameters of the model using an EM style algorithm commonly used for hidden Markov models. Instead, we use the Gibbs sampling to learn the parameters = {, c, 2 , } of the model. In this section, we derive conditional posterior distributions of the parameters for Gibbs sampling. For each of the J covariates, we sample j and cj from their joint posterior distribution p(j , cj |-j , c-j , y, X, 2 ) = p(j |-j , c, y, X, 2 ) (1/2 ) /(2(2 )). 2 P (cj | -j , c-j , y, X, 2 ). (3) We first sample cj from the marginal distribution, the second term on the right-hand side of Equation (3), after integrating out j . Conditional on the sampled cj , we sample j from its conditional posterior, the first term on the right-hand side of Equation (3). In order to sample cj , we re-write the second term on the right-hand side of Equation (3) as P (cj = k | -j , c-j , y, X, 2 ) Let A(-) denote the first integral and A(+) the second integral in Equation (4). Then, using a straightforward algebra, it can be shown that A(-) and A(+) are given as ·i / - i zi xij + 0.5/)2 ( 2 i2 zi - (2 2 ) A(-) = exp xij 2 0 2 i2 N(-) dj xij - - ·i / i zi xij - 0.5/)2 ( 2 i2 zi - A(+) = exp (2 2 ) xij 2 2 i2 N(+) dj , xij 0 where N N (-) = N =N j| i i ( +) j| p(y| -j , cj = k , X, 2 )P (cj = k |cj -1 )P (cj +1 |cj = k ) Sampling from the above equation requires to compute the marginal likelihood p(y| -j , cj , X, 2 ) after integrating out j when cj = 0 and cj = 1. When cj = 0, the j th covariate is irrelevant, and we set j = 0. Thus, the marginal likelihood is simply given Once we sample cj as described above, we sample j from p(j |-j , c, y, X, 2 ) in Equation (3). When cj = 0, we set j to 0. If cj = 1, we re-write the conditional probability distribution as p(j | -j , c, y, X, 2 ) = p(y| , c, X, 2 )p(j ) p . (5) (y| , c, X, 2 )p(j )dj zi xij - i2 xij zi xij + i2 xij 1 2 , 2 ( 2 1 2 i i x2j )-1 i x2j )-1 i . , ( We find that the denominator of Equation (5) is the same as what we computed in Equation (4). In fact, sampling from Equation (5) is equivalent to sampling from a mixture distribution of two components given as A (-) mj Bernoulli A(-) + A(+) N if mj = 0 (-), j <0 (6) j N ( +) , j > 0 if mj = 1. Using Equation (6), we augment j with mj , and sample (j , mj ) by first drawing the mixture component label mj from the Bernoulli distribution and then drawing j conditional on the mj . The conditional posterior for 2 is given as an inverse gamma distribution ( 2 | , c, y, X Inv-gamma N + 2J + 0 )/2, /. i 1j |j | + 0 s2 2 (yi - Xi )2 + 0 Next, we sample the parameters 0 and 1 of the transition matrix . The conditional posterior for 0 is p(0 |c) P (c|0 )p(0 ) k (e-dk k + (1 - e-dk k )0 ) = S00 3 EXPERIMENTS We demonstrate our proposed model on simulated data and mouse data, and compare the performance with those from the model with independent Bernoulli prior for cj 's, ridge regression, and the lasso. We use ridge regression instead of ordinary least squares regression to prevent the singularity in matrix inversion, since often J > N in our experiments. The regularization parameter in the ridge regression is set to a small value 0.1. The regularization parameter of the lasso is selected using a cross-validation. In all of our experiments, for the block-regularized regression and the model with Bernoulli prior, we run the sampling algorithm for 5000 iterations after 2000 burn-in iterations. Samples are taken every 10 iterations. In the block-regularized regression, the priors for 0 and 1 are set to Beta(10,2), weakly encouraging the cj 's to stay in the same state as the cj -1 's. Similarly, in the model with Bernoulli prior, the prior for the parameter p of the Benoulli distribution is set to Beta(10,2). 3. 1 SIMULATION · k where Sml = {k |ck-1 = m, ck = l} and nml is the number of transitions from cj -1 = m to cj = l in c. We approximate n00 as n00 = j (1 - e-dj j )0 I (cj -1 = 0, cj = 0), e-dj j + (1 - e-dj j )0 S01 a · 0 00 -1 (1 ((1 - e-dk k )(1 - 0 )) - 0 )b00 -1 n00 +a00 -1 (1 0 - 0 )n01 +b00 -1 ( 7) where 0 is the value from the previous sampling iteration, assuming that the number of events (cj -1 = 0, cj = 0) due to e-dj j and (1 - e-dj j )0 are proportional to their probabilities. In Equation (7), the conditional posterior for 0 is Beta(n00 + a00 , n01 + b00 ) for 0 . Similarly, we obtain the conditional posterior for 1 as Beta(n11 + a10 , n10 + b10 ). Finally, we sample of the Laplacian prior from its conditional posterior j , Using the data simulated from the true parameters J SJ |j | 2 in Figure 2(a), we fit our proposed model, the model + p(| , , , ) = Inv-gamma + , 2 2 with independent Bernoulli prior, ridge regression and where J is the number of covariates with cj = 1 in the the lasso, and plot the estimated in Figures 2(b)current sampling iteration, and SJ is the set of such (e). For the block-regularized regression and the model covariates. with independent Bernoulli prior, we show the sam- We generate 360 haplotypes of a 40kb region with mutation rate 0.8/kb and recombination rate 0.1/kb using the software ms (Hudson 2002), and retain only those SNPs whose minor allele frequency is greater than 0.01. In the haplotypes generated under this setting, there are 108 SNPs in the region and 23 blocks in which no recombination events occurred. Then, we randomly pair two haplotypes to obtain genotypes of 180 individuals. We run the widely used software Phase 2.1.1 (Li and Stephens 2003) on these data to estimate the recombination rate between each pair of adjacent SNPs. We select 10 SNPs as relevant variables that are grouped into three blocks of 3, 2, and 5 SNPs respectively such that the SNPs within a group are located within a block of no recombination given by ms. Based on these selected causal SNPs, we generate a phenotype for each individual with j = 2.5 for causal SNPs and j = 0 for non-causal SNPs. The random noise generated from N (0, 1) is added to the simulated phenotype. The true parameters used in this simulation are shown in Figure 2(a). The dots show values for j 's, and the line indicates the locations of the causal SNPs, where 1's represent causal SNPs, and 0's non-causal SNPs. P(c) 2 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 (a) 0 -2 0 (a) 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Position Position 2 P(c) 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 (b) 0 -2 0 (b) 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Position Position 2 (c) 0 -2 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Position 2 Figure 3: Estimated P (cj )'s for (a) block-regularized regression and (b) the model with independent Bernoulli prior, corresponding to the results in Figure 2(b) and (c) respectively. The ×'s indicate the locations of true relevant variables. Table 1: Summary Statistics of Simulated Data Numb er of SNPs Min Max Mean Average numb er of SNPs p er block (d) 0 -2 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Position 2 (e) 0 -2 0 0.05/kb 0.1/kb 0.5/kb 1.0/kb 86 103 99 110 260 226 214 197 150.8 147.2 151.0 152.2 5.59 5.53 1.18 0.57 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Position Figure 2: Simulation results with =0.1/kb and j = 2.5 for relevant variables. True parameters are shown in (a), and the estimated parameters are shown for (b) the block-regularized regression, (c) the model with independent Bernoulli prior, (d) ridge regression, and (e) the lasso. Dots indicate j 's at each position, and the lines in (a), (b), and (c) represent cj 's. ple corresponding to the lowest train error, and plot the estimated c for the same sample as a line. The block-regularized regression in Figure 2(b) discovers the block structure in covariates with four groups of relevant variables, three of which roughly correspond to the three groups in the true parameters. Throughout our simulation experiments, we found that the groups of causal SNPs indicated in cj 's estimated by the block-regularized regression tend to extend to a slightly larger interval than in the true parameters, and that a subset of such SNPs with cj = 1 has a large value for j . Thus, we can view cj 's as suggesting regions of relevant variables, and j 's as deciding how relevant the variables with cj = 1 are. As we see in Figures 2(c)-(e), the block structure is not obvious in the results from the other three methods. Figures 3(a) and (b) show the estimated P (cj )'s for the block-regularized regression and the model with independent Bernoulli prior respectively, using the same data as in Figure 2. Each P (cj ) is estimated as the proportion of the number of times that the cj is set to 1 in samples for the cj . The locations of the true causal SNPs are marked with ×'s. The block-regularized regression encourages a block structure among relevant and irrelevant covariates, leading to a smoother variation in P (cj )'s between adjacent covariates compared to the model with independent Bernoulli prior. In order to quantify the performance of the blockregularized regression and various other regression methods in the presence of different level of correlation among SNPs, we repeat the above procedure of simulating data for four different recombination rates =0.05/kb, 0.1/kb, 0.5/kb, and 1/kb, using j = 2.0 for causal SNPs and j = 0 for non-causal SNPs. A lower recombination rate results in a more tightly linked SNPs and a stronger block structure in covariates. For each recombination rate, we generate 50 datasets of 180 individuals, and report the results averaged over these 50 datasets for the given recombination rate. Because of the random nature in the data generation process of the simulation software ms, the number of SNPs vary in each dataset even if the same parameters are used in simulation. The minimum, maximum, and average number of covariates in the 50 datasets for each recombination rate are shown in Table 1 as well as the average number of SNPs within a block given by ms during simulation. As the recombination rate gets higher, the correlations between adjacent SNPs become weaker, leading to only less than 1 SNP per block in the case of = 1.0/kb. Given these datasets generated as described above, we estimate with the block-regularized regression, the 1 Precision Precision 0.5 Precision Block Bernoulli Ridge Lasso 1 1 0.5 0.5 =1.0 =2.0 =3.0 =4.0 =5.0 0 0 0.5 1 0 0 0.5 1 0 0 0.5 1 Recall Recall (a) 1 1 (b) Recall Precision 0.5 Precision 0.5 Figure 5: Precision-recall graphs for block-regularized regression, using simulated data with varying j 's for relevant variables. P value 10 5 0 0 0 0 0.5 1 0 0 0.5 1 (a) Recall Recall (c) (d) 5 10 15 x 10 7 Position 20 Figure 4: Precision-recall graphs for simulated data. (a) = 0.05, (b) =0.1, (c) =0.5, and (d) =1.0/kb. model with independent Bernoulli prior, ridge regression, and the lasso, and plot precision-recall graphs in Figure 4(a)-(d) for recombination rates =0.05/kb, 0.1/kb, 0.5/kb, and 1.0/kb respectively. To obtain each precision-recall curve in Figure 4, we estimate j 's given a dataset and a regression method, and rank the SNPs according to the absolute values of j 's. The SNP with the largest value of |j | is considered as the most relevant. We compare the rankings of SNPs given by each regression method to the list of true causal SNPs, and compute the precisions and recalls shown in Figure 4. As can be seen in Figures 4(a) and (b), the block-regularized regression clearly outperforms other methods, since the correlations among SNPs are relatively high, and the block-regularized regression takes advantage of this correlation structure. As the recombination rate increases in Figures 4(c) and (d), the advantage of having a block model decreases. When =1/kb, the average number of SNPs per block is less than 1 as shown in Table 1, and the block-regularized regression, the model with independent Bernoulli prior and the lasso perform similarly. We can use P (cj )'s instead of |j |'s to rank the SNPs. When we plotted the precision-recall graphs according to the P (cj )'s, we obtained similar results to the ones in Figure 4. In Figure 5, we fit the block-regularized regression model to the dataset simulated with different values of j for relevant variables and the noise distributed as N (0, 1), and plot the precision-recall graphs. The recombination rate is set to 0.5/kb, and each precisionrecall curve is the result averaged over 50 datasets. We see that as the signal to noise ratio increases, the performance increases. (b) 0 -20 0 20 0 -20 0 20 0 -20 0 5 10 15 x 10 7 Position (c) 5 10 15 x 10 7 Position (d) 5 10 15 x 10 7 Position Figure 6: Results for the mouse haplotype data (chromosome 4) and the measurements of drinking preference. (a) -log(p value), (b) block-regularized regression, (c) the model with independent Bernoulli prior, (d) the lasso. 3. 2 MOUSE DATA We apply the block-regularized regression to the inbred laboratory mouse haplotype map publicly available from BROAD institue website1 . We use the measurements for the sodium intake of 25% NaCl concentration for female mice (Tordoff et al. 2007) as phenotype. The data for 33 strains are available for both the genotype and phenotype dataset. Thus, the number of individuals in this experiment is 33. We consider the 8217 SNPs in chromosome 4 of length 154Mb as covariates. We are interested in scanning the chromosome and discovering the SNPs with high genetic effects on the phenotype. 1 http://www.broad.mit.edu/mouse/hapmap The most commonly used method for discovering SNPs highly associated with the phenotype is to perform a statistical test for the phenotype and one SNP at a time and report the SNPs with high p-values as significant. The result for the mouse data using one such test, the Wald test, is shown in Figure 6(a). The y -axis shows -log(p-value) for each SNP. In Figure 6(b)-(d), we show the estimated using the blockregularized regression, the model with independent Bernoulli prior, and the lasso respectively. For these three regression models, we divide the whole sequence into segments of 200 SNPs, and fit the model to one segment at a time. We see that the SNPs with high values of -log(p-value) in Figure 6(a) roughly correspond to the SNPs with high j values in Figure 6(b). The International HapMap Consortium (2005). A haplotyp e map of the human genome. Nature 437:1399-1320. H. Ishwaran and J.S. Rao (2005). Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics 33(2):730-773. R.R. Hudson (2002). Generating samples under a WrightFisher neutral model of genetic variation. Bioinformatics 18:337-338. N. Li and M. Stephens (2003). Modelling linkage disequilibrium, and identifying recombination hotsp ots using snp data. Genetics 165:2213-2233. D.J. Nott and P.J. Green (2004). Bayesian variable selection and the Swendsen-Wang algorithm. Journal of Computational and Graphical Statistics 13:141-157. T. Park and G. Casella (2008). The Bayesian Lasso. (unpublished manuscript). B. Servin, and M. Stephens (2007). Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genetics 3(7):1296-1308. K. Sohn and E.P. Xing (2007). Sp ectrum: joint Bayesian inference of p opulation structure and recombination event. The Fifteenth International Conference on Intel ligence Systems for Molecular Biology. B. Stranger, M. Forrest, A. Clark, M. Minichiello, S. Deutsch, R. Lyle, S. Hunt, B. Kahl, S. Antonarakis, S. Tavare, P. Deloukas, and E. Dermitzakis (2005). Genome-wide associations of gene expression variation in humans. PLoS Genetics 1(6):695-704. R. Tibshirani (1996). Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, Series B 58(1):267-288. R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight (2005). Sparsity and smoothness via the fused lasso. Journal of Royal Statistical Society, Series B 67(1):91-108. M.G. Tordoff, A.A. Bachmanov, and D.R. Reed (2007). Forty mouse strain survey of water and sodium intake. Physiology & Behavior 91(5):620-31. M. Wainwright, P. Ravikumar, and J. Lafferty (2006). High-dimensional graphical model selection using L1 regularized logistic regression. Advances in Neural Information Processing Systems, 19. M. Yuan and Y. Lin (2005). Efficient empirical Bayes variable selection and estimation in linear models. Journal of the American Statistical Association 100(472):1215-1225. M. Yuan and Y. Lin (2006). Model selection and estimation in regression with group ed variables. Journal of Royal Statistical Society, Series B 68(1):49-67. N. Zaitlen, H. Kang, E. Eskin, and E. Halp erin (2007). Leveraging the hapmap correlation structure in association studies. The American Journal of Human Genetics 80(4):683-91. K. Zhang, P. Calabrese, M. Nordb org, and F. Sun (2002). Haplotyp e block structure and its applications to association studies: p ower and study design. The American Journal of Human Genetics 71:1386-1394. 4 CONCLUSIONS In this paper, we considered the problem of finding a subset of covariates in a high-dimensional space that affect the output variable when there is a block structure in the covariates. In the context of association mapping, we proposed a regression-based model with a Markov chain prior that encodes the information in the correlation structure such as distance and recombination rate between adjacent SNP markers. We demonstrated on the simulated and mouse data that our proposed algorithm can be used to identify groups of SNP markers as a relevant block of causal SNPs. The idea of representing the correlation structure as a Markov chain in a variable selection method to learn grouped relevant variables can be generalized to use a graphical model as a prior in a variable selection problem to represent an arbitrary correlation structure in variables in a high-dimensional space. Another interesting extension of the model is to model a structure in output variables as well when measurements of multiple output variables are available. Acknowledgements This research was supported by NSF Grants CCF0523757 and DBI-0546594. References V. Cheung, R. Spielman, K. Ewens, T. Web er, M. Morley, and J. Burdick (2005). Mapping determinants of human gene expression by regional and genome-wide association. Nature 437:1365-1369. P. Fearnhead and P. Donnelly (2001). Estimating recombination rates from p opulation genetic data. Genetics 159:1299-1318. E.I. George and R.E. McCulloch (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88:881-889.