Blockwise Coordinate Descent Procedures for the Multi-task Lasso, with Applications to Neural Semantic Basis Discovery Han Liu hanliu@cs.cmu.edu Machine Learning Department, School of Computer Science, Carnegie Mellon University, PA 15213 USA Mark Palatucci mpalatuc@cs.cmu.edu School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 USA Jian Zhang jianzhan@stat.purdue.edu Department of Statistics, Purdue University, 250 N. University St., West Lafayette, IN, 47907 USA Abstract We develop a cyclical blockwise coordinate descent algorithm for the multi-task Lasso that efficiently solves problems with thousands of features and tasks. The main result shows that a closed-form Winsorization operator can be obtained for the sup-norm penalized least squares regression. This allows the algorithm to find solutions to very largescale problems far more efficiently than existing methods. This result complements the pioneering work of Friedman, et al. (2007) for the single-task Lasso. As a case study, we use the multi-task Lasso as a variable selector to discover a semantic basis for predicting human neural activation. The learned solution outperforms the standard basis for this task on the majority of test participants, while requiring far fewer assumptions about cognitive neuroscience. We demonstrate how this learned basis can yield insights into how the brain represents the meanings of words. can be used to evaluate the entire regularization path remarkably faster than almost all the existing stateof-the-art methods. The main reasons for such a surprising performance of the coordinate descent algorithm can be summarized as: (i) During each iteration, the coordinate-wise update can be written as a closed-form soft-thresholding operator, thus an inner loop is avoided; (ii) If the underlying feature vector is very sparse, the soft-thresholding operator can very efficiently detect the zeros by a simple check, thus only a small number of updates are needed. (iii) Many computational tricks, like the covariance update or warm start, can be easily incorporated into the coordinate descent procedure (Friedman et al., 2008). In this paper, we consider the computational aspect of the multi-task Lasso (Zhang, 2006), which generalizes the Lasso to the multi-task setting by replacing the 1 -norm regularization with the sum of sup-norm regularization. A scalable cyclical blockwise coordinate descent algorithm is designed which can evaluate the entire regularization path efficiently. In particular, we show that the sub-problem within each iteration can be very efficiently solved by aWinsorization operator,1 i.e. a proportion of the extreme values of the given vector are truncated while the others remain the same. This extends the result of (Friedman et al., 2007) to the multi-task setting. A similar result also appeared in (Fornasier & Rauhut, 2008) under the more general linear inverse problem framework. The main contribution of this work is that we formulate a non-trivial learning task from the cognitive neuroscience community into a multi-task Lasso problem 1 After Charles P. Winsor, whom John Tukey credited with converting him from topology to statistics (Mallows, 1990) 1. Introduction The cyclical coordinate descent algorithm has been proposed to solve the 1 -regularized least squares regression (or the Lasso) almost ten years ago (Fu, 1998), but not until very recently was their power fully appreciated (Friedman et al., 2007; Wu & Lange, 2008). In particular, Friedman et al.(2007) show that the coordinate descent method, if implemented appropriately, Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Blockwise Coordinate Descent Procedures for the Multi-task Lasso and solve it using the obtained blockwise coordinate descent algorithm. Compared with the most stateof-the-art results (Mitchell et al., 2008), our solution outperforms the standard hand-crafted features in the majority of test participants while using far fewer assumptions. We also discuss how our methods can be used to refine current theory in cognitive neuroscience. 2. The Multi-task Lasso In this section, we introduce the multi-task Lasso and some related work. We start with some notations. Consider a K-task linear regression, for k = 1, . . . , K, (k) (k) (k) p Y (k) = j=1 j Xj + (k) where Y (k) , Xj , (k) nk R . The superscript k indexes the tasks, p is the number of features, and nk is the number of instances within the k-th task. We assume the data is standard(k) ized so the constant terms can be dropped, i.e. Xj and Y (k) have mean 0 and Xj the 2 -Euclidean norm. Let j (j , . . . , j (1) (k) 2 Remark 1. Equation (2) treats all tasks equally, which can be sensitive to abnormal or outlier tasks. To address this, we can build an adaptive version of this algorithm. After obtaining the initial estimate by treating all the tasks equally, we could calculate the residual sum of squares for the fit of different tasks. A second step can then be conducted by weighting these tasks differently according to their initial fit. This provides extra performance gain and robustness. 3. Blockwise Coordinate Descent For a fixed regularization parameter , the blockwise coordinate descent algorithm for the multi-task Lasso problem in Equation (2) is given in Figure 1, where ·, · denotes the inner product operator of two vectors. Recall that j in (1) represents the coefficient vector of the j-th feature across all the K tasks. We call j a block. The algorithm consists of simultaneously updating the coefficients within each block while holding all the others fixed, then cycling through this process. Therefore, if the current estimates are , = 1, . . . , p, then j is updated by the following sub-problem: j = arg min j (k) = 1 where · 2 is (K) T ) (1) be the vector of all coefficients for the j th feature across different tasks, the multi-task Lasso estimate is formulated as the solution to the optimization problem min 1 2 K Rj -j Xj k=1 (k) (k) (k) 2 2 + j (3) 1 2 K p p Y (k)- k=1 j=1 j Xj (k) (k) (k) 2 2 + j=1 j , (2) where Rj Y (k) - residual vector. =j (k) X (k) denotes the partial where j maxk |j | is the sup-norm in the Euclidean space. It has the effect of "grouping" the elements in j such that they can achieve zeros simultaneously. If all tasks share a common design matrix, the multi-task regression reduces to a multivariateresponse regression. In this case, Turlach et al. (2005) proposes the same sum of sup-norm regularization and name the resulting estimate in (2) as the simultaneous Lasso. It's obvious that any solver for the multi-task Lasso also solves the simultaneous Lasso. Existing methods to solve (2) from the machine learning and statistics communities include the double coordinate descent method from (Zhang, 2006), the interior-point method from (Turlach et al., 2005), and the geometric solution path method from (Zhao et al., 2009). These methods, however, have difficulty scaling to thousands of features and tasks. One alternative worth noting is the multi-task feature selection work of Argyriou, Evgeniou, and Pontil (2008). Compared with (2), although both methods can be used to learn features over many tasks, their work uses a different penalty term in the optimization problem. Our work, by contrast, focuses on the multitask Lasso which uses the sum of sup-norm penalty. If the regularization parameter = 0, the problem in (k) (3) decouples and the least squares solution j is j (k) (k) = Rj , Xj (k) (k) (k) , for j, k. - X (k) (4) , Xj (k) (k) Since Rj , Xj (k) = Y (k) Xj , (k) we can pre-calculate the (k) (k) =j (k) quantities cj = (k) , Y (k) , Xj and dij = Xi , Xj . This is the same covariance update idea as in (Friedman et al., 2008) and corresponds to the first double loop in the algorithm in Figure 1. If we have a decreasing sequence of the reg(k) ularization parameters 's, the initial values of j for each fixed comes from the solutions j calculated from the previous value. This is the same warm start trick as in (Friedman et al., 2008) and can significantly speedup the algorithm performance for evaluating the (k) (k) entire solution path. Since the quantities cj and dij do not depend on , they only need to be calculated once and can serve for the whole pathwise evaluation. For > 0, (3) becomes a sup-norm penalized least squares regression. If we use the Newton's method or coordinate descent procedure to solve it as in (Zhang, (k) Blockwise Coordinate Descent Procedures for the Multi-task Lasso 2006), an inner loop will be needed. This turns out not to be scalable if the number of tasks K is very large. Fortunately, Theorem 2 below shows that the solution to (3) is equivalent to a closed-form Winsorization operation applied on the previously calculated unpenal(k) ized least squares results j 's. This corresponds to the main loop of the algorithm in Figure 1. Blockwise coordinate descent algorithm Input: Data (X1 , . . . , Xp , Y (k) ), k = 1, . . . , K and the regularization parameter ; Iterate over k {1, . . . , K} and j {1, . . . , p} (1) cj (k) (k) (k) Proof: The proof proceeds by discussing several cases separately: (i) All the elements in the sup-norm are zeros; (ii) One unique element in the sup-norm achieves the maximum; (iii) At least two elements in the supnorm achieve the maximum. These cases correspond to Propositions 5, 6, and 8 respectively. Since the given objective function in (3) is convex, its solution can be characterized by the Karush-KuhnTucker conditions as the following Rj (k) - j Xj (k) (k) T Xj (k) = k k {1, . . . , K}, (5) . j Y (k) , Xj (k) ; where {k }K satisfy (1 , . . . , K )T · k=1 (k) b(k) (2) j initial values (either 0 or j for the previous if doing a warm-start); (3) For each i {1, . . . , p}: dij Xi , Xj End; Iterate until convergence (Main Loop): For each j {1, . . . , p}: (1) k {1, . . . , K}, j (2) If PK (k) k=1 (k) (k) (k) (k) ; Here, · denotes the subdifferential of the conj vex functional · evaluated at j , it lies in a K-dimensional Euclidean space. Next, the following proposition from (Rockafellar & Wets, 1998) can be used to characterize the subdifferential of sup-norms. Lemma 3. The subdifferential of · in RK is cj (k) - X i=j i dij ; (k) (k) · x= { : 1 1} conv{sign(xi )ei : |xi | = x x=0 (6) } o.w. |j | Then j 0 Else (k1 ) (a) Sort the indices according to |j (k ) |j 2 | | ... (k ) |j K |; where conv(A) denotes the convex hull, and ei is the i-th canonical unit vector in RK . Proposition 4. Let j in (4), if (k) j (k) (b) m arg max1mK If i > m Then j j End; (ki ) "P m i=1 |j (ki ) " |- /m; be solution to (3) and j = (k) sign(j ). (k) (c) For each i {1, . . . , K} (ki ) = 0, then j " m X =1 (ki ) (k) sign(j ) (k) Else # (k ) sign(j i ) m |j (k ) |- ; Proof to Proposition 4: Since j = 0, the result trivially follows from the convexity and continuity of the objective function in (3). Firstly, we consider the case that show that 0 must be a solution. K k=1 |j | and K (k) (k) b(k) Output: j j for j = 1, . . . , p and k = 1, . . . , K; Figure 1. The algorithm for the multi-task Lasso. (k) Proposition 5. j = 0 if and only if k=1 |j | . (k) Theorem 2. Let j as defined in (4) and order the (k2 ) indices according to |j 1 | |j Then the solution to (3) is (k ) | . . . |j (kK ) |. Proof to Proposition 5: From (5), we know that j = 0 K if and only if 1 , . . . , K such that k=1 |k | 1 and k = Rj If K k=1 (k) (k) T Xj (k) = j . (k) (7) (ki ) m ) (k ) (ki ) sign(j (k ) j = |j i |- ·1{im } +j i ·1{i>m } m i =1 + |j | , choosing k as in (7) would guarK k=1 antee that |k | 1, therefore j = 0. 1 (k ) |j i | - , 1{·} is the m m i =1 indicator function, and [·]+ denotes the positive part. where m = arg max m On the other hand, If j = 0, from (7), we know that (k) K k = j , k = 1, . . . , K and k=1 |k | 1. This implies that K k=1 |j | . (k) Blockwise Coordinate Descent Procedures for the Multi-task Lasso Next, we consider the case that (k ) |j 1 | K k=1 |j | > and = (k) j (k) - for k = (k ) > |j 2 |. Here we (k ) k1 , while j 1 = sign (k ) |j 1 | > (k ) > |j 2 |. (k) show that j (k ) (k ) j 1 |j 1 | - . Proof to Proposition 7: Since exactly m entries (k ) (k ) |j 1 |, . . . , |j m | achieve j > 0 , by (6), there must exist m nonnegative numbers a1 , . . . , am , such m that = 1 and for each {1, . . . , m}, =1 a Rj (k ) Proposition 6. only if (k ) |j 1 | (k) | j | for k = k1 if and - j (k ) Xj (k ) T Xj (k ) = a sign(j (k ) ). - Which can be re-written as (k ) (k) Proof to Proposition 6: If |j 1 | > |j | for k = k1 , this implies that · j = ek1 , where ek1 is the k1 -th canonical vector. Therefore, from (5), Rj (k) j (k ) = a sign(j (k ) ) + j (k ) (k ) {1, . . . , m}. (k ) (9) - j Xj (k) (k) T Xj (k) = sign(j = j (k1 ) (k1 ) )1{k=k1 } . (k1 ) Therefore, we know j j (k) (k1 ) Using the fact that |j 1 | = . . . = |j m |, summing over the absolute value of both sides of all (k ) m the equations in (9), we obtain | = =1 |j m =1 (k ) j | - sign(j ) and - = j (k1 ) (k1 ) (k) |a sign(j = a + |= 1 m (k ) ) + j and (k ) |. Since |a sign(j m =1 (k ) )+ for k = k1 . (k) (k) Combined with the (k1 ) (k ) | j | m a = 1, we have fact |j | > |j | for k = k1 , we get |j )| > |j | for k = k1 . sign(j |j (ki ) |j =1 (ki ) | - . i {1, . . . , m}. (10) (ki ) (k ) (k ) From Proposition 4, we have sign(j 1 ) = sign(j 1 ). (k ) (k ) (k ) Further, if j 1 > 0, then |j 1 | > , we have |j 1 - (k ) (k ) (k ) sign(j 1 )| = |j 1 | - . Therefore, |j 1 | - > (k ) (k ) |j 2 |. This is also true for j 1 < 0. Finally, by Proposition 4, we know that sign(j (k ) sign(j i ) )= for i = 1, . . . , m, therefore (ki ) On the other hand, assuming |j for some n > 1, there exist | j (k1 ) (k1 ) | - > |j . (k2 ) | but j (ki ) = sign(j m ) m |j =1 (k ) |- i = 1, . . . , m. | = . . . = | j (kn ) | = j (8) Then, by (6) and (5), there must exist a1 , a2 [0, 1] and a1 + a2 1, for h = 1, 2, Rj (kh ) - j (kh ) Xj (kh ) T Xj (kh ) = ah sign(j (kh ) ). (k1 ) To finish the proof of Theorem 2, we still need to describe the exact conditions that there are exactly m > 1 elements that achieve the sup-norm. This is given in the following Proposition 8. A similar result was also given in (Fornasier & Rauhut, 2008). Proposition 8. For m > 1, there exist precisely (k ) (k ) m entries |j 1 |, . . . , |j m | that achieve j > 0 if and only if |j (k ) |j m | 1 m 1 m-1 (k ) m | =1 |j m-1 =1 (k1 ) Combine this result and (8), we get |j (k ) a1 sign(j 1 )| - (k ) (k ) = |j 2 - a2 sign(j 2 )|. By (k1 ) (k1 ) Proposition 4 and |j | > a1 , we have |j - (k1 ) (k1 ) (k2 ) a1 sign(j )| = |j |-a1 . If |j | > a2 , we get (k ) (k ) (k ) (k ) |j 1 | - sign(a1 j 1 ) + a2 sign(j 2 ) = |j 2 |. | - - and |j (k2 ) | and < (k ) |j | (k ) |j m+1 | - . Since a1 + a2 1, this obviously contradicts with the (k ) (k ) assumption that |j 1 | - > |j 2 |. The same result also hold for the case Lastly, for the case (k ) |j 2 |. (k ) |j 2 | K k=1 a2 . (k) (k1 ) Proof to Proposition 8: Assume exactly m > 1 entries (k ) (k ) |j 1 |, . . . , |j m | achieve j > 0, from Proposition 7, we know that for i = 1, . . . , m, (k ) j i |j | > and |j |- sign(j = m (ki ) We start with an auxiliary proposition. ) m |j =1 (k ) |- . (11) Proposition 7. For m > 1, if there are precisely m (k ) (k ) entries |j 1 |, . . . , |j m | achieve j > 0, then j (ki ) By (9) and Proposition 4, we have a = j (k ) = sign(j m (ki ) ) m |j =1 (k ) |- i = 1, . . . , m. - j (k ) (k ) sign(j ) = |j (k ) | - | j (k ) | {1, . . . , m}. Blockwise Coordinate Descent Procedures for the Multi-task Lasso Plugging (11) into am , since am 0, we get |j (km ) 4. Neural Semantic Basis Discovery (12) In this section, we present a case study of the multitask Lasso by applying it to a problem in cognitive neuroscience. Specifically, we consider the task of predicting a person's neural activity in response to an arbitrary word in English as described in (Mitchell et al., 2008). Their approach is to predict the neural image that would be recorded using functional magnetic resonance imaging (fMRI) when a person thinks about a given word. In their work, the semantic meaning of a word is encoded by co-occurrence statistics with other words in a very large text corpus. They use a small number of training words to learn a linear model that maps these co-occurrence statistics to images of neural activity recorded while a person is thinking about those words. Their model can then predict images for new words that were not included in the training set and shows that the predicted images are similar to the observed images for those words. Table 1. The semantic basis used in Mitchell et al. (2008) See Hear Listen Taste Smell Eat Touch Rub Approach Manipulate Run Push Fill Move Ride Say Fear Open Lift Near Enter Drive Wear Break Clean | 1 m-1 (km ) m-1 |j =1 (k ) |- . Further, since |j | > |j = (km+1 ) |, the necessity < | j (km ) (k ) follows from |j m+1 | (k ) m 1 |- . =1 |j m (k ) |j m+1 | | = For the sufficiency, we assume that precisely n = m (k ) (k ) entries |j 1 |, . . . , |j n | achieve j > 0, then follow exactly the same argument as the necessity part to obtain a contradiction. To prove Theorem 2, we know from Proposition 8 there are precisely m entries in j that achieve j > 0 1 |- . if and only if m = arg maxm m =1 |j The result follows by combining Propositions 5 and 6. m (k ) Remark 9. We also conducted experiments to quantitatively compare the performance of the blockwise coordinate descent algorithm with the Log-barrier interior-point method in a similar setting as in (Friedman et al., 2007). The blockwise coordinate descent algorithm is significantly faster. Due to the lack of space, we do not report the simulation details here. The complexity analysis of the algorithm is straightforward. Within the main loop, the most expensive step is sorting the K elements, which can be done in O(K log K). This makes the algorithm scalable to very large number of tasks. From the Winsorization operator, we do not need to update a block if (k) K k=1 |j | , which happens frequently if the problem is sparse. This makes the algorithm scalable to very large number of features. Furthermore, since many quantities can be pre-calculated, the algorithm can be also applied to large sample datasets. The numerical convergence of this algorithm is summarized in the following theorem. Theorem 10. The solution sequence generated by the blockwise coordinate descent algorithm in Figure 1 is bounded and every cluster point is a solution of the multi-task Lasso defined in (2). Proof The optimization objective function in (2) is continuous on a compact level set, and is convex (but not strictly convex) and nondifferentiable. Furthermore, notice that the nondifferentiable part p j=1 j is separable, i.e. it can be decomposed into a sum of individual functions, one for each block of variables. By Theorem 4.1 in (Tseng, 2001) we obtain the claimed results. In their initial model each word is encoded by a vector of co-occurrences with 25 sensory-action verbs (e.g. eat, ride, wear). For example, words related to foods such as "apples" and "oranges" would have frequent co-occurrences with the word "eat" but few cooccurrences with the word "wear". Conversely, words related to clothes such as "shirt" or "dress" would cooccur frequently with the word "wear" but not the word "eat". Thus "eat" and "wear" are example basis words used to encode relationships of a broad set of other words. These 25 sensory-motor verbs (shown in Table 1) were hand-crafted based on domain knowledge from the cognitive neuroscience literature and are considered a semantic basis of latent word meaning. A natural question is: What is the optimal basis of words to represent semantic meaning across many concepts? Rather than relying on models that require manual selection of a set of words, our research tries to build models that will perform variable selection to automatically learn a semantic basis of word meaning. In this way, we not only want to predict neural activity well, but also give insights into how the brain represents the meaning of different concepts. The hope is that learning directly from data could lead to new theories in cognitive neuroscience. Blockwise Coordinate Descent Procedures for the Multi-task Lasso For our study, we utilize the two datasets described in (Mitchell et al., 2008). The first dataset was collected using fMRI. Nine participants were presented with 60 different words and were asked to think about each word for several seconds while their neural activities were recorded. The 60 words are composed of nouns from 12 categories with 5 exemplars per category. For example, a bodypart category includes Arm, Eye, Foot, Hand, Leg, a tools category includes the words Chisel, Hammer, Pliers, Saw, Screwdriver, and a furniture category includes Bed, Chair, Dresser, Desk, Table, etc. The second dataset is a symmetric matrix of text cooccurrences between the 50,000 most frequent words in English. These co-occurrences are derived from the Google Trillion Word Corpus2 . The goal is to use these co-occurrences to construct a feature vector that represents word meaning. The hope is that two words that cause similar neural activites would also have high inner product between their co-occurrence vectors. One simple representation of each word is to use the raw 50,000 dimensional feature vector of co-occurrences (normalized to unit length row norm). Typically a much smaller representation is desired such as the hand-crafted 25-word basis described above. For each participant, there are altogether n = 60 fMRI images taken3 , each one corresponds to one stimulus word. A typical fMRI image contains over K = 20, 000 different neural responses. Each neural response is the activity in a voxel (volume-element) in the brain. Each voxel represents a 1-3 mm3 volume in the brain and is the basic spatial unit of measurement in fMRI. Here we show the problem of learning a semantic basis can be formulated into a multi-task Lasso as in (2). Since the goal of learning a semantic basis is to find a common set of predictor variables that will predict the neural response well across multiple voxels, where each predictor variable is the text co-occurences with a particular word from the Google Trillion Word Corpus. Therefore, for each participant, we have roughly K = 20, 000 tasks, all these tasks share the common design columns {Xj }p Rn , representing the coj=1 occurrences of n = 60 training words with p = 50, 000 other English words in the Google Corpus. The response vector Y (k) for each task contains the neural activations for a single voxel (volume-element) across the n = 60 fMRI images. Therefore, this is a multitask learning problem with a very large number of tasks K = 20, 000 and a very large number of features p = 50, 000. While the algorithm we develop 2 http://googleresearch.blogspot.com/2006/08/all-ourn-gram-are-belong-to-you.html 3 Each image is actually the average of 6 different recordings of each word. can solve a problem of this scale in only a few minutes, our primary results use a smaller dataset where p = 5, 000 and K = 500, so that our experiments are directly comparable with other published results. We also provide additional results where K = 4500. 5. Experimental Results To evaluate our methods and compare them to existing results, we use exactly the same experimental protocols described in (Mitchell et al., 2008). As a small additional step we use the multi-task Lasso to perform a variable selection. Note that the multi-task Lasso is used in this context only to learn which variables should be input into the final model. Specifically, the leave-two-out-cross-validation procedure is as follows: a Create a 60 × 5, 000 design matrix of semantic features using co-occurences of the 5,000 most frequent words in English (minus 100 stop words). A stop word is a high frequency common word like "the". b Select 2 words out of 60 for testing and use the other 58 words for training. Using (2), learn the coefficients by setting each Xj to be the 58 × 1 vector of co-occurences for each of the 5000 basis words and each Y (k) to be the 58 × 1 column vector for each of the top K = 500 voxel responses selected using the stability criterion score described in (Mitchell et al., 2008). In the language of multi-task Lasso, this problem corresponds to the scale n = 58, p = 5000, K = 500. The regularization parameter here can be set to choose the desired number of non-zero coefficients. We set this parameter to yield basis sets with 25, 140, and 350 elements so the model is easier to interpret and compare to existing results. c Create a new matrix of semantic features that is 58 × d, where d is the number of selected feature blocks in . In our case d will be either 25, 140, or 350. Each nonzero block should correspond to a word selected from the original set of 5,000. This word shall now become a semantic feature in the new basis. d Train a linear model using ridge regularization to predict each of the 500 voxels from the semantic feature basis. The solution can be obtained using the standard normal equations for ridge regression. e For each of the two test examples, predict the neural response of the 500 selected voxels. Compute the cosine similarity of each prediction with each of the held out images. Based on the combined similarity scores, choose which prediction goes with each held out image. Test if the joint labeling was correct. This leads to an output of 0 or 1. For more details, see (Mitchell et al., 2008). ` ´ f Repeat steps b-e for all 60 possible pairs of words 2 (1,770 total). Count the number of incorrect labelings in step e to determine the accuracy of the basis set. Figure 2. The leave-two-out-cross-validation protocols Blockwise Coordinate Descent Procedures for the Multi-task Lasso Table 2. Accuracies for 9 fMRI Participants. Learned per-subject Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 0.705085 0.713559 0.741243 0.757627 0.726554 0.718079 0.727119 0.754802 0.715254 0.608475 0.545763 0.567232 0.792655 0.787006 0.831638 0.840678 0.771751 0.649153 0.654237 0.69209 0.684746 0.714124 0.733333 0.762147 Subject 1 Handcraft MTL25 MTL140 MTL350 0.749718 0.863842 0.863842 0.881921 Subject 8 0.729379 0.730508 0.751977 0.784746 Subject 9 0.787006 0.679661 0.717514 0.738983 Subject 1 MTL25 MTL140 MTL350 0.815254 0.849718 0.880791 Table 3. Accuracies for 9 fMRI Participants. Learned with combined fMRI Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 0.718079 0.736158 0.761017 0.679096 0.733333 0.758757 0.588701 0.553107 0.575141 0.732203 0.810734 0.845198 0.661017 0.678531 0.691525 0.737288 0.761017 0.762712 0.755932 0.753107 0.783051 Subject 9 0.690960 0.732768 0.738983 We repeat this experiment for each of the nine different participants in the fMRI study and use the same method in Mitchell et al. (2008) to ensure consistency while testing various semantic features. The regularization parameter for the ridge regression in step d is chosen automatically using the cross-validation error score described in (Hastie et al., 2001, Page 216). We performed several experiments using the above protocols to compare the performance of the semantic feature sets learned using the multi-task Lasso with the hand-crafted features described earlier. Using these results we now pose and answer several questions: Q1: Can we automatically learn a semantic basis? As in Figure 2, we use the multi-task Lasso to learn a semantic basis from the 5,000 most frequent words in English and adjust the regularization parameter to obtain different basis sizes. We denote MTL25, MTL140, MTL350 as models that have 25,140, and 350 words respectively in their basis sets each time they were trained. Note that we train the models on each iteration of the cross-validation and keep the regularization parameter the same between iterations. As a result, there can be a slight variation in the actual number of features selected on each iteration. We see in Table 2 the results of the 4 models. The result for the 25 features hand-crafted by cognitive neuroscientists is also reported. Chance accuracy for this prediction task is 0.5 and statistically significant accuracy at the 5% level is 0.61 (Mitchell et al., 2008). We see that in 8 of 9 subjects, all three multi-task Lasso models learn a semantic basis that leads to statistically significant accuracies. This suggests that we can indeed learn a semantic basis directly from data. The MTL140 and MTL350 models achieve higher accuracies than the hand-crafted features in 6 of 9 participants. It is exciting that the model can often meet or exceed the performance of the hand-crafted features using far fewer assumptions about neuroscience. It is also useful that our learned basis is different than the handcrafted features, which suggests potential benefits from a model averaging approach. For the MTL25 model, we report accuracies higher than the handcrafted features in 4 of 9 participants. We also see that two larger basis sets outperform the MTL25 set, suggesting that more than 25 features are necessary to capture the variance of the data. An interesting observation comes from participant 4. On this participant all three learned models performed worse than the hand-crafted model. The very abnormal behavior of the learned models on this participant versus the other participants suggests that this particular participant might be an outlier or a very noisy observation (as is common in fMRI studies). However, the hand-crafted feature does not suffer from this case. More investigation is suggested to study this. Q2: Is there any advantage to learning the basis across multiple subjects simultaneously? An interesting question is whether we could ameliorate this problem by learning the basis simultaneously across all subjects at once. Table 3 shows the results of learning the basis by combining the fMRI data for all participants (thereby yielding a 58 x 4500) target matrix. This corresponds to a multi-task Lasso where K = 4500. On average, we found a slight advantage on the MTL140 and MTL350 models, and slight disadvantage for the MT25 model. Although encouraging, this is hardly conclusive evidence, and we feel this is an interesting direction for future work. In particular, it would be worth studying the impact of noisy data by removing an outlier such as participant 4. Q3: What is the learned semantic basis? Table 4 shows one example of 25 basis words learned using the MTL25. It is easy to see relationships between many of the words in the basis set and the 60 stimulus words described before. For example, the model learned Tools as a basis word, which is interest- Blockwise Coordinate Descent Procedures for the Multi-task Lasso ing because 5 different instances of tools were shown (e.g. Screwdriver, Hammer, etc.). The basis word Bedroom clearly refers to words in the furniture category (Bed, Dresser, etc.) and Arms is related to body parts (Leg, Hand, etc.). Table 4. An example of 25 learned semantic basis words. Tools Model Mad Arms Right Car Station Rentals Walk White Dog Bedroom Fishing Cleaning Front Wine Breakfast Cake Cheese Contents Pottery Cup Tip Gay Result References Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73, 243­272. Fornasier, M., & Rauhut, H. (2008). Recovery algorithms for vector valued data with joint sparsity constraints. SIAM Journal of Numerical Analysis, 46, 577­613. Friedman, J., Hastie, T., H¨dotofling, H., & Tibshiu rani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302­332. Friedman, J. H., Hastie, T., & Tibshirani, R. (2008). Regularized paths for generalized linear models via coordinate descent. Technical report, Stanford University. Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7, 397­416. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. Springer-Verlag. Mallows, C. L. (Ed.). (1990). The collected works of john w. tukey. volume vi: More mathematical, 1938­ 1984. Wadsworth & Brooks/Cole. Mitchell, T., et al. (2008). Predicting human brain activity associated with the meanings of nouns. Science, 320, 1191­1195. Rockafellar, R. T., & Wets, R. J.-B. (1998). Variational analysis. Springer-Verlag Inc. Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109, 475­494. Turlach, B., Venables, W. N., & Wright, S. J. (2005). Simultaneous variable selection. Technometrics, 27, 349­363. Wu, T. T., & Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2, 224­244. Zhang, J. (2006). A probabilistic framework for multitask learning (Technical Report CMU-LTI-06-006). Ph.D. thesis, Carnegie Mellon University. Zhao, P., Rocha, G., & Yu, B. (2009). The grouped and hierarchical model selection through composite absolute penalties. The Annals of Statistics (to appear). For a given basis word, we can train a simple linear model to predict neural activations across all 20,000 voxels in the brain from this single basis word. Note that this is a multiple output regression and each learned coefficient corresponds to a physical location in the brain. By plotting the coefficients, we can discover how different basis words activate different regions of the brain. Figure 3 shows a 3D map of these coefficients for the basis word Tools. We see strong activation (red) in the superior temporal sulcus which is believed to be associated with the perception of biological motion. We also see strong activation in the postcentral gyrus which is believed to be associated with premotor planning. Figure 3. Regression weights to each voxel for word Tools. 6. Conclusion We present a blockwise coordinate descent algorithm to fit the entire regularization path of the multi-task Lasso in a highly efficient way. This algorithm uses a closed-form Winsorization operator which allows it easy to implement and perform far more efficiently than prevous methods. We believe that the multi-task Lasso is useful for a large class of sparse, multi-task regression problems and demonstrated its power on a neural semantics discovery problem.