An Efficient Projection for l1, Regularization Ariadna Quattoni ariadna@csail.mit.edu Computer Science and Artificial Intelligence Laboratory, MIT, 32 Vassar St., Cambridge MA 02139 USA UC Berkeley EECS and ICSI, 1947 Center St., Berkeley CA 94704 USA Xavier Carreras carreras@csail.mit.edu Michael Collins mcollins@csail.mit.edu Computer Science and Artificial Intelligence Laboratory, MIT, 32 Vassar St., Cambridge MA 02139 USA Trevor Darrell UC Berkeley EECS and ICSI, 1947 Center St., Berkeley CA 94704 USA trevor@eecs.berkeley.edu Abstract In recent years the l1, norm has been proposed for joint regularization. In essence, this type of regularization aims at extending the l1 framework for learning sparse models to a setting where the goal is to learn a set of jointly sparse models. In this paper we derive a simple and effective projected gradient method for optimization of l1, regularized problems. The main challenge in developing such a method resides on being able to compute efficient projections to the l1, ball. We present an algorithm that works in O(n log n) time and O(n) memory where n is the number of parameters. We test our algorithm in a multi-task image annotation problem. Our results show that l1, leads to better performance than both l2 and l1 regularization and that it is is effective in discovering jointly sparse solutions. discovering significant features is of value and where computing features is expensive. In addition, it has been shown that in some cases l1 regularization can lead to sample complexity bounds that are logarithmic in the number of input dimensions, making it suitable for learning in high dimensional spaces (Ng, 2004). In recent years the l1, norm has been proposed for joint regularization (Turlach et al., 2005; Tropp, 2006; Quattoni et al., 2008; Schmidt et al., 2008). The l1, norm is a matrix norm that penalizes the sum of maximum absolute values of each row. This regularizer encourages row sparsity: i.e., it encourages entire rows of the matrix to have zero elements. As one example application, consider a multi-task setting with m problems where the l1, regularizer is applied to a parameter matrix W = [w1 , . . . , wm ], where wi is a parameter vector for the i-th task. In this case the l1, regularizer is used to promote feature sharing across tasks and discover solutions where only a few features are non-zero in any of the m tasks (i.e. jointly sparse solutions) (Quattoni et al., 2008). Other applications of l1, regularization are simultaneous sparse signal approximation (Tropp, 2006; Turlach et al., 2005) and structure learning (Schmidt et al., 2008). In this paper we present an efficient projected subgradient method for optimization of l1, regularized convex objectives, which we formulate as a constrained convex optimization problem. Projected gradient methods iterate between performing unconstrained gradient-based updates followed by projections to the feasible set, which in our case is an l1, ball. These methods have been shown to scale well and have been proposed for solving constrained optimization problems involving large numbers of variables. For 1. Introduction Learning algorithms based on l1 regularized loss functions have had a relatively long history in machine learning, covering a wide range of applications such as sparse sensing (Donoho, 2004), l1-logistic regression (Ng, 2004), and structure learning of Markov networks (Lee et al., 2007). A well known property of l1 regularized models is their ability to recover sparse solutions. Because of this they are suitable for applications where Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). An Efficient Projection for l1, Regularization example, (Shalev-Shwartz et al., 2007) developed a projected gradient method for l2 regularization and (Duchi et al., 2008) proposed an analogous algorithm for l1 . However, the problem of finding an efficient projected method for l1, constraints remains open. The main challenge in developing a projected gradient algorithm for l1, constraints resides on being able to efficiently compute Euclidean projections onto the l1, ball. We show that this can be done in O(n log n) time and O(n) memory, where n is the number of parameters of our model. We apply our algorithm to a multi-task image annotation problem where the goal is to predict keywords for a given image. We show that l1, regularization performs significantly better than both independent l2 and independent l1 regularization. Furthermore, we show that l1, is able to find jointly sparse solutions (i.e. parameter matrices with few non-zero rows). 3. A projected gradient method for l1, regularization In this section we start by describing a convex constrained optimization formulation of l1, regularization, followed by a concrete application to multi-task learning. We finish by introducing the projected gradient approach that we will use to solve the problem. 3.1. Constrained Convex Optimization Formulation Assume we have a dataset D = (z1 , z2 , . . . , zn ) with points z belonging to some set Z, and a d × m parameter matrix W = [w1 , . . . , wm ] where wk Rd for k = {1, . . . , m} is the k-th column of W . For example in a multi-task setting wk would be the parameters for the k-th task. We also have a convex loss function L(z, W ) that measures the loss incurred by W on a training sample z. Let us now define the l1, norm: d 2. Previous Work Turlach et al. (2005) developed an interior point algorithm for optimizing a twice differentiable objective regularized with an l1, norm. One of the limitations of this approach is that it requires the exact computation of the Hessian of the objective function. This might be computationally expensive for some applications both in terms of memory and time. An alternative approach was proposed by Schmidt et al. (2008), who combined a gradient-descent method with independent l projections. For the special case of a linear objective the regularization problem can be expressed as a linear program (Quattoni et al., 2008). While this is feasible for small problems it does not scale to problems with large number of variables. Our l1, projection algorithm is related to the l1 projection of Duchi et al. (2008) in that theirs is a special case of our algorithm for m = 1. The derivation of the general case for l1, regularization is significantly more involved as it requires reducing a set of l regularization problems tied together through a common l1 norm to a problem that can be solved efficiently. Similarly to the l1, norm, the l1,2 has also been proposed for joint sparse approximation. This norm penalizes the sum of the l2 norms of each row of W (Yuan & Lin, 2006; Meier et al., 2006; Simil¨ & Tikka, 2007; a Park & Hastie, 2006; Obozinski et al., 2006; Argyriou et al., 2007; Schmidt et al., 2009). ||W ||1, = j=1 max |Wj,k | k (1) When used as a regularization norm l1, induces solutions where only a few rows will contain non-zero values. In fact Tropp (2006) showed that under certain conditions the l1, regularization norm is a convex relaxation of a pseudo-norm that counts the number of non-zero rows of W . One possibility for defining the l1, regularization problem is to set it as a soft constraint: n min W i=1 L(zi , W ) + ||W ||1, (2) Here is a parameter that captures the trade-off between error and sparsity. Another natural way of formulating the regularization problem is to set it as a constrained convex optimization: n min W i=1 L(zi , W ) s.t. ||W ||1, C (3) In this case C is a bound on ||W ||1, and serves a role analogous to that of in the previous formulation. In this paper we concentrate on this latter formulation. 3.2. An Application: Multi-task Learning To give a concrete application let us describe the multitask joint regularization setting. The goal here is to train m jointly-sparse linear classifiers, one for each task. By jointly sparse we mean that we wish only a few features to be non-zero in any of the m problems. An Efficient Projection for l1, Regularization We can formulate this problem as an l1, regularized objective. Following the notation from the previous section, Z is the set of tuples: (xi , yi , li ) for i = 1 . . . n where each xi Rd is a feature vector, li {1, . . . , m} is a label specifying to which of the m tasks the example corresponds to, and yi {+1, -1} is the label for the example. Assume that we wish to learn m linear classifiers of the form fk (x) = wk · x, and let W = [w1 , w2 , . . . , wm ] be a d × m matrix where Wj,k corresponds to the j-th parameter of the k-th problem. So in the ideal case we would like a solution matrix W with a few non-zero rows (i.e. a few active features). In order to assess the classification performance multiple classification losses could be used, in this paper we used the hinge loss: LH (z, W ) = max(0, 1 - fk (x)y) (4) gin is less than one: t = k y i xi (6) i : li =k, fk (xi )yi <1 In the next section we show how to compute the projection onto the l1, ball efficiently. 4. Efficient Projection onto the l1, Ball We start this section by using the Lagrangian of the projection to characterize the optimal solution. This will allow us to map the projection to a simpler problem for which we can develop an efficient algorithm, that we present in the second part of the section. 4.1. Characterization of the solution We now describe the projection of a matrix A to the l1, ball. For now, we assume that all entries in A are non-negative; later we will show that this assumption imposes no limitation. The projection can be formulated as finding a matrix B that solves the following convex optimization problem: P1, : min B,µ In this formulation, the hinge loss encourages correct classification of the points and the l1, norm is similar to l1 regularization, but it encourages joint sparsity across the different tasks. 3.3. A Projected Gradient Method Our algorithm for optimizing equation (3) is based on the projected subgradient method for minimizing a convex function F(W ) subject to convex constraints of the form W , where is a convex set (Bertsekas, 1999). In our case F(W ) is some convex loss function, W is a parameter matrix and is the set of all matrices with ||W ||1, C. A projected subgradient algorithm works by generating a sequence of solutions W t via W t+1 = P (W t - t ). Here t is a subgradient of F at W t and P (W ) is is the Euclidean projection of W onto , given by: W 1 2 i,j (Bi,j - Ai,j )2 (7) (8) (9) (10) (11) s.t. i i, j Bi,j µi µi = C i, j Bi,j 0 i µi 0 min ||W - W ||2 = min W j,k (Wj,k - Wj,k ) 2 (5) In the above problem, the objective (7) corresponds to the Euclidean distance between A and B, whereas the constraints specify that B is in the boundary of the l1, ball of radius C.1 To do so, there are variables µ that stand for the the maximum coefficients of B for each row i, as imposed by constraints (8), and that sum to the radius of the ball, as imposed by constraint (9). Constraints (10) and (11) stand for non-negativity of the new coefficients and maximum values. We now present the Lagrangian of problem P1, and three lemmas that will be used to derive an efficient algorithm. The Lagrangian is: L(B, µ, , , , ) = + i,j Finally, is the learning rate that controls the amount by which the solution changes at each iteration. Standard results in optimization literature (Bertsekas, 1999) show that when = 0t and F(W ) is a convex Lipschitz function the projected algorithm will converge to an -accurate solution in O(1/2 ) iterations. For the hinge loss case described in the previous section, computing the subgradient of the objective of equation (3) is straightforward. The subgradient for the parameters of each task can be computed independently of the other tasks, and for the k-th task it is given by summing examples of the task whose mar- 1 2 i,j (Bi,j - Ai,j )2 µi - C i µi i i,j (Bi,j - µi ) + - i,j Bi,j - i i,j 1 It is simple to show that the optimal B is on the boundary of the ball, providing that ||A||1, >= C. An Efficient Projection for l1, Regularization Lemma 1 At the optimal solution of P1, there exists a constant 0 such that for every i: either (a) µi > 0 and j (Ai,j - Bi,j ) = ; or (b) µi = 0 and j Ai,j . Proof: Differentiating L with respect to Bi,j gives the L optimality condition Bi,j = Bi,j - Ai,j + i,j - i,j = 0. Differentiating L with respect to µi gives the optimality P L condition µi = - j i,j - i = 0. We now assume µi > 0 to prove (a). The complementary slackness conditions imply that whenever µi > 0 then P If we assume that i = 0 and therefore = j i,j . Bi,j > 0, by complementary slackness then i,j = 0, and therefore i,j = Ai,j - Bi,j . If Bi,j = 0 then Bi,j - µi = 0 and so i,j = 0 due to complementary slackness; we then observe that Ai,j = -i,j , but since Ai,j 0 and i,j 0 it must be that Ai,jP 0; so we can express also = i,j = Ai,j - Bi,j . Thus, = j (Ai,j - Bi,j ) which proves (a). When µi = 0, Bi,j = 0 because of (8) and (10). Then, using L the optimality conditions Bi,j we get that i,j = Ai,j + P L i,j . Plugging this into µi we get = j (Ai,j +i,j )+i . By definition i,j 0 and i 0, which proves (b). l1, ball can be reduced to the following problem, which finds the optimal maximums µ: M1, : find µ , s.t. i (15) (16) µi = C j:Ai,j µi (Ai,j - µi ) = , i s.t. µi > 0 (17) Ai,j , i s.t. µi = 0 (18) (19) j i µi 0 ; 0 With µ we can create a matrix B using Lemma 2. Intuitively, the new formulation reduces finding the projection to the l1, ball to finding a new vector of maximum absolute values that will be used to truncate the original matrix. The constraints express that the cumulative mass removed from a row is kept constant across all rows, except for those rows whose coefficients become zero. A final lemma establishes that there is a unique solution to M1, . Therefore, the original projection P1, reduces to finding the solution of M1, . Lemma 3 For any matrix A and a constant C such that C < ||A||1, , there is a unique solution µ , to the problem M1, . Proof: For any 0 there is a unique µ that P satisfies (17), (18) and (19). To see this, consider j Ai,j . In this case we must have µi = 0, by equation (18). For P < j Ai,j we have µi = fi () where fi is the inverse of the function X gi (µ) = (Ai,j - µ) j:Ai,j µ Lemma 1 means that when projecting A, for every row whose sum is greater than , the sum of the new values in the row will be reduced by a constant . The rows whose sum is less than will become zero. The next lemma reveals how to obtain the coefficients of B given the optimal maximums µ. Lemma 2 Let µ be the optimal maximums of problem P1, . The optimal matrix B of P1, satisfies that: Ai,j µi Ai,j µi = = Bi,j = µi Bi,j = Ai,j Bi,j = 0 (12) (13) (14) µi = 0 = Proof: If µi = 0 the lemma follows directly from (8) and (10). The rest of the proof assumes that µi > 0. To prove (12), assume that Ai,j µi but Bi,j = µi . We consider two cases. When Bi,j > 0, by (8), if Bi,j = µi then Bi,j < µi , which means i,j = 0 due to complementary slackness. This together with i,j = 0 imply that Bi,j = Ai,j , and therefore Ai,j < µi , which contradicts the assumption. When Bi,j = 0 then i,j = 0 and Ai,j = 0 (see proof of Lemma 1), which contradicts the assumption. To prove (13), assume that Ai,j µi but Bi,j = Ai,j . We again consider two cases. If Bi,j > 0, i,j = 0; given that Bi,j = Ai,j , then i,j > 0, and so Bi,j = µi due to complementary slackness. But since i,j > 0, Ai,j > Bi,j = µi , which contradicts the assumption. If Bi,j = 0 then Ai,j = 0 (see proof of Lemma 1), which contradicts the assumption. gi (µ) is a strictly decreasing function in the interval P [0, maxj Ai,j ] with gi (0) = j Ai,j and gi (maxj Ai,j ) = 0. Therefore it is clear that fi () is also well defined on the P interval [0, j Ai,j ]. Next, define N() = X i hi () P where hi () = 0 if > j Ai,j and hi () = fi () otherwise. P N() is strictly increasing in the interval [0, maxi j Ai,j ]. Hence there is a unique that satisfies N( ) = C; and there is a unique µ such that µ = hi ( ) for each i. i With these results, the problem of projecting into the So far we have assumed that the input matrix A is non-negative. For the general case, it is easy to prove that the optimal projection never changes the sign of a coefficient (Duchi et al., 2008). Thus, given the coefficient matrix W used by our learning algorithm, we can run the l1, projection on A, where A is a matrix made of the absolute values of W , and then recover the original signs after truncating each coefficient. An Efficient Projection for l1, Regularization 4.2. An efficient projection algorithm In this section we describe an efficient algorithm to solve problem M1, . Given a d × m matrix A and a ball of radius C, the goal is to find a constant and a vector µ of maximums for the new projected matrix, such that C = i µi . As we have shown in the proof of Lemma 3, µ and can be recovered using functions N() and hi (). Each function hi () is piecewise linear with m + 1 intervals. Furthermore N(), the sum of functions hi (), is also piecewise linear with dm + 1 intervals. Appendix A describes the intervals and slopes of hi . Our algorithm builds these functions piece by piece, until it finds a constant that satisfies the conditions of problem M1, ; it then recovers µ. The cost of the algorithm is dominated by sorting and merging the xcoordinates of the hi functions, which form the intervals of N. Therefore the complexity is O(dm log dm) time and O(dm) in memory, where dm is the number of parameters in A. As a final note, the algorithm only needs to consider non-zero parameters of A. Thus, in this complexity cost, dm can be interpreted as the number of non-zero parameters. This property is particularly attractive for learning methods that maintain sparse coefficients. Test Error 60 50 40 error 30 20 10 L2 L1 L1INF 0 10 20 40 80 160 320 # training examples per task Feature Selection Performance 100 80 60 40 20 0 10 Precision L1INF Recall L1 Precision L1 Recall L1INF 20 40 80 160 320 640 # training examples per task 640 Figure 1. Synthetic experiments, for 60 problems and 200 features (10% relevant). Top: test error. Bottom: feature selection performance. 5. Synthetic Experiments For the synthetic experiments we considered a multitask setting where we compared the l1, projection with both independent l2 projections for each task and independent l1 projections. In all cases we used a projected subgradient method, thus the only difference is in the projection step. For all the experiments we used the sum of average hinge losses per task as our objective function. For these experiments as well as the experiments in the following section the learning rate was set to 0 / t, where 0 was chosen to minimize the objective on the training data (we tried values 0.1, 1, 10 and 100). All models were run for 200 iterations. To create data for these experiments we first generated parameters W = [w1 , w2 , . . . , wm ] for all tasks, where each entry was sampled from a normal distribution with 0 mean and unit variance. To generate jointly sparse vectors we randomly selected 10% of the features to be the global set of relevant features V . Then for each task we randomly selected a subset v V of relevant features for the task. The size of v was sampled uniformly at random from {|V |/2, . . . , |V |}. All parameters outside v were zeroed. Each of the dimensions of the training points xk for i each task was also generated from a normal distribution with 0 mean and unit variance. All vectors were then normalized to have unit norm. The correspondk ing labels yi were set to sign(wk · xi ). The test data k was generated in the same fashion. The number of dimensions for these experiments was set to 200 and the number of problems to 60. We evaluated three types of projections: l1, , independent l2 and independent l1 . For each projection the ball constraints C were set to be the true norm of the corresponding parameters.2 That is for the l1, norm we set C = ||W ||1, . For the independent l1 and l2 norms we set Ck = ||wk ||1 and Ck = ||wk ||2 , resp., where Ck is the regularization parameter for task k. We trained models with different number of training examples ranging from 10 to 640 examples per task and evaluated the classification error of the resulting classifier on the test data. Figure 1 shows the results of these experiments. As we would expect, given the amount of feature sharing between tasks, the l1, projection results in better generalization than both independent l1 and independent l2 projections. Since we know the relevant feature set V , we can evaluate how well the l1 and l1, projections recovered these features. For each model we take the coefficient matrix W learned and select all the features corresponding to non-zero coefficients for 2 We also conducted experiments where C was chosen using a validation set and obtained very similar results. An Efficient Projection for l1, Regularization at least one task. The bottom plot shows precision and recall of feature selection for each model, as we increase the number of training examples per task. As we can see both the l1 model and the l1, can easily recognize that a feature is in the relevant set: the recall for both models is high even with very few training examples. The main difference between the two models is in the precision at recovering relevant features: the l1, model returns significantly sparser solutions, and thus has higher precision. a labeled test set (xi , yi ) for i = 1 . . . n, the AUC measure for a function f can be expressed (e.g., see Agarwal et al. (2005)) as 1 n+ I[f (xi ) > f (xj )] n- =-1 (20) i:yi =+1 j:yj 6. Image Annotation Experiments In these experiments we use our algorithm in a multitask learning application. We compare the performance of independent l2 , independent l1 , and joint l1, regularization. In all cases we used the sum of hinge losses as our objective function. To train the l1 regularized models we used the projected method of (Duchi et al., 2008) (which is a special case of our projection when m = 1). To train the l2 regularized models we used the standard SVM-Light software 3 . For these experiments we collected a dataset from the Reuters news-website. Images on the Reuters website have associated captions. We selected the 40 most frequent content words as our target prediction tasks (a content word is defined as not being in a list of stop words). That is, each task involved the binary prediction of whether a content word was an appropriate annotation for an image. Examples of words include: awards, president, actress, actor, match, team, people. We partitioned the data into three sets: 10,589 images for training, 5,000 images as validation data, and 5,000 images for testing. For each of the 40 most frequent content words we created multiple training sets of different sizes, n = {40, 80, 160, 320, 640}: each training set contained n positive examples and 2n negative examples. All examples for each task were randomly sampled from the pool of supervised training data. Our image representation is a multi-resolution vocabulary histogram (Nister & Stewenius, 2006; Grauman & Darrell, 2008); each pyramid level is concatenated to form our feature space. As a preprocessing step we performed SVD to obtain a new basis of the image space where features are uncorrelated. 6.1. Evaluation and Significance Testing To compare the performance of different classifiers we use the AUC criterion, which is commonly used in evaluation on retrieval tasks. For a single task, assuming 3 where n+ is the number of positive examples in the test set, n- is the number of negative examples, and I[] is the indicator function which is 1 if is true, 0 otherwise. The AUC measure can be interpreted (Agarwal et al., 2005) as an estimate of the following expectation, which is the probability that the function f correctly ranks a randomly drawn positive example over a randomly drawn negative item: AU C(f ) = EX + D+ ,X - D- I[f (X + ) > f (X - )] Here D+ is the distribution over positively labeled examples, and D- is the distribution over negative examples. This interpretation allows to develop a simple significance test, based on the sign test, to compare the performance of two classifiers f and g (more specifically, to develop a significance test for the hypothesis that AU C(f ) > AU C(g)). Assuming that n+ < n- (which is the case in all of our experiments), and given a test set, we create pairs of examples (x+ , x- ) for i i i = 1 . . . n+ . Here each x+ is a positive test example, i and x- is an arbitrary negative test example; each i positive and negative example is a distinct item from the test set. Given the n+ pairs, we can calculate the following counts: s+ s - = i I[(f (x+ ) > f (x- )) (g(x+ ) < g(x- ))] i i i i I[(f (x+ ) < f (x- )) (g(x+ ) > g(x- ))] i i i i = i These counts are used to calculate significance under the sign test. In our experiments, the set-up is slightly more complicated, in that we have a multi-task setting where we are simultaneously measuring the performance on several tasks rather than a single task. The test set in our case consists of examples (xi , yi , li ) for i = 1 . . . n where li specifies the task for the i-th example. We replace the AUC measure in Eq. 20 with the following measure: 1 n+ I[fl (xi ) > fl (xj )] n- l =-1 l i:li =l,yi =+1 j:li =l,yj http://svmlight.joachims.org/ where n+ is the total number of positive examples in the test set, n- is the number of negative examples for l An Efficient Projection for l1, Regularization Table 1. Significance tests for the Image Annotation task, comparing l1, with l2 and l1 . 0.73 0.72 Average AUC 0.71 0.7 0.69 0.68 0.67 0.66 0.65 0.64 4800 9600 19200 38400 # training samples L1INF L1 L2 63408 # samples 4,800 9,600 19,200 38,400 63,408 1 0.8 objective 0.6 0.4 0.2 0 0 C=0.1 C=1 C=10 C=100 50 l2 p-value 0.0042 0.0002 0.00001 0.35 0.001 l1 p-value 0.00001 0.009 0.00001 0.36 0.46 Figure 3. Model Comparison in the Image Annotation task: l1, , l1 and l2 . 0.67 0.665 Average AUC 0.66 0.655 0.65 0.645 0.64 L1 L1INF 0.3 0.4 100 # iterations 150 200 Figure 2. Convergence of l1, models for different values of C and best 0 in the Image Annotation task using 63,408 samples. 0.5 0.6 0.7 0.8 % non-zero features 0.9 1 the l-th task, and fl is the classifier for the l-th task. It is straightforward to develop a variant of the sign test for this setting; for brevity the details are omitted. 6.2. Results We trained models for C = {0.1, 1, 10, 100}. Figure 2 shows the convergence of the l1, models for the largest training set. To validate the constant C for each model we assume that we have 10 words for which we have validation data. We chose the C that maximized the AUC. Figure 3 shows results on the Reuters dataset for l2 , l1 and l1, regularization as a function of the total number training samples. Table 1 shows the corresponding significance tests for the difference between l1, and the other two models. As we can see the l1, regularization performs significantly better than both l1 and l2 for training sizes of less than 20, 000 samples, and for larger sizes all models seem to perform similarly. Figure 4 shows the accuracy of the learned model as a function of the total number of non-zero features (i.e. the number of features that were used by any of the tasks), where we obtained solutions of different sparsity by controlling the C parameter. Notice that the l1, model is able to find a solution of 66% average AUC using only 30% of the features while the l1 model needs to use 95% of the features to achieve a performance of 65%. Figure 4. AUC measure with respect joint sparsity in the Image Annotation task, for l1 and l1, . 7. Conclusion In this paper we have presented a simple and effective projected gradient method for training joint models with l1, constraints. We have shown that projections onto the l1, ball can be done efficiently by presenting an algorithm that can compute them in O(n log n) time and O(n) memory. This matches the computational cost of the most efficient algorithm (Duchi et al., 2008) for computing l1 projections. We have applied our algorithm to a multi-task image annotation problem. Our results show that l1, leads to better performance than both l2 and l1 regularization. Furthermore l1, is effective in discovering jointly sparse solutions. One advantage of our approach is that it can be easily extended to work on an online convex optimization setting (Zinkevich, 2003). Future work should explore this possibility. References Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., & Roth, D. (2005). Generalization bounds for the area under the roc curve. Journal of Machine Learning Research, 6, 393­425. An Efficient Projection for l1, Regularization Argyriou, A., Evgeniou, T., & Pontil, M. (2007). Multi-task feature learning. Advances in Neural Information Processing Systems 19 (pp. 41­48). Bertsekas, D. (1999). Nonlinear programming. Athena Scientific. Donoho, D. (2004). For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. (Technical Report). Statistics Dept., Stanford University. Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projections onto the l1-ball for learning in high dimensions. Proc. of Intl. Conf. on Machine Learning (pp. 272­279). Grauman, K., & Darrell, T. (2008). The pyramid match kernel: Efficient learning with sets of features. Journal of Machine Learning Research, 8, 725­760. Lee, S. I., Ganapathi, V., & Koller, D. (2007). Effcient structure learning of markov networks using l1-regularization. Advances in Neural Information Processing Systems 19 (pp. 817­824). Meier, L., van de Geer, S., & Buhlmann, P. (2006). The group lasso for logistic regression (Technical Report). ETH Seminar fur Statistik. Ng, A. Y. (2004). Feature selection, l1 vs. l2 regularization, and rotational invariance. Proc. of Intl. Conf. on Machine Learning. Nister, D., & Stewenius, H. (2006). Scalable recognition with a vocabulary tree. Proc. of Conf. on Computer Vision and Pattern Recognition. Obozinski, G., Taskar, B., & Jordan, M. (2006). Multitask feature selection (Technical Report). Statistics Dept., University of California, Berkeley. Park, M. Y., & Hastie, T. (2006). Regularization path algorithms for detecting gene interactions (Technical Report). Stanford University. Quattoni, A., Collins, M., & Darrell, T. (2008). Transfer learning for image classification with sparse prototype representations. Proc. of Conf. on Computer Vision and Pattern Recognition. Schmidt, M., Murphy, K., Fung, G., & Rosale, R. (2008). Structure learning in random fields for heart motion abnormality detection. Proc. of Conf. on Computer Vision and Pattern Recognition. Schmidt, M., van den Berg, E., Friedlander, M., & Murphy, K. (2009). Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm. Proc. of Conf. on Artificial Intelligence and Statistics (pp. 456­463). Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Proc. of Intl. Conf. on Machine Learning (pp. 807­814). Simil¨, T., & Tikka, J. (2007). Input selection and a shrinkage in multiresponse linear regression. Computational Statistics and Data Analysis, 52, 406­ 422. Tropp, J. (2006). Algorithms for simultaneous sparse approximation, part ii: convex relaxation. Signal Processing (pp. 589­602). Turlach, B., Venables, W., & Wright, S. (2005). Simultaneous variable selection. Technometrics, 47, 349­363. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. Royal Statistical Society Series B, 68, 49­67. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. Proc. of Intl. Conf. on Machine Learning (pp. 928­936). Acknowledgements We would like to thank John Duchi and Yoram Singer for useful discussions about this work. A. Quattoni and M. Collins were supported by NSF grant 0347631. A. Appendix: Computation of hi In this section we describe the intervals and slope of piecewise linear functions hi . Let si be a vector of the coefficients of row i in A sorted in decreasing order, with an added 0 at position m + 1, si,1 si,2 . . . si,m si,m+1 = 0. Then, let us define points k ri,k = j:Ai,j si,k (Ai,j - si,k ) = j=1 (si,j - si,k ) = k-1 j=1 si,j - (k - 1)si,k , for 1 k m + 1. Each point ri,k corresponds to the reduction in magnitude for row i that is obtained if we set the new maximum to si,k . Clearly hi (ri,k ) = si,k . Furthermore it is easy to see that hi is piecewise linear with intervals [ri,k , ri,k+1 ] for 1 k m and slope: k j=1 si,j - ksi,k+1 - si,k+1 - si,k k-1 j=1 si,j + (k - 1)si,k =- 1 k After point ri,m+1 the function is constant and its value is zero. Note that this comes from equation (18) m that establishes that µi = 0 for > ri,m+1 = j=1 Ai,j .