A scalable trust-region algorithm with application to mixed-norm regression Dongmin Kim Suvrit Sra Inderjit Dhillon DMKIM @ CS . UTEXAS . EDU SUVRIT @ TUEBINGEN . MPG . DE INDERJIT @ CS . UTEXAS . EDU Dept. of Computer Science, University of Texas, Austin, TX 78712, USA Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076, T¨ bingen, Germany u Abstract We present a new algorithm for minimizing a convex loss-function subject to regularization. Our framework applies to numerous problems in machine learning and statistics; notably, for sparsity-promoting regularizers such as 1 or 1, norms, it enables efficient computation of sparse solutions. Our approach is based on the trust-region framework with nonsmooth objectives, which allows us to build on known results to provide convergence analysis. We avoid the computational overheads associated with the conventional Hessian approximation used by trust-region methods by instead using a simple separable quadratic approximation. This approximation also enables use of proximity operators for tackling nonsmooth regularizers. We illustrate the versatility of our resulting algorithm by specializing it to three mixed-norm regression problems: group lasso [36], group logistic regression [21], and multi-task lasso [19]. We experiment with both synthetic and real-world large-scale data--our method is seen to be competitive, robust, and scalable. Lasso [36], 1 -logistic regression [16], and multi-task feature selection [25]. Since is convex, a variety of methods can be used to solve (1.1)--ranging from the subgradient method [5] to sophisticated interior point methods [24]; we summarize several approaches in §2. Standard methods, however, do not always scale to problem sizes encountered in machine learning, where datasets with several million points are not uncommon. Lack of scalability is even more pronounced when R is non-differentiable, so that many otherwise scalable methods can also become unbearably slow. To overcome some of the challenges posed by (1.1), we design and analyze a new trust-region (TR) algorithm, which we call T RIP. For concrete instances of (1.1), T RIP has a worstcase runtime bounded by O(1/ ) where denotes the desired solution accuracy (Theorem 3.5). In contrast to conventional trust-region setups, T RIP attains greater scalability by using only first-order information at the current iteration to build a scalar approximation to the Hessian. This strategy enables the use of proximity operators (§3.2), which allow tackling nonsmooth regularizers efficiently. As a result, T RIP is highly competitive with state-of-the-art methods. We illustrate T RIP's empirical effectiveness by applying it to three important regression problems regularized by mixed-norms (e.g., 1,2 , 1, ): group lasso (GLASSO) [3, 36]; group lasso for logistic regression (GL-LR) [21]; and multi-task lasso (MTL) [19]. These three problems are not only widely applicable but also representative of modern challenging problems that are solvable via our method--see §4 for details. In summary, our algorithm T RIP is (i) scalable, as it uses separable quadratic approximation; (ii) flexible, as it depends only on proximity operators for solving subproblems; and (iii) robust, because despite using several trustregion parameters, it requires minimal tuning. Empirical results (see §4) on both synthetic and real-world data corroborate these claims. 1. Introduction We present a new algorithm for solving optimization problems of the form min w (w) = L(w) + R(w), (1.1) where L is a continuously differentiable, convex lossfunction, and R is a convex and continuous, usually nondifferentiable regularizer. This generic setup enjoys great importance as it encompasses several basic problems in machine learning and statistics: e.g., Lasso [29], Group Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). A scalable trust-region algorithm with application to mixed-norm regression 2. Related Work & Background The simplest approach to solving (1.1) is perhaps the subgradient method [5]. Here one iterates wt+1 = wt - t g t ; t is a step-size and g t is a subgradient in (wt ). This method is, however, overly simple: its iterates are hardly ever at the points of non-differentiability [28, 11], which are usually the actual points of interest as they tend to reveal problem structure such as sparsity [37, 2]. As noted in [28], a similar fate is met by stochastic gradient descent. A more attractive choice is to tackle non-differentiability of R via proximity operators (see [6], §3.2). These operators are central to forward-backward splitting methods (FBS) [6] (FBS includes methods such as iterative shrinkage-thresholding as special cases), as well as to methods based on surrogate optimization [13, 10], and separable approximation [34]. If even the loss is nondifferentiable, then the recent FBS method FOBOS [11] may be used. A potential drawback of FBS methods is that often the step-sizes must be carefully tuned. In contrast, our method T RIP usually succeeds without user intervention. In recent development for mixed-norm problems, blockcoordinate descent procedures have been popular: e.g., for separable regularizers [30]; for MTL [19], GLASSO [36], and GL-LR [22]. However, for large-scale data, coordinatewise approaches are often non-competitive [32, 28]. Beyond FBS there are two more noteworthy methods: SpaRSA [34] and PQN [27]. SpaRSA is a state-of-the-art framework for solving (1.1); our method shares its separable Hessian approximation, but differs significantly with respect to the convergence analysis. PQN is a limited memory quasi-Newton method with excellent empirical performance, but tackles R by passing to the constrained form R(w) . We compare T RIP with FBS, SpaRSA, PQN, and observe highly competitive empirical behavior. 2.1. Nonsmooth Trust-Region Method Broadly viewed, our method is a variant of the basic trustregion (TR) method which has been successfully applied to some machine learning problems [18, 35]. Like the basic TR method, the TR method for nonsmooth also consists of three main steps: (i) approximate the objective function within a trust-region1 via a model function; (ii) minimize the resulting model; and (iii) update the current iterate and the trust-region. Here are some details on these steps. Let the current iterate be wk . The model function m is designed to satisfy m(wk , pk , s) (wk + s), 1 where pk parametrizes the model and k defines the trustregion. Since is non-differentiable a conventional choice such as a second-order Taylor approximation is not directly applicable (see [8, Chapter 11] for more technical details). Once a proper model has been built, we minimize it to obtain a "step" (minimizer) sk , and then test whether this step is acceptable by computing the ratio k = (wk ) - (wk + sk ) , m(wk , pk , 0) - m(wk , pk , sk ) (2.1) which roughly measures the accuracy of m. When decrease in the model m matches that in the objective , then k 1 and the trust-region can be increased; if k is too small, the trust-region must be shrunk. More formally, let 1 0 < 1 2 < 1, 0 < 3 1 2 < 1 < 3 , and U > 0 be TR parameters [8]. If k 1 , then we update wk+1 wk + sk , while the TR is updated via some [1 k , 2 k ], if k < 1 , (2.2) k+1 [2 k , k ], if k [1 , 2 ), [3 k , U ], if k 2 . 3. The Trust-Region Proximal method: T RIP Now we are ready to present our approach: the TrustRegIon Proximal method, or T RIP for short. A crucial feature of T RIP that differentiates it from other TR methods is the particular model function that we use. Normally, the model function captures second-order information of the overall objective . We do not approximate directly, but instead approximate only L, that too using essentially first-order information. More precisely, we replace the Hessian of L by a separable quadratic approximation, which reduces minimization of the model function to applying a proximity operator. Though our simplification sacrifices superior theoretical convergence rates of second-order methods, it does so in exchange for cheaper per-iteration cost. Such tradeoffs are common to machine learning applications, and are often crucial to tackling large-scale data. 3.1. The model function For differentiable , a typical second-order model is m(wk , H k , s) = (wk ) + s, (wk ) + 1 2 s, H k s , for s k , where H k is a positive-definite matrix that captures curvature information (e.g., for twice-differentiable , one could have H k = 2 (wk )). Since = L + R where only L is differentiable, a second-order model could be m(wk , H k , s) = L(wk ) + s, L(wk ) + 1 2 The trust-region captures a neighborhood of the current iterate where minimizing a model function, more or less guarantees minimization of the original objective. s, H k s + R(wk + s), (3.1) A scalable trust-region algorithm with application to mixed-norm regression where H k now captures curvature information for L(wk ). However, the matrix H k makes minimizing (3.1) nontrivial, so we wish to replace it with a simpler choice. An immediate choice is inspired by quasi-Newton methods that seek a matrix H k which satisfies the secant equation H k uk = v k , or equivalently, uk = H k -1 Several examples of proximity operators are also discussed in [11], and some additional details on proximity operators may be found in [7]; one important property that we use is Moreau's decomposition: v = ProxR v + ProxR v, (3.7) 1, v k , (3.2) where R is the Fenchel conjugate to R. For R equal to 2 , or norms, after simple algebra (3.7) yields Prox Prox Prox · · · 1 2 where the difference vectors uk and v k are defined as uk = wk - wk-1 and v k = L(wk ) - L(wk-1 ). v = sgn(v) v = (1 - v v =v-P · (|v| - )+ , -1 2 )+ v, 1 (3.8) (3.9) (3.10) v, The secant equation (3.2) can be solved in various ways. One extreme, but surprisingly effective choice is to use the spectral formulae proposed by [4] which yield scalar solutions to (3.2) by computing k = argmin uk - v k = argmin u - k k 2 2, -1 k 2 v 2, where (x)+ = max(x, 0). The first two can be computed in O(n) time; (3.10) too can be computed in O(n) time by using the algorithm of [12] for the projection P · 1 (v). 3.3. The Algorithm We now have all the basic ingredients needed to present T RIP--Algorithm 1 provides the pseudo-code. or which result in the closed form solutions / uk , v k . (3.3) The scalars k yield diagonal Hessian approximations k I. This is one of the key features of SpaRSA [34] and we too adopt it in our method. However, we note that this k might be unbounded for our problem (1.1), hence we impose a pre-specified lower bound L to keep k I positivedefinite, and an upper bound U to prevent degeneracy. , Finally, we plug either choice of k into (3.1) to obtain our separable quadratic model, m(wk , k , s) = L(wk ) + s, L(wk ) + 1 k 2 k = uk , v k / uk 2 and k = v k 2 T RIP operates in two phases: null and monotonic. In the null phase, we solve (3.5) using appropriate proximity operators. But the null phase relies on the separable model (3.4), which is based on spectral formulae that exhibit a characteristic non-monotonic behavior [4]. This non-monotonicity, while empirically very effective, makes it extremely difficult to analyze and prove convergence. The difficulty essentially stems from potential violation of the following descent condition: Definition 3.1 (Descent). Let m(wk , pk , sk ) be the model function, and g the minimum norm generalized gradient2 at wk . The descent condition requires sk to satisfy m(wk , pk , 0) - m(wk , pk , sk ) > g min{L , k }, (3.11) s 2 2 + R(w + s). k (3.4) Minimization of (3.4) can be written as argmins m(wk , k , s) = argmins 1 2 where (0, 1), and L , k denote a pre-specified lower-bound and the current trust-region size, respectively. 1 R(wk k s+ 1 k L(wk ) 2 2 + + s), (3.5) Problem (3.5) is recognized to be a proximity operator (for R) [7, 6], and since it plays a crucial role in our method, let us look at proximity operators in more detail. 3.2. Proximity operators Let R : X Rd (-, ] be a lower semicontinuous, proper convex function. For a point v X, the proximity operator for R applied to v is defined as ProxR v = argmin sX 1 2 The monotonic phase of the algorithm ensures (3.11), so that the adverse null iterations can be "corrected." To this end, we limit the number of allowable consecutive nonmonotonic iterations. At any moment, the algorithm maintains a reference iteration wr , for which the last descent occurred. From wr onwards the algorithm may move nonmonotonically for at most r iterations. If there is descent, ¯ the reference iteration is reset, while if for r iterations the ¯ method fails to descend, it enters the monotonic phase that guarantees (3.11). Consequently, we can guarantee convergence of the overall algorithm, as shown below. Defined as g = argmingG (w) g , where G (w) is the set of generalized gradients at w, i.e., vectors g G (w), s.t., d Rn , g, d limt0+ supyw (y+td)-(y) t 2 s-v 2 2 + R(s). (3.6) A scalable trust-region algorithm with application to mixed-norm regression Algorithm 1 Trust-RegIon Proximal method (T RIP). input TR parameters (2.2), descent parameters (3.11), model parameters L , U (3.4) and (0, 1) Initialize k 0, r 0, 0 , and wk , wk-1 Rn such that (wk ) < (wk-1 ) repeat sk -wk + ProxR/k wk - (k )-1 L(wk ) if k - r < r then ¯ {Potential non-monotonic descent} wk+1 wk + sk ; Compute k+1 ; if (wk + sk ) < (wr ) then r k; {Update reference iterate} end if k k + 1; Continue; end if {Enforce monotonicity} repeat sk - P · k (k )-1 g ; k k until (3.11) is satisfied Compute k and update TR using (2.2); Update k+1 ; r k; k k + 1 until Stopping criteria met and continuous, and thereby also locally Lipschitz continuous and regular [26, 1] in s. Furthermore, given s it is easy to check that all the terms are continuous in (w, ) (recall that L is continuously differentiable), whereby m itself is continuous in (w, ) for all s Rn . To prove 3.2(4) and 3.2(5) we use the following lemma. Lemma 3.3 ([8, Lemma 11.2.1]). If 3.2(2) and 3.2(3) hold, then the following limit lim (w + s) - m(w, , s) = 0, s (3.12) s 0 is equivalent to the combination of 3.2(4) and 3.2(5). We now prove 3.2(4), 3.2(5) by showing that (3.12) holds. Proof: 3.2(4,5). By definition of m(w, , s) we have lim (w + s) - m(w, , s) s L(w + s) + R(w + s) = lim s 0 s 1 L(w) + 2 s 2 + R(w + s) 2 s L(w + s) - L(w) s, L(w) = lim - lim s 0 s 0 s s s 0 - L(w) + s, 3.4. Convergence Analysis of T RIP Before we state the main convergence theorem for T RIP, we need to prove a technical but crucial proposition. Proposition 3.2. The following properties hold: (1) The set of model parameters is closed and bounded. (2) The function (w) from (1.1) is locally Lipschitz continuous and regular on Rn . (3) The model m(w, , s) from (3.4) is locally Lipschitz continuous and regular with respect to s for all (w, ) Rn × , and continuous in (w, ) for all s Rn . (4) The values of the objective function and of the model coincide when s = 0, i.e., m(w, , 0) = (w) for all (w, ) Rn × . (5) The generalized directional derivatives of the model and of the objective function coincide in every nonzero direction d when s = 0 for all (w, ) Rn × . Proof: 3.2 (1). Immediate because = [L , U ]. Proof: 3.2 (2). Since by definition both L and R are convex, continuous, the objective = L + R is locally Lipschitz continuous [26], and also regular [1]. Proof: 3.2(3). Recall that the model m(w, , s) is m(w, , s) = L(w) + s, L(w) + 1 s 2 2 2 + R(w + s) - lim s 0 s 2 2 . 2 s (3.13) Let {sk } be a sequence converging to zero; let hk = sk , and dk = sk /hk . Then, k lim hk = 0 while dk = 1, k. Using this notation we rewrite (3.13) as lim ¸ hk dk , L(w) L(w + hk dk ) - L(w) - lim k hk dk hk dk hk dk 2 dk 2 2 k - lim k . Since L is continuously differentiable we obtain k lim dk , L(w) dk ¸ - lim 2 2 k dk , L(w) dk ¸ - lim hk · C = 0, k where C = dk 2 dk . Thus (3.12) holds. Each of the above terms on the right hand side of m is convex and continuous w.r.t. s. Thus, m(w, , s) is convex With Proposition 3.2 in hand, the global convergence of T RIP can be established provided that the sequence {wr } of reference iterations has a limit point. This can be ensured using some additional assumptions on , for instance, when is bounded below and has a bounded level set. In this paper, we assume the existence of a limit point and show that it is first-order critical. A scalable trust-region algorithm with application to mixed-norm regression Theorem 3.4 (Convergence). Suppose w is a limit point of the sequence {wr } generated by Algorithm 1. Then w is a first-order critical point of (w). Proof. Since Proposition 3.2 holds, the TR framework of Theorem 11.2.5 in [8] yields the proof. Let us now look at T RIP's convergence rate. Theorem 3.5 (Convergence rate). The sequence {(wr )} generated by Algorithm 1 converges at the rate O(1/r). Proof sketch. Suppose w is the solution to (1.1). For brevity, denote (wr ) by r , and m(wr , r , sr ) by mr . Since r+1 < r , there exists a constant such that wr - w for all wr . From (2.1), the computation of , and from the trust-region update (2.2), we have - 1 , r - mr r r+1 4. Experiments In this section we apply T RIP to the following three mixednorm3 regression problems: 1. Group lasso (GLASSO) [36]. Let Xj Rm×dj , wj Rdj (1 j n), y Rm , and > 0. Then, the GLASSO problem ( 1,2 -regularized) is: w1 ,...,wn min 1 2 y- Xn j=1 Xj w j 2 2 + Xn j=1 wj 2. (4.1) Formulation (4.1) is the most frequently studied variant of GLASSO; see for instance [36, 2]. 2. Group-lasso for logistic regression (GL-LR) [21]. Same as GLASSO except that instead of a quadratic loss, the logistic loss L(w) = y T - Xm i=1 ` ´ log 1 + ei , (4.2) that is r+1 (1 - 1 ) + 1 m . 0 2 r r is used; = b1 + n j=1 Xj wj for a bias term b. Following Nesterov [23] we see that if - 1 U , then 1 - 1 U 2 /2, otherwise r+1 - r - - 1 (r - )2 . 2U 2 (3.14) 3. Multi-task lasso (MTL) [19]. Let Xj Rmj ×d , and parameter matrix W Rd×n . Each column wj of W corresponds to a task, but the mixed-norm is applied over the rows wi of W . The optimization problem is: w1 ,...,wn Using the shorthand r = 1/(r - ), (3.14) yields 1 r+1 1 1 2U 2 r - 1 - = , r 2U 2 (r )2 2U 2 (r )2 2 r 2 r min Xn 1 j=1 2 yj - Xj wj 2 2 + Xd i=1 wi , (4.3) A more general setup is described in [25]. These three problems and their variations have attracted a lot of attention. Notable applications include: featureselection [36]; groupwise, multi-task feature-selection [25, 31, 38]; cognitive neuroscience [19]; bioinformatics [20]; computer vision [33]. We conduct experiments on both synthetic and real-world datasets and compare four algorithms:4 T RIP, PQN,5 SpaRSA,6 and FBS7 . Though we focus on the three applications listed above, we remark in passing that T RIP also easily applies to other problems such as lasso, 1 -penalized logistic-regression [16], Elasticnet [39], and sparse group lasso [14]. 4.1. Synthetic Data Table 1 summarizes the synthetic datasets used. Due to space limitations, we include only a few of our results which further implies that r+1 2U ( ) 1 = r + 2 r - 2U 2U 2 r - 1 1 r 1 1 > r + = r + . 2U 2 r 2U 2 1 r 1 r + 2 1 · · · 0 + > , 2 2 2U 2U 2U 2 2U 2 2U 2 1 < · = O(1/r). 1 r + 2 1 r Repeating the argument for all t we immediately obtain r r-1 + which yields the desired convergence rate r - < 3.5. Implementation At each iteration of T RIP, we need to apply proximity operators for mixed-norms (null phase) or compute minimumnorm generalized gradients (monotonic phase). Proximity operators can be applied efficiently as the task breaks down into n separate operators for GLASSO and GL-LR (d for MTL)--a point also noticed by [34, 11]. The monotonic phase requires minimizing the generalized gradients g 1 , g 2 or g --a task that is easy but having a somewhat technical derivation, so we defer it to [15, §2.1] for brevity. 3 for 1 Let w Rd be partitioned into subvectors w1 , . . . , wn , where wj Rdj j n. The p,q mixed-norm (aka group-norm) for w is defined as ,^ ~, w p,q = , w1 q ; w2 q ; . . . , wn q ,p . The most commonly used mixed-norms are 1,2 and 1, . 4 We remark again that block-coordinate descent (BCD) methods do not scale well. For example, the BCD method of [19] requires pre-processing steps that comT pute Xj Xj ; these products are prohibitively expensive to compute, and demand huge storage--e.g., for dataset S3 (Table 1) more than 1800GB would be needed! 5 Obtained from: http://people.cs.ubc.ca/ schmidtm/Software/PQN.html 6 Obtained from: http://www.lx.it.pt/mtf/SpaRSA/ 7 Carefully implemented ourselves; no reference implementation was available. A scalable trust-region algorithm with application to mixed-norm regression 10 10 10 10 10 10 10 10 13 10 TRIP-MTL PQN-MTL SpaRSA-MTL FBS-MTL Objective fn. value 14 10 TRIP-MTL PQN-MTL SpaRSA-MTL FBS-MTL Objective fn. value 14 12 10 12 10 12 TRIP-MTL PQN-MTL SpaRSA-MTL FBS-MTL 11 Objective fn. value 10 10 10 10 10 9 10 8 10 8 8 7 10 6 10 6 6 0 50 100 150 200 Running time (secs) 250 300 10 4 0 100 200 300 Running time (secs) 400 10 4 0 200 400 600 800 Running time (secs) 1000 1200 Figure 1. MTL experiments: objective function vs. runtime for datasets S1, S2, and S3. FBS eventually converges (not shown) after a long time. On S2 PQN is very competitive. SpaRSA is seen to be nonmonotonic. Overall T RIP outperforms the other methods. 10 20 10 TRIP-GLASSO PQN-GLASSO SpaRSA-GLASSO FBS-GLASSO Objective fn. value 10 10 10 10 10 10 16 10 TRIP-GLASSO PQN-GLASSO SpaRSA-GLASSO FBS-GLASSO Objective fn. value 10 10 10 10 10 10 10 16 14 14 10 Objective fn. value 15 TRIP-GLASSO PQN-GLASSO SpaRSA-GLASSO FBS-GLASSO 12 12 10 10 10 10 8 8 6 6 10 5 4 4 10 0 0 50 100 150 200 Running time (secs) 250 300 10 2 2 0 50 100 150 200 Running time (secs) 250 300 0 50 100 150 200 Running time (secs) 250 300 Figure 2. GLASSO experiments: objective function vs. runtime for datasets S1, S2, and S3. FBS eventually converges (not shown) after a long time; T RIP and PQN behave similarly, except on S3 where T RIP outperforms PQN. (more may be found in [15]). We first show results for MTL. Here, the observations are generated following the model yj = Xj wj + j , where j is Gaussian noise. Figure 1 plots objective values versus runtimes for the four algorithms. We adjusted the parameters for PQN, SpaRSA, and FBS, to make them perform competitively; for FBS we searched over a grid of step-sizes and have reported the best results obtained. We see that T RIP converges the fastest, followed by PQN SpaRSA, and FBS. Notice how SpaRSA exhibits highly non-monotonic behavior. Next, we show synthetic results for GLASSO. Here the obn servation was generated as per y = j=1 Xj wj + . The data matrices used were the same as for MTL. We again adjusted parameters of the other approaches to make them run competitively. Figure 2 presents objective function versus runtime plots. Here, as opposed to MTL, we observe that PQN and T RIP perform similarly on S1 and S2, while for the larger dataset T RIP converges much faster. 4.2. Real-world Datasets The main goal of this section is to compare the computational efficiency of the different methods on some real-world datasets, once again with application to MTL, GLASSO, and GL-LR. We run all these regressions for per- Table 1. Dimensions of the various synthetic datasets. The values correspond to matrices in n groups, X1 , . . . , Xn Rm×d . The largest dataset S3 contains over 500 million nonzero elements. Name S1 S2 S3 m 10,000 50,000 100,000 d 5,000 20,000 50,000 n 500 100 100 # nonzeros 2.49 · 108 3.99 · 108 5.01 · 108 forming explanatory feature selection, a problem for which fast optimization of the objective function is always beneficial. We begin with a brief demonstrative sample, which is followed by the main results. The first experiment solves MTL for the wine quality dataset [9]. This dataset contains two sub-datasets that measure physicochemical properties of red wines and white wines. The response variable is the quality for each wine, and takes values from 0 to 10. Table 2 shows the weights of the features selected by linear regression and by MTL. By applying the linear regression separately, one obtains (Density, Alcohol) as the most influential features for red wine, and (Density, Residual sugar) for white wine. MTL, however, provides a different explanation: Alcohol is the co-dominating feature to determine the quality, regardless of the wine type, while Volatile A scalable trust-region algorithm with application to mixed-norm regression Feature Fixed acidity Volatile acidity Citric acid Residual sugar Chlorides Free sulfur dioxide Total sulfur dioxide Density pH Sulphates Alcohol Linear Regression 0.0253 / 0.0624 -0.1104 / -0.2120 -0.0120 / 0.0030 0.0037 / 0.4666 -0.0325 / -0.0061 0.0154 / 0.0717 -0.0335 / -0.0137 0.6634 / -0.5075 -0.1993 / 0.1170 0.1070 / 0.0814 0.5467 / 0.2688 MTL = 1e + 2 0.0386 / -0.0201 0.0193 / -0.2043 0.0164 / 0.0013 0.0389 / 0.2183 0.0045 / -0.0045 0.0265 / 0.0521 00 0.1579 / -0.1579 0.0484 / 0.0484 0.0833 / 0.0604 0.5604 / 0.4289 MTL = 3e + 3 0/0 0.0768/-0.0768 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0.3731/0.3731 10 6 Table 2. Feature selection via multi-task lasso Objective fn. value TRIP-GLASSO PQN-GLASSO SpaRSA-GLASSO FBS-GLASSO 10 5 0 10 20 30 Running time (secs) 40 10 5.8 Objective fn. value TRIP-MTL PQN-MTL SpaRSA-MTL FBS-MTL 10 5.6 10 5.4 Figure 4. Running time plots on the MNIST handwritten digits dataset. Each image in the dataset is partitioned with respect to its rows, i.e., each 28 × 28 digit image is considered as a collection of 28 row-pixels. We remark that we had to turn on the 'safeguard' option of SpaRSA to make it converge on this dataset; this generally results in the prolonged first iteration as shown in the figure. Considering that FBS requires extensive parameter tuning, we would still regard SpaRSA as highly competitive, as it rapidly catches up with other methods in a few "normal" iterations. 10 5.2 0 200 400 600 Running time (secs) 800 1000 Figure 3. Running time plots. The 20 Newsgroups dataset was pre-processed to fit into the MTL framework. Five newsgroup topics (computer, recreation, science, politics, religion) were reduced to five different lasso tasks of size 2907 × 53975. The "forsale" group is omitted since it does not have any subgroups. We also produced a GL-LR problem on the MNIST dataset. We, however, simplified the problem to learning a binary classifier for the digit "1". The result from this GL-LR problem is shown in Figure 5. Here we could not show SpaRSA as the publicly available implementation supports only quadratic losses. T RIP is seen to outperform PQN and FBS, both of which are competitive with each other. acidity contrasts different types. Notice that it can be difficult to identify such contrasting features via standard lasso, and it is a unique ability of MTL to be able to do so. In our next experiment we run MTL on the 20 newsgroups dataset, using five tasks; each task has data of size 2907 × 53975. As already seen on synthetic data, T RIP minimizes the objective function effectively. Here too, from Figure 3 we see that T RIP generates solution comparable with other methods in a fraction of their elapsed time. The key difference between MTL and GLASSO is how the grouping happens. In MTL, each feature is grouped across multiple tasks, while in GLASSO, features in a single problem are grouped. To compare the methods for GLASSO, we chose the MNIST handwritten digits dataset [17] and formed a GLASSO instance. Specifically, we formulate the problem as selecting a limited number of row-pixel locations and computing the weights for pixels in the selected rows. This setup yields 28 data matrices of size 60, 000×28 with a total of 8,994,516 non-zero entries and an observation vector of size 60, 000 × 1. Figure 4 shows the running time of various methods on this dataset--we observe that T RIP significantly outperforms the other methods. 5. Conclusions and Future Work 10 7 TRIP-GL-LR PQN-GL-LR FBS-GL-LR Objective fn. value 10 6 10 5 0 5 10 15 Running time (secs) 20 Figure 5. Running time plots for solving a GL-LR problem on the MNIST handwritten digits dataset. We presented a new method based on a nonsmooth trustregion framework for solving regularized convex problems. We applied our algorithm to three mixed-norm regularized regression problems, for which it exhibited highly competitive performance on both synthetic and real-world datasets, A scalable trust-region algorithm with application to mixed-norm regression while enjoying solid theoretical convergence properties. We emphasize that the applicability of our method is not limited to the problems presented in this paper. Future challenges include further algorithmic developments such as employing better, potentially more sophisticated yet scalable models, and developing efficient proximity operators for more complex regularizers, e.g., tree-regularized mixed-norms and overlapping mixed-norms. References [1] Aubin, J.-P. L'analyse non lineaire et ses motivations economiques. Masson, 1984. [2] Bach, F. R. Consistency of the Group Lasso and Multiple Kernel Learning. J. Mach. Learn. Res., 9:1179­1225, 2008. [3] Bakin, S. Adaptive regression and model selection in data mining problems. PhD thesis, Australian National University, Canberra, 1999. [4] Barzilai, J. and Borwein, J. M. Two-Point Step Size Gradient Methods. IMA Journal of Numerical Analysis, 8(1): 141­148, 1988. [5] Bertsekas, D. P. Nonlinear Programming. Athena Scientific, second edition, 1999. [6] Combettes, P. L. and Pesquet, J.-C. Proximal Splitting Methods in Signal Processing. Dec. 2009. [7] Combettes, P. L. and Wajs, V. R. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):1168­1200, 2005. [8] Conn, A. R., Gould, N. I. M., and Toint, P. L. Trust-Region Methods. SIAM, 2000. [9] Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547­ 553, 1998. [10] Daubechies, I., Defrise, M., and Mol, C. De. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 11:1413­1457, 2004. [11] Duchi, J. and Singer, Y. Online and Batch Learning using Forward-Backward Splitting. J. Mach. Learning Res. (JMLR), Sep. 2009. [12] Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T. Efficient Projections onto the 1 -Ball for Learning in High Dimensions. In ICML, 2008. [13] Figueiredo, MAT and Nowak, RD. An EM algorithm for wavelet-based image restoration. IEEE Transactions on Image Processing, 12(8):906­916, 2003. [14] Friedman, J., Hastie, T., and Tibshirani, R. A note on the group lasso and a sparse group lasso. arXiv:1001.0736v1 [math.ST], Jan. 2010. [15] Kim, D., Sra, S., and Dhillon, I. S. A scalable trust-region algorithm with application to mixed-norm regression: Supplementary Material. In ICML, 2010. [16] Koh, K., Kim, S.-J., and Boyd, S. An Interior-Point Method for Large-Scale 1 -Regularized Logistic Regression. JMLR, 8, 2007. [17] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proc. IEEE, 86(11):2278­2324, 1998. [18] Lin, C.J., Weng, R.C., and Keerthi, S.S. Trust region newton method for logistic regression. The Journal of Machine Learning Research, 9:627­650, 2008. [19] Liu, H., Palatucci, M., and Zhang, J. Blockwise Coordinate Descent Procedures for the Multi-task Lasso, with Applications to Neural Semantic Basis Discovery. In Int. Conf. Machine Learning, Jun. 2009. [20] Ma, S., Song, X., and Huang, J. Supervised group lasso with applications to microarray data analysis. BMC Bioinformatics, 2007. [21] Meier, L., van de Geer, S., and B¨ hlmann, P. The group u lasso for logistic regression. J. Royal Stat. Soc.: Ser. B, 70 (1):53­71, 2008. [22] Meinshausen, N. and Yu, B. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat., 37(1): 246­270, 2009. [23] Nesterov, Y. Gradient methods for minimizaing composite objective function. Center for Operations Research and Econometrics (CORE0, Catholic University of Louvain, Tech. Rep, 76, 2007. [24] Nocedal, J. and Wright, S. Numerical Optimization. Springer, 1999. [25] Obonzinski, G., Taskar, B., and Jordan, M. Multi-task feature selection. Technical report, UC Berkeley, Jun. 2006. [26] Roberts, A. W. and Varberg, D. E. Another proof that convex functions are locally Lipschitz. The American Mathematical Monthly, 81(9):1014­1016, 1974. [27] Schmidt, M., van den Berg, E., Friedlander, M., and Murphy, K. Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm. In AISTATS, 2009. [28] Shalev-Shwartz, S. and Tewari, A. Stochastic methods for 1 -regularized loss minimization. In ICML, 2009. [29] Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267­288, 1996. [30] Tseng, P. and Yun, S. A block-coordinate descent method for linearly constrained nonsmooth separable optimization. J. Optim. Theory and Appl., 140, 2009. [31] Turlach, B. A., Venables, W. N., and Wright, S. J. Simultaneous Variable Selection. Technometrics, 27:349­363, 2005. [32] van den Berg, E., Schmidt, M., Friedlander, M. P., and Murphy, K. Group sparsity via linear-time projection. Technical Report TR-2008-09, Univ. British Columbia, Jun. 2008. [33] Wang, X. G., Zhang, C., and Zhang, Z. Y. Boosted multitask learning for face verification with applications to web image and video search. In CVPR, pp. 142­149, 2009. [34] Wright, S. J., Nowak, R. D., and Figueiredo, M. A. T. Sparse reconstruction by separable approximation. IEEE Trans. Sig. Proc., 57(7):2479­2493, 2009. [35] Yuan, G.X., Chang, K.W., Hsieh, C.J., and Lin, C.J. A comparison of optimization methods for large-scale l1regularized linear classification. Technical report, Department of Computer Science, National Taiwan University, http://www. csie. ntu. edu. tw/~ cjlin/papers/l1. pdf, 2009. [36] Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B, 68 (1):49­67, 2006. [37] Zhao, P. and Yu, B. On Model Selection Consistency of Lasso. JMLR, 7:2541­2563, 2006. [38] Zhao, P., Rocha, G., and Yu, B. The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat., 37(6A):3468­3497, 2009. [39] Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67(2):301­320, 2005.