A Convex Formulation for Learning Shared Structures from Multiple Tasks

Jianhui Chen jianhui.chen@asu.edu Lei Tang l.tang@asu.edu Jun Liu j.liu@asu.edu Jieping Ye jieping.ye@asu.edu Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA

Abstract
Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously. In this paper, we consider the problem of learning shared structures from multiple related tasks. We present an improved formulation (iASO) for multi-task learning based on the non-convex alternating structure optimization (ASO) algorithm, in which all tasks are related by a shared feature representation. We convert iASO, a non-convex formulation, into a relaxed convex one, which is, however, not scalable to large data sets due to its complex constraints. We propose an alternating optimization (cASO) algorithm which solves the convex relaxation efficiently, and further show that cASO converges to a global optimum. In addition, we present a theoretical condition, under which cASO can find a globally optimal solution to iASO. Experiments on several benchmark data sets confirm our theoretical analysis.

their intrinsic relatedness, based on which the informative domain knowledge is allowed to be shared across the tasks, thus facilitating individual task learning. It is particularly desirable to share such knowledge across the tasks when there are a number of related tasks but only limited training data for each one is available. Multi-task learning has been investigated by many researchers from different perspectives such as sharing hidden units of neural networks among similar tasks (Caruana, 1997; Baxter, 2000), modeling task relatedness using the common prior distribution in hierarchical Bayesian models (Bakker & Heskes, 2003; Schwaighofer et al., 2004; Yu et al., 2005; Zhang et al., 2005), learning the parameters of Gaussian Process covariance from multiple tasks (Lawrence & Platt, 2004), extending kernel methods and regularization networks to multi-task learning (Evgeniou et al., 2005), and multi-task learning with clustered tasks (Jacob et al., 2008). Recently, there is growing interest in studying multi-task learning in the context of feature learning and selection (Ando & Zhang, 2005; Obozinski et al., 2006; Amit et al., 2007; Argyriou et al., 2008). Specifically, Ando and Zhang (2005) propose the alternating structure optimization (ASO) algorithm to learn the predictive structure from multiple tasks. In ASO, a separate linear classifier is trained for each of the tasks and dimension reduction is applied on the predictor space, finding low-dimensional structures with the highest predictive power. This framework has been applied successfully in several applications (Ando, 2007; Quattoni et al., 2007). However, it is non-convex and the alternating optimization procedure is not guaranteed to find a global optimum (Ando & Zhang, 2005). In this paper, we consider the problem of learning a shared structure from multiple related tasks following the approach in (Ando & Zhang, 2005). We present an improved ASO formulation (called iASO) using a novel regularizer. The improved formulation is non-convex;

1. Introduction
The problem of multi-task learning (Caruana, 1997) has recently received broad attention in areas such as machine learning, data mining, computer vision, and bioinformatics (Heisele et al., 2001; Ando & Zhang, 2005; Ando, 2007; Xue et al., 2007; Yu et al., 2007; Argyriou et al., 2008). Multi-task learning aims to improve generalization performance by learning from multiple related tasks. This can be achieved by learning the tasks simultaneously, and meanwhile exploiting
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

A Convex Multi-Task Learning Formulation

where L is the loss function, w 2 is the regularization we show that it can be converted into a (relaxed) conterm (w = u - T v ) controlling the task relatedness vex formulation, whose globally optimal solution apamong m tasks, and  is the pre-specified parameter. proximates the one to iASO. However, this convex formulation is not scalable to large data sets due to its The optimization problem in Eq. (2) is non-convex due positive semidefinite constraints. We propose a conto its orthonormal constraints and the regularization vex alternating structure optimization (called cASO) term in terms of u , v , and  (the loss function L is algorithm to solve the convex relaxation efficiently, in assumed to be convex). We present an improved ASO which the optimization variables are updated iteraformulation (called iASO) given by: tively. The proposed cASO algorithm is similar in spirit to the block coordinate descent method (Bertn m 1 sekas, 1999) in unconstrained optimization, and it can (F0 ) min L(uT xi , yi )+g (u , v , ) ,(3) n i=1 {u ,v },T =I be shown to converge to a global optimum of the con=1 vex relaxation. In addition, we present a theoretical where g (u , v , ) is the regularization function decondition, under which cASO finds a globally optimal fined as: solution to iASO. We have performed experiments on benchmark data sets. The reported experimental reg (u , v , ) =  u - T v 2 +  u 2 . (4) sults are consistent with our theoretical analysis. Results also demonstrate the effectiveness of the proposed The regularization functions in Eq. (4) controls the multi-task learning algorithm. task relatedness (via the first component) as well as Notations Denote Nn = {1, · · · , n}. Let Sd (Sd ) the complexity of the predictor functions (via the sec++ + be the subset of positive semidefinite (positive definite) ond component) as commonly used in traditional regmatrices. Denote A B if and only if B -A is positive ularized risk minimization formulation for supervised semidefinite. Let tr(X) be the trace, and X -1 be the learning. Note that  and  are pre-specified coeffiinverse of a matrix X. cients, indicating the importance of the corresponding regularization component. For simplicity, we use the same  and  parameters for all tasks. However, the 2. Multi-Task Learning Framework discussion below can be easily extended to the case Assume that we are given m supervised (binarywhere  and  are different for different tasks. class) learning tasks. Each of the learning tasks The iASO formulation (F0 in Eq. (3)) subsumes sevis associated with a predictor f and training data eral multi-task learning algorithms as special cases: it d {(x1 , y1 ), · · · , (xn , yn )}  R × {-1, 1} (  Nm ). reduces to the ASO algorithm in Eq. (2) by setting T We focus on linear predictors f (x) = u x, where u  = 0 in Eq. (4); and it reduces to m independent is the weight vector for the th task. support vector machines (SVM) by setting  = 0. It Ando and Zhang (2005) propose the alternating strucis worth noting that F0 is non-convex. In the next ture optimization (ASO) algorithm for learning predicsection, we convert F0 into a (relaxed) convex formutive functional structures from multiple related tasks, lation, which admits a globally optimal solution. that is, learning all m predictors {f }m simultane=1 ously by exploiting a shared feature space in a simple 3. A Convex Multi-Task Learning linear form of low-dimensional feature map  across Formulation the m tasks. Formally, the prediction function f can be expressed as: In this section, we consider a convex relaxation of the non-convex problem F0 (iASO) in Eq. (3). f (x) = uT x = wT x + v T x, (1) The optimal {v }m to F0 can be expressed in the =1 where the structure parameter  takes the form of an form of a function on  and {u }m . Let U = =1 h×d matrix with orthonormal rows, i.e., T = I, and [u1 , · · · , um ]  Rd×m and V = [v1 , · · · , vm ]  Rh×m . u , w , and v are the weight vectors for the full feature It can be verified that g (u , v , ) in Eq. (4) is minispace, the high-dimensional one, and the shared lowmized when v = u (  Nm ), and hence V = U . dimensional one, respectively. Mathematically, ASO Therefore we can denote can be formulated as the following optimization probm lem: G0 (U, ) = min g (u , v , ) n m V 1 =1 T 2 L(u xi , yi ) +  w , (2) min n i=1 {u ,v },T =I =  tr U T (1 + )I - T  U , (5) =1

A Convex Multi-Task Learning Formulation

where  = / > 0. Moreover, it can be verified that the following equality holds (1 + )I - T  =  (1 + ) I + T 
-1

.

(6)

We can then reformulate G0 (U, ) in Eq. (5) into an equivalent form given by G1 (U, ) =   (1 + ) tr U
T

Note that F2 is a convex relaxation of F1 since the optimal M to F2 is not guaranteed to occur at the extreme points of Mc . The optimal  to F1 can be approximated using the first h eigenvectors (corresponding to the largest h eigenvalues) of the optimal M computed from F2 . The convexity in F2 can be readily verified. We add variables {t }m and enforce uT (I + M )-1 u  =1 t (  Nm ); it follows from the Schur complement Lemma (Golub & Loan, 1996) that we rewrite F2 as:
m

I +  

T

-1

U .(7)

Since the loss term in Eq. (3) is independent of {v }m , =1 F0 can be equivalently transformed into the following optimization problem F1 with optimization variables  and U as:
m

(F3 ) min subject to

U,M,{t }

=1

1 n

n

m

L(uT xi , yi ) +(1 + )
i=1 =1

t

(F1 )

U,T =I

min

=1

1 n

n

L(u xi , yi )
i=1

T

+ G1 (U, ), (8)

I + M uT

u t

0,   Nm , I, M  Sd . + (13)

tr(M ) = h, M

where G1 (U, ) is defined in Eq. (7). 3.1. Convex Relaxation The orthonormality constraints in Eq. (8) is nonconvex, so is the optimization problem F1 . We propose to convert F1 into a convex formulation by relaxing its feasible domain into a convex set. Let the set Me be defined as: Me = Me | Me = T , T = I,   Rh×d . (9)

Given that the loss function L is convex, the optimization problem F3 is convex. However, it is not scalable to high-dimensional data (with a large d) due to its positive semidefinite constraints. If L is the SVM hinge loss, F3 is a semidefinite program (SDP) (Boyd & Vandenberghe, 2004). Note that many off-the-shelf optimization solvers such as SeDuMi1 can be used for solving SDP, which can only handle several hundreds of optimization variables.

It has been shown in (Overton & Womersley, 1993) that the convex hull (Boyd & Vandenberghe, 2004) of Me can be precisely expressed as the convex set Mc given by Mc = Mc | tr(Mc ) = h, Mc I, Mc  Sd , (10) +

4. Convex Alternating Structure Optimization
The optimization problem F3 in Eq. (13) is convex, thus resulting in a globally optimal solution. However, this formulation does not scale well in practice. In this section, we propose a convex alternating structure optimization (called cASO) algorithm to efficiently solve the optimization problem F2 in Eq. (11). cASO is similar to the block coordinate descent method (Bertsekas, 1999), in which one of the two optimization variables (U and M ) is fixed, while the other one can be optimized in terms of the fixed one. The pseudo-code of the cASO algorithm is presented in Algorithm 1. 4.1. Computation of U for a Given M For a given M , the optimal U can be computed by solving the following problem:
m

and each element in Me is referred to as an extreme point of Mc . Since Mc consists of all convex combinations of the elements in Me , Mc is the smallest convex set that contains Me , and hence Me  Mc . To convert the non-convex problem F1 into a convex formulation, we replace T  with M in Eq. (8), and relax the (feasible) problem domain to a convex set based on the relationship between Me and Mc presented above; this results in a convex formulation F2 defined as:
m

(F2 ) min
U,M =1

1 n

n

L(uT xi , yi )
i=1

+ G2 (U, M ) (11)

min
U =1

subject to

tr(M ) = h, M

I, M  Sd , +

1 n

n

L(uT xi , yi ) + g (u ) , (14) ^
i=1 -1

where G2 (U, M ) is defined as: G2 (U, M ) =   (1 + ) tr U T (I + M )
-1

where g (u ) =   (1 + ) tr uT (I + M ) u . ^ Given any convex loss function L, we can verify that U . (12)
1

http://sedumi.ie.lehigh.edu/

A Convex Multi-Task Learning Formulation

Algorithm 1 cASO for multi-task learning Input: {(xi , yi )}, i  Nn ,  Nm , h  N. Output: U , V , and . Parameter:  and . Initialize M subject to the constraints in Eq. (11). repeat Update U via Eq. (14). T Compute the SVD of U = P1 P2 . Update M via Eq. (17) and Theorem 4.1. until convergence criterion is satisfied. Construct  using the top h eigenvectors of M . Construct V as V = U . Return U , V and .

where {i } are the singular values of U defined in  Eq. (16). The optimal {i }q to Eq. (17) must satisfy i=1    1  2 · · ·  q . Hence 1 1 1    +   · · ·   +  .  + 1 q 2 (18)

 This can be verified by contradiction: assuming i <  i+1 and given i > i+1 , we can construct another   feasible solution by switching i and i+1 , and obtain a smaller objective value in Eq. (17). Note that the problem in Eq. (17) can be solved via many existing algorithms such as the projected gradient descent method (Boyd & Vandenberghe, 2004).

the objective function in Eq. (14) is strictly convex, and hence this optimization problem admits a unique minimizer. Note that if L is the SVM hinge loss, the problem in Eq. (14) decouples into m independent quadratic programs (QP), thus the optimal {u }m =1 can be computed separately. 4.2. Computation of M for a Given U For a given U , the optimal M can be computed by solving the following problem: min
M

We show that the optimal solution to Eq. (15) can be obtained by solving Eq. (17). We first present the following lemma, which will be useful for proving the theorem followed. ^^ ^ Lemma 4.1. For any matrix Z  Sd , let Z = U z U T + ^ ^ be its SVD, where U  Rd×d is orthogonal, z = diag(^1 , · · · , d ), and 1  · · ·  d  0. Let {Zi }d  ^ ^ ^ i=1 be the diagonal entries of Z, and  = {1 , · · · , p }  Nd be any integer subset with p distinct elements. p p Then i=1 Zi  j=1 j . ^ ^ Proof. Denote the i-th row-vector of U  Rd×d by ^ Ui = [^i1 , · · · , uid ]. For any integer subset  = u ^ {1 , · · · , p }, we have
p d

tr U T (I + M ) tr(M ) = h, M

-1

U (15)

subject to

I, M  Sd . +

This problem can be recast into an SDP problem, which is computationally expensive to solve. We propose an efficient approach to find its optimal solution, in which we only solve a simple eigenvalue optimization problem. 4.2.1. Efficient Computation of M
T Given any U  Rd×m in Eq. (15), let U = P1 P2 be d×d its SVD (Golub & Loan, 1996), where P1  R and P2  Rm×m are orthogonal, and rank(U ) = q. Note that in general q  m  d. It follows that

0
k=1

u2 k j  1, ^
j=1

u2 k j = 1, j  Nd , k  Np . ^

The i-th diagonal entry of Z can be expressed as Zi = d ^ ^2 j=1 j uij . It follows that
p d

Zi =
i=1 j=1 d

j u2 1 j + · · · + j u2 p j ^ ^ ^ ^
p d p p

=
j=1 k=1

j u2 k j = ^ ^
j=1

j ^
k=1

u2 k j  ^
j=1

j , ^

 = diag(1 , · · · , m )  R

d×m

, (16)

1  · · ·  q > 0 = q+1 = · · · = m .

where the last equality (the maximum) above is attained when the set {^2 1 j , · · · , u2 p j } (j  Nd ) has u ^ only one non-zero element of value one or p = d. This completes the proof of this lemma. We summarize the main result of this subsection in the following theorem. Theorem 4.1. Let { }q be optimal to Eq. (17), i i=1 and denote  = diag( , · · · ,  , 0)  Rd×d . Let q 1 P1  Rd×d be orthogonal consisting of the left singular T vectors of U . Then M  = P1  P1 is an optimal solution to Eq. (15). Moreover, the problem in Eq. (17)

We consider the following convex optimization problem (Boyd & Vandenberghe, 2004):
q {i }i=1

min q

i=1 q

2 i  + i

subject to
i=1

i = h, 0  i  1, i  Nq , (17)

A Convex Multi-Task Learning Formulation

attains the same optimal objective value as the one in Eq. (15). Proof. For any feasible M in Eq. (15), let M = QQT be its SVD, where Q  Rd×d is orthogonal,  = diag(1 , · · · , d ), and 1  · · ·  d  0. The problem in Eq. (15) can be rewritten as: min tr (I + )
Q, -1 T QT P1 T P1 Q

Note that the optimization problem (not strictly convex) in Eq. (15) may have multiple global minimizers yet with the same objective value, while the formulation in Eq. (17) can find one of those global minimizers. 4.3. Convergence Analysis The alternating optimization procedure employed in Algorithm 1 (cASO) is widely used for for solving many optimization problems efficiently. However, such a procedure does not generally guarantee the global convergence. We summarize the global convergence property of cASO algorithm in the following theorem. Theorem 4.2. Algorithm 1 converges to the global minimizer of the optimization problem F2 in Eq. (11). Proof. The proof follows similar arguments in (Argyriou et al., 2007; Argyriou et al., 2008).

subject to

QQT = QT Q = I,  = diag(1 , · · · , d ),
d i=1

i = h, 1  1  · · ·  d  0, (19)

where  = diag(1 , · · · , q , 0) is defined in Eq. (16). Note that the reformulated problem in Eq. (19) is equivalent to the one in Eq. (15) and has two separate optimization variables Q and . We show that the optimization variable Q can be facT tored out from Eq. (19). Let D = QT P1 T P1 Q and d denote its diagonal entries by {Di }i=1 . It follows from Eq. (16) that D is a positive semidefinite matrix with 2 non-zero singular values {i }q . Given any feasible i=1  in Eq. (19), we have min tr (I + )
d -1 T QT P1 T P1 Q

5. Computation of an Optimal Solution to iASO
Recall that F2 in Eq. (11) is a convex relaxation of iASO in Eq. (3). In this section, we present a theoretical condition under which a globally optimal solution to iASO can be obtained via cASO. We first present the following lemma, which is the key building block of the analysis in this section. Lemma 5.1. Let {i }m be defined in Eq. (16) and i=1  {i }q be optimal to Eq. (17). For any h  Nq , if i=1   h /h+1  1 + 1/, then 1 = · · · = h = 1 and   h+1 = · · · = q = 0.
 Proof. Prove by contradiction. Assume that 1 =    · · · = h = 1 and h+1 = · · · = q = 0 do not hold; this    contrary leads to h = 1 and hence 0 < h+1  h < 1, q   since i=1 i = h and i is non-increasing with i. We  show that there exists a feasible solution {i }m such i=1 m m 2  2  that i=1 i /( +i ) > i=1 i /( +i ), thus reaching a contradiction.   Let a be the element in {i }q with the smallest i=1   index a  Nh , satisfying a = 1. Let b be the element  q in {i }i=1 with the largest index b  Nq , satisfying  b = 0. Note that it can be verified that a  h and   h + 1  b. For any 0 <  < min(1 - a , b ), we can  m construct a feasible solution {i }i=1 to Eq. (17) as:   i  Nq , i = a, i = b  i   a +  i = a i =   b -  i = b

QT Q=QQT =I

=

DSd :DT +

min

i=1

Di ,  + i

(20)

where D  T indicates that the eigenvalues of D are given by the diagonal elements of T , and the equality above means that these two problems attain the same optimal objective value. Following the nonp decreasing order of 1/(+i ) (i  Nd ) and i=1 Di  p p 2 j=1 j for any integer subset {i }i=1 (Lemma 4.1), we can verify that the optimal objective value to Eq. (20) is given by
q 2 0 i + =  + i i=q+1  + i d q 2 i ,  + i

(21)

i=1

i=1

where this optimum can be attained when QT P1 = I (Golub & Loan, 1996) and D = T . It follows from Eq. (21) that the optimal { }d to Eq. (19) i i=1 satisfy  = · · · =  = 0. q+1 d In summary, the optimal objective value to Eq. (19) or equivalently Eq. (15) can be obtained via minimizing Eq. (21) subject to the constraints on {i } or equivalently Eq. (17). Since Eq. (20) is minimized when T Q = P1 , we conclude that M  = P1  P1 is optimal to Eq. (15). This completes the proof.

    such that 1  1  · · · > a > · · ·  h > · · · > b >

A Convex Multi-Task Learning Formulation

0 = · · · = 0. Moreover, we have
2 a   + a

+

2 b   + b

-

2 a   + a

+

2 b   + b

rization using the Yahoo web pages data sets (Ueda & Saito, 2002). The Yahoo data sets consist of 11 top-level categories (each category corresponds to one data set), and each category is further divided into a set of second-level categories (each sub-category corresponds to a topic in one data set). We preprocess the data sets by removing topics with fewer than 100 web pages. The TF-IDF scheme is used to represent the web pages, and the obtained feature vectors are normalized to unit length. In the experiments, we use the SVM hinge loss; quadratic programs (QP) are solved via MOSEK2 , and SVM is solved via LIBSVM package (Chang & Lin, 2001). Performance Comparsion We compare the proposed cASO algorithm with SVM (independent SVM for multi-task learning), ASO (alternating structure optimization) (Ando & Zhang, 2005), and cMTFL (convex multi-task feature learning) (Argyriou et al., 2008) on the tasks of web pages categorization. We use Macro F1 and Micro F1 (Lewis et al., 2004) as the performance measures. The parameters in the competing algorithms (the penalty parameter C of SVM, the regularization parameters of ASO, cASO and cMTFL) are determined via 3-fold crossvalidation. In ASO, cASO and cMTFL, the stopping criterion is set as that the relative change of the objective value is smaller than 10-4 . We randomly sample 1500 data points from each data set as training data, and the remaining are used as test data. The experimental results (averaged over 5 random repetitions) as well as the standard deviation are presented in Tables 1 and 2. We can observe that cASO is competitive with other competing algorithms on all of the 11 data sets. Moreover, cASO outperforms ASO (in this supervised setting) on 9 data sets in terms of both Macro F1 and Micro F1. This superiority may be due to the flexibility in the cASO formulation (via the regularization term), and its guaranteed global optimal solution. The reason for the low performance of SVM probably lies in that it does not utilize the relationship of the multiple learning tasks. Efficiency Comparison We compare the efficiency of cASO with other competing algorithms in terms of computation time (in seconds) on the Arts data set. We increase the training sample size from 500 to 2500, and record the computation time. From Figure 1, we can observe that SVM is the fastest algorithm, while ASO is the slowest one. Note that in practice, early stopping can be applied in ASO as in (Ando & Zhang, 2005). cASO performs faster than cMTFL
2

=

( +

 a )(

2 2 a b -  )( +   - )  + ) + a ( + b b

2  h+1  2 > h+1 

1 (1 + 1/)2 -     ( + a )( + a + ) ( + b )( + b - ) (1 + 1/)2 1 - ( + 1)( + 1)  2 = 0,

where the first inequality follows from h /h+1  1 + 1/, a  h  (1 + 1/) h+1 , and h+1  b ; the   second (strict) inequality follows from 1 > a , b > 0, m   2 and 1  a + , b -   0. Therefore i=1 i /( + m  2  i ) > i=1 i /( +i ). This completes the proof. We conclude this section using the following theorem. Theorem 5.1. Let the problems F1 and F2 be defined in Eqs. (8) and (11), respectively, and let (U  , M  ) be the optimal solution to F2 . Let P1  Rd×d be orthogonal consisting of the left singular vectors of U  , and {i }q i=1 be the corresponding non-zero singular values of U  in non-increasing order. Let  consist of the first h column-vectors of P1 corresponding to the largest h singular values. If h /h+1  1 + 1/, then the optimal solution to F1 is given by (U  ,  ). Proof. Since (U  , M  ) is optimal to F2 , it follows from Theorem 4.1 that M  can be expressed as M  = T P1 P1 , where  = diag(1 , · · · , d )  Rd×d can be computed via Eq. (17). Given h /h+1  1 + 1/, we can verify that i = 1 if i  Nh , and 0 otherwise (Lemma 5.1); therefore M  =  T  , where   Rd×h corresponds to the first h column-vectors of P1 . Moreover, given a fixed U  Rd×m in F1 and F2 respectively, we have
T Me ,T =I

min

G1 (U, )  min G2 (U, M ),
M Mc

(22)

where G1 (U, ) and G2 (U.M ) are defined in Eqs. (7) and (12) respectively, and Me and Mc are defined in Eqs. (9) and (10) respectively. The equality in Eq. (22) is attained when the optimal M to the right side of Eq. (22) is an extreme point of the set Mc , i.e., belong to the set Me . For a given U  , if h /h+1  1 + 1/ is satisfied,  minimizes G1 (U  , ) and the equality in Eq. (22) can be attained. Hence, (U  ,  ) is the optimal solution to F1 . This completes the proof.

6. Experiments
In this section, we evaluate the proposed cASO algorithm on the tasks of multi-topic web pages catego-

http://www.mosek.com/

A Convex Multi-Task Learning Formulation Table 1. Performance comparison of competing algorithms on six Yahoo data sets. The statistics of the data sets are presented in the second row, where n, d, and m denotes the sample size, dimensionality, and the number of topics (tasks), respectively. In ASO and cASO, the shared feature dimensionality h is set as (m - 1)/5 × 5.
Data (n, d, m) SVM Macro ASO cASO F1 cMTFL SVM ASO Micro F1 cASO cMTFL Arts (7441, 17973, 19) 33.93 ± 1.07 37.93 ± 1.57 37.35 ± 0.60 37.06 ± 0.75 43.99 ± 1.23 43.96 ± 0.03 47.69 ± 0.47 46.31 ± 0.32 Business (9968, 16621, 17) 44.43 ± 0.56 44.64 ± 0.40 45.79 ± 0.69 40.90 ± 1.66 77.51 ± 0.51 78.08 ± 0.25 77.44 ± 0.94 69.00 ± 1.01 Computers (12317, 25259, 23) 30.09 ± 1.10 28.33 ± 0.67 33.35 ± 0.84 32.50 ± 0.90 55.36 ± 0.63 54.43 ± 0.40 54.54 ± 1.07 49.38 ± 4.22 Education (11817, 20782, 14) 39.00 ± 2.42 36.93 ± 1.98 41.28 ± 0.90 40.17 ± 0.55 48.03 ± 1.56 46.97 ± 0.37 49.50 ± 0.57 48.56 ± 0.40 Entertainment (12691, 27435, 14) 46.88 ± 0.47 47.46 ± 0.37 49.66 ± 0.97 50.94 ± 1.06 55.69 ± 2.45 57.71 ± 0.33 57.90 ± 1.38 58.25 ± 0.76 Health (9209, 18430, 14) 56.14 ± 2.58 57.63 ± 0.74 61.16 ± 1.70 58.66 ± 2.22 61.40 ± 4.76 65.90 ± 0.39 68.19 ± 1.01 66.83 ± 1.72

Table 2. Performance comparison of competing algorithms on five Yahoo data sets. Explanation can be found in Table 1.
Data Set (n, d, m) SVM Macro ASO cASO F1 cMTFL SVM ASO Micro F1 cASO cMTFL Recreation (12797, 25095, 18) 43.01 ± 1.44 43.63 ± 1.29 47.12 ± 0.73 46.13 ± 0.58 49.15 ± 2.32 50.68 ± 0.18 53.34 ± 0.90 52.52 ± 0.92 Reference (7929, 26397, 15) 39.37 ± 1.15 37.46 ± 0.27 42.11 ± 0.60 43.25 ± 0.81 55.11 ± 3.16 57.72 ± 0.51 59.39 ± 0.39 58.49 ± 0.51 Science (6345, 24002, 22) 41.80 ± 1.45 39.26 ± 0.82 45.46 ± 0.50 42.52 ± 0.59 49.27 ± 4.64 49.05 ± 0.57 53.32 ± 0.45 50.60 ± 0.76 Social (11914, 32492, 21) 35.87 ± 0.79 35.29 ± 0.67 39.30 ± 1.28 38.94 ± 1.88 63.05 ± 2.45 62.77 ± 3.59 66.04 ± 0.62 65.60 ± 0.63 Society (14507, 29189, 21) 30.68 ± 0.94 29.42 ± 0.30 34.84 ± 1.05 33.79 ± 1.43 40.07 ± 3.42 46.13 ± 2.33 49.27 ± 0.55 46.46 ± 0.87

when training with a fixed regularization parameter. cASO and cMTFL have comparable efficiency, when training with parameter tuning, and their computation time is close to that of SVM.
6 x 10
4

1200

Computation Time

5 4 3 2 1 0 500

SVM ASO cASO cMTFL

Computation Time

that a small  (equivalently small ) leads to lower F1, while   1 (equivalently   ) leads to the highest F1. In the experiments, we also observe that cASO requires more computation time for convergence using a small , while less computation time is required for a large .
Table 3. Comparison of the optimal objective values of F0 and F2 with different choices of .
 1 + 1/ h /h+1 OBJF0 OBJF2 1000 1.001 1.23 52.78 52.78 100 1.01 1.25 52.65 52.65 10 1.1 1.34 51.37 51.37 1 2 1.75 40.73 40.71 0.1 11 3.07 22.15 20.73 0.01 101 13.79 5.95 4.11 0.001 1001 89.49 0.69 0.41

1000 800 600 400 200 0 500

SVM ASO cASO cMTFL

1000

1500

2000

2500

1000

1500

2000

2500

Training Sample Size
(a)

Training Sample Size
(b)

Figure 1. Comparison of computation time (in seconds) (a) training with parameter tuning; (b) training with a fixed penalty/regularization parameter (the best one obtained from tuning).
0.38
Macro F1
Micro F1

0.48 0.475 0.47 0.465 0.46 -4 -2 0 2 10 10 10 10  10
4

0.37 0.36 0.35 0.34 -4 -2 0 2 10 10 10 10  10
4

Figure 2. Sensitivity study of  on Micro/Micro F1.

Sensitivity Study We study the effect of  on the performance of cASO on Arts data. Recall that  = /; we fix  = 1 and vary  from 10-4 to 105 , and record the obtained Macro/Micro F1. The experimental results are presented in Figure 2. We can observe

Relationship Study on F0 and F2 We study the relationship between the problems F0 and F2 as well as the condition presented in Theorem 5.1. We vary the parameter  (by fixing  = 1 and varying  accordingly) from 103 to 10-3 . F2 is solved via Algorithm 1, and the optimal objective value OBJF2 and the value of h /h+1 are recorded. Similarly, F0 is solved via alternating optimization and its optimal objective value OBJF0 is recorded. From the experimental results in Table 3, we can observe that when   {1000, 100, 10}, the condition h /h+1 > 1 + 1/ is satisfied and hence OBJF0 = OBJF2 ; otherwise, we observe OBJF0 > OBJF2 . This empirical result is consistent with our theoretical analysis in Theorem 5.1.

7. Conclusion and Future Work
We present a multi-task learning formulation called iASO for learning a shared feature representation from

A Convex Multi-Task Learning Formulation

multiple related tasks. Since iASO is non-convex, we convert it into a relaxed convex formulation, and then develop the cASO algorithm to solve the convex relaxation efficiently. Our convergence analysis shows that the cASO algorithm converges to a globally optimal solution to the convex relaxation. We also present a theoretical condition, under which cASO can find a globally optimal solution to iASO. We have conducted experiments on benchmark data sets; the experimental results are consistent with our theoretical analysis. We also observe that cASO with a non-zero , tends to either increase or at least keep the generalization performance compared with ASO, while significantly reducing the computational cost. We are currently investigating how the solutions of cASO depend on the parameters involved in the formulation as well as their estimation. We plan to compare the presented iASO formulation with the multi-task learning formulation using the trace-norm regularization. We also plan to apply the cASO algorithm to applications such as the automatic processing of biomedical texts for tagging the gene mentions (Ando, 2007).

Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. Caruana, R. (1997). Multitask learning. Mach. Learn., 28, 41­75. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Evgeniou, T., Micchelli, C. A., & Pontil, M. (2005). Learning multiple tasks with kernel methods. J. Mach. Learn. Res., 6, 615­637. Golub, G. H., & Loan, C. F. V. (1996). Matrix computations. Johns Hopkins University Press. Heisele, B., Serre, T., Pontil, M., Vetter, T., & Poggio, T. (2001). Categorization by learning and combining object parts. Adv. in Neural Info. Proc. Sys. (pp. 1239­1245). Jacob, L., Bach, F., & Vert, J.-P. (2008). Clustered multi-task learning: A convex formulation. Adv. in Neural Info. Proc. Sys. (pp. 745­752). Lawrence, N. D., & Platt, J. C. (2004). Learning to learn with the informative vector machine. Int. Conf. on Mach. Learn.. Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5, 361­397. Obozinski, G., Taskar, B., & Jordan, M. I. (2006). Multi-task feature selection. Technical report, Dept. of Statistics, UC Berkeley. Overton, M. L., & Womersley, R. S. (1993). Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrics. Math. Program., 62, 321­357. Quattoni, A., Collins, M., & Darrell, T. (2007). Learning visual representations using images with captions. IEEE Conf. on Comp. Vision and Patt. Recog.. Schwaighofer, A., Tresp, V., & Yu, K. (2004). Learning gaussian process kernels via hierarchical bayes. Adv. in Neural Info. Proc. Sys.. Ueda, N., & Saito, K. (2002). Parametric mixture models for multi-labeled text. Adv. in Neural Info. Proc. Sys. (pp. 721­728). Xue, Y., Liao, X., Carin, L., & Krishnapuram, B. (2007). Multi-task learning for classification with dirichlet process priors. J. Mach. Learn. Res., 8, 35­63. Yu, K., Tresp, V., & Schwaighofer, A. (2005). Learning gaussian processes from multiple tasks. Int. Conf. on Mach. Learn. (pp. 1012­1019). Yu, S., Tresp, V., & Yu, K. (2007). Robust multitask learning with t-processes. Int. Conf. on Mach. Learn. (pp. 1103­1110). Zhang, J., Ghahramani, Z., & Yang, Y. (2005). Learning multiple related tasks using latent independent component analysis. Adv. in Neural Info. Proc. Sys..

Acknowledgments
This work was supported by NSF IIS-0612069, IIS0812551, CCF-0811790, NIH R01-HG002516, and NGA HM1582-08-1-0016.

References
Amit, Y., Fink, M., Srebro, N., & Ullman, S. (2007). Uncovering shared structures in multiclass classification. Int. Conf. on Mach. Learn. (pp. 17­24). Ando, R. K. (2007). BioCreative II gene mention tagging system at IBM Watson. Proc. of the 2nd. BioCreative Challenge Evaluation Workshop. Ando, R. K., & Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6, 1817­ 1853. Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Mach. Learn., 73, 243­272. Argyriou, A., Micchelli, C. A., Pontil, M., & Ying, Y. (2007). A spectral regularization framework for multi-task structure learning. Adv. in Neural Info. Proc. Sys.. Bakker, B., & Heskes, T. (2003). Task clustering and gating for bayesian multitask learning. J. Mach. Learn. Res., 4, 83­99. Baxter, J. (2000). A model of inductive bias learning. J. Artif. Intell. Res., 12, 149­198. Bertsekas, D. P. (1999). Athena Scientific. Nonlinear programming.