A Spectral Regularization Framework for Multi-Task Structure Learning

Andreas Argyriou Department of Computer Science University College London Gower Street, London WC1E 6BT, UK a.argyriou@cs.ucl.ac.uk Massimiliano Pontil Department of Computer Science University College London Gower Street, London WC1E 6BT, UK m.pontil@cs.ucl.ac.uk

Charles A. Micchelli Department of Mathematics and Statistics SUNY Albany 1400 Washington Avenue Albany, NY, 12222, USA Yiming Ying Department of Engineering Mathematics University of Bristol University Walk, Bristol, BS8 1TR, UK enxyy@bristol.ac.uk

Abstract
Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. In addition, we provide a necessary and sufficient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with Lp matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance.

1

Introduction

Recently, there has been renewed interest in the problem of multi-task learning, see [2, 4, 5, 14, 16, 19] and references therein. This problem is important in a variety of applications, ranging from conjoint analysis [12], to object detection in computer vision [18], to multiple microarray data set integration in computational biology [8] ­ to mention just a few. A key objective in many multitask learning algorithms is to implement mechanisms for learning the possible structure underlying the tasks. Finding this common structure is important because it allows pooling information across the tasks, a property which is particularly appealing when there are many tasks but only few data per task. Moreover, knowledge of the common structure may facilitate learning new tasks (transfer learning), see [6] and references therein. In this paper, we extend the formulation of [4], where the structure shared by the tasks is described by a positive definite matrix. In Section 2, we propose a framework in which the task parameters and the structure matrix are jointly computed by minimizing a regularization function. This function has the following appealing property. When the structure matrix is fixed, the function decomposes across the tasks, which can hence be learned independently with standard methods such as SVMs. When the task parameters are fixed, the optimal structure matrix is a spectral function of the covariance of the tasks and can often be explicitly computed. As we shall see, spectral functions are of particular interest in this context because they lead to an efficient alternating minimization algorithm. 1


The contribution of this paper is threefold. First, in Section 3 we provide a necessary and sufficient condition for convexity of the optimization problem. Second, in Section 4 we characterize the spectral functions which relate to Schatten Lp regularization and present the alternating minimization algorithm. Third, in Section 5 we discuss the connection between our framework and the convex optimization method for learning the kernel [11, 15], which leads to a much simpler proof of the convexity in the kernel than the one given in [15]. Finally, in Section 6 we present experiments on two real data sets. The experiments indicate that the alternating algorithm runs significantly faster than gradient descent and that our method improves on state of the art statistical performance on these data sets. They also highlight that our approach can be used for transfer learning.

2

Modelling Tasks' Structure

In this section, we introduce our multi-task learning framework. We denote by S d the set of d × d symmetric matrices, by Sd (Sd + ) the subset of positive semidefinite (definite) ones and by Od the + + set of d × d orthogonal matrices. For every positive integer n, we define INn = {1, . . . , n}. We let T be the number of tasks which we want to simultaneously learn. We assume for simplicity that each task t  INT is well described by a linear function defined, for every x  IRd , as wt x, where wt is a fixed vector of coefficients. For each task t  INT , there are m data examples {(xtj , ytj ) : j  INm }  IRd × IR available. In practice, the number of examples per task may vary but we have kept it constant for simplicity of notation. Our goal is to learn the vectors w1 , . . . , wT , as well as the common structure underlying the tasks, from the data examples. In this paper we follow the formulation in [4], where the tasks' structure is summarized by a positive definite matrix D which is linked to the covariance matrix between the tasks, W W . Here, W denotes the d × T matrix whose t-th column is given by the vector wt (we have assumed for simplicity that the mean task is zero). Specifically, we learn W and D by minimizing the function Reg(W, D) := Err(W ) +  Penalty(W, D), (2.1) where  is a positive parameter which balances the importance between the error and the penalty. The former may be any bounded from below and convex function evaluated at the values w t xtj , t  INT , j  INm . Typically, it will be the average error on the tasks, namely, Err(W ) = j t t RR INm (ytj , w xtj ) and : I × I  [0, ) is a prescribed INT Lt (wt ), where Lt (wt ) = loss function (e.g. quadratic, SVM, logistic etc.). We shall assume that the loss is convex in its second argument, which ensures that the function Err is also convex. The latter term favors the tasks sharing some common structure and is given by tT (2.2) wt F (D)wt , Penalty(W, D) = tr(F (D)W W ) =
=1

where F : Sd +  Sd + is a prescribed spectral matrix function. This is to say that F is induced + + by applying a function f : (0, )  (0, ) to the eigenvalues of its argument. That is, for every D  Sd + we write D = U U , where U  Od ,  = Diag(1 , . . . , d ), and define + F (D) = U F ()U
,

F () = Diag(f (1 ), . . . , f (d )).

(2.3)

In the rest of the paper, we will always use F to denote a spectral matrix function and f to denote the associated real function, as above. Minimization of the function Reg allows us to learn the tasks and at the same time a good representation for them which is summarized by the eigenvectors and eigenvalues of the matrix D. Different choices of the function f reflect different properties which we would like the tasks to share. In the special case that f is a constant, the tasks are totally independent and the regularizer (2.2) is a sum of T independent L2 regularizers. In the case f () = -1 , which is considered in [4], the regularizer favors a sparse representation in the sense that the tasks share a small common set of features. More generally, functions of the form f () = - ,   0, allow for combining shared features and task-specific features to some degree tuned by the exponent . Moreover, the regularizer (2.2) ensures that the optimal representation (optimal D) is a function of the tasks' covariance W W . Thus, we propose to solve the minimization problem R inf eg(W, D) : W  IRd×T , D  Sd + , tr D  1 + 2 ( 2.4)


for functions f belonging to an appropriate class. As we shall see in Section 4, the upper bound on the trace of D in (2.4) prevents the infimum from being zero, which would lead to overfitting. Moreover, even though the infimum above is not attained in general, the problem in W resulting after partial minimization over D admits a minimizer. Since the first term in (2.1) is independent of D, we can first optimize the second term with respect to D. That is, we can compute the infimum . t (2.5) f (W ) := inf r(F (D)W W ) : D  Sd + , tr D  1 +

In this way we could end up with an optimization problem in W only. However, in general this would be a complex matrix optimization problem. It may require sophisticated optimization tools such as semidefinite programming, which may not scale well with the size of W . Fortunately, as we shall show, problem (2.4) can be efficiently solved by alternately minimizing over D and W . In particular, in Section 4 we shall show that f is a function of the singular values of W only. Hence, the only matrix operation required by alternate minimization is singular value decomposition and the rest are merely vector problems. Finally, we note that the ideas above may be extended naturally to a reproducing kernel Hilbert space setting [3].

3

Joint Convexity via Matrix Concave Functions

In this section, we address the issue of convexity of the regularization function (2.1). Our main result characterizes the class of spectral functions F for which the term w F (D)w is jointly convex in (w, D), which in turn implies that (2.4) is a convex optimization problem. To illustrate our result, we require the matrix analytic concept of concavity, see, for example, [7]. We say that the real-valued function g : (0, )  IR is matrix concave of order d if G(A) + (1 - )G(B ) G(A + (1 - )B ) A, B  Sd + and   [0, 1] , + where G is defined as in (2.3). The notation denotes the Loewner partial order on S d : C D if and only if D - C is positive semidefinite. If g is a matrix concave function of order d for any d  IN, we simply say that g is matrix concave. We also say that g is matrix convex (of order d) if -g is matrix concave (of order d). Clearly, matrix concavity implies matrix concavity of smaller orders (and hence standard concavity). Theorem 3.1. Let F : Sd + Sd + be a spectral function. Then the function  : IRd × Sd + [0, ) + + + 1 defined as (w, D) = w F (D)w is jointly convex if and only if f is matrix concave of order d. Proof. By definition,  is convex if and only if, for any w1 , w2  IRd , D1 , D2  Sd + and   + (0, 1), it holds that (w1 + (1 - )w2 , D1 + (1 - )D2 )  (w1 , D1 ) + (1 - )(w2 , D2 ). Let C := F (D1 + (1 - )D2 ), A := F (D1 )/, B := F (D2 )/(1 - ), w := w1 + (1 - )w2 and z := w1 . Using this notation, the above inequality can be rewritten as w
C

wz

A

z + (w - z )

B

(w - z )

 w, z  IRd .

(3.1)

The right hand side in (3.1) is minimized for z = (A + B )-1 B w and hence (3.1) is equivalent to I BI w w C w  w B (A + B )-1 A(A + B )-1 B + - (A + B )-1 B - (A + B )-1 B ,  w  IRd , or to = I C B(A + B )-1 A(A + B )-1 B + - (A + B )-1 B = B - B (A + B )-1 B = (A-1 + B -1 )-1 , where the last equality follows from the matrix inversion lemma [10, Sec. 0.7]. The above inequality is identical to (see e.g. [10, Sec. 7.7]) A-1 + B -1 3 C -1 , B (A + B )-1 A(A + B )-1 B + B - 2B (A + B )-1 B + B (A + B )-1 B (A + B )-1 B BI - (A + B )-1 B


By definition, this inequality holds for any D1 , D2  concave of order d.

or, using the initial notation, F -1 F -1  (D1 ) + (1 - ) (D2 )

F (D1 + (1 - )D2 ) Sd + ,  +

 (0, 1) if and only if

-1

.
1 f

is matrix

Examples of matrix concave functions on (0, ) are log(x + 1) and the function x s for s  [0, 1] 1 ­ see [7] for other examples and theoretical results. We conclude with the remark that, whenever f is matrix concave of order d, function f in (2.5) is convex, because it is the partial infimum of a jointly convex function [9, Sec. IV.2.4].

4
4.1

Regularization with Schatten Lp Prenorms
Partial Minimization of the Penalty Term

Moreover, for the infimum on the left to be attained, F (D) has to share a set of eigenvectors with B so that the corresponding eigenvalues are in the reverse order as the i . Proof. We use an inequality of Von Neumann [13, Sec. H.1.h] to obtain, for all X, Y  S d , that i i µi tr(X Y ) 
INd

In this section, we focus on the family of negative power functions f and obtain that function  f in (2.5) relates to the Schatten Lp prenorms. We start by showing that problem (2.5) reduces to a minimization problem in IRd , by application of a useful matrix inequality. In the following, we let B take the place of W W for brevity. Lemma 4.1. Let F : Sd  Sd be a spectral function, B  Sd and i , i  INd , the eigenvalues of B . Then, . i i d i  1 f (i )i : i > 0, i  INd , inf {tr(F (D)B ) : D  S++ , tr D  1} = inf
INd INd

where i and µi are the eigenvalues of X and Y in nonincreasing and nondecreasing order, respectively. The equality is attained whenever X = U Diag()U , Y = U Diag(µ)U for some U  Od . Applying this inequality for X = F (D), Y = B and denoting f (i ) = i , i  INd , the result follows. Using this lemma, we can now derive the solution of problem (2.5) in the case that f is a negative power function. d Proposition 4.2. Let B  S+ and s  (0, 1]. Then we have that . t 1 s-1 (tr B s ) s = inf r(D s B ) : D  Sd + , tr D  1 + Moreover, if B  Sd + the infimum is attained and the minimizer is given by D = + Bs . tr B s Proof. By Lemma 4.1, it suffices to show the analogous statement for vectors, namely that 1 w i i s i s-1 s i = inf i s i : i > 0, i  INd , i  1
INd INd INd

1 ¨ here i  0, i  INd . To this end, we apply Holder's inequality with p = 1 and q = 1-s : s i 1-s i si s s  s-1 i i s-1 s-1 1 s s i s i  i s i i . i i -s  i = i INd INd INd INd s i INd

When i > 0, i  INd , the equality is attained for i =

inequality is shj rp in all other cases, we replace i by i, a s s i, = i, /( j, ) and take the limits as   0. 4

j

N s , i  I d . To show that the j := i + , i  INd ,  > 0, define
INd


The above result implies that the regularization problem (2.4) is conceptually equivalent to regular2 ization with a Schatten Lp prenorm of W , when the coupling function f takes the form f (x) = x1- p with p  (0, 2], p = 2s. The Schatten Lp prenorm is the Lp prenorm of the singular values of a matrix. In particular, trace norm regularization (see [1, 17]) corresponds to the case p = 1. We also note that generalization error bounds for Schatten Lp norm regularization can be derived along the lines of [14]. 4.2 Learning Algorithm

Lemma 4.1 demonstrates that optimization problems such as (2.4) with spectral regularizers of the form (2.2) are computationally appealing, since they decompose to vector problems in d variables along with singular value decomposition of the matrix W . In particular, for the Schatten L p prenorm with p  (0, 2], the proof of Proposition 4.2 suggests a way to solve problem (2.4). We modify the penalty term (2.2) as F , Penalty (W, D) = tr (D)(W W + I ) (4.1) where  > 0 and let Reg (W, D) = Err(W ) +  Penalty (W, D) be the corresponding regularization function. By Proposition 4.2, for a fixed W  IRd×T there is a unique minimizer of Penalty  (under the constraints in (2.5)), given by the formula (W W + I ) 2 p. tr(W W + I ) 2 Moreover, there exists a minimizer of problem (2.4), which is unique if p  (1, 2]. D (W ) =
p

(4.2)

This minimization can be carried out independently for each task since the regularizer decouples 1 when D is fixed. Specifically, introducing new variables for (F (D)) 2 wt yields a standard L2 regularization problem for each task with the same kernel K (x, z ) = x (F (D))-1 z , x, z  IRd . In other words, we simply learn the parameters wt ­ the columns of matrix W ­ independently by a regularization method, for example by an SVM or ridge regression method, for which there are well developed tool boxes. In the second step, we keep matrix W fixed and minimize over D using equation (4.2). Space limitations prevent us from providing a convergence proof of the algorithm. We only note that following the proof detailed in [3] for the case p = 1, one can show that the sequence produced by the algorithm converges to the unique minimizer of Reg  if p  [1, 2], or to a local minimizer if p  (0, 1). Moreover, by [3, Thm. 3] as  goes to zero the algorithm converges to a solution of problem (2.4), if p  [1, 2]. In theory, an algorithm without -perturbation does not converge to a minimizer, since the columns of W and D always remain in the initial column space. In practice, however, we have observed that even such an algorithm converges to an optimal solution, because of round-off effects.

Therefore, we can solve problem (2.4) using an alternating minimization algorithm, which is an extension of the one presented in [4] for the special case F (D) = D -1 . Each iteration of the algorithm consists of two steps. In the first step, we keep D fixed and minimize over W . This consists in solving the problem . t t d×T t w F (D)wt : W  IR Lt (wt ) +  min
INT INT

5

Relation to Learning the Kernel

In this section, we discuss the connection between the multi-task framework (2.1)-(2.4) and the framework for learning the kernel, see [11, 15] and references therein. To this end, we define the kernel Kf (D)(x, z ) = x (F (D))-1 z , x, z  IRd , the set of kernels Kf = {Kf (D) : D  d S++ , tr D  1} and, for every kernel K , the task kernel matrix Kt = (K (xti , xtj ) : i, j  INm ), t  INT . It is easy to prove, using Weyl's monotonicity theorem [10, Sec. 4.3] and [7, Thm. V.2.5], 1 that the set Kf is convex if and only if f is matrix concave. By the well-known representer theorem (see e.g. [11]), problem (2.4) is equivalent to minimizing the function ( i t 5.1) (yti , (Kt ct )i ) +  ct Kt ct
INT INm

5


over ct  IRm (for t  INT ) and K  Kf . It is apparent that the function (5.1) is not jointly convex in ct and K . However, minimizing each term over the vector ct gives a convex function of K . Proposition 5.1. Let K be the set of all reproducing kernels on IRd . If (y, ·) is convex for any y  IR then the function Et : K  [0, ) defined for every K  K as i i Et (K ) = min (yti , (Kt c)i ) +  c
K

tc

: c  IRm

INm

s convex. Proof. Without loss of generality, we can assume as in [15] that Kt aire invertible for all t  INT . K -1 For every a  IRm and K  K , we define the function Gt (a, K ) = t a, INm (yti , ai ) +  a m which is jointly convex by Theorem 3.1. Clearly, Et (K ) = min{Gt (a, K ) : a  IR }. Recalling that the partial minimum of a jointly convex function is convex [9, Sec. IV.2.4], we obtain the convexity of Et . The fact that the function Et is convex has already been proved in [15], using minimax theorems and Fenchel duality. Here, we were able to simplify the proof of this result by appealing to the joint convexity property stated in Theorem 3.1.

6

Experiments

In this section, we first report a comparison of the computational cost between the alternating minimization algorithm and the gradient descent algorithm. We then study how performance varies for different Lp regularizers, compare our approach with other multi-task learning methods and report experiments on transfer learning. We used two data sets in our experiments. The first one is the computer survey data from [12]. It was taken from a survey of 180 persons who rated the likelihood of purchasing one of 20 different personal computers. Here the persons correspond to tasks and the computer models to examples. The input represents 13 different computer characteristics (price, CPU, RAM etc.) while the output is an integer rating on the scale 0 - 10. Following [12], we used the first 8 examples per task as the training data and the last 4 examples per task as the test data. We measured the root mean square error of the predicted from the actual ratings for the test data, averaged across people. The second data set is the school data set from the Inner London Education Authority (see http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-support/datasets.shtml). It consists of examination scores of 15362 students from 139 secondary schools in London. Thus, there are 139 tasks, corresponding to predicting student performance in each school. The input consists of the year of the examination, 4 school-specific and 3 student-specific attributes. Following [5], we replaced categorical attributes with binary ones, to obtain 27 attributes in total. We generated the training and test sets by 10 random splits of the data, so that 75% of the examples from each school (task) belong to the training set and 25% to the test set. Here, in order to compare our results with those in [5], we used the measure of percentage explained variance, which is defined as one minus the mean squared test error over the variance of the test data and indicates the percentage of variance explained by the prediction model. Finally, we note that in both data sets we used the square loss, tuned the regularization parameter  with 5-fold cross-validation and added an additional input component accounting for the bias term. In the first experiment, we study the computational cost of the alternating minimization algorithm against the gradient descent algorithm, both implemented in Matlab, for the Schatten L 1.5 norm. The left plot in Figure 1 shows the value of the objective function (2.1) versus the number of iterations, on the computer survey data. The curves for different learning rates  are shown, whereas for rates greater than 0.05 gradient descent diverges. The alternating algorithm curve for  = 10 -16 is also shown. We further note that for both data sets our algorithm typically needed less than 30 iterations to converge. The right plot depicts the CPU time (in seconds) needed to reach a value of the objective function which is less than 10-5 away from the minimum, versus the number of tasks. It is clear that our algorithm is at least an order of magnitude faster than gradient descent with the optimal learning rate and scales better with the number of tasks. We note that the computational cost of our method is mainly due to the T ridge regressions in the supervised step (learning W ) and the singular 6


value decomposition in the unsupervised step (learning D). A singular value decomposition is also needed in gradient descent, for computing the gradient of the Schatten Lp norm. We have observed that the cost per iteration is smaller for gradient descent but the number of iterations is at least an order of magnitude larger, leading to the large difference in time cost.
28.5 28 27.5 27 Reg 26.5 26 2 25.5 1 25 24.5 0 0 50 seconds 3  = 0.05  = 0.03  = 0.01 Alternating 6 Alternating  = 0.05 5

4

20

40 iterations

60

80

100

100 tasks

150

200

Figure 1: Comparison between the alternating algorithm and the gradient descent algorithm.
4 0.27

0.265 3.5 0.26 3 RMSE 2.5

0.255 expl. variance 0.25

0.245 2 0.24 1.5 0.2

0.4

0.6

0.8

1 p

1.2

1.4

1.6

1.8

0.235 0.4

0.6

0.8

1 p

1.2

1.4

1.6

1.8

2

Figure 2: Performance versus p for the computer survey data (left) and the school data (right). Table 1: Comparison of different methods on the computer survey data (left) and school data (right). Method p=2 p=1 p = 0.7 Hierarchical Bayes [12] RMSE 3.88 1.93 1.86 1.90 Method p=2 p=1 Hierarchical Bayes [5] Explained variance 23.5 ± 2.0% 26.7 ± 2.0% 29.5 ± 0.4%

In the second experiment we study the statistical performance of our method as the spectral function changes. Specifically, we choose functions giving rise to Schatten Lp prenorms, as discussed in Section 4. The results, shown in Figure 2, indicate that the trace norm is the best norm on these data sets. However, on the computer survey data a value of p less than one gives the best result overall. From this we speculate that our method can even approximate well the solutions of certain non-convex problems. In contrast, on the school data the trace norm gives almost the best result. Next, in Table 1, we compare our algorithm with the hierarchical Bayes (HB) method described in [5, 12]. This method also learns a matrix D using Bayesian inference. Our method improves on the HB method on the computer survey data and is competitive on the school data (even though our regularizer is simpler than HB and the data splits of [5] are not available). Finally, we present preliminary results on transfer learning. On the computer survey data, we trained our method with p = 1 on 150 randomly selected tasks and then used the learned structure matrix D for training 30 ridge regressions on the remaining tasks. We obtained an RMSE of 1.98 on these 30 "new" tasks, which is not much worse than an RMSE of 1.88 on the 150 tasks. In comparison, when 7


I using the raw data (D = d ) on the 30 tasks we obtained an RMSE of 3.83. A similar experiment was performed on the school data, first training on a random subset of 110 schools and then transferring D to the remaining 29 schools. We obtained an explained variance of 19.2% on the new tasks. This was worse than the explained variance of 24.8% on the 110 tasks but still better than the explained variance of 13.9% with the raw representation.

7

Conclusion

We have presented a spectral regularization framework for learning the structure shared by many supervised tasks. This structure is summarized by a positive definite matrix which is a spectral function of the tasks' covariance matrix. The framework is appealing both theoretically and practically. Theoretically, it brings to bear the rich class of spectral functions which is well-studied in matrix analysis. Practically, we have argued via the concrete example of negative power spectral functions, that the tasks' parameters and the structure matrix can be efficiently computed using an alternating minimization algorithm, improving upon state of the art statistical performance on two real data sets. A natural question is to which extent the framework can be generalized to allow for more complex task sharing mechanisms, in which the structure parameters depend on higher order statistical properties of the tasks.

Acknowledgements

This work was supported by EPSRC Grant EP/D052807/1, NSF Grant DMS 0712827 and by the IST Programme of the European Commission, PASCAL Network of Excellence IST-2002-506778.

References
[1] J. Abernethy, F. Bach, T. Evgeniou, and J-P. Vert. Low-rank matrix factorization with attributes. Technical Report N24/06/MM, Ecole des Mines de Paris, 2006. [2] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817­1853, 2005. [3] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 2007. In press. [4] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural Information Processing Systems 19, pages 41­48. 2007. [5] B. Bakker and T. Heskes. Task clustering and gating for bayesian multi­task learning. Journal of Machine Learning Research, 4:83­99, 2003. [6] J. Baxter. A model for inductive bias learning. J. of Artificial Intelligence Research, 12:149­198, 2000. [7] R. Bhatia. Matrix Analysis. Graduate texts in Mathematics. Springer, 1997. [8] R. Chari, W.W. Lockwood, and B.P. Coe et al. Sigma: a system for integrative genomic microarray analysis of cancer genomes. BMC Genomics, 7:324, 2006. ´ [9] J.-B. Hiriart-Urruty and C. Lemarechal. Convex Analysis and Minimization Algorithms. Springer, 1996. [10] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27­72, 2005. [12] P. J. Lenk, W. S. DeSarbo, P. E. Green, and M. R. Young. Hierarchical Bayes conjoint analysis: recovery of partworth heterogeneity from reduced experimental designs. Marketing Science, 15(2):173­191, 1996. [13] A. W. Marshall and I. Olkin. Inequalities: Theory of Majorization and its Applications. Academic Press, 1979. [14] A. Maurer. Bounds for linear multi-task learning. J. of Machine Learning Research, 7:117­139, 2006. [15] C.A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099­1125, 2005. [16] R. Raina, A. Y. Ng, and D. Koller. Constructing informative priors using transfer learning. In Proceedings of the 23rd International Conference on Machine Learning, 2006. [17] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17, pages 1329­1336. 2005. [18] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing features: efficient boosting procedures for multiclass object detection. In Proc. of Conf. on Computer Vision and Pattern Recognition. 2:762-769, 2004. [19] J. Zhang, Z. Ghahramani, and Y. Yang. Learning multiple related tasks using latent independent component analysis. In Advances in Neural Information Processing Systems 18, pages 1585­1592. 2006.

8