Multi-Task Learning via Conic Programming
Tsuyoshi Kato ,, Hisashi Kashima , Masashi Sugiyama , Kiyoshi Asai , G raduate School of Frontier Sciences, The University of Tokyo,  Institute for Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency (JST)  Tokyo Research Laboratory, IBM Research,  Department of Computer Science, Tokyo Institute of Technology, A IST Computational Biology Research Center, k a t o - t s u y o s h i @ c b . k . u - t o k y o . a c . j p, k a s h i p o n g @ y a h o o . c o . j p , s u g i @ c s . t i t e c h . a c . j p, a s a i @ c b r c . j p

Abstract
When we have several related tasks, solving them simultaneously is shown to be more effective than solving them individually. This approach is called multi-task learning (MTL) and has been studied extensively. Existing approaches to MTL often treat all the tasks as uniformly related to each other and the relatedness of the tasks is controlled globally. For this reason, the existing methods can lead to undesired solutions when some tasks are not highly related to each other, and some pairs of related tasks can have significantly different solutions. In this paper, we propose a novel MTL algorithm that can overcome these problems. Our method makes use of a task network, which describes the relation structure among tasks. This allows us to deal with intricate relation structures in a systematic way. Furthermore, we control the relatedness of the tasks locally, so all pairs of related tasks are guaranteed to have similar solutions. We apply the above idea to support vector machines (SVMs) and show that the optimization problem can be cast as a second order cone program, which is convex and can be solved efficiently. The usefulness of our approach is demonstrated through simulations with protein super-family classification and ordinal regression problems.

1 Introduction
In many practical situations, a classification task can often be divided into related sub-tasks. Since the related sub-tasks tend to share common factors, solving them together is expected to be more advantageous than solving them independently. This approach is called multi-task learning (MTL, a.k.a. inductive transfer or learning to learn) and has theoretically and experimentally proven to be useful [4, 5, 8]. Typically, the `relatedness' among tasks is implemented as imposing the solutions of related tasks to be similar (e.g. [5]). However, the MTL methods developed so far have several limitations. First, it is often assumed that all sub-tasks are related to each other [5]. However, this may not be always true in practice--some are related but others may not be. The second problem is that the related tasks are often imposed to be close in the sense that the sum of the distances between solutions over all pairs of related tasks is upper-bounded [8] (which is often referred to as the global constraint [10]). This implies that all the solutions of related tasks are not necessarily close, but some can be quite different. In this paper, we propose a new MTL method which overcomes the above limitations. We settle the first issue by making use of a task network that describes the relation structure among tasks. This enables us to deal with intricate relation structures in a systematic way. We solve the second problem 1


by directly upper-bounding each distance between the solutions of related task pairs (which we call local constraints). We apply this ideas in the framework of support vector machines (SVMs) and show that linear SVMs can be trained via a second order cone program (SOCP) [3] in the primal. An SOCP is a convex problem and the global solution can be computed efficiently. We further show that the kernelized version of the proposed method can be formulated as a matrix-fractional program (MFP) [3] in the dual, which can be again cast as an SOCP; thus the optimization problem of the kernelized variant is still convex and the global solution can be computed efficiently. Through experiments with artificial and real-world protein super-family classification data sets, we show that the proposed MTL method compares favorably with existing MTL methods. We further test the performance of the proposed approach in ordinal regression scenarios [9], where the goal is to predict ordinal class labels such as users' preferences (`like'/`neutral'/`dislike') or students' grades (from `A' to `F'). The ordinal regression problems can be formulated as a set of one-versus-one classification problems, e.g., `like' vs. `neutral' and `neutral' vs. `dislike'. In ordinal regression, the relatedness among tasks is highly structured. That is, the solutions (decision boundaries) of adjacent problems are expected to be similar, but others may not be related, e.g., `A' vs. `B' and `B' vs. `C' would be related, but `A' vs. `B' and `E' vs. `F' may not be. Our experiments demonstrate that the proposed method is also useful in the ordinal regression scenarios and tends to outperform existing approaches [9, 8]

2 Problem Setting
In this section, we formulate the MTL problem. Let us consider M binary classification tasks, which all share the common input-output space X × {±1}. For the time being, we assume X  R d for simplicity; later in Section 4, we extend it to reproducing kernel Hilbert spaces. Let {x t , yt }t=1 be the training set, where x t  X and yt  {±1} for t = 1, . . . , . Each data sample (x t , yt ) has its target task; we denote the set of sample indices of the i-th task by I i . We assume that each sample belongs only to a single task, i.e., the index sets M are exclusive: i=1 |Ii | = and Ii  Ij = null, i = j . The goal is to learn the score function of each classification task: f i (x; wi , bi ) = wi x + bi , for i = 1, . . . , M , where wi  Rd and bi  R are the model parameters of the i-th task. We assume that a task network is available. The task network describes the relationships among tasks, where each node represents a task and two nodes are connected by an edge if they are related to each other 1. We denote the edge set by E  {(i k , jk )}K 1 . k=

3 Local MTL with Task Network: Linear Version
In this section, we propose a new MTL method. 3.1 Basic Idea When the relation among tasks is not available, we may just solve M penalized fitting problems individually: t 1 Hinge(fi (xt ; wi , bi ), yt ), for i = 1, . . . , M , (1) wi 2 + C 2
I i

where C  R+ is a regularization constant and Hinge(·, ·) is the hinge loss function: Hinge(f , y )  max(1 - f y , 0). This individual approach tends to perform poorly if the number of training samples in each task is limited--the performance is expected to be improved if more training samples are available. Here, we can exploit the information of the task network. A naive
More generally, the tasks can be related in an inhomogeneous way, i.e., the strength of the relationship among tasks can be dependent on tasks. This general setting can be similarly formulated by a weighted network, where edges are weighted according to the strength of the connections. All the discussions in this paper can be easily extended to weighted networks, but for simplicity we focus on unweighted networks.
1

2


idea would be to use the training samples of neighboring tasks in the task network for solving the target fitting problem. However, this does not fully make use of the network structure since there are many other indirectly connected tasks via some paths on the network. To cope with this problem, we take another approach here, which is based on the expectation that the solutions of related tasks are close to each other. More specifically, we impose the following constraint on the optimization problem (1): 1 wik - wjk 2  , for k = 1, . . . , K. (2) 2 Namely, we upper-bound each difference between the solutions of related task pairs by a positive scalar   R+ . We refer to this constraint as local constraint following [10]. Note that we do not impose a constraint on the bias parameter b i since the bias could be significantly different even among related tasks. The constraint (2) allows us to implicitly increase the number of training samples over the task network in a systematic way through the solutions of related tasks. Following the convention [8], we blend Eqs.(1) and (2) as
M iM t 1i wi 2 + C 2 M =1 =1

Hinge(fi (xt ;  ), yt ) + C ,

(3)

I i

where C is a positive trade-off parameter. Then our optimization problem is summarized as follows: Problem 1. min subj. to where
M 1i wi 2 + C  2 M =1 1

+ C , wrt. and and

w  RM d , b  RM ,   R+ , and   R+ , yt wi xt + b i 
 1 - t , t  Ii , i .

1 wik - wjk 2  , k , 2 w1 , w , . . . , wM

   [1 , . . . ,   ]

(4)

3.2 Primal MTL Learning by SOCP The second order cone program (SOCP) is a class of convex programs of minimizing a linear function over an intersection of second-order cones [3]: 2 Problem 2. min f
z

wrt

z  Rn

subj. to

Ai z + bi  ci z + di ,

for i = 1, . . . , N ,

(5)

where f  Rn , Ai  R(ni -1)×n , bi  Rni -1 , ci  Rn , di  R. Linear programs, quadratic programs, and quadratically-constrained quadratic programs are actually special cases of SOCPs. SOCPs are a sub-class of semidefinite programs (SDPs) [3], but SOCPs can be solved more efficiently than SDPs. Successful optimization algorithms fori both SDP and SOCP are interior-point algorithms. The SDP solvers (e.g. [2]) consume O(n 2 n2 ) time complexity i for solvii g Problem 2, but the SOCP-specialized solvers that directly solve Problem 2 take only n O(n2 ni ) computation [7]. Thus, SOCPs can be solved more efficiently than SDPs. We can show that Problem 1 is cast as an SOCP using hyperbolic constraints [3]. Theorem 1. Problem 1 can be reduced to an SOCP and it can be solved with O((M d + ) 2 (K d + )) computation.

4 Local MTL with Task Network: Kernelization
The previous section showed that a linear version of the proposed MTL method can be cast as an SOCP. In this section, we show how the kernel trick could be employed for obtaining a non-linear variant.
More generally, an SOCP can include linear equality constraints, but they can be eliminated, for example, by some projection method.
2

3


4.1 Dual Formulation Let Kfea be a positive semidefinite matrix with the (s, t)-th element being the inner-product of fa feature vectors x s and xt : Kset  xs , xt . This is a kernel matrix of feature vectors. We also , introduce a kernel among tasks. Using a new K -dimensional non-negative parameter vector   K R+ , we define the kernel matrix of tasks by 1 -1 IM + U  , Knet ()  M K where U   k=1 k Uk , Uk  E ik ik + E jk jk - E ik jk - E jk ik , and E (i,j )  RM ×M is the sparse matrix whose (i, j )-th element is one and all the others are zero. Note that this is the graph Laplacian kernel [11], where the k -th edge is weighted according to  k . Let Z  NM × be the indicator of a task and a sample such that Z i,t = 1 if t  Ii and Zi,t = 0 otherwise. Then the information about the tasks are expressed by the × kernel matrix Z Knet () Z . We integrate the two kernel matrices K fea and Z Knet () Z by ZK , (6) Kint ()  Kfea  net () Z where  denotes the Hadamard product (a.k.a element-wise product). This parameterized matrix Kint () is guaranteed to be positive semidefinite [6]. Based on the above notations, the dual formulation of Problem 1 can be expressed using the parameterized integrated kernel matrix K int () as follows: Problem 3. 1  diag(y )Kint () diag(y ) -  1 , min wrt.   R+ , and   RM , + 2  1  C . (7) subj. to   C 1 , Z diag(y )  = 0M , We note that the solutions  and  tend to be sparse due to the 1 norm. Changing the definition of K fea from the linear kernel to an arbitrary kernel, we can extend the proposed linear MTL method to non-linear domains. Furthermore, we can also deal with nonvectorial structured data by employing a suitable kernel such as the string kernel and the Fisher kernel. In the test stage, a new sample x in the j -th task is classified by
t

fj (x) =

iM

t yt kfea (xt , x)knet (i, j )Zi,t + bj ,

(8)

=1 =1

where kfea (·, ·) and knet (·, ·) are the kernel functions over features and tasks, respectively. 4.2 Dual MTL Learning by SOCP Here, we show that the above dual problem can also be reduced to an SOCP. To this end, we first introduce a matrix-fractional program (MFP) [7]: Problem 4. ip P p -1 (z ) (F z + g ) wrt. z  R+ subj. to P (z )  P0 + zi Pi  Sn + , min (F z + g ) +
=1

where Pi  F R , and g  R . Here and denote the positive semidefinite cone and strictly positive definite cone of n × n matrices, respectively. Sn , +
n

n×p

Sn +

Sn + +

Let us re-define d as the rank of the feature kernel matrix K fea . We introduce a matrix V fea  R ×d which decomposes the feature kernel matrix as K fea = Vfea Vfea . Define the -dimensional vectors fh  R of the h-th feature as V fea  [f1 , . . . , fd ]  R ×d and the matrices Fh  Z diag(fh  y ), for h = 1, . . . , d. Using those variables, the objective function in Problem 3 can be rewritten as -1 1 d 1h IM + U   Fh Fh  -  1 . (9) JD = 2 M
=1

4


This implies that Problem 3 can be transformed into the combination of a linear program and d MFPs. Let us further introduce the vector v k  RM for each edge: v k = eik -ejk , where eik is a unit vector with the ik -th element being one. Let V lap be a matrix defined by V lap = [v1 , . . . , vK ]  RM ×K . Then we can re-express the graph Lagrangian matrix of tasks as U  = V lap diag()Vlap . Given the fact that an MFP can be reduced to an SOCP [7], we can reduce Problem 3 to the following SOCP: Problem 5. min wrt subj. to -1


+

1h 2

d

s0,h + s1,h + · · · + sK,h ,
=1 

(10) RK k , h (11) (12) (13) h k , h (14) (15)

s0,h  R, sk,h  R, u0,h  RM , uh = [u1,h , . . . , uK,h ]  RK , + R ,
+ ,

  C 1

Z diag(y )  = 0M ,

M -1/2 u0,h + Vlap uh = Fh , 2  u k ,h sk,h + k sk,h - k Consequently, we obtain the following result:

1   C , 2  u0,h s0,h + 1, s0,h - 1

K

Theorem 2. The dual problem of CoNs learning (Problem 3) can be reduced to the SOCP in Problem 5 and it can be solved with O((K d + ) 2 ((M + K )d + )) computation.

5 Discussion
In this section, we discuss the properties of the proposed MTL method and the relation to existing methods. MTL with Common Bias A possible variant of the proposed MTL method would be to share the common bias parameter with all tasks (i.e. b 1 = b2 = · · · = bM ). The idea is expected to be useful particularly when the number of samples in each task is very small. We can also apply the common bias idea in the kernelized version just by replacing the constraint Z diag(y ) = 0 M in Problem 3 by y  = 0. Global vs. Local Constraints Micchelli and Pontil [8] have proposed a related MTL method which upper-bounds the sum of the differences of K related task pairs, i.e., K 1 k=1 wik - wjk 2  . We call it the global constraint. This global constraint can also have 2 a similar effect to our local constraint (2), i.e., the related task pairs tend to have close solutions. However, the global constraint can allow some of the distances to be large since only the sum is upper-bounded. This actually causes a significant performance degradation in practice, which will be experimentally demonstrated in Section 6. We note that the idea of local constraints is also used in the kernel learning problem [10]. Relation to Standard SVMs By construction, the proposed MTL method includes the standard SVM learning algorithm a special case. Indeed, when the number of tasks is one, Problem 3 is reduced to the standard SVM optimization problem. Thus, the proposed method may be regarded as a natural extension of SVMs. Ordinal Regression As we mentioned in Section 1, MTL approaches are useful in ordinal regression problems. Ordinal regression is a task of learning multiple quantiles, which can be formulated as a set of one-versus-one classification problems. A naive approach to ordinal regression is to individually train M SVMs with score functions f i (x) = wi , x + bi , i = 1, . . . , M . Shashua 5


1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0
1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1

1 0.5 0 -0.5 0.5 1 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

(a) True classification boundaries
1 0.5 0 -0.5 -1 -1 -0.5 0
1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1

(b) IL-SVMs
1 0.5 0 -0.5 0.5 1 -1 -1 -0.5 0
1 0.5 0 -0.5

1 0.5 0 -0.5 0.5 1 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1

1 0.5 0 -0.5 0.5 1 -1 -1 -0.5 0 1 0.5 0 -0.5
0.5 1

1 0.5 0 -0.5 0.5 1 -1 -1 -0.5 0 1 0.5 0 -0.5 0.5 1 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

1 0.5 0 -0.5 -1 -1 -0.5 0 1 0.5 0 -0.5 -1 -1 -0.5 0 0.5 1 0.5 1

-1 -1 -0.5 0

-1 -1 -0.5 0

(c) MTL-SVM(global/full)

(d) MTL-SVM(local/network)

Figure 1: Toy multi classification tasks. Each subfigure contains the 10-th, 30-th, 50-th, 70-th, and 90-th tasks in the top row and the 110-th, 130-th, 150-th, 170-th, and 190-th tasks in the bottom row.

and Levin [9] proposed an ordinal regression method called the support vector ordinal regression (SVOR), where the weight vectors are shared by all SVMs (i.e. w 1 = w2 = · · · = wM ) and only the bias parameter is learned individually. The proposed MTL method can be naturally employed in ordinal regression by constraining the weight vectors as wi - wi+1 2  , i = 1, . . . , M - 1, i.e., the task network only has a weight between consecutive tasks. This method actually includes the above two ordinal regression approaches as special cases--C = 0 (i.e., ignoring the task network) yields the independent training of SVMs and C =  (i.e., the weight vectors of all SVMs agree) is reduced to SVOR. Thus, in the context of ordinal regression, the proposed method smoothly bridges two extremes and allows us to control the belief of task constraints.

6 Experiments
In this section, we show the usefulness of the proposed method through experiments. 6.1 Toy Multiple Classification Tasks First, we illustrate how the proposed method behaves using a 2-dimensional toy data set, which includes 200 tasks (see Figure 1(a)). Each task possesses a circular-shaped classification boundary with different centers and a fixed radius 0.5. The location of the center in the i-th task is (-1 + 0.02(i - 1), 0) for 1  i  100 and (0, -1 + 0.02(i - 101)) for 101  i  200. For each task, only two positive and two negative samples are generated following the uniform distribution. We construct a task network where consecutive tasks are connected in a circular manner, i.e., (1, 2), (2, 3), . . ., (99, 100), and (100, 1) for the first 100 tasks and (101, 102), (102, 103), . . ., (199, 200), and (200, 1) for the last 100 tasks; we further add (50, 150), which connects the clusters of the first 100 and the last 100 nodes. We compare the following methods: a naive method where 200 SVMs are trained indivisually (individually learned SVM, `IL-SVM '), the MTL-SVM algorithm where the global constraint and the fully connected task network are used [5] (`MTL-SVM(global/full)'), and the proposed method which uses local constraints and the properly defined task network (`MTL-SVM(local/network)'). The results are exhibited in Figure 1, showing that IL-SVM can not capture the circular shape due to the small sample size in each task. MTL-SVM(global/full) can successfully capture closed-loop boundaries by making use of the information from other tasks. However, the result is still not so reliable since non-consecutive unrelated tasks heavily damage the solutions. On the other hand, MTL-SVM(local/network) nicely captures the circular boundaries and the results are highly reliable. Thus, given an appropriate task network, the proposed MTL-SVM(local/network) can effectively exploit information of the related tasks. 6


Table 1: The accuracy of each method in the protein super-family classification task. Dataset d-f d-s d-o f-s f-o s-o IL-SVM 0.908 (0.023) 0.638 (0.067) 0.725 (0.032) 0.891 (0.036) 0.792 (0.046) 0.663 (0.034) One-SVM 0.941 (0.015) 0.722 (0.030) 0.747 (0.017) 0.886 (0.021) 0.819 (0.029) 0.695 (0.034) MTL-SVM (global/full) 0.945 (0.013) 0.698 (0.036) 0.748 (0.021) 0.918 (0.020) 0.834 (0.021) 0.692 (0.050) MTL-SVM (global/network) 0.933 (0.017) 0.695 (0.032) 0.749 (0.023) 0.911 (0.022) 0.828 (0.015) 0.663 (0.068) MTL-SVM (local/network) 0.952 (0.015) 0.747 (0.020) 0.764 (0.028) 0.918 (0.025) 0.838 (0.018) 0.703 (0.036)

6.2 Protein Super-Family Classification Next, we test the performance of the proposed method with real-world protein super-family classification problems. The input data are amino acid sequences from the SCOP database [1] (not SOCP). We counted 2-mers for extraction of feature vectors. There are 20 kinds of amino acids. Hence, the number of features is 202 = 400. We use RBF kernels, where the kernel width  r2 f is set to the average b of the squared distances to the fifth nearest neighbors. Each data set consists of two folds. Each fold is divided into several super-families. We here consider the classification problem into the super-families. A positive class is chosen from one fold, and a negative class is chosen from the other fold. We perform multi-task learning from all the possible combinations. For example, three super-families are in DNA/RNA binding, and two in SH3. The number of combinations is 3 · 2 = 6. So the data set d-s has the six binary classification tasks. We used four folds: DNA/RNA binding, Flavodoxin, OB-fold and SH3. From these folds, we generate six data sets: d-f, d-f, d-o, f-o, f-s, and o-s, where the fold names are abbreviated to d, f, o, and s, respectively. The task networks are constructed as follows: if the positive super-family or the negative superfamily is common to two tasks, the two tasks are regarded as a related task pair and connected by an edge. We compare the proposed MTL-SVM(local/network) with IL-SVM, `One-SVM ', MTLSVM(global/full), and MTL-SVM(global/network). One-SVM regards the multiple tasks as one big task and learns the big task once by a standard SVM. We set C  = 1 for all the approaches. The value of the parameter C  for three MTL-SVM approaches is determined by cross-validation over the training set. We randomly pick ten training sequences from each super-family, and use them for training. We compute the classification accuracies of the remaining test sequences. We repeat this procedure 10 times and take the average of the accuracies. The results are described in Table 1, showing that the proposed MTL-SVM(local/network) compares favorably with the other methods. In this simulation, the task network is constructed rather heuristically. Even so, the proposed MTL-SVM(local/network) is shown to significantly outperform MTL-SVM(global/full), which does not use the network structure. This implies that the proposed method still works well even when the task network contains small errors. It is interesting to note that MTL-SVM(global/network) actually does not work well in this simulation, implying that the task relatedness are not properly controlled by the global constraint. Thus the use of the local constraints would be effective in MTL scenarios. 6.3 Ordinal Regression As discussed in Section 5, MTL methods are useful in ordinal regression. Here we create five ordinal regression data sets described in Table 2, where all the data sets are originally regression and the output values are divided into five quantiles. Therefore, the overall task can be divided into four isolated classification tasks, each of which estimates a quantile. We compare MTL-SVM(local/network) with IL-SVM, SVOR [9] (see Section 5), MTL-SVM(full/network) and MTL-SVM(global/network). The value of the parameter C  for three MTL-SVM approaches is determined by cross-validation over the training set. We set C = 1 for all the approaches. We use RBF kernels, where the parameter r2 f is set to the average of the squared distances to the fifth nearest neighbors. We randomly b picked 200 samples for training. The remaining samples are used for evaluating the classification accuracies. 7


Table 2: The accuracy of each method in ordinal regression tasks. Data set pumadyn stock bank-8fh bank-8fm calihouse IL-SVM 0.643 (0.007) 0.894 (0.012) 0.781 (0.003) 0.854 (0.004) 0.648 (0.003) SVOR 0.661 (0.006) 0.878 (0.011) 0.777 (0.006) 0.845 (0.010) 0.642 (0.008) MTL-SVM (global/full) 0.629 (0.025) 0.872 (0.010) 0.772 (0.006) 0.832 (0.013) 0.640 (0.005) MTL-SVM (global/network) 0.645 (0.018) 0.888 (0.010) 0.773 (0.006) 0.847 (0.009) 0.646 (0.007) MTL-SVM (local/network) 0.661 (0.007) 0.902 (0.007) 0.779 (0.002) 0.854 (0.009) 0.650 (0.004)

The averaged performance over five runs is described in Table 2, showing that the proposed MTLSVM(local/network) is also promising in ordinal regression scenarios.

7 Conclusions
In this paper, we proposed a new multi-task learning method, which overcomes the limitation of existing approaches by making use of a task network and local constraints. We demonstrated through simulations that the proposed method is useful in multi-task learning scenario; moreover, it also works excellently in ordinal regression scenarios. The standard SVMs have a variety of extensions and have been combined with various techniques, e.g., one-class SVMs, SV regression, and the  -trick. We expect that such extensions and techniques can also be applied similarly to the proposed method. Other possible future works include the elucidation of the entire regularization path and the application to learning from multiple networks; developing algorithms for learning probabilistic models with a task network is also a promising direction to be explored. Acknowledgments This work was partially supported by a Grant-in-Aid for Young Scientists (B), number 18700287, from the Ministry of Education, Culture, Sports, Science and Technology, Japan.

References
[1] A. Andreeva, D. Howorth, S. E. Brenner, T. J. P. Hubbard, C. Chothia, and A. G. Murzin. SCOP database in 2004: refinements integrate structure and sequence family data. Nucl. Acid Res., 32:D226­D229, 2004. [2] B. Borchers. CSDP, a C library for semidefinite programming. Optimization Methods and Software, 11(1):613­623, 1999. [3] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [4] R. Caruana. Multitask learning. Machine Learning, 28(1):41­75, 1997. [5] T. Evgeniou and M. Pontil. Regularized multitask learning. In Proc. of 17-th SIGKDD Conf. on Knowledge Discovery and Data Mining, 2004. [6] D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz, July 1999. [7] M. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret. Applications of second-order cone programming. Linear Algebra and its Applications, 284:193­228, 1998. [8] C. A. Micchelli and M. Pontil. Kernels for multi-task learning. In Lawrence K. Saul, Yair Weiss, and Leon ´ Bottou, editors, Advances in Neural Information Processing Systems 17, pages 921­928, Cambridge, MA, 2005. MIT Press. [9] A. Shashua and A. Levin. Ranking with large margin principle: two approaches. In Advances in Neural Information Processing Systems 15, pages 937­944, Cambridge, MA, 2003. MIT Press. [10] K. Tsuda and W.S. Noble. Learning kernels from biological networks by maximizing entropy. Bioinformatics, 20(Suppl. 1):i326­i333, 2004. [11] X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. In Lawrence K. Saul, Yair Weiss, and Lon Bottou, editors, Advances in Neural Information Processing Systems 17, Cambridge, MA, 2004. MIT Press.

8