Clustered Multi-Task Learning: a Convex Formulation Laurent Jacob Mines ParisTech ­ CBIO INSERM U900, Institut Curie 35, rue Saint Honore, 77300 Fontainebleau, France ´ laurent.jacob@mines-paristech.fr Francis Bach INRIA ­ Willow Project Ecole Normale Superieure, ´ 45, rue d'Ulm, 75230 Paris, France francis.bach@mines.org Jean-Philippe Vert Mines ParisTech ­ CBIO INSERM U900, Institut Curie 35, rue Saint Honore, 77300 Fontainebleau, France ´ jean-philippe.vert@mines-paristech.fr Abstract In multi-task learning several related tasks are considered simultaneously, with the hope that by an appropriate sharing of information across tasks, each task may benefit from the others. In the context of learning linear functions for supervised classification or regression, this can be achieved by including a priori information about the weight vectors associated with the tasks, and how they are expected to be related to each other. In this paper, we assume that tasks are clustered into groups, which are unknown beforehand, and that tasks within a group have similar weight vectors. We design a new spectral norm that encodes this a priori assumption, without the prior knowledge of the partition of tasks into groups, resulting in a new convex optimization formulation for multi-task learning. We show in simulations on synthetic examples and on the I E D B MHC-I binding dataset, that our approach outperforms well-known convex methods for multi-task learning, as well as related non-convex methods dedicated to the same problem. 1 Introduction Regularization has emerged as a dominant theme in machine learning and statistics, providing an intuitive and principled tool for learning from high-dimensional data. In particular, regularization by squared Euclidean norms or squared Hilbert norms has been thoroughly studied in various settings, leading to efficient practical algorithms based on linear algebra, and to very good theoretical understanding (see, e.g., [1, 2]). In recent years, regularization by non Hilbert norms, such as p norms with p = 2, has also generated considerable interest for the inference of linear functions in supervised classification or regression. Indeed, such norms can sometimes both make the problem statistically and numerically better-behaved, and impose various prior knowledge on the problem. For example, the 1 -norm (the sum of absolute values) imposes some of the components to be equal to zero and is widely used to estimate sparse functions [3], while various combinations of p norms can be defined to impose various sparsity patterns. While most recent work has focused on studying the properties of simple well-known norms, we take the opposite approach in this paper. That is, assuming a given prior knowledge, how can we design a norm that will enforce it? More precisely, we consider the problem of multi-task learning, which has recently emerged as a very promising research direction for various applications [4]. In multi-task learning several related inference tasks are considered simultaneously, with the hope that by an appropriate sharing 1 of information across tasks, each one may benefit from the others. When linear functions are estimated, each task is associated with a weight vector, and a common strategy to design multi-task learning algorithm is to translate some prior hypothesis about how the tasks are related to each other into constraints on the different weight vectors. For example, such constraints are typically that the weight vectors of the different tasks belong (a) to a Euclidean ball centered at the origin [5], which implies no sharing of information between tasks apart from the size of the different vectors, i.e., the amount of regularization, (b) to a ball of unknown center [5], which enforces a similarity between the different weight vectors, or (c) to an unknown low-dimensional subspace [6, 7]. In this paper, we consider a different prior hypothesis that we believe could be more relevant in some applications: the hypothesis that the different tasks are in fact clustered into different groups, and that the weight vectors of tasks within a group are similar to each other. A key difference with [5], where a similar hypothesis is studied, is that we don't assume that the groups are known a priori, and in a sense our goal is both to identify the clusters and to use them for multi-task learning. An important situation that motivates this hypothesis is the case where most of the tasks are indeed related to each other, but a few "outlier" tasks are very different, in which case it may be better to impose similarity or low-dimensional constraints only to a subset of the tasks (thus forming a cluster) rather than to all tasks. Another situation of interest is when one can expect a natural organization of the tasks into clusters, such as when one wants to model the preferences of customers and believes that there are a few general types of customers with similar preferences within each type, although one does not know beforehand which customers belong to which types. Besides an improved performance if the hypothesis turns out to be correct, we also expect this approach to be able to identify the cluster structure among the tasks as a by-product of the inference step, e.g., to identify outliers or groups of customers, which can be of interest for further understanding of the structure of the problem. In order to translate this hypothesis into a working algorithm, we follow the general strategy mentioned above which is to design a norm or a penalty over the set of weights which can be used as regularization in classical inference algorithms. We construct such a penalty by first assuming that the partition of the tasks into clusters is known, similarly to [5]. We then attempt to optimize the objective function of the inference algorithm over the set of partitions, a strategy that has proved useful in other contexts such as multiple kernel learning [8]. This optimization problem over the set of partitions being computationally challenging, we propose a convex relaxation of the problem which results in an efficient algorithm. 2 Multi-task learning with clustered tasks We consider m related inference tasks that attempt to learn linear functions over X = Rd from a training set of input/output pairs (xi , yi )i=1,...,n , where xi X and yi Y . In the case of binary classification we usually take Y = {-1, +1}, while in the case of regression we take Y = R. Each training example (xi , yi ) is associated to a particular task t [1, m], and we denote by I (t) [1, n] the set of indices of training examples associated to the task t. Our goal is to infer m linear functions ft (x) = wt x, for t = 1, . . . , m, associated to the different tasks. We denote by W = (w1 . . . wm ) the d × m matrix whose columns are the successive vectors we want to estimate. We fix a loss function l : R × Y R that quantifies by l(f (x), y ) the cost of predicting f (x) for the input x when the correct output is y . Typical loss functions include the square error in regression l(u, y ) = 1 (u - y )2 or the hinge loss in binary classification l(u, y ) = max(0, 1 - uy ) 2 with y {-1, 1}. The empirical risk of a set of linear classifiers given in the matrix W is then defined as the average loss over the training set: m i 1 (W ) = n t=1 (1 ) I (t) l (wt xi , yi ) . In the sequel, we will often use the m×1 vector 1 composed of ones, the m×m projection matrices U = 11 /m whose entries are all equal to 1/m, as well as the projection matrix = I - U . In order to learn simultaneously the m tasks, we follow the now well-established approach which looks for a set of weight vectors W that minimizes the empirical risk regularized by a penalty functional, i.e., we consider the problem: minW Rd×m (W ) + (W ) , (2 ) where (W ) can be designed from prior knowledge to constrain some sharing of information between tasks. For example, [5] suggests to penalize both the norms of the wi 's and their variance, 2 i.e., to consider a function of the form: n where w = ( i=1 wi ) /m is the mean weight vector. This penalty enforces a clustering of the wi s ¯ towards their mean when increases. Alternatively, [7] propose to penalize the trace norm of W : mi trace (W ) = i=n(d,m) i (W ) , (4 ) 1 where 1 (W ), . . . , min(d,m) (W ) are the successive singular values of W . This enforces a low-rank solution in W , i.e., constrains the different wi 's to live in a low-dimensional subspace. Here we would like to define a penalty function (W ) that encodes as prior knowledge that tasks are clustered into r < m groups. To do so, let us first assume that we know beforehand the clusters, i.e., we have a partition of the set of tasks into r groups. In that case we can follow an approach proposed by [5] which for clarity we rephrase with our notations and slightly generalize now. For a given cluster c [1, r], let us denote J (c) [1, m] the set of tasks in c, mc = |J (c)| the number of tasks in the cluster c, and E the m × r binary matrix which describes the cluster assignment for ihe m tasks, i.e., Eij = 1 if task i is in cluster j , 0 otherwise. Let us further denotm by wc = t e ¯ ¯ ( J (c) wi )/mc the average weight vector for the tasks in c, and recall that w = ( i=1 wi ) /m denotes the average weight vector over all tasks. Finally it will be convenient to introduce the matrix M = E (E E )-1 E . M can also be written I - L, where L is the normalized Laplacian of the graph G whose nodes are the tasks connected by an edge if and only if they are in the same cluster. Then we can define three semi-norms of interest on W that quantify different orthogonal aspects: · A global penalty, which measures on average how large the weight vectors are: mean (W ) = n w 2 = trW U W . ¯ · A measure of between-cluster variance, which quantifies how close to each other the different clusters are: r between (W ) = c=1 mc wc - w 2 = trW (M - U )W . ¯ ¯ · A measure of within-cluster variance, which quantifies the compactness of the clusters: = i r trW (I - M )W . within (W ) = c=1 wi - wc 2 ¯ J (c ) We note that both between (W ) and within (W ) depend on the particular choice of clusters E , or equivalently of M . We now propose to consider the following general penalty function: (W ) = M mean (W ) + B between (W ) + W within (W ) , (5 ) variance (W ) = w 2 + ¯ m m i=1 wi - w 2 , ¯ (3 ) where M , B and W are non-negative parameters that can balance the importance of the components of the penalty. Plugging this quadratic penalty into (2) leads to the general problem: minW Rd×m (W ) + trW (M )-1 W , wh e r e Here we use the notation (M ) to insist on the fact that this quadratic penalty depends on the cluster structure through the matrix M . Observing that the matrices U , M - U and I - M are orthogonal projections onto orthogonal supplementary subspaces, we easily get from (7): - (M ) = M1 U + -1 (M - U ) + -1 (I - M ) = -1 I + (-1 - -1 )U + (-1 - -1 )M . (8) B W W M B B W (6 ) (7 ) (M )-1 = M U + B (M - U ) + W (I - M ) . By choosing particular values for M , B and W we can recover several situations, In particular: · For W = B = M = , we simply recover the Frobenius norm of W , which does not put any constraint on the relationship between the different tasks: m (W ) = trW W = i=1 wi 2 . 3 · For W = B > M , we recover the penalty of [5] without clusters: m ¯ (W ) = trW (M U + B (I - U )) W = M n w 2 + B i=1 wi - w 2 . ¯ In that case, a global similarity between tasks is enforced, in addition to the general constraint on their mean. The structure in clusters plays no role since the sum of the betweenand within-cluster variance is independent of the particular choice of clusters. · For W > B = M we recover the penalty of [5] with clusters: . cr m i W ¯ (W ) = trW (M M + W (I - M )) W = M ¯ c wc 2 + M J (c) wi - wc 2 =1 where Mr denotes the set of matrices M = E (E E )-1 E defined by a clustering of the m tasks into r clusters and (M ) is defined in (8). Denoting by Sr = {(M ) : M Mr } the corresponding set of positive semidefinite matrices, we can equivalently rewrite the problem as: (1 0 ) minW Rd×m ,Sr (W ) + trW -1 W . m The objective function in (10) is jointly convex in W Rd×m and S+ , the set of m×m positive semidefinite matrices, however the (finite) set Sr is not convex, making this problem intractable. We are now going to propose a convex relaxation of (10) by optimizing over a convex set of positive semidefinite matrices that contains Sr . In order to enforce a cluster hypothesis on the tasks, we therefore see that a natural choice is to take W > B > M in (5). This would have the effect of penalizing more the within-cluster variance than the between-cluster variance, hence promoting compact clusters. Of course, a major limitation at this point is that we assumed the cluster structure known a priori (through the matrix E , or equivalently M ). In many cases of interest, we would like instead to learn the cluster structure itself from the data. We propose to learn the cluster structure in our framework by optimizing our objective function (6) both in W and M , i.e., to consider the problem: (9 ) minW Rd×m ,M Mr (W ) + trW (M )-1 W , 3 Convex relaxation In order to formulate a convex relaxation of (10), we observe that in the penalty term (5) the cluster structure only contributes to the second and third terms between (W ) and within (W ), and that these penalties only depend on the centered version of W . In terms of matrices, only the last two terms of (M )-1 in (7) depend on M , i.e., on the clustering, and these terms can be re-written as: B (M - U ) + W (I - M ) = (B M + W (I - M )). (1 1 ) Indeed, it is easy to check that M - U = M = M , and that I - M = I - U - (M - U ) = - M = (I - M ). Intuitively, multiplying by on the right (resp. on the left) centers the rows (resp. the columns) of a matrix, and both M - U and I - M are row- and column-centered. To simplify notations, let us introduce M = M . Plugging (11) in (7) and (9), we get the penalty t + (1 2 ) trW (M )-1 W = M rW W U (W )(B M + W (I - M ))(W ) , in which, again, only the second part needs to be optimized with respect to the clustering M . Denoting -1 (M ) = B M + W (I - M ), one can express c (M ), using the fact that M is a projection: c M c (M ) = -1 - -1 (1 3 ) + -1 I . W B W c is characterized by M = M , that is discrete by construction, hence the non-convexity of Sr . We have the natural constraints M 0 (i.e., M -U ), 0 M I (i.e., 0 M ) and trM = r (i.e., trM = r - 1). A possible convex relaxation of the discrete set of matrices M is therefore {M : 0 M I, trM = r - 1}. This gives an equivalent convex set Sc for c , namely: , m Sc = c I, trc = (1 4 ) c S+ : I - r with = W1 , = -1 and = (m - r + 1)-1 + (r - 1)-1 . Incorporating the first pa, t of the B W B t penalty (12) into the empirical risk term by defining c (W ) = (W ) + M rW W U we are now ready to state our relaxation of (10): (1 5 ) minW Rd×m ,c Sc c (W ) + trW -1 (W ) . c 4 3.1 Reinterpretation in terms of norms - We denote W 2 = minc Sc trW c 1 W T the cluster norm (CN). For any convex set Sc , we obc tain a norm on W (that we apply here to its centered version). By putting some different constraints on the set Sc , we obtain different norms on W , and in fact all previous multi-task formulations may be cast in this way, i.e., by choosing a specific set of positive matrices Sc (e.g., trace constraint for the trace norm, and simply a singleton for the Frobenius norm). Thus, designing norms for multitask learning is equivalent to designing a set of positive matrices. In this paper, we have investigated a specific set adapted for clustered-tasks, but other sets could be designed in other situations. Note that we have selected a simple spectral convex set Sc in order to make the optimization simpler in Section 3.3, but we could also add some additional constraints that encode the point-wise positivity of the matrix M . Finally, when r = 1 (one cluster) and r = m (one cluster per task), we get back the formulation of [5]. 3.2 Reinterpretation as a convex relaxation of K-means In this section we show that the semi-norm W 2 that we have designed earlier, can be interpreted c as a convex relaxation of K-means on the tasks [9]. Indeed, given W Rd×m , K-means aims to decompose it in the form W = µE where µ Rd×r are cluster centers and E represents a partition. Given E , µ is found by minimizing minµ W - E µ 2 . Thus, a natural strategy F outlined by [9], is to alternate between optimizing µ, the partition E and the weight vectors W . We now show that our convex norm is obtained when minimizing in closed form with respect to µ and relaxing. By translation invariance, this is equivalent to minimizing minµ W - E µ 2 . If we add a F penalization on µ of the form trE E µµ , then a short calculation shows that the minimum with respect to µ (i.e., after optimization of the cluster centers) is equal to trW W (E (E E )-1 E / + I )-1 = trW W (M / + I )-1 . By comparing with Eq. (13), we see that our formulation is indeed a convex relaxation of K-means. 3.3 Primal optimization Let us now show in more details how (15) can be solved efficiently. Whereas a dual formulation could be easily derived following [8], a direct approach is to rewrite (15) as ( 16) minW Rd×m c (W ) + minc Sc trW -1 (W ) c which, if c is differentiable, can be directly optimized by gradient-based methods on W since - W 2 = minc Sc trW c 1 (W ) is a quadratic semi-norm of W . This regularization c -1 term trW c (W ) can be computed efficiently using a semi-closed form. Indeed, since c as defined in (14) is a spectral set (i.e., it does depend only on eigenvalues of covariance matrices), we obtain a function of the singular values of W (or equivalently the eigenvalues of W W ): minc Sc trW -1 (W ) = minRm , i , 1= , V Om trW V diag()-1 V (W ) , c where Om is the set of orthogonal matrices in Rm×m . The optimal V is the matrix of the eigenvectors of W W , and we obtain the value of the objective function at the optimum: m 2 minS trW -1 (W ) = minRm , i , 1= i=1 i , i where and are the vectors containing the singular values of W and respectively. Now, we simply need to be able to compute this function of the singular values. The only coupling in this formulation comes from the trace constraint. The Lagrangian corresponding to this constraint is: m m 2 (1 7 ) L(, ) = i=1 i + ( i=1 i - ) . i For 0, this is a decreasing function of i , so the minimum on i [, ] is reached for i = . The dual function is then a linear non-decreasing function of (since /m from the definition of , , in (14)), which reaches it maximum value (on 0) at = 0. Let us therefore now consider the dual for 0. (17) is then a convex function of i . Canceling its derivative with respect to i gives that the minimum in R is reached for i = i / . Now this may not be 5 in the constraint set (, )so if i < then the minimum in i [, ] of (17) is reached , for i = , and if i > it is reached for i = . Otherwise, it is reached for i = i / . Reporting this in (17), the dual problem is therefore +i - 2 2 i i i i 2i + + + . (1 8 ) max 0 , respectively. Denoting A 2 = c F (A, (A)), A F = A F + F A cannot be computed because of the non-differentiable constraints on for F . We followed an alternative direction, using only the A F part. 4 Experiments 4.1 Artificial data We generated synthetic data consisting of two clusters of two tasks. The tasks are vectors of Rd , d = 30. For each cluster, a center wc was generated in Rd-2 , so that the two clusters be orthogonal. More ¯ 2 2 precisely, each wc had (d - 2)/2 random features randomly drawn from N (0, r ), r = 900, and ¯ (d - 2)/2 zero features. Then, each tasks t was computed as wt + wc (t), where c(t) was the cluster ¯ of t. wt had the same zero feature as its cluster center, and the other features were drawn from 2 2 2 N (0, c ), c = 16. The last two features were non-zero for all the tasks and drawn from N (0, c ). 2 For each task, 2000 points were generated and a normal noise of variance n = 150 was added. In a first experiment, we compared our cluster norm . 2 with the single-task learning given by the c Frobenius norm, and with the trace norm, that corresponds to the assumption that the tasks live in a low-dimension space. The multi-task kernel approach being a special case of CN, its performance will always be between the performance of the single task and the performance of CN. In a second setting, we compare CN to alternative methods that differ in the way they learn : · The True metric approach, that simply plugs the actual clustering in E and optimizes W using this fixed metric. This necessitates to know the true clustering a priori, and can be thought of like a golden standard. · The k-means approach, that alternates between optimizing the tasks in W given the metric and re-learning by clustering the tasks wi [9]. The clustering is done by a k-means run 3 times. This is a non convex approach, and different initialization of k-means may result in different local minima. We also tried one run of CN followed by a run of True metric using the learned reprojected in Sr by rounding, i.e., by performing k-means on the eigenvectors of the learned (Reprojected approach), and a run of k-means starting from the relaxed solution (CNinit approach). 6 Only the first method requires to know the true clustering a priori, all the other methods can be run without any knowledge of the clustering structure of the tasks. Each method was run with different numbers of training points. The training points were equally separated between the two clusters and for each cluster, 5/6th of the points were used for the first task and 1/6th for the second, in order to simulate a natural setting were some tasks have fewer data. We used the 2000 points of each task to build 3 training folds, and the remaining points were used for testing. We used the mean RMSE across the tasks as a criterion, and a quadratic loss for (W ). The results of the first experiment are shown on Figure 1 (left). As expected, both multi-task approaches perform better than the approach that learns each task independently. CN penalization on the other hand always gives better testing error than the trace norm penalization, with a stronger advantage when very few training points are available. When more training points become available, all the methods give more and more similar performances. In particular, with large samples, it is not useful anymore to use a multi-task approach. 35 Frob Trace CN 32 30 28 26 RMSE RMSE 25 24 22 20 15 18 16 10 3 3.5 4 4.5 5 5.5 Number of training points (log) 6 6.5 14 3 3.5 4 4.5 5 5.5 Number of training points (log) 6 6.5 CN KM True Repr 30 20 Figure 1: RMSE versus number of training points for the tested methods. Figure 2: Recovered with CN (upper line) and k-means (lower line) for 28, 50 and 100 points. Figure 1 (right) shows the results of the second experiment. Using the true metric always gives the best results. For 28 training points, no method recovers the correct clustering structure, as displayed on Figure 2, although CN performs slightly better than the k-means approach since the metric it learns is more diffuse. For 50 training points, CN performs much better than the k-means approach, which completely fails to recover the clustering structure as illustrated by the learned for 28 and 50 training points on Figure 2. In the latter setting, CN partially recovers the clusters. When more training points become available, the k-means approach perfectly recovers the clustering structure and outperforms the relaxed approach. The reprojected approach, on the other hand, performs always as well as the best of the two other methods. The CNinit approach results are not displayed since the are the same as for the reprojected method. 4.2 MHC-I binding data We also applied our method to the I E D B MHC-I peptide binding benchmark proposed in [10]. This database contains binding affinities of various peptides, i.e., short amino-acid sequences, with different MHC-I molecules. This binding process is central in the immune system, and predicting it is crucial, for example to design vaccines. The affinities are thresholded to give a prediction problem. Each MHC-I molecule is considered as a task, and the goal is to predict whether a peptide binds a molecule. We used an orthogonal coding of the amino acids to represent the peptides and balanced 7 Table 1: Prediction error for the 10 molecules with less than 200 training peptides in I E D B. Method Test error Pooling 26.53% ± 2.0 Frobenius norm 11.62% ± 1.4 Multi-task kernel 10.10% ± 1.4 Trace norm 9.20% ± 1.3 Cluster norm 8.71% ± 1.5 the data by keeping only one negative example for each positive point, resulting in 15236 points involving 35 different molecules. We chose a logistic loss for (W ). Multi-task learning approaches have already proved useful for this problem, see for example [11, 12]. Besides, it is well known in the vaccine design community that some molecules can be grouped into empirically defined supertypes known to have similar binding behaviors. [12] showed in particular that the multi-task approaches were very useful for molecules with few known binders. Following this observation, we consider the mean error on the 10 molecules with less than 200 known ligands, and report the results in Table 1. We did not select the parameters by internal cross validation, but chose them among a small set of values in order to avoid overfitting. More accurate results could arise from such a cross validation, in particular concerning the number of clusters (here we limited the choice to 2 or 10 clusters). The pooling approach simply considers one global prediction problem by pooling together the data available for all molecules. The results illustrate that it is better to consider individual models than one unique pooled model.On the other hand, all the multitask approaches improve the accuracy, the cluster norm giving the best performance. The learned , however, did not recover the known supertypes, although it may contain some relevant information on the binding behavior of the molecules. 5 Conclusion We have presented a convex approach to clustered multi-task learning, based on the design of a dedicated norm. Promising results were presented on synthetic examples and on the I E D B dataset. We are currently investigating more refined convex relaxations and the natural extension to nonlinear multi-task learning as well as the inclusion of specific features on the tasks, which has shown to improve performance in other settings [6]. References [1] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990. [2] F. Girosi, M. Jones, and T. Poggio. Regularization Theory and Neural Networks Architectures. Neural Comput., 7(2):219­269, 1995. [3] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Stat. Soc. B., 58:267­288, 1996. [4] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. J. Mach. Learn. Res., 4:83­99, 2003. [5] T. Evgeniou, C. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. J. Mach. Learn. Res., 6:615­637, 2005. [6] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes. Technical Report cs/0611124, arXiv, 2006. [7] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In B. Scholkopf, J. Platt, and ¨ T. Hoffman, editors, Adv. NIPS 19, pages 41­48, Cambridge, MA, 2007. MIT Press. [8] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the Kernel Matrix with Semidefinite Programming. J. Mach. Learn. Res., 5:27­72, 2004. [9] M. Deodhar and J. Ghosh. A framework for simultaneous co-clustering and learning from complex data. In KDD '07, pages 250­259, New York, NY, USA, 2007. ACM. [10] B. Peters, H.-H Bui, S. Frankild, M. Nielson, C. Lundegaard, E. Kostem, D. Basch, K. Lamberth, M. Harndahl, W. Fleri, S. S Wilson, J. Sidney, O. Lund, S. Buus, and A. Sette. A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol, 2(6):e65, 2006. [11] D. Heckerman, D. Kadie, and J. Listgarten. Leveraging information across HLA alleles/supertypes improves epitope prediction. J. Comput. Biol., 14(6):736­746, 2007. [12] L. Jacob and J.-P. Vert. Efficient peptide-MHC-I binding prediction for alleles with few known binders. Bioinformatics, 24(3):358­366, Feb 2008. 8