Large Scale Max-Margin Multi-Label Classification with Priors

Bharath Hariharan cs1060156@cse.iitd.ernet.in Department of Computer Science and Engineering, Indian Institute of Technology, Delhi, India 110 016 Lihi Zelnik-Manor lihi@ee.technion.ac.il Department of Electrical Engineering, The Technion - Israel Institute of Technology, Haifa, Israel 32000 S. V. N. Vishwanathan Department of Statistics, Purdue University, West Lafayette, IN 47907-2066 Manik Varma Microsoft Research India, Bangalore, India 560 080 vishy@stat.purdue.edu

manik@microsoft.com

Abstract
We propose a max-margin formulation for the multi-label classification problem where the goal is to tag a data point with a set of pre-specified labels. Given a set of L labels, a data point can be tagged with any of the 2L possible subsets. The main challenge therefore lies in optimising over this exponentially large label space subject to label correlations. Existing solutions take either of two approaches. The first assumes, a priori, that there are no label correlations and independently trains a classifier for each label (as is done in the 1-vs-All heuristic). This reduces the problem complexity from exponential to linear and such methods can scale to large problems. The second approach explicitly models correlations by pairwise label interactions. However, the complexity remains exponential unless one assumes that label correlations are sparse. Furthermore, the learnt correlations reflect the training set biases. We take a middle approach that assumes labels are correlated but does not incorporate pairwise label terms in the prediction function. We show that the complexity can still be reduced from exponential to linear while modelling dense pairwise label correlations. By incorporating correlation priors we can overcome training set biases and improve prediction accuracy. We provide a principled interpretation of the 1-vs-All method and show
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

that it arises as a special case of our formulation. We also develop efficient optimisation algorithms that can be orders of magnitude faster than the state-of-the-art.

1. Introduction
The objective in multi-label classification is to predict a set of relevant binary labels for a given input. A key aspect is dealing with the exponentially large power set of labels subject to label correlations. This is in contrast to multi-class classification where one has to predict just the single, most probable label. Existing methods for multi-label classification take one of two approaches. In the first, labels are a priori assumed not to be correlated so that a predictor can be trained for each label independently (as is done in the popular 1-vs-All heuristic). This reduces training and prediction complexity from exponential in the number of labels to linear. In the second, label correlations are explicitly taken into account by incorporating pairwise, or potentially even higher order, label interactions. However, inference is intractable unless one assumes that labels are sparsely correlated. Exact inference might also not be possible in the presence of loops and most work has focused on hierarchical tree structured labels. Furthermore, label correlation weights are estimated from the training set alone and prior knowledge is rarely incorporated. In this paper, we focus on the case where the labels are densely correlated and where the label correlations found during training might be very different from those found during testing. This is a common setting in image and video search where one has to frequently recognise categories for which training

Large Scale Max-Margin Multi-Label Classification with Priors

data is unavailable. For instance, to recognise objects without requiring any labeled examples of the test categories, (Lampert et al., 2009) proposed an attribute based multi-label system which needs only prior knowledge about how attributes co-occur in the test categories. Existing multi-label methods are unsuitable in such scenarios. We propose a max-margin multi-label (M3L) formulation where correlated, rather than independent, predictors are learnt but where pairwise label terms are not incorporated in the prediction function. We show that, for a problem with L labels and N training data points, the M3L formulation can be reduced from having N 2L constraints to N L constraints. The formulation generalises existing approaches which assume independence, such as 1-vs-All, and provides a principled way of interpreting them. Furthermore, by sacrificing some modelling power as compared to explicit methods, the formulation can handle dense, loopy label correlations and incorporate prior knowledge efficiently. We develop specialised algorithms for optimising the M3L formulation and demonstrate that they can be orders of magnitude faster than existing cutting plane methods (Tsochantaridis et al., 2005). In particular, for kernelised M3L, while a straight forward SMO based implementation would have taken time quadratic in the number of labels, our algorithm can train in linear time. By efficient kernel caching, we can sometimes even be an order of magnitude faster than 1-vs-All. Thus our code, available from (M3L), should also be very useful for learning independent 1-vs-All classifiers. For linear M3L, we demonstrate that we can train on the RCV1 dataset with 781,265 points and 103 labels in effectively six minutes and fully converge in eighteen. In terms of performance, we show that incorporating prior correlation information using the M3L formulation can substantially boost prediction accuracy over independent methods.

can be extremely efficient if the mapped label space has significantly lower dimensionality than the original label space (Hsu et al., 2009). The disadvantage of such approaches is that the choice of an appropriate mapping might be unclear. As a result, minimising regression loss functions, such as square loss, in this space might be very efficient but might not be strongly correlated with minimising the desired multi-label loss. Furthermore, classification involves inverting the map which might not be straight forward, result in multiple solutions and might involve heuristics. A multi-label problem with L labels can be viewed as a classification problem with 2L classes (McCallum, 1999; Boutell et al., 2004) and standard multi-class techniques can be brought to bear. Such an approach was shown to give the best empirical results in the survey by (Tsoumakas & Katakis, 2007). Apart from computational costs, one of the main drawbacks of this approach is that most classes will have no positive training data and these label combinations can not be recognised at test time. Furthermore, the multi-class 0/1 loss is a poor approximation to the desired multilabel loss. For instance, the 0/1 loss would charge the same penalty for getting all but one of the labels right as it would for getting none of the labels right. Binary classification can be leveraged by replicating the feature vector for each data point L times. For copy number l, an extra dimension is added to the feature vector with value l and the training label is +1 if label l is present in the label set of the original point and -1 otherwise. Due to the data replication, applying a binary classifier naively would be computationally costly and would require that complex decision boundaries be learnt. However, (Schapire & Singer, 2000) show that the problem can be solved efficiently using Boosting. A somewhat related technique is 1vs-All (Rifkin & Khautau, 2004) which independently learns a binary classifier for each label. As we'll show in Section 3, our formulation generalises 1-vs-All to handle label correlations. A ranking based solution was proposed in (Elisseeff & Weston, 2001). The objective was to ensure that, for every data point, all the relevant labels were ranked higher than any of the irrelevant ones. Determining the number of labels to predict for a novel point can be problematic. Some approaches address the issue by training an independent regressor while others introduce a dummy label. Posing the problem as ranking also induces a quadratic number of constraints per example which leads to a harder optimisation. This is ameliorated in (Crammer & Singer, 2003) who reduced the space complexity to linear and

2. Related Work
Most multi-label approaches proposed in the literature try and reduce the problem to a more "canonical" one such as regression, multi-class and binary classification and ranking. We review such independent methods as well as those that take label correlations into account. In regression methods (Ji et al., 2008; Hsu et al., 2009; Tsoumakas & Katakis, 2007), the label space is mapped onto a vector space (which might sometimes be a shared subspace of the feature space) where regression techniques can be applied straight forwardly. The primary advantage of such methods is that they

Large Scale Max-Margin Multi-Label Classification with Priors

time complexity to sub-quadratic. Most of the approaches mentioned above do not explicitly model label correlations ((McCallum, 1999) has a generative model which can, in principle, handle correlations but greedy heuristics are used to search over the exponential label space). In terms of discriminative methods, most work has focused on hierarchical tree, or forest, structured labels. Methods such as (Cai & Hofmann, 2007; Cesa-Bianchi et al., 2006) optimise a hierarchical loss over the tree structure but do not incorporate pairwise, or higher order, label interaction terms. For instance, (Cai & Hofmann, 2007) classify at only the leaf nodes by leveraging the ranking method of (Elisseeff & Weston, 2001). The M3N formulation of (Taskar et al., 2003) was the first to suggest max-margin learning of label interactions. Learning is exact and efficient for trees and approximate, in general, for loopy graph structures. However, learning is intractable for densely correlated labels. While the M3N formulation dealt with the Hamming loss, a more suitable hierarchical loss was introduced and efficiently optimised in (Rousu et al., 2006). Finally, (Tsochantaridis et al., 2005) propose an iterative, cutting plane algorithm for learning in general structured output spaces. The algorithm adds the worst violating constraint to the active set in each iteration and is proved to take a maximum number of iterations independent of the size of the output space. While this algorithm can be used to learn pairwise label interactions it too can't handle a fully connected graph as the worst violating constraint can not be generally found in polynomial time. However, it can be used to learn our proposed M3L formulation but is an order of magnitude slower than the specialised optimisation algorithms we develop.

then be formulated as the following primal
N

P1 = min
f

1 2

f

2

+C
i=1

i

(1)

s. t. f (xi , yi )  f (xi , y) + (yi , y) - i i, y  {ą1}L \ {yi } i  0 i (2) (3)

with a new point x being assigned labels according to y = argmaxy f (x, y). The drawback of such a formulation is that there are N 2L constraints which make direct optimisation very slow. Furthermore, classification of novel points requires 2L function evaluations (one for each possible value of y), which can be prohibitive at run time. In this section, we demonstrate that, under general assumptions of linearity, (P1 ) can be reformulated as the minimisation of L densely correlated sub-problems each having only N constraints. At the same time, prediction cost is reduced to a single function evaluation (with complexity linear in L). We start by making the standard assumption that f (x, y) = wt ((x)  (y)) where  and  are the feature and label space mappings respectively,  is the Kronecker product and wt denotes w transpose. To incorporate prior knowledge and correlate classifiers efficiently, we assume that labels have at most linear, possibly dense correlation so that it is sufficient to choose (y) = Py where P is an invertible matrix encoding all our prior knowledge about the labels. To reduce the number of constraints from exponential to linear, we make another standard assumption of restricting ourselves to modelling loss functions that decompose over the individual labels (Taskar et al., 2003). Hence, we require that L (yi , y) = l=1 l (yi , yl ) where yl  {ą1} corresponds to label l in the set of labels represented by y. For instance, the popular Hamming loss, amongst others, satisfies this condition. The Hamming loss (yi , y), between a ground truth label yi and a prediction y is given by (yi , y) = yt (yi - y) which is i a count of twice the total number of individual labels mispredicted in y. Note that the Hamming loss can be decomposed over the labels as (yi , y) = l 1 - yl yil . Of course, for  to represent a sensible loss we also require that (yi , y)  (yi , yi ) = 0. Under these assumptions, (P1 ) can be expressed as
N 1 P1  min 2 wt w + C w i=1 y{ą1}L

3. M3L: The Max-Margin Multi-Label Classification Primal Formulation
The objective in multi-label classification is to learn a function f which can be used to assign a set of labels to a point x. We assume that N training data points have been provided of the form (xi , yi )  RD × {ą1}L with yil being +1 if label l has been assigned to point i and -1 otherwise. A principled way of formulating the problem would be to take the loss function  that one truly cares about and minimise it over the training set subject to regularisation or prior knowledge. Of course, since direct minimisation of most discrete loss functions is hard, we might end up minimising an upper bound on the loss, such as the hinge. The learning problem can

max [(yi , y)+ (4)

wt (xi )  P(y - yi )]

Large Scale Max-Margin Multi-Label Classification with Priors

where the constraints have been moved into the objective and i  0 eliminated by including y = yi in the maximisation. To simplify notation, we express the vector w as a D × L matrix W so that P1  min 1 Trace(Wt W) + C 2
W i

max[(yi , y)+
y

(y - yi )t Pt Wt (xi )]

(5)

Substituting Z = WP, R = Pt P  0 and using the identity Trace(ABC) = Trace(CAB) results in
L 1 P1  min 2 Z l=1 k=1 L L -1 Rlk zt zk + C l i

we deliberately chose not to include bias terms b in f even though the reduction from (P1 ) to (P2 ) would still have gone through and the resulting kernelised optimisation been more or less the same (see Section 6.1). However, we would then have had to regularise b and correlate it using R. Otherwise b would have been a free parameter capable of undoing the effects of R on Z. Therefore, rather than explicitly have b and regularise it, we implicitly simulate b by adding an extra dimension to the feature vector. This has the same effect while keeping optimisation simple. Equivalence with 1-vs-All If label correlation information is not included, i. e. R = I, then (P2 ) decouples into L completely independent sub-problems each of which can be tackled in isolation. In particular, for the Hamming loss we get
N

max
y

[l (yi , yl ) + (yl - yil )zt (xi )] l
l=1

(6)

where zl is the lth column of Z. Note that the terms inside the maximisation break up independently over the L components of y. It is therefore possible to interchange the maximisation and summation to get
L L -1 Rlk zt zk + C l l=1 k=1 i l=1 L

Sl =

min 1 zt zl l zl ,dl , 2

+ 2C
i=1

i

(12) (13) (14)

s. t. yil zt (xi )  1 - i l i  0

P1  min 1 2
Z

yl {ą1}

max [l (yi , yl ) + (yl - yil )zt (xi )] l

(7)

This leads to an equivalent primal formulation (P2 ) as the summation of L correlated problems, each having N constraints which is significantly easier to optimise.
L

Thus, Sl reduces to an independent binary classification sub-problem where the positive class contains all training points tagged with label l and the negative class containing all other points. This is exactly the strategy used in the popular 1-vs-All heuristic and we can therefore now make explicit the assumptions underlying this technique. The only difference is that one should charge a misclassification penalty of 2C to be consistent with the original primal formulation.

P2 =
l=1

Sl
L N -1 Rlk zk 1 t 2 zl

(8)

4. The M3L Dual Formulation
The dual of (P2 ) has similar properties in that it can be viewed as the maximisation of L related problems which decouple into independent binary SVM classification problems when R = I. The dual is easily derived if we rewrite (P2 ) in vector notation. Defining Yl = diag([y1l , . . . , yN l ]), Kx = t (X)(X) and ą = [l (y1 , ąy1l ), . . . , l (yN , ąyN l )]t we get the l following Lagrangian
L L

Sl = min
Z,

+C
i=1

il

(9) (10) (11)

s. t.

k=1 2yil zt (xi ) l

 l (yi , -yil ) - il

il  l (yi , yil )

Furthermore, a novel point x can be assigned the set of labels for which the entries of sign(Zt (x)) are +1. This corresponds to a single function evaluation as compared to the 2L in the original case. Note that the L classifiers in Z are not independent but correlated by R ­ a positive definite matrix encoding our prior knowledge about label correlations. Depending on the application, R can be dense and even have negative entries. Due to this, the number of constraints would have remained exponential had we made f quadratic in y by explicitly including pairwise terms as in (Taskar et al., 2003). Also, note that

L=
l=1

(1 2
k=1

-1 Rlk zt zk + C1t  l -  t ( l - + ) l l l

- t (2Yl t (X)zl - - +  l )) l l with the optimality conditions being
L

(15)

z l L = 0 
k=1

-1 Rlk zk = 2(X)Yl l

(16) (17)

l L = 0  C1 - l -  l = 0

Large Scale Max-Margin Multi-Label Classification with Priors

Substituting these leads to the following dual
L

5.2. Linear M3L We build on top of the dual coordinate ascent algorithm of (Hsieh et al., 2008). While a set of active points is still maintained we no longer maintain gradients or a cache. During each pass over the active set, dual variables are randomly picked and optimised analytically. Points at bound having gradient magnitudes outside the range of currently maintained extremal gradients are discarded. Extremal gradients are re-estimated at the end of each pass and if they are too close to each other the active set is expanded to include all training points. A straight forward implementation with globally maintained extremal gradients again leads to slow training. Essentially, if the classifier for a particular label has not yet converged, then it can force a large active set even though most points would not be considered by the other classifiers. We therefore implemented separate active sets for each label but coupled the maintained extremal gradients via R. This was empirically found to decrease training time.

D2 =

0C1

max

t (- l l
l=1 L L

-

+ ) l (18)

-2
l=1 k=1

Rlk t Yl KYk k l

5. Optimisation
The M3L dual is similar to the standard SVM dual. Existing optimisation techniques can therefore be brought to bear. However, the dense structure of R couples all N L dual variables and simply porting existing solutions leads to very inefficient code. We show that, with book keeping, we can easily go from an O(L2 ) algorithm to an O(L) algorithm. Furthermore, by re-utilising the kernel cache, our algorithms can be very efficient even for non-linear problems. We treat the kernelised and linear M3L cases separately. 5.1. Kernelised M3L We optimise the M3L dual using coordinate ascent with second order variable selection. Two dual variables are chosen for optimisation at every iteration. The first variable is chosen to be the one with the maximum projected gradient magnitude. The second is chosen so as to maximise second order dual progress. We maintain gradients  D2 (where D2 is now overloaded to mean the dual objective) in order to pick variables efficiently. After optimisation the dual variable il , all gradients need to be updated as
new old new D2 = old D2 - 4yil yjk Rkl Kij (il - il ) jk jk

6. Experiments
In this section we first compare the performance of our optimisation algorithms and then evaluate how prediction accuracy can be improved by incorporating prior knowledge about label correlations. 6.1. Optimisation Experiments The cutting plane algorithm in SVMStruct (Tsochantaridis et al., 2005) is an excellent general purpose algorithm that can be used to optimise the original M3L formulation (P1 ). In each iteration, the approximately worst violating constraint is added to the active set and the algorithm is proved to take a maximum number of iterations independent of the size of the output space. The algorithm has a user defined parameter  for the amount of error that can be tolerated in finding the worst violating constraint. We compared the SVMStruct algorithm to our M3L implementation on an Intel Xeon 2.67 GHz machine with 8GB RAM. Even on medium scale problems with linear kernels, our M3L implementation was nearly a hundred times faster than SVMStruct. For example, on the Media Mill dataset (Snoek et al., 2006) with 101 labels and ten, fifteen and twenty thousand training points, our M3L code took 19, 37 and 55 seconds while SVMStruct took 1995, 2998 and 7198 seconds respectively. On other datasets SVMStruct ran out of RAM or failed to converge in a reasonable amount of time (even after tuning ). This demonstrates that ex-

For dense R this implies an inefficient O(L2 ) algorithm as N L gradients need to be updated in each iteration. To mitigate this problem, we optimise along a single label for L consecutive iterations maintaining gradients for only that label. Now switching to the label with the maximal dual progress might seem difficult as gradients for the other labels haven't been maintained. However, by simple book keeping, all the other gradients can be updated in O(N L) time as new D2 = old D2 - 4yjk Rkl ujl jk jk
N

(19)

new old where ujl = i=1 yil Kij (il - il ) and l indexes the label along which we have been optimising. This leads to an efficient O(L) algorithm with dramatically reduced runtime even though slightly more iterations were needed for convergence (see (M3L) for a proof). We also employed an effective kernel cache.

Large Scale Max-Margin Multi-Label Classification with Priors Table 1. Timing results for our linear M3L (LM3L) and kernelised M3L (KM3L) optimisation algorithms on datasets with N training points, D features and L labels. See text for details.

N 2,000 10,000 15,000 24,292

(a) Animals with Attributes: D=252, L=85. Linear Kernel (s) RBF Kernel (s) 1-vs-All LibLinear LM3L 1-vs-All LibSVM KM3L 1-vs-All LibSVM KM3L 3 7 234 15 250 20 48 51 5438 245 6208 501 68 74 11990 500 13875 922 102 104 29328 1087 34770 3016 (b) RCV1: D=47,236(sparse), L=103. Linear Kernel (s) RBF Kernel (s) 1-vs-All LibLinear LM3L 1-vs-All LibSVM KM3L 1-vs-All LibSVM KM3L 7 4 54 6 139 11 23 27 743 110 1589 177 33 43 1407 230 2893 369 45 57 2839 513 5600 817 (c) Siam: D=30,438(sparse), L=22. Linear Kernel (s) 1-vs-All LibLinear LM3L 1-vs-All LibSVM KM3L 1 1 27 5 2 2 527 126 3 3 1118 288 5 5 2191 598 (d) Media Mill: D=120, L=101. Linear Kernel (s) 1-vs-All LibLinear LM3L 1-vs-All LibSVM KM3L 2 2 11 2 18 19 456 57 35 37 1014 124 62 75 2662 337 84 97 4168 527 RBF Kernel (s) 1-vs-All LibSVM KM3L 43 7 775 185 1610 422 3095 878 RBF Kernel (s) 1-vs-All LibSVM KM3L 15 6 505 123 1107 275 2902 761 4484 1162

N 2,000 10,000 15,000 23,149

N 2,000 10,000 15,000 21,519

N 2,000 10,000 15,000 25,000 30,993

plicitly reducing the number of constraints from exponential to linear and implementing a specialised solver can lead to a dramatic reduction in training time. As the next best thing, we benchmark our performance against the 1-vs-All method, even though it can't learn correlated classifiers. In the linear case, we compare to 1-vs-All trained by running LibLinear (Fan et al., 2008) and LibSVM (Chang & Lin, 2001) independently over each label. For non-linear kernels we compare to 1-vs-All trained using LibSVM. In each case, we set R = I, so that M3L reaches exactly the same solution as LibSVM and LibLinear. Also, we avoided repeated disk I/O by reading the data into RAM and using LibLinear and LibSVM's API's. Table 1 lists the variation in training time with the number of training examples on the Ani-

mals with Attributes (Lampert et al., 2009), Media Mill (Snoek et al., 2006), Siam (SIA) and RCV1 (Lewis et al., 2004) datasets. The training times of linear M3L (LM3L) and LibLinear are comparable, with LibLinear being slightly faster. The training time of kernelised M3L (KM3L) are significantly lower than LibSVM, with KM3L sometimes being as much as 30 times faster. This is because KM3L can efficiently leverage the kernel cache across all labels while LibSVM has to build the cache from scratch each time. This isn't an issue in linear M3L and LibLinear as no kernel cache is maintained. Thus, even though M3L generalises 1-vs-All, its training time can be comparable, and sometimes, even significantly lower. Finally, to demonstrate that our code scales to large problems, we train linear M3L on RCV1 with 781,265 points and 103 labels. Table 2 charts dual progress

Large Scale Max-Margin Multi-Label Classification with Priors Table 2. Linear M3L training on RCV1.

Time(s) 60 183 300 338 345 353 1080

Dual 1197842 1473565 1492664 1494012 1494050 1494057 1494057

Train Error(%) 0.86 0.74 0.72 0.72 0.72 0.72 0.72

Test Error(%) 0.98 0.84 0.83 0.82 0.82 0.82 0.82

and train and test error with time. As can be seen, the model is nearly fully trained in under six minutes and converges in eighteen. 6.2. Incorporating Prior Knowledge An interesting scenario, which has only recently been introduced in computer vision, is of recognising object categories that have never been seen during training but about whom prior information might be available (Lampert et al., 2009). If the training and test categories share a common set of object attributes, and the attributes for each test category are known a priori, then (Lampert et al., 2009) show how a multilabel system can be used to predict significantly better than chance. We investigate whether basic attribute prediction can be improved if the distribution of test categories is also known a priori. The Animals with Attributes dataset (Lampert et al., 2009) has 40 training animal categories and 10 disjoint test animal categories which share a common set of 85 attributes. The attributes are densely correlated and form a fully connected graph. Each image in the database contains a dominant animal and is labelled with its 85 attributes. There are 24,292 training images and 6,180 test images. We use 252 dimensional PHOG features that are provided by the authors. Training times are reported in Table (1a). We start by visualising the influence of R. We randomly sample 200 points from the training set and discard all but two of the attributes ­ "black" and
40 Hamming Loss (%) 35 30 25 20 15 -1

"weak". These two attributes were selected as they are very weakly correlated on our training set, with a correlation coefficient of 0.2, but have a strong negative correlation of -0.76 on the test animals (Leopards, Giant Pandas, Humpback Whales, etc). Figure 1 plots the Hamming loss on the test set as we set R = [1 r; r 1], plug it into the M3L formulation, and vary r from -1 to +1. Learning independent classifiers for the two attributes (r = 0) can lead to a Hamming loss of 25% because of the mismatch between training and test sets. This can be made even worse by choosing an incorrect prior that forces the two classifiers to be correlated (r = +1). However, if our priors are generally correct, then negatively correlating the classifiers lowers prediction error. We now evaluate performance quantitatively on the same training set but with all 85 labels. For the 10 t M3L formulation we set R = c=1 p(c)yc yc where yc is the known attribute vector for test category c and p(c) is the probability of occurrence of class c during testing (which we require as additional prior information). Under this setup, learning independent classifiers using 1-vs-All yields a Hamming loss of 29.38%. The Hamming loss for M3L, with the specific choice of R, is 26.35%. This decrease in error is very significant given that 1-vs-All, trained on all 24,292 training points, only manages to reduce error to 28.64%. Thus M3L, with extra knowledge, in the form of just test category distributions, can dramatically reduce test error. The results also compare favourably to other independent methods such as BoostTexter (Schapire & Singer, 2000) (30.28%), power set multi-class classification (32.70%), 5 nearest neighbours (31.79%), regression (Hsu et al., 2009) (29.38%) and ranking (Crammer & Singer, 2003) (34.84%). Benchmark Datasets Our interest is in multi-label problems where the training set statistics do not reflect the test set statistics. Unfortunately, most benchmark datasets do not have this property. We therefore take the Siam, Media Mill and RCV1 datasets and create train and test splits where the labels are correlated dif-

Table 3. Test Hamming loss (%) on benchmark datasets.

-0.5

0 r

0.5

1

Figure 1. Test Hamming loss versus classifier correlation.

Method M3L 1-vs-All BoostTexter Power Set Regression Ranking 5-NN

Siam 8.41 11.15 12.91 14.01 11.19 9.41 12,51

Media Mill 3.78 4.69 4.91 6.27 4.69 9.06 4.74

RCV1 3.45 4.25 4.12 3.71 4.26 5.67 4.47

Large Scale Max-Margin Multi-Label Classification with Priors

ferently. The R matrix, encoding label correlation information, is estimated from a disjoint third set whose points are not used for training. Table 3 compares the performance of various methods. In these scenarios, it would appear that M3L can consistently leverage prior knowledge to outperform independent methods.

Elisseeff, A. and Weston, J. A kernel method for multilabelled classification. In NIPS, pp. 681­687, 2001. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. Liblinear: A library for large linear classification. JMLR, 9:1871­1874, 2008. Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., and Sundarajan, S. A dual coordinate descent method for large-scale linear svm. In ICML, 2008. Hsu, D., Kakade, S., Langford, J., and Zhang, T. Multi-label prediction via compressed sensing. In NIPS, 2009. Ji, S., Sun, L., Jin, R., and Ye, J. Multi-label multiple kernel learning. In NIPS, pp. 777­784, 2008. Lampert, C. H., Nickisch, H., and Harmeling, S. Learning to detect unseen object classes by betweenclass attribute transfer. In CVPR, 2009. Lewis, D., Yang, Y., Rose, T., and Li, F. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361­397, 2004. McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI 99 Workshop on Text Learning, 1999. Rifkin, R. and Khautau, A. In defense of one-vs-all classification. JMLR, 5:101­141, 2004. Rousu, J., Saunders, C., Szedmak, S., and ShaweTaylor, J. Kernel-based learning of hierarchical multilabel classification models. JMLR, 7:1601­1626, 2006. Schapire, R. E. and Singer, Y. Boostexter: A boostingbased system for text categorization. ML, 39(2/3): 135­168, 2000. Snoek, C., Worring, M., van Gemert, J., Geusebroek, J.-M., and Smeulders, A. The challenge problem for automated detection of 101 semantic concepts in multimedia. In ACM Multimedia, pp. 421­430, 2006. Taskar, B., Guestrin, C., and Koller, D. Max-margin markov networks. In NIPS, 2003. Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. JMLR, 6:1453­1484, 2005. Tsoumakas, G. and Katakis, I. Multi-label classification: An overview. Int. Journal of Data Warehousing and Mining, 3(3):1­13, 2007.

7. Conclusions
We developed the M3L formulation for learning maxmargin multi-label classification with prior knowledge about densely correlated labels. We showed that the number of constraints could be reduced from exponential to linear and, in the process, generalised 1-vs-All classification. We also developed efficient optimisation algorithms that were orders of magnitude faster than the standard cutting plane method. Our kernelised algorithm was significantly faster than even 1-vs-All and hence our code, available from (M3L), can also be used for efficient independent learning. Finally, we demonstrated that incorporating prior knowledge using M3L could improve prediction accuracy over independent methods and that M3L trained on 200 points could outperform 1-vs-All trained on nearly 25,000.

Acknowledgements
We would like to thank Alekh Agarwal, Brendan Frey and Sunita Sarawagi for helpful discussions.

References
M3L code and convergence proof http://research. microsoft.com/~manik/code/M3L/download.html. The SIAM Text Mining Competition http://www.cs.utk.edu/tmw07/). 2007

Boutell, M., Luo, J., Shen, X., and Brown, C. Learning multi-label scene classification. Pattern Recognition, 37(9):1757­1771, 2004. Cai, L. and Hofmann, T. Exploiting known taxonomies in learning overlapping concepts. In IJCAI, pp. 714­719, 2007. Cesa-Bianchi, N., Gentile, C., and Zaniboni, L. Incremental algorithms for hierarchical classification. JMLR, 7:31­54, 2006. Chang, C.-C. and Lin, C.-J. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. Crammer, K. and Singer, Y. A family of additive online algorithms for category ranking. JMLR, 3:1025­ 1058, 2003.