Graded Multilabel Classification: The Ordinal Case Weiwei Cheng1 cheng@informatik.uni-marburg.de Krzysztof Dembczy´ ski1,2 n dembczynski@informatik.uni-marburg.de Eyke H¨ llermeier1 u eyke@informatik.uni-marburg.de 1 Mathematics and Computer Science, Marburg University, Hans-Meerwein-Str., 35032 Marburg, Germany 2 Institute of Computing Science, Pozna´ University of Technology, Piotrowo 2, 60-965 Pozna´, Poland n n Abstract We propose a generalization of multilabel classification that we refer to as graded multilabel classification. The key idea is that, instead of requesting a yes-no answer to the question of class membership or, say, relevance of a class label for an instance, we allow for a graded membership of an instance, measured on an ordinal scale of membership degrees. This extension is motivated by practical applications in which a graded or partial class membership is natural. Apart from introducing the basic setting, we propose two general strategies for reducing graded multilabel problems to conventional (multilabel) classification problems. Moreover, we address the question of how to extend performance metrics commonly used in multilabel classification to the graded setting, and present first experimental results. yes-no answer to the question of class membership or, say, relevance of a class label for an instance, we allow an instance to belong to a class to a certain degree. In other words, we allow for graded class membership in the sense of fuzzy set theory (Zadeh, 1965). In fact, there are many applications for which this extension seems to make perfect sense. In the case of movie genres, for example, it is not always easy to say whether or not a movie belongs to the category action, and there are definitely examples which can be considered as "almost action" or "somewhat action". Another obvious example comes from one of the benchmark data sets in MLC, namely the emotions data (Trohidis et al., 2008). Here, the problem is to label a song according to the Tellegen-Watson-Clark model of mood: amazed-surprised, happy-pleased, relaxingclam, quiet-still, sad-lonely, and angry-aggressive. It is important to emphasize that the relevance of a label is indeed gradual in the sense of fuzzy logic and not uncertain in the sense of probability theory. The latter would mean that, e.g., a song is either relaxing or it is not--one is only uncertain about which of these two exclusive alternatives is correct. As opposed to this, gradualness is caused by the vagueness of categories like "relaxing song" and "action movie", and means that one does not have to fully agree on one of the alternatives. Instead, one can say that a song is somewhere in-between (and can be certain about this). As will be explained in more detail later on, our idea is to replace simple "yes" or "no" labels by membership degrees taken from a finite ordered scale such as M = { not at all, somewhat, almost, fully }. (1) Admittedly, graded multilabel data sets of that kind are not yet widely available. We believe, however, that this is a kind of hen and egg problem: As long as there are no methods for learning from graded multilabel data, new data sets will be created in the common way, possibly forcing people to give a "yes" or "no" answer even when they are hesitating. 1. Introduction Problems of multilabel classification (MLC), in which an instance may belong to several classes simultaneously or, say, in which more than one label can be attached to a single instance, are ubiquitous in everyday life: At IMDb, a movie can be categorized as action, crime, and thriller, a CNN news report can be tagged as people and political at the same time, etc. Correspondingly, MLC has received increasing attention in machine learning in recent years. In this paper, we propose a generalization of MLC that we shall refer to as graded multilabel classification (GMLC). The key idea is that, instead of requesting a Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Graded Multilabel Classification: The Ordinal Case The rest of this paper is organized as follows: The problem of multilabel classification is introduced in a more formal way in Section 2. In Section 3, we propose our graded generalization of MLC and, moreover, outline two different strategies for reducing GMLC problems to conventional (multilabel) classification problems. In Section 4, we address the question of how to extend MLC evaluation metrics from the conventional to the graded setting. Finally, Section 5 presents some first experimental results. query instance x, this classifier is supposed to predict whether i is relevant for x (Hi (x) = 1) or not (Hi (x) = 0). A multilabel prediction for x is then given by H(x) = {i L | Hi (x) = 1}. Since binary relevance learning treats every label independently of all other labels, an obvious disadvantage of this approach is its ignorance of correlations and interdependencies between labels. Many approaches to MLC learn a multilabel classifier H in an indirect way via a scoring function f : X × L - R that assigns a real number to each instance/label combination. The idea is that a score f (x, ) is in direct correspondence with the probability that is relevant for x. Given a scoring function of this type, multilabel prediction can be realized via thresholding: H(x) = { L | f (x, ) t } , where t R is a threshold. As a byproduct, a scoring function offers the possibility to produce a ranking (weak order) x of the class labels, simply by sorting them according to their score: i x 2. Multilabel Classification Let X denote an instance space and let L = {1 , 2 , . . . , m } be a finite set of class labels. Moreover, suppose that each instance x X can be associated with a subset of labels L 2L ; this subset is often called the set of relevant labels, while the complement L \ L is considered as irrelevant for x. Given training data in the form of a finite set T of observations in the form of tuples (x, Lx ) X × 2L , typically assumed to be drawn independently from an (unknown) probability distribution on X × 2L , the goal in multilabel classification is to learn a classifier H : X - 2L that generalizes well beyond these observations in the sense of minimizing the expected prediction loss with respect to a specific loss function; examples of commonly used loss functions include the subset zero-one loss, which is 0 if H(x) = Lx and 1 otherwise, and the Hamming loss that computes the percentage of labels whose relevance is incorrectly predicted: EH (H(x), Lx ) = 1 H(x) Lx , |L| (2) j f (x, i ) f (x, j ) . (3) Sometimes, this ranking is even more desirable as a prediction, and indeed, there are several evaluation metrics that compare a true label subset with a predicted ranking instead of a predicted label subset; an example is the rank loss which computes the average fraction of label pairs that are not correctly ordered: ER (f, Lx ) = (, )Lx ×Lx S(f (x, ), f (x, )) where is the symmetric difference between sets. An MLC problem can be reduced to a conventional classification problem in a straightforward way, namely by considering each label subset L 2L as a distinct (meta-)class. This approach is referred to as label powerset (LP) in the literature. An obvious drawback of this approach is the potentially large number of classes that one has to deal with in the newly generated problem; obviously, this number is 2|L| (or 2|L| -1 if the empty set is excluded as a prediction). This is the reason why LP typically works well if the original label set L is small but quickly deteriorates for larger label sets (Tsoumakas & Vlahavas, 2007). Another way of reducing multilabel to conventional classification is offered by the binary relevance (BR) approach. Here, a separate binary classifier Hi is trained for each label i L, reducing the supervision to information about the presence or absence of this label while ignoring the other ones. For a |Lx | × |Lx | , where Lx = L \ Lx is the set of irrelevant labels and S(u, v) = 1 if u < v, = 1/2 if u = v, and = 0 if u > v. The idea to solve both problems simultaneously, ranking and MLC, has recently been addressed in (F¨rnkranz et al., 2008): A calibrated ranking is a u ranking with a "zero point" separating a positive (relevant) part from a negative (irrelevant) one. 3. Graded Multilabel Classification Generalizing the above setting of multilabel classification, we now assume that each instance x X can belong to each class L to a certain degree. In other words, the set Lx of relevant labels is now a fuzzy subset of L. This fuzzy set is characterized by a membership function, namely an L - M mapping, where M is the set of graded membership degrees. For notational simplicity, we shall not distinguish between Graded Multilabel Classification: The Ordinal Case Figure 1. Vertical reduction, viz. prediction of membership degree (ordinate) for each label (abscissa). Figure 2. Horizontal reduction, viz. prediction of a subset of labels (indicated by black circles) on each level. the fuzzy set Lx and its membership function, and denote by Lx () the degree of membership of the label L in the fuzzy set Lx . In fuzzy set theory, the set of membership degrees is supposed to form a complete lattice and is normally taken as the unit interval (i.e., M = [0, 1] endowed with the standard order). Here, however, we prefer an ordinal scale of membership degrees, that is, a finite ordered set of membership degrees such as (1). More generally, we assume that M = {m0 , m1 , . . . , mk }, where m0 < m1 < . . . < mk (and m0 = 0 and mk = 1 have the special meaning of zero and full membership). In the context of multilabel classification, an ordinal membership scale is arguably more convenient from a practical point of view, especially with regard to data acquisition. In fact, people often prefer to give ratings on an ordinal scale like (1) instead of choosing precise numbers on a cardinal scale. The goal, now, is to learn a mapping H : X - F(L), where F(L) is the class of fuzzy subsets of L (with membership degrees in M ). Following the general idea of reduction (Balcan et al., 2008), we seek to make GMLC problems amenable to conventional multilabel methods via suitable transformations. There are two more or less obvious possibilities to reduce graded multilabel classification to conventional (multilabel) classification. In agreement with the distinction between the "vertical" description of a fuzzy subset F of a set U (through the membership function, i.e., by specifying the degree of membership F (u) for each element u U ) and the "horizontal" description (via level cuts [F ] = {u U | F (u) }), we distinguish between a vertical and a horizontal reduction. 3.1. Vertical Reduction Recall the binary relevance approach to conventional MLC: For each label i L, a separate binary classi- fier Hi is trained to predict whether this label is relevant (Hi (x) = 1) or not (Hi (x) = 0) for a query instance x X. Generalizing this approach to GMLC, the idea is to induce a classifier Hi : X - M (4) for each label i . For each query instance x X, this classifier is supposed to predict the degree of membership of i in the fuzzy set of labels Lx . Instead of a binary classification problem, as in MLC, each classifier Hi is now solving a multi-class problem. Since the target space M has an ordinal structure, these problems are ordinal classification problems. In other words, the vertical reduction of a GMLC problem eventually leads to solving a set of m (non-independent) ordinal classification problems; see Fig. 1 for an illustration. Just like simple binary problems, ordinal classification problems are often solved indirectly via "scoring plus thresholding": First, a scoring function f (·) is learned, and k thresholds t1 , . . . , tk are determined; then, for an instance x, the i-th class is predicted if f (x, i ) is between ti-1 and ti . Of course, if classifiers (4) are learned in this way, i.e., by inducing a scoring function f (·, i ) for each label i , then these scoring functions can also be used to predict a ranking (3). 3.2. Horizontal Reduction From fuzzy set theory, it is well-known that a fuzzy set F can be represented "horizontally" in terms of its level-cuts. This representation suggests another decomposition of a GMLC problem: For each level {m1 , m2 , . . . , mk }, learn the mapping H () : X - 2M , x [Lx ] . (5) Obviously, each of these problems is a standard MLC problem, since the level-cuts [Lx ] are standard subsets of the label set L. Thus, the horizontal reduction Graded Multilabel Classification: The Ordinal Case comes down to solving k standard MLC problems; see Fig. 2 for an illustration. It is worth mentioning that this decomposition comes with a special challenge. In fact, since level-cuts are nested in the sense that [F ] [F ] for < , the k MLC problems are not independent of each other. Instead, the predictions should be monotone in the sense that (H (mj ) (x) = 1) (H (mj-1 ) (x) = 1) (6) evance P( [Lx ]mi ). Then, f (x, ) simply corresponds to the expected level of x, since k k f (mi ) (x, ) = i=1 k i=1 P( [Lx ]mi ) = k = i=1 P(Lx () mi ) = i=1 i · P(Lx () = mi ) for all j {2, . . . , k}. Thus, whenever a label i is predicted to be in the mj -cut of the fuzzy label set Lx associated with x, it must also be in all lower level-cuts. Satisfying this requirement is a non-trivial problem. In particular, (6) will normally not be guaranteed when solving the k problems independently of each other. Once an ensemble of k multilabel classifiers H (m1 ) , . . . , H (mk ) has been trained, predictions can be obtained as follows: H(x)() = max{ mi M | H (mi ) (x) } (7) Thus, the degree of membership of a label L in the predicted fuzzy set of labels associated with x is given by the maximum degree mi M for which is still in the predicted mi -cut of this set. The prediction of a ranking (3) is arguably less obvious in the case of the horizontal decomposition. Suppose that f (m1 ) , . . . , f (mk ) are scoring functions trained on the k level cuts, using a conventional MLC method. As a counterpart to the monotonicity condition (6), we should require f (m1 ) (x, ) f (m2 ) (x, ) . . . f (mk ) (x, ) (8) Note, however, that we simply equated the levels mi with the numbers i in this derivation, i.e., the ordinal scale L was implicitly embedded in a numerical scale by the mapping mi i (on L itself, an averaging operation of this kind is not even defined). Despite being critical from a theoretical point of view, this embedding is often used in ordinal classification, for example when computing the absolute error AE(mi , mj ) = |i-j| as a loss function (Lin & Li, 2007). Interestingly, the absolute error is minimized (in expectation) by the median and, moreover, this estimation is invariant toward rescaling (Berger, 1985). Thus, it does actually not depend on the concrete embedding chosen. Seen from this point of view, the median appears to be a theoretically more solid score than the mean value (9). However, it produces many ties, which is disadvantageous from a ranking point of view. This problem is avoided by (9), which can be seen an approximation of the median that breaks ties in a reasonable way. 3.3. Combination of Both Reductions As mentioned above, the binary relevance approach is a standard (meta-)technique for solving MLC problems. Consequently, it can also be applied to each problem (5) produced by the horizontal reduction. Since BR can again be seen as a "vertical" decomposition of a regular MLC problem, one thus obtains a combination of horizontal and vertical decomposition: first horizontal, then vertical. Likewise, the two types of reduction can be combined the other way around, first vertical and then horizontal. This is done by solving the ordinal classification problems produced by the vertical reduction by means of a "horizontal" decomposition, namely a meta-technique that has been proposed by Frank & Hall (2001): Given an ordered set of class labels M = {m0 , m1 , . . . , mk }, the idea is to train k binary classifiers. The i-th classifier considers the instances with label m0 , . . . , mi-1 as positive and those with label mi , . . . , mk as negative. Interestingly, both combinations eventually coincide in the sense of ending up with the same binary classification problems. Roughly speaking, a single binary problem is solved for each label/level combination for all x X and L. In fact, interpreting f (mi ) (x, ) as a measure of how likely is a relevant label on level mi , this condition follows naturally from [Lx ]m1 [Lx ]m2 . . . [Lx ]mk . On each level mi , (m ) the function f (mi ) (x, ·) induces a ranking x i via (mj ) (m ) (3), however, the identity x i x is of course (m ) (mi ) not guaranteed; that is, x may differ from x j for 1 i = j k. To obtain a global ranking, the level-wise rankings (mi ) need to be aggregated into a single one. To x this end, we propose to score a label by k f (x, ) = i=1 f (mi ) (x, ) . (9) This aggregation is especially reasonable if the scores f (mi ) (x, ) can be interpreted as probabilities of rel- Graded Multilabel Classification: The Ordinal Case (i , mj ) L × M (each circle in the picture in Fig. 2), namely the problem to decide whether Lx (i ) mj or Lx (i ) > mj . Any difference between the two approaches is then due to different ways of aggregating the predictions of the binary classifiers. In principle, however, such differences can only occur in the case of inconsistencies, i.e., if the monotonicity condition (6) is violated. 3.4. Generalizing IBLR-ML Our discussion so far has been restricted to metatechniques for reducing GMLC to MLC problems, without looking at concrete methods. Nevertheless, there are several methods that can be generalized immediately from the binary to the gradual case. As an example, we mention the IBLR-ML method that will also be used in our experiments later on. This method, which was recently proposed in (Cheng & H¨llermeier, u 2009), combines instance-based learning with logistic regression and again trains one classifier Hi for each label. For the i-th label i , this classifier is derived from the logistic regression equation log 0 (i) (i) (i) m for each label, while the horizontal reduction comes down to solving the following k multilabel problems (r = 1, . . . , k): log 0 1- (i,r) (i,r) 0 m = (i,r) 0 + j=1 j (i,r) +j (x0 ) (12) (i,r) Recall, however, that these problems are not independent of each other. Solving them simultaneously so as to guarantee the monotonicity constraint (6) is an interesting but non-trivial task. In the experiments in Section 5, we therefore derived independent predictions and simply combined them by (7). 4. Loss Functions As mentioned before, a number of different loss functions have already been proposed within the setting of MLC. In principle, all these functions can be generalized so as to make them applicable to the setting of GMLC. In this section, we propose extensions of some important and frequently used measures. Moreover, we address the question of how to handle these extensions in the context of the horizontal and vertical reduction technique, respectively. 4.1. Representation of Generalized Losses To generalize the Hamming loss (2), it is necessary to replace the symmetric difference operator defined on sets, , by the symmetric difference between two fuzzy sets. This can be done, for example, by averaging over the symmetric differences of the corresponding levelcuts, which in our case leads to E (H(x), Lx ) = H k i=1 1 - 0 = 0 + j=1 (i) j · +j (x0 ) , (i) (i) (10) where 0 denotes the (posterior) probability that i is relevant for x0 , and +j (x0 ) = xN (x0 ) (i) (x0 , x) · yj (x) (11) is a summary of the presence of the j-th label j in the neighborhood of x0 ; here, is a kernel function, such as the (data-dependent) "KNN kernel" (x0 , xi ) = 1 if xi Nk (x0 ) and = 0 otherwise, where Nk (x0 ) is the set of k nearest neighbors of x0 . Moreover, yj (x) = +1 if j is present (relevant) for the neighbor x, and yj (x) = -1 in case it is absent (nonrelevant). Obviously, this approach is able to capture interdependencies between class labels: The es(i) timated coefficient j indicates to what extent the relevance of label i is influenced by the relevance of (i) j . A value j > 0 means that the presence of j makes the relevance of i more likely, i.e., there is a positive correlation. Correspondingly, a negative coefficient would indicate a negative correlation. Given a query instance x0 , a multilabel prediction is made on the basis of the predicted posterior probabilities of (i) relevance: H(x0 ) = {i L | 0 > 1/2 }. This approach can be generalized to the GMLC setting using both the horizontal and the vertical reduction. The vertical reduction leads to solving an ordinal instead of a binary logistic regression problem [H(x)]mi [Lx ]mi . k|L| (13) Note that this "horizontal" computation can be replaced by an equivalent "vertical" one, namely E (H(x), Lx ) = H |L| i=1 AE(H(x)(i ), Lx (i )) , (14) k|L| where AE(·) is the absolute error of a predicted membership degree which, as mentioned above, is defined by AE(mi , mj ) = |i - j|. In other words, minimizing the symmetric difference level-wise is equivalent to minimizing the absolute error label-wise. It is worth to mention that the existence of an equivalent horizontal and vertical representation of a loss function, like in the case of (13) and (14), is not selfevident. For example, replacing in (14) the absolute error on the ordinal scale M by the simple 0/1 loss Graded Multilabel Classification: The Ordinal Case leads to E (H(x), Lx ) = 0/1 1 |L| |L| i=1 0 1 H(x)(i ) = Lx (i ) . H(x)(i ) = Lx (i ) as a measure of concordance in statistics (Gnen & Heller, 2005), and which is essentially equivalent to the pairwise ranking error introduced in (Herbrich et al., 2000): E (f, Lx ) = R i