Label Ranking Methods based on the Plackett-Luce Model

Weiwei Cheng1 cheng@informatik.uni-marburg.de Krzysztof Dembczy´ ski1,2 n dembczynski@informatik.uni-marburg.de Eyke H¨ llermeier1 u eyke@informatik.uni-marburg.de 1 Mathematics and Computer Science, Marburg University, Hans-Meerwein-Str., 35032 Marburg, Germany 2 Institute of Computing Science, Pozna´ University of Technology, Piotrowo 2, 60-965 Pozna´, Poland n n

Abstract
This paper introduces two new methods for label ranking based on a probabilistic model of ranking data, called the Plackett-Luce model. The idea of the first method is to use the PL model to fit locally constant probability models in the context of instance-based learning. As opposed to this, the second method estimates a global model in which the PL parameters are represented as functions of the instance. Comparing our methods with previous approaches to label ranking, we find that they offer a number of advantages. Experimentally, we moreover show that they are highly competitive to start-of-the-art methods in terms of predictive accuracy, especially in the case of training data with incomplete ranking information.

nal problem into a single binary classification problem in an expanded space of higher dimension, and constructs a label ranking model from the classifier learned in that space (Har-Peled et al., 2003). Another approach, ranking by pairwise comparison, reduces the original problem to several small instead of a single large binary classification problem. More specifically, one binary model is learned for each pair of labels, and the predictions of these models are combined into a ranking of all labels (H¨llermeier et al., 2008). u Reduction techniques of this kind have shown promising performance in first experimental studies. Moreover, the reduction of the label ranking problem to the simpler problem of binary classification is appealing for several reasons. Notably, it makes the label ranking problem amenable to the large repertoire of (binary) classification methods and existing algorithms in this field. On the other hand, reduction techniques also come with some disadvantages. In particular, theoretical assumptions on the sought "ranking-valued" mapping, which may serve as a proper learning bias, may not be easily translated into corresponding assumptions for the classification problems. Likewise, it is often not clear (and mostly even wrong) that minimizing the classification error, or a related loss function, on the binary problems is equivalent to maximizing the (expected) performance of the label ranking model in terms of the desired loss function on rankings (H¨llermeier & F¨rnkranz, 2010). u u An alternative approach, which avoids these problems to some extent, was recently put forward in (Cheng et al., 2009). Here, the idea is to develop label ranking methods on the basis of statistical models for ranking data, that is, parameterized (conditional) probability distributions on the class of all rankings. Given assumptions of that kind, the learning problem can be posed as a problem of maximum likelihood estimation (or, alternatively, as a problem of Bayesian inference) and thus be solved in a theoretically sound way.

1. Introduction
The problem of label ranking can be considered as a generalization of conventional classification, insofar as a complete ranking of all labels is requested as a prediction instead of only a single class label. Thus, as we shall explain in more detail later on, the label ranking problem consists of learning a mapping from instances to rankings over a finite set of predefined labels. Several methods for label ranking have already been proposed in the literature; we refer to (Vembu & G¨rtner, 2010) for a comprehensive survey. Existing a methods for label ranking are mostly reduction techniques transforming the original learning problem into one or several binary classification problems. So-called constraint classification, for example, turns the origiAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Label Ranking Methods based on the Plackett-Luce Model

In (Cheng et al., 2009), the authors proposed to use the Mallows model and developed an instance-based (nearest neighbor) learning algorithm to estimate this model in a local way. In this paper, we propose the Plackett-Luce model as an alternative, especially since this model is more apt to learning from possibly incomplete label rankings. Moreover, apart from the estimation of locally constant models suitable for instancebased learning, we also develop a method for estimating generalized linear models. The paper is organized as follows. Section 2 recalls the problem of label ranking in a more formal setting. In Section 3, we introduce a probability model that will be used for estimating predictive models for rankings. Sections 4 and 5 are devoted, respectively, to the instance-based and generalized linear method for label ranking. An experimental evaluation is presented in Section 6, and Section 7 concludes the paper.

ric group of order M ) is denoted by . By abuse of terminology, though justified in light of the above oneto-one correspondence, we refer to elements    as both permutations and rankings. In analogy with the classification setting, we do not assume the existence of a deterministic X -  mapping. Instead, every instance is associated with a probability distribution over . This means that, for each x  X, there exists a probability distribution P(· | x) such that, for every   , P( | x) is the probability that x = . (Note that, if rankings are interpreted as qualitative probabilities, then P(· | x) is a probability over probability distributions, i.e., a second-order probability.) The goal in label ranking is to learn a "label ranker" in the form of an X -  mapping. As training data, a label ranker uses a set of instances xn (n = 1, . . . , N ), together with information about the associated rankings xn . Ideally, complete rankings are given as training information. From a practical point of view, however, it is important to allow for incomplete information in the form of a ranking yx (1)
x

2. Label Ranking
In the setting of label ranking, each instance x from an instance space X is associated with a total order of all class labels, that is, a total, transitive, and asymmetric relation x on Y, where yi x yj indicates that yi precedes yj in the order. Since a ranking can be considered as a special type of preference relation, we shall also say that yi x yj indicates that yi is preferred to yj given the instance x. Note that, in contrast to the classification scenario, there is no such thing as a "true class label" of an instance under this interpretation. However, depending on the type of application, other interpretations of a label ranking are possible. For example, within the setting of conventional classification, a ranking can be interpreted as a special type of qualitative probability on Y (Wellman, 1994). The order relation yi x yj then indicates that the conditional probability of yi given x is higher than the probability of yj given x, without specifying any concrete numerical values. Given that good numerical estimates are hard to obtain, and sometimes not even needed for decision making, a qualitative representation of this kind is an interesting alternative to numerical distributions. Formally, a total order x can be identified with a permutation x of the set {1, . . . , M }. We define x such that x (i) is the index j of the class label yj put -1 on the i-th position in the order (and hence x (j) = i the position of the j-th label). This permutation thus encodes the (ground truth) order relation yx (1)
x

yx (2)

x

...

x

yx (k) ,

(1)

where k < M and {(1), . . . , (k)}  {1, . . . , M }. For example, for an instance x, it might be known that y2 x y1 x y5 , while no preference information is given about the labels y3 or y4 . By definition, we let  -1 (yi ) =  -1 (i) = 0 if yi is not present in the ranking ; thus, the presence of a class yi is equivalent to  -1 (i) > 0. To evaluate the predictive performance of a label ranker, a suitable loss function on  is needed. In the statistical literature, several distance measures for rankings have been proposed. For example, a commonly used measure is Kendall's tau coefficient, defined as C(, ) - D(, ) (2) M (M - 1)/2 with C(, ) the number of concordant label pairs (i.e., pairs (i, j)  {1, . . . , M }2 such that ((i) - (j)) · ((i) - (j)) > 0) and D(, ) the number of discordant pairs ((i, j) with ((i) - (j)) · ((i) - (j)) < 0). Actually, (2) is not a loss function but a correlation measure with values in [-1, +1] (it assumes the value 1 if  =  and the value -1 if  is the reversal of ).

3. Ranking Models
So far, no assumptions about the conditional probability measure P(· | x) on  were made, despite its existence. In statistics, different types of probability

yx (2)

x

...

x

yx (M ) .

The class of permutations of {1, . . . , M } (the symmet-

Label Ranking Methods based on the Plackett-Luce Model

distributions on rankings have been proposed (Marden, 1995). A prominent example is the Mallows model (Mallows, 1957), a distance-based probability model belonging to the family of exponential distributions. The standard Mallows model is determined by two parameters: P( | , 0 ) = exp(-D(, 0 )) () (3)

This model is a generalization of the well-known Bradley-Terry model, a model for the pairwise comparison of alternatives, which specifies the probability that "a wins against b" in terms of P(a b) = va . va + vb

The ranking 0   is the location parameter (mode, center ranking) and   0 is a spread parameter. Moreover, D(·) is a distance measure on rankings, and the constant  = () is a normalization factor that depends on the spread (but, provided the rightinvariance of D(·), not on 0 ). Obviously, the Mallows model assigns the maximum probability to the center ranking 0 . The larger the distance D(, 0 ), the smaller the probability of  becomes. The spread parameter  determines how quickly the probability decreases, i.e., how peaked the distribution is around 0 . For  = 0, the uniform distribution is obtained, while for   , the distribution converges to the one-point distribution that assigns probability 1 to 0 and 0 to all other rankings. The Mallows model was used in (Cheng et al., 2009) in the context of an instance-based approach to label ranking. Notwithstanding some appealing properties, the Mallows model is arguably not ideal for handling incomplete training data, i.e., observations in the form of incomplete rankings (1). Roughly speaking, this is because the probability of such a ranking cannot be expressed in closed form. Instead, it has to be derived through marginalization: P( | , 0 ) =
  E()

Obviously, the larger va in comparison to vb , the higher the probability that a is chosen. Likewise, the larger the parameter vi in (4) in comparison to the parameters vj , j = i, the higher the probability that the label yi appears on a top rank. An intuitively appealing explanation of the PL model can be given in terms of a vase model: If vi corresponds to the relative frequency of the i-th label in a vase filled with labeled balls, then P( | v) is the probability to produce the ranking  by randomly drawing balls from the vase in a sequential way and putting the label drawn in the k-th trial on position k (unless the label was already chosen before, in which case the trial is annulled). For the PL model, one easily verifies that the probability of an incomplete ranking (1) is given by P( | v) = v(i) , v(i) + v(i+1) + . . . + v(k) i=1
k

i.e., by an expression of exactly the same form, except that the number of factors is k (the number of labels observed) instead of M . In a different though related context, the use of the PL model for machine learning was recently motivated in (Guiver & Snelson, 2009).

4. Instance-Based Label Ranking
In this section, we propose an instance-based approach to label ranking, i.e., a local prediction method based on the nearest neighbor estimation principle. Consider a query instance x  X and let x1 , . . . , xK denote the nearest neighbors of x (according to an underlying distance measure on X) in the training set, where K  N is a fixed integer. Each neighbor xi (i = 1, . . . , K) is associated with a possibly incomplete ranking i of the labels y  Y. We denote by Mi  {2, . . . , M } the number of labels ranked by i . Moreover, recall that i (m) denotes the index of the label ranked on position m. In analogy to the conventional settings of classification and regression, in which the nearest neighbor estimation principle has been applied for a long time, we assume that the probability distribution P(· | x) on  is (at least approximately) locally constant around the query x. By furthermore assuming that the rankings i have been produced independently of each other by

P(  | , 0 ) ,

where E() denotes the set of linear extensions of .1 Consequently, inference (maximum likelihood estimation of 0 and ) is difficult for incomplete observations and becomes computationally complex. The normalization factor () in (3) causes additional problems. Another model that seems to be more appropriate from this point of view is the Plackett-Luce (PL) model, which is specified by a parameter vector v = (v1 , v2 , . . . vM )  RM : + P( | v) = v(i) v(i) + v(i+1) + . . . + v(M ) i=1
M

(4)

1 A permutation     is a linear extension of  if it ranks all labels occurring in  in the same order.

Label Ranking Methods based on the Plackett-Luce Model

the PL model (4), the probability to observe the rankings  = {1 , . . . , K } in the neighborhood, given the parameters v = (v1 , . . . , vM ), becomes
K Mi

P( | v) =
i=1 m=1

vi (m) Mi j=m vi (j)

.

(5)

The maximum likelihood estimation (MLE) of v is then given by those parameters that maximize this probability or, equivalently, the log-likelihood function  
K Mi Mi

In general, an interesting question concerns the complexity of the minimization problem (8). An explicit computation of the expected loss for each ranking  is feasible only for small label sets Y, since the cardinality of , which is given by || = |Y|! = M !, grows very fast. However, depending on the loss function (·) and the probability distribution P(· | v  ), an explicit enumeration of this type can often be avoided. The PL model appears to be especially appealing from this point of view. In fact, due to the special structure of the probability distribution (4), a ranking of the form (7) is not only the most intuitive prediction, but also provably optimal for virtually all common loss functions on rankings. Without going into technical details here, we only mention that, in particular, it is a risk minimizer for the 0/1 loss function (defined by (  , ) = 0 if   =  and = 1 if   = ) and, likewise, a maximizer of the expected rank correlation in terms of (2). In contrast to other methods (including most reduction techniques) that simply produce a prediction in terms of a ranking, a probabilistic approach to label ranking allows one to complement predictions by diverse types of statistical information, for example regarding the reliability of a prediction. Besides, the distribution P(· | v  ) supports various types of generalized predictions, such as credible sets of rankings covering the true one with a high probability.

L(v) =

i=1 m=1

log vi (m) - log

j=m

vi (j)  .

4.1. Maximum Likelihood Estimation Finding the MLE parameters of the PL model is a problem that has already been considered in the statistical literature. We resort to an algorithm called MM, which is short for Minorization and Maximization. MM seems to perform especially well for this problem (Hunter, 2004). It is an iterative algorithm whose idea is to maximize, in each iteration, a function that minorizes the original log-likelihood, namely   Mi K Mi vi (j) log vi (m) - j=m . Qk (v) = (k) Mi vi (j) i=1 m=1 j=m
(k) (k)

Here, v (k) = (v1 , . . . , vM ) is the estimation of the PL parameters in the k-th iteration. Considering these values as fixed, the problem to maximize Qk (·) as a function of v can be solved analytically. The corresponding solution, i.e., the parameter vector v  for which Qk (·) is maximal, is then used as a new solution: v (k+1) = v  . This procedure provably converges to an MLE estimation of the PL parameters. 4.2. Prediction

5. Generalized Linear Models
The learning method proposed in the previous section is local (and lazy) in the sense that an individual PL model, i.e., an individual parameter vector v = (v1 , . . . , vM ), is estimated for each query instance x  X. In this section, we consider the estimation of a global model as an alternative. To this end, we model the parameters vm , quantifying the propensity for the m-th label ym , i.e., the tendency to put this label on a high rank, as a linear function of the attributes describing an instance. More precisely, to guarantee the non-negativity of the parameters, we model their logarithm as a linear function:
D

Given the MLE v  , a prediction of the ranking associated with x can be derived from the distribution P(· | v  ) on . In particular, a MAP estimate, i.e., a ranking with the highest posterior probability, is given by    arg max P( | v  ) . (6)


A ranking of this kind can easily be produced by sorting the labels yi in decreasing order according to the  respective parameters vi , i.e., such that v (i)  v (j) (7) for all 1  i < j  M . More generally, given a loss function (·) to be minimized, the best prediction is   = arg min
  

vm = exp
d=1

d

(m)

· xd

,

(9)

where we assume an instance to be represented in terms of a feature vector x = (x1 , . . . , xD )  X = RD . The model parameters to be estimated are now the (m) d (1  m  M, 1  d  D). Given a training data set N T = x(n) ,  (n)
n=1

(,  ) · P( | v  ) .

(8)

Label Ranking Methods based on the Plackett-Luce Model

with x(n) = x1 , . . . , xD tion is given by
N Mn

(n)

(n)

, the log-likelihood func-

L=
n=1 m=1

log v( (n) (m), n)
Mn

implementation, we use a standard stochastic gradient descent algorithm (Bottou, 2004) that, in terms of efficiency, compared quite favorably with other gradientbased methods.

- log
j=m

where Mn is the number of labels in the ranking  (n) , and
D

v( (n) (j), n) ,


6. Experimental Evaluation
In this section, we present an empirical evaluation of our instance-based (IB-PL) and generalized linear (Lin-PL) approach to label ranking using the PL model. For comparison, we include two other methods, namely the aforementioned instance-based approach using the Mallows model (IB-Mal) and so-called loglinear models for label ranking (Lin-LL) as a representative of the class of linear models. To guarantee a fair comparison, we used the Euclidean distance (after normalizing the attributes) as a distance measure on the instance space for both IB-PL and IB-Mal (and disabled distance-weighting in IBMal). The neighborhood size K  {5, 10, 15, 20} was selected through cross validation on the training set. Log-linear models for label ranking have been proposed in (Dekel et al., 2004). In this approach, utility functions fi (·) are expressed as linear combinations of so-called base ranking functions (which map instance/label pairs to real numbers). As a special case, D this includes functions of the form fi (x) = d=1 d xd , which should be specified so as to minimize the number of ranking errors. A ranking error for an instance x occurs if fi (x) < fj (x) even though the i-th label should precede the j-th label, so the total number of errors on x is 1 f(i) (x) < f(j) (x) 0 f(i) (x)  f(j) (x)

v(m, n) = exp
d=1

d

(m)

· xd

(n)

.

The first derivatives of L(·) are given by L
(a) k N

=
n=1

(a, n, 1) · xk
N Mn

(n)

-
n=1 m=1

(a, n, m) ·

v(a, n) · xk
Mn j=m

(n)

v( (n) (j), n)

,

where (a, n, m) = 1 0 ( (n) )-1 (a)  m . otherwise

Moreover, the second derivatives (for a = b, k = ) are as follows: 2L  k
(a) 2 N Mn

=-
n=1 m=1

(a, n, m) · v(a, n) · xk
Mn j=m

(n)

2

v( (n) (j), n) - v(a, n)
Mn j=m

· 2L k 
(a) (a) N

v( (n) (j), n)

2 1ijM (n) (n)

Mn

=-
n=1 m=1

(a, n, m) · v(a, n) · xk · x
Mn j=m

v( (n) (j), n) - v(a, n)
Mn j=m

· 2L k 
(a) (b) N Mn

v( (n) (j), n)

2

Since minimizing this error (or, more precisely, the sum of this error over all training instances) directly is intractable, the authors propose to minimize a smooth, convex upper bound:  

=
n=1 m=1

(a, n, m) · (b, n, m) · v(a, n) · xk · v(b, n) · x
Mn j=1 (a) (n) (n)

v( (n) (j), n)

2

log 1 +

1ijM

exp f(j) (x) - f(i) (x) 

Note that  2 L/(k )2  0 for all 1  a  M and 1  k  D. Based on these derivatives, the maximization of the log-likelihood can be accomplished by means of gradient-based optimization methods. In our

Algorithmically, this optimization problem is approached by means of a boosting-based algorithm that works in an iterative way. In (H¨llermeier et al., 2008), u it was shown that this approach is quite comparable, in terms of predictive accuracy, to other state-of-the-art methods for label ranking.

Label Ranking Methods based on the Plackett-Luce Model

6.1. Data Since benchmark data for the label ranking problem is still not available, we resorted to multi-class and regression data sets from the UCI repository and the Statlog collection and turned them into label ranking data in two different ways. (A) For classification data, we followed the procedure proposed in (H¨llermeier u et al., 2008): A naive Bayes classifier is first trained on the complete data set. Then, for each example, all the labels present in the data set are ordered with respect to the predicted class probabilities (in the case of ties, labels with lower index are ranked first). This way of generating label ranking data is in line with the interpretation of rankings in terms of qualitative probabilities, as discussed in Section 2. (B) For regression data, a certain number of (numerical) attributes is removed from the set of predictors, and each one is considered as a label. To obtain a ranking, the attributes are standardized and then ordered by size. Given that the original attributes are correlated, the remaining predictive features will contain information about the ranking thus produced. Yet, as will be confirmed by the experimental results, this second type of data generation leads to more difficult learning problems. A summary of the data sets and their properties is given in Table 1.2
Table 1. Data sets and their properties (the type refers to the way in which the data has been generated). data set authorship bodyfat calhousing cpu-small elevators fried glass housing iris pendigits segment stock vehicle vowel wine wisconsin type A B B B B B A B A A A B A A A B # inst. 841 252 20640 8192 16599 40769 214 506 150 10992 2310 950 846 528 178 194 # attr. 70 7 4 6 9 9 9 6 4 16 18 5 18 10 13 16 # labels 4 7 4 5 9 5 6 6 3 10 7 5 4 11 3 16

modified the training data as follows: A biased coin was flipped for every label in a ranking to decide whether to keep or delete that label; the probability for a deletion is specified by a parameter p  [0, 1]. Hence, p × 100% of the labels will be missing on average. The summary of the results is shown in Table 2. To analyze these results, we followed the two-step procedure recommended in (Demsar, 2006), consisting of a Friedman test of the null hypothesis that all learners have equal performance and, in case this hypothesis is rejected, a Nemenyi test to compare learners in a pairwise way. Both tests are based on the average ranks (for each problem, the methods are ranked in decreasing order of performance, and the ranks thus obtained are averaged over the problems) as shown in the bottom line in Table 2. At a significance level of 5%, IB-PL and IB-Mal are better than Lin-LL in the case of complete rankings, whereas the Friedman test does not discover significant differences in the case of 30% and 60% missing labels.
Table 3. Pairwise comparisons of the methods in terms of win/win/win statistics: Wins in the complete ranking scenario, in the 30% and in the 60% missing label scenario.

IB-PL IB-Mal Lin-PL Lin-LL

IB-PL -- 10/5/5 4/8/9 3/5/7

IB-Mal 6/11/11 -- 5/8/9 4/7/9

Lin-PL 12/8/7 11/8/7 -- 2/4/5

Lin-LL 13/11/9 12/9/7 14/13/11 --

Despite being statistically non-significant most of the time, the results are still quite informative and show some important trends (which are likely to become significant when increasing the number of data sets). This becomes especially obvious from the pairwise comparisons of the methods, summarized in Table 3. From these comparisons, the following conclusions can be drawn: · Regarding the two instance-based learners, IB-PL performs a bit worse in the complete ranking scenario, but is better in the case of missing label information. This is in perfect agreement with our conjecture that the PL model is better suited for learning from incomplete ranking data. · Comparing the two generalized linear approaches, our method based on the PL model seems to be consistently better than Lin-LL (winning 14, 13 and 11 of the 16 data sets in the three scenarios, respectively). · Comparing the instance-based with the linear methods, it can be seen that the former perform

6.2. Experiments and Results Results were derived in terms of Kendall's tau coefficient from five repetitions of a ten-fold crossvalidation. To model incomplete observations, we
2 The data sets, along with a description, are available at http://www.uni-marburg.de/fb12/kebi/research/.

Label Ranking Methods based on the Plackett-Luce Model Table 2. Performance of the label ranking methods in terms of Kendall's tau (in brackets the rank).
authorship bodyfat calhousing cpu-small elevators fried glass housing iris pendigits segment stock vehicle vowel wine wisconsin Avg. Rank IB-PL .936(1) .230(3) .326(2) .495(2) .721(2) .894(4) .841(2) .711(2) .960(1) .939(2) .950(1) .922(2) .859(1) .851(2) .947(2) .479(4) 2.06 complete IB-Mal .936(2) .229(4) .344(1) .496(1) .727(1) .900(3) .842(1) .736(1) .925(2) .941(1) .802(4) .925(1) .855(2) .882(1) .944(3) .501(3) 1.94 ranking Lin-PL .930(3) .272(1) .220(4) .426(3) .712(3) .996(1) .825(3) .659(3) .832(3) .909(3) .902(2) .710(3) .838(3) .586(4) .954(1) .635(1) 2.56 Lin-LL .657(4) .266(2) .223(3) .419(4) .701(4) .989(2) .818(4) .626(4) .818(4) .814(4) .810(3) .696(4) .770(4) .601(3) .942(4) .542(2) 3.44 IB-PL .927(1) .204(3) .303(2) .477(1) .702(2) .861(3) .809(3) .654(3) .926(1) .918(1) .874(2) .877(1) .838(1) .785(2) .926(4) .453(4) 2.13 30% missing labels IB-Mal Lin-PL .913(2) .899(3) .198(4) .266(1) .310(1) .229(3) .473(2) .418(4) .683(4) .706(1) .850(4) .993(1) .776(4) .825(1) .669(1) .658(2) .867(2) .823(3) .902(3) .909(2) .735(4) .895(1) .855(2) .701(3) .822(2) .817(3) .810(1) .581(4) .930(3) .931(2) .464(3) .615(1) 2.63 2.19 Lin-LL .656(4) .251(2) .223(4) .419(3) .699(3) .989(2) .817(2) .625(4) .804(4) .802(4) .806(3) .691(4) .769(4) .598(3) .941(1) .533(2) 3.06 IB-PL .886(1) .151(4) .259(2) .437(1) .633(3) .797(3) .675(3) .492(4) .868(1) .794(2) .674(3) .740(1) .765(2) .588(3) .907(2) .381(4) 2.44 60% missing labels IB-Mal Lin-PL .849(2) .846(3) .160(3) .222(2) .263(1) .229(3) .428(2) .412(4) .596(4) .704(1) .777(4) .990(1) .611(4) .807(2) .543(3) .636(1) .799(2) .778(3) .781(4) .907(1) .612(4) .888(1) .724(2) .687(4) .736(4) .804(1) .638(1) .575(4) .893(4) .915(1) .399(3) .585(1) 2.94 2.06 Lin-LL .650(4) .241(1) .221(4) .418(3) .696(2) .987(2) .808(1) .614(2) .768(4) .787(3) .801(2) .689(3) .764(3) .591(2) .894(3) .518(2) 2.56

0.85 0.8 0.75
IB-PL (housing) IB-PL (glass) Lin-PL (glass)

0.7 0.65 Lin-PL (housing) 0.6 0.55 0.5 0.45 0.4 0% 10% 20% 30% 40% 50% 60% 70%

eficial for these methods even when the linear methods, due to a lack of flexibility, are no longer able to exploit and adapt to extra data. Typical examples for the glass and the housing data are shown in Fig. 1. As a nice feature of our approach, we already mentioned the possibility to complement a prediction by a measure of reliability. Perhaps the simplest measure of this kind is the probability of the prediction itself, namely p = P(  | v  ). To test whether p is indeed a good indicator of the uncertainty of a prediction, we used it to compute a kind of accuracy-rejection curve: Using IB-PL in a leave-one-out cross validation, we computed the accuracy of the prediction (in terms of Kendall's tau) and its reliability (in terms of p ) for each instance x. Subsequent to sorting the instances in decreasing order of reliability, we plot the function t  f (t), where f (t) is the mean accuracy of the top t percent of the instances. Given that p is indeed a good indicator of reliability, this curve should be decreasing, because the higher t, the more instances with a low reliability are taken into consideration. This expectation is indeed confirmed for our data sets. Fig. 2 shows two exemplary curves for the glass and the housing data.

Figure 1. Ranking performance (in terms of Kendall's tau) as a function of the missing label rate.

a bit better in the complete ranking scenario, but their performance drops more quickly in the presence of missing label information. This last observation is plausible, too, and coherent with the complementary nature of global and local methods. Like in the case of conventional classification, instance-based methods are advantageous for problems requiring complex decision boundaries, for which the strong bias of linear methods prevents them from achieving a good separation. On the other hand, if the linearity assumption is (at least approximately) valid, better models can be learned with fewer data. Correspondingly, instance-based learners are more sensitive toward the amount of training data. Some evidence in favor of this hypothesis is indeed provided by the learning curves depicting the performance as a function of the fraction of missing label information. While the learning curves of the linear methods are often rather flat, showing a kind of saturation effect, they are steeper for the instance-based approaches. This suggests that additional label information is still ben-

7. Conclusions and Future Work
Using the Plackett-Luce model as a model of the underlying data generating process, we proposed new methods for the problem of label ranking. The idea of our first approach, an instance-based learning algorithm, is to fit a locally constant model in the neighborhood of a query instance. The same idea was already proposed earlier, using the Mallows model instead of the PL model. As we have seen, the performance of the approaches is quite comparable. However, the PL model seems preferable in the case of incomplete train-

Label Ranking Methods based on the Plackett-Luce Model
1 0.95 0.9 housing 0.85 0.8 0.75 0.7 0% glass

References
Bottou, L. Stochastic learning. In Bousquet, O. and von Luxburg, U. (eds.), Advanced Lectures on Machine Learning, pp. 146­168. Springer-Verlag, Berlin, 2004. Cheng, W., H¨hn, J., and H¨llermeier, E. Decision u u tree and instance-based learning for label ranking. In Proc. ICML­2009, Montreal, Canada, 2009.
20% 40% 60% 80% 100%

Figure 2. Accuracy-rejection curves computed on the basis of P(  | v  ).

Dekel, O., Manning, CD., and Singer, Y. Log-linear models for label ranking. In Thrun, S., Saul, LK., and Sch¨lkopf, B. (eds.), Advances in Neural Ino formation Processing Systems (NIPS-2003). MIT Press, 2004. Demsar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1­30, 2006. Guiver, J. and Snelson, E. Bayesian inference for Plackett-Luce ranking models. In Proc. ICML­2009, Montreal, Canada, 2009. Har-Peled, S., Roth, D., and Zimak, D. Constraint classification for multiclass classification and ranking. In Becker, Suzanna, Thrun, Sebastian, and Obermayer, Klaus (eds.), Advances in Neural Information Processing Systems 15, pp. 785­792. MIT Press, 2003. H¨llermeier, E. and F¨rnkranz, J. On loss functions u u in label ranking and risk minimization by pairwise learning. Journal of Computer and System Sciences, 76(1):49­62, 2010. H¨llermeier, E., F¨rnkranz, J., Cheng, W., and u u Brinker, K. Label ranking by learning pairwise preferences. Artificial Intelligence, 172:1897­1917, 2008. Hunter, D.R. MM algorithms for generalized BradleyTerry models. The Annals of Statistics, 32(1):384­ 406, 2004. Mallows, C. Non-null ranking models. Biometrika, 44 (1):114­130, 1957. Marden, J. Analyzing and Modeling Rank Data. CRC Press, 1995. Vembu, S. and G¨rtner, T. Label ranking: a survey. In a F¨rnkranz, J. and H¨llermeier, E. (eds.), Preference u u Learning. Springer-Verlag, 2010. Wellman, M.P. Some varieties of qualitative probability. In Proc. IPMU-94, pp. 437­442, Paris, 1994.

ing data, not only computationally but also regarding performance. The idea of parameterizing the coefficients of the PL model and expressing them as functions of the input attributes has led to the second approach, namely fitting a global model in the form of log-linear (utility) functions. Empirically, we have shown that it compares favorably with other methods for label ranking, which are closely related in the sense of fitting the same type of model. Perhaps even more importantly, however, we consider our approach as more solid from a theoretical point of view. In fact, while existing methods are fitting models based on criteria that are to some extent adhoc, our probabilistic model provides the basis for a theoretically sound prediction procedure in the form of maximum likelihood estimation. Apart from making model assumptions more explicit, it also has further advantages. For example, it allows for complementing predictions by diverse types of statistical information, for example regarding the reliability of an estimation. For future work, we plan to combine the two methods presented in this paper, the local and the global one, into a local linear learning method. Similar to methods like local linear regression, the idea is to estimate a (generalized) linear model in a local way, i.e., in the neighborhood of a query instance. From the point of view of our instance-based approach, this means replacing the assumption of a locally constant model by the relaxed assumption of an approximately linear model. Thus, we hope to combine the advantages of both approaches.

Acknowledgments
The authors gratefully acknowledge financial support by the Germany Research Foundation (DFG).