Unsupervised Rank Aggregation with Distance-Based Mo dels Alexandre Klementiev klementi@uiuc.edu Dan Roth danr@uiuc.edu Kevin Small ksmall@uiuc.edu University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, IL 61801 USA Abstract The need to meaningfully combine sets of rankings often comes up when one deals with ranked data. Although a number of heuristic and supervised learning approaches to rank aggregation exist, they require domain knowledge or supervised ranked data, both of which are expensive to acquire. In order to address these limitations, we propose a mathematical and algorithmic framework for learning to aggregate (partial) rankings without supervision. We instantiate the framework for the cases of combining permutations and combining top-k lists, and propose a novel metric for the latter. Experiments in both scenarios demonstrate the effectiveness of the proposed formalism. One impediment to solving rank aggregation tasks is the high cost associated with acquiring full or partial preference information, making supervised approaches of limited utility. For data fusion, efforts to overcome this difficulty include applying domain specific heuristics (Shaw & Fox, 1994) or collecting such preference information indirectly (e.g. using clickthrough data (Joachims, 2002)). In order to address this limitation, we propose a general unsupervised learning framework for (partial) rank aggregation. Analyzing ranked data is an extensively studied problem in statistics, information retrieval, and machine learning literature. (Mallows, 1957) introduced a distance-based model for fully ranked data and investigated its use with Kendall's and Spearman's metrics. The model was later generalized to other distance functions and for use with partially ranked data (Critchlow, 1985). (Lebanon & Lafferty, 2002) proposed a multi-parameter extension, where multiple modal rankings (e.g. expert opinions) are available and use their formalism for supervised ensemble learning; they also analyzed their model for partially ranked data (Lebanon & Lafferty, 2003). The first key contribution of our work is the derivation of an EM-based algorithm for learning the parameters of the extended Mallows model without supervision. We instantiate the model with appropriate distance functions for two important scenarios: combining permutations and combining top-k lists. In the context of defining distances between rankings, various metrics have been proposed and analyzed (Critchlow, 1985; Estivill-Castro et al., 1993). Distances over top-k lists, i.e. rankings over the k most preferable ob jects, receive particular attention in the IR community (Fagin et al., 2003). (Fligner & Verducci, 1986) show that a class of distance functions between full rankings, such as Kendall's and Cayley's metrics, decompose into a sum of independent components allowing for efficient parameter estimation of the standard Mallows model. A second key contribution of our work is the derivation 1. Intro duction Consider the scenario where each member of a panel of judges independently generates a (partial) ranking over a set of items while attempting to reproduce a true underlying ranking according to their level of expertise. This setting motivates a fundamental machine learning and information retrieval (IR) problem - the necessity to meaningfully aggregate preference rankings into a joint ranking. The IR community refers to this as data fusion, where a joint ranking is derived from the outputs of multiple retrieval systems. For example, in meta-search the aim is to aggregate Web search query results from several engines into a more accurate ranking. In many natural language processing applications, such as machine translation, there has been an increased interest in combining the results of multiple systems built on different principles in an effort to improve performance (Rosti et al., 2007). Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Unsupervised Rank Aggregation with Distance-Based Mo dels of a novel decomposable distance function for top-k lists. We show it to be a generalization of the Kendall metric and demonstrate that it can be decomposed, enabling us to estimate the parameters of the extended Mallows model efficiently. Among recent work, (Busse et al., 2007) propose a method for clustering heterogeneous rank data based on the standard Mallows model. More directly related, many heuristics as well as a number of supervised learning approaches (Liu et al., 2007) exist for rank aggregation, although few learn to combine rankings without any supervision. (Klementiev et al., 2007) frame unsupervised rank aggregation as an optimization problem specifically for top-k lists, which relies on user-tuned parameters, a form of implicit supervision, whereas we describe a general unsupervised framework that can be instantiated to top-k lists in addition to other settings. The remainder of the paper is organized as follows: section 2 formalizes distance-based ranking models and introduces relevant notation. Section 3 derives our EM-based algorithm for learning model parameters and specifies the requirements for efficient learning and inference. Section 4 instantiates the framework for two common scenarios: permutations (full rankings) and top-k lists. Section 5 experimentally demonstrates the model's effectiveness in both cases. Finally, section 6 concludes the work and gives ideas for future directions. re-index the ob jects such that one of the permutations becomes e and the other = -1 . Borrowing the notation from (Fligner & Verducci, 1986) we abbreviate d(e, ) as D( ). In a later section, when we define as a random variable, we may treat D( ) = D as a random variable as well: whether it is a distance function or a r.v. will be clear from the context. 2.2. Mallows Mo dels While a large body of work on ranking models exists in statistics literature, of particular interest to us are the distance based conditional models first introduced in (Mallows, 1957). Let us give a brief review of the formalism and elucidate some of the its properties relevant to our work. The model generates a judge's rankings according to: p( |, ) = 1 exp( d( , )) Z ( , ) (1) where Z (, ) = Sn exp( d( , )) is a normalizing constant. The parameters of the model are R, 0 and Sn , referred to as the dispersion and the location parameters, respectively. The distribution's single mode is the modal ranking ; the probability of ranking decreases exponentially with distance from . When = 0, the distribution is uniform, and it becomes more concentrated at as decreases. One property of (1) is that the normalizing constant Z (, ) does not depend on due to the right invariance of the distance function: Z (, ) = Z () (2) 2. Distance-Based Ranking Mo dels 2.1. Notation and Definitions Let {x1 , . . . , xn } be a set of ob jects to be ranked, i.e. assigned rank-positions 1, . . . , n, by a judge. We denote the resulting permutation = ( (1), . . . , (n)), where (i) is the rank assigned to ob ject xi . Correspondingly, we use -1 (j ) to denote the index of the ob ject assigned to rank j . Let Sn be the set of all n! permutations over n items, and let d : Sn × Sn R be a distance function between two permutations. We will require d(·, ·) to be a right-invariant metric (Diaconis & Graham, 1977): in addition to the usual properties of a metric, we will also require that the value of d(·, ·) does not depend on how the set of ob jects is indexed. In other words, d( , ) = d( , ) , , Sn , where is defined by (i) = ( (i)). In particular, note that d( , ) = d( -1 , -1 ) = d(e, -1 ), where e = (1, . . . , n) is the identity permutation. That is, the value of d does not change if we Let us denote the moment generating function of D under (1) as MD, (t), and as MD,0 (t) under the uniform distribution ( = 0). Since (1) is an exponential family, MD, (t) = Therefore, t =0 MD,0 (t + ) MD,0 () E (D) = = dMD,0 (t + ) MD,0 () dt t d ln(MD,0 (t)) dt = 1 (3) (Fligner & Verducci, 1986) note that ifm distance funca tion can be expressed as D( ) = i=1 Vi ( ), where Unsupervised Rank Aggregation with Distance-Based Mo dels Vi ( ) are independent (with uniformly distributed) m with m.-g.f. Mi (t), then MD,0 (t) = i=1 Mi (t). Consequently, (3) gives: m di ln Mi (t) E (D) = dt =1 3. Learning and Inference In this section, we derive the general formulation of Expectation Maximization algorithm for parameter estimation of the extended Mallows models (5), and suggest a class of distance functions for which learning can be done efficiently. We then describe an inference procedure for the model. 3.1. EM Background and Notation Let us start with a brief overview of ExpectationMaximization (Dempster et al., 1977) mostly to introduce some notation. EM is a general method of finding maximum likelihood estimate of parameters of models which depend on unobserved variables. The EM procedure iterates between: E step: estimate the expected value of complete data log-likelihood with respect to unknown data Y , observed data X , and current parameter estimates : t (4) = We will call such distance functions decomposable and will later use (4) in section 4 in order to estimate efficiently. 2.3. Extended Mallows Mo dels (Lebanon & Lafferty, 2002) propose a natural generalization of the Mallows model to the following conditional model: 1 p( ) exp p( | , ) = Z ( , ) K i =1 ( i d( , i ) 5) T (, ) = E [log p(X , Y |)|X , ] K where = (1 , . . . , K ) Sn , = (1 , . . . , K ) RK , 0, p( ) is a prior, and normalizing constant K Z ( , ) = Sn p( ) exp( i=1 i d( , i )). M step: choose parameters that maximize the expectation computed in the E step: The rankings i may be thought of as votes of K individual judges, e.g. rankings returned by multiple search engines for a particular query in the metasearch setting. The free parameters i represent the degree of expertise of the individual judges: the closer the value of i to zero, the less the vote of the i-th judge affects the assignment of probability. Under the right-invariance assumption on d, we can use property (2) to derive the following generative story underlying the extended Mallows model: argmax T (, ) In our setting, the K > 2 experts generate votes corresponding to the unobserved true ranking . We will see multiple instances of so the observed data we get are ranking vectors X = { (j ) }Q 1 with the correj= sponding true (unobserved) rankings Y = { (j ) }Q 1 . j= In the meta-search example, i is the ranking of the i-th (of the total of K ) search engine for the j -th (of the total of Q) query. The (unknown) true ranking corresponding to the j -th query is denoted as (j ) . 3.2. EM Derivation (j ) p( , | ) = p( ) iK =1 p(i |i , ) (6) That is, is first drawn from prior p( ). is then made up by drawing 1 . . . K independently from K Mallows models p(i |i , ) with the same location parameter . It is straightforward to generalize both Mallows models (Critchlow, 1985), and the extended Mallows models to partial rankings by constructing appropriate distance functions. We will assume this more general setting in the following section. We now use the generative story (6) to derive the following propositions (proofs omitted due to space constraints): Prop osition 1. The expected value of the complete data log-likelihood under (5) is: ( Q (1) ,..., (Q) )Sn T ( , ) = L U ( 7) Unsupervised Rank Aggregation with Distance-Based Mo dels where the complete data log-likelihood L is: L = jQ =1 log p( (j ) )- j Q iK =1 =1 (j ) As the chain proceeds, we update the distance value with the incremental change due to a single transposition, instead of recomputing it from scratch, resulting in substantial savings in computation. Alternatively, we also found (Section 5.1) that a combination of rankings i weighted by exp(-i ) provides a reasonable and quick estimate for evaluating the RHS. Sampling or the suggested alternative RHS estimation used during training is also used for model inference. Q iK =1 log Z (i ) + i d( (j ) , i ) and the marginal distribution of the unobserved data U is: P jQ U = p (j ) | , (j ) =1 4. Mo del Application Overcoming the remaining hurdle (the LHS estimation) in learning the model efficiently depends on the definition of a distance function. We now consider two particular types of (partial) rankings: permutations, and top-k lists. The latter is the case when each judge specifies a ranking over k most preferable ob jects out of n. For instance, a top-10 list may be associated with the 10 items on the first page of results returned by a web search engine. For both permutations and top-k lists, we show distance functions which satisfy the decomposability property (Section 2.2), which, in turn, allows us to estimate the LHS of (8) efficiently. 4.1. Combining Permutations Kendall's tau distance (Kendall, 1938) between permutations and is a right-invariant metric defined as the minimum number of pairwise adjacent transpositions needed to turn one permutation into the other. Assuming that one of the permutations, say , is the identity permutation e (we can always turn one of the permutations into e by re-indexing the ob jects without changing the value of the distance, see Section 2.1), it can be written as: n-1 i =1 rop osition 2. T ( , (1 , . . . , K ) such that: ) is maximized by = 1 Ei (D) = ( (1) qQ =1 U d( (q ) (q ) , i ) ( ,..., Q Sn ( Q) Q ) 8) That is, on each iteration of EM, we need to evaluate the right-hand side (RHS) of (8) and solve the LHS for i for each of the K components. 3.3. Mo del Learning and Inference At first, both evaluating the RHS of (8) and solving the LHS for i seem quite expensive (> n!). While true in general, we can make the learning tractable for a certain type of distance functions. In particular, if a distance function can be decomposed into a sum of independent components under the uniform distribution of (see section 2.2), property (4) may enable us to make the estimation of the LHS efficient. In Section 4, we show two examples of such distance functions (for permutations and top-k lists). In order to estimate the RHS, we use the Metropolis algorithm (Hastings, 1970) to sample from (5). The chain proceeds as follows: denoting the most recent value sampled as t , two indices i, j {1, . . . , n} are chosen at random and the ob jects - - t 1 (i) and t 1 (j ) are transposed forming t . If t a = p( | , )/p(t | , ) 1 the chain moves to t . If a < 1, the chain moves to t with probability a; otherwise, it stays at t . (Diaconis & Saloff-Coste, 1998) show quick convergence for Mallows model with Cayley's distance. While no convergence results are known for the extended Mallows model with arbitrary distance, we found experimentally that the MC chain converges rapidly with the two distance functions used in this work (10n steps in experiments of Section 5). DK ( ) = Vi ( ) j -1 where1 Vi ( ) = (i) - -1 (j )). Vi are in>i I ( dependent and uniform over integers [0, n - i] (Feller, n-i 1968) with m.-g.f. Mi (t) = n-1+1 k=0 etk . Following i (Fligner & Verducci, 1986), equation (4) gives: j ne j e j - 1 - e 1 - e j =1 n E (DK ) = (9) E (DK ) is monotone decreasing, so line search for will converge quickly. 1 I (x) = 1 if x > 0, and 0 otherwise. Unsupervised Rank Aggregation with Distance-Based Mo dels 4.2. Combining Top-k Lists We now propose an extension of the Kendall's tau distance to top-k lists, i.e. the case where and indicate preferences over different (possibly, overlapping) subsets of k n ob jects. Let us denote by F and F the elements in and respectively, noting that |F | = |F | = k . We define Z = F F , |Z | = z , P = F \ F , and S = F \ F (note that |P | = |S | = k - z = r). We treat and as rankings, which for us will mean that the smallest index will indicate the top, i.e. contain the most preferred ob ject. For notational convenience, let us now define the augmented ranking as augmented ~ with the elements of S assigned the same index (k + 1), one past the bottom of the ranking as shown on Figure 1 ( is defined similarly). We will slightly abuse ~ our notation and denote -1 (k + 1) to be the set of ~ elements in position (k + 1). Kendall's tau distance DK is naturally extended from permutations to augmented rankings. ~ ~~ Definition 1. Distance DK ( , ) between augmented rankings and is the minimum number of adjacent ~ ~ transpositions needed to turn into . ~ ~ ~ ~~ It can be shown that DK ( , ) is a right-invariant metric, thus we will again simplify the notation denoting ~~ it as DK ( ). This distance can be decomposed as: 1 2 3 4 3 k+1 k+1 k ~ 1 2 3 4 1 2 3 4 ~ k-2 k-1 k k+1 k+1 1 7 2 4 k-1 k-2 k-1 k k+1 k-2 k-1 k k+1 k+1 k+1 Figure 1. An example of augmented permutations (left) ~ and identity augmented permutation (right, in natural ~ ~~ order). Grey boxes are ob jects in but not in . DK ( ) is the minimum number of adjacent transpositions needed to turn into : namely, bring all grey boxes into the ~ ~ position k + 1 and put the remaining k ob jects in their natural order. them with the elements in -1 (k + 1), which would ~ then appear in the correct order in the bottom r positions. Finally, the first term is the adjacent transpositions necessary to put the k elements now in the list in the natural order. It can be shown that the random variable sum~~ mands comprising DK ( ) are independent when ~ ~ ~ is uniformly distributed. Furthermore, Vi and Uj are uniform over integers [0, k - i] and [0, z ], with k -i moment generating functions k-1+1 j =0 etj and i z 1 tj j =0 e , resp ectively. Assuming z > 0, and r > 0 z +1 equation (4) gives: ~~ DK ( ) = ik =1 -1 (i)Z ~ ~~ Vi ( ) + ik =1 -1 (i)Z ~ / r(r + 1) ~~ Ui ( ) + 2 where j k e j ej ~ E (DK ) = - + 1 - e 1 - ej =r +1 ~~ Vi ( ) = jk =i -1 (j )Z ~ k I ( ~ -1 (i) - ~ -1 (j )) + r(r + 1) e(z+1) - r(z + 1) 2 1 - e(z+1) (10) j -1 (k+1) ~ I ( ~ jk -1 (i) - j ) ~~ Ui ( ) = 1 =i -1 (j )Z ~ If r = 0 (i.e. the augmented rankings are over the same ob jects), both the distance and the expected value reduce to the Kendall distance results. Also, if z = 0 (i.e. the augmented rankings have no ob jects in common), ~ ~ DK = E (DK ) = k (k + 1)/2, which is the smallest number of adjacent transpositions needed to move the r = k ob jects in -1 (k + 1) into the top k positions. ~ ~ E (DK ) is decreasing monotonically, so we can again use line search to find the value of . Notice that the expected value depends on the value of z (the number of common elements between the two permutations). We will compute the average value of z as we estimate the RHS of (8) and use it to solve the LHS for . ~~ Decomposing DK ( ), the second term is the minimum number of adjacent transpositions necessary to bring the r elements not in Z (grey boxes on Figure 1) to the bottom of the ranking. The third term is the minimum number of adjacent transpositions needed to switch Unsupervised Rank Aggregation with Distance-Based Mo dels 5. Exp erimental Evaluation We demonstrate the effectiveness of our approach for permutations and top-k lists considered in Section 4. 5.1. Permutations We first consider the scenario of aggregating permutations. For this set of experiments, the votes of K = 10 individual experts were produced by sampling standard Mallows models (1), with the same location parameter = e (an identity permutation over n = 30 ob jects), and concentration parameters 1,2 = -1.0, 3,..,9 = -0.05, and 10 = 0 (the latter generating all permutations uniformly randomly). The models were sampled 10 times, resulting in Q = 10 lists of permutations (one for each "query"), which constituted the training data. In addition to the sampling procedure described in Section 3.3 to estimate the RHS of (8), we also tried the following weighted Borda count approximation. For each "query" q , we took the K votes and mixed them into a single permutation q as follows: a score ^ for each of the n ob jects is computed as a weighted combination of ranks assigned to that ob ject by individual judges. The aggregate permutation q is ob^ tained by sorting the ob jects according to their resulting scores. The weights are computed using the current values of the model parameters as exp(-i ). The rationale is that the smaller the absolute value of i , the lower the relative quality of the ranker, and the less it should contribute to the aggregate vote. Finally, the RHS for the i-th component is computed as the distance from its vote to q averaged over all Q ^ queries. We also tried using the true permutation in place of q to see how well the learning procedure can do. ^ At the end of each EM iteration, we sampled the current model (5), and computed the Kendall's tau distance between the generated permutation to the true . Figure 2 shows the model performance when sampling and the proposed approximation are used to estimate the RHS. Although the convergence is much faster with the approximation, the model trained with the sampling method achieves better performance approaching the case when the true permutation is known. 5.2. Top-k lists In order to estimate the model's performance in the top-k list combination scenario, we performed data fusion experiments using the data from the ad-hoc reAverage Dk to true permutation 250 200 150 100 50 Sampling Weighted True 0 0 2 4 6 8 10 12 EM Iteration 14 16 18 20 Figure 2. Permutations: learning performance of the model (averaged over 5 runs) when RHS is estimated using sampling (Sampling), the proposed weighted Borda count approximation (Weighted), or the true permutation (True). Although the convergence is much faster with the approximation, model trained with the sampling method achieves better performance. trieval shared task of the TREC-3 conference (Harman, 1994). Our goal here is to examine the behavior of our approach as we introduce poor judges into the constituent ranker pool. In this shared task, 40 participants submitted top-1000 ranking over a large document collection for each of the 50 queries. For our experiments, we used top-100 (k = 100) rankings from K = 38 of the participants (two of the participants generated shorter rankings for some of the queries and were not used) for all Q = 50 queries. We replaced a specific number Kr [0, K ] of the participants with random rankers (drawing permutations of k documents from the set of documents returned by all participants for a given query uniformly randomly). We then used our algorithm to combine top-k lists from Kr random rankers and (K - Kr ) participants chosen at random. We measure performance using the precision in top{10, 30} documents as computed by trec eval 2 from the TREC conference series. As a baseline, we use C ombM N Zrank suggested in (Klementiev et al., 2007). It is a variant of a commonly used C ombM N Z (Shaw & Fox, 1994). Given a query q for each document x in the collection it computes a score Nx × K i=1 (k - ri (x, q )), where ri (x, q ) is the rank of the document x in the ranking returned by participant i for the query q , and Nx is the number of participants which place x in their top-k rankings. The aggregate 2 Available at http://trec.nist.gov/ Unsupervised Rank Aggregation with Distance-Based Mo dels Table 1. M RP R of the four search engines and their corresponding model parameters; the results suggest a correlation between the magnitude of the dispersion parameters and the relative system performance. S1 Precision 0.5 0.4 0.3 0.2 0.1 20 Aggregation (Top-10) CombMNZrank (Top-10) Aggregation (Top-30) CombMNZrank (Top-30) 22 24 26 28 30 32 Number of random rankers Kr 34 36 38 0.8 0.7 0.6 S2 0.0 0.43 S3 -0.066 0.82 S4 -0.049 0.78 M RP R -0.065 0.86 Figure 3. Top-k lists: precision of the aggregate ranker as a function of the number of random component rankers Kr in top 10 and top 30 documents. Our algorithm learns to discount the random components without supervision substantially improving over C ombM N Zrank . their corresponding model parameters. As expected, the results suggest a correlation between the magnitude of the dispersion parameters and the relative system performance, implying that their values may also be used for unsupervised search engine evaluation. Finally, our model achieves M RP R = 0.92 beating all of the constituent rankers. 6. Conclusions and Future Work We propose a formal mathematical and algorithmic framework for aggregating (partial) rankings without supervision. We derive an EM-based algorithm for the extended Mallows model and show that it can be made efficient for the right-invariant decomposable distance functions. We instantiate the framework and experimentally demonstrate its effectiveness for the important cases of combining permutations and combining top-k lists. In the latter case, we introduce the notion of augmented permutation and a novel decomposable distance function for efficient learning. A natural extension of the current work is to instantiate our framework for other types of partial rankings, as well as to cases where ranking data is not of the same type. The latter is of practical significance since often preference information available is expressed differently by different judges (e.g. top-k rankings of different lengths). Another direction for future work is to extend the rank aggregation model to accommodate position dependence. In IR, more importance is generally given to results appearing higher in the rankings. Within our framework one may be able to design a distance function reflecting this requirement. Additionally, the quality of votes produced by individual components may depend on the rank, e.g. in the top-k scenario some rankers may be better at choosing few most relevant ob jects, while others may tend to have more relevant ob jects in the k selected but may not rank them well relative to one another. This case may be modeled by adding a dependency on rank to the dispersion parameters of the model. ranking is obtained by sorting documents according to their scores. Intuitively, the more component rankers rank a document highly the higher it appears in the aggregate ranking. Figure 3 shows that our algorithm learns to discount the random components without supervision substantially improving over the baseline as Kr K . We also compared our results with the ULARA algorithm (Klementiev et al., 2007). These results were not included since we found ULARA to be too sensitive to user-defined parameters (an implicit form of supervision) with results varying between competitive with our model to comparable with C ombM N Zrank . 5.3. Mo del Disp ersion Parameters In order to demonstrate the relationship between the learned dispersion parameters of the model, , and the relative performance of the constituent rankers, we also conducted a meta-search experiment. First, we generated Q = 50 queries which result in an unambiguous most relevant document and submitted them to K = 4 commercial search engines. For each engine, we kept the 100 highest ranked documents (10 pages of 10 documents each) after removing duplicates, and unified URL formatting differences between engines. We measure performance with Mean Reciprocal Page Rank (M RP R), which we define as mean reciprocal rank of the page number on which the correct document appears. Table 1 shows M RP R of the four search engines and Unsupervised Rank Aggregation with Distance-Based Mo dels In addition, this framework appears promising for a number of applications. Besides the NLP problems mentioned before, such as learning to combine output from multiple machine translation systems, one interesting setting may be domain adaptation. Here, the task is to adapt a hypothesis trained with ample labeled data from one input distribution to a second distribution where minimal training data is available. When the hypothesis is a trained aggregate ranker, we expect the relative expertise of its components to change and can use our approach to reweigh them accordingly. Feller, W. (1968). An introduction to probability theory and its applications, vol. 1. John Wiley and Sons, Inc. Fligner, M. A., & Verducci, J. S. (1986). Distance based ranking models. Journal of the Royal Statistical Society, 48, 359­369. Harman, D. (1994). Overview of the third Text REtrieval Conference (TREC-3). Hastings, W. K. (1970). Monte carlo sampling methods using markov chains and their applications. Biometrika, 57, 97­109. Joachims, T. (2002). Unbiased evaluation of retrieval quality using clickthrough data. SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval. Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30, 81­93. Klementiev, A., Roth, D., , & Small, K. (2007). An unsupervised learning algorithm for rank aggregation. Proc. of the European Conference on Machine Learning (ECML) (pp. 616­623). Lebanon, G., & Lafferty, J. (2002). Cranking: Combining rankings using conditional probability models on permutations. Proc. of the International Conference on Machine Learning (ICML). Lebanon, G., & Lafferty, J. (2003). Conditional models on the ranking poset. The Conference on Advances in Neural Information Processing Systems (NIPS) (pp. 431­438). Liu, Y.-T., Liu, T.-Y., Qin, T., Ma, Z.-M., & Li, H. (2007). Supervised rank aggregation. Proc. of the International World Wide Web Conference (WWW). Mallows, C. L. (1957). Non-null ranking models. Biometrika, 44, 114­130. Rosti, A.-V. I., Ayan, N. F., Xiang, B., Matsoukas, S., Schwartz, R., & Dorr, B. J. (2007). Combining outputs from multiple machine translation systems. Proc. of the Annual Meeting of the North American Association of Computational Linguistics (NAACL) (pp. 228­235). Shaw, J. A., & Fox, E. A. (1994). Combination of multiple searches. Text REtrieval Conference (TREC) (pp. 243­252). Acknowledgments We would like to thank Ming-Wei Chang, Sariel HarPeled, Vivek Srikumar, and the anonymous reviewers for their valuable suggestions. This work is supported by NSF grant ITR IIS-0428472, DARPA funding under the Bootstrap Learning Program and by MIAS, a DHS-IDS Center for Multimodal Information Access and Synthesis at UIUC. References Busse, L. M., Orbanz, P., & Buhmann, J. M. (2007). Cluster analysis of heterogeneous rank data. Proc. of the International Conference on Machine Learning (ICML). Critchlow, D. E. (1985). Metric methods for analyzing partial ly ranked data, vol. 34 of Lecture Notes in Statistics. Springer-Verlag. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1­38. Diaconis, P., & Graham, R. L. (1977). Spearman's footrule as a measure of disarray. Journal of the Royal Statistical Society, 39, 262­268. Diaconis, P., & Saloff-Coste, L. (1998). What do we know about the Metropolis algorithm? Journal of Computer and System Sciences, 57, 20­36. Estivill-Castro, V., Mannila, H., & Wood, D. (1993). Right invariant metrics and measures of presortedness. Discrete Applied Mathematics, 42, 1­16. Fagin, R., Kumar, R., & Sivakumar, D. (2003). Comparing top k lists. SIAM Journal on Discrete Mathematics, 17, 134­160.