Estimation and Clustering with Infinite Rankings Marina Meil and Le Bao a Department of Statistics University of Washington Seattle, WA 98195-4322 {mmp,lebao}@stat.washington.edu Abstract This paper presents a natural extension of stagewise ranking to the the case of infinitely many items. We introduce the infinite generalized Mallows model (IGM), describe its properties and give procedures to estimate it from data. For estimation of multimodal distributions we introduce the ExponentialBlurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties of the new model and demonstrate that infinite models can be simple, elegant and practical. can nominate and order their own favourites from a virtually unlimited population. For instance, the difference between "Order the following issues by how much you care about them" vs. "List in order the issues that you care most about" illustrates the difference between the standard and the infinite GM models. By these examples, we argue that the infinite GM corresponds to realistic scenarios. After defining the infinite GM model, we show that it has sufficient statistics and give algorithms for estimating its parameters from data in the Maximum Likelihood (ML) framework. To be noted that our model will have an infinite number of parameters, of which only a finite number will be constrained by the data from any finite sample. The existence of sufficient statistics also enables us to construct and characterize the conjugate prior for this class of models. Then, we consider the clustering of top-t ranking data. and introduce an adapted version of the well known Gaussian Blurring Mean-Shift algorithm [2] (GBMS) that we call Exponential Blurring Mean Shift (EBMS). 1 Introduction The stagewise ranking model of [6], also known as generalized Mal lows (GM), has been recognized as particularly appropriate for modeling the human process of ranking. This model assigns a permutation over n items a probability that decays exponentially with its distance to a central permutation . Here we study this class of models in the limit n , with the assumption that out of the infinitely many items ordered, one only observes those occupying the first t ranks. Ordering an infinite number of items is common in retrieval tasks: search engines, programs that match a face, or a fingerprint, or a biological sequence against a database, all output the first t items in a ordering over virtually infinitely many ob jects. We shall call this output a top-t ordering. Unlike machines, people can only reliably rank a small number of items. The GM model has been successfully used to model human ranking decisions. We can view the difference between the standard GM model and the infinite GM model that we introduce here as the difference between an election where each voter returns an ordering of a small number of preselected candidates (nominees) and a "grassroots" election process, where everyone 2 Finite and infinite permutations and their top-t orderings We consider permutations over the set of positive natural numbers N = {1, 2, . . . , i . . .}. Following standard notation, (i) denotes the rank of item i, -1 (j ) the item at rank j in . The permutation matrix corresponding to has ij = 1 iff (i) = j . For any two permutations , over N , the matrix product corresponds to the function composition of permutations ( ). In our case, , will be infinite matrices with exactly one 1 in every row and column. A top-t ordering -1 is the prefix ( -1 (1) . . . -1 (t)) of some infinite ordering. The notation -1 indicates that we observe items, not ranks. For the rest of this paper, the term ordering will denote the inverse of a permutation, i.e. the list of items associated to ranks 1, 2, . . .. In general, a Greek letter or denotes a permutation, also called ranking, while the symbols -1 , -1 denote the corresponding orderings. Further, our notation distinguishes when possible between observed orderings, denoted by -1 , which by virtue of being observed, are always truncated, and the "central permutations", ideal infinite ob jects denoted by . What we try to estimate is a top-t ordering of and this is denoted by -1 . The matrix of a top-t ordering has t columns, each with an infinite number of zeros and with a 1 in row -1 (j ), j = 1 : t. Rigourously this matrix should be denoted T since it is the matrix of -1 but we opt for to simplify notation. For a permutation and a top-t ordering -1 , the matrix T corresponds to the composition (-1 ) listing the ranks in of the items in -1 . We shall use its transpose, T which is the × t matrix formed with rows -1 (1) . . . -1 (j ) of as columns. For any T matrix, its code (sj , j = 1 : t) is defined as follows: 1 + sj is the rank of -1 (j ) in |N \{-1 (1),...-1 (j -1)} . Thus, s1 is the number of 0's preceding the 1 in the first column of T ; after we delete the row containing this 1, s2 is the number of 0's preceding the 1 in the second column; after deleting the row containing this 1, s3 is the number of 0's preceding the 1 in the third column, and so on. Hence sj {0, 1, 2, . . .}, j sj (T ) = (-1 (j )) - 1 - 1 [ ( -1 ( j ) ) < ( -1 ( j ) ) ] 0. P (sj ) = 1 e-j sj , (j ) sj = 0 , 1 , 2 , . . . (2) The normalization constant is (j ) = k=0 e-j k = 1 . The IGM model can now be defined as 1-e-j P, ( -1 ) = e- Pt T j =1 [j sj ( )+ln (j )] (3) The distribution P, has a t-dimensional real parameter and an infinite-dimensional discrete parameter . Any top-t ordering -1 stands for a set of infinite sequences starting with s1:t , P, can be viewed as the marginal of s1:t in the infinite prP ct space defined by odu the distribution P, (s) = e- j=1 [j sj +ln (j )] , s N × N × . . .. In contrast with the finite GM, the parameters j must be strictly positive in order for the probability distribution to exist. The most probable -1 for any given t is the top-t ordering which corresponds to s1 = . . . = st = 0. This is the top-t prefix of -1 . The ordering is called the central permutation of P, . The parameters control the spread around the mode . Larger j correspond to more concentrated distributions. These facts are direct extensions of statements about the GM model from [6] and therefore the detailed proofs are omitted. We now introduce the distance d ( -1 , ) = jt j sj (T ) (1) 4 Estimating the model from data =1 with = (1:t ) a vector of strictly positive parameters; d is the extension to infinite orderings of the d of [6]. In general, this "distance" between a top-t ordering and an infinite ordering is not a metric. When j are all equal d ( -1 , ) = dK ( -1 , ), with the dK known as the Kendall distance. dK counts the number of adjacent transpositions needed to make compatible with -1 and is a metric. We are given a set of N top-t orderings D. Each -1 D can have a different length t; all -1 are sampled independently from a P, with unknown parameters. We propose to estimate , from this data in the ML paradigm. We will start by rewriting the log-likelihood of the model, in a way that will uncover a set of sufficient statistics. Then we will show how to estimate the model based on the sufficient statistics. 4. 1 Sufficient statistics 3 The Infinite Generalized Mallows model The following result lets us understand the structure of the log-likelihood and is thus key to the discrete optimization over . For any square (infinite) matrix A RN ×N , denote by L(A) the sum of the elements below the diagonal of A, i.e in the lower triangle of A. Let L (A) = L(TA), and let 1 RN be the vector of all 1's. -1 For any let t be its length and denote tmax = D t . maxD t , T = Now we are ready to introduce the infinite generalized Mal lows (IGM) model. We start with the observation that as any top-t ordering can be represented uniquely by a sequence of t natural numbers, defining a distribution over the former is equivalent to defining a distribution over the latter, which is a more intuitive task. In keeping with the GM paradigm, we shall choose tsequences for which each sj is sampled independently Theorem 1 ln P, (D) = - j [j L (Rj ) + Nj ln (j )] 1 (4) to the items observed, i.e. the restriction of to the D -1 s et . In other words the data determine a topn (partial) ordering corresponding to M L , leaving the rest undetermined1 . Note also that, unlike in the finite case, the sufficient statistics do not necessarily compress the data. They can take up more space than storing the permutations, and their size grows with N . 4. 2 ML estimation: the case of a single where Rj = qj 1T - Qj , and Nj is the number of -1 D that have length t j (in other words, that contain rank j ); qj = [qi,j ]iN , with qi,j being the number of times i is observed in rank j in the data D, Qj = [Qii,j ]i ,iN is a matrix whose element Qii,j counts how many times (i) = j and (i ) < j . Pro of Let Q0 be the infinite matrix that has 1 above the main diagonal and 0 elsewhere, (Q0 )ij = 1 iff j > i and let :j denote the j -th column of . By definition, sj represents the number of 0's preceding 1 in column j , minus all the 1's in the submatrix T 1:(-1 (j )),1:j , l i.e sj (T ) = T :j )l (1 - T :1 - T :2 - 1 (Q0j T T T . . . T :j -1 )l = (1 - tmax , we can write C the log-likelihood as in (4). orollary 2 If 1 = 2 = . . . = then the loglikelihood of the data D can be written as ln P, (D) = -L (R) - T ln () j j Qj . qj , Q = with R = q 1T - Q and q = (5) We now go on to the practical estimation of and starting with the case of equal j , i.e 1 = 2 = . . . = . In this case, equation (5) shows that the estimation of and decouple. For any fixed , equation (5) attains its minimum at = ln(1 + T /L (R)). (6) In contrast to the above simple formula, for the finite GM, the likelihood has no analytic solution for [6]. The estimated value of increases when L (R) decreases. In other words, if the lower triangle of T R, counting the "out of order" pairs, has very low counts, we conclude that the distribution is very concentrated, hence has a high . Estimating M L amounts to minimizing L (R) w.r.t , independently of the value of . The optimal according to Corollary 2 is the (partial) permutation that minimizes the lower triangular part of T R2 . To find it we exploit an idea first introduced in [11]. This idea is to search for -1 = (i1 , i2 , i3 , . . . ) in a stepwise fashion, starting from the top item i1 and continuing down. Assuming for a moment that -1 = (i1 , i2 , i3 , . . . ) is known, the cost to be minimized L (R) can be l decomposed columnwise as L (R) = =i1Rli1 + l l Rli2 + Rli3 + . . . where the number of =i1 ,i2 =i1 ,i2 ,i3 non-trivial terms is one less than the dimension of R. It is on this decomposition that the search algorithm is based. If -1 is not known, a search algorithm could try every i1 in turn, saving the partial sums, then for a chosen i1 value could try all i2 's that could follow it, etc. This type of search is represented by a search tree, whose nodes are candidate prefixes for -1 . That not even the restriction to the observed items is always completely determined can b e seen by the following example. Assume the data consists of the the two topt orderings (a, b, c), (a, b, d). Then (a, b, c, d) and (a, b, d, c) are b oth ML estimates for -1 ; hence, it would b e more accurate to say that the ML estimate of -1 is the partial ordering (a, b, {c, d}). 2 - If the optimum is a partial p ermutation ^ 1 then any ^ 1 will b e a minimizer of - p ermutation compatible with L (R). 1 Note that qi , Qii represent respectively the number of times item i is observed in the data and the number of times item i precedes i in the data. Theorem 1 and its corollary 2 show that the infinite model P, has sufficient statistics. As , of the model are infinite, the sufficient statistics Rj , Nj (or R) are infinite too. However, for any finite data set, these matrices and vector will contain only a finite number of non-zero entries. Another consequence is that the data will only constrain a finite number of parameters of the model. The log-likelihood (4) depends only on the parameters 1:tmax . Maximizing likelihood will determine 1:tmax leaving the other j parameters undetermined. Let n be the number of distinct items observed in the data. From , we can estimate at most its restriction For our problem, the search tree has n! terminal nodes, one for each possible ordering of the observed items. Finding the lowest cost path through the tree is equivalent to minimizing L (R). Branch-and-bound (BB) [12] algorithms are methods to explore the tree nodes in a way that guarantees that the optimum is found, even though the algorithm may not visit all the nodes in the tree. The number of nodes explored in the search for -1 depends on the sufficient statistics matrix R. In the worst case, the number of nodes searched can be a significant fraction of n! and as such intractable for all but small n. However, if the data distribution is concentrated around a mode, then the search becomes tractable. We call the BB algorithm for estimating the BBoundR. The algorithm's main steps, characteristics of a BB algorithm, are given in figure 1. In addition to the exact BBoundR algorithm, various heuristic search techniques can be used. Two of them which showed good performance for the standard GM model are the greedy (depth-first) search and the heuristic of k Rkl [7]. In the latter, one obtains by sorting rl = in increasing order3 . In conclusion, to estimate the parameters from data in a single parameter case, one first computes the sufficient statistics, then a prefix of -1 is estimated by BBoundR or heuristic methods, and finally, with the obtained ordering of the observed items, one can compute the estimate of . 4. 3 ML estimation: the case of general Algorithm BBoundR Input Matrix R The algorithm maintains a priority queue Q storing search tree nodes. For each node , we store: the path to the node (r1:j ), the cost C of this path, a lower bound A on the cost-to-go, and the sum T = C + A. Nodes are prioritized by T . While Q is not empty: 1. Extract min(Q) 2. if j () = n - 1 Output . Stop 3. else, for k = 1 : n - j (a) Create child of (b) Evaluate C ( ), A( ), T ( ) (c) enqueue in Q Figure 1: Algorithm BBoundR. Algorithm EstimateSigmaTheta Input Sufficient statistics Rj , Nj , j = 1 : tmax Initial parameter values 1:tmax > 0 1. Iterate until convergence: j j Rj (a) Calculate R = Maximizing the likelihood of the data D is equivalent, by Theorem 1, with minimizing J (, ) = j [j L (Rj ) + Nj ln (j )] (7) (b) Find partial ordering -1 = argmin L (R ) by BBoundR (c) Estimate j = ln[1 + Nj /L (Rj )] for j = 1 : tmax Output -1 , 1:tmax Figure 2: Algorithm EstimateSigmaTheta. This estimation equation does not decouple w.r.t and . Minimization is however possible, due to the following two observations. First, for any fixed set of j values, minimization w.r.t is possible by the algorithms described in the previous section. Second, for fixed , the optimal j parameters can be found analytically by j = ln(1 + Nj /L (Rj )). The two observations immediately suggest an alternate minimization approach to minimizing J . The algorithm is given in Figure 2. As both steps increase the likelihood, the algorithm will stop in a finite number of steps at a local optimum Describ ed here is a simplified version of the SortRowsR heuristic. The algorithm as prop osed by [7] also p erforms limited search around the obtained p ermutation. 3 5 Clustering Having defined a distance and a method for estimating ML parameters gives one access to a large number of the existing clustering paradigms originally defined for Euclidean data. For instance, the extensions of the K-means and EM algorithms to infinite orderings is immediate, and so are extensions to other distancebased clustering methods. In addition, we introduce a nonparametric clustering method, the Exponential Blurring Mean-Shift (EBMS). Nonparametric clustering is motivated by the fact that in many real applications the number of clusters is unknown and outliers exist. We adapt for ranked data the well known blurring mean-shift algorithm [2]. We choose the exponential kernel K ( , ) = e-dK (,) . For a data set of top-t rankings, the Nadaraya-Watson kernel estimator Table 1: Results of estimation experiments. Top: mean ML Algorithm EBMS - and standard deviation of for two values of the true Input Top-t orderings D = {i 1 }i=1:N with same and for different t values and sample sizes n. Bottom: the length t . prop ortion of cases when the ordering error, i.e the numb er inversions w.r.t the true -1 was 0, resp ectively 1. Each 1. Count the distinct permutations to obtain reestimation was replicated 25 or more times. ~ duced D and counts ni 1 for each ordering Estimates of (mean stdev) -1 ~ i D . N 200 500 2000 -1 ~ mean std mean std mean std 2. For i D compute qi , Qi , Ri the sufficient t=2 0.68 0.04 0.68 0.03 0.68 0.024 statistics of a single data point. 0.69 t=4 0.67 0.03 0.69 0.02 0.69 0.01 - - ~ 3. For i 1 , j 1 D calculate Kendall distance t=8 0.68 0.02 0.69 0.01 0.69 0.007 -1 -1 t=2 1.34 0.13 1.37 0.09 1.37 0.04 dij = dK (i , j ) 1.38 t=4 1.40 0.06 1.38 0.05 1.38 0.03 4. Set the scale by solving the equation t=8 1.37 0.03 1.38 0.03 1.38 0.01 Ordering error t i t × e- j j × e- 2 N 200 500 2000 dij = - ~~ dK dK dK dK dK dK 1 - e- 1 - e-j N (N - 1) 0, 1 be a vector and j , j = 2 : t denote a set of possibly infinite matrices satisfying 1 0; ii,j 0, for al l i, i , j ; 1T j 1 = (j - 1) 0 for al l j > 1. Denote = {, 1 , 2:t } and Rj = 1 f 0 j j -1 11T - I or j > 1, R1 = 1 1T .Then, the distribution P (, ) e- Pt T0 j =1 [j L( Rj )+ln (j )] 7 7. 1 Experiments Estimation exp eriments, single In these experiments we generated data from an infinite GM model with constant j = ln 2, ln 4 and estimated the central permutation and the parameter . To illustrate the influence of t, t was constant over each data set. The results are summarized in Table 1. Note the apparent lack of convergence of the M L estimates. This is due to the fact that, as either N or t increase, nitems the number of items ranked in-1 is a conjugate prior for the model P, ( ) in (3). creases. The least frequent items will have very little support from the data and will be those misranked. Pro of Given observed permutations We have confirmed this by computing the distance be( -1 )1:N , the posterior distribution of tween the true -1 and our estimate, restricted to the (, )- is updated by P (, | , (-1 )1:N ) =rst t ranks. This was always 0, with the exception of t fi 0 ex p j =1 [( L (Rj ) + L (Rj ))j + (N + ) ln (j )] n = 200, = 0.69, t = 2 when it averaged 0.04 (2 cases + . 0 - t Rj +Rj in 50 runs). ln (j )] If ex p (N + ) j =1 [j L N + the hyperparameters , 1 , 2:t satisfy the conditions Even so the table shows that most ordering errors are of the proposition, then the new hyperparameters no larger than 1. We also note that the sufficient = { + N , ( 1 + q1 )/( + N ), ( j + Rj )/( + statistic R is an unbiased estimate of the expected R. ~ N ), j = 2 : t} satisfy the same conditions. . Hence, for any fixed length t of M L , the -1 estimated from R should converge to the true -1 (see The conjugate prior is defined in (8) only up to a noralso [7]). The M L based on the true -1 is also unbimalization constant. This normalization constant is ased and asymptotically normal. Another peculiarity not always computable in closed form. Another aspect is the "asymmetry" of the error in M L . Recall that of conjugacy is that one prefers the conjugate hyperby equation (6) is a decreasing function of L (R). parameters to represent expectations of the sufficient If the true -1 is not optimal for the given R, due to statistics under some P, . The conditions in Proposample variance, then M L will tend to overestimate . sition 3 are necessary, but not sufficient to ensure this Hence M L is a biased estimate of . If however, due fact. to imperfect optimization, the estimated ( -1 )M L is not optimal and has higher cost than -1 , then M L Prop osition 4 Let P (, ) be defined as in (8) and will err towards underestimation. S = L ( R0 + Rj ). Then, (8) j j P (j | Sj ) = B etaSj , +1 (e-j ) (9) (10) 7. 2 Estimation exp eriments, general P ( ) jt B eta(Sj ( ), 1 + ) =1 where B eta, denotes the Beta distribution and B eta(x, y ) denotes Euler's Beta function. Pro of sketch Replacing (j ) with its value yields P (j | ) e-Sj j (1 - e-j )N + (11) We now generated data from an Infinite GM model with 1 = ln 2 or ln 4 and j = 2-(j -1)/2 1 for j > 1. As before, t was fixed in each experiment at the values 2, 4, 8. As the estimation algorithm has local optima, we initialized the parameters multiple times. However, we observed that in all the experiments on artificial data, the iterations converged to the same parameter values for all initializations. Figure 4 shows the estimated values of j for sample sizes N = 200 and 2000 and for the case 1 = 0.69 (the more dispersed distribution) and t = 8. The results of the other experiments are similar, and they are presented in the full paper. Qualitatively, the results are similar to those for single , with the main difference stemming from the fact that, with decreasing j values, the sampling distribution of the data is much more spread, especially w.r.t Wom which the desired results follow. Fr e have shown thus that closed form integration over the continuous parameters j is possible. This result is entirely new, as no analog result, and no closed form integration is possible for the GM model with n finite. The exact summation over the discrete parameters poses much harder challenges, described in the full paper [10], and is still an open problem. N = 200 0.8 r = 1 we have the single parameter model, and for r = tmax = 30 we have the fully parameterized model. Estimating a model with r parameters is done by a simple modification of the EstimateSigmaTheta algorithm which is left to the reader. The estimation algorithm was started from the fixed value j = 0.1 for all runs. The number of iterations to convergence was typically in the range 10­30. In figure 5 we give a synopsis of the values of the parameters under different models. The single parameter models yields values in the range [0.007, 0.104] with the 10%, 50% and 90% quantiles being respectively 0.009, 0.018,and 0.032. The parameters are on average decreasing in all models, with the free parameters higher than the tied parameters for the remaining ranks. Notice also that for the models with fewer parameters the values of the free parameters tend to be higher than the corresponding values in models with more parameters. Compare for instance the values of 1 in the two-parameter model with 1 in the 30 parameter model. For each query and each model size, we computed the rank of the true university home page, i.e the target, under the estimated central permutation M L . Assuming the search engines are reasonably good, this rank is an indirect indicator of the goodness of a model. In addition, for each query, we selected one model by BIC and calculated the target ranks for these models. The BIC selects predominantly the single parameter model (124 out of 147 cases). Table 2 gives the results. 7. 3 Clustering exp eriments 0.6 theta 0.4 0.2 1 2 3 4 5 6 7 8 N = 2000 0.7 0.6 theta 0.5 0.4 0.3 0.2 1 2 3 4 5 6 7 8 Figure 4: Example of 1:t parameters estimation; t = 8, sample sizes N = 200, 2000. Boxplots over 50 random samples, with j on the horizontal axis. The continuous line crossing the b ox plots marks the true values of the parameters 1:8 (exp onential decay starting from 1 = 0.69). the lower ranks. Therefore, the bias in j is more pronounced for larger j (and for smaller N and smaller t). For the same reason, the number of observed items n is much larger than before (hundreds vs. less than 20) and consequently the ordering errors w.r.t all the observed items are also much larger4. But, if one considers only the top-t part of ( -1 )M L then the results are as good as those for a single , i.e 2 errors in 50 runs for t = 2, = 0.69, N = 200 and zero errors in al l other trials. We have experimented with t up to 22, estimating 22 j parameters, with similar results. The next experiment was conducted with the data collected by Cohen, Schapire and Singer for their [3] paper. The data consists of a list of 157 universities, the queries, and a set of 21 search engines, the "experts". Each search engine outputs a list of up to tmax = 30 URL's when queried with the name of the university. The data set provides also a "target" output for each query, which is the university's home page. Hence, we have 147 estimation problems (10 universities with no data), with sample size N 21 (as some experts return empty lists) and with variable length data ranging from t = 1 to t = 30. Figure 5, a and b give a summary view of number of samples for each rank Nj , j = 1 : t, and respectively the total number of items n per query. These values suggest that estimating a fully parameterized model with distinct 1:30 may lead to overfitting and therefore we estimate several parameterizations, all having the form r = (1 , 2 , . . . r-1 , r , r , . . . r ). In other words, ranks 1 : r - 1 have distinct parameters, while the remaining ranks share 1 parameter r . We call 1:r-1 the free parameters and r the tied parameter. For 4 We sampled orderings from 3 clusters, each an Infinite GM model with a single spread parameters , equal to 1.5, 1.0, 0.7 respectively. The cluster centers are random permutations of infinite many ob jects. A data set contains 150 samples from each cluster, plus 50 outliers. All top-t ordering had the same length t. We experimented with t = 4, 6, 8. We ran the Exponential Blurring Mean-Shift, Kmeans, and EM Model-based clustering algorithms 10 times on samples from this distribution. For EBMS, the scale parameter was estimated based on the average of pairwise distances. For the K-means and model-based algorithms, we experiment with different numbers of clusters, and report the best classification error with respect to the true clustering. This puts these two algorithms at an advantage w.r.t EBMS, but as the results table 3 shows, even so the nonparametric algorithm achieves the best performance. Complete results are in the full pap er [10]. 25 0 0 20 15 -0.5 log10( theta ) -0.5 10 5 0 -1 10 20 30 a. 20 15 10 -1 -1.5 -1.5 -2 -2 5 0 0 -2.5 5 200 400 600 10 15 j 20 25 30 5 10 15 j 20 25 30 b. c. d. Figure 5: Universities data: mean and standard deviation of the number of samples per rank, over all queries (a); histogram of the numb er of distinct items observed (b); b oxplots of the estimates over all queries for models with 3 and 30 parameters (c,d). The vertical axis of the scale is logarithmic, base 10, i.e 0 corresp onds to j = 1 and -2 to j = 0.01. For clarity, the distribution of the tied parameter is replicated for j = r : tmax . The black horizontal line marks the mean value of in the single parameter model. Table 2: Mean and median of the rank of the target web page under each model, and under the BIC selected model. The rank is tmax + 1 = 31 if the target is not in the search engine's top-t ranking. The statitics are computed once over all 147 universities and once over a subset of 74 universities where the target is always ranked in the first 30; the subset is lab eled as "good". Model size Mean rank (good) Median rank (good) Mean rank (all) Median rank (all) 1 5.3 3 16.5 13 2 5.7 3 16.1 15 3 4.2 1. 5 15. 4 11 4 4.2 2 15.5 11 5 4. 1 2 15.5 12 6 4. 1 2 15.6 9 7 4.4 2 15.8 10 8 4.5 2 15.7 10 9 5.0 2 15.9 11 10 5.2 3 16.0 11 30 5.1 3 16.0 11 B IC 5.37 2.5 16.2 12 Table 3: Classification Errors: mean and standard deviation t 4 6 8 of 10 random samples EBMS K-means 0.0030 (0.0001) 0.1014 (0.0038) 0.0014 (0.0001) 0.0986 (0.0010) 0.0002 (0.0001) 0.0972 (0.0010) EM 0.1008 (0.0025) 0.1000 (0.0000) 0.1000 (0.0000) and was never larger than 10. 8 Related work and discussion Note that the error rate in table 3 is computed including the outliers, i.e we compared a true clustering with 53 clusters (3 clusters and 50 singletons) to the clustering obtained when the algorithm converged. For EM and K-means the number of clusters associated with the lowest classification errors was between 3 and 5. From the table we see that the K-means and model based approach identified three primary clusters correctly. K-means did not have the ability of identifying the outliers, so it just assigned each outlier into one of those primary clusters. The model-based approach assigned outliers into primary clusters too, but it also gave more uncertainty on the outliers (the probabilities of outliers belonging to their assigned cluster were relatively smaller than data from primary clusters). The running time per data set of EBMS was under a minute, and the number of iterations to convergence followed the pattern typical of mean-shift algorithms This work acknowledges its roots in the work of Fligner and Verducci on stagewise ordering models [6] and in the recent paper [11]. The latter shows for the first time that GM models have sufficient statistics, and describes an exact but non-polynomial algorithm to find the central permutation. While similarities exist between the algorithm of [11] and the BBoundR algorithm presented here, we stress that our representation (based on the codes sj ) is different from the representation (denoted Vj ) in [11]. For any given permutation sj ( ) = Vj ( -1 ). While this difference is trivial for complete permutations, it is not so in the case of missing data. In particular, the distribution of Vj for top-t orderings does not seem to have sufficient statistics for j > 2 even in the case of finite permutations. The sj representation has another advantage that Vj has not: for any finite data set, a parameter sj is either completely determined or completely undetermined the data, whereas in the reciprocal Vj representation al l Vj are weakly constrained by data. While both our BBoundR and the algorithm of [11] perform branch-and-bound search on a matrix of sufficient statistics, the sufficient statistics in the infinite case are derived by an entirely different method, and cannot be obtained by naively replacing the sufficient statistics of the finite case. The paper [1] uses the sj representation and outlines its advantages. It is also the first paper to do (EM) clustering of partial orderings, without however recognizing the existence of sufficient statistics. Another interesting application of the GM model to multimodal data is [9] (there the 's play the role of the data), while a greedy algorithm for consensus ordering with partially observed data is introduced in [3]; [4] is an early work on (Haussdorf ) distances for partial orderings. In [8] the authors also introduce an EM algorithm for clustering ranking data for the purpose of analyzing Irish voting patterns. However, the base model used by [8] is not the Mallows model but a model known as Plackett-Luce. The estimation of this from data is much more difficult and, as [8] show, can be only done approximately. All the above works deal with permutations on finite sets. In fairness to [3] we remark that this work, although it only considers heuristics methods for optimization and introduces a cost function which only later, by [11] is shown to be closely related to the loglikelihood, is motivated by the same problem as ours, i.e dealing with a very large set of items, of which only some are ranked by the "voters". The paper of [13] studies the space of infinite permutations which differ from the identity in a finite number of positions. In the vocabulary of the present paper, these would be the infinite permutations at finite distance dK from . In a single parameter infinite GM, these infinite permutations are the only ones which have non-zero probability. While from a probabilistic perspective the two views are equivalent, from a practical perspective they are not. We prefer to consider in our sample space all possibile orderings, including those with vanishing probability. It is the latter who are more representative of real experiments. For instance, in the university web sites ranking experiment, our model assumed that there is a "true" central permutation from which the observations were generated as random perturbations. This is already an idealization. But we also have the liberty to assume that the observations are very long orderings which are close to the central permutation only in their highest ranks, and which can diverge arbitrarily far from it in the latter ranks. We consider this a more faithful scenario than assuming in addition that the observations must be identical to the central permutation (and hence to each other!) on all but a finite number of ranks. We have introduced the first ­to our knowledge­ stagewise ranking model for infinitely many items. The new probabilistic model has several attractive properties: it handles naturally truncated top-t orderings, it has sufficient statistics, and more importantly we showed that it also has an exact estimation algorithm (albeit intractable in the worst case). As it is known from the study of stochastic models of permutations over finite domains, exact estimation and interpretable parameters are very rare qualities in this field. Sampling, distance computations, clustering can be performed in this class of models in a natural way and are all tractable. We have paid particular attention to non-parametric clustering by mean-shift blurring, showing by experiments that the algorithm is practical and effective. An issue not solved for GM models, finite or infinite, is sampling a , from the conjugate distribution. If this is feasible, one can perform clustering by the DP mixture model, a model-based clustering paradigm widely recognized for its advantages. It is our intention to work in this direction. Acknowledgments Thanks to Jon Wellner for bringing up the question of infinite permutations. References [1] L. M. Busse, P. Orbanz, and J. Buhmann. Cluster ¨ analysis of heterogeneous rank data. In Proceedings of the International Conference on Machine Learning ICML. [2] M. A. Carreira-Perpinan. Gaussian mean shift ~ is an EM algorithm. IEEE Trans. on Pattern Analysis and Machine Intel ligence, 29(5):767­ 776, 2007. [3] W. C. Cohen, R. S. Schapire, and Y. Singer. Learning to order things. Journal of Artificial Intel ligence Research, 10:243­270, 1999. [4] D. E. Critchlow. Metric methods for analyzing partial ly ranked data. Number 34 in Lecture notes in statistics. Springer-Verlag, Berlin Heidelberg New York Tokyo, 1985. [5] M. H. DeGroot. Probability and Statistics. Addison­Wesley Pub. Co., Reading, MA, 1975. [6] M. A. Fligner and J. S. Verducci. Distance based ranking models. Journal of the Royal Statistical Society B, 48:359­369, 1986. [7] M. A. Fligner and J. S. Verducci. Multistage ranking models. Journal of the American Statistical Association, 88, 1988. [8] I. Gormley and T. Murphy. Exploring heterogeneity in irish voting data: A mixture modelling approach. Technical Report, Department of Statistics, Trinity College Dublin(05/09), 2005. [9] G. Lebanon and J. Lafferty. Conditional models on the ranking poset. In Advances in Neural Information Processing Systems, number 15, Cambridge, MA, 2003. MIT Press. [10] M. Meila and L. Bao. Estimation and clustering with infinite rankings. Technical Report 529, UW Statistics, 2008. [11] M. Meila, K. Phadnis, A. Patterson, and J. Bilmes. Consensus ranking under the exponential model. In R. Parr and L. Van den Gaag, editors, Proceedings of the 23rd Conference on Uncertainty in AI, volume 23, page (to appear), 2007. [12] G. L. Nemhauser and L. A. Wolsey. Integer and combinatorial optimization. Wiley, 1999. [13] E. Thoma. Mathematische Zeitschrift, 1964.