Inferring rankings under constrained sensing Srikanth Jagabathula Devavrat Shah Laboratory of Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139. {jskanth, devavrat}@mit.edu Abstract Motivated by applications like elections, web-page ranking, revenue maximization etc., we consider the question of inferring popular rankings using constrained data. More specifically, we consider the problem of inferring a probability distribution over the group of permutations using its first order marginals. We first prove that it is not possible to recover more than O(n) permutations over n elements with the given information. We then provide a simple and novel algorithm that can recover up to O(n) permutations under a natural stochastic model; in this sense, the algorithm is optimal. In certain applications, the interest is in recovering only the most popular (or mode) ranking. As a second result, we provide an algorithm based on the Fourier Transform over the symmetric group to recover the mode under a natural majority condition; the algorithm turns out to be a maximum weight matching on an appropriately defined weighted bipartite graph. The questions considered are also thematically related to Fourier Transforms over the symmetric group and the currently popular topic of compressed sensing. 1 Introduction We consider the question of determining a real-valued function on the space of permutations of n elements with very limited observations. Such a question naturally arises in many applications including efficient web-page rank aggregation, choosing the winner in a sport season, setting odds in gambling for revenue maximization, estimating popularity of candidates pre-election and the list goes on (for example, see references [1], [2], [3]). In what follows, we give a motivating example for the pursuit of this quest. A motivating example. Consider a pre-election scenario in a democratic country with n potential candidates. Each person (or voter) has certain ranking of these candidates in mind (consciously or sub-consciously). For example, let n = 3 and the candidates be A, B and C . Each person believes in one of the 3! = 6 possible ordering of these candidates. For example, let 50% of people believe in A > B > C , 30% of people believe in B > A > C and 20% of people believe in C > A > B . We wish to infer these preferences of population by means of a limited set of questions. Specifically, suppose we can interview a representative collection (i.e. reasonably large random collection) of people for this purpose. However, in the interview we may not be able to ask them their complete ranking of all candidates. This may be because a person may not be able to articulate it clearly. Or, in situations (e.g. gambling) where there is a financial significance associated with information of complete ranking, an individual may not be ready to provide that information. In such a situation, we will have to settle with restricted questions of the following type: what will be the rank of candidate A in your opinion? or, whom would you rank second? Given answers to such restricted questions, we would like to infer what fraction of the population prefers which ordering of candidates. Clearly, such restricted information cannot lead to any useful 1 inference of prevalent ordering of candidates in the population if there are too many of them (for large n). Now, in a real world scenario, it is likely that people decide rankings of candidates based on a few issues such as war, abortion, economy and gay marriage. That is, an individual will decide the ranking of the candidates based on the opinions of candidates on these issues. Therefore, irrespective of the number of candidates, the number of distinct rankings that prevail in the population are likely to be very few. In this paper, we are interested in inferring such few prevalent rankings of candidates and their popularity based on the restricted (or partial) information as explained above. Thematically, this question is similar to the pursuit of compressed sensing. However, as we explain in Section 2, standard compressed sensing does not apply under this setting. We also discuss a natural relation between the available information and the Fourier coefficients of the Fourier transformation based on group representation (see Proposition 1). It turns out that the problem we consider is equivalent to that of recovery of a function over a symmetric group using the first order Fourier coefficients. Thus, our problem is thematically related to the recovery of functions over non-commutative groups using a limited set of Fourier coefficients. As we show in Section 2, a naive recovery by setting the unknown Fourier coefficients to zero yields a very bad result. Hence, our approach has potential applications to yielding a better recovery. In many applications, one is specifically interested in finding out the most popular ranking (or mode) rather than all the prevalent rankings. For this, we consider an approximation based on Fourier transformation as a surrogate to find the mode. We establish that under the natural majority condition, our algorithm finds the correct mode (see Theorem 2). Interestingly enough, our algorithm to find an estimate of the mode corresponds to finding a maximum weight matching in a weighted bipartite graph of n nodes. Organization. We start describing the setup, the problem statement, and the relation to compressed sensing and Fourier transform based approaches in Section 2. In Section 3, we provide precise statements of the main results. In the remaining Sections, we prove these results and discuss the relevant algorithms. 2 Background and preliminaries Setup. Let Sn = {1 , . . . , N } denote set of all possible N = n! permutations (orderings) of n elements. Sn is also known as the symmetric group of degree n. Let f : Sn [0, 1] denote a mapping from the symmetric group to the interval [0, 1]. We assume that the function f is normalized i.e., f 1 = 1, where · 1 denotes the 1 norm. Let pk denote the value f (k ), for 1 k N . Without loss of generality we assume that the permutations are labeled such that pk pm for k < m. We write f (·) to denote the function and f to denote the vector (f (k ))N ×1 . The set of permutations for which f (·) is non-zero will be called the support of f (·); also, the cardinality of the support will be called sparsity of f and is denoted by K i.e., K = f 0 . Each permutation will be represented by its corresponding permutation matrix denoted by P i.e., Pi = 1{(j )=i} , where j 1E is the indicator variable of the event E . For brevity, we write P to mean both the n × n matrix and the n2 × 1 vector. We use the terms permutation and permutation matrix interchangeably. We think of permutations as complete matchings in a bipartite graph. Specifically, we consider an n × n bipartite graph and each permutation corresponds to a complete matching in the graph. The edges in a permutation will refer to the edges in the corresponding bipartite matching. For 1 i, j n, let qij := f ( ) (1) Sn : ( j ) = i Let Q denote both the matrix (qij )n×n and the vector (qij )n2 ×1 . It is easy to note that Q can be equivalently written as Sn f ( )P . From the definition, it also follows that Q is a doubly stochastic matrix. The matrix Q corresponds to the first order information about the function f (·). In the election example, it is easy to see that qij denotes the fraction of voters that have ranked candidate j in the ith position. Problem statement and result. The basic objective is to determine the values of the function f (·) precisely, using only the values of the matrix Q. We will first prove, using information theoretic techniques, that recovery is asymptotically reliable (average probability of error goes to zero as 2 n ) only if K = O(n). We then provide a novel algorithm that recovers prevalent rankings and their popularity exactly under minimal (essentially necessary) conditions; under a natural stochastic model, this algorithm recovers up to O(n) permutations. In this sense, our algorithm is optimal. It is often the case that the full knowledge of functional values at all permutations is not required. Specifically, in scenarios such as ranked elections, interest is in finding the most likely permutation i.e., arg max f ( ). Theorem 2 proves that the max-weight matching yields the most likely permutation under natural majority assumption. 2.1 Relation to Fourier Transform The question we consider is thematically related to harmonic analysis of functions over noncommutative groups. As we shall show soon, the matrix Q is related to the first two Fourier coefficients of the Fourier Transform of the distribution over the permutation group. Thus, the problem we are considering can be restated as that of reconstructing a distribution over the permutation group from its first two Fourier coefficients. Reconstructing distributions over the permutation group from a limited number of Fourier coefficients has several applications. Specifically, there has been some recent work on multi-object tracking (see [4] and [3]), in which they approach the daunting task of maintaining a distribution over the permutation group by approximating it using the first few Fourier coefficients. This requires reconstructing the function from a limited number of Fourier coefficients, where our solution can be potentially applied. We will now discuss the Fourier Transform of a function on the permutation group, which provides another possible approach for recovery of f . Interestingly enough, the first order Fourier transform of f can be constructed using information based on Q = (qij ). As we shall find, this approach fails to recover sparse f as it has tendency to "spread" the mass on all n! elements given Q. However, as established in Theorem 2 this leads to recovery of mode or most likely assignment of f under natural majority condition. Next, some details on what the Fourier transform (an interested reader is requested to check [5] for missing details) based approach is, how Q can be used to obtain an approximation of f and why it does not recover f exactly. The details relevant to recovery of mode of f will be associated with Theorem 2. Fourier Transform: Definition. We can obtain a solution to the set of linear equations in (8) using the Fourier Transforms at symmetric group representations. For a function h : G R on ^ group G, its Fourier Transform at a representation of G is defined as h = h( )( ). The collection of Fourier Transforms of h(·) at a complete set of inequivalent irreducible representations of G completely determine the function. This follows from the following expression for the inverse Fourier Transform: ( ^ 1k h( ) = 2) dk Tr hTk k ( ) |G| where |G| denotes the cardinality of G, dk denotes the degree of representation k and k indexes over the complete set of inequivalent irreducible representations of G. The trivial representation of a group is the 1-dimensional representation 0 ( ) = 1, G. Therefore, the Fourier Transform of h(·) at 0 is h( ). Fourier Transform: Approximation. The above naturally suggests an approximation based on a limited number of Fourier coefficients with respect to a certain subset of irreducible representations. We will show that, indeed, the information matrix Q corresponds to the Fourier coefficient with respect to the first-order representation of the symmetric group Sn . Therefore, it yields a natural approximation. It is known that [5] the first order permutation representation of Sn , denoted by 1 , has a degree n and maps every permutation to its corresponding permutation matrix P . In other words, we ^ have 1 ( ) = P . Thus, f ( ) = Sn f ( )1 ( ) = Q. Reconstruction of f requires Fourier Transforms at irreducible representations. Even though 1 is not an irreducible representation, it is known that [5] that every representation of a group is equivalent to the direct sum of irreducible representations. In particular, 1 can be decomposed into 1 = 0 1 ; where 0 is the aforementioned trivial representation of degree 1 and 1 is an irreducible representation of degree n - 1. It is worth pointing out to a familiar reader that what we call 1 is more appropriately denoted by (n-1,1) in 3 the literature; but we will stick to 1 for brevity. Thus, Q is related to the Fourier Transforms of the irreducible representations 0 and 1 . We now have the following proposition: Proposition 1. Consider a function f : Sn R. Suppose that f 1 = 1 and we are given its corresponding Q. Then, its natural Fourier approximation obtained by looking at the Fourier ~ coefficients of the relevant irreducible representations is given by the function f : Sn R defined as: Q, P N n-2 ~ f ( ) = (n - 1) - (3) N ~ ~ for Sn , with N = n!, f 1 = f 1 and Sn f ( )P = Q. Proof. We have: Q= Therefore, Q, P = Tr QT P = Tr ( ^ T ^T f0 f1 0 ( ) 1 ( )) ( ( 5) Since Tr is independent of the basis, choosing an appropriate basis we can write: ^ + ^ = ^ T T T Q, P = Tr f0 0 ( ) Tr f1 1 ( ) 1 + Tr f1 1 ( ) (6) is true because 0 ( ) = 1, Sn , and f 1 = 1. ~ f is obtained by truncating the Inverse Fourier Transform expression to the first two terms. Thus, from (2), it follows that: ( 1 ^T ~ ^T fo 0 ( ) + (n - 1)f1 1 ( ) f ( ) = 7) N ^ Using the fact that 0 ( ) = 1 Sn , f0 = 1, and plugging (6) into (7) gives the result of the proposition. Summary. Thus, the Fourier Transform technique yields a solution to the problem. Unfortunately, the solution is not sparse and the "mass" is distributed over all the permutations yielding values of O(1/N ) for all permutations. In summary, a naive approach to the reconstruction of a sparse distribution gives unsatisfactory results and requires a different approach. 2.2 Relation to Compressed Sensing f ( )1 = Sn ^ ^ f ( )(0 1 ) = f0 f1 . Sn (4) 6) Here we discuss the relation of the above stated question to the recently popular topic of compressed sensing. Indeed, both share the commonality in the sense that the ultimate goal is to recover a sparse function (or vector) based on few samples. However, as we shall show, the setup of our work here is quite different. This is primarily because in the standard compressed sensing setup, samples are chosen as "random projections" while here samples are highly constrained and provide information matrix Q. Next, we provide details of this. Our problem can be formulated as a solution to a set of linear equations by defining a matrix A as the n2 × N matrix with column vectors as P k , 1 k N . Then, f is a solution to the following set of linear equations: Ax = Q (8) Candes and Tao (2005) [6] provide an approach to solve this problem. They require the vector f to be sparse i.e., f 0 = N , for some > 0. As discussed earlier, this is a reasonable assumption in our case because: (a) the total number of permutations N can be very large even for a reasonably sized n and (b) most functions f (·) that arise in practice are determined by a small (when compared to N ) number of parameters. Under a restriction on the isometry constants of the matrix A, Candes and Tao prove that the solution f is the unique minimizer to the LP: min x 1 s.t. Ax = Q 4 (9) Unfortunately, the approach of Candes and Tao cannot be directly applied to our problem because the isometry constants of the matrix A do not satisfy the required conditions. We now take a closer look at the isometry constants of A. Gaussian random matrices form an important class of matrices with good isometry constants. Unfortunately, neither is our matrix A random nor is there a straightforward random formulation of our problem. To see why the matrix A has bad isometry constants, we take a simple example. For any n 4 consider the following 4 permutations: 1 = id, 2 = (12), 3 = (34) and 4 = (12)(34). Here, id refers to the identity permutation and the permutations are represented using the cycle notation. It is easy to see that: P 1 + P 4 = P 2 + P 3 (10) For any integer 1 S N , the S restricted isometry constant of A is defined as the smallest quantity such that AT c obeys: (1 - S ) c 2 2 AT c 22 (1 + S ) c 2 2 (11) k {1, 2, . . . , N } of cardinality at most S and all real vectors c. Here, AT c denotes T k . From this definition and (10), it follows that S = 1 S 4. Theorem 1.4 reT ck P quires S < 1 for perfect reconstruction of f when f 0 S . Therefore, the compressed sensing approach of Candes and Tao does not guarantee the unique reconstruction of f if f 0 4. 3 Main results Exact recovery. The main result of this paper is about the exact recovery of f from the given constrained information matrix Q = (qij ) under the hypothesis that f is sparse or has small f 0 . We provide an algorithm that recovers f exactly if the underlying support and probabilities have the following two properties: Property 1 (P1). Suppose the function f (·) is K sparse. Let p1 , p2 , . . . , pK be the function values. The following is true: j j pj J, J {1, 2, . . . , K } s.t J J = pj = J J Property 2 (P2). Let {1 , 2 , . . . , K } be the support of f (·). For each 1 i K , an 1 i n such that i (i ) = j (i ) j = i. In other words, each permutation has at least one edge that is different from all the others. When properties P1 and P2 are satisfied, the equation Q = Af has a unique solution and can indeed be recovered; we will provide an algorithm for such recovery. The following is the formal statement of this result and will be proved later. Theorem 1. Consider a function f : Sn [0, 1] such that f 0 = L, f 1 = 1, and the functional values and the support possess properties P1 and P2. Then, matrix Q is sufficient to reconstruct f (·) precisely. Random model, Sparsity and Theorem 1. Theorem 1 asserts that when properties P1 and P2 are satisfied, exact recovery is possible. However, it is not clear why they are reasonable. We will now provide some motivation and prove that the algorithm is indeed optimal in terms of the maximum sparsity it can recover. Let's go back to the counter-example we mentioned before: For any n 4 consider the 4 permutations 1 = id, 2 = (12), 3 = (34) and 4 = (12)(34). We have P 1 + P 4 = P 2 + P 3 . Now, consider 4 values p1 , p2 , p3 and p4 . Without loss of generality suppose that p1 p4 and p2 p3 . Using the equation P 1 + P 4 = P 2 + P 3 , we can write the following: Q = p 1 P 1 + p 2 P 2 + p 3 P 3 + p 4 P 4 = (p1 + p2 )P 1 + (p1 + p2 )P 4 + (p3 - p2 )P 3 = (p1 + p2 )P 2 + (p1 + p3 )P 3 + (p4 - p1 )P 4 . 5 Thus, under the above setup, there is no unique solution to Q = Af . In addition, from the last two equalities, we can conclude that even the sparsest solution is not unique. Hence, there is no hope of recovering f given only Q in this setup. The question we now ask is whether the above counter example is contrived and specially constructed, or is it more prevalent. For that, we consider a random model which puts a uniform measure on all the permutations. The hope is that under this model, situations like the counter example occur with a vanishing probability. We will now describe the random model and then state important results on the sparsity of f that can be recovered from Q. Random Model. Under the random model, we assume that the function f with sparsity K is constructed as follows: Choose K permutations uniformly at random and let them have any non-trivial real functional values chosen uniformly at random from a bounded interval and then normalized. f = ^ ^ We call an algorithm producing an estimate f of f as asymptotically reliable if Pr =f (n) where (n) 0 as n . We now have the following two important results: Lemma 1. Consider a function f : Sn R with sparsity K . Given the matrix Q = Af , and no additional information, the recovery will be asymptotically reliable only if K 4n. First note that a trivial bound of (n - 1)2 can be readily obtained as follows: Since Q is doubly stochastic, it can be written as a convex combination of permutation matrices [7], which form a space of dimension (n - 1)2 . Lemma 1 says that this bound is loose. It can be proved using standard arguments in Information Theory by considering A as a channel with input f and output Q. Lemma 2. Consider a function f : Sn R with sparsity K constructed according to the random model described above. Then, the support and functional values of f possess properties P1 and P2 with probability 1 - o(1) as long as K 0.6n. It follows from Lemma 2 and Theorem 1 that f can be recovered exactly from Q if the sparsity K = O(n). Coupled with Lemma 1 we conclude that our algorithm is optimal in the sense that it achieves the sparsity bound of O(n). Recovery of Mode. As mentioned before, often we are interested in obtaining only limited information about f (·). One such scenario is when we would like to find just the most likely permutation. ~ For this purpose, we use the Fourier approximation f (cf. Proposition 1) in place of f : that is, the ~. The following result states the correctness of this approximamode of f is estimated as mode of f tion under majority. Theorem 2. Consider a function f : Sn [0, 1] such that f 0 = L and f 1 = 1. Suppose the majority condition holds, that is maxSn f ( ) > 1/2. Then, ~ arg max f ( ) = arg max f ( ) = arg max P , Q . Sn Sn Sn ~ The mode of f , or maximizer of P , Q is essentially the maximum weight matching in a weighted bipartite graph: consider a complete bipartite graph G = ((V1 , V2 ), E ) with V1 = V2 = {1, . . . , n} and E = V1 × V2 with edge (i, j ) E having weight qij . Then, weight of a matching (equivalently permutation ) is indeed P , Q . The problem of finding maximum weight matching is classical. It can be solved in O(n3 ) using algorithm due to Edmond and Karp [8] or max-product belief propagation by Bayati, Shah and Sharma [9]. Thus, this is an approximation that can be evaluated. 4 Theorem 1: Proof and Algorithm Here, we present a constructive proof of Theorem 1. Specifically, we will describe an algorithm to determine the function values from Q which will be the original f as long as properties P1 and P2 are satisfied. Let p1 , p2 , . . . , pL denote the non-zero functional values. Let 1 , 2 , . . . , L denote the corresponding permutations i.e., f (k ) = pk . Without loss of generality assume that the permutations are labeled such that pi pj for i < j . Let q1 , q2 , . . . , qM , where M = n2 , denote the values of matrix Q arranged in ascending order. 6 Given this sorted version, we have qi qj for i < j . Let ei denote the edge (u, v ) such that qi = qei = quv , where recall that k k quv = f (k ) = pk . :k (u)=v :k (u)=v Let Ak denote the set of edges corresponding to permutation k , 1 k L. That is, Ak = {(u, k (u)) : 1 u n}. The algorithm stated below will itself determine L, and (Ak , pk )1 k L using information Q. The algorithm works when properties P1 and P2 are satisfied. Algorithm: initialization: p0 = 0, k (0) = 0 and Ak = , 1 k M . for i = 1 to M j if qi = J pj for some J {0, 1, . . . , k (i - 1)} k (i) = k (i - 1) Aj = Aj {ei } j J else k (i) = k (i - 1) + 1 pk(i) = qi Ak(i) = Ak(i) {ei } end if end for Output L = k (i) and (pk , Ak ), 1 k L. By property P2, there exists at least one qi such that it is equal to pk , for each 1 k L. The property P1 ensures that whenever qi = pk(i) , the condition in the "if " statement of the pseudocode is not satisfied. Therefore, the algorithm correctly assigns values to each of the pk 's. Note that the condition in the "if " statement being true implies that edge ei is present in all the permutations j such that j J . Property P1 ensures that such a J , if exists, is unique. Therefore, when the condition is satisfied, the only permutations that contain edge ei are j , j J . When the condition in the "if " statement fails, again from properties P1 and P2 it follows that edge ei is contained only in permutation k(i) . From this discussion we can conclude that at the end of the iterations, each of the Ai 's contain complete information about their corresponding permutations. The algorithm thus completely determines the function f (·). Finally, note that the algorithm does not require the knowledge of f 0 . 5 Theorem 2: Proof and Algorithm Here, our interest is in finding the mode of f . The algorithm we have proposed is use the mode of ~ f , as an estimate of mode of f . We wish to establish that when maxSn f ( ) > 1/2 then = , where ~ ~ = arg max f ( ); ~ Sn = arg max f ( ). Sn Since we have assumed that f ( ) > 1/2 and f 1 = 1, we should have S f ( ) < 1/2, where S Sn such that S . Therefore, there is exactly one entry in each column of matrix Q / that is > 1/2, and the corresponding edge should be a part of . Thus, keeping only those edges (i, j ) such that Qi,j > 1/2, we should the matching . It is clear from the construction that indeed has the maximum weight of all the other matchings. The result now follows. 6 Conclusion In summary, we considered the problem of inferring popular rankings from highly constrained information. Since raking data naturally arises in several diverse practical situations, an answer to this question has wide ranging implications. 7 Specifically, we considered the problem of inferring a sparse normalized function on the symmetric group using only the first order information about the function. In the election example this first order information corresponds to the fraction of people who have ranked candidate i in the j th position. We provide a novel algorithm to precisely recover the permutations and the associated popularity under minimal, and essentially necessary, conditions. We provide justification to the necessity of our assumptions and consider a natural random model to quantify the sparsity that can be supported. We also provide an algorithm, based on Fourier transform approximation, to determine the most popular ranking (mode of the function). The algorithm is essentially a max-weight matching with weights as the q.. values. Under a natural majority assumption, the algorithm finds the correct mode. The question considered is thematically related to harmonic analysis of functions over the symmetric group and also the currently popular topic of compressed sensing. The problem we consider can be restated as the reconstruction of a function using its first order Fourier representation, which has several applications particularly in the multi-object tracking problem. On the other hand, the parallels to the to the standard compressed sensing setup are limited because the available information is highly constrained. Thus, the existing approaches of compressed sensing cannot be applied to the problem. Next Steps. We concentrated on the recovery of the distribution from its first order marginals. A possible next step would be to consider recovery under different forms of partial information. More specifically, practical applications motivate considering the recovery of distribution from pair-wise information: probability of candidate i being ranked above candidate j . Another natural practical consideration would be to address the presence of noise in the available information. Understanding recovery of distributions with the above considerations are natural next steps. References [1] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation revisited. In Proceedings of WWW10, 2001. [2] Yiling Chen, Lance Fortnow, Evdokia Nikolova, and David M. Pennock. Betting on permutations. In EC '07: Proceedings of the 8th ACM conference on Electronic commerce, pages 326­335, New York, NY, USA, 2007. ACM. [3] J. Huang, C. Guestrin, and L. Guibas. Efficient Inference for Distributions on Permutations. In Advances in Neural Information Processing Systems (NIPS), 2007. [4] R. Kondor, A. Howard, and T. Jebara. Multi-object tracking with representations of the symmetric group. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007. [5] P. Diaconis. Group Representations in Probability and Statistics. IMS Lecture Notes-Monograph Series, 11, 1988. [6] E.J. Candes and T. Tao. Decoding by linear programming. Information Theory, IEEE Transactions on, 51(12):4203­4215, Dec. 2005. [7] G. Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucuman Rev. Ser. A, 5:147­ 151, 1946. [8] J. Edmonds and R. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Jour. of the ACM, 19:248­264, 1972. [9] M. Bayati, D. Shah, and M. Sharma. Max-product for maximum weight matching: convergence, correctness and lp duality. IEEE Transactions on Information Theory, March 2008. 8