Efficient Inference for Distributions on Permutations Jonathan Huang Carnegie Mellon University Carlos Guestrin Carnegie Mellon University Leonidas Guibas Stanford University Abstract Permutations are ubiquitous in many real world problems, such as voting, rankings and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact representations such as graphical models cannot efficiently capture the mutual exclusivity constraints associated with permutations. In this paper, we use the "low-frequency" terms of a Fourier decomposition to represent such distributions compactly. We present Kronecker conditioning, a general and efficient approach for maintaining these distributions directly in the Fourier domain. Low order Fourier-based approximations can lead to functions that do not correspond to valid distributions. To address this problem, we present an efficient quadratic program defined directly in the Fourier domain to project the approximation onto the marginal polytope. We demonstrate the effectiveness of our approach on a real camera-based multi-people tracking setting. 1 Introduction Permutations arise naturally in a variety of real situations such as card games, data association problems, ranking analysis, etc. As an example, consider a sensor network that tracks the positions of n people, but can only gather identity information when they walk near certain sensors. Such mixed-modality sensor networks are an attractive alternative to exclusively using sensors which can measure identity because they are potentially cheaper, easier to deploy, and less intrusive. See [1] for a real deployment. A typical tracking system maintains tracks of n people and the identity of the person corresponding to each track. What makes the problem difficult is that identities can be confused when tracks cross in what we call mixing events. Maintaining accurate track-to-identity assignments in the face of these ambiguities based on identity measurements is known as the Identity Management Problem [2], and is known to be N P -hard. Permutations pose a challenge for probabilistic inference, because distributions on the group of permutations on n elements require storing at least n! - 1 numbers, which quickly becomes infeasible as n increases. Furthermore, typical compact representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. Diaconis [3] proposes maintaining a small subset of Fourier coefficients of the actual distribution allowing for a principled tradeoff between accuracy and complexity. Schumitsch et al. [4] use similar ideas to maintain a particular subset of Fourier coefficients of the log probability distribution. Kondor et al. [5] allow for general sets of coefficients, but assume a restrictive form of the observation model in order to exploit an efficient FFT factorization. The main contributions of this paper are: · A new, simple and general algorithm, Kronecker Conditioning, which performs all probabilistic inference operations completely in the Fourier domain. Our approach is general, in the sense that it can address any transition model or likelihood function that can be represented in the Fourier domain, such as those used in previous work, and can represent the probability distribution with any desired set of Fourier coefficients. · We show that approximate conditioning can sometimes yield Fourier coefficients which do not correspond to any valid distribution, and present a method for projecting the result back into the marginal polytope. · We demonstrate the effectiveness of our approach on a real camera-based multi-people tracking setting. 1 2 Filtering over permutations In identity management, a permutation represents a joint assignment of identities to internal tracks, with (i) being the track belonging to the ith identity. When people walk too closely together, their identities can be confused, leading to uncertainty over . To model this uncertainty, we use a Hidden Markov Model on permutations, which is a joint distribution over P ( (1) , . . . , (T ) , z (1) , . . . , z (T ) ) which factors as: P ( (1) , . . . , (T ) , z (1) , . . . , z (T ) ) = P (z (1) | (1) ) (t) (t) Y t P (z t | (t) ) · P ( (t) | (t-1) ), where the are latent permutations and the z denote observed variables. The conditional probability distribution P ( (t) | (t-1) ) is called the transition model, and might reflect for example, that the identities belonging to two tracks were swapped with some probability. The distribution P (z (t) | (t) ) is called the observation model, which might capture a distribution over the color of clothing for each individual. We focus on filtering, in which one queries the HMM for the posterior at some timestep, conditioned on all past observations. Given the distribution P ( (t) |z (1) , . . . , z (t) ), we recursively compute P ( (t+1) |z (1) , . . . , z (t+1) ) in two steps: a prediction/rollup step and a conditioning step. The first updates the distribution by multiplying by the tran ition model and marginalizing out the s (t+1) (t) | )P ( (t) |z (1) , . . . , z (t) ). previous timestep: P ( (t+1) |z (1) , . . . , z (t) ) = (t) P ( The second conditions the distribution on an observation z (t+1) using Bayes rule: P ( (t+1) |z (1) , . . . , z (t+1) ) P (z (t+1) | (t+1) )P ( (t+1) |z (1) , . . . , z (t) ). Since there are n! permutations, a single update requires O((n!)2 ) flops and is consequently intractable for all but very small n. The approach that we advocate is to maintain a compact approximation to the true distribution based on the Fourier transform. As we discuss later, the Fourier based approximation is equivalent to maintaining a set of low-order marginals, rather than the full joint, which we regard as being analagous to an Assumed Density Filter [6]. 3 Fourier projections of functions on the Symmetric Group Over the last 50 years, the Fourier Transform has been ubiquitously applied to everything digital, particularly with the invention of the Fast Fourier Transform. On the real line, the Fourier Transform is a well-studied method for decomposing a function into a sum of sine and cosine terms over a spectrum of frequencies. Perhaps less familiar, is its group theoretic generalization, which we review in this section with an eye towards approximating functions on the group of permutations, the Symmetric Group. For permutations on n objects, the Symmetric Group will be abbreviated by Sn . The formal definition of the Fourier Transform relies on the theory of group representations, which we briefly discuss first. Our goal in this section is to motivate the idea that the Fourier transform of a distribution P is related to certain marginals of P . For references on this subject, see [3]. Definition 1. A representation of a group G is a map from G to a set of invertible d × d matrix operators which preserves algebraic structure in the sense that for all 1 , 2 G, (1 2 ) = (1 ) · (2 ). The matrices which lie in the image of this map are called the representation matrices, and we will refer to d as the degree of the representation. Representations play the role of basis functions, similar to that of sinusoids, in Fourier theory. The simplest basis functions are constant functions -- and our first example of a representation is the trivial representation 0 : G R which maps every element of G to 1. As a more pertinent example, we define the 1st order permutation representation of Sn to be the degree n representation, 1 , which maps a permutation to its corresponding permutation matrix given by: [1 ( )]ij = 1 { (j ) = i}. For example, the permutation in S3 which swaps the second and third elements maps to: 1 1 (1 1, 2 3, 3 2) = @ 0 0 0 0 0 1 1 0 1A 0 . The 1 representation can be thought of as a collection of n2 functions at once, one for each matrix entry, [1 ( )]ij . There are other possible permutation representations - for example the 2nd order unordered permutation representation, 2 , is defined by the action of a permutation on - unordered pairs of objects, ([( )]{i,j },{,k} = 1 { ({, k }) = {i, j }}), and is a degree n(n2 1) representation. And the list goes on to include many more complicated representations. 2 It is useful to think of two representations as being the same if the representation matrices are equal up to some consistent change of basis. This idea is formalized by declaring two representations and to be equivalent if there exists an invertible matrix C such that C -1 · ( ) · C = ( ) for all G. We write this as . Most representations can be seen as having been built up by smaller representations. We say that a representation is reducible if there exist smaller representations 1 , 2 such that 1 2 where is defined to be the direct sum representation: 1 2 (g ) ,, 1 (g ) 0 0 2 (g ) « . (1) In general, there are infinitely many inequivalent representations. However, for any finite group, there is always a finite collection of atomic representations which can be used to build up any other representation using direct sums. These representations are referred to as the irreducibles of a group, and they are simply the collection of representations which are not reducible. We will refer to the set of irreducibles by R. It can be shown that any representation of a finite group G is equivalent to a direct sum of irreducibles [3], and hence, for any representation , there exists a matrices C for which C -1 · · C = i R i , where the inner refers to some finite number of copies of the irreducible i . Describing the irreducibles of Sn up to equivalence is a subject unto itself; We will simply say that there is a natural way to order the irreducibles of Sn that corresponds to `simplicity' in the same way that low frequency sinusoids are simpler than higher frequency ones. We will refer to the irreducibles in this order as 0 , 1 , . . . . For example, the first two irreducibles form the first order permutation representation (1 0 1 ), and the second order permutation representation can be formed by the first 3 irreducibles. Irreducible representation matrices are not always orthogonal, but they can always be chosen to be so (up to equivalence). For notational convenience, the irreducible representations in this paper will always be assumed to be orthogonal. 3.1 The Fourier transform On the real line, the Fourier Transform corresponds to computing inner products of a function with sines and cosines at varying frequencies. The analogous definition for finite groups replaces the sinusoids by group representations. Definition 2. Let f : G R be any function on a group G and let be any representation on G. ^ The Fourier Transform of f at the representation is defined to be: f = f ( )( ). There are two important points which distinguish this Fourier Transform from the familiar version ^ on the real line -- it is matrix-valued, and instead of real numbers, the inputs to f are representations of G. The collection of Fourier Transforms of f at all irreducibles form the Fourier Transform of f . As in the familiar case, there is an inverse transform given by: f ( ) = h i 1X ^T dk Tr fk · k ( ) , | G| k (2) where k indexes over the collection of irreducibles of G. We provide two examples for intuition. For functions on the real line, the Fourier Transform at zero gives the DC component of a signal. This is also true for functions on a group; If f : G R is any function, then the Fourier Transform of f at the trivial representation is constant with ^ ^ f ( ). Thus, for any probability distribution P , we have P0 = 1. If P were the uniform f0 = ^ = 0 at all irreducibles except at the trivial representation. distribution, then P The Fourier Transform at 1 also has a simple interpretation: ^ [f1 ]ij = Sn X f ( )[1 ( )]ij = Sn X f ( )1 { (j ) = i} = : ( j )= i X f ( ). ^ Thus, if P is a distribution, then P1 is a matrix of marginal probabilties, where the ij -th element is the marginal probability that a random permutation drawn from P maps element j to i. Similarly, the Fourier transform of P at the second order permutation representation is a matrix of marginal probabilities of the form P ( ({i, j }) = {k , }). 3 In Section 5, we will discuss function approximation by bandlimiting the Fourier coefficients, but this example should illustrate the fact that maintaining Fourier coefficients at low-order irreducibles is the same as maintaining low-order marginal probabilities, while higher order irreducibles correspond to more complicated marginals. 4 Inference in the Fourier domain Bandlimiting allows for compactly storing a distribution over permutations, but the idea is rather moot if it becomes necessary to transform back to the primal domain each time an inference operation is called. Naively, the Fourier Transform on Sn scales as O((n!)2 ), and even the fastest Fast Fourier Transforms for functions on Sn are no faster than O(n! log(n!)) (see [7] for example). To resolve this issue, we present a formulation of inference which operates solely in the Fourier domain, allowing us to avoid a costly transform. We begin by discussing exact inference in the Fourier domain, which is no more tractable than the original problem because there are n! Fourier coefficients, but it will allow us to discuss the bandlimiting approximation in the next section. There are two operations to consider: prediction/rollup, and conditioning. The assumption for the rest of this section is that the Fourier Transforms of the transition and observation models are known. We discuss methods for obtaining the models in Section 7. 4.1 Fourier prediction/rollup We will consider one particular type of transition model -- that of a random walk over a group. This model assumes that (t+1) is generated from (t) by drawing a random permutation (t) from some distribution Q(t) and setting (t+1) = (t) (t) . In our identity management example, (t) represents a random identity permutation that might occur among tracks when they get close to each other (a mixing event), but the random walk model appears in other applications such as modeling card shuffles [3]. The Fourier domain Prediction/Rollup step is easily formulated using the convolution theorem (see also [3]): Proposition 3. Let Q and P be probability distributions on Sn . Define the convolution of Q and P to Q - Q(1 · 2 1 )P (2 ). Then for any representation , be the function [Q P ] (1 ) = P = 2 Q · P , where the operation on the right side is matrix multiplication. {( (t) , (t) ) : (t+1) = (t) · (t) } The Prediction/Rollup step for the random walk transition model can be written as a convolution: X Q(t) ( (t) )·P ( (t) ) = (t) (t) P ( (t+1) ) = X (t) h i Q(t) ( (t+1) ·( (t) )-1 )P ( (t) ) = Q(t) P ( (t+1) ). Then assuming that P and Q are given, the prediction/rollup update rule is simply: ^ Note that the update requires only knowledge of P and does not require P . Furthermore, the update is pointwise in the Fourier domain in the sense that the coefficients at the representation affect ( P t+1) only at . P (t+1) Q(t) · P (t) . 4.2 Fourier conditioning An application of Bayes rule to find a posterior distribution P ( |z ) after observing some evidence z requires a pointwise product of likelihood L(z | ) and rior P ( ), followed by a normalization step. p We showed earlier that the normalization constant L(z | ) · P ( ) is given by the Fourier transL(t) P (t) at the trivial representation -- and therefore the normalization step of conditioning form of L can be implemented by simply dividing each Fourier coefficient by the scalar (t) P (t) . 0 The pointwise product of two functions f and g , however, is trickier to formulate in the Fourier domain. For functions on the real line, the pointwise product of functions can be implemented ^ by convolving the Fourier coefficients of f and g , and so a natural question is: can we apply a ^ similar operation for functions over other groups? Our answer to this is that there is an analogous (but more complicated) notion of convolution in the Fourier domain of a general finite group. We present a convolution-based conditioning algorithm which we call Kronecker Conditioning, which, in contrast to the pointwise nature of the Fourier Domain prediction/rollup step, and much like convolution, smears the information at an irreducible k to other irreducibles. 4 Fourier transforming the pointwise product Our approach to Fourier Transforming the point^ wise product in terms of f and g is to manipulate the function f ( )g ( ) so that it can be seen as the ^ result of an inverse Fourier Transform. Hence, the goal will be to find matrices Ak (as a function of ^^ f , g ) such that for any G, f ( ) · g ( ) = " " 1X dk Tr AT · k ( ) , k | G| k (3) where Ak = inverse Fourier Transform (Equation 2): f ( ) · g ( ) = = #" # " " " " 1X 1X T T ^ di Tr fi · i ( ) · dj Tr gj · j ( ) ^ | G| i | G| j ,, «2 X h" " " "i 1 ^T di dj Tr fi · i ( ) · Tr gj · j ( ) . ^T | G| i,j " f g ^ . For any G we can write the pointwise product in terms f and g using the ^ k (4) Now we want to manipulate this product of traces in the last line to be just one trace (as in Equation 3), by appealing to some properties of the matrix Kronecker product. The connection to the pointwise product (first observed in [10]), lies in the property that for any matrices U, V , Tr (U V ) = (Tr U ) · (Tr V ). Applying this to Equation 4, we have: " " " " ^T Tr fi · i ( ) · Tr gj · j ( ) ^T = = "" "" "" ^T fi · i ( ) gj · j ( ) ^T « ,," "T ^ g fi ^ j · (i ( ) j ( )) , Tr Tr (5) where the last line follows by standard matrix properties. The term on the right, i ( ) j ( ), itself happens to be a representation, called the Kronecker Product Representation. In general, the Kronecker Product representation is reducible, and so it can decomposed into a direct sum of irreducibles. This means that if i and j are any two irreducibles of G, there exists a similarity transform Cij such that for any G, - Cij 1 · [i j ] ( ) · Cij = zij k MM k =1 k ( ). The symbols here refer to a matrix direct sum as in Equation 1, k indexes over all irreducible representations of Sn , while indexes over a number of copies of k which appear in the decomposition. We index blocks on the right side of this equation by pairs of indices (k , ). The number of copies of each k is denoted by the integer zij k , the collection of which, taken over all triples (i, j, k), are commonly referred to as the Clebsch-Gordan series. Note that we allow the zij k to be zero, in which case k does not contribute to the direct sum. The matrices Cij are known as the Clebsch-Gordan coefficients. The Kronecker Product Decomposition problem is that of finding the irreducible components of the Kronecker product representation, and thus to find the Clebsch-Gordan series/coefficients for each pair of representations (i , j ). Decomposing the Kronecker product inside Equation 5 using the Clebsch-Gordan series/coefficients yields the desired Fourier Transform, which we summarize here: ^^ Proposition 4. Let f , g be the Fourier Transforms of functions f and g ^ espectivel·y, and for each r - Cij . Then the ordered pair of irreducibles (i , j ), define the matrix: Aij Cij 1 · fi gj ^ Fourier tranform of the pointwise product f g is: hi c fg = zij k X k 1X Aij , di dj dk |G| ij =1 (6) z k where Akj is the block of Aij corresponding to the (k , ) block in k ij k k . i See the Appendix for a full proof of Proposition 4. The Clebsch-Gordan series, zij k , plays an important role in Equation 6, which says that the (i , j ) crossterm contributes to the pointwise product at k only when zij k > 0. For example, 1 1 0 1 2 3 5 (7) So z1,1,k = 1 for k 3 and is zero otherwise. Unfortunately, there are no analytical formulas for finding the Clebsch-Gordan series or coefficients, and in practice, these computations can take a long time. We emphasize however, that as fundamental quantities, like the digits of , they need only be computed once and stored in a table for future reference. Due to space limitations, we will not provide complete details on computing these numbers. We refer the reader to Murnaghan [8], who provides general formulas for computing Clebsch-Gordan series for pairs of low-order irreducibles, and to Appendix 1 for details about computing Clebsch-Gordan coefficients. We also plan to make a set of precomputed coefficients available on the web. 5 Approximate inference by bandlimiting We approximate the probability distribution P ( ) by fixing a bandlimit B and maintaining the Fourier transform of P only at irreducibles 0 , . . . B . We refer to this set of irreducibles as B . As on the real line, smooth functions are generally well approximated by only a few Fourier coefficients, while "wigglier" functions require more. For example, when B = 3, B is the set 0 , 1 , 2 , and 3 , which corresponds to maintaining marginal probabilities of the form P ( ((i, j )) = (k , )). During inference, we follow the procedure outlined in the previous section but ignore the higher order terms which are not maintained. Pseudocode for bandlimited prediction/rollup and Kronecker conditioning is given in Algorithm 6 and 6. Since the Prediction/Rollup step is pointwise in the Fourier domain, the update is exact for the maintained irreducibles because higher order irreducibles cannot affect those below the bandlimit. As in [5], we find that the error from bandlimiting creeps in through the conditioning step. For example, Equation 7 shows that if B = 1 (so that we maintain first-order marginals), then the pointwise product spreads information to second-order marginals. Conversely, pairs of higher-order irreducibles may propagate information to lower-order irreducibles. If a distribution is diffuse, then most of the energy is stored in low-order Fourier coefficients anyway, and so this is not a big problem. However, it is when the distribution is sharply concentrated at a small subset of permutations, that the Fourier projection is unable to faithfully approximate the distribution, in many circumstances, resulting in a bandlimited Fourier Transform with negative marginal probabilities! To combat this problem, we present a method for enforcing nonnnegativity. Projecting to marginal polytope The marginal polytope M is the set of Fourier coefficients which are doubly stochastic at the permutation representations (e.g. rows and columns sum to one and all entries are nonnegative). M can be described by a set of linear equalities which constrain a matrix of marginals to correspond to a legal Fourier transform, and a set of linear inequalities which constrain the marginals to be nonnegative. In the first-order case, M is exactly the set of all doubly-stochastic matrices. After each conditioning step, we apply a `correction' to the approximate posterior P (t) by finding the bandlimited function in M which is closest to P (t) in an L2 sense. To perform the projection, we employ the Plancherel Theorem [3] which relates the L2 distance between functions on Sn to a distance metric in the Fourier domain. ^ T ^ . 1k Proposition 5. (f ( ) - g ( ))2 = dk Tr fk - gk ^ · fk - gk (8) ^ |G| We formulate the optimization as a quadratic program where the objective is to minimize the right side of Equation 8, where the sum is taken only over the set of maintained irreducibles, B , and subject to the set of constraints which describe the marginal polytope. We remark that even though the projection will produce a Fourier transform corresponding to nonnegative marginals, there might not necessarily exist a joint probability distribution on Sn consistent with those marginals. In the case of first-order marginals, however, the existence of a consistent joint distribution is guaranteed by the Birkhoff-von Neumann theorem [9], which states that a matrix is doubly stochastic if and only if it can be written as a convex combination of permutation matrices. 6 Related Work The Identity Management problem was first introduced in [2] which maintains a doubly stochastic first order belief matrix to reason over data associations. Schumitsch et al. [4] exploits a similar idea, but formulated the problem in log-space. 6 Figure 1: Pseudocode for the Fourier Prediction/Rollup Algorithm. P R E D I C T I O N RO L L U P ^ (t+1) Q(t) · P t) ; ^ ^( foreach k B do Pk k k Figure 2: Pseudocode for the Kronecker Conditioning Algorithm. K RO N E C K E R C O N D Ih I O N I N G i T foreach k B do L(t) P (t) k 0 //Initialize Posterior //Pointwise Product foreach i B do foreach j B do z C Gseries(i , j ) ; " " T ^ Cij C Gcoef f icients(i , j ) ; Aij Cij · fi gj · Cij ; ^ for k B such that zij k = 0 do for = 1 to zk do h h i i d i d k L(t) P (t) L(t) P (t) + d nj Aij //Akj is the (k, ) block of Aij i ! k k k i h ; Z L(t) P (t) 0 h i h i 1 foreach k B do L(t) P (t) Z L(t) P (t) //Normalization k k Kondor et al. [5] were the first to show that the data association problem could be approximately handled via the Fourier Transform. For conditioning, they exploit a modified FFT factorization which works on certain simplified observation models. Our approach generalizes the type of observations that can be handled in [5] and is equivalent in the simplified model that they present. We require O(D3 n2 ) time in their setting. Their FFT method saves a factor of D due to the fact that certain representation matrices can be shown to be sparse. Though we do not prove it, we observe that the Clebsch-Gordan coefficients, Cij are typically similarly sparse, which yields an equivalent running time in practice. In addition, Kondor et al. do not address the issue of projecting onto the marginal polytope, which, as we show in our experimental results, is fundamental in practice. Willsky [10] was the first to formulate a nonabelian version of the FFT algorithm (for Metacyclic groups) as well as to note the connection between pointwise products and Kronecker product decompositions for general finite groups. In this paper, we address approximate inference, which is necessary given the n! complexity of inference for the Symmetric group. 7 Experimental results For small n, we compared our algorithm to exact inference on synthetic datasets in which tracks are drawn at random to be observed or swapped. For validation, we measure the distance to the true distribution using L2 normalized by n!. As expected, the Fourier approximation is better when there are either more mixing events (the fraction of conditioning events is smaller), or when more Fourier coefficients are maintained, as shown in Fig. 3(a). We also see in Fig 3(b) that the Plancherel Projection step is fundamental, especially when mixing events are rare, reducing the error by factors of about 3. Comparing running times, it is clear that our algorithm scales gracefully compared to the exact solution (Fig. 3(c)). We also evaluated our algorithm on data taken from a real network of 8 cameras (Fig. 3(d)). In the data, there are n = 11 people walking around a room in fairly close proximity. To handle the fact that people can freely leave and enter the room, we maintain a list of the tracks which are external to the room. Each time a new track leaves the room, it is added to the list and a mixing event is called to allow for m2 pairwise swaps amongst the m external tracks. The number of mixing events is approximately the same as the number of observations. For each observation, the network returns a color histogram of the blob associated with one track track. The task after conditioning on each observation is to predict identities for all tracks which are inside the room, and the evaluation metric is the fraction of accurate predictions. We compared against a baseline approach of predicting the identity of a track based on the most recently observed histogram at that track. This approach is expected to be accurate when there are many observations and discriminative appearance models, neither of which our problem afforded. As Figure 3(e) shows, both the baseline 7 n=6 average error over 500 timesteps 0.012 0.01 0.008 0.006 0.004 0.002 0 b=1 b=2 b=3 average error over 500 timesteps 5 x 10 -3 n=6 100 Running time of 100 conditioning events Running time in seconds b=1 b=2 b=3 b=1 80 b=2 b=3 60 exact 4 3 2 40 1 20 0.1 0.3 0.5 0.7 0.9 0 0.1 0.3 0.5 0.7 0.9 0 4 5 6 7 8 fraction of conditioning events fraction of conditioning events n (a) Without Projection (b) With Projection 60 (c) n versus Running Time Omniscient % Tracks correctly Identified 50 40 w/Projection w/o Projection 30 20 10 0 Baseline (d) Sample Image (e) Accuracy for Camera Data Figure 3: Evaluation on synthetic ((a)-(c)) and real camera network ((d),(e)) data. and first order model(without projection) fared poorly, while the projection step dramatically boosted the prediction accuracy for this problem. To illustrate the difficulty of predicting based on appearance alone, the rightmost bar reflects the performance of an omniscient tracker who knows the result of each mixing event and is therefore left only with the task of distinguishing between appearances. 8 Conclusions We presented a formulation of hidden Markov model inference in the Fourier domain. In particular, we developed the Kronecker Conditioning algorithm which performs a convolution-like operation on Fourier coefficients to find the Fourier transform of the posterior distribution. We argued that bandlimited conditioning can result in Fourier coefficients which correspond to no distribution, but that the problem can be remedied by projecting to the marginal polytope. Our evaluation on data from a camera network shows that our methods outperform well when compared to the optimal solution in small problems, or to an omniscient tracker in large problems. Furthermore, we demonstrated that our projection step is fundamental to obtaining these high-quality results. We conclude by remarking that the mathematical framework developed in this paper is quite general. In fact, both the prediction/rollup and conditioning formulations hold over any finite group, providing a principled method for approximate inference for problems with underlying group structure. Acknowledgments This work is supported in part by the Office of Naval Research under MURI N000140710747, the National Science Foundation under grant DGE-0333420 and EEEC-540865, and by the Pennsylvania Infrastructure Technology Alliance (PITA). C. Guestrin was also supported in part by an Alfred P. Sloan Fellowship. We also thank Kyle Heath for helping with the camera data. References [1] Y. Ivanov, A. Sorokin, C. Wren, and I. Kaur. Tracking people in mixed modality systems. Technical Report TR2007-11, MERL, 2007. [2] J. Shin, L. Guibas, and F. Zhao. A distributed algorithm for managing multi-target identities in wireless ad-hoc sensor networks. In IPSN, 2003. [3] P. Diaconis. Group Representations in Probability and Statistics. Institute of Mathematical Statistics, 1988. [4] B. Schumitsch, S. Thrun, G. Bradski, and K. Olukotun. The information-form data association filter. In NIPS. 2006. [5] R. Kondor, A. Howard, and T. Jebara. Multi-object tracking with representations of the symmetric group. In AISTATS, 2007. [6] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In UAI, 1998. [7] R. Kondor. Sn ob: a C++ library for fast Fourier transforms on the symmetric group, 2006. Available at http://www.cs.columbia.edu/~risi/Snob/. [8] F.D. Murnaghan. The analysis of the kronecker product of irreducible representations of the symmetric group. American Journal of Mathematics, 60(3):761­784, 1938. [9] J. van Lint and R.M. Wilson. A Course in Combinatorics. Cambridge University Press, 2001. [10] A. Willsky. On the algebraic structure of certain partially observable finite-state markov processes. Information and Control, 38:179­212, 1978. 8