New Outer Bounds on the Marginal Polytope David Sontag MIT dsontag@csail.mit.edu Tommi Jaakkola MIT tommi@csail.mit.edu Abstract We give a new class of outer bounds on the marginal polytope, and propose a cutting-plane algorithm for efficiently optimizing over these constraints. When combined with a concave upper bound on the entropy, this gives a new variational inference algorithm for probabilistic inference in discrete Markov Random Fields (MRFs). Valid constraints on the marginal polytope are derived through a series of projections onto the cut polytope. As a result, we obtain tighter upper bounds on the log-partition function. We also show empirically that our approximations of the marginals are significantly more accurate. Moreover, we demonstrate the advantage from the new constraints in finding the MAP assignment for protein structure prediction. 1 Introduction Graphical models such as Markov Random Fields (MRFs) have been successfully applied to a wide variety of fields, from computer vision to computational biology. From the point of view of inference, we are generally interested in two questions: find the marginal probabilities of specific subsets of the variables, and the Maximum a Posteriori (MAP) assignment. Both of these require approximate methods. We will focus on a particular class of variational approximation methods that cast the inference problem as a non-linear optimization problem over the marginal polytope, the set of valid marginal probabilities. The selection of appropriate marginals from the marginal polytope is guided by the (non-linear) entropy function. Both the marginal polytope and the entropy are difficult to characterize in general, reflecting the hardness of exact inference calculations. Most message-passing algorithms for evaluating marginals, including belief propagation and tree-reweighted sum-product (TRW), operate instead within the local consistency polytope, characterized by pairwise consistent marginals. For general graphs, this is an outer bound of the marginal polytope. Various approximations have also been suggested for the entropy function. For example, in the TRW algorithm [8], the entropy is decomposed into a weighted combination of entropies of tree-structured distributions. Our goal here is to provide tighter outer bounds on the marginal polytope. We show how this can be achieved efficiently using a cutting-plane algorithm, iterating between solving a relaxed problem and adding additional constraints. Cutting-plane algorithms are a well-known technique for solving integer linear programs, and are often used within combinatorial optimization. The key to such approaches is to have an efficient separation algorithm which, given an infeasible solution, can quickly find a violated constraint, generally from a very large class of valid constraints on the set of integral solutions. The motivation for our approach comes from the cutting-plane literature for the maximum cut problem. Barahona et al. [3] showed that the MAP problem in pairwise binary MRFs is equivalent to a linear optimization over the cut polytope, which is the convex hull of all valid graph cuts. Tighter relaxations were obtained by using a separation algorithm together with the cutting-plane methodology. We extend this work by deriving a new class of outer bounds on the marginal polytope for 1 non-binary and non-pairwise MRFs. The key realization is that valid constraints can be constructed by a series of projections onto the cut polytope. More broadly, we seek to highlight emerging connections between polyhedral combinatorics and probabilistic inference. 2 Background Markov Random Fields. Let x n denote a random vector on n variables, where, for simplicity, each variable xi takes on the values in i = {0, 1, . . . , k - 1}. The MRF is specified by a set of d real valued potentials or sufficient statistics (x) = {k (x)} and a parameter vector Rd : x p(x; ) = exp { , (x) - A()}, A() = log n exp { , (x) } where , (x) denotes the dot product of the parameters and the sufficient statistics. In pairwise MRFs, potentials are restricted to be at most over the edges in the graph. We assume that the potentials are indicator functions, e.g., i,s (x) = (xi = s), and make use of the following notation: µi;s = E [i;s (x)] = p(xi = s; ) and µij ;st = E [ij ;st (x)] = p(xi = s, xj = t; ). Variational inference. The inference task is to evaluate the mean vector µ = E [(x)]. The log-partition function A(), a convex function of , plays a critical part in these calculations. In particular, we can write the log-partition function in terms of its Fenchel-Legendre conjugate [9]: A() = supµM { , µ - B (µ)} , (1) where B (µ) = -H (µ) is the negative entropy of the distribution parameterized by µ and is also convex. M is the set of realizable mean vector.s µ known as the marginal polytope. More precisely, µ M := Rd | p(X ) s.t. µ = Ep [(x)] The value µ M that maximizes (1) is precisely the desired mean vector corresponding to . Both M and the entropy H (µ) are difficult to characterize in general and have to be approximated. We call the resulting approximate mean vectors pseudomarginals. Mean field algorithms optimize over an inner bound on the marginal polytope by restricting the marginal vectors to those coming from simpler, e.g., fully factored, distributions (non-convex). The entropy can be evaluated exactly in this case (the distribution is simple). Alternatively, we can relax the optimization over an outer bound on the marginal polytope and also bound the entropy function. Most message passing algorithms for evaluating marginal probabilities obtain locally consistent beliefs so that the pseudomarginals over the edges agree with the singleton pseudomarginals at the nodes. The solution is therefore sought within the local marginal polytope s t LOCAL(G) = { µ 0 | (2) i µi;s = 1, j µij ;st = µi;s } Clearly, M LOCAL(G) since true marginals are also locally consistent. For trees, M = LOCAL(G). Both LOCAL(G) and M have the same integral vertices for general graphs [9, 6]. Belief propagation [13] can be seen as optimizing pseudomarginals over LOCAL(G) with a (nonconvex) Bethe approximation to the entropy. The tree-reweighted sum-product algorithm (TRW) [8], on the other hand, uses a concave upper bound on the entropy, expressed as a convex combination of entropies corresponding to the spanning trees of the original graph. The log-determinant relaxation [10] is instead based on a semi-definite outer bound on the marginal polytope combined with a Gaussian approximation to the entropy function. Since the moment matrix M1 (µ) = E [(1 x)T (1 x)] has to be positive semi-definite, the outer bound is obtained simply by requiring only that the pseudomarginals lie in SDEF1 (Kn ) = {µ R+ | M1 (µ) 0}. Maximum a posteriori. The marginal polytope plays a critical part also in finding the MAP assignment. The problem is to find an assignment x n which maximizes P (x; ), or equivalently: xn max log P (x; ) = max , (x) - A() = sup , µ - A() n x µM (3) where the log-partition function A() remains a constant and can be ignored. The last equality holds because the optimal value of the linear program is obtained at a vertex (integral solution). When the MAP assignment x is unique, we have that the maximizing µ = (x ). Cycle inequalities. The marginal polytope can be defined by the intersection of a large number of linear inequalities. We focus on inequalities beyond those specifying LOCAL(G), in particular the 2 Algorithm 1 Cutting-plane algorithm for probabilistic inference 1: R LOCAL(G) 2: repeat 3: µ maxµR { , µ - B (µ)} 4: Choose projection graph G , e.g. single, k , or full 5: C Separate Cycle Inequalities(G , (µ )) 6: RRC 7: until C = cycle inequalities [4, 2, 6]. Assume the variables are binary. Given an assignment x {0, 1}n , (i, j ) E is cut if xi = xj . The cycle inequalities arise from the observation that a cycle must have an even (possibly zero) number of cut edges. Suppose we start at node i, where xi = 0. As we traverse the cycle, the assignment changes each time we cross a cut edge. Since we must return to xi = 0, the assignment can only change an even number of times. For a cycle(C and any F C such ( that |F | is odd, this constraint can be written as i,j )C \F (xi = xj ) + i,j )F (xi = xj ) 1. Since this constraint is valid for all assignments x {0, 1}n , it holds also in expectation. Thus ( ( (µij ;10 + µij ;01 ) + (µij ;00 + µij ;11 ) 1 (4) i,j )C \F i,j )F is valid for any µ M{0,1} , the marginal polytope of a binary pairwise MRF. For a chordless circuit C , the cycle inequalities are facets of M{0,1} [4]. They suffice to characterize M{0,1} for a graph G if and only if G has no K4 -minor. Although there are exponentially many cycles and cycle inequalities for a graph, Barahona and Mahjoub [4, 6] give a simple algorithm to separate the whole class of cycle inequalities. To see whether any cycle inequality is violated, construct the undirected graph G = (V , E ) where V contains nodes i1 and i2 for each i V , and for each (i, j ) E , the edges in E are: (i1 , j1 ) and (i2 , j2 ) with weight µij ;10 + µij ;01 , and (i1 , j2 ) and (i2 , j1 ) with weight µij ;00 + µij ;11 . Then, for each node i V we find the shortest path in G from i1 to i2 . The shortest of all these paths will not use both copies of any node j (otherw( se the path j1 to j2 would be sh(orter), and so defines a cycle in i G and gives the minimum value of i,j )C \F (µij ;10 + µij ;01 ) + i,j )F (µij ;00 + µij ;11 ). If this is less than 1, we have found a violated cycle inequality; otherwise, µ satisfies all cycle inequalities. Using Dijkstra's shortest paths algorithm with a Fibonacci heap, the separation problem can be solved in time O(n2 log n + n|E |). 3 Cutting-plane algorithm Our main result is the proposed Algorithm 1 given above. The algorithm alternates between solving for an upper bound of the log-partition function (see Eq. 1) and tightening the outer bound on the marginal polytope by incorporating valid constraints that are violated by the current pseudomarginals. The projection graph (line 4) is not needed for binary pairwise MRFs and will be described in the next section. Any class of valid constraints for the marginal polytope with an efficient separation algorithm can be used in line 5. Other examples besides the cycle inequalities include the odd-wheel and bicycle odd-wheel inequalities [6]. We start the algorithm with the loose outer bound on the marginal polytope given by the local consistency constraints. Tighter initial constraints, e.g., M1 (µ) 0, are possible as well. Any entropy approximation B (µ) can be used so long as we can efficiently solve the optimization problem in line 3. The log-determinant and TRW entropy approximations have two appealing features. First, as upper bounds they permit the algorithm to be used for obtaining tighter upper bounds on the log-partition function. Second, the objective functions to be maximized are convex and can be solved efficiently using conditional gradient or other methods. When the algorithm terminates, we can use the last µ vector as an approximation to the single node and pairwise marginals. The results given in Section 5 use this method. The algorithm for MAP is the same, excluding the entropy function in line 3; the optimization is simply a linear program. Since all integral vectors in the relaxation R are extreme points of the marginal polytope, any integral µ is the MAP assignment. 3 Figure 1: Illustration of the projection for one edge (i, j ) E where i = {0, 1, 2} and j = {0, 1, 2, 3}. The projection graph G , shown on the right, has 3 partitions for i and 7 for j . 4 New outer bounds In this section we give a new class of valid inequalities for the marginal polytope of non-binary and non-pairwise MRFs, and show how to efficiently separate this exponentially large set of inequalities. The key theoretical idea is to project the marginal polytope onto different binary marginal polytopes. Aggregation and projection are well-known techniques in polyhedral combinatorics for obtaining valid inequalities [6, 5]. Given a linear projection (x) = Ax, any valid inequality c (x) b for (x) also gives the valid inequality c Ax b for x. Prior work has used aggregation of the nodes of the graph. Instead, we obtain new inequalities by aggregating the states of each variable. q q For each variable i, let i be a partition of its values into two non-empty sets, i.e., the map i : 1 2 i {0, 1} is surjective. Let i = {i , i , . . .} be a collection of partitions of node i. We permit each variable to have a different number of partitions. Define the projection graph G = (V , E ) so that there's a node for each partition in each collection i and such nodes are fully connected across adjacent variables i q r V = i , E {(i , j ) | (i, j ) E , q |i |, r |j |}. (5) V q Definition 1. The linear map takes µ M and for each node v = i V ass q r e v q = signs µ = i s.t. i (s)=1 µi;s and for each edge e = (i , j ) E assigns µ s q r (s )=1 µij ;si sj . i i ,sj j s.t. (si )= j i j We call the graph consisting of all possible node partitions and all possible edges the full projection graph (see Figure 1). To construct valid inequalities for each projection we need to characterize the image space. Let M{0,1} (G ) denote the binary marginal polytope of the projection graph. Theorem 1. The image of the projection is M{0,1} (G ), i.e. : M M{0,1} (G ). Proof. Since is a linear map, it suffices to show that, for every extreme point µ M, (µ) M{0,1} (G ). The extreme points of M correspond one-to-one with assignments x n . Given an s q extreme point µ M and variable v = i V , define x (µ)v = i s.t. q (s)=1 µi;s . Since µ i is an extreme point, µi;s = 1 for exactly one value s, which implies that x (µ) {0, 1}|V | . Then, (µ) = E [(x (µ))], showing that (µ) M{0,1} (G ). This result allows valid inequalities for M{0,1} (G ) to carry over to M. In general the projection will not be surjective. Suppose every variable has k states. The single projection graph, where |i | = 1 for all i, has one node per variable and is surjective [1]. The full projection graph has O(2k ) nodes per variable. A cutting-plane algorithm may begin by projecting onto a small graph, then expanding to larger graphs only after satisfying all inequalities given by the smaller one. The k -projection graph Gk = (Vk , Ek ) has k nodes per variable, one for each state: Vk = {vi;s | i V , s i }, Ek = {(vi;s , vj ;t ) | (i, j ) E , s i , t j } v (6) Definition 2. The linear map k takes µ M and for each node vi;s Vk assigns µ = µi;s and for each edge (vi;s , vj ;t ) assigns µe = µij ;st . We could also have defined this projection by giving the corresponding partitions and using Definition 1. Thus, the result from Theorem 1 applies. 4 These projections yield a new class of cycle inequalities for the marginal polytope. Consider a single projection graph G , a cycle C in G and any F C , |F | odd. Let i be the partition for node i. We obtain the following valid inequality for µ M by applying the projection and the cycle inequality: ( ( µj (xi = xj ) + µj (xi = xj ) 1, (7) i i i,j )C \F i,j )F where µj (xi = xj ) = i µj (xi = xj ) = i s i i ,sj j µij ;si sj s.t. i (si )=j (sj ) (8) (9) s i i ,sj j s.t. i (si )=j (sj ) µij ;si sj . ( ( n It is revealing to contrast (7) with i,j )C \F (xi = xj ) + i,j )F (xi = xj ) 1. For x , the latter holds only for |F | = 1. We can only obtain the more general inequality by fixing a partition of each nodes states. Theorem 2. For every single projection graph G and every cycle inequality arising from a chordless circuit C on G , µ LOCAL(G)\M such that µ violates that inequality. Proof. For each variable i V , choose si , ti s.t. i (si ) = 1 and i (ti ) = 0. Assign µi;q = 0 for q i \{si , ti }. Similarly, for every (i, j ) E , assign µij ;qr = 0 for q i \{si , ti } and r j \{sj , tj }. The polytope resulting from the projection of M onto the remaining variables is isomorphic to M{0,1} for the graph G . Barahona and Mahjoub [4] showed that the cycle inequality on the chordless circuit C is facet-defining for M{0,1} . Since C is over 3 variables from G, this cannot be a facet of LOCAL(G). Let LOCAL{0,1} be the projection of LOCAL(G) onto the remaining variables. Thus, µ LOCAL{0,1} \M{0,1} , and we can assign µ accordingly. Note that the theorem implies that the projected cycle inequalities are strictly tighter than LOCAL(G) but it does not characterize how much is gained. If all n variables have k values, then there are O((2k )n ) different single projection graphs. Instead of attempting to separate each graph individually, it suffices to consider just the full projection graph. Thus, even though the projection is not surjective, the full projection graph allows us to efficiently obtain a tighter relaxation than any combination of projection graphs would give. Theorem 3. The separation problem for all cycle inequalities (7) for all single projection graphs, when we allow some additional valid inequalities for M, can be solved in time O(poly (n, 2k )). Proof. For every cycle inequality in the single projection graphs, there is an equivalent cycle inequality in the full projection graph. Thus, by separating all cycle inequalities in the full projection graph, which has O(n2k ) nodes, we can only get a tighter relaxation. We showed earlier that the separation problem of cycle inequalities for the binary marginal polytope can be solved in polynomial time in the size of the graph. Non-pairwise Markov random fields. These results can be trivially applied to non-pairwise MRFs by first projecting the marginal vector onto the marginal polytope of a pairwise MRF. For example, suppose we have a MRF with a potential on the variables i, j, k , so the marginal vector will have the variables µij k;stw for s i , t j , w k . After projection, we will have the pairwise variables µij ;st , µj k;tw , and µik;sw . More generally, suppose we include additional variables corresponding to the joint probability of a cluster of variables. We need to add constraints enforcing that all variables in common between two clusters Co and Cp have the same marginals. Let Vc = Co Cp be the variables in common between the clusters, and let Vo = Co \Vc and Vp = Cp \Vc be the other variables. Define c to be y ll possible assignz ents to the variables in the set Vc . Then, for x c , include the constraint a m µCo ;xy = o p µCp ;xz . For pairwise clusters these are simply the usual local consistency constraints. We can now apply the projections of the previous section, considering various partitions of each cluster variable, to obtain a tighter relaxation of the marginal polytope. 5 0.5 0.4 Average l1 Error TRW TRW + PSD TRW + Cycle TRW + Marg 0.5 0.4 Average l1 Error Logdet Logdet + PSD Logdet + Cycle Logdet + Marg 0.3 0.2 0.1 0 2 4 6 8 0.3 0.2 0.1 0 2 4 6 8 Coupling U[-x, x], Field U[-1, 1] Coupling U[-x, x], Field U[-1, 1] Figure 2: Accuracy of pseudomarginals on 10 node complete graph (100 trials). 5 Experiments Computing marginals. We experimented with Algorithm 1 using both the log-determinant [10] and the TRW [8] entropy approximations. These trials are on Ising models, which are pairwise MRFs with xi {-1, 1} and potentials i (x) = xi for i V and ij (x) = xi xj for (i, j ) E . Although TRW can efficiently optimize over the spanning tree polytope, for these experiments we simply use a weighted distribution over spanning trees, where each tree's weight is the sum of the absolute value of its edge weights ij . The edge appearance probabilities for this distribution can be efficiently computed using the Matrix Tree Theorem [11]. We optimize the TRW objective with conditional gradient, using linear programming after each gradient step to project onto R. We used the glpkmex and YALMIP optimization packages within Matlab, and wrote the separation algorithm for the cycle inequalities in Java. In Figure 2 we show results for 10 node complete graphs with i U [-1, 1] and ij U [-, ], where is the coupling strength shown in the figure. For each data point we averaged the results over 100 trials. The y -axis shows the average 1 error of the single node marginals. These MRFs are highly coupled, and loopy belief propagation (not shown) with a .5 decay rate seldom converges. The TRW and log-determinant algorithms, optimizing over the local consistency polytope, give pseudomarginals only slightly better than loopy BP. Even adding the positive semi-definite constraint M1 (µ) 0, for which TRW must be optimized using conditional gradient and semidefinite programming for the projection step, does not improve the accuracy by much. However, both entropy approximations give significantly better pseudomarginals when used by our algorithm together with the cycle inequalities (see "TRW + Cycle" and "Logdet + Cycle" in the figures). For small MRFs, we can exactly represent the marginal polytope as the convex hull of its 2n vertices. We found that the cycle inequalities give nearly as good accuracy as the exact marginal polytope (see "TRW + Marg" and "Logdet + Marg"). Our work sheds some light on the relative value of the entropy approximation compared to the relaxation of the marginal polytope. When the MRF is weakly coupled, both entropy approximations do reasonably well using the local consistency polytope. This is not surprising: the limit of weak coupling is a fully disconnected graph, for which both the entropy approximation and the marginal polytope relaxation are exact. With the local consistency polytope, both entropy approximations get steadily worse as the coupling increases. In contrast, using the exact marginal polytope, we see a peak at = 2, then a steady improvement in accuracy as the coupling term grows. This occurs because the limit of strong coupling is the MAP problem, for which using the exact marginal polytope will give exact results. The interesting region is near the peak = 2, where the entropy term is neither exact nor outweighed by the coupling. Our algorithm seems to "solve" the part of the problem caused by the local consistency polytope relaxation: TRW's accuracy goes from .33 to .15, and log-determinant's accuracy from .17 to .076. The fact that neither entropy approximation can achieve accuracy below .07, even with the exact marginal polytope, motivates further research on improving this part of the approximation. 6 10×10 Grid Average Prediction Error 10×10 Grid 0.4 0.5 20 Node Complete Average Prediction Error 0.8 0.6 0.4 0.2 0 3 20 Node Complete Average l1 Error Average l1 Error 1 2 3 4 5 6 7 8 9 10 0.4 0.3 0.2 0.1 1 2 3 4 5 6 7 8 9 10 0.3 0.2 0.1 0 0.4 0.3 0.2 0.1 Cutting Plane Iterations Cutting Plane Iterations 3 Cutting Plane Iterations 9 15 21 27 33 39 45 Cutting Plane Iterations 9 15 21 27 33 39 45 Figure 3: Convergence with TRW entropy, i U [-1, 1] and ij U [-4, 4]. Cutting Plane Iterations 20 15 10 5 0 0 500 1000 Amino Acids (Variables) Figure 4: MAP for protein side-chain prediction with Rosetta energy function. Next, we looked at the number of iterations (in terms of the loop in Algorithm 1) the algorithm takes before all cycle inequalities are satisfied. In each iteration we add to R at most1 n violated cycle inequalities, coming from the n shortest paths. In Figure 3 we show boxplots of the l1 error for 10x10 grid MRFs over 40 trials, where i U [-1, 1] and ij U [-4, 4], using the TRW entropy approximation. We also show whether the pseudomarginals are on the correct side of .5, which is important if we were doing prediction based on the results from approximate inference. The middle line gives the median, the boxes show the upper and lower quartiles, and the whiskers show the extent of the data. Iteration 1 corresponds to TRW with only the local consistency constraints. All of the cycle inequalities were satisfied within 10 iterations. After only 5 iterations the median l1 error in the single node marginals dropped from over .35 to under .2. We observed the same convergence results on a 30x30 grid, although we could not assess the accuracy due to the difficulty of exact marginals calculation. From these results, we predict that our algorithm will be both fast and accurate on larger structured models. While these results are promising, real-world MRFs may have different structure, so we next looked at the other extreme, giving analogous results for 20 node complete MRFs over 10 trials (Figure 3). Here, the algorithm took many more iterations before all cycle inequalities were satisfied. While the improvement in the average l1 error is roughly monotonic as the number of iterations increase, the change in the prediction accuracy is certainly not. Despite this, the eventual improvement in prediction accuracy is striking, with the median going from .5 (as bad as a coin flip) to .1. Protein side-chain prediction. We next applied our algorithm to the problem of predicting protein side-chain configurations. Given the 3-dimensional structure of a protein's backbone, the task is to predict the relative angle of each amino acid's side-chain. The angles are discretized into at most 45 states. Yanover et al. [12] showed that minimization of the Rosetta energy function corresponds to finding the MAP assignment of a non-binary pairwise MRF. They also showed that the treereweighted max-product algorithm can be used to solve the LP relaxation given by LOCAL(G), and that this succeeds in finding the MAP assignment for 339 of the 369 proteins in their data set. However, the optimal solution to the LP relaxation for the remaining 30 proteins, arguably the most difficult of the proteins, is fractional. Using the k -projection graph and projected cycle inequalities, we succeeded in finding the MAP assignment for all proteins except for the protein `1rl6'. We show in Figure 4 the number of cuttingplane iterations needed for each of the 30 proteins. In each iteration, we solve the LP relaxation, and, if the solution is not integral, run the separation algorithm to find violated inequalities. For the protein `1rl6', after 12 cutting-plane iterations, the solution was not integral, and we could not find any violated cycle inequalities using the k -projection graph. We then tried using the full projection 1 Many fewer inequalities were added, since not all cycles in G are simple cycles in G. 7 graph, and found the MAP after just one (additional) iteration. Figure 4 shows one of the cycle inequalities (7) in the full projection graph that was found to be violated. The cut edges indicate the 3 edges in F . The violating µ had µ36;s = .1667 for s {0, 1, 2, 3, 4, 5}, µ38;6 = .3333, µ38;4 = .6667, µ43;s = .1667 for s {1, 2, 4, 5}, µ43;3 = .3333, and zero for all other states of these variables. This example shows that the relaxation given by the full projection graph is strictly tighter than that of the k -projection graph. The commercial linear programming solver CPLEX 10.0 solves each LP relaxation in under 75 seconds. Using simple heuristics, the separation algorithm runs in seconds, and we find each protein's MAP assignment in under 11.3 minutes. Kingsford et al. [7] found, and we also observed, that CPLEX's branch-and-cut algorithm for solving integer linear programs also works well for these problems. One interesting future direction would be to combine the two approaches, using our new outer bounds within the branch-and-cut scheme. Our results show that the new outer bounds are powerful, allowing us to find the MAP solution for all of the MRFs, and suggesting that using them will also lead to significantly more accurate marginals for non-binary MRFs. 6 Conclusion The facial structure of the cut polytope, equivalently, the binary marginal polytope, has been wellstudied over the last twenty years. The cycle inequalities are just one of many large classes of valid inequalities for the cut polytope for which efficient separation algorithms are known. Our theoretical results can be used to derive outer bounds for the marginal polytope from any of the valid inequalities on the cut polytope. Our approach is particularly valuable because it takes advantage of the sparsity of the graph, and only uses additional constraints when they are guaranteed to affect the solution. The results in this paper lead to several interesting open problems. Which of the inequalities obtained through projection are facet-defining for the marginal polytope? Can we develop new messagepassing algorithms which can incorporate cycle (and other) inequalities, to use them as an efficient inner loop in the cutting-plane algorithm? References [1] Anonymous. Supplementary material for `New bounds on the marginal polytope'. 2007. [2] F. Barahona. On cuts and matchings in planar graphs. Mathematical Programming, 60:53­68, 1993. ¨ [3] F. Barahona, M. Grotschel, M. Junger, and G. Reinelt. An application of combinatorial optimization to statistical physics and circuit layout design. Operations Research, 36(3):493­513, 1988. [4] F. Barahona and A. R. Mahjoub. On the cut polytope. Mathematical Programming, 36:157­173, 1986. ¨ [5] R. Borndorfer and R. Weismantel. Discrete relaxations of combinatorial programs. Discrete Applied Mathematics, 112(1­3):11­26, 2001. [6] M. M. Deza and M. Laurent. Geometry of Cuts and Metrics, volume 15 of Algorithms and Combinatorics. Springer, 1997. [7] C. L. Kingsford, B. Chazelle, and M. Singh. Solving and analyzing side-chain positioning problems using linear and integer programming. Bioinformatics, 21(7):1028­1039, 2005. [8] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51:2313­2335, July 2005. [9] M. Wainwright and M. I. Jordan. Graphical models, exponential families and variational inference. Technical Report 649, UC Berkeley, Dept. of Statistics, 2003. [10] M. Wainwright and M. I. Jordan. Log-determinant relaxation for approximate inference in discrete Markov random fields. IEEE Transactions on Signal Processing, 54(6):2099­2109, June 2006. [11] D. B. West. Introduction to Graph Theory. Prentice Hall, 2001. [12] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation ­ an empirical study. JMLR Special Issue on Machine Learning and Large Scale Optimization, 7:1887­1907, September 2006. [13] J. Yedidia, W. Freeman, and Y. Weiss. Bethe free energy, Kikuchi approximations, and belief propagation algorithms. Technical Report 16, Mitsubishi Electric Research Lab, 2001. 8