Clusters and Coarse Partitions in LP Relaxations David Sontag, Amir Globerson, Tommi Jaakkola MIT, CSAIL Abstract We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency of the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message-passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. D RA 1 Introduction 1 A common inference task in graphical models is finding the most likely setting of the values of the variables (the MAP assignment). Indeed, many important practical problems can be formulated as MAP problems. For example, in the so-called protein-design problem [10], we try to select amino acid residues and the orientation of their side-chains along a fixed (pre-selected) backbone structure so as to minimize the overall energy. The complexity of this task depends on the structure of dependencies between the variables (the graph structure) and is known to be NP hard in general. The dependencies in the protein-design tasks are quite dense, especially for globular proteins, rendering standard exact inference algorithms useless in such cases. A great deal of effort has been spent recently on developing approximate algorithms for the MAP problem. One promising approach is based on linear programming relaxations, solved via message passing algorithms akin to belief propagation [3, 5]. In this case, the MAP problem is first cast as an integer programming problem and subsequently relaxed to a linear program by removing the integer constraints. Whenever the relaxed solution is integer, it is guaranteed to be the optimal solution. However, this happens only if the relaxation is sufficiently "tight". Relaxations can be made increasingly tight by introducing LP variables that correspond to clusters of variables in the original model. For example, using clusters corresponding to the cliques of the graph structure would guarantee the exact MAP solution. The complexity of the relaxation, however, grows exponentially with the number of variables in the cluster. This is particularly burdensome in cases where the variables can take on a large number of possible values or states. Consider, for example, a problem where each variable has 100 states (cf. protein-design). Using clusters of k variables means adding 100k LP variables, keeping k typically three or four at most. Recent empirical evidence nevertheless shows that even with relatively small clusters, the exact MAP assignment can be found in problems such as stereo vision and protein design. However, in the protein-design case, for example, even clusters of size three come with a significant cost. In this paper we focus on introducing tighter relaxations at a coarser level of granularity. The key observation is that it may not necessary to represent all the possible joint states of a cluster of variables. Instead, we partition the cluster states at a coarser level and enforce consistency only across such partitions. The approach removes the size of the cluster as a consideration and instead focuses FT on resolving currently ambiguous settings of the variables. Following the approach of [3], we formulate a dual LP for the partition-based LP relaxations, and derive a message-passing algorithm for optimizing the dual LP based on dual coordinate descent. Unlike standard message passing algorithms, the algorithm we derive involves passing messages between coarse and fine representations of the same set of variables. 1.1 MAP and its LP relaxation We consider discrete pairwise Markov random fields on a graph G = (V , E ), defined as the following exponential distribution1 1i p(x; ) = e jE ij (xi ,xj ) (1) Z The resulting discrete optimization problem may also be cast as a linear program. Define µ to be a vector of marginal probabilities associated with the interacting pairs of variables (edges) {µij (xi , xj )}ij E as well as {µi (xi )}iV for the nodes. µ cannot be chosen arbitrarily. The set of µ's that arise from some joint distribution is known as the marginal polytope M(G) [8]. The MAP problem is then equivalent to the following linear program: D RA x µM(G) c max f (x, ) = max µ · , i x where µ · = ij (xi , xj )µij (xi , xj ). The extreme points of the marginal polytope j E i ,xj are integral and thus there always exists a maximizing µ that is integral and corresponds to xM . Although the number of variables in this LP is only O(|E | + |V |), the difficulty comes from an exponential number of linear inequalities typically required to describe the marginal polytope M(G). LP relaxations replace the difficult global constraint that the marginals in µ arise from some common joint distribution by ensuring only that the marginals are locally consistent with one another. The most common such relaxation, pairwise consistency, enforces that the edge marginals are consistent x with the node marginals, {µ | µij (xi , xj ) = µi (xi )}. The integral extreme points of this local j marginal polytope correspond to assignments. If a solution is obtained at one such extreme point, it is provably the MAP assignment. However, the local marginal polytope also contains (prominent) fractional extreme points and, as a relaxation, will not be tight enough in general. We are therefore interested in tightening the relaxation. There are many known ways to do so, including cycle inequalities [6] and semi-definite constraints [9]. However, perhaps the most straightforward approach corresponds to adding clusters to the model (cf. generalized belief propagation [11]) such that consistency is enforced on the level of clusters. More precisely, for a cluster c {1, . . . , n} indexing a subset of the variables, the edge distributions µij (xi , xj ) within the cluster must arise as marginals from some joint distribution c (xc ) over the variables in the cluster. Define the tighter relaxation MC (G) as x µij (xi , xj ) = µi (xi ) j 0 (4) x (xi , xj ) = µij (xi , xj ) (i, j ) c c µ0 (x ) = 1 c c whx re c (xi , xj ) denotes the marginal of c (xc ) over the edge (i, j ), i.e. c (xi , xj ) = e c (xc ). The relaxation is clearly tighter with more clusters. However, each cluster comes c\i,j with a computational cost that grows exponentially with the number of variables in the cluster. We seek to offset this exponential cost by introducing sparse clusters. 2 FT j E Here is a parameter vector specifying how pairs of variables in E interact. The MAP problem we consider here is to find the most likely assignment of variables under p(x; ) (we assume that the evidence has already been incorporated into the model). This is equivalent to finding the assignment xM that maximizes i f (x; ) = ij (xi , xj ) (2) (3) xk zk zk xi zi zi zj Figure 1: A graphical illustration of the consistency constraint between the original (fine granularity) edge (xi , xk ) and the coarsened triplet (zi , zj , zk ). The two should agree on the marginal of zi , zk . For example, the shaded area in all three figures represents the same probability mass. 2 Clusters and partitions We begin with an illustrative example. Suppose we have a graphical model that is a triangle with each variable taking k states. We can recover the exact marginal polytope in this case by forcing the pairwise marginals µij (xi , xj ) to be consistent with some µ123 (x1 , x2 , x3 ). However, when k is large, introducing the corresponding k 3 variables to our LP may be too costly and perhaps unnecessary if weaker consistency would already lead to an integral solution. To this end, we will use a coarse-grained version of µ123 where the joint states are partitioned into larger collections, and consistency is enforced over the partitions. The simplest partitioning scheme builds on coarse-grained versions of each variable Xi . Let Zi denote a disjoint collection of sets cover{ng the possible values of Xi . For example, if variable Xi i . has five states, Zi might be defined as 12}, {35}, {4} We will use zi to index the sets in Zi so that, for example, zi may take a value {12}. D RA x µik (xi , xk ) = z i zi ,xk zk We can now consider distributions over a cluster c where each variable Xi is represented by its coarser version (the sets Zi ). The "partitioned" clusters are used to enforce the following coarsegrained consistency constraints: for every zi Zi and zk Zk , µc (zc ). (5) This is illustrated graphically in Fig. 1 for a cluster on three variables (the rhs summation above is { , over zj ). In the case when Zi individuates each state, i.e., 1}, {2}, {3}, {4} we recover the usual cluster consistency constraints. How do we obtain the partitions? Suppose we solve the linear program using only the pairwise relaxation. The resulting marginals µi (xi ) may involve non-binary values for some states (the fractional states). We can generate the partitions based on this set of fractional values Si i where i is the domain of Xi . The idea is to individuate all the fractional states and lump together the remaining states t. at we ended up not using in the pairwise relaxation. More formally, h x Zi = {i \Si } i Si {xi } The resulting coarser version of the variable Xi , and the corresponding cluster consistency constraint, is likely to help us resolve some of the ambiguity left by the pairwise relaxation. There are many possible partitions of each variable. Thus, we could introduce multiple clusters for any particular set of variables, each cluster pertaining to a different partitioning scheme, and each leading to different constraints of the form Eq. 5. The primal partition-based LP relaxation that we consider is maxµ µ · with the feasible space given by the constraints in Eq. 5 and µc 0. We also enforce the usual pairwise consistency by node and edge clusters using the full partitioning for each variable. 1 We do not use potentials on single node i (xi ) since these can be folded into ij (xi , xj ). Our algorithm can also be derived with explicit i (xi ), and we omit the details for brevity. FT c \{zi ,zk } 3 3 Dual linear program In this section we give the dual of the partition-based LP relaxation, and from this obtain a messagepassing algorithm to efficiently optimize this relaxation. Our approach extends earlier work by Globerson and Jaakkola [3] who gave the generalized max-product linear programming (MPLP) algorithm to solve the usual (complete) cluster LP relaxation in the dual. The dual formulation which lead to the original MPLP algorithm was derived by replicating the primal LP variables for each consistency constraint in which it participates, leading to reparameterization in the dual. Particularly important was the notion of intersection sets, the set of variables in which two clusters overlap. Our dual formulation generalizes the notion of intersection sets, especially regarding the overlap of two partitioned clusters. Consider the example in Fig. 1. The triplet cluster on the right overlaps with the edge cluster on the left through the partitions Zi and Zk (middle). We will denote such intersection set by Zi Zj to clarify how they depend on both the variables (i and j ) as well as how they are partitioned in the corresponding cluster (Zi and Zj ). Let S be the set of all intersection sets, and let S (c) S be the intersection sets used by cluster c (an edge ij also appears as a cluster). The intersection sets will act as the interface between the sparse cluster constraints (coarse) and the edge marginals (fine). An edge may have multiple intersection sets corresponding to different partitionings of its variables. An edge (similarly to any cluster) is re-parameterized across all the intersection sets it participates in. So, for example, ij Zi Zj (xi , xj ) represents the fraction of the edge parameters pushed to the intersection set Zi Zj of a sparse cluster involving partitionings Zi and Zj . By applying this new notion of intersection sets to the primal formulation of [3], it is straightforward to show that the dual partition-based linear program is given by k Z i c iki (xi ) + max max cZi Zj (zi , zj ) min (6) xi N (i) i Zj S D RA zi Zi zj Zj subject to the re-parameterization constraints Z ij i (xi , xj ) + ij j (xi , xj ) + (xi , xj ) = ij (xi , xj ) ij E , xi , xj i Zj S (ij ) i ij Zi Zj cZi Zj (zc ) = 0 c, zc j c (7) and where the terms in the dual objective iki (xi ) = max iki (xi , xk ) xk cZi Zj (zi , zj ) = zc \{zi ,zj } xi zi xj zj max ij Zi Zj (zi , zj ) = max ij Zi Zj (xi , xj ). As we show in the next section, the variables correspond to the messages sent in the message passing algorithm that we use for optimizing the dual. Thus iki (xi ) should be read as the message sent from edge ij to node, i. Similarly, ij Zi Zj (zi , zj ) is the message sent from the (fine) edge (Xi , Xj ) to the coarsened edge (Zi , Zj ), and cZi Zj (zi , zj ) is the message from the coarsened cluster to one of its intersection edges. By duality, the dual objective evaluated at a dual feasible point upper bounds the primal LP optimum, which in turn upper bounds the value of the MAP assignment. We have one dual variable for each cluster and each of its intersection sets. The state space of the intersection sets depends only on the number of partitions of its variables, not on their original state space. Thus, as in the primal formulation, we get the desired coarsening. Notice that Eq. 10 is where the interface between the coarse-grained partitions and the fine-grained variables occurs. 3.1 Message-passing algorithm In this section we give the message updates corresponding to performing block coordinate descent in the dual LP. All messages outgoing from a cluster to its intersection sets must be sent simultaneously 4 FT :Zi Zj S (c) (8) for c = ij (9) (10) cZi Zj (zc ) to maintain dual feasibility. The derivation of the updates closely follows that of [3]. The key idea is that it is possible to fix the values of messages (i.e., the values of ) going out of all clusters except one, and to find a closed form solution for the non-fixed messages. The equations below specify how each of the messages is updated (see Appendix A for derivation). 1. The edge to intersection set messages ij Zi Zj (zi , zj ) are set to the following value (for every k 1 zi , zj ), where -j (xi ) = i N (i)\j ikk (xi ) and = |S (ij )|+2 , . Z j j -iiZj (zi (xi ), zj (xj ))+-i (xj )+-j (xi )+ij (xi , xj ) ( - 1) -iiZj (zi , zj )+ max j i Z Z xi zi xj zj ij Z S (ij ) : Zi Zj = Zi Zj xj i Zj S (ij ) 3. The cluster to edge messages cZi Zj (zi , zj ) are set to the following values (for every zi , zj ) 1 c , e c 1 1 - - c Zi Zj (zi , zj ) + max c Ze (ze ) |S (c)| |S (c)| zc \{zi ,zj } = = c: Zi Zj S (c ) D RA 4 Generalization: hierarchies of clusters 5 where the notation Ze for edge e = (k , j ) refers to the partitions Zk and Zj used by cluster c. Also, |S (c)| denotes the number of intersection sets that cluster c has (for example if c is a cluster over four nodes that intersects with five edges, |S (c)| will be five). The clusters operate completely in the coarse parameter space, taking advantage of the sparsity of the constraints. The edge clusters act as an interface between the coarse parameter space and the fine parameter space, passing information between the two layers. Thus, we never need to operate in the larger state space. The techniques we have developed thus far in the paper allow us to construct sparse clusters according to partitionings of the states of the variables involved in the cluster. In this section we sketch how our methods could be used to build a hierarchy of cluster consistency constraints while keeping the state space of the clusters small. This will enable us to construct global constraints on the model without an exponential blow up in state spaces. Suppose we have a pairwise MRF on a 3 × 3 grid graph, with binary variables. Let the variables x1 , . . . , x9 be placed left to right, top to bottom. We wish to add clusters to tighten the pairwise LP relaxation. We first construct clusters q1 , q2 , q3 , q4 for each square of the grid (again, left to right, top to bottom). Each cluster qi has three states, summarizing the assignments to the four nodes in its square. qi = 0 corresponds to the left two nodes being 0 and the right two nodes being 1. qi = 1 corresponds to the left two nodes being 1 and the right two nodes being 0. qi = 2 is for the remaining 14 assignments to the four nodes in its square. Note that the state space of q1 is significantly less than the 24 assignments to the nodes beneath it. Next we enforce consistency between q1 and the edges beneath it, (x1 , x2 ), (x2 , x5 ), (x4 , x5 ), and (x1 , x4 ). To do this, we construct a triplet cluster xi xj q1 for q1 and each edge (xi , xj ). This cluster has states corresponding to the allowed configurations of these two variables, excluding e.g. q1 = 1, (xi , xj ) = (0, 0). We then enforce marginalization of this cluster with the edge cluster xi xj and with q1 . We now build a new square by adding "edge" clusters (q1 , q2 ), (q2 , q4 ), (q3 , q4 ), and (q1 , q3 ). We use each edge marginal to enforce that two adjacent squares are consistent on the edge that they FT c\ij c: Ze S ( c ) 2. Define zi (xi ) to be the mapping from each assignment xi to the partition zi Zi where this state is located (i.e. xi zi ). Then, the edge to node messages ij i (xi ) are set to the following values (for every xi ) Z . j ( - 1) -j (xi ) + max -iiZj (zi (xi ), zj (xj )) + -i (xj ) + ij (xi , xj ) i j Z Coarsened Full 100 80 Time (Sec) Coarsened Full Objective 60 40 20 0 10 20 30 Time (Hours) 40 0 500 1000 1500 Iteration Num. 2000 2500 share in common. We do so using a construction similar to the one of the previous paragraph, where we introduce a new cluster for the cross product of the state spaces of the edge cluster (x4 , x5 ) and the new edge cluster q1 q2 , excluding states that cannot correspond to a valid assignment (such as (q1 , q2 ) = (0, 0), (x4 , x5 ) = (1, 0), or (q1 , q2 ) = (0, 1) for any (x4 , x5 )), and then enforce marginalization of this cluster to the two (of different levels) edge clusters. D RA 5 Related work 6 Experiments 6 Now we can repeat the process on the square involving the four qi nodes (and edges), constructing a top level cluster q with three states (in precisely the same way as before) which summarizes the square beneath it. Through this construction, the state q has global control over the assignments to all 9 nodes. q = 0 implies that xi in the left and right grid columns are 0, and in the middle grid column is 1. Likewise, q = 1 implies that xi in the left and right grid columns are 1, and in the middle grid column is 0. q = 2 corresponds to all other assignments. Our approach contains two key elements. The first is the tightening of the marginal polytope by adding constraints, and the second is the use of coarse representation of the clusters. The marginal polytope for binary variables, or the cut polytope, has been extensively studied in combinatorial optimization [1]. This has resulted in outer bounds on the polytope on the basis of a variety of tools from cycle inequalities to semi-definite constraints. These inequalities are often not easily extended to non-binary variables though recent extensions do exist (e.g., [6, 4]). The class of consistency constraints that we use here (see Eq. 4) are particularly convenient for use with nonbinary variables. However, they require an exponential expansion in the number of LP variables, which our approach seeks to remedy. Another class of related approaches are called "coarse-to-fine" applications of belief propagation [2]. In this approach, one solves low-resolution versions of an MRF, and uses the resulting beliefs to initialize finer resolution versions. Although it shares the element of coarsening with our approach, the goal of coarse-to-fine approaches is very different from our objective. Specifically, the low-resolution MRFs only serve to speed-up convergence of the full resolution MRF via better initialization. Thus, one typically should not expect it to perform better than the finest granularity MRF. In contrast, our approach is designed to strictly improve the performance of the original MRF by introducing additional (coarse) clusters. One of the key technical differences is that in our formulation the setting of coarse and fine variables are refined iteratively whereas in [2] once a coarse MRF has been solved, it is not revisited. We report results on the protein design problem, originally described in [10]. The protein design problem is the inverse of the protein folding problem. Given a desired backbone structure for the protein, the goal is to construct the sequence of amino-acids that results in a low energy, and thus FT Figure 2: Comparison of our method (denoted by "Coarsened") and the method in [7] (denoted by "Full") for the protein 1a62. Left: Dual objective as a function of time for both methods. Right: The cost per one iteration over the entire graph. stable, configuration. We can use an approximate energy function to guide us towards finding a set of amino-acids and rotamer configurations with minimal energy. In [10] the design problem was posed as finding a MAP configuration in a pairwise MRF. The models used there (which are also available online) have a number of states per variable that is between 2 and 158, and contain up to 180 variables per model. The models are also quite dense so that exact calculation is not feasible. Recently [7] it was shown that all but one of the problems described in [10] can be solved exactly by using an LP relaxation with clusters on three variables. However, since each individual state has roughly 100 possible values, processing triplets requires 106 operations, making the optimization rather costly. In applying our approach to the above problem we have two performance goals: one is obviously to reduce computation time, and the other is to achieve the same performance as the algorithm in [7] which uses the complete (un-coarsened) triplet representation (namely to find the exact MAP solution for the proteins). Note that in general it is not at all clear that the latter can be achieved, since the bounds on the polytope that we use are weaker than the full-triplet bounds used in [7]. In what follows we describe two sets of experiments that show we can attain the above two goals. The experiments differ in the strategy for adding triplets, and illustrate two performance regimes. In both experimental setups we first run the standard edge based message passing algorithm. D RA 2 3 In the first experiment, we add all triplets that correspond to variables whose singleton beliefs are tied at the maximum after running the edge based algorithm. Since tied beliefs correspond to fractional LP solutions, it is natural to consider those in tighter relaxations. The triplets correspond to partitioned variables, as explained in Sec. 2. The partitioning themselves are guided by the ties in the singleton beliefs. Specifically, for each variable Xi we find states whose singleton beliefs are tied at the maximum. Denote the number of states maximizing the belief by r. Then we partition the state into r subsets, each containing a different maximizing state. The other (non-maximizing) states are split randomly between the r subsets. The above partitioning results in a reduced state (corresponding to Zi in our notation). The triplets are then constructed over the variables Zi and the message passing algorithm of Sec. 3.1 is applied to the resulting structure. After convergence of the algorithm, we recalculate the singleton beliefs. These may result in a different partition scheme, and hence new variables Zi . We add new triplets corresponding to the new variables and re-run. We repeat until the dual-LP bound is sufficiently close to the value of the integral assignment obtained from the messages (note that these values do not generally coincide, but for the problems we considered they do). We applied the above scheme to the ten smallest proteins in the dataset used in [7] (for the larger proteins we used a different strategy described next). We were able to solve all ten exactly, as in [7]. The mean running time was six minutes. The gain in computational efficiency as a result of using coarsened-triplets was considerable: The average state space size for coarsened triplets was on average 3000 times smaller than that of the original triplet state space, resulting in a factor 3000 speed gain over a scheme that uses the complete (un-coarsened) triplets.2 This big factor comes about because a very small number of states are tied per variable, thus increasing the efficiency of our method where the number of partitions is equal to the number of tied states. Thus, while running on full triplets was completely impractical, the coarsened message passing algorithm is very practical and achieves the exact MAP assignments. One shortcoming of the above scheme is that it uses the same partition of a variable Xi for all the triplets Xi participates in. Since a given partition might not provide the tightest bound, it may be advantageous to add partitions more often. In the second set of experiments, we added only 100 triplets at a time, and ran 100 iterations of the message passing algorithm. The triplets to be added were chosen using the bound criterion described in [7], and the state partitionings were based on the beliefs of the previous iteration. We applied this second scheme to the 15 largest proteins in the dataset,3 and compared it to the algorithm of [7] (the code was provided to us by the authors). Of these, we found the exact MAP in 53% of the cases (according to the criterion used in [7]), and in the rest of the cases were within 10-2 of the known optimal value. For the cases that were solved exactly, the mean running time These timing comparisons do not apply to [7] since their algorithm did not use all the triplets. We do not run on the protein 1fpo, which was not solved in [7]. FT 7 was 2.4 hours. To compare the running times of our algorithm and that in [7] we checked how long it took them to get to within 10-2 of the optimum. This revealed that our method is faster by an average factor of 4.2.4 The reason why this factor is less than the 3000 in the previous setup, is that for the larger proteins the number of tied states is typically much higher than that for the small ones. Furthermore, the coarsened method makes less improvement in the objective for a given set of triplets than [7] which uses tighter constraints on the polytope. Results for one of the proteins are shown in Fig. 2. The objective is seen to decrease faster for our method. Also, as expected, the cost per iteration is significantly lower for our method where processing each triplet is cheaper. 7 Conclusion Our technique further explores the trade-offs of representing complex constraints on the marginal polytope while keeping the optimization tractable. In applying the method, we chose to cluster variables' states based on the ties in the single node beliefs after solving using a looser constraint on the polytope. However, it should be possible to optimize these partitions using more refined tools, for example by assessing the improvement in the bound that would result from using different partitions. D RA References 4 The constraints that we chose to represent were consistency constraints between marginals, i.e., equality constraints. However, one may also consider inequalities on marginals that follow from general probability laws such as inclusion-exclusion. Finally, perhaps the most interesting extension of our approach is to constructing hierarchies of coarsened variables that interact with each other to represent constraints on increasingly larger clusters while keeping the optimization tractable. [1] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [2] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation for early vision. Int. J. Comput. Vision, 70(1):41­54, 2006. [3] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In Advances in Neural Information Processing Systems 21. MIT Press, 2008. [4] P. Kohli, A. Shekhovtsov, C. Rother, V. Kolmogorov, and P. Torr. On partial optimality in multi-label MRFs. In ICML, 2008 (to appear). [5] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell., 28(10):1568­1583, 2006. [6] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In Advances in Neural Information Processing Systems 21. MIT Press, 2008. [7] D. Sontag, T. Meltzer, A. Globerson, Y. Weiss, and T. Jaakkola. Tightening LP relaxations for MAP using message-passing. In UAI, 2008. [8] M. Wainwright and M. I. Jordan. Graphical models, exponential families and variational inference. Technical report, UC Berkeley, Dept. of Statistics, 2003. [9] M. Wainwright and M. I. Jordan. Log-determinant relaxation for approximate inference in discrete Markov random fields. IEEE Transactions on Signal Processing, 54(6):2099­2109, June 2006. [10] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation ­ an empirical study. JMLR, 7:1887­1907, 2006. [11] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282­ 2312, 2005. We made sure that differences were not due to different processing powers or CPU loads. FT 8 We presented an algorithm that enforces higher-order consistency constraints on LP relaxations, but at a reduced computational cost. Although the resulting constraints are not as tight as the full cluster consistency constraints, they work well in practice and allow us to find the exact MAP assignment in many of the cases we studied, while doing so considerably faster than methods that use the usual cluster consistency constraints. A Derivation of updates To take a block coordinate descent step with respect to the edge messages ij Zi Zj and ij i , look at the following terms in the objective: k k Z c cZi Zj (zi , zj ) (11) max iki (xi ) + max j kj (xj ) + max xi N (i) xj N (j ) i Zj S (ij ) zi Zi zj Zj :Zi Zj S (c) Re-writing the above in terms of the non-changing terms k -j (xi ) = i N (i)\j ikk (xi ), analogously for -i (xj ), and j ij -i Zj (zi , zj ) = Z = ij : Zi Zj S (c) the relevant objective looks like + + Z max -j (xi )+max ij i (xi , xj ) max -i (xj )+max ij j (xi , xj ) xi xj xj xi FT c cZi Zj (zi , zj ), max ( i Zj S (ij ) zi Zi zj Zj -ij Zi Zj (zi , zj )+ max xi zi xj zj ij Zi Zj (xi , xj ) xi ,xj 12) Let zi (xi ) be the mapping from each assignment xi to the partition zi Zi . The above is equivalent to + +Z , ij max -j (xi )+ij i (xi , xj ) max -i (xj )+ij j (xi , xj ) max -i Zj (zi (xi ), zj (xj ))+ij Zi Zj (xi , xj ) Z xi ,xj i Zj S (ij ) xi ,xj D RA which is lower bounded by Z max -j (xi )+ij i (xi , xj )+-i (xj )+ij j (xi , xj )+ xi ,xj (13) ij -i Zj (zi (xi ), zj (xj ))+ij Zi Zj (xi , xj ) Z , i Zj S (ij ) (14) ij -i Zj (zi (xi ), zj (xj )) Z which is equal to xi ,xj max -j (xi ) + -i (xj ) + ij (xi , xj ) + Z . (15) i Zj S (ij ) We can achieve this minimum by setting each of the terms within the maximizations of Eq. 13 equal to of the quantity inside of the maximization of Eq. 15. This results in the updates given in Section 3.1. 1 |S (ij )|+2 9