Cooled and Relaxed Survey Propagation for MRFs Hai Leong Chieu1,2 , Wee Sun Lee2 1 Singapore MIT Alliance 2 Department of Computer Science National University of Singapore Yee-Whye Teh Gatsby Computational Neuroscience Unit University College London ywteh@gatsby.ucl.ac.uk haileong@nus.edu.sg,leews@comp.nus.edu.sg Abstract We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem. 1 Introduction Energy minimization is the problem of finding a maximum a posteriori (MAP) configuration in a Markov random field (MRF). A MAP configuration is an assignment of values to variables that maximizes the probability (or minimizes the energy) in the MRF. Energy minimization has many applications; for example, in computer vision it is used for applications such as stereo matching [11]. As energy minimization in general MRFs is computationally intractable, approximate inference algorithms based on belief propagation are often necessary in practice. Such algorithms are split into two classes: max-product and variants address the problem by trying to find a MAP configuration directly, while sum-product and variants return approximate marginal distributions which can be used to estimate a MAP configuration. It has been shown that the max-product algorithm converges to neighborhood optimums [18], while the sum-product algorithm converges to local minima of the Bethe approximation [20]. Convergence of these algorithms are important for good approximations. Recent work (e.g. [16, 8]) on sufficient conditions for convergence of sum-product algorithms suggests that they converge better on MRFs containing potentials with small strengths. In this paper, we propose a novel algorithm, called Relaxed Survey Propagation (RSP), based on performing the sum-product algorithm on a relaxed MRF. In the relaxed MRF, there is a parameter vector y that can be optimized for convergence. By optimizing y to reduce the strengths of potentials, we show empirically that RSP converges on MRFs where other algorithms fail to converge. The relaxed MRF is built in two steps, by first (i) converting the energy minimization problem into its weighted MAX-SAT equivalent [17], and then (ii) constructing a relaxed version of the survey propagation MRF proposed in [14]. We prove that the relaxed MRF has approximately equal joint distribution (and hence the same MAP and marginals) as the original MRF, independent (to a large extent) of the parameter vector y. Empirically, we show that RSP, when run at low temperatures ("cooled"), performs well for the application of energy minimization. For max-product algorithms, we compare against the max-product algorithm and its sequential tree-reweighted variant, which is guaranteed to converge [11]. For sum-product algorithms, we compare against residual belief propagation [6] as a state-of-the-art asynchronous belief propagation algorithm, as well as the treestructured expectation propagation [15], which has been shown to be a special case of generalized belief propagation [19]. We show that RSP outperforms all approaches for Ising models with mixed couplings, as well as in a supervised clustering application for web person disambigation. 1 : factors : variables b c (1) 1 (1,1) 1 3 (2,1) (2) (2,2) 4 Legend: is a negative literal in 3 (1,2) 4 2 is a positive literal in 1 a 2 2 (a) G = (V , F ) (b) W = (B , C ) Figure 1: The variables x1 , x2 in (a) are binary, resulting in 4 variables in (b). The clauses 1 to 4 in (b) are entries in the factor a in (a), 1 and 2 (resp. 3 and 4 ) are from b (resp. c). (1) and (2) are the positivity clauses. The relaxed MRF for RSP has a similar form to the graph in (b). 2 Preliminaries A MRF, G = (V , F ), is defined by a set of variables V , and a set of factors F = {a }, where each a is a non-negative function depending on Xa V . We assume for simplicity that variables in V have the same cardinality q , taking values in Q = {1, .., q }. For Xi V and Xa V , we denote by xi the event that Xi = xi , and by xa the event Xa = xa . To simplify notation, we will sometimes write i V for Xi V , or a F for a F . The joint distribution over a 1 a (xa ) where Z is the normalization factor. When each configurations is defined by P (x) = Z 1 a is a posia ive function, the joint distribution can be written as P (x) = Z exp(-E (x)/T ) where t E (x) = - log a (xa ) is the energy function, and the temperature T is set to 1. A factor graph [13] is a graphical representation of a MRF, in the form of a bipartite graph with two types of nodes, the variables and the factors. Each factor a is connected to the variables in Xa , and each variable Xi is connected to the set of factors, N (i), that depends on it. See Figure 1(a) for a simple example. Weighted MAX-SAT conversion [17]: Before describing RSP, we describe the weighted MAXSAT (WMS) conversion of the energy minimization problem for a MRF. The WMS problem is a generalization of the satisfiability problem (SAT). In SAT, a set of boolean variables are constrained by a boolean function in conjunctive normal form, which can be treated as a set of clauses. Each clause is a set of literals (a variable or its negation), and is satisfied if one of its literals evaluates to 1. The SAT problem consist of finding a configuration that satisfies all the clauses. In WMS, each clause has a weight, and the WMS problem consists of finding a configuration with maximum total weight of satisfied clauses (called the weight of the configuration). We describe the conversion [17] of a MRF G = (V , F ) into a WMS problem W = (B , C ), where B is the set of boolean variables and C the set of weighted clauses. Without lost of generality, we normalize factors in F to take values in (0, 1]. For each Xi V , introduce the variables (i,xi ) B as the predicate that Xi = xi . For convenience, we index variables in B either by k or by (i, xi ), denote factors in F with Roman alphabet (e.g. a, b, c) and clauses in C with Greek alphabet (e.g. , , ). For a clause C , we denote by C () as the set of variables in . There are two types of clauses in C : interaction and positivity clauses. Definition 1. Interaction clauses: For each entry a (xa ) in a F , introduce the clause = a to show that the clause xi xa (i,xi ) with the weight w = - log(a (xa )). We write comes from the factor a F , and we denote a = src() to be the factor a F for which a. The violation of an interaction clause corresponding to a (xa ) entails that (i,xi ) = 1 for all xi xa . This correspond to the event that Xi = xi for Xi Xa . Definition 2. Positivity lauses: for Xi V , introduce the clause (i) = xi Q (i,xi ) with c weights w (i) = 2 Ci w , where Ci is the set of interaction clauses containing any variable in {(i,xi ) }xi Q . For Xi V , we denote (i) as the corresponding positivity clause in C , and for a positivity clause C , we denote src( ) for the corresponding variable in V . Positivity clauses have large weights to ensure that for each Xi V , at least one predicate in {(i,xi ) }xi Q equals 1. To map back to a configuration in the original MRF, exactly one variable in each set of {(i,xi ) }xi Q can take the value 1. We call such configurations valid configurations: 2 Definition 3. A configuration is valid if Xi V , exactly one of the indicators {i,xi }xi Q equals 1. There are two types of invalid configurations: MTO configurations where more than one variable in the set {i,xi }xi Q equals 1, and AZ configurations where all variables in the set equals zero . For valid configurations , let x( ) be the corresponding configuration of in V . For valid configurations , and for each a F , exactly one interaction clause in {} a is violated: when corresponding to a (xa ) is violated, we have Xa = xa in x( ). Valid configurations have locally maximal weights [17]: MTO configurations have low weights since in all interaction clauses, variables appear as negative literals. AZ configurations have low weights because they violate the positivity clauses. See Figure 1 for an example of a WMS equivalent of a simple factor graph. 3 Relaxed Survey Propagation In this section, we transform the WMS problem W = (B , C ) into another MRF, Gs = (Vs , Fs ), based on the construction of the MRF for survey propagation [14]. We show (in Section 3.1) that under this framework, we are able to remove MTO configurations, and AZ configurations have negligible probability. In survey propagation, in addition to the values {0, 1}, variables can take a third value, * ("joker" state), signifying that the variable is free to take either 0 or 1, without violating any clause. In this section, we assume that variables k take values in {0, 1, }. Definition 4. [14] A variable k is constrained by the clause C if it is the unique satisfying variable for clause (all other variables violate ). Define CONk, (C () ) = (k is constrained by ), where (P ) equals 1 if the predicate P is true, and 0 otherwise. We introduce the parameters {ya }aF and {yi }iV by modifying the definition of VAL in [14]: Definition 5. An assignment is invalid for clause if and only if all variables are unsatisfying except for exactly one for which k = . (In this case, k cannot take * as it is constrained). Define if C (a) satisfies exp(w ) VAL (C () ) = exp(-ysrc() ) if C (a) violates (1) 0 if C (a) is invalid The term exp(-ysrc() ) is the penalty for violating clauses, with src() V F defined in Definitions 1 and 2. For interaction clauses, we index ya by a F because among valid configurations, exactly one clause in the group {} a is violated, and exp(-ya ) becomes a constant factor. Positivity clauses are always satisfied and the penalty factor will not appear for valid configurations. Definition 6. [14] Define the parent set Pi of a variable k to be the set of clauses for which k is the unique satisfying variable, (i.e. the set of clauses constraining k ). We now construct the MRF Gs = (Vs , Fs ) where variables k Vs are of the form k = (k , Pk ), with k variables in the WMS problem W = (B , C ). (See Figure 1). We define single-variable compatibilities (k ) and clause compatibilities ( ) as in [14]: 0 if Pk = , k = if Pk = , k = k (k = {k , Pk }) = , where 0 + = 1 (2) 1 for any other valid (k , Pk ) k ( = {k , Pk }kC () ) = VAL (C () ) × (( Pk ) = CON,k (C () )), (3) C () where is defined in Definition 4. The single-variable compatibilities k (k ) are defined so that when k is unconstrained (i.e. Pk = ), k (k ) takes the values or 0 depending on whether k equals *. The clause compatibilities introduce the clause weights and penalties into the joint distribution. The factor graph has the following underlying distribution: n n P ({k , Pk }k ) 0 0 exp (w ) exp(-ysrc() ), (4) SAT( ) UNSAT( ) where n0 is the number of unconstrained variables in , and n the number of variables taking . Comparing RSP with SP- in [14], we see that 3 Theorem 1. In the limit where all ya , yi , RSP is equivalent to SP- [14], with = . Taking y to infinity correspond to disallowing violated constraints, and SP- was formulated for satisfiable SAT problems, where violated constraints are forbidden. In this case, all clauses must be n0 n satisfied and the term C exp(w ) in Equation 4 is a constant, and P ( ) 0 . 3.1 Main result In the following, we assume the following settings: (1) = 1 and 0 = 0 ; (2) for positivity clauses (i), let yi = 0 ; and (3) in the original MRF G = (V , F ), single-variable factors are defined on all variables (we can always define uniform factors). Under these settings, we will prove the main result that the joint distribution on the relaxed MRF is approximately equal to that on the original MRF, and that RSP estimates marginals on the original MRF. First, we prove the following lemma: Lemma 1. The joint probability over valid configurations on Gs is proportional to the joint probability of the corresponding configurations on the original MRF, G = (V , F ). Proof. For valid configurations, all positivity clauses are satisfied, and for each a F , all valid configurations have one violated constraint in the set of interaction clauses {} . Hence the a a penalty term for violated constraints F exp(ya ) is a constant factor. Let W = C w be the sum of all weights. For a valid configuration , a P ( ) exp( w ) = exp(W - w ) a (x( )) SAT( ) UNSAT( ) F Lemma 2. All configurations containing * have zero probability in the MRF Gs , and there is a one-to-one mapping between configurations = {k , Pk }kVs and configurations = {k }kB Proof. Single-variable factors on G translate into single-literal clauses in the WMS formulation, which in turn becomes single-variable factors in Gs . For a variable k = (k , Pk ) with a singlevariable factor, , we have VAL (k = ) = 0. This implies (k = (, Pk )) = 0. Lemma 3. MTO configurations have n0 = 0 and since 0 = 0, they have zero probability. Proof. In MTO configurations, (i, xi , xi ), i,xi = i,xi = 1. The positivity clause (i) is hence non-constraining for these variables, and since all other clauses connected to them are interaction clauses and contain them as negative literals, both variables are unconstrained. Hence n0 = 0, and from Equation 4, for 0 = 0, they have zero probability. The above lemma lead to the following theorem: Theorem 2. Assuming that exp(w (i) ) 1 for all Xi V , the joint distribution over the relaxed MRF Gs = (Vs , Fs ) is approximately equal to the joint distribution over the original MRF, G = (V , F ). Moreover, RSP estimates the marginals on the original MRF, and at the fixed points, the x beliefs at each node, B ((i,xi ) = 1), is an estimate of P (Xi = xi ), and B ((i,xi ) = 1) 1. i Q We can understand the above theorem as follows: if we assume that the probability of AZ invalid configurations is negligible (equivalent to assuming that the probability of violating positivity clauses are negligible, i.e. exp(wi ) exp(-ysrc( (i)) ) = 1), then we have only valid configurations left. MTO invalid configurations are ruled out by Lemma 3. Since the positivity clauses have large weights, exp(wi ) 1 are usually satisfied. Hence RSP, as the sum-product algorithm on the relaxed MRF, returns estimations of the marginals P (Xi = xi ) as B ((i,xi ) = 1). 3.2 Choosing y a Valid configurations have a joint probability with the factor F exp(-ya ) while AZ configurations do not. However, Theorem 2 states that, if exp(wi ) 1, AZ configurations have negligible probability. Empirically, we observe that for a large range of values of {ya }aF , RSP returns x marginals satisfying B ((i,xi ) = 1) 1, indicating that AZ configurations do indeed have i negligible probability. We can hence select the values of {ya }aF for better convergence properties. 4 We describe heuristics based on the sufficient conditions for convergence of sum-product algorithms in [16]. To simplify notations, we write the conditions for a MRF with pairwise factors a , a maxXj V ,bN (j ) ) N (j )\b N (a1 < 1 (5) ij (x ,x ) a (x ,x ) where N (a ) = supxi =xi supxj =xj tanh 4 log a (xi ,xj ) a (xi ,xj ) i j a Mooij and Kappen [16] have also derived another condition based on the spectral radius of a matrix having N (a ) as entries. These conditions lead us to believe that the sum-product algorithm converges better on MRFs with small N (a ) (or the "strengths" of potentials in [8]). To calculate N ( ) for the interaction clause , we characterize these factors as follows: e xp(-ysrc() ) if clause is violated, i.e. (k , l ) = (0, 0) ((k , Pk ), (l , Pl )) = (6) exp(w ) otherwise 1 As ya are shared among a, we choose ya to minimize a N ( ) = a tanh 4 |w +ya |. A good approximation for ya would be the median of {-w } a . For our experiments, we divide the search range for ya into 10 bins, and use fminsearch in Matlab to find a local minimum. 3.3 Update equations and efficiency While each message in RSP has large cardinality, we show in the supplementary material that, under the settings of Section 3.1, the update equations can be simplified such that each factor passes a single number to each variable. The interaction clause sends a number (i,v) to each (i, x) C (), and the positivity clauses (i) sends a number µ (i)(i,x) to (i, x) for each x Q. The update equations are as follows: (proofs in the supplementary material): x µ (i,x) = (i,x ) + exp(-wi ) (7) =x N (i,x )\ (i) µ (j )(j,x ) + exp(-ysrc() - w ) N (j,x )\{ (j ),} (j,x ) (i,x) = µ (j )(j,x ) + N (j,x )\{ (j ),} (j,x ) B ((i,x) = 0) µ (i)(i,x) ; B ((i,x) = 1) (i,x) ; B ((i,x) = ) = 0 N (i,x)\ (i) (8) (9) We found empirically that the schedule of message updates affect convergence to a large extent. A good schedule is to update all the -messages first (by updating the groups of -messages belonging to each factor a F together), and then updating the µ-messages together. This seems to work better than the schedule defined by residual belief propagation [6] on the relaxed MRF. In terms of efficiency, for a MRF with N pairwise factors, the sum-product algorithm has 2N q real numbers in the factor to variable messages, and RSP has 2N q + q . Empirically, we observe that RSP on the relaxed MRF runs as fast as the simple sum-product algorithm on the original MRF, with an overhead for determining the values of y. 4 Experimental Results While Ising models with attractive couplings are exactly solvable by graph-cut algorithms, general Ising models with mixed couplings on complete graphs are NP-hard [4], and graph cut algorithms are not applicable to graphs with mixed couplings [12]. In this section, we perform three sets of experiments to show that RSP outperforms other approaches: the first set compares RSP and the residual belief propagation on a simple graph, the second set compares the performance of various methods on randomly generated graphs with mixed couplings, and the third set applies RSP to the application of the web person disambiguation task. A simple example: we use a 4-node complete graph of binary variables, with the two sets of factors defined in Figure 2(a), for = +1 and -1. The case = -1 was used in [8] to illustrate how the strengths of potentials affect convergence of the sum-product algorithm. We also show the case of = +1 (an attractive network) as a case where the sum-product algorithm converges well. Both sets of graphs ( = +1 or -1) have uniform marginals, and 2 MAP configurations (modes). In Figure 5 = i,j (xi , xj )= e 1 0 1 0 1 1 0 xp(i,j /4) if xi = xj exp(-i,j /4) if xi = xj 0 (b) = +1 (c) = -1 (a) 4-node (binary) complete graph Figure 2: In Figure (a), we define factors under the two settings: = ±1. Figure (b) and (c) show the L2 distance between the returned marginals and the nearest mode of the graph. Circles on the lines mean failure to converge, where we take the marginals at the last iteration. 2(b) and 2(c), we show experimental results for = +1 and -1. In each case, we vary from 0 to 12, and for each , run residual belief propagation (RBP) damped at 0.5 and RSP (undamped) on the corresponding graph. Both methods are randomly initialized. We plot the L2 distance between the returned marginals and the nearest mode marginals (margins with probability one on the modes). al The correct marginals are uniform, where the L2 distance is 0.5 0.7. For small , both methods converge to the correct marginals. As is increased, for = +1 in Figure 2(b), both approaches converge to marginals with probability 1 on one of the modes. For = -1, however, RSP converges again to marginals indicating a mode, while RBP faces convergence problems for 8. Increasing corresponds to increasing N (i,j ), and the sum-product algorithm fails to converge for large when = -1. When the algorithms converge for large , they converge not to the correct marginals, but to a MAP configuration. Increasing has the same effect as decreasing the temperature of a network: the behavior of sum-product algorithm approaches that of the max-product algorithm, i.e. the max-product algorithm is the sum-product algorithm at the zero temperature limit. Ising models with mixed couplings: we conduct experiments on complete graphs of size 20 with different ipercentage of attractive couplings, using the Ising model with the energy function: i H (s) = - i si ,where si {-1, 1}. We draw i from U [0, 0.1]. To control the ,j i,j si sj - percentage of attractive couplings, we draw i,j from U [0, ], and randomly assign negative signs to the i,j with probability (1 - ), where is the percentage of attractive couplings required. We vary from 1 to 3. In Figure 3, we plot the difference between the optimal energy (obtained with a brute force search) and the energy returned by each of the following approaches: RSP, max-product belief propagation (MBP), the convergent tree reweighted max product belief propagation (TRW-S) [11], residual sum-product belief propagation (RBP) [6], and tree-structured expecation propagation (TEP) [15]. Each point on the graph is the average over 30 randomly generated networks. In Table 1, we compare RSP against these methods. When an algorithm does not converge, we take its result at the last iteration. We damp RBP and TEP with a 0.5 damping factor. For RSP, MBP, TRW-S and RBP, we randomly initialize the initial messages, and take the best result after 5 restarts. For TEP, we use five different trees consisting of a maximal spanning tree and four random stars [19]. For RSP, RBP and TEP, which are variants of the sum product algorithm, we lower the temperature by a factor of 2 each time the method converges and stop when the method fails to converge or if the results are not improved over the last temperature. We observe that MBP outperforms TRW-S constantly: this agrees with [11] that MBP outperforms TRW-S for graphs with mixed couplings. While the performance of TRW-S remains constant from 25% to 75%, the sum-product based algorithms (RBP and TEP) improve as the percentage of attractive potentials is increased. In all three cases, RSP is one of the best performing methods, beaten only by TEP at 2 points on the 50% graph. TEP, being of the class of generalized belief propagation [19], runs significantly slower than RSP. Supervised clustering: Finley and Joachims [7] formulated S V M cluster , which learns an itempair similarity measure, Sim(i, j ), to minimize a correlation clustering objective on a training set. In i training S V M cluster , they have to minimize E (x) = ,j Sim(i, j ) (xi , xj ) where xi {1, .., U } are cluster-ids of item i, and U an upper-bound on the number of clusters. They tried a greedy and a linear programming approach, and concluded that the two approaches are comparable. Due to time constraints, we did not implement S V M cluster : instead we test our inference algorithms on the pairwise classification clustering (PCC) baseline in [7]. The PCC baseline trains svmlight [9] on training item-pairs, and run the classifier through all test pairs. For each test pair (i, j ), we apply softmax to the classifier outputs to obtain the probability pi,j that the pair is in the same cluster. 6 (a) 75% (b) 50% (c) 25% Figure 3: Experiments on the complete graph Ising model with mixed couplings (legend in (a)), with different percentage of attractive couplings. The y-axis shows, in log scale, the average energy difference between the configuration found by the algorithm and the optimal solution. mbp trw-s rbp tep opt 1 2/0 26/0 1/0 2/0 0/0 75% attractive 1.5 2 2.5 2/0 0/0 1/0 24/0 22/0 25/0 0/0 0/0 2/0 2/0 0/0 2/0 0/0 0/0 0/1 3 1/0 25/0 0/0 0/0 0/0 1 7/6 28/0 22/0 14/3 0/7 50% attractive 1.5 2 2.5 11/5 14/0 10/2 29/0 29/0 27/0 14/2 12/0 9/1 9/3 11/2 6/2 0/8 0/2 0/2 3 9/6 28/1 13/5 6/5 0/7 1 20/2 29/0 22/0 23/1 0/6 25% attractive 1.5 2 2.5 13/3 16/0 13/3 27/0 30/0 28/1 16/6 15/2 21/0 15/4 10/2 16/2 0/10 0/4 0/4 3 15/2 27/0 17/0 15/2 0/2 Table 1: Number of trials (out of 30) where RSP does better/worse than various methods. In particular, the last row (opt) shows the number of times that RSP does worse than the optimal solution. Defining Sim(i, j ) = log(pi,j /(1 - pi,j )), we minimize E (x) to cluster the test set. We found that the various inference algorithms perform poorly on the MRF for large U , even when they converge (probably due to a large number of minima in the approximation). We are able to obtain lower energy configurations by the recursive 2-way partitioning procedure in [5] used for graph cuts. (Graph cuts do not apply here as weights can be negative). This procedure involves recursively running, for e.g. RSP, on the MRF for E (x) with U = 2, and applying the Kernighan-Lin algorithm [10] for local refinements among current partitions. Each time RSP returns a configuration that partitions the data, we run RSP on each of the two partitions. We terminate the recursion when RSP assigns a same value to all variables, placing all remaining items in one cluster. We use the web person disambiguation task defined in SemEval-2007 [1] as the test application. Training data consists of 49 sets of web pages (we use 29 sets with more than 50 documents), where each set (or domain) are results from a search query on a person name. The test data contains another 30 domains. Each domain is manually annotated into clusters, with each cluster containing pages referring to a single individual. We use a simple feature filtering approach to select features that are useful across many domains in the training data. Candidate features include (i) words occurring in only one document of the document-pair, (ii) words co-ocurring in both documents, (iii) named entity matches between the documents, and (iv) topic correlation features. For comparison, we replace RSP with MBP and TRW-S as inference algorithms (we did not run RBP and TEP as they are very slow on these problems because they often fail to converge). We also implemented the greedy algorithm (Greedy) in [7]. We tried using the linear programming approach but free off-theshelf solvers seem unable to scale to these problems. Results comparing RSP with Greedy, MBP and TRW-S are shown in Table 2. The F-measure attained by RSP for this SemEval task [1] is equal to the systems ranked second and third out of 16 participants (official results yet unpublished). We found that although TRW-S is guaranteed to converge, it performs poorly. RSP converges far better than MBP, but due to the Kernighan-Lin corrections that we run at each iteration, results can sometimes be corrected to a large extent by the local refinements. Method Number of test domains where RSP attains lower/higher energy E (x) than Method Percentage of convergence over all runs F-measure of purity and inverse purity [1] RSP 0/0 91% 75.08% MBP 9/6 74% 74.97% TRW-S 16/7 100% * 74.61% Greedy 22/5 74.78% Table 2: Results for the web person disambiguation task. (*: TRW-S is guaranteed to converge) 5 Related work and conclusion In this paper, we formulated RSP, generalizing the formulation of SP- in [14]. SP- is the sumproduct interpretation for the survey propagation (SP) algorithm [3]. SP has been shown to work well 7 for hard instances of 3-SAT, near the phase transition where local search algorithms fail. However, its application has been limited to constraint satisfaction problems [3]. In RSP, we took inspiration from the SP-y algorithm [2] in adding a penalty term for violated clauses. SP-y works on MAXSAT problems and SP can be considered as SP-y with y taken to , hence disallowing violated constraints. This is analogous to the relation between RSP and SP- [14] (See Theorem 1). RSP is however different from SP-y since we address weighted MAX-SAT problems. Even if all weights are equal, RSP is still different from SP-y , which, so far, does not have a sum-product formulation on an alternative MRF. We show that while RSP is the sum-product algorithm on a relaxed MRF, it can be used for solving the energy minimization problem. By tuning the strengths of the factors (based on convergence criteria in [16]) while keeping the underlying distribution approximately correct, RSP converges well even at low temperatures. This enables it to return low-energy configurations on MRFs where other methods fail. As far as we know, this is the first application of convergence criteria to aid convergence of belief propagation algorithms, and this mechanism can be used to exploit future work on sufficient conditions for the convergence of belief propagation algorithms. Acknowledgments We would like to thank Yee Fan Tan for his help on the web person disambiguation task, and Tomas Lozano-Perez and Leslie Pack Kaelbling for valuable comments on the paper. The research is partially supported by ARF grant R-252-000-240-112. References [1] "Web person disambiguation task at SemEval," 2007. [Online]. Available: http://nlp.uned.es/weps/taskdescription-2.html [2] D. Battaglia, M. Kolar, and R. Zecchina, "Minimizing energy below the glass thresholds," Physical Review E, vol. 70, 2004. [3] A. Braunstein, M. Mezard, and R. Zecchina, "Survey propagation: An algorithm for satisfiability," Random Struct. Algorithms, vol. 27, no. 2, 2005. [4] B. A. Cipra, "The Ising model is NP-complete," SIAM News, vol. 33, no. 6, 2000. [5] C. Ding, "Spectral clustering," ICML '04 Tutorial, 2004. [6] G. Elidan, I. McGraw, and D. Koller, "Residual belief propagation: Informed scheduling for asynchronous message passing," in UAI, 2006. [7] T. Finley and T. Joachims, "Supervised clustering with support vector machines," in ICML, 2005. [8] T. Heskes, "On the uniqueness of loopy belief propagation fixed points," Neural Computation, vol. 16, 2004. [9] T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Norwell, MA, USA: Kluwer Academic Publishers, 2002. [10] B. Kernighan and S. Lin, "An efficient heuristic procedure for partitioning graphs," Bell Systems Technical Report, 1970. [11] V. Kolmogorov, "Convergent tree-reweighted message passing for energy minimization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 10, 2006. [12] V. Kolmogorov and R. Zabih, "What energy functions can be minimized via graph cuts?" IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, 2004. [13] F. Kschischang, B. Frey, and H. Loeliger, "Factor graphs and the sum-product algorithm," IEEE Transactions on Information Theory, vol. 47, no. 2, 2001. [14] E. Maneva, E. Mossel, and M. Wainwright, "A new look at survey propagation and its generalizations," 2004. [Online]. Available: http://arxiv.org/abs/cs.CC/0409012 [15] T. Minka and Y. Qi, "Tree-structured approximations by expectation propagation," in NIPS, 2004. [16] J. M. Mooij and H. J. Kappen, "Sufficient conditions for convergence of loopy belief propagation," in UAI, 2005. [17] J. D. Park, "Using weighted MAX-SAT engines to solve MPE," in AAAI, 2002. [18] Y. Weiss and W. T. Freeman, "On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs," IEEE Transactions on Information Theory, vol. 47, no. 2, 2001. [19] M. Welling, T. Minka, and Y. W. Teh, "Structured region graphs: Morphing EP into GBP," in UAI, 2005. [20] J. S. Yedidia, W. T. Freeman, and Y. Weiss, "Constructing free-energy approximations and generalized belief propagation algorithms," IEEE Transactions on Information Theory, vol. 51, no. 7, 2005. 8