AC C E P T E D T O A P P E A R AT N E U R A L I N F O R M AT I O N P RO C E S S I N G S Y S T E M S 2 0 0 7 P R E L I M I NA RY V E R S I O N - S U B J E C T TO R E V I S I O N Loop Series and Bethe Variational Bounds in Attractive Graphical Models Erik B. Sudderth and Martin J. Wainwright Electrical Engineering & Computer Science University of California, Berkeley sudderth@eecs.berkeley.edu, wainwrig@eecs.berkeley.edu Alan S. Willsky Electrical Engineering & Computer Science Massachusetts Institute of Technology willsky@mit.edu Abstract Variational methods are frequently used to approximate or bound the partition or likelihood function of a Markov random field. Methods based on mean field theory are guaranteed to provide lower bounds, whereas certain types of convex relaxations provide upper bounds. In general, loopy belief propagation (BP) provides often accurate approximations, but not bounds. We prove that for a class of attractive binary models, the so­called Bethe approximation associated with any fixed point of loopy BP always lower bounds the true likelihood. Empirically, this bound is much tighter than the naive mean field bound, and requires no further work than running BP. We establish these lower bounds using a loop series expansion due to Chertkov and Chernyak, which we show can be derived as a consequence of the tree reparameterization characterization of BP fixed points. 1 Introduction Graphical models are widely used in many areas, including statistical machine learning, computer vision, bioinformatics, and communications. Such applications typically require computationally efficient methods for (approximately) solving various problems, including computing marginal distributions and likelihood functions. The variational framework provides a suite of candidate methods, including mean field approximations [3, 9], the sum­product or belief propagation (BP) algorithm [11, 14], Kikuchi and cluster variational methods [23], and related convex relaxations [21]. The likelihood or partition function of an undirected graphical model is of fundamental interest in many contexts, including parameter estimation, error bounds in hypothesis testing, and combinatorial enumeration. In rough terms, particular variational methods can be understood as solving optimization problems whose optima approximate the log partition function. For mean field methods, this optimal value is desirably guaranteed to lower bound the true likelihood [9]. For other methods, including the Bethe variational problem underlying loopy BP [23], optima may either over­estimate or under­estimate the truth. Although "convexified" relaxations of the Bethe problem yield upper bounds [21], to date the best known lower bounds on the partition function are based on mean field theory. Recent work has studied loop series expansions [2, 4] of the partition function, which generate better approximations but not, in general, bounds. Several existing theoretical results show that loopy BP, and the corresponding Bethe approximation, have desirable properties for graphical models with long cycles [15] or sufficiently weak dependencies [6, 7, 12, 19]. However, these results do not explain the excellent empirical performance of BP in many graphs with short cycles, like the nearest­neighbor grids arising in spatial statistics and low­level vision [3, 17, 22]. Such models often encode "smoothness" priors, and thus have attractive interactions which encourage connected variables to share common values. The first main contribution of this paper is to demonstrate a family of attractive models for which the Bethe variational method always yields lower bounds on the true likelihood. Although we focus on models with 1 binary variables (but arbitrary order of interactions), we suspect that some ideas are more generally applicable. For such models, these lower bounds are easily computed from any fixed point of loopy BP, and empirically improve substantially on naive mean field bounds. Our second main contribution lies in the route used to establish the Bethe lower bounds. In particular, Sec. 3 uses the reparameterization characterization of BP fixed points [20] to provide a simple derivation for the loop series expansion of Chertkov and Chernyak [2]. The Bethe approximation is the first term in this representation of the true partition function. Sec. 4 then identifies attractive models for which all terms in this expansion are positive, thus establishing the Bethe lower bound. We conclude with empirical results demonstrating the accuracy of this bound, and discuss implications for future analysis and applications of loopy BP. 2 Undirected Graphical Models Given an undirected graph G = (V , E ), with edges (s, t) E connecting n vertices s V , a graphical model associates each node with a random variable Xs taking values xs X . For pairwise Markov random fields (MRFs) as in Fig. 1, the joint distribution of x := {xs | s V } is specified via a normalized product of local compatibility functions: ( 1s p(x) = s (xs ) st (xs , xt ) (1) Z V s,t)E x s ( The partition function Z := s (xs ) X n s,t) st (xs , xt ), whose value depends on the compatibilities , is defined so that p(x) is properly normalized. We also consider distributions defined by hypergraphs G = (V , C ), where each hyperedge c C connects some subset of the vertices (c V ). Letting xc := {xs | s c}, the corresponding joint distribution equals c 1s p(x) = s (xs ) c (xc ) (2) Z V C x s c where as before Z = s (xs ) c (xc ). Such higher­order random fields are conveX n niently described by the bipartite factor graphs [11] of Fig. 2. In statistical physics, the partition function Z arises in the study of how physical systems respond to changes in external stimuli or temperature [23]. Alternatively, when compatibility functions are parameterized by exponential families [20], log Z is the family's cumulant generating function, and thus intrinsically related to the model's marginal statistics. For directed Bayesian networks (which can be factored as in eq. (2)), Z is the marginal likelihood of observed data, and plays a central role in learning and model selection [9]. However, for general graphs coupling discrete random variables, the cost of exactly evaluating Z grows exponentially with n [8]. Computationally tractable families of bounds on the true partition function are thus of great practical interest. 2.1 Attractive Discrete Random Fields In this paper, we focus on binary random vectors x {0, 1}n . We say that a pairwise MRF, with compatibility functions st : {0, 1}2 R+, has attractive interactions if st (0, 0) st (1, 1) st (0, 1) st (1, 0) (3) for each edge (s, t) E . Intuitively, this condition requires all potentials to place greater weight on configurations where neighboring variables take the same value. Our later analysis is based on pairwise marginal distributions st (xs , xt ), which we parameterize as follows: 1 s := Est [Xs ] - s - t + st t - st (4) st (xs , xt ) = s - st st st := Est [Xs Xt ] We let Est [·] denote expectation with respect to st (xs , xt ), so that st is the probability that Xs = Xt = 1. This normalized matrix is attractive, satisfying eq. (3), if and only if st s t . For binary variables, the pairwise MRF of eq. (1) provides one representation of a general, inhomogeneous Ising model. In the statistical physics literature, Ising models are typically expressed by coupling random spins zs {-1, +1} with symmetric potentials log st (zs , zt ) = st zs zt . The attractiveness condition of eq. (3) then becomes st 0, and the resulting model has ferromagnetic interactions. Furthermore, pairwise MRFs satisfy the regularity condition of [10], and thus allow tractable MAP estimation via graph cuts [5, 18], if and only if they are attractive. Even for attractive models, however, calculation of the partition function in non­planar graphs is NP­complete [8]. 2 To define families of higher­order attractive potentials, we first consider a probability distribution c (xc ) on k = |c| binary variables. Generalizing eq. (4), we parameterize such distributions by the following collection of 2k - 1 mean parameters: s =ac (5) a := Ec Xs a For example, stu (xs , xt , xu ) would be parameterized by {s , t , u , st , su , tu , stu }. For any subset a c, we then define the following central moment statistic: s a := Ec (Xs - s ) =ac (6) a Note that s = 0, while st = Cov (Xs , Xt ) = st - s t . The third­order central moment then equals the cumulant stu = stu - st u - su t - tu s + 2s t u . Given these definitions, we say that a probability distribution c (xc ) is attractive if the central moments associated with all subsets a c of binary variables are non­negative (a 0). Similarly, a compatibility function c (xc ) is attractive if the probability distribution attained by normalizing its values has non­negative central moments. For example, the following potential is easily shown to satisfy this condition for all degrees k = |c|, and any scalar c > 0: x1 = x2 = · · · = xk c log c (x1 , . . . , xk ) = (7) -c otherwise 2.2 Belief Propagation and the Bethe Variational Principle Many applications of graphical models require estimates of the posterior marginal distributions of individual variables s (xs ) or factors c (xc ). Loopy belief propagation (BP) approximates these marginals via a series of messages passed among nodes of the graphical model [14, 23]. Let (s) denote the set of factors which depend on Xs , or equivalently the neighbors of node s in the corresponding factor graph. The BP algorithm then iterates the following message updates: t d x c (xc ) mtc (xt ) (8) ¯ msc (xs ) s (xs ) ¯ mds (xs ) mcs (xs ) (s)\c c\s c\ s The left­hand expression updates the message msc (xs ) passed from variable node s to factor c. New ¯ outgoing messages mcs (xs ) from factor c to each s c are then determined by marginalizing the incoming messages from other nodes. At any iteration, appropriately normalized products of these messages define estimates of the desired marginals: c t s (xs ) s (xs ) mcs (xs ) c (xc ) c (xc ) mtc (xt ) ¯ (9) (s) c In tree­structured graphs, BP defines a dynamic programming recursion which converges to the exact marginals after finitely many iterations [11, 14]. In graphs with cycles, however, convergence is not guaranteed, and pseudo­marginals computed via eq. (9) are (often good) approximations. A wide range of inference algorithms can be derived via variational approximations [9] to the true partition function. Loopy BP is implicitly associated with the following Bethe approximation: x c x s c (xc ) log c (xc ) s (xs ) log s (xs ) + log Z ( ) = V s C c - Fixed points of loopy BP correspond to stxtionary points of this Bethe approximation [23], subject a to the local marginalization constraints c (xc ) = s (xs ). c\s s V x s (xs ) log s (xs ) - s c C x c (xc ) log c (x ) tc c c t (xt ) (10) 3 Reparameterization and Loop Series Expansions As discussed in Sec. 2.2, any BP fixed point is in one­to­one correspondence with a set {s , c } of pseudo­marginals associated with each of the graph's nodes s V and factors c C . These pseudo­marginals then lead to an alternative parameterization [20] of the factor graph of eq. (2): c (x ) 1s tc c (11) s (xs ) p(x) = Z ( ) c t (xt ) V C 3 For pairwise MRFs, the reparameterized compatibility functions equal st (xs , xt )/s (xs )t (xt ). The BP algorithm effectively searches for reparameterizations which are tree­consistent, so that c (xc ) is the exact marginal distribution of Xc for any tree (or forest) embedded in the original graph [20]. In later sections, we take expectations with respect to c (xc ) of functions f (xc ) defined over individual factors. Although these pseudo­marginals will in general not equal the true marginals pc (xc ), BP fixed points ensure local consistency so that Ec [f (Xc )] is well­defined. Using eq. (10), it is easily shown that the Bethe approximation Z ( ) = 1 for any joint distribution defined by reparameterized potentials as in eq. (11). For simplicity, the remainder of this paper focuses on reparameterized models of this form, and analyzes properties of the corresponding exact partition function Z ( ). The resulting expansions and bounds are then related to the original MRF's partition function via the positive constant of eq. (10). Recently, Chertkov and Chernyak proposed a finite loop series expansion [2] of the partition function, whose first term coincides with the Bethe approximation. They provide two derivations: one applies a trigonometric identity to Fourier representations of binary variables, while the second employs a saddle point approximation obtained via an auxiliary field of complex variables. The gauge transformations underlying these derivations are a type of reparameterization, but their form is complicated by auxiliary variables and extraneous degrees of freedom. In this section, we show that the fixed point characterization of eq. (11) leads to a more direct, and arguably simpler, derivation. 3.1 Pairwise Loop Series Expansions We begin by developing a loop series expansion for pairwise MRFs. Given an undirected graph G = (V , E ), and some subset F E of the graph's edges, let ds (F ) denote the degree (number of neighbors) of node s in the subgraph induced by F . As illustrated in Fig. 1, any subset F for which all nodes s V have degree ds (F ) = 1 defines a generalized loop [2]. The partition function for any binary, pairwise MRF can then be expanded via an associated set of loop corrections. Proposition 1. Consider a pairwise MRF defined on an undirected G = (V , E ), with reparameterized potentials as in eq. (11). The associated partition function then equals ( s ( Z ( ) = 1 + F Es Xs - s )ds (F ) st (12) F := =F E V s,t)F st - s t Covst (Xs , Xt ) = s (1 - s )t (1 - t ) Vars (Xs ) Vart (Xt ) where only generalized loops F lead to non­zero terms in the sum of eq. (12), and ( = ( Es Xs - s )d s (1 - s ) 1 - s )d-1 + (-1)d (s )d-1 are central moments of the binary variables at individual nodes. st := (13) ( 14) Proof. To establish the expansion of eq. (12), we exploit the following polynomial representation of reparameterized pairwise compatibility functions: st (xs , xt ) = 1 + st (xs - s )(xt - t ) (15) s (xs )t (xt ) As verified in the Appendix, this expression is satisfied for any (xs , xt ) {0, 1}2 if st is defined as in eq. (13). For attractive models satisfying eq. (3), st 0 for allsedges. Using E [·] to denote ~ s (xs ), we then have expectation with respect to the fully factorized distribution (x) = ~ x s ( st (xs , xt ) Z ( ) = s (xs ) s (xs )t (xt ) { 0, 1} n V s,t)E ( ( ( = st (Xs , Xt ) 1 + st (Xs - s )(Xt - t ) 16) = E E ~ ~ s (Xs )t (Xt ) s,t)E s,t)E Expanding this polynomial via the expectation operator's linearity, we recover one term for each non­empty subset F E of the graph's edges: ( ( Z ( ) = 1 + E st (Xs - s )(Xt - t ) 17) ~ =F E s,t)F The expression in eq. (12) then follows from the independence structure of (x), and standard ~ formulas for the moments of Bernoulli random variables. To evaluate these terms, note that if 4 Figure 1: A pairwise MRF coupling ten binary variables (left), and the nine generalized loops in its loop series expansion (right). For attractive potentials, two of the generalized loops may have negative signs (second & third from right), while the core graph of Thm. 1 contains eight variables (far right). ds (F ) = 1, it follows that Es [Xs - s ] = 0. There is thus one loop correction for each generalized loop F , in which all connected nodes have degree at least two. Figure 1 illustrates the set of generalized loops associated with a particular pairwise MRF. These loops effectively define corrections to the Bethe estimate Z ( ) 1 of the partition function for reparameterized models. Tree­structured graphs do not contain any non­trivial generalized loops, and the Bethe variational approximation is thus exact. The loop expansion formulas of [2] can be precisely recovered by transforming binary variables to a spin representation, and refactoring terms from the denominator of edge weights st to adjacent vertices. Explicit computation of these loop corrections is in general intractable; for example, fully connected graphs with n 5 nodes have more than 2n generalized loops. In some cases, accounting for a small set of significant loop corrections may lead to improved approximations to Z ( ) [4], or more accurate belief estimates for LDPC codes [1]. We instead use the series expansion of Prop. 1 to establish analytic properties of BP fixed points. 3.2 Factor Graph Loop Series Expansions We now extend the loop series expansion to higher­order MRFs defined on hypergraphs G = (V , C ). Let E = {(s, c) | c C, s c} denote the set of edges in the factor graph representation of this MRF. As illustrated in Fig. 2, we define a generalized loop to be a subset F E of edges such that all connected factor and variable nodes have degree at least two. Proposition 2. Consider any factor graph G = (V , C ) with reparameterized potentials as in eq. (11), and associated edges E . The partition function then equals ( c s ac (F ) (18) Es Xs - s )ds (F ) Z ( ) = 1 + F F := =F E V C a a (Xs - s ) a := t (19) = a t (1 - t ) a Vart (Xt ) where ac (F ) := {s c | (s, c) F } denotes the subset of variables linked to factor node c by the edges in F . Only generalized loops F lead to non­zero terms in the sum of eq. (18). Ec Proof. As before, we employ a polynomial representation of the reparameterized factors in eq. (11): a s (x ) tc c =1+ a (xs - s ) (20) c t (xt ) a c,|a|2 ts For factor graphs with attractive reparameterized potentials, the constant a 0 for all a c. Note that this representation, which is derived in the Appendix, reduces to that of eq. (15) when c = {s, t}. Single­variable subsets are excluded in eq. (20) because s = Es [Xs - s ] = 0. Applying eq. (20) as in our earlier derivation for pairwise MRFs (see eq. (16)), we may express the partition function of the reparameterized factor graph as follows: c = ( c s c (Xc ) t 1+ E a (Xs - s ) 21) Z ( ) = E ~ ~ c t (Xt ) a C C =ac Note that a = 0 for any subset where |a| = 1. There is then a one­to­one correspondence between variable node subsets a c, and subsets {(s, c) | s a} of the factor graph's edges E . Expanding this expression by F E , it follows that each factor c C contributes a term corresponding to the chosen subset ac (F ) of its edges: c ( s Z ( ) = 1 + E ac (F ) (Xs - s ) 22) ~ =F E C ac (F ) 5 Figure 2: A factor graph (left) with three binary variables (circles) and four factor nodes (squares), and the thirteen generalized loops in its loop series expansion (right, along with the full graph). Note that = 1. Equation (18) then follows from the independence properties of (x). For a term ~ in this loop series to be non­zero, there must be no degree one variables, since Es [Xs - s ] = 0. In addition, the definition of a implies that there can be no degree one factor nodes. 4 Lower Bounds in Attractive Binary Models The Bethe approximation underlying loopy BP differs from mean field methods [9], which lower bound the true log partition function Z ( ), in two key ways. First, while the Bethe entropy (second line of eq. (10)) is exact for tree­structured graphs, it approximates (rather than bounds) the true entropy in graphs with cycles. Second, the marginalization condition imposed by loopy BP relaxes (rather than strengthens) the global constraints characterizing valid distributions [21]. Nevertheless, we now show that for a large family of attractive graphical models, the Bethe approximation Z ( ) of eq. (10) lower bounds Z ( ). In contrast with mean field methods, these bounds hold only at appropriate BP fixed points, not for arbitrarily chosen pseudo­marginals c (xc ). 4.1 Partition Function Bounds for Pairwise Graphical Models Consider a pairwise MRF defined on G = (V , E ), as in eq. (1). Let VH V denote the set of nodes which either belong to some cycle in G, or lie on a path (sequence of edges) connecting two cycles. We then define the core graph H = (VH , EH ) as the node­induced subgraph obtained by discarding edges from nodes outside VH , so that EH = {(s, t) E | s, t VH }. The unique core graph H underlying any graph G can be efficiently constructed by iteratively pruning degree one nodes, or leaves, until all remaining nodes have two or more neighbors. The following theorem identifies conditions under which all terms in the loop series expansion must be non­negative. Theorem 1. Let H = (VH , EH ) be the core graph for a pairwise binary MRF, with attractive potentials satisfying eq. (3). Consider any BP fixed point for which all nodes s VH with three or 1 more neighbors in H have marginals s 2 (or equivalently, s 1 ). The corresponding Bethe 2 variational approximation Z ( ) then lower bounds the true partition function Z ( ). Proof. It is sufficient to show that Z ( ) 1 for any reparameterized pairwise MRF, as in eq. (11). From eq. (9), note that loopy BP estimates the pseudo­marginal st (xs , xt ) via the product of st (xs , xt ) with message functions of single variables. For this reason, attractive pairwise compatibilities always lead to BP fixed points with attractive pseudo­marginals satisfying st s t . Consider the pairwise loop series expansion of eq. (12). As showns by eq. (13), attractive odels m ( Es Xs - s )ds (F ) lead to edge weights st 0. It is thus sufficient to show that 0 for each generalized loop F E . Suppose first that the graph has a single cycle, and thus exactly one non­zero generalized(loop F . Because all connected nodes in this cycle have degree two, the bound follows because Es Xs - s )2 0. More generally, we clearly have Z ( ) 1 in graphs where every generalized loop F associates an even number of neighbors ds (F ) with each node. Focu( ing on gene lized loops containing nodes with odd degree d 3, eq. (14) implies that s ra Es Xs - s )d 0 for marginals satisfying 1 - s s . For BP fixed points in which s 1 2 for all nodes, we thus have Z ( ) 1. In particular, the symmetric fixed point s = 1 leads to uni2 formly positive generalized loop corrections. More generally, the marginals of nodes s for which ds (F ) 2 for every generalized loop F do not influence the expansion's positivity. Theorem 1 discards these nodes by examining the topology of the core graph H (see Fig. 1 for an example). For fixed points where s 1 for all nodes, we rewrite the polynomial in the loop expansion of 2 eq. (15) as (1 + st (s - xs )(t - xt )), and employ an analogous line of reasoning. 6 In addition to establishing Thm. 1, our arguments show that the true partition function monotonically increases as additional edges, with attractive reparameterized potentials as in eq. (11), are added to a graph with fixed pseudo­marginals s 1 . For such models, the accumulation of particular 2 loop corrections, as explored by [4], produces a sequence of increasingly tight bounds on Z ( ). In addition, we note that the conditions required by Thm. 1 are similar to those underlying classical correlation inequalities [16] from the statistical physics literature. Indeed, the Griffiths­Kelly­ 1 Sherman (GKS) inequality leads to an alternative proof in cases where s = 2 for all nodes. 1 For attractive Ising models in which some nodes have marginals s > 2 and others t < 1 , the loop 2 series expansion may contain negative terms. For small graphs like that in Fig. 1, it is possible to use upper bounds on the edge weights st , which follow from st min(s , t ), to cancel negative loop corrections with larger positive terms. As confirmed by the empirical results in Sec. 4.3, the lower bound Z ( ) Z ( ) thus continues to hold for many (perhaps all) attractive Ising models with less homogeneous marginal biases. 4.2 Partition Function Bounds for Factor Graphs Given a factor graph G = (V , C ) relating binary variables, define a core graph H = (VH , CH ) by excluding variable and factor nodes which are not members of any generalized loops. As in Sec. 2.2, let (s) denote the set of factor nodes neighboring variable node s in the core graph H . Theorem 2. Let H = (VH , CH ) be the core graph for a binary factor graph, and consider an attractive BP fixed point for which one of the following conditions holds: ( i) s ( ii) s 1 2 1 2 for all nodes s VH with |(s)| 3, and a 0 for all a c, c CH . for all nodes s VH with |(s)| 3, and (-1)|a| a 0 for all a c, c CH . The Bethe approximation Z ( ) then lower bounds the true partition function Z ( ). 1 For the case where s 2 , the proof of this theorem is a straightforward generalization of the 1 arguments in Sec. 4.1. When s 2 , we replace all (xs - s ) terms by (s - xs ) in the expansion of eq. (20), and again recover uniformly positive loop corrections. For any given BP fixed point, the conditions of Thm. 2 are easy to verify. For factor graphs, it is more challenging to determine which compatibility functions c (xc ) necessarily lead to attractive fixed points. For symmetric potentials as in eq. (7), however, one can show that the conditions on a , a c are necessarily satisfied whenever all variable nodes s VH have the same bias. 4.3 Empirical Comparison of Mean Field and Bethe Lower Bounds In this section, we compare the accuracy of the Bethe variational bounds established by Thm. 1 to those produced by a naive, fully factored mean field approximation [3, 9]. Using the spin representation zs {-1, +1}, we examine Ising models with attractive pairwise potentials log st (zs , zt ) = st zs zt of varying strengths st 0. We first examine a 2D torus, with poten¯ tials of uniform strength st = and no local observations. For such MRFs, the exact partition function may be computed via Onsager's classical eigenvector method [13]. As shown in Fig. 3(a), ¯ ¯ for moderate the Bethe bound Z ( ) is substantially tighter than mean field. For large , only ¯|E |). In this two states (all spins "up" or "down") have significant probability, so that Z 2 exp( regime, loopy BP exhibits "symmetry breaking" [6], and converges to one of these states at random ¯ ¯ with corresponding bound Z ( ) exp(|E |). As verified in Fig. 3(a), as the difference log Z - log Z ( ) log 2 0.69 thus remains bounded. We also consider a set of random 10 × 10 nearest,­neighbor grids, with inhomogeneous pairwise 0 ¯ potentials 0ampled according to |st | N , 2 and observation potentials log s (zs ) = s zs , s . 2 ¯, we sample 100 random MRFs, and plot the average dif|s | N , 0.1 For each candidate ference log Z ( ) - log Z between the true partition function and the BP (or mean field) fixed point reached from a random initialization. Fig. 3(b) first considers MRFs where s > 0 for all nodes, so that the conditions of Thm. 1 are satisfied for all BP fixed points. For these models, the Bethe bound is extremely accurate. In Fig. 3(c), we also consider MRFs where the observation potentials s are of mixed signs. Although this sometimes leads to BP fixed points with negative associated loop corrections, the Bethe variational approximation nevertheless always lower bounds the true partition function in these examples. We hypothesize that this bound in fact holds for all attractive, binary pairwise MRFs, regardless of the observation potentials. 7 10 Difference from True Log Partition 0 -10 -20 -30 -40 -50 -60 -70 0 0.2 Belief Propagation Mean Field 0.4 0.6 Edge Strength 0.8 1 Difference from True Log Partition 0.5 0 -0.5 -1 -1.5 -2 -2.5 -3 0 0.2 Belief Propagation Mean Field 0.4 0.6 Edge Strength 0.8 1 2 Difference from True Log Partition 0 -2 -4 -6 Belief Propagation Mean Field -8 0 0.2 0.4 0.6 Edge Strength 0.8 1 (a) (b) (c) Figure 3: Bethe (dark blue, top) and naive mean field (light green, bottom) lower bounds on log Z ( ) for three families of attractive, pairwise Ising models. (a) 30 × 30 torus with no local observations and homogeneous potentials. (b) 10 × 10 grid with random, inhomogeneous potentials and all pseudo­marginals s > 1 , satisfy2 ing the conditions of Thm. 1. (c) 10 × 10 grid with random, inhomogeneous potentials and pseudo­marginals of mixed biases. Empirically, the Bethe lower bound also holds for these models. 5 Discussion We have provided an alternative, direct derivation of the partition function's loop series expansion, based on the reparameterization characterization of BP fixed points. We use this expansion to prove that the Bethe approximation lower bounds the true partition function in a family of binary attractive models. These results have potential implications for the suitability of loopy BP in approximate parameter estimation [3], as well as its convergence dynamics. We are currently exploring generalizations of our results to other families of attractive, or "nearly" attractive, graphical models. Appendix: Polynomial Representations of Compatibility Functions Consider a binary pairwise marginal distribution st (xs , xt ) parameterized by three mean parameters {s , t , st } as in eq. (4). Evaluating each (xs , xt ) {0, 1}2 , the pairwise potential equals 1 M =1 t s -s -t +st t -st s + (1-sts-(1-tt ) 1 - 1t -s t st (xs , xt ) ) ( -s ) (1-s )(1-t ) (1-s )t = s s -st st 1 - st --s t 1 + st-t t s (xs )t (xt ) s (1 t ) s (1-t ) s t s ultiplying and dividing so that the right­hand terms share a common denominator, we then immediately recover the polynomial representation of eq. (15). To establish the general result in eq. (20), we note that this expression equivalently implies that the following relationship holds for any real­valued function f (xc ) on k = |c| bits: t 1a f x s x t (xt ) + c (xc )f (xc ) = a (xs - s ) (xc ) c c c c,|a|2 a Ec [f (Xc )] = E ~ f (Xc ) + a a f (Xc ) c,|a|2 s (Xs - s ) a ( 23) t Here, E [·] denotes expectation with respect to (xc ) = ~ ~ c t (xt ). Because both sides of eq. (23) depend on k binary variables, it is sufficient to show equality for the 2k linearly independent funct tions fb (xc ) := b (xt - t ), b c. For the constant function f (xc ) = 1, eq. (23) becomes 1a = s a 1 = E + (Xs - s ) 1. (24) ~ c,|a|2 a = E [fa (Xc )] = 0 for any a = . For non­ The last equality follows since E ~ ~ a (Xs - s ) constant functions fb (xc ), verifying eq. (23) is thus equivalent to showing that f ( s a (Xs - s ) 25) Ec [fb (Xc )] = a E b (Xc ) ~ s c,|a|2 a If a = b, the product function fb (xc ) a (xs - s ) contains at least one degree­one term (xt - t ), and thus has mean zero with respect to (xc ). Both sides of eq. (25) then vanish if |b| = 1. For ~ 8 s higher­order subsets, the right­hand side of this expression has a single non­zero term: f = s Ec [fb (Xc )] = b E b (Xc ) (Xs - s ) ~ b Applying the definition of the central moment b from eq. (6), we have established that both sides of eq. (23) are satisfied by 2k linearly independent constraints, thus proving eqs. (15) and (20). t b E ~ Var (Xt ) b s (Xs - s )2 b = b (26) Acknowledgments The authors thank Yair Weiss for suggesting connections to loop series expansions, and helpful conversations. Funding provided by Army Research Office Grant W911NF-05-1-0207, National Science Foundation Grant DMS-0528488, and NSF Career Grant CCF-0545862. References [1] M. Chertkov and V. Y. Chernyak. Loop calculus helps to improve belief propagation and linear programming decodings of low density parity check codes. In Allerton Conf., 2006. [2] M. Chertkov and V. Y. Chernyak. Loop series for discrete statistical models on graphs. J. Stat. Mech., 2006:P06009, June 2006. [3] B. J. Frey and N. Jojic. A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Trans. PAMI, 27(9):1392­1416, Sept. 2005. ´ [4] V. Gomez, J. M. Mooij, and H. J. Kappen. Truncating the loop series expansion for BP. JMLR, 8:1987­ 2016, 2007. [5] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact maximum a posteriori estimation for binary images. J. R. Stat. Soc. B, 51(2):271­279, 1989. [6] T. Heskes. On the uniqueness of loopy belief propagation fixed points. Neural Comp., 16:2379­2413, 2004. [7] A. T. Ihler, J. W. Fisher, and A. S. Willsky. Loopy belief propagation: Convergence and effects of message errors. JMLR, 6:905­936, 2005. [8] S. Istrail. Statistical mechanics, three­dimensionality and NP­completeness: I. Universality of intractability of the partition function of the Ising model across non­planar lattices. In ACM Symp. Theory Comp., pages 87­96. ACM Press, 2000. [9] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37:183­233, 1999. [10] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEE Trans. PAMI, 26(2):147­159, Feb. 2004. [11] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum­product algorithm. IEEE Trans. IT, 47(2):498­519, Feb. 2001. [12] J. M. Mooij and H. J. Kappen. Sufficient conditions for convergence of loopy belief propagation. In UAI 21, pages 396­403. AUAI Press, 2005. [13] L. Onsager. Crystal statistics I: A two­dimensional model with an order­disorder transition. Physical Review, 65:117­149, 1944. [14] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Mateo, 1988. [15] T. J. Richardson and R. L. Urbanke. The capacity of low-density parity-check codes under messagepassing decoding. IEEE Trans. IT, 47(2):599­618, Feb. 2001. [16] S. B. Shlosman. Correlation inequalities and their applications. J. Math. Sci., 15(2):79­101, Jan. 1981. [17] M. F. Tappen and W. T. Freeman. Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters. In ICCV, volume 2, pages 900­907, 2003. [18] B. Taskar, V. Chatalbashev, and D. Koller. Learning associative Markov networks. In ICML. ACM, 2004. [19] S. C. Tatikonda and M. I. Jordan. Loopy belief propagation and Gibbs measures. In UAI 18, pages 493­500. Morgan Kaufmann, 2002. [20] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree­based reparameterization framework for analysis of sum­product and related algorithms. IEEE Trans. IT, 49(5):1120­1146, May 2003. [21] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. IT, 51(7):2313­2335, July 2005. [22] Y. Weiss. Comparing the mean field method and belief propagation for approximate inference in MRFs. In D. Saad and M. Opper, editors, Advanced Mean Field Methods. MIT Press, 2001. [23] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free energy approximations and generalized belief propagation algorithms. IEEE Trans. IT, 51(7):2282­2312, July 2005. 9