Bounds on the Bethe Free Energy for Gaussian Networks

Botond Cseke Faculty of Science Radboud University Nijmegen Toernooiveld 1, 6525 ED Nijmegen, The Netherlands

Tom Heskes Faculty of Science Radboud University Nijmegen Toernooiveld 1, 6525 ED Nijmegen, The Netherlands

Abstract
We address the problem of computing approximate marginals in Gaussian probabilistic models by using mean field and fractional Bethe approximations. As an extension of Welling and Teh (2001), we define the Gaussian fractional Bethe free energy in terms of the moment parameters of the approximate marginals and derive an upper and lower bound for it. We give necessary conditions for the Gaussian fractional Bethe free energies to be bounded from below. It turns out that the bounding condition is the same as the pairwise normalizability condition derived by Malioutov et al. (2006) as a sufficient condition for the convergence of the message passing algorithm. By giving a counterexample, we disprove the conjecture in Welling and Teh (2001): even when the Bethe free energy is not bounded from below, it can possess a local minimum to which the minimization algorithms can converge.

their behavior for the discrete and Gaussian cases. For example, while the error function of the Bethe approximation--also called Bethe free energy--in discrete models is bounded from below, in Gaussian models this is not always the case (see Welling and Teh, 2001). The study of the Bethe free energy of Gaussian models is also motivated by their importance for the study of conditional Gaussian models. Conditional Gaussian or hybrid graphical models, such as switching Kalman filters (e.g., Zoeter and Heskes, 2005), combine both discrete and Gaussian variables. Approximate inference in these models can be carried out by expectation propagation (e.g., Minka, 2004, 2005). Expectation propagation can be viewed as a generalization of the Bethe approximation where marginalization constraints are replaced by expectation constraints (e.g., Heskes et al., 2005). Therefore, studying the properties of the Bethe free energy can reveal some of the convergence properties of expectation propagation. In order to understand the properties of the Bethe free energy of hybrid models a good understanding of the two special cases of discrete and Gaussian models is needed. While the properties of the Bethe free energy of discrete models have been studied extensively in the last decade and are well understood (e.g., Yedidia et al., 2000; Heskes, 2003; Wainwright et al., 2003), the properties of the Gaussian Bethe free energy have been studied much less. The message passing algorithm is a well established method for finding the stationary points of the Bethe free energy (e.g., Pearl, 1988; Yedidia et al., 2000; Heskes, 2003). It works by locally updating the approximate marginals and has been successfully applied in both discrete (e.g., Murphy et al., 1999; Wainwright et al., 2003) and Gaussian models (e.g., Weiss and Freeman, 2001; Rusmevichientong and Roy, 2001; Malioutov et al., 2006). The problem of finding sufficient conditions for the convergence of message passing in Gaussian networks has been successfully addressed

1

Introduction

Calculating marginal probabilities of a set of variables given some observations is one of the ma jor tasks of probabilistic inference. In the case of Gaussian models, the computation of the marginal probabilities has a computational complexity that scales cubically with the number of variables, while for models with discrete variables, it often leads to intractable computations. Computations can be made faster or tractable by using approximate inference methods like mean field approximation (e.g., Jaakkola, 2000) and Bethe approximation (e.g., Yedidia et al., 2000). These methods were developed mainly for discrete probabilistic graphical models, but they are applicable in Gaussian models as well. However, there are important differences in


by many authors. Using the computation tree approach, Weiss and Freeman (2001) proved that message passing converges whenever the information-- inverse covariance--matrix of the probability distribution is diagonally dominant1 . With the help of an analogy between message passing and walk­sum analysis, Malioutov et al. (2006) derived the stronger condition of pairwise normalizability2 . A different approach was taken by Welling and Teh (2001), who directly minimized--with regard to the parameters of approximate marginals--the Bethe free energy, conjecturing that Gaussian message passing converges if and only if the free energy is bounded from below. Their experiments showed that message passing and direct minimization either converge to the same solution or both fail to converge. Following Welling and Teh (2001), instead of analyzing the message passing updates, we turn our attention to the properties of the Bethe free energy expressed as a function of the moment parameters of approximate marginals. We derive a lower bound for the Gaussian fractional Bethe free energies and give necessary conditions for them to be bounded from below.

makes marginals easy to identify. The most common quantity to measure the difference between two probability distributions is the Kullback-Leibler divergence D [q ||p]. It is often used (e.g., Jaakkola, 2000) to characterize the quality of the approximation and formulate the computation of approximate marginals as the optimization problem q d q (x) q (x) = argmin (x) log x. (1) p(x) q F Here, F is the set of distributions with the above mentioned form. Since it is not symmetric, the KullbackLeibler divergence is not a distance, but D [q ||p]  0 for any proper q and p, D [q ||p] = 0 if and only if p = q and it is convex in both q and p. A family F of densities possessing a form that makes marginals easy to identifykis the family of distributions that factorize as q (x) = qk (xk ). In other words, in

2

Background

for any I  R =  with I  R = {1, . . . , N }--here, N is the number of variables in the model. In sparse models, the complexity of computations can be reduced to scale with the number of non-zero elements of JR,R , but computing marginals for several groups of variables can still be costly. Trading correctness for speed, one can opt for computing approximate marginals. A popular method to approximate marginals is approximating p with a distribution q having a form that
1 P The matrix A is diagonally dominant if |Aii | > j =i |Aij | for all i. 2 Following Malioutov et al. (2006), we call a Gaussian distribution pairwise normalizable if it can b e factorized into a prodQ t of normalizable "pair" factors, that is uc p(x1 , . . . , xn ) = ij ij (xi , xj ) such that all ij -s are normalizable.

The probability distribution of a Gaussian undirected probabilistic graphical model--also known as Gaussian Markov random field--is usually defined in terh s of canonic-- parameters, namely m al p(x)  exp T x - 1 xT Jx with J symmetric and 2 positive definite. Such canonical parameterizations often result from Bayesian computations, for example from a model with a Gaussian likelihood and Gaussian prior. The calculation of marginals requires matrix invers(ions, that is they can be computed as p(xI )  , 1 1 1 exp hI - JI ,R J-,R hR )T xI - 2 xT JI ,R J-,R JR,I xI I R R

problem (1) we approximate p with a distribution that has independent variables. An approximation q of this type is called mean field approximation (e.g., Jaakkola, 2000). Writing out in detail the right hand side of (1) one gets l k FM F ({qk }) = - og p(x) qk (xk )dx + k q
k (xk ) log qk (xk )dxk .

2 Using notation qk (xk ) = N (xk |mi , i ), m = T T (m1 , . . . , mN ) and  = (1 , . . . , N ) , this simplifies to -h 1k 1 2 T Jkk k FM F (m, ) = - m - mT Jm - 2 2 k log(k ) + CM F , (2)

D

can easily check that FM F is convex in its variables m m and  and its d inimum is obtained for m = J-1 h and  = 1/ iag (J). Since [J-1 ]kk = 1/(Jkk - J -1 J\k,k ), one can easily see that the JT,\k \k,\k k mean field approximation underestimates variances. Note that the mean field approximation computes a solution in which the means are exact, but the variances are computed as if there were no interactions between the variables, namely as if the matrix J were diagonal, thus giving poor estimates of the variances. In order to improve the estimates for variances, one has to choose approximating distributions q that are

wheke CM F mis an irrelevant constant. r qk ||p

Although

ight not be convex in (q1 , . . . , qN ), one


able to capture dependencies between the variables in p. It can be verified that any distribution in which the dependencies form a tree graph can be written in the form p(x) = n p(xi , xj ) k p(xk ), p(xi )p(xj )

(i,j )

Since FM F is convex and bounded and the Bethe free energy could be unbounded, it seems plausible to analyze the fractional Bethe free energy q n F ({qij , qk }) = - ij (xi,j ) log ij (xi,j )dxi,j
(i,j )

+ +

where i and j run through all the connections or edges n(i, j ) of the tree and k runs through {1, . . . , N }. Although in most cases the undirected graph generated by the connections in J is not a tree, based on the "tree intuition" one can construct q from one and two variable marginals as q (x)  n qij (xi , xj ) k qk (xk ) qi (xi )qj (xj ) (3)

n

(i,j )

k

1 ij q

q

ij (xi,j ) log

qi (xi )qj (xj )

q

ij (xi,j )

d xi,j

k (xk ) log qk (xk )dxk

(5)

(i,j )

and constrain the functions qij and qk to be marginally consistent and normalize to 1, that is q qij (xi , xj )dxj = qi (xi ) for any i and j and k (xk )dxk = 1 for any k . An approximation of the form (3) together with the constraints on qij ­s and qk ­s is called a Bethe approximation. Denoting the family of such functions by FB , by choosing qij (xi , xj ) = qi (xi )qj (xj ) one can easily check that FM F  FB , thus FB is non-empty. Assuming that the approximate marginals are correct and q normalizes to 1 and then substituting (3) into (1), we get an approximation of the Kullback­Leibler divergence in (1) called the Bethe free energy. Since the interactions between the variabln s in p are pairwise, we can face i,j (xi , xj ), and express the torize p as p(x) 
(i,j )

introduced by Wiegerinck and Heskes (2003). Here,  denotes the set of variables {ij }. They showed that the fractional Bethe free energy "interpolates" between the mean field and the Bethe approximation. That is for ij = 1 we get the Bethe free energy, while in the case when all ij ­s tend to 0 the mutual information between variables xi and xj is highly penalized, therefore, (5) enforces solutions close to the mean field solution. They also showed that the fractional message passing algorithm derived from (5) can be interpreted as Pearl's message passing algorithm with the difference that instead of computing local marginals--like in Pearl's algorithm--one computes local ij ­marginals3. The local ij ­marginals correspond to "true" local marginals when ij = 1 and to local mean field approximations when ij = 0. Power expectation propagation by Minka (2004) is an approximate inference method that uses local approximations with ­divergences. It turns out that in case of Gaussian models power expectation propagation-- with a fully factorized approximating distribution-- boils down to the same message passing algorithm as the one derived from (5) and the appropriate constraints. Starting from the idea of creating a convex upper bound of the Bethe free energy when p and q are exponential distributions, Wainwright et al. (2003) derived a form of (5) where the ij ­s are chosen such that it is convex in ({qij }, {qk }). Thus, they derived a form of the fractional Bethe free energy that has a unique global minimum.

Bethe free energy as FB ({qij ,qk }) = - + + n

n

(i,j )

q

ij (xi,j ) log ij (xi,j )dxi,j

(i,j )

q

ij (xi,j ) log

qi (xi )qj (xj )

q

ij (xi,j )

d xi,j (4)

k

q

k (xk ) log qk (xk )dxk .

Yedidia, Freeman, and Weiss (2000) showed that the fixed point iteration for finding the constrained minima of the function in (4), boils down to the message passing algorithm of Pearl (1988). The algorithm is derived from the Karush­Kuhn­Tucker conditions of the constrained minimization. As it was mentioned in the introduction, in case of Gaussian models this algorithm does not always converge, and the reason for this appears to be that the approximate marginals may get indefinite or negative definite covariance matrices. Welling and Teh (2001) pointed out that this can be due to the unboundedness of the Bethe free energy.

3

Main results

In this section we analyze the parametric form of (5). Setting all ij values equal, we show that the fractional Gaussian Bethe free energy is a non-increasing function of . By letting  tend to infinity, we obtain a lower bound for the free energies. It turns out
Here, we define the ­marginals of a distribution p as ­ ť argmin{qk } D p Q qk , where D is the ­divergence.
k 3


that the condition for the lower bound to be bounded from below is the same as the pairwise normalizability condition. Conforming to Malioutov et al. (2006) and without loss of generality, we work with the "normalized" information matrix, that is we use J = I + R where diag(R) = 0. We define |R| as the matrix formed by the absolute values of R's elements. Following Welling and Teh (2001), we use qij (xi,j ) = N (xi,j |mij , ij ) and 2 qk (xk ) = N (xk |mk , k ), where m = (m1 , . . . , mN )T , , 2 T 2 mi,j = (mi , mj ) and ij = i , ij ; ij , j and thus we embed the marginalization and normalization constraints into the parameterization. The ma2 trix formed by diagonal elements k and off-diagonal elements ij is denoted by  and the vector of standard deviations by  = (1 , . . . , N )T . Substituting qij and qk into (5) one gets -h 1 1 T F (m, ) = - m - mT Jm - Tr(JT ) 2 2 - 1 2 ij 1 1n log - 22 2 ij i j (i,j ) k log(k ) + C (6) where C is an irrelevant constant. Note that the variables m and  are independent, hence the minimizations of F (m, ) with regard to m and  can be carried out independently. Prop erty 1. F (m, ) is convex and bounded in (m, {ij }i=j ) and at any stationary point we have m


equal, we get, F1 (m, )  F2 (m, ) for any 1  2 . This leads to the following property. Prop erty 2. With ij = , F is a non-increasing function of .
 Using Property 1 and substituting ij into F we define the constrained function 1k 2 1 c k F (m,  ) = mJ-1 m - hT m + 2 2 -1n 1 1 1 2 2 /2 - -1 + (2ij Rij )2 i j 2 ij (i,j ) - 21 1/2 2 + (2ij Rij )2 i j 2 -1 1n 1 log 2 2 2 2 ij (2ij Rij ) i j (i,j ) k c log(k ) + C (8) c where C is an irrelevant constant. From Property 2, it follows that when choosing ij = , the function in (8) is a non-increasing function of . It then makes sense to take    and verify whether we can get a lower bound for (8).

Lemma For any  > 0, 0  1  1 and 2  1 the fol lowing inequalities hold. FM F (m,  )   FB (m, {ij }, )  ... 
 c F1 (m,  )  FB (m, {ij },  ) c F2 (m,  ) . . . 1 FM F (m,  ) -  T |R| 2

= =

J

-1

h 1
22 + (2ij Rij )2 i j 2ij |Rij |

(7) 1/2 -1 .

Moreover, they are tight, that is
0  lim F (m, {ij ()},  ) = FM F (m,  )

ij 

-sign(Rij )

Proof: By definition J is positive definite, therefore, the quadratic term in m is convex and bounded. The variables m and  are independent and the minimum with regard to m is achieved at m = J-1 h. One can check that the second order derivative of F (m, ) with regard to ij is non-negative and the first order derivative has only one solution when -i j  ij  i j (Welling and Teh, 2001). Since the variables ij are independent, one can conclude that F (m, ) is convex in {ij }. From the independence of m and , it follows that F is convex in Sm, {ij }). ( ince ij is constrained to be a covariance matrix, we 2 22 have ij  i j , thus the first logarithmic term in (6) is negative. As a consequence, by setting all ij ­s

and


1  lim F (m, {ij ()},  ) = FM F (m,  ) -  T |R|. 2

Proof: Since the Bethe free energy is the specific case of the fractional Bethe free energy for  = 1, the in equalities on FB (m, {ij ()},  ) follow from Property 2. Now, we show that the upper and lower bounds are tight. The function (1 + x2 )1/2 - 1 behaves as 1 x2 in 2 the neighborhood of 0, therefore,
0  lim ij () = 0

and log
0

lim

1  2 () i - j2 2
i j

 ij 2 () 1 = - 2 2 lim = 0, i j 0 


showing that FM F (m,  ) is a tight upper bound. As  tends to infinity, we have lim 1
22 + (2Rij )2 i j 2


1/2

-1

= |Rij |i j

and 1 lim log   yielding 1  lim F (m, {ij ()},  ) = FM F (m,  ) -  T |R|. 2 1
2 + (2Rij )2 i j 2 22 (2Rij )2 i j

1/2

-1

=

0,


L et max (|R|) be the largest eigenvalue of |R|. Analyzing the boundedness of the lower bound, we arrive at the following theorem. Theorem For the fractional Bethe free energy in (6) corresponding to a connected Gaussian network, the fol lowing statements hold (1) if max (|R|) < 1, then F is bounded from below for al l  > 0, (2) if max (|R|) > 1, then F is unbounded from below for al l  > 0, (3) if max (|R|) = 1i tn en F is bounded from below h 1 1 if and only if 2 ij  N .
(i,j )

Statement (2): Since we assumed that the Gaussian network is connected and undirected, it follows that |R| is irreducible (e.g., Horn and Johnson, 2005). According to the Frobenius-Perron theory of non-negative matrices (e.g., Horn and Johnson, 2005), the non-negative and irreducible matrix |R| has a simple maximal eigenvalue max (|R|) and all elements of the eigenvector umax corresponding to it are positive. Let us take the fractional Bethe free energy and analyze its behavior when  = tumax and t  . Since for large val1/2 1  ues of t we have + (2ij Rij )2 (ui ax uj ax )2 t4 m m 2ij |Rij |ui ax uj ax t2 , the sum of the second and third m m term in (8) boils down to (1 - max (|R|))t2 and this term dominates over the logarithmic ones as t  . As a result, the limit is independent of the choice of ij and it tends to - whenever max (|R|) > 1. Statement (3): If max (|R|) = 1, then the only direction in which the quadratic term will not dominate is  = tumax . Therefore, we have to analyze the behavior of the logarithmic terms in (8) when1 t  . For lalrge t-s these terms behave in 1 as 2 og(t). For this reason, the ij - N
(i,j ) c boundedness of F --and thus of F --depends on the I condition in statement (3).

t was shown by Malioutov et al. (2006) that the condition max (|R|) < 1 is an equivalent condition of pairwise normalizability. Therefore, pairwise normalizability is not only a sufficient condition for the message passing algorithm to converge, but it is also a necessary condition for the fractional Gaussian Bethe free energies to be bounded. Example In the case of models with K­regular adjacency matrix (non-zero entries of R) and equal interaction weights Rij = r, the maximal eigenvalue of |R| is max (|R|) = K r and the eigenvector corresponding to this eigenvalue is 1. (We define 1 as the vector that has all its elements equal to 1.) Verifying the stationary point conditions, it turns out that for some choice of r and  there exists a local minimum which is symmetrical, that is it lies in the direction 1. One can show that when the model is not pairwise normalizable (K r > 1), the critical r below which the fractional Bethe free ene y possesses this rg (K - ) and for local minimum is rc (K, ) = 1/2 any valid r the critical  below which the fractional Bethe free energies possesses this lo. al minimum is c 1 1 1 2 - 1/(K r) c (K, r) = 2 K - These results

Proof: Since in F there is no interaction between the parameters m and  and the term depending on m is bounded from below due to the positive definiteness of J, we can simply neglect this term when analyzing the boundedness of F . Let us write out in detail the lower bound of the fractional Bethe free energies in the form 1 1 FM F (m,  ) -  T |R| = mT J-1 m - hT m 2 2 k 1 +  T (I - |R|)  - log(k ) + C, (9) 2 Statement (1): The condition max (|R|) < 1 implies that I - |R| is positive definite. Now, log(x)  x - 1, 1 thus 1  T (I - |R|) - 1T log( )  2  T (I - |R|) - 2 T 1  + N . The latter is bounded from below and so it follows that (9) is bounded from below as well. According to the Lemma, boundedness of (9) implies that all fractional Bethe free energies are bounded from below.

are illustrated in Figure 1. (Note that for 2­regular graphs, all valid models are pairwise normalizable and F possess a unique global minimum.) or K­regular graphs convexity of the fractional Bethe


10

6

Bethe free energies

 =  (lower bound) 10
2

Bethe free energies

10

4

=0 (mean field)   [0.01,100] =1 (Bethe) =c

10

6

r=0 (mean field) r  (0,rvalid] 10
4

r=1/K rc=1/2*sqrt(K-1) r=rvalid

10

2

10

0

10

0

10

-2

10

-2

10

-1

10 

0

10

1

10

2

10

-2

10

-2

10

-1

10 

0

10

1

10

2

Figure 1: Visualizing critical parameters for a symmetric K-regular Gaussian model with Rij = r. Plots in c the left panel correspond to the constrained fractional Bethe free energies F in the direction 1 for an 8 node 4­regular Gaussian model with r=0.27 (K r > 1) and varying . Plots in the right panel correspond to the c constrained Bethe free energies F1 in the direction 1 for an 8 node 4­regular Gaussian model with varying r. Here, rvalid is the supremum of r­s for which the model is valid--J is positive definite. free energy in terms of {qij , qk } requires   K , a much stronger condition than   c (K, r). Thus, if we choose  sufficiently large such that the Bethe free energy is guaranteed to have a unique global minimum, this minimum is unbounded. by Welling and Teh (2001) for standard message passing, namely fractional message passing and direct minimization either both converge or both fail to converge. Our experiments in combination with the Theorem show that when max (|R|) > 1 standard message passing at best converges to a local minimum of the Bethe free energy. If standard message passing fails to converge, one can decrease  and search for a stationary point--preferably a local minimum--of the corresponding fractional free energy. It can be seen from the results in the right panels of Figure 1 and 2, that when the model is no longer pairwise normalizable, the local minimum and not the unbounded global minimum is the natural continuation of the (bounded) global minimum for pairwise normalizable models. This explains why the quality of the approximation at the local minimum for models that are not pairwise normalizable is still comparable to that at the global minimum for models that are pairwise normalizable. Note that when the model is pairwise normalizable both the upper and lower bounds are convex in t. This may be obscured by the use of logarithmic a x es .

4

Experiments

We implemented both direct minimization and fractional message passing and analyzed their behavior for different values of max (|R|). For reasons of simplicity we set all ij equal. The results are summarized in Figure 2. Note that there is a good correspondence between the behavior of the fractional Bethe free energies in the direction of the eigenvalue corresponding to max (|R|) and the convergence of the Newton method. The Newton method was started from varying initial points. We experienced that when max (|R|) > 1 and setting the initial value to tumax , the algorithm did not converge for high values of t. This can be explained by the plots in Figure 2: for high values of t the initial point is not in the convergence region of the local minimum. For the fractional message passing algorithm we used two types of initialization: (1) when max (|R|) < 1 we set ij such that they are all normalizable and set initial messages to 1, (2) when max (|R|)  1, we used a symmetric partitioning of p into ij -s --symmetric partitioning of the diagonal elements of J--and set messages to identical values such that in the first step of the algorithm all approximate marginals become normalizable. We experienced a behavior similar to that described

5

Conclusions and future research

As we have seen, FM F and FM F - 1  T |R| provide 2 tight upper and lower bounds for the Gaussian fractional Bethe free energies. It turns out that pairwise normalizability (see Malioutov et al., 2006) is not only a sufficient condition for the message passing algorithm to converge, but it is also a necessary condition for the


10

6

7 6 5 4 3 2 1 0 -1 10
-2

10

4

Error in variances after convergence

Function value after convergence

Mean field   [0.01,100] Bethe (=1) Lower bound

8

Newton method Message passing 10
1

10

2

10

0

10

0

10

-2

10

-1

10

0

10 t

1

10

2

10

3

10 

0

10

2

10

-1

10

-2

10 

0

10

2

10

6

7 6 5 4 3 2 1 0 -1 10
-2

10

4

Error in variances after convergence

Function value after convergence

Mean field   [0.01,100] Bethe (=1) Lower bound

8

Newton method Message passing 10
1

10

2

10

0

10

0

10

-2

10

-1

10

0

10 t

1

10

2

10

3

10 

0

10

2

10

-1

10

-2

10 

0

10

2

Figure 2: Left panels show the constrained fractional Bethe free energies of an 8 node Gaussian network in the direction of the eigenvector corresponding to max (|R|) for max (|R|) = 0.9 (top) and max (|R|) = 1.1 (bottom). The thick lines are the functions FM F (dashed), FB (dashed dotted) and the lower bound FM F - 1  T |R| 2 c (continuous). The thin lines are the constrained -fractional free energies F for   [10-2 , 102 ]. Center panels show the final function values after the convergence of the Newton method. Right panels show the || ˇ ||2 error in approximation for the single node standard deviations  . Missing values indicate non-convergence. Gaussian fractional Bethe free energies to be bounded from below. If the model is pairwise normalizable, then the lower bound is bounded from below, therefore, both direct minimization and fractional message passing are converging for any choice of  > 0. Experiments show that regardless of the initialization, they converge to the same minimum. This suggests that in the pairwise normalizable case, fractional Bethe free energies possess a unique global minimum. If the model is not pairwise normalizable, then none of the fractional Bethe free energies is bounded from below. However, experiments show that there is always a range of  values for which both direct minimization and fractional message passing converge. By decreasing  towards zero, one gets closer to the mean field energy and a local minimum to which the minimization algorithms can converge may appear. As mentioned in Section 2, ij ­s correspond to using local ij divergences when applying power expectation propagation with a fully factorized approximating distribution. Several papers on continuous models (e.g., Minka, 2004; Qi et al., 2005) report that when expectation propagation does not converge, applying power expectation propagation with  < 1 helps to achieve convergence. In the case of the problem addressed in this paper this behavior can be explained by the observation that small ­s make local minima more likely to occur and thus prevents the covariance matrices from becoming indefinite or even non positive definite. Wainwright et al. (2003) propose to convexify the Bethe free energy by choosing ij ­s sufficiently large such that the fractional Bethe free energy has a unique global minimum. This strategy appears to fail for Gaussian models. Convexification makes the possibly useful local minima dissapear, leaving just the unbounded global minimum. In the case of the more general hybrid models the use of the convexification is still unclear. Note that the example in the previous section disproves the conjecture in Welling and Teh (2001): even when the Bethe free energy is not bounded from below, it can possess a local minimum to which the minimization algorithms converge. Our future goals are to find sufficient conditions for the Bethe free energy to possess local minima and to derive algorithms that are guaranteed to find those.


Acknowledgements We would like to thank Jason K. Johnson for suggesting us to try a case study of models with K­regular adjacency matrix and equal interaction weights, and sharing his ideas about the the walk­sum analysis of these models. The authors would like to acknowledge the support from the Vici grant 639.023.604 (second author) and Marie Curie EST program MEST-CT2004-514510 (first author).

pages 269­276. Society for Artificial Intelligence and Statistics, 2005. (Available electronically at http://www.gatsby.ucl.ac.uk/aistats/). P. Rusmevichientong and B. Van Roy. An analysis of belief propagation on the turbo decoding graph with Gaussian densities. IEEE Transactions on Information Theory, 47:745­765, 2001. M. Wainwright, T. Jaakkola, and A. Willsky. Tree-reweighted belief propagation algorithms and approximate ML estimation via pseudo-moment matching. In C. Bishop and B. Frey, editors, Proceedings of the Ninth International Workshop on Artificial Intel ligence and Statistics. Society for Artificial Intelligence and Statistics, 2003. Y. Weiss and W. T. Freeman. Correctness of belief propagation in Gaussian graphical models of arbitrary topology. Neural Computation, 13(10):2173­ 2200, 2001. M. Welling and Y. W. Teh. Belief optimization for binary networks: a stable alternative to loopy belief propagation. In J. S. Breese and D. Koller, editors, Proceedings of the 17th Conference in Uncertainty in Artificial Intel ligence, pages 554­561. Morgan Kaufmann Publishers, 2001. W. Wiegerinck and T. Heskes. Fractional belief propagation. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 438­445, Cambridge, MA, 2003. The MIT Press. J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In Advances in Neural Information Processing Systems 12, pages 689­695, Cambridge, MA, 2000. The MIT Press. O. Zoeter and T. Heskes. Change point problems in linear dynamical systems. Journal of Machine Learning Research, 6:1999­2026, 2005.

References
T. Heskes. Stable fixed points of loopy belief propagation are minima of the Bethe free energy. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 359­366, Cambridge, MA, 2003. The MIT P r es s . T. Heskes, M. Opper, W. Wiegerinck, O. Winther, and O. Zoeter. Approximate inference techniques with expectation constraints. Journal of Statistical Mechanics: Theory and Experiment, 2005:P11015, 2005. URL http://www.iop.org/EJ/jstat/. R. A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, UK, 2005. T. Jaakkola. Tutorial on variational approximation methods. In M. Opper and D. Saad, editors, Advanced mean field methods: theory and practice, pages 129­160, Cambridge, MA, 2000. The MIT P r es s . D. Malioutov, J. Johnson, and A. Willsky. Walk-sums and belief propagation in Gaussian graphical models. Journal of Machine Learning Research, 7:2031­ 2064, October 2006. T. Minka. Power EP. Technical report, Microsoft Research Ltd., Cambridge, UK, MSR-TR-2004-149, October 2004. T. Minka. Divergence measures and message passing. Technical report, Microsoft Research Ltd., Cambridge, UK, MSR-TR-2005-173, December 2005. K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intel ligence, pages 467­ 475, San Francisco, CA, 1999. Morgan Kaufmann Publishers. J. Pearl. Probabilistic Reasoning in Intel ligent Systems: Networks of Plausible Inference. Morgan Kaufman Publishers, San Mateo, CA, 1988. Y. Qi, M. Szummer, and T. Minka. Bayesian conditional random fields. In Robert G. Cowell and Zoubin Ghahramani, editors, AISTATS05,