Convergent Message-Passing Algorithms for Inference over General Graphs with Convex Free Energies Tamir Hazan Amnon Shashua School of Eng. & Computer Science The Hebrew University of Jerusalem Israel Abstract Inference problems in graphical models can be represented as a constrained optimization of a free energy function. It is known that when the Bethe free energy is used, the fixedpoints of the belief propagation (BP) algorithm correspond to the local minima of the free energy. However BP fails to converge in many cases of interest. Moreover, the Bethe free energy is non-convex for graphical models with cycles thus introducing great difficulty in deriving efficient algorithms for finding local minima of the free energy for general graphs. In this paper we introduce two efficient BP-like algorithms, one sequential and the other parallel, that are guaranteed to converge to the global minimum, for any graph, over the class of energies known as "convex free energies". In addition, we propose an efficient heuristic for setting the parameters of the convex free energy based on the structure of the graph. tor graph until convergence. The belief propagation (BP) algorithm (and its extensions) is popular, and has received the most attention, due to its simplicity and computational efficiency. The BP algorithm is exact, i.e., the resulting marginal distributions are the correct ones, when the factor graph is free of cycles. An intriguing feature of BP, which most likely is the source for its great popularity, is that it often gives surprisingly good approximate results for graphical models with cycles. However, in this context there are no convergence guarantees (except under some special cases (Mooij & Kappen, 2005)) and the algorithm often fails to converge. It is known that the fixed-points of the BP algorithm correspond to local minima of a constrained energy function called the Bethe free energy (Yedidia et al., 2005). The free energy arises from the expansion of the KL-divergence between the input distribution and its product form. The Bethe function replaces the entropy term in the free energy by an approximation which is exact for factor graphs without cycles. In such a case, the Bethe free energy is convex over the set of constraints (representing validity of marginals). When the factor graph has cycles the Bethe energy is nonconvex and although it is possible to derive convergent algorithms to a local minima of the Bethe function (Yuille, 2002) the computational cost is large and thus has not gained popularity. To overcome the difficulty with the non-convexity of the Bethe approximation, several authors have introduced a class of approximations known as convex free energies which are convex over the set of constraints for any factor graph. An important member of this class is the tree-reweighted (TRW) free energy which consist of a linear combination of free energies defined on spanning trees of the factor graph. It is notable that for this specific member of convex free energies a convergent message-passing algorithm has been recently introduced (Globerson & Jaakkola, 2007b). The algorithm is sequential (unlike BP which has both se- 1 Intro duction Probabilistic graphical models present a convenient and popular tool for reasoning about complex distributions. The graphical model reflects the way the complex distribution factors into a product of distributions over a small number of variables (cliques), where the graph represents the incidence between cliques and the variables contained in them -- such a graph is known as a factor graph. The probabilistic inference is represented by a way of calculating marginal distributions (or the most likely assignment of variables) efficiently using the structure of the graph. One of the most popular class of methods for inference over (factor) graphs are message-passing algorithms which pass messages along the edges of the fac- quential and parallel forms) and applies to graphs with pairwise cliques only. However, a convergent message passing algorithm for the general class of convex free energies is still lacking. The existing algorithms either employ damping heuristics to ensure convergence in practice (Wainwright et al., 2005) or focus on a sub-class of free energies where the entropy term is a positive combination of joint entropies (Heskes, 2006). In this paper, we derive convergent message-passing algorithms, one sequential and the other parallel, for the general class of convex free energies. The derivation applies to general factor graphs (cliques of any size) and have a similar architecture to the BP algorithm. The algorithms are based on a general framework for ihandling optimization problems of the type f (b) + hi (b) where f (b) is strictly convex and hi (b) are convex but not necessarily strict nor differentiable. We show that problems of this class have a simple block-update (sequential and parallel) messagepassing solution. We then map the constrained convex free energy problem into this framework. Independently, we propose also a heuristic for setting up the parameters of the convex free energy from the structure of the factor graph. The key idea is to strive for a convex free energy which is as close as possible to Bethe's free energy under a set of constraints governing the class of convex free energies. The underlying motivation is borne by the empirical observation from BP practitioners that when BP does converge, the results are often surprisingly good (Murphy et al., 1999). Since our scheme would always converge and the free energy approximation is close to Bethe's, we would have in some sense a "convergent BP" for general graphs. N (i) stands for all factor nodes that are neighbors of variable node i and N () stands for all variable nodes that are neighbors of factor node . Finally, we limit our treatment to factor graphs where any pair of factors intersect in at most a single variable node, i.e., N () N ( ) is either empty or equal to some i. There is no technical limitation to allow for higher-order intersections but that comes at the price of reducing the clarity of our presentation. In Section 7 we will explain the necessary additions for handling general factor graphs. The typical task we try to perform ix to compute s the marginal distributions P (xi ) = \xi P (x) and x P (x ) = \x P (x). Basically, the computation requires the summation over the states of all the variable nodes not in x (including the case of the singleton xi ). This computation is generally hard because it can require summing up exponentially large number of terms -- thus one seeks efficient ways or approximate solutions for the marginals. Let b(x) stand for the approximate distribution where bi (xi ) approximates P (xi ) and b (xx approximates ) P (x ) with the constraints that b (x ) = \xi bi (xi ) for all N (i). The free energy arises from minimizing the KL-divergence between the approximate distribution b(x) and the un-normalized product form: D(b || (x )) = x E (x )b (x ) - H (b), 2 Terminology and Problem Setup We consider a joint distribution P (x) on a set of discrete variables x = x1 , ..., xn in a finite domain. In a graphical model we suppose that P (x) factors into a product of non-negative functions (clique potentials): P (x) = 1 (x ), Z where E = - ln and H (b) is the entropy of b(x). In other words, the free energy consists of a sum of a linear term over b which is exponential in the size of the cliques and and an entropy term (which is exponential in n). The approximate methods for computing the marginals are based on choosing an approximation to the entropy term H (b). The Bethe free energy approximates H (b) by i H (b ) + (1 - di )H (bi ) where di is the degree of x the variable nodexi, H (b ) = - b (x ) ln b (x ) and H (bi ) = - b(xi ) ln b(xi ). As a result, the comi putational complexity of the Bethe free energy is exponential only in the size of the cliques. The Bethe free energy is exact (equal to free energy) when the factor graph has no cycles and in that case the energy is strictly convex over the set of constraints mentioned above. When the factor graph has cycles the Bethe free energy is non-convex. Another notable property is that the fixed-points of the BP algorithm correspond to local minima of the Bethe free energy minimization over the constraints on b (Yedidia et al., 2005). The Bethe approximation of the entropy H (b) can be where is an index labeling m functions 1 , ..., m and the function (x ) has arguments x that are some subset of {x1 , ..., xn } and Z is a normalization constant. The factorization structure above is conveniently represented by a factor graph (Kschischang et al., 2001) which is a bipartite graph with variable nodes one for each variable xi and a factor node for each function . An edge connects a variable node i with factor node if and only if xi x , i.e., xi is an argument of . We adopt the terminology where written in a more general form as: i c H (b ) + ¯ ci H (bi ), ¯ (1) b1 i,1 µ1, h1 hi where ci = 1 - ¯ ¯ N (i) c . Thus when the co efficients c = 1 for all factor nodes we obtain the Bethe ¯ approximation. A convex free energy is based on a result of (Heskes, 2004) who derived sufficient conditions for an entropy approximation to be convex over the set of constraints. In the setting we have described, those conditions have the following form (Weiss et al., 2007): Definition: An approximate entropy term of the forx eqn. 1 is strictly convex over the set of constraints m b (x ) = bi (xi ) for all N (i) if there exists \xi i ci , ci 0 and c > 0 such that c = c + N () ci ¯ and ci = ci - ¯ N (i) ci . The approximate entropy becomes: i i ci (H (b ) - H (bi )) + c H (b ) + ci H (bi ) ,N (i) b µ n, i,m hn bm Figure 1: Message-passing architecture of Algorithms 1,2 however, have a high computational cost and their architecture (the update flow of parameters) is far from similar to a message passing architecture and to BP in particular. In this section we will take a detour and develop a message passing framework (sequential and parallel versions) to a particular sub-class of convex problems. Later in Section 4 we will map the convex free energy minimization of eqn. 2 into this framework. Consider the class of problems min f (b) + b Taken together, the convex free energy constrained optimization problem is: x E (x )b (x ) - min c H (b ) (2) b ,bi - i ci H (bi ) + i ,N (i) ci (H (bi ) - H (b )) sub ject to x b (x ) = bi (xi ) N (i) \xi in =1 hi (b), x b (x ) = 1 b , b i 0 We denote the criterion function (convex free energy) by Fcon (b , bi ) and note that it is strictly convex over the set of constraints provided that ci , ci 0 and c > 0. For now we assume that the parameters ci , ci , c are given as input and set out to derive a message-passing algorithm (two versions, one sequential and the other parallel) which is guaranteed to converge to the global minimum for any factor graph. Later in Section 5 we will introduce an algorithm for determining the convex free energy parameters from the factor graph. where b Rm , f (b) is a strictly convex real valued function and hi (b) are convex (not necessarily strict nor differentiable), and proper (i.e., can take the value for some values of b). This class of problems includes in particular the classical convex program: minb f (b) over the constraints b C1 · · · Cn where Ci are convex sets1 . For this class of problems we derive two messagepassing algorithms, one sequential and the other parallel. Both algorithms are based on a block update regime using convex duality. The sequential algorithm is described below: Algorithm 1 (Sequential Message-Passing) Let i and µi , i = 1, ..., n be vectors in Rm . Set i = 0. 1. For t = 1, 2, ... 2. For i = 1, ...n: j (a) µi =i j 1 In this case hi (b) = Ci (b) is the indicator function Ci (b) = 0 if b Ci and Ci (b) = if b Ci . 3 A General Framework for Sequential and Parallel Message Passing Algorithms The constrained minimization of eqn. 2 can be handled within the body of convex programming tools. Those, (b) b argmin h i (b) + f (b) + b bdomain(hi ) (c) i -µi - f (b ). Output b . µ i . 4 Convergent Message-Passing Algorithms for Convex Free Energies The vectors i and µi are messages passed along edges of a bipartite graph with n (function) nodes corresponding to the n functions hi (b) and m (variable) nodes corresponding to the dimension of b. Function node i sends the m coordinates of vector i to the m variable nodes. Variable node j sends the j 'th coordinate of vectors µ1 , ..., µn to the n functions nodes (see Fig. 1). The algorithm is "sequential" in the sense that it is crucial to move sequentially over the index i = 1, ..., n, thus the network proceeds in a node-afternode update policy. For those familiar with successive pro jection schemes, in the particular case when hi (b) = Ci (b) (the indicator function of convex set Ci ), the update step for b is a "Bregman" pro jection (Bregman et al., 1999) of the vector µi onto the convex set Ci . In that case, following some algebraic manipulations (such as eliminating µi among other manipulations) the scheme reduces to the well known Dykstra (Dykstra, 1983) (also goes under different names such as Hildreth, Bregman, Csiszar, Han, Tseng) successive pro jection algorithm which has its origins in the work of Von-Neumann (von Neumann, 1950). We introduce next a paral lel message-passing algorithm: Algorithm 2 (Parallel Message-Passing) Let i and µi , i = 1, ..., n be vectors in Rm . Set µi = 0. 1. For t = 1, 2, ... 2. For i = 1, ...n in paral lel 1 b argmin f (b) + hi (b) + b bdomain(hi ) n 1 f (b ) i -µi - n 3. For i = 1, ...n in paral lel k 1j µi -i + j n =1 We are ready to derive a convergent algorithm for the constrained convex free energy minimization problem (eqn. 2) using the two algorithms above. It is impori tant to note that in the framework of f (b) + hi (b) the function f (b) is strictly convex in the entire domain whereas the convex free energy Fcon (b , bi ) is strictly convex over the set of constraints. We need therefore both to map Fcon (b , bi ) onto the framework i of f (b) + hi (b) and handle the convexity over the domain of definition issue. x Let b (xi ) stand for b (x ) (a notation also \xi used by (Heskes, 2006)). The marginal constraints dictate that b (xi ) = b (xi ) for all , N (i). We substitute b (xi ) for bi (xii ) in eqn. 2 and move terms around to fit the f (b) + hi (b) framework. The result is summarized below: Let b = (b1 , ..., bm ), that is we dropped bi (xi ) from the process. Let f (b) be defined as follows: x ( f (b) = E (x )b (x ) - c H (b ) 3) = f (b ) µ i Let hi (b), i = 1, ..., n, corresponding to the n variable nodes of the factor graph, be defined below: ( , N (i) : b (xi ) = b (xi ) hi (b) = otherwise N (i) hi (b ) 4) where hi (b ) is defined as follows: x hi (b ) = (ci /|N (i)|-ci ) b (xi ) ln b (xi )-ci H (b ) i (5) Taken together, the constrained convex free-energy minimization (eqn. 2) becomes: min b=(b1 ,...,bm ) f (b)+ in =1 hi (b) s.t. b 0, x b (x ) = 1, Output b . The description of the messages and the network architecture are the same as in the sequential algorithm but here all the function nodes update and send their messages in parallel to the variable nodes. Once all the variable nodes have received their messages they compute their update and send their message in parallel to the function nodes. The derivation of both algorithms is presented in the Appendix. where now f (b) is strictly convex over its entire domain of definition and hi are proper convex functions. Note that f (b) and hi (b) each deco pose into a m sum of simpler functions, i.e., f (b) = f (b ) and hi (b) = N (i) hi (b ). The decomp osition translates into the messages being sparse. For example, the update step of b in both message-passing algorithms becomes: b,N (i) = argmin (f (b ) + hi (b ) + b µi ) i bdom(hi ) N (i) 1. Set nj (x ) = 1 for all i = 1, ..., n, N (i) and x . 2. For t = 1, 2, ... 3. For i = 1, ...n: m i (xi ) = b (xi ) z \xi 1/ci ^ j N ( )\i (z ) c ^ /c ^ nj (z ) N (i) mii i (xi ), ni (x ) = (x ) j N ( )\i c - ci ^ i b nj (x ) (xi ) c m i (xi ) Figure 2: Sequential message-passing algorithm for convex free energy. The constants ci and ci are defined as ^ ^ ci = ci + ^ c and ci = c + ci . ^ N (i) where bi,N (i) are the entries b in b corresponding to factor nodes neighboring to variable node i. Likewise µi are the portions in µi corresponding to factor nodes neighboring to variable node i. The domain of hi are the constraints : x b (x ) = 1, N (i) from c and ci one can solve for ci , ci 0 and c > 0 ¯ ¯ by means of linear satisfaction from the equations: i c = c + ¯ ci , ci = ci - ¯ ci . (6) N ( N (i) b (xi ) = b (xi ), , N (i) We omit the remainder of the derivation as it is long, tedious and mechanical and arrive to the final description of the two message-passing algorithms presented in Fig. 2 and Fig. 3. Like BP, the algorithms send messages between variable nodes and factor nodes where ni (x ) represents the message from variable node i to factor node and m i (x ) is the message from factor node to variable node i. In this section we propose a heuristic for setting the parameters based on the following idea2 . Given the equations (eqn. 6) connecting the parameters to c ¯ and ci , the space of admissible solutions must satisfy ¯ the following equations: j ci + (c + cj ) = 1, i = 1, ..., n N (i) N ()\i ci , ci 0, c > 0. Among all possible admissible solutions we choose the one in which c is as uniform as possible, i.e., we ap¯ ply Laplace's principle of insufficient reasoning. The criterion function, therefore, minimizes: i min (c + ci - 1)2 , (7) ci ,ci ,c admissible N () 5 Fitting a Convex Free Energy to a Graphical Mo del The convex free energy contains three sets of parameters ci , ci 0 and c > 0 one parameter per each node and edge in the factor graph. Our discussion so far was general in the sense that we presented algorithms for handling the family of convex free energies without regard as to how those parameters are determined. The only setting of parameters proposed to date is the tree-reweighted (TRW) free energy where c can be set analytically. This has a simple form when ¯ the cliques are of size 2, i.e., representing pairs of variables. In that case, c is the number of spanning trees ¯ containing divided by the total number of spanning trees (a computation which can be done nalytically). a Once c is determined then ci = 1 - ¯ ¯ ¯ N (i) c and which is a least-squares criteria for uniformity of c . Alternatively, we also used the maximum en¯ tropy approach where the criterion function minimizes c ln c . In both cases we used standard solvers to ¯ ¯ recover ci , ci , c , i.e., we did not attempt to devise specifically tailored solvers for those problems. The desire towards uniformity, besides being used extensively in probabilistic settings, is motivated by the success of the Bethe free energy where c = 1. The ¯ Bethe free energy is non-convex for factor graphs with 2 A similar idea was independently derived by Nir Friedman and his collaborators -- personal communication. cycles, thus is not a member of the convex free energies, but empirical evidence suggest that when BP converges the marginals are surprisingly good. For Bethe free energy c = 1 over all factor nodes -- hence ¯ our proposal to strive for uniformity over the space of admissible solutions. In some sense we are attempting to "convexify" the Bethe free energy, although this is not being done directly. 6 Exp eriments We applied our (parallel and sequential) message passing algorithm using the heuristic (eqn. 7) for setting the parameters of the convex free energy from the input graph to an Ising model on a two dimensional 8 × 8 grid. The distribution has the form P p(x) e ijE ij xi xj +i xi , where ij , i are parameters, xi {±1}, and E are edges of the 2D grid. Following (Globerson & Jaakkola, 2007a), the parameters i were drawn uniformly from U [-df , df ] where df {0.05, 1}. The parameters ij were drawn from U [-do , do ] or U [0, do ] to obtain mixed or attractive interaction potential respectively. The interaction levels were do {0.2, 0.4, ..., 4}. In addition to BP3 , the following algorithms were used to estimate the marginals of the distribution: · CCCP algorithm (Yuille, 2002) for obtaining a local minima of the constrained Bethe free energy. Our message-passing algorithm runs on a "covexified" version of the Bethe free-energy and achieves its global minimum. The CCCP, on the other hand, runs on the (non-convex) Bethe energy and finds a local minima. · Sequential MP where the convex free energy parameters determined using the convex-L2 free energy described in Eqn.7. · The same as above but the parameters were set using maximum entropy (instead of L2 ), we call convex-H . · The TRW method of (Wainwright et al., 2005) with uniform distributions over the trees. Since the TRW free energy belongs to the class of convex free energies we ran our sequential MP algorithm on the parameters ci , ci , c determined by TRW. For each setting of the parameters and each algorithm we calculated the mean L1 error in the marginals i (alg) 1 |p (xi = 1) - p(true) (xi = 1)|. The accuracy n 3 We used the inference package by Talya Meltzer available at http://www.cs.huji.ac.il/talyam/. Figure 4: Comparison of error in marginals estimation on an Ising model on a two dimensional 8 × 8 grid. The models presented include BP (when converged), TRW, CCCP, convex-L2 and convex-H . Mean is shown for 10 random trials. 1. For t = 1, 2, ... 2. For i = 1, ...n in parallel m i (xi ) = z \xi j (z ) N ( ) nj (z )1/n 1/ci ^ ni (z ) 3. For i = 1, ...n in parallel (a) For every xi : b (xi ) for every N (i) and every x \ xi : ni (x ) = n (x ) j i 1/n N ( ) nj (x )) (z ) N (i) mii i (xi ) c ^ /c ^ j N ( ) nj (x )1/n c /nci ^ b (xi ) c /n ( (x ) ni (x ) m i (xi ) Figure 3: P rallel message-passing algorithm for convex free energy. The constants ci and ci are defined as a ^ ^ ci = ci + ^ c and ci = c + ci . ^ N (i) results are shown in Fig. 4. The displays are arranged into four cases: Field=0.05, 1 and Mixed versus Attractive interactions. In three out of the four cases, the performance of the three convex free energies models are roughly the same. In the case [Field=1, Attractive], TRW and convex-L2 produce roughly the same marginal approximations and convex-H is worse. BP does not converge in the Mixed cases for high Interaction value of do ; CCCP produces comparable results in the two cases when Field=0.05, and produces the best results for [Field=1, Mixed]. To conclude so far, the two convex free energy settings produce comparable results to TRW on all cases and are sometimes better and sometimes worse than BP (when converges). Among the two setting of the convex free energy, convex-L2 consistently produces better approximations than convex-H . It is interesting that CCCP often produces good approximations, however this comes at a costly run-time tradeoff. Fig. 5 compares the running time of our (sequential) MP algorithm with a general convex solver performing conditional gradient descent on the primal energy function (Bertsekas et al., 2003) which uses linear programming to find feasible search directions, and to the CCCP algorithm. We ran all three algorithms on n × n grids where n = 2, 3, ..., 10. The stopping criteria for all algorithms was the same and based on a primal energy difference of 10-5 . For a 10 × 10 grid, for instance, the general convex solver was slower by a factor of 20 (e.g., 306 seconds compared to 15.2) and the CCCP was slower by a factor of 115 compared to our MP algorithm (running 1767 seconds). For a 2 × 2 grid, on the other hand, our MP algorithm took 0.15 Figure 5: Run-time (in seconds) comparisons of our message-passing algorithm against a conditional gradient descent solver (running on convex-L2 free energy) and against CCCP for the (non-convex) Bethe energy. Al l three algorithms were applied to n × n grids with n = 2, 3..., 10. Mean is shown for 10 random trials. seconds compared to 0.59 for CCCP and 1.41 seconds for the general convex solver. Our next experiment is conducted on random graphs to analyze the differences between BP, CCCP, TRW, convex-H and convex-L2 . To generate a random graph we used the probability space of G(n, p) over graphs with n vertices - where each edge is present with probability p and absent with probability 1 - p, independently among edges. Note that in G(n, 1 ) all the 2 graphs have the same probability, i.e., G(n, 1 ) is the 2 probability space consisting of all n-vertex graphs under the uniform distribution. Figure 6: Comparison of error in marginals estimation for random graphs with edge density p = 0.3, 0.5, 0.7 (left to right) and for local field value of 0.05. For high field value (df = 1, not displayed here) convex-L2 , convex-H and TRW produce similar results. Mean is shown for 10 random trials. In Fig. 6 we observe that the typical graph behavior is mainly dependent on the edge-density of the random graph as well as on the interaction levels. In order to compute the exact marginals we chose 10 vertices. A random graph with edge density of p = 0.3 is "almost" a tree and we see that the TRW is slightly inferior to convex-L2 and better than convex-H. For intermediate edge-density p = 0.5 the TRW and convex-H are comparable and both are inferior to convex-L2 . When edge-density is high, i.e., p = 0.7 the graph is far from a tree and the convex-L2 as well as convex-H are better than TRW. Note that in the Mixed case all the convex algorithms produce comparable results. In those cases BP usually does not converge. Nevertheless, in other cases (except p = 0.3, Attractive) BP produces good marginal approximation (consistent with empirical observations found in the literature) and that convex-L2 is not far behind. It is interesting to note that, unlike the 8 × 8 grid, the CCCP is in most situations inferior to the convex free energy models. treatment by deriving both sequential and parallel convergent message-passing algorithms which have similar form to BP. The algorithms are based on a general message-passing architecture idesigned for a class of problems of the type f (b) + hi (b) with f (b) being strictly convex and hi being convex, continuous and proper. We have shown the basic steps of fitting the constrained convex free energy problem into this framework. We limited the discussions to factor graphs where the neighborhoods of every pair of factor nodes have at most a single intersection. This limitation can easily be removed by replacing the term ci (H (b ) - H (bi )) with the term c, (H (b ) - H (b )) for every pair of factor nodes. This replacement propagates mechanically into subsequent steps of the derivation -- as would be found in a more detailed follow-up of this paper. As for the second issue, we have proposed a heuristic principle where among all admissible parameters we choose the one most closest to the Bethe free energy (using Laplace principle of insufficient reasoning). Empirical results show that for certain graphs, like a grid, we obtain very close marginal results to those obtained by the TRW free energy. For random graphs we obtain a very different free energy from TRW and superior accuracy of marginal estimation. The results suggest that our heuristic for setting up the convex free energy satisfies what we were after, i.e., to get approximations similar to BP but in guaranteed (globally) convergent framework. Future work is required for obtaining a firmer theoretical understanding about the applicability of our heuristic and relation to TRW 7 Summary and Discussion The convex free energies provide a way for obtaining approximate inference over general graphs. There are two main issues in this regard: the first is how to obtain a guaranteed globally convergent message-passing algorithm for the general class of convex free energies, and secondly, how to tune the energy parameters ci , ci , c to a specific graph? As for the first issue, we have provided a complete free energy in particular. References Bertsekas, D., Nedi´, A., & Ozdaglar, A. (2003). Convex c analysis and optimization. Athena Scientific Belmont, Mass. Bregman, L., Censor, Y., & Reich, S. (1999). Dykstras algorithm as the nonlinear extension of Bregmans optimization method. Journal of Convex Analysis, 6, 319­ 333. Dykstra, R. (1983). An Algorithm for Restricted Least Squares Regression. Journal of the American Statistical Association, 78, 837­842. Globerson, A., & Jaakkola, T. (2007a). Approximate inference using conditional entropy decompositions. Globerson, A., & Jaakkola, T. (2007b). Convergent Propagation Algorithms via Oriented Trees. Uncertainty in Artificial Intel ligence (UAI 2007). Heskes, T. (2004). On the Uniqueness of Loopy Belief Propagation Fixed Points. Neural Computation, 16, 2379­ 2413. Heskes, T. (2006). Convexity Arguments for Efficient Minimization of the Bethe and Kikuchi Free Energies. Journal of Artificial Intel ligence Research, 26, 153­190. Kschischang, F., Frey, B., & Loeliger, H. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47, 498­519. Mooij, J.M., & Kappen, H.J. (2005). Sufficient conditions for convergence of loopy belief propagation. Uncertainty in Artificial Intel ligence (UAI 2005). Murphy, K., Weiss, Y., & Jordan, M. (1999). Loopy belief propagation for approximate inference: An empirical study. Proceedings of Uncertainty in AI, 467­475. Rockafellar, R. (1970). Convex Analysis. Princeton University Press. Tseng, P. (1993). Dual coordinate ascent methods for nonstrictly convex minimization. Mathematical Programming, 59, 231­247. von Neumann, J. (1950). Functional Operators, Vol. II: The Geometry of Orthogonal Spaces. Annals of Math. Studies, 22. Wainwright, M., Jaakkola, T., & Willsky, A. (2005). A new class of upper bounds on the log partition function. Information Theory, IEEE Transactions on, 51, 2313­ 2335. Weiss, Y., Yanover, C., & Meltzer, T. (2007). MAP Estimation, Linear Programming and Belief Propagation with Convex Free Energies. Uncertainty in Artificial Intel ligence (UAI 2007). Yedidia, J., Freeman, W., & Weiss, Y. (2005). Constructing free-energy approximations and generalized belief propagation algorithms. Information Theory, IEEE Transactions on, 51, 2282­2312. Yuille, A. (2002). CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation. Neural Computation, 14, 1691­1722. A Sequential and Parallel Bilo ck Up dates for minb f (b) + hi (b) Recall the f (b) is a strictly convex real-valued function and the functions hi are convex, proper and continuous. We quote below two basic theorems from convex duality (cf. (Bertsekas et al., 2003)) which we will use as building blocks for our algorithms. Theorem 1 Basic Fenchel Duality I Let g (b) be a convex and differentiable function and let h(b) be a proper conbex and continuous function, v b and let h () = maxb - h(b) e its conjugate dual function. Consider the primal and dual programs: Primal: min g (b) + h(b) b m g Dual: max in (b) + b b t - h () hen there is no duality gap and the optimal primaldual pair b , satisfies g(b ) = - . The next theorem is a generalized version of the one above: Theorem 2 Basic Fenchel Duality I I Let f (b) be a strictly convex and differentiable function and let hi (b) be proper convex and continuous b b functions, and let h () = maxb - hi (b) e i their conjugate dual functions. Consider the primal and dual programs: Primal: min f (b) + b ( in =1 hi (b) ! X n i=1 Dual: 1 ,...,n max min f (b) + b b i - n X i=1 ) h (i ) i then there is no duality gap, and the opnimal primalt dual pair b , satisfies f (b ) = - i=1 . i i A.1 The Sequential Blo ck Up date Algorithm Since f (b) fis strictly convex, ithen its conjugate n dual minb (b) + b s differentiable (see i=1 i (Rockafellar, 1970)). In this case a block dual ascent optimization scheme converges to the global maxima (Tseng, 1993). Our algorithm alternates over 1 , ..., n by optimizing i while fixing j for j = i. j Let µi = =i i and define the following dual algorithmic building block: ( m f - h (i ) 8) max in (b) + b µi + b i i i b To recover i one can use Theorem 1: Set g (b) f (b) + b µi and h(b) hi (b), and solve the primal program: F f b = argmin (b) + b µi + hi (b) bdomain(hi ) rom the Lagrange optimality condition of Theorem 1 we recover i = -µi - i f (b ) n ote that whenever i=1 µi = 0 the dual function attains the value q () = -. Since we seek to maximize the dual fn nction we need to optimize µi in its u domain, i.e. i=1 µi = 0. n1 The function i=1 n f (bi ) is strictly convex therefore its conjugate dual is differentiable (see (Rockafellar, 1970)). In this case a block dual ascent optimization scheme converges to the global maxima (Tseng, 1993). Our algorithm alternates through optimizing i (in parallel) while fixing µi followed by optimizing µi (by a closed form solution) while fixing i . We formulate our dual algorithmic building block with respect to 1 i using Theorem 1: Set g (b) n f (b) + b µi and h(b) hi (b), and solve the primal program: F 1 f (b) + b µi + hi (b) b = argmin n bdomain(hi ) rom Lagrange optimality condition in Theorem 1 we recover i 1 = -µi - f (b ) (9) i n We turn to find the closed-form solution for optimizing µ1 , ..., µn while fixing n 1 , ..., n using Theorem 1: 1 Set g (b1 , ..., bn ) n i=1 (f (bi ) + bi i ) and set h(b1 , ..., bn ) to be the indicator function that attains the value zero if b1 = · · · = bn and infinity otherwise. The conjugate function of h(b1 , ..., bn ) is the indicator function h (µ1 , ..., µn ) whose value is zero if n i=1 µi = 0 and otherwise. The primal program: n c 1 i f (bi ) + bi i argmin n b1 ,...,bn domain(h) =1 an be further simplified by taking into account the domain of h(b1 , ..., bn ), i.e. restricting all the bi to equal some vector b Rn : S f i argmin (b) + b n i bRn =1 ince f (b) is real-valued function and the optimization is unconstnained the optimal vector b satisfies r f (b ) = - i=1 i . Theorem 1 asserts the Lagrange multipliers = µ , ..., µ equals the gradient of g (b , ..., b ), or equivn 1 n 1 1 p alently µ = - n f (b ) - i . In the n receding parai i graph we argued that f (b ) = - i=1 i so we derive the update rule for µ i µ = -i + i N Taken together, one obtains Algorithm 2 described in Section 3. n 1i i n =1 Taken together, one obtains Algorithm 1 described in Section 3. A.2 The Parallel Blo ck Up date Algorithm We begin by stating and proving the following theorem: Theorem 3 Let f (b) be a strictly convex and differentiable function and let hi (b) be proper convex and continuous fb nctions, and let h () = u i b maxb - hi (b) e their conjugate dual functions. The fol lowing is a primal/dual pair with no duality-gap: (P ) min f (b) + b (D ) max 1 Pn , ..., n i=1 µi = 0 in hi (b) =1 (n ,, ,, X 1 f ( b) + b min n b i=1 ( « i + µi ) - hi (i ) «) Furthermore, the optimal primal-dual pair b , sati n isfies f (b ) = - i=1 . i Pro of:n we introduce an s quivalent primal funce 1 tion i=1 n f (bi ) + hi (yi ) ub ject to the constraints b = bi and bi = yi for every i. The Lagrangian L(b, yi , i , µi ) and the Lagrange dual function q (i , µi ) = minb,bi ,y L() take the form: i L(·) = n X,,1 i=1 n « f (bi ) + hi (yi ) + µi (bi - b) + i (bi - yi ) (10) q () = in =1 m in b 1 n f (b) + b ( - i + µi ) h (i ) i