Message-passing for Graph-structured Linear Programs: Proximal Pro jections, Convergence and Rounding Schemes

Pradeep Ravikumar pradeepr@stat.berkeley.edu Alekh Agarwal alekh@eecs.berkeley.edu Martin J. Wainwright, wainwrig@stat.berkeley.edu Department of Statistics and Department of Electrical Engineering and Computer Sciences University of California, Berkeley

Abstract
A large body of past work has focused on the first-order tree-based LP relaxation for the MAP problem in Markov random fields. This paper develops a family of super-linearly convergent LP solvers based on proximal minimization schemes using Bregman divergences that exploit the underlying graphical structure, and so scale well to large problems. All of our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman divergences used to compute each proximal update. The inner loop updates are distributed and respect the graph structure, and thus can be cast as messagepassing algorithms. We establish various convergence guarantees for our algorithms, illustrate their performance, and also present rounding schemes with provable optimality guarantees.

tions. The ordinary max-product algorithm is a form of non-serial dynamic-programming, exact for trees, and also widely used as a heuristic for obtaining approximate solutions to the MAP problem, but it suffers from convergence failures, and despite some local optimality results (Freeman & Weiss, 2001), it has no general correctness guarantees. For certain MRFs arising in computer vision, Boykov et al. (2001) studied graph-cut based search algorithms that compute a local maximum over two classes of moves. A related class of methods are those based on various types of convex relaxations, in which the discrete MAP problem is relaxed some type of convex optimization problem over continuous variables. Examples include linear programming (LP) relaxations (Wainwright et al., 2005; Chekuri et al., 2005), as well as quadratic, semidefinite and other conic programming relaxations (Ravikumar & Lafferty, 2006; Kumar et al., 2006; Wainwright & Jordan, 2003). Among convex relaxations, LP relaxation is the least computationally expensive and best understood. The primary focus of this paper is a well-known tree-based LP relaxation (Chekuri et al., 2005; Wainwright et al., 2005) of the MAP estimation problem for pairwise Markov random fields, based on optimizing over a set of locally consistent pseudomarginals on edges and vertices of the graph. In principle, this LP relaxation can be solved by any standard solver, including simplex or interior-point methods (Bertsimas & Tsitsikilis, 1997). However, such generic methods fail to exploit the graph-structured nature of the LP, and hence do not scale favorably to large-scale problems. Wainwright et al. (2005) established a connection between this tree-based LP relaxation and the class of tree-reweighted max-product (TRW-MP) algorithms, showing that suitable TRW-MP fixed points specify optimal solutions to the LP relaxation. Subsequent work has extended this basic connection in various in-

1. Intro duction
A key computational challenge associated with discrete Markov random fields (MRFs) is the problem of maximum a posteriori (MAP) estimation: computing the most probable configuration(s). For general graphs, this MAP problem includes a large number of classical NP-complete problems, including MAX-CUT independent set, and satisfiability problems, among various others. This intractability motivates the development and analysis of methods for obtaining approximate soluAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Solving Graph-structured Linear Programs: Proximal Pro jections and Rounding Schemes

teresting ways. For instance, Kolmogorov (2005) developed a serial form of TRW-MP with some convergence properties but as with the ordinary TRW-MP updates, no guarantees of LP optimality. Weiss et al. (2007) connected convex forms of sum-product and exactness of reweighted max-product algorithms. Globerson and Jaakkola (2007) developed a convergent dual-ascent algorithm, but its fixed points are guaranteed to be LP-optimal only for binary problems, as is also the case for the TRW-MP algorithm (Kolmogorov & Wainwright, 2005), and the rate of convergence is not analyzed. Other authors (Komodakis et al., 2007; Feldman et al., 2002) have proposed subgradient methods, but such methods typically have sub-linear convergence rates. The goal of this paper is to develop and analyze various classes of message-passing algorithms that always solve the LP, and are provably convergent with at least a geometric rate. The methods that we develop are flexible, in that new constraints can be incorporated in a relatively seamless manner, with new messages introduced to enforce them. All of the algorithms in this paper are based on the notion of proximal minimization : instead of directly solving the original linear program itself, we solve a sequence of so-called proximal problems, with the property that the sequence of associated solutions is guaranteed to converge to the LP solution. We describe different classes of algorithms, based on different choices of the proximal function: quadratic, entropic, and reweighted Bethe entropies. For all choices, we show how the intermediate proximal problems can be solved by messagepassing updates on the graph, guaranteed to converge but with a distributed nature that scales favorably. An additional desirable feature, given the wide variety of lifting methods for further constraining LP relaxations (Wainwright & Jordan, 2003), is that additional constraints are easily incorporated within the framework.

The problem of maximum a posteriori (MAP) estimation is to compute a configuration with maximum probability--i.e., an element ( s ( s (xs ) + x  arg max st (xs , xt ) 1) p
xX V s,t)E

tively, we assume s hat the distrib(ution takes the form t . P(x; )  exp V s (xs ) + s,t)E st (xs , xt )

This problem is an integer program, since it involves optimizing over the discrete space X p . The functions s (ˇ) and st (ˇ) can always be represented in the form j s (xs ) = (2a) s;j I[xs = j ] st (xs , xt ) = j
X

st;j k I[xs = j ; xt = k ],

(2b)

,k  X

where the m-vectors {s;j , j  X } and m × m matrices {st;j k , (j, k )  X × X } parameterize the problem. The basic linear programming (LP) relaxation of this problem is based on a set of pseudomarginals ľs and ľst , associated with the nodes and vertices of the graph. These pseudomarginals are constrained to be non-negative, as well to normalize and be locally consistent in the following sense: x ľs (xs ) = 1, for all s  V (3a) x
s

ľst (xs , xt ) = ľs (xs )

for all (s, t)  E .

t

The polytope defined in this way is denoted LOCAL(G), or L(G) for short. The LP relaxation is based on solving maximizing the linear function ( x sx st (xs , xt )ľst (xs , xt ), s (xs )ľs (xs )+
s

s,t)E

s ,xt

2. Background
We begin by introducing some background on Markov random fields, and the LP relaxations that are the focus of this paper. Given a discrete space X = {0, 1, 2, . . . , m}, let X = {X1 , . . . , Xp }  X p denote a p-dimensional discrete random vector. We assume that its distribution P is a Markov random field, meaning that it factors according to the structure of an undirected graph G = (V , E ), with each variable Xs associated with one node s  V , in the following way. Letting s : X  and st : X × X  be singleton and edgewise potential functions respec-

sub ject to the constraint ľ  L(G). In the sequel, we write this LP more compactly in the form maxľL(G) T ľ. By construction, this relaxation is guaranteed to be exact for any problem on a treestructured graph (Wainwright et al., 2005), so that it can be viewed as a tree-based relaxation. The main goal of this paper is to develop efficient and distributed algorithms for solving this LP relaxation, as well as strengthenings based on additional constraints. For instance, one natural strengthening is by "lifting": view the pairwise MRF as a particular case of a more general MRF with higher order cliques, define higher-order pseudomarginals on these cliques, and use them to impose higher-order consistency constraints. This particular progression of tighter relaxations underlies the Bethe to Kikuchi (sum-product to generalized sum-product) hierarchy.


Solving Graph-structured Linear Programs: Proximal Pro jections and Rounding Schemes

3. Proximal Minimization Schemes
We begin by defining the notion of a proximal minimization scheme, and the Bregman divergences that we use to define our proximal sequences. Instead of referring to the maximization problem maxľL(G) T ľ, it is convenient to consider the equivalent minimization problems minľL(G) -T ľ. 3.1. Proximal Minimization The class of methods that we develop are based on the notion of proximal minimization (Bertsekas & Tsitsiklis, 1997). Instead of attempting to solve the LP directly, we solve a sequence of problems of the form , - 1 (4) T ľ + n Df (ľ ľn ) ľn+1 = arg min  ľL(G)

Quadratic Distances: This choice is the simplest, corresponding to the quadratic norm across nodes and edges s ( Q(ľ ) : = ľs - s 2 + ľst - st 2,
V s,t)E

(6)

where we have used the shorthand x |ľs (xs ) - s (xs )|2 , ľs - s 2 =
s

where for each n = 0, 1, 2, . . ., ľn denotes current iterate, { n } denotes a sequence of positive weights, and Df is a certain type of generalized distance, known as the proximal function. The purpose of introducing the proximal function is to convert the original LP--a convex optimization problem but non-differentiable in dual space --into a strictly convex optimization problem that can be solved relatively easily. This scheme appears similar to an annealing scheme, in that it involves a choice of weights { n }. However, although the weights { n } can be adjusted for faster convergence, they can also be set to a constant, unlike for annealing procedures, which would typically require that 1/ n  0. The reason is that Df (ľ ľ(n) ), as a generalized distance, itself converges to zero when the method gets closer to the optimum, thus providing an "adaptive" annealing. For appropriate choice of weights and proximal functions, these proximal minimization schemes converge to the LP optimum with at least geometric and possibly superlinear rates (Bertsekas & Tsitsiklis, 1997; Iusem & Teboulle, 1995). In this paper, we focus exclusively on proximal functions that are Bregman divergences (Censor & Zenios, 1997), a class that includes various well-known divergences (e.g., quadratic norm, Kullback-Leibler divergence etc.). More specifically, we say that a function f is a Bregman function if it is continuously differentiable, strictly convex, and has bounded level sets. It then induces a Bregman divergence Df (ľ ) : = f (ľ) - f ( ) - f ( ), ľ -  (5)

and similarly for the edges. The Bregman function this corresponds to is the quadratic function, ( s s 1 2 2 ľst (xs , xt ) 7) ľ (xs ) + f (ľ) = 2 ,x s ,t,x ,x
s s t

p i xp (x) where D(p q) : = (x) log p(x) - (x) - q (x) s q the KL divergence, and {s , st } are positive node and edge weights, respectively. An advantage of the KL distance, in contrast to the quadratic norm, is that it automatically acts to enforce non-negativity constraints on the pseudomarginals. The Bregman function this corresponds to is the entropy function, s s f (ľ) = Hs (ľs ) + Hst (ľst ) (9)
,t

Weighted Entropic Distances: Here we consider a (possibly weighted) sum of Kullback-Leibler (KL) divergences across the nodes and edges: s s D(ľ ) = s D(ľs s ) + st D(ľst st ) (8)
V ,t

where Hs and Hst are singleton and edge-based entropies, respectively. An extension of define a Bregman function based on a convex combination of tree-structured entropy functions (Wainwright & Jordan, 2003), and using expressions such as the reweighted Bethe entropy which are equivalent to the convex combination of tree entropies within the local polytope, we can derive an iterative procedure involving tree-reweighted message passing to solve the outer proximal steps. We defer further details to a full-length version. 3.2. Proximal Sequences via Bregman Pro jection The key in designing an efficient proximal minimization scheme is ensuring that the proximal sequence {ľn } can be computed efficiently. In this section, we

This function satisfies Df (ľ )  0 with equality iff ľ =  , but need not be symmetric or satisfy the triangle inequality, so it is known as a generalized distance. We study the sequence {ľn } of proximal iterates (4) for the following choices of Bregman divergences.


Solving Graph-structured Linear Programs: Proximal Pro jections and Rounding Schemes

first describe how the sequence of each proximal minimization can be reformulated as a particular Bregman pro jection. We then describe how this Bregman projection can itself be computed iteratively, in terms of a sequence of cyclic Bregman pro jections based on a decomposition of the constraint set LOCAL(G). In the sequel, we then show how this cyclic Bregman pro jections reduce to very simple message-passing updates. Given a Bregman divergence D, the Bregman projection of the vector  onto a convex set C is given by ľ := arg min Df (ľ )
ľC

With this notation, we can write the proximal update in a compact manner as a type of pro jection ľn+1 = Df (Jf (ľn ,  n ); L(G)) .

(10)

By taking derivatives and using standard conditions for optima over convex sets (Bertsekas & Tsitsiklis, 1997), the defining optimality condition for ľ is  Tľ  f (ľ) - f ( ) -ľ 0 (11) for all ľ  C . Now consider the proximal minimization problem to be solved at step n, namely the strictly convex problem - . 1 T ľ + n Df (ľ ľn ) min (12)  ľL(G)

Now consider a decomposition of the constraint set as an intersection--say L(G) = T=1 Lk (G). By k the method of cyclic Bregman pro jections (Censor & Zenios, 1997), we can compute ľn+1 in an iterative manner, by performing the sequence of pro jections onto the simpler constraint sets, initializing ľn,0 = ľn and updating from ľn,  ľn, +1 by pro jecting ľn, onto constraint set Li( ) (G), where i( ) =  mod T , for instance. This procedure is summarized in Algorithm 1. Algorithm 1 Basic proximal-Bregman LP solver Given a Bregman distance D, weight sequence { n } and problem parameters : ˇ Initialize ľs (xs ) =
(0) 1 m,

ľst (xs , xt ) =

(0)

1 m2 .

ˇ Outer lo op: For iterations n = 0, 1, 2, . . ., update ľn+1 = D (Jf (ľn ,  n ); L(G)). ­ Solve outer loop via Inner lo op: (a) Initialize ľn,0 = Jf (ľn ,  n ). (b) For  = 0, 1, 2, . . ., ľ et i( ) =  mod T . s . n, +1 (c) Set ľ = D n, ; Li( ) (G) As shown in the following sections, by using a decomposition of L(G) over the edges of the graph, the inner loop steps correspond to local messagepassing updates, slightly different in nature depending on the choice of Bregman distance. Iterating the inner and outer loops yields a provably convergent message-passing algorithm for the LP. Convergence follows from the convergence properties of proximal minimization (Bertsekas & Tsitsiklis, 1997), combined with convergence guarantees for cyclic Bregman projections (Censor & Zenios, 1997). In the following section, we derive the message-passing updates corresponding to various Bregman functions of interest. We also give rates of convergence for the cyclic pro jection updates in the inner loop. 3.3. Quadratic Pro jections Consider the proximal sequence with the quadratic distance Q from equation (6); the Bregman function inducing this distance is the quadratic function f (y ) = 1 y 2 , whose gradient is given by f (y ) = y . 2 The Map  = Jf (ľ,  ): In this case, it can

By taking derivatives and using the same convex optimality, we see that the optimum ľn+1 is defined by the conditions  T f (ľn+1 ) - f (ľn ) -  n  (ľ - ľn+1 )  0

for all ľ  C . Note that these optimality conditions are of the same form as the Bregman pro jection conditions (11), with the vector f (ľn ) +  n  taking the role of f ( ); in other words, with (f )-1 (f (ľ) +  n ) being substituted for  . Consequently, efficient algorithms for computing the Bregman pro jection (11) can be leveraged to compute the proximal update (12). In particular, our algorithms leverage the fact that Bregman pro jections can be computed efficiently in a cyclic manner --that is, by decomposing the constraint set C = i Ci into an intersection of simpler constraint sets, and then performing a sequence of pro jections onto these simple constraint sets (Censor & Zenios, 1997). To simplify notation, for any Bregman function f , let us define the operator Jf (ľ,  ) = (f )-1 (f (ľ) +  ) and for any Bregman divergence D with Bregman function f and any convex set C , define the pro jection operator Df ( ; C ) := arg min Df (ľ )
ľC


Solving Graph-structured Linear Programs: Proximal Pro jections and Rounding Schemes

be derived as, f ( ) = f (ľ) +     = ľ +  whence we get the initialization in Equation 18. The Pro jections ľn, +1 = Q (ľn, , Li (G)): onto the individual constraints Li (G); the associated local update takes the form ľn, +1 = min {f () -  f (ľn, )}
Li (G)

(13) (14)

Algorithm 2 Quadratic Messages for ľn+1 Initialization: ľst
(n,0)

(xs , xt ) = ľst (xs , xt ) + wn st (xs , xt ) = ľ(n) (xs ) s + w s (xs )
n

(n)

(18) (19)

ľ(n,0) (xs ) s

rep eat for each edge (s, t)  E do
ľst
(n, +1)

(15)

(xs , xt ) = ľst

(n, )

(xs , xt )+ X
xt

(20) (xs , xt ) ! (21) !

Consider the edge marginalization constraint for edge x ľst (xs , xt ) = ľs (xs ). Denoting (s, t), Li (G)  t the dual (Lagrange) parameter corresponding to the constraint by st (xs ), the KKT conditions for (15) are given by f (ľnt, +1 (xs , xt )) s f (ľn, +1 (xs )) s ľnt, +1 (xs , xt ) s ľn, +1 (xs ) s = = = = f (ľnt, (xs , xt )) + st (xs ) s f (ľn, (xs )) - st (xs ) s ľnt, (xs , xt ) + st (xs ) s ľn, (xs ) - st (xs ), s

( (1/L + 1) ľsn, ) (xs ) -

ľst

(n, )

( ( ľsn, +1) (xs ) = ľsn, ) (xs )+ ( (1/L + 1) -ľsn, ) (xs ) +

X
xt

ľst

(n, )

(xs , xt )

end for for each node s  V do
ľ(k+1) (xs ) = ľ(k) (xs ) + s s ľ(k+1) (xs ) s = X (k ) 1 ľs (xs )) (1 - L x
s

while the constraint itself gives x n, +1 ľst (xs , xt ) = ľn, (xs ) s
t

(17)

max(0, ľ(k+1) (xs )) s

Solving for st (xs ) yields equation (20). The node marginalization follows similarly, so that overall, we obtain message-passing algorithm (2) for the inner loop. 3.4. Entropic Pro jections Consider the proximal sequence with the KullbackLeibler distance D(ľ ) defined in equation (8); the Bregman function inducing the distance is a sum of negative entropy functions f (ľ) = ľ log ľ, and its gradient is given by f (ľ) = log(ľ) + 1. The derivation of the updates mirrors the previous section, and defering the details to a full-length version, we get the message passing algorithm (3) for the inner loop. There are also interesting similarities between our corresponding dual updates and sum-product updates-- which are updates to the dual parameters--details of which we defer to a full-length version of this paper due to lack of space. 3.5. Reweighted Entropy Pro jections The message passing updates here are "reweighted" versions of those in the previous section for the unweighted entropy induced Kullback-Leibler divergence

end for until convergence

proximal iterates. Initialization of Proximal Steps: ľst
(n,0)

(xs , xt ) = ľst (xs , xt ) exp( n /st st (xs , xt ))

(n)

ľ(n,0) (xs ) = ľ(n) (xs ) exp( n /s s (xs )). s s Pro jections: The node normalization update remains the same as in the previous section, while the marginalization update changes as,
(n, +1) ľst (xs , xt )

=

(n, ) ľst (xs , xt )

ľs P
xt xt

(n, )

(xs )

ľst

(n, )

( ľsn, +1) (xs )

=

s ( ľsn, ) (xs ) s +st


X

(n, ) ľst (xs , xt )

(xs , xt ) !

!

s s +st

 st s +st

4. Rounding with Optimality Certificates
A key practical issue in applying LP relaxation is how round the fractional solution; a standard approach is


Solving Graph-structured Linear Programs: Proximal Pro jections and Rounding Schemes

Algorithm 3 Entropic Messages for ľn+1 Initialization: ľst
(n,0)

(xs , xt ) = ľst (xs , xt ) exp( n st (xs , xt ))

(n)

ľ(n,0) (xs ) = ľ(n) (xs ) exp( n s (xs )) s s rep eat for each edge (s, t)  E do
v u u ľs
xt (n, )

(b) Run the ordinary max-product problem on energy Ei (x) to find a MAP-optimal configuration xn (Ti ). Say that such a rounding is tree-consistent if the tree MAP solutions {xn (Ti ), i = 1, . . . , M } are all equal.

(a) Define the tree-structured energy function Ei (x) : sn ( 1 n = ľ (xs ) + s,t)E (Ti ) st ľst (xs , xt ).

(n, +1) ľst (xs , xt )

(xs , xt ) s X (n, ) (n, ) ( ľsn, +1) (xs ) = ľs ľst (xs , xt ) (xs )
xt

=

(n, ) ľst (xs , xt )t P

(xs )

ľst

(n, )

end for for each node s  V do
(n, ) ľs (xs ) x (n, ) ľs (xs ) s

The following result characterizes the optimality guarantees associated with these rounding schemes: Theorem 1 (Rounding with optimality certificates). At any iteration n = 1, 2, . . ., any edge-consistent configuration obtained from node-rounding, or any treeconsistent configuration obtained from tree-rounding is guaranteed to be MAP optimal for the original problem. The proof is based on a certain energy-invariance property of the proximal updates; in particular, at any iteration n, the pseudomarginals ľn have an associated function F (x; s n ) which is s roportional to the energy ľ p E (; x) = s (xs ) + t st (xs , xt ) of the graphical model. For instance, for the entropic proximal schems , at any iteration n, the function F (x; ľn ) : e ( n n = V ľs (xs ) s,t)E ľst (xs , xt ) is prop ortional to the exponential of E (; x). (See the technical report (Ravikumar et al., 2008) for full details.) Both rounding schemes require relatively little computation. Of course, the node-rounding scheme is purely local, and so trivial to implement. With reference to the tree-rounding scheme, many graphs can be covered with a small number M of trees (e.g. M = 2 for grid graphs). Consequently, the tree-rounding scheme requires running the ordinary max-product algorithm twice, certainly more expensive than noderounding but doable. In practice, we find that treerounding tends to find LP optima more quickly than node rounding.

( ľsn, +1) (xs ) =

end for until convergence

to round the node marginals to the nearest integer solution. However, in general, such rounding procedures need not always output the optimal integer configuration. An attractive feature of our proximal Bregman procedures is the existence of rounding schemes which, assuming that the LP relaxation is tight, can produce the LP integral optimum and certify that it is correct, even before the pseudomarginals converge to the LP solution. Here we describe two rounding schemes, and state the optimality certificate associated with each. No de-based Rounding: This method applies to any of our proximal schemes. Given the vector ľn of pseudomarginals at iteration n, define an integer configuration xn by choosing, for each vertex s  V , a value xn  arg maxxs ľn (xs ). Say that such a rounds s ing is edgewise-consistent if for all edges (s, t)  E , n we have (xs , xn )  arg max ľnt (xs , xt ). s t
(xs ,xt )

5. Convergence Rates
The convergence of our message passing updates follows from two sets of results: (a) convergence of proximal algorithms (Bertsekas & Tsitsiklis, 1997) and (b) convergence of cyclic Bregman pro jections (Censor & Zenios, 1997). Our outer loop is a proximal algorithm; which has been well-studied in the optimization literature. A sequence ľ(t) is said to have superlinear con(t+1)  vergence to the optimum ľ if limk ľľ(t) --ľ = 0. ľ Note that such convergence is faster than a multi(t+1)  plicative contraction (limk ľľ(t) --ľ   < 1). ľ Bertsekas and Tsistiklis (1997) show that a proximal

Tree-based Rounding: We describe this method in application to the unweighted entropic proximal updates. Let T1 , . . . , TM be a set of spanning trees that cover the graph (meaning that each edge appears in at least one tree); for each edge (s, t), define the edge M 1 weight st = M i=1 I[(s, t)  Ti ]. Then for each tree i = 1, . . . , M :


Solving Graph-structured Linear Programs: Proximal Pro jections and Rounding Schemes

algorithm with a quadratic proximity has a superlinear convergence under mild conditions, whereas Iusem and Teboulle (1995) show the same for the entropy proximity. Under the assumption that inner loops are solved exactly, these convergence results then show that our outer iterates converge superlinearly. Our inner loop message updates use cyclic Bregman pro jections; Censor and Zenios (1997) show that with dual feasibility correction, pro jections onto general convex sets are convergent. For Euclidean (quadratic) pro jections onto linear constraints (half-spaces), Deutsch et al. (2006) establish a geometric rate of convergence, dependent on angles between the half-spaces. The intuition is that the more orthogonal the half-spaces are, the faster the convergence; for instance, a single iteration suffices for completely orthogonal constraints. Our inner updates thus converge geometrically to an -suboptimal solution for any outer proximal step. As noted earlier, the proximal convergence results assume that the inner loop has been solved exactly, while the Bregman pro jection results yield geometric convergence to but an -suboptimal solution. While with  small enough, e.g. 10-6 as in our experiments, this issue might not be practically relevant, there has been some recent work, e.g. (Solodov & Svaiter, 2001), showing that under mild conditions, superlinear rates still hold for -suboptimal proximal iterates.

Distance versus iteration 0 p=100 p=400 p=900 Log distance to fixed point

-0.5

-1

-1.5 0

2

4

6 8 Iteration

10

12

14

Figure 1. Plot of distance log10 ľn - ľ 2 between the current iterate ľn and the LP optimum ľ versus iteration number for Potts models on grids with p  {100, 400, 900} vertices, and SNR = 1. Note the superlinear rate of convergence.

convergence.

7. Discussion
In this paper, we have developed distributed algorithms, based on the notion of proximal sequences, for solving graph-structured linear programming (LP) relaxations. Our methods respect the graph structure, and so can be scaled to large problems, and they exhibit a superlinear rate of convergence. We also developed rounding schemes that can be used to generate integral solutions along with a certificate of optimality. These optimality certificates allow the algorithm to be terminated in a finite number of iterations. The structure of our algorithms naturally lends itself to incorporating additional constraints, both linear and other types of conic constraints. It would be interesting to develop an adaptive version of our algorithm, which selectively incorporated new constraints as necessary, and then used the same proximal schemes to minimize the new conic program. Acknowledgements Work supported by NSF grants CCF-0545862 and DMS-0528488. We thank the anonymous reviewers for helpful comments.

6. Exp eriments
We performed experiments on a 4-nearest neighbor grid graphs with sizes varying from p = 100 to p = 900, in all cases using models with 5 labels. The edge potentials were set to Potts functions, st (xs , xt ) = st I[xs = xt ], which penalize disagreement of labels by st . The Potts weights on edges st were chosen randomly as Uniform(-1, +1), while the node potentials s (xs ) were set as Uniform(- SNR, SNR), where the parameter SNR  0 controls the ratio of node to edge strengths, and thus corresponds roughly to a signal-to-noise ratio. Figure 1 shows plots of the logarithmic distance between the current iterate ľn and the LP optimum ľn for grids of different sizes. In all cases, note how the curves have an inverted quadratic shape, corresponding to superlinear convergence. Figure 2 shows the fraction of edges for which the node-based rounding is edgewise inconsistent for grids of different sizes. Note how the fraction converges to zero in a small number of iterations. Figure 3 shows the fraction of the energy of the rounded solution to the energy of the MAP optimum, or the suboptimality factor. Note again, the small number of iterations for

References
Bertsekas, D. P., & Tsitsiklis, J. N. (1997). Paral lel and Distributed Computation: Numerical Methods. Boston, MA: Athena Scientific. Bertsimas, D., & Tsitsikilis, J. (1997). Introduction to linear optimization. Belmont, MA: Athena Scientific.


Solving Graph-structured Linear Programs: Proximal Pro jections and Rounding Schemes
Convergence of the rounded solution 0.2 p=400 p=625 p=900 Edge Disagreement % 0.15

and its relation to iterative approaches. Proc. 40th Annual Al lerton Conf. on Communication, Control, and Computing. Freeman, W. T., & Weiss, Y. (2001). On the optimality of solutions of the max-product belief propagation algorithm in arbitrary graphs. IEEE Trans. Info. Theory, 47, 736­744. Globerson, A., & Jaakkola, T. (2007). Fixing max-product: Convergent message passing algorithms for map lprelaxations. Neural Information Processing Systems (p. To appear). Vancouver, Canada. Iusem, A. N., & Teboulle, M. (1995). Convergence rate analysis of nonquadratic proximal methods for convex and linear programming. Mathematics of Operations Research, 20(3), 657­677. Kolmogorov, V. (2005). Convergent tree-reweighted message-passing for energy minimization. International Workshop on Artificial Intel ligence and Statistics. Kolmogorov, V., & Wainwright, M. J. (2005). On optimality properties of tree-reweighted message-passing. UAI.

0.1

0.05

0 0

5

10 Iteration

15

20

Figure 2. Plots of the fraction of edges for which the node-based rounding is edgewise inconsistent for grids of different sizes. Recall that when the fraction is zero, we recover the MAP optimum.
Convergence of the rounded solution 1 p=400 p=625 p=900 Suboptimality Ratio 0.95

Komodakis, N., Paragios, N., & Tziritas, G. (2007). MRF optimization via dual decomposition: Message-passing revisited. ICCV. Rio di Janeiro, Brazil. Kumar, P., Torr, P., & Zisserman, A. (2006). Solving markov random fields using second order cone programming. IEEE CVPR.

0.9

Ravikumar, P., Agarwal, A., & Wainwright, M. J. (2008). Message-passing for graph-structured linear programs: Proximal projections, convergence and rounding schemes (Technical Report). UC Berkeley.
5 10 Iteration 15 20

0.85 0

Figure 3. Plots of the fraction of the energy of the rounded solution to the energy of the MAP optimum. Note the small number of iterations for convergence.

Ravikumar, P., & Lafferty, J. (2006). Quadratic programming relaxations for metric labeling and markov random field map estimation. ICML '06 (pp. 737­744). Solodov, M., & Svaiter, B. (2001). A unified framework for some inexact proximal point algorithms. Numerical Functional Analysis and Optimization, 22, 1013­1035. Wainwright, M., Jaakkola, T., & Willsky, A. (2005). Map estimation via agreement on (hyper)trees: Messagepassing and linear-programming approaches. IEEE Transactions on Information Theory, 51, 3697­3717. Wainwright, M. J., & Jordan, M. I. (2003). Graphical models, exponential families, and variational inference (Technical Report). UC Berkeley, Department of Statistics, No. 649. Weiss, Y., Yanover, C., & Meltzer, T. (2007). Map estimation, linear programming and belief propagation with convex free energies. UAI.

Boykov, Y., Veksler, O., & .Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intel l., 23, 1222­1239. Censor, Y., & Zenios, S. A. (1997). Paral lel optimization theory, algorithms and applications. Oxford University Press. Chekuri, C., Khanna, S., Naor, J., & Zosin, L. (2005). A linear programming formulation and approximation algorithms for the metric labeling problem. SIAM Journal on Discrete Mathematics, 18, 608­625. Deutsch, F., & Hundal, H. (2006). The rate of convergence for the cyclic pro jections algorithm i: Angles between convex sets. Journal of Approximation Theory, 142, 36 55. Feldman, J., Karger, D. R., & Wainwright, M. J. (2002). Linear programming-based decoding of turbo-like codes