Accelerated dual decomposition for MAP inference Vladimir Jojic vjojic@cs.stanford.edu Stephen Gould sgould@stanford.edu Daphne Koller koller@cs.stanford.edu Computer Science Department, 353 Serra Mall, Stanford University, Stanford, CA 94305, USA Abstract Approximate MAP inference in graphical models is an important and challenging problem for many domains including computer vision, computational biology and natural language understanding. Current state-of-theart approaches employ convex relaxations of these problems as surrogate objectives, but only provide weak running time guarantees. In this paper, we develop an approximate inference algorithm that is both efficient and has strong theoretical guarantees. Specifically, our algorithm is guaranteed to converge to an -accurate solution of the convex relaxation in O 1 time. We demonstrate our approach on synthetic and real-world problems and show that it outperforms current stateof-the-art techniques. methods is significantly improved with better quality inference (Finley & Joachims, 2008). Current state-of-the-art methods for approximate inference employ convex programming relaxations to find an approximate solution via the convex dual (Kolmogorov, 2006; Globerson & Jaakkola, 2007; Wainwright et al., 2005; Komodakis & Paragios, 2009). The solution to the original primal problem is found by a decoding process on the dual variables. Typically, the better the solution to the dual, the better the quality of the solution to the original problem. While many state-of-the-art approaches aim to optimize the dual formulation, they are either slow to converge or can get stuck (Globerson & Jaakkola, 2007; Kolmogorov, 2006). For example, coordinate descent procedures are prone to this failure mode. In contrast to previous methods, the method of Komodakis & Paragios (2009) provided guarantees on quality of convergence. However, since this is a projected gradient approach the guaranteed rate of convergence is slow, namely O( 1 ) 2 time complexity for an -accurate solution. In this paper, we develop an algorithm based on the smoothing approach of Nesterov (2005) with significantly faster time complexity of O( 1 ). Like the method of Komodakis & Paragios (2009), our method is based on the framework of Lagrangian relaxation that leverages the structure of the problem. Lagrangian relaxation divides the original problem into a set of coordinated slave problems that have tractable structure. The coordination of the slaves is handled by a master. The master achieves this by collecting messages from each of the slaves, which describe their solutions. Each slave is then updated in a manner so as to bring them into agreement. An important feature of the framework is that the tractability of the slaves is unaffected by the message passing. Note that, since the slave problems are decoupled, the algorithm can be distributed and the computational effort involved in finding their solution is fixed across all iterations. Unlike projected subgradient approaches, we construct 1. Introduction Markov random fields (MRFs) are a useful framework for modeling many important problems in computer vision, computational biology and natural language understanding. One important task for these problems is to find the assignment to each variable that jointly minimizes the energy (maximizes the probability) defined by the model. However, finding the optimal solution for a general MRF is an NP-hard problem. Hence, research has focused on approximate methods. These methods, however, still pose computational challenges, and algorithms with fast convergence rates to near-optimal solutions are needed. This is especially important when inference is used as the inner loop of another algorithm, such as the popular max-margin learning algorithms (Tsochantaridis et al., 2005; Taskar et al., 2005), which iteratively add constraints to the training objective. Learning in these Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Accelerated dual decomposition for MAP inference smooth approximations to the slave problems that remain tractable and enable globally faster optimization algorithms to be applied. Importantly, our approximations are uniformly bounded, resulting in guaranteed convergence to an -accurate solution of the original problem. Surprisingly, our algorithm, which is aimed at solving the MAP problem, gives rise to sum-productlike updates at a particular temperature. Such connections between max-product and sum-product algorithms have been explored in the context of zero-limit temperatures (Meltzer et al., 2009). To our knowledge, this is the first constructive approach to establishing such relationships at non-zero temperatures. While the smoothed objective can be solved using any standard gradient-based approach (conjugate gradients, L-BFGS, etc.), we derived a method from a family of optimal methods (Nesterov, 1983), which provide superior convergence guarantees. Specifically, our method converges to an -approximate solution in O( 1 ) iterations. In addition to the strong theoretical guarantees, we provide experimental results on both synthetic and real-world problems demonstrating the superiority of our approach. grangian relaxations and utilized a gradient descent algorithm for their optimization. However, this work does not provide guarantees on either run time or on the level of error introduced by their smoothing approximation. In contrast to the above three approaches, our method provides strong convergence guarantees. Ravikumar et al. (2008) introduced a family of double loop algorithms. The outer loop iterates over a schedule of smoothing coefficients and the inner loop solves an equivalent of a sequential TRW problem for a given smoothing coefficient. Strong guarantees exist; however these do not apply when the inner loop cannot be solved exactly. Further, solution of the inner loop requires sequential updates across different tree subproblems, eliminating one of the main sources of parallelism inherent in the dual decomposition based methods. In contrast, we provide strong guarantees on total running time, and each slave is solved exactly. Additionally, in our algorithm, slave problems can be solved in isolation, thus leveraging large scale distribution on top of an efficient algorithm. 3. Background 3.1. Markov Random Fields Let z = (z1 , . . . , zN ) be a vector of discrete random variables, each random variable zi taking its value from a discrete set Zi . A Boltzmann distribution induced by energy function E(z) over variables z at temperature µ is defined by p(z) = 1 Z 1 exp - µ E(z) 2. Related Work Our work stems from Lagrangian relaxation of the MAP estimation problem. This approach, which has been explored by Komodakis et al. (2007), leverages the tractability of the slave problems. The approach derives a dual-decomposition algorithm that only requires MAP inference in the slave. Although the algorithm is provably convergent, that is, it does not get stuck in local minima, the approach suffers from poor run time guarantees. In particular, it can take up to O 1 iterations to converge. Our algorithm, on the 2 other hand, converges in O( 1 ) iterations. The tree-reweighted (TRW) family of algorithms (Wainwright et al., 2005; Kolmogorov, 2006) are based on dual-decomposition across spanning trees. However, unlike (Komodakis & Paragios, 2009), these algorithms do not provide convergence guarantees. The innovative approach of Globerson & Jaakkola (2007) derives a message-passing algorithm which iteratively optimizes the dual of the relaxed integer program. This is accomplished by block coordinate descent moves; thus the algorithm, due to its coordinate nature, can get stuck. Furthermore, the method has no convergence guarantees, but seems to perform well in practice. Johnson et al. (2007) investigated smoothing of La- where 1 Z = is referred to as the partiz exp - µ E(z) tion function and serves to normalize the distribution. Markov random fields (MRFs) encode such a distribution and their energy function is formed as a sum of difC ferent potentials E(z) = - c=1 c (zc ) at a temperature µ = 1. Here, we assume that each potential c is defined over a subset of the variables c {1, . . . , N }. -1 We introduce i = {c : i c } to represent the set of potentials with variable zi in their scope. In what follows, we will assume that inference is tractable for each potential; however, we make no assumption about the structure of the potential. Tractability can be achieved through exploitation of the structure of the potential (e.g., trees or low tree-width subgraphs) or the small size of its configuration space, i.e., |Zc | = ic |Zi |. A typical task in MRF inference is to find the most likely joint assignment z under the distribution p(z), explicitly z = argmaxz p(z) = argmaxz c c (zc ). This problem can be recast as the integer program maximize z C c=1 c (zc ) i = 1, . . . , N. subject to zi Zi (IP) Accelerated dual decomposition for MAP inference The tractability of this integer program is made hard by the coupling of potentials sharing the same variables. As we will see, a framework known as Lagrangian relaxation will enable us to take advantage of our ability to efficiently solve tractable subproblems (c ) and combine these sub-solutions into a solution of the main problem. Naturally, the relaxation of the integer program leads to an approximation, and Lagrangian relaxation bounds can be shown to be at least as tight as the linear programming relaxations (Wolsey, 1998). 3.2. Lagrangian relaxation In order to facilitate derivation of the algorithm and elucidate the character of messages, we introduce an equivalent representation of the integer program defined above. For each value a Zi we introduce a binary integer variable xi,a , indicating whether zi assumes the value a. We will use xi to denote a binary integer vector (xi,1 , . . . xi,ni ), and define set Xi = {xi : Hence there exists aZi xi,a = 1}. a one-to-one mapping of configurations of x and z, bip(x) = z and |Xi | = |Zi |. We will use shorthand c (x) for c (bip(x)) and let Xc = bip-1 (Zc ). A binary integer program equivalent to (IP) is maximize x c over x yields the dual minimize c c maxxc c c (xc c ) + c , xc c c i,a = 0 i, a Zi . subject to (LR) -1 ci By virtue of being a dual, this problem is convex in . It is composed of a sum of disjoint maximization problems and is nonsmooth. We identify slave problems c c sc (c ) = max c (xc c ) + c , xc c . c xc X (SLV) Each slave problem is an integer program. The exact and MAP inference remain tractable in each slave as c (xc ) = c (xc ) + c , xc retains the size and structure of c (xc ). The master problem is minimize m() = -1 ci c c i,a c sc (c ) (MSTR) subject to = 0, i, a Zi . c (xc ) i = {1, . . . , N }. subject to xi Xi (BIP) We draw attention to the constraint that ensures that the multipliers associated with variable zi and its state a, sum to zero across all slave problems that have the variable zi in its scope. Equally importantly, the slave problem sc does not involve multipliers for variables that are not in the scope of c . Thus we can conclude that the size of the slave's state space has not grown by the process of Lagrangian relaxation, and the cost of exact inference in the slave remains the same regardless of the messages. Traditionally, Lagrangian relaxations are solved with a projected subgradient method, due to the nonsmoothness of the objective (Bertsekas, 1999; Held & Karp, 1970; Komodakis & Paragios, 2009). Such methods require specification of the step size schedule and are guaranteed to require at most O( 1 ) iterations to 2 achieve a cost of f t f + , where f is the optimum. We obtain the Lagrangian relaxation of (BIP) via the process of dual decomposition (Bertsekas, 1999). First, we introduce copies of the original variables specific to each potential, with the corresponding potential indicated by the variables' superscript. These copies, slave variables, are constrained to agree with the original, master variables. The resulting equivalent integer program is maximize x subject to c c c (xc ) c xi Xi xc = xi,a i,a 4. Accelerated optimization c, i c c, i c , a Zi In this section, we will construct a smooth approximation of the master problem's objective (MSTR). This construction will be accomplished by smoothing of the slave problems' objectives in Section 4.1. This section also will cover the quality of the approximation as well as exhibit the character of the gradients of the smoothed problem. In Section 4.2 we derive a general method for optimizing smooth objectives with constraints akin to the ones in the master problem. Section 4.3 specializes the algorithm from Section 4.2 to the smoothed master problem. The resulting method optimizes an approximate cost, but with a faster algorithm than the one used for the nonsmooth case. The approximation introduced by smoothing of the objective ends up being more than compensated for by the use of an optimal algorithm. It is important to note that the equality constraints between slave and master variables separate across the domain of the variables. We note that the domain of the x variables is heavily constrained, in that for any i only one of the indicators xi,a can be one. We introduce a partial Lagrangian J(, x, x1 , . . . xC ) = c c The multiplier vector c is of length equal to the total number of indicator variables xc with i c , that is i,a ic |Zi |. Maximization of the partial Lagrangian c c (xc c ) + c , xc c - xc . Accelerated dual decomposition for MAP inference 4.1. Smoothing the Lagrangian relaxation Recently, a broad framework for smoothing structured convex objectives has been introduced (Nesterov, 2005). In this approach, an adjoint problem, in essence a dual problem, is constructed for a single nonsmooth component. The adjoint problem is then modified by the addition of a strongly concave term, thus introducing smoothness in the original problem. We will apply this general methodology to the problem (SLV) to smooth each of the slave problems. A slave problem (SLV) can be rewritten as qc k+1 k 1- k k k+1 k+1 1-k k k f( k k+1 k ) /( k k k k k+1 k k L) ] k+1 Figure 1. Schematic view of Algorithm 1 max q(xc ) c (xc c ) + c , xc c xc , For proof see Appendix A. With smooth slaves and their gradients in hand we can state the full smoothed master problem as minimize where c is the simplex over all configurations of xc . We denote the dimension of c as |c |. Consequently, q is a distribution over all configurations xc and |c | = |Xc |. This reformulation has not changed the optimum of the problem. The optimal distributions q still put the probability mass only on configurations that maximize the original slave objective. We now change the problem by adding a strongly concave contribution dependent on q dc (q) = ln(|c |) + xc mµ () = c c c c sµ ( ) subject to c = 0. (SMSTR) Proposition 4.2 For any feasible mµ () m() mµ () + µ For proof see Appendix A. c ln |c |. q(xc ) ln q(xc ) This term is bound by 0 and ln(|c |). It is strongly concave with concavity parameter 1 (Nesterov, 2005). This term can be seen as a proximal regularization c that pulls the distributions q towards q0 1, a uniform distribution across all configurations xc . Addition of this proximal regularization term, weighted by µ, yields a smoothed slave problem sc ( c ) = max µ qc In addition to stating that we have obtained a uniform smooth approximation to the original nonsmooth objective, we point out that the choice of temperature µ and the sizes of slave-MRF configuration spaces dictate how close the smooth objective is to the original objective. For the sake of completeness, we now state the partial derivatives of the master problem that will be used in the optimization algorithm: sc ( c ) mµ ( c ) µ = = pc (xi,a = 1). µ c c i,a i,a 4.2. An optimal method derivation The chief benefit of working with a smooth objective is that its optimization can be performed with a variety of projected gradient based methods. Remarkably, there exist first-order methods that have guarantees of a better convergence rate than the naive projected gradients. Algorithms achieving the provably optimal 1 error rate of O( k2 ) in k iterations on problems with a known Lipschitz constant are called optimal methods (Nesterov, 1983). Here we construct an optimal method for problems akin to the smoothed master (SMSTR) problem constructed in Section 4.1. Given a differentiable and convex f , the problem of interest is: minimize f () c subject to (P) c = 0. q(xc ) c (xc c ) + c , xc c xc - µ ln(|c |) + xc q(xc ) ln q(xc ) , which after maximization over q yields c (xc ) + c , xc c 1 sc ( c ) = µ ln exp µ |c | x µ c . The smoothing of the slave integer problem produced an objective that is closely related to the log partition function of an MRF. Proposition 4.1 The gradient of a smooth slave problem sc ( c ) is given by µ sc ( c ) µ = pc (xc = 1) µ i,a c i,a where pc (xc ) exp µ 1 µ c (xc ) + c , xc c . Accelerated dual decomposition for MAP inference Algorithm 1 An optimal method for smooth function optimization 1: initialize k = 0, k = 0, k = 0, 0 = 1 2: for k 0 do 3: k = (1 - k )k + k k 4: k+1 = argminPc c =0 {lf (; k ) + k LD(, k )} 5: k+1 = (1 - k )k + k k+1 6: k+1 = 7: end for 2 4 2 k +4k -k 2 Algorithm 2 An optimal method for smoothed dual decomposition 1: initialize k = 0, k = 0, k = 0, 0 = 1 2: for k 0 do 3: k = (1 - k )k + k k 4: foreach c compute pc (xi,a ) for i c , a Zi ; µ 5: k+1 = project(p, k , k ) 6: k+1 = (1 - k )k + k k+1 k k k 7: k+1 = 2 8: end for 9: 10: function project(p, , ) 11: for i = 1..N, a Zi do 1 c c 12: i,a = -1 c -1 k Li,a - pµ (xi,a ) 4 +4 2 - 2 We note that there exist optimal algorithms that retain all gradients from previous iterations (Nesterov, 2005). In large-scale MAP inference problems this is impractical. Hence we opt for a method that has limited memory requirements and achieves -solution in O L iterations. One such algorithm is Algo- rithm 1 (schematically shown in Figure 1). In order to fully specify the algorithm, we define lf and proximal function D: lf (; ) D(, ) := f () + f (), - 2 := (1/2) - 2 . -1 13: for c i do 1 c c 14: i,a = i,a - k L (pc (xi,a ) + i,a ) µ 15: end for 16: end for 17: return |i | i Proposition 4.3 For a differentiable and convex f (), where RN , Algorithm 1 produces an L solution of the problem (P) in O iterations, where L is the Lipschitz constant of function f . of Algorithm 1 to the problem of solving the smooth master problem is given in Algorithm 2. In Section 4.1 we computed the gradients of the smooth master problem. It turned out that in order to compute gradients we need to perform marginalization of the slave-MRF at temperature µ given by 1 (c (xc ) + c , xc ) µ and that the gradient of the master problem is given by sc ( c ) mµ ( c ) µ = = pc (xi,a = 1). µ c c i,a i,a By design, the slave problems are tractable and these marginal distributions can be computed exactly in an amount of time that is independent of µ and remains constant across iterations. Importantly, the computation of the marginals of each slave (line 3. in Algorithm 2) can be carried out in parallel. Given desired precision and a chosen decomposition into slave problems, the temperature and Lipschitz constant (see Appendix A) are given by µ = 2 P |c | and pc (xc ) exp µ L= 2 P c c This proposition follows from Corollary 1 in (Tseng, 2008). We note that for the update in line 4 of Algorithm 1 we need to compute argminPc c =0 argminPc c =0 {lf (; k ) + k LD(, k )} = {f (k ) + f (k ), - k + 2 k (L/2) - k 2 } Writing the Lagrangian of the minimization problem: f (k )+ f (k ), - k +k (L/2) - k 2 T 2 + c c, and relevant KKT conditions: f (k ) + k L( - k ) + = 0 c = 0, c 1 we obtain = C c (k Lk - f (k )) and = k - f (k )+ . k L We note that the memory requirement of this algorithm is equal to 3×size(). 4.3. Accelerated dual decomposition The smooth function mµ can be optimized by the method introduced in Section 4.2. A specialization ln |c | . Note that any hard constraints that limit the space of possible configurations of a slave will result in a smaller simplex c and consequently a smaller Lipschitz constant. 5. Convergence guarantees The algorithm constructed in the previous section is applied to the smooth problem (SMSTR). However, the ultimate assessment of this method's performance Accelerated dual decomposition for MAP inference 650 600 550 500 450 potential variance 4 Nonsmooth Smooth,non-optimal Smooth,optimal Nonsmooth Dual 1300 1200 1100 1000 900 potential variance 16 0.25 Normalized Objective Dual Decomposition Normalized Objective 0.1 0.08 Nonsmooth Dual 0.2 GEMPLP 0.15 0.1 0.05 0 0 0.06 0.04 0.02 0 0 0 50 Time 100 150 0 50 Time 100 150 1900 1800 Nonsmooth Dual potential variance 36 2600 2400 2200 2000 1800 potential variance 64 0.05 0.1 0.15 0.2 0.25 0.02 0.04 0.06 0.08 0.1 Smoothing Method Smoothing Method Nonsmooth Dual 1700 1600 1500 1400 0 50 100 150 Time 0 50 Time 100 150 Figure 3. Comparison of our smooth-optimal method against (left) the dual decomposition method of Komodakis & Paragios (2009), and (right) LP relaxation algorithm of Globerson & Jaakkola (2007) on protein design dataset (Yanover et al., 2006). Figure 2. Synthetic Experimental Results. The MRF in all four examples is a grid of 30 × 30 random variables, each ranging over 7 states and involved in pairwise nearest neighbor potentials. These potentials are drawn from a zero mean Gaussian with variances ranging over {1, 4, 36, 64}. The three methods are: "Nonsmooth", projected subgradients applied to the nonsmooth objective; "Smooth, non-optimal", projected gradients applied to the smooth objective; "Smooth, optimal," the optimal method applied to the smooth objective. All solutions are evaluated in the non-smooth objective, and all solutions maintain dual feasibility in each iteration. tasks, we assume that the slaves are the faces of the grid (4-tuples of nodes and pairwise potentials on these nodes form a cycle). The potentials shared between multiple slaves are weighted by 0.5. We compare three different methods. The first method optimizes the nonsmooth objective m using projected subgradients. The second method is gradient descent on the smoothed objective (mµ ). The third method is the one introduced in this paper, an optimal method on the smoothed objective (mµ ). The methods are compared in terms of the original nonsmooth objective m, the Lagrangian relaxation. The temperature µ is set so that the smooth objective is within = 1 of the nonsmooth objective in all synthetic tests. The last two methods do not optimize this objective directly. However, by Proposition 4.2, the smooth objective is always within of the nonsmooth objective, thus limiting error due to optimization of the wrong objective. The projected gradients were used to solve a nonsmooth dual decomposition algorithm with a square 1 summable step size ( k for k th iteration). Results are shown in Figure 2. The plotted time is the real running time of the algorithm rather than iterations of the methods. A single iteration of algorithms that optimize the smooth objective are the same; however, the projected subgradient method does not require computation of marginals. The difference in an iteration's running time is a result of the difference of the computational cost of performing logsumexp and max on a vector of values, which favors the projected subgradients method on the nonsmooth objective, but only as a multiplicative constant. 6.2. Protein Design We conducted experiments on the protein design dataset made available by Yanover et al. (2006). The dataset consists of 97 problems with the goal of finding the most stable set of amino-acids and rotamer configurations that give rise to a given 3D configura- is how well it solves nonsmooth problem (MSTR). The following proposition answers this question. Proposition 5.1 For a given precision and a slave decomposition specified by the potentials c , setting 1 µ = 2 P ln |c | , L = µ and iterating Algorithm 2 for c O 2L iterations produces solution for which ^ m(^) m( ) + , where is optimum of the nonsmooth problem (MSTR). The proof of this proposition is given in Appendix A. The projected subgradients on the original nonsmooth objective achieve this error in O 1 iterations. 2 To summarize, we use a fast algorithm on an approximate cost to reach -solutions in the original cost faster than we could have done had we just optimized the original cost. 6. Experiments 6.1. Synthetic experiments The synthetic experiments were performed on an MRF with pairwise nearest neighbor interactions forming a grid, and each random variable ranging over 7 states. The pairwise potentials were sampled from a Gaussian distribution with variance 2 {1, 4, 36, 64}. In all Accelerated dual decomposition for MAP inference 1kw4 1.5 1.4 1.3 1.2 1.1 objective 1 0.9 0.8 0.7 0.6 0.5 in general, guarantee increase in the primal. dual (smooth-optimal) primal (smooth-optimal) dual (dual decomposition) primal (dual decomposition) dual (LP relaxation) primal (LP relaxation) 7. Discussion In this paper we constructed a smooth approximation to the objective obtained from Lagrangian relaxation of the MAP inference problem. This enabled us to trade off quality of approximation in return for the ability to apply efficient gradient based methods. In particular, we used the "optimal method" of Nesterov (1983) which yielded a significantly faster algorithm compared to naive projected gradients. In addition, our construction revealed a connection between solving a MAP problem and sum-product-like algorithms at a particular low temperature. Ours is the first constructive illustration of such connections. The smoothing of the objective also revealed the connection between the quality of the Lipschitz constant of this approximate objective and the form of the proximal term used in smoothing. Employing entropic proximal terms, as in our approach, is one of many possibilities. For example, the convex free energies studied by Meltzer et al. (2009) may offer alternative proximal terms that yield better, or even adaptive, Lipschitz constants. Our experiments indicate that the initial iterations of our algorithms can be significantly accelerated by using an adaptive Lipschitz schedule. For example, such a schedule may be achieved through annealing of the smoothing coefficient. Regardless of the speedups obtained in initial iterations, the strength of our method stems from its strong guarantees of fast global convergence towards the optimal solution. 2 4 6 8 10 time (min) 12 14 16 18 Figure 4. Typical optimization curves showing both the (original) dual objective and the decoded integer solution. Compared are our smooth-optimal method, the dual decomposition method of Komodakis & Paragios (2009) and the LP relaxation algorithm of Globerson & Jaakkola (2007). tion. The problems are defined in terms of an energy function with unary and pairwise terms. We compare our method to the dual decomposition method of Komodakis & Paragios (2009) and the LP-based coordinate-descent method of Globerson & Jaakkola (2007). For our method and dual decomposition, we absorb the unary potentials into the pairwise terms by distributing their contribution evenly across terms involving the same variable.1 We set = 1 for our method and run for a maximum of 5000 iterations on all methods. Results for all 97 problems are shown in Figure 3. The plots show the final value of the (nonsmoothed) dual objective for our algorithm against the competing approaches. In the plots, a point above the diagonal indicates better performance. Our approach significantly outperforms the other methods. We also show a plot of objective versus execution time for the three methods (see Figure 4). The plot shows both the non-smooth dual objective (solid lines) and the best integer primal solution at hand (dashed lines). The primal solution is determined by decoding the messages at each iteration (Globerson & Jaakkola, 2007). Notice that, initially, our objective decreases more slowly than the other approaches, but ultimately attains a better solution. Also observe that the best integer primal solution is only achieved after the dual objective is sufficiently close to its optimal. However, we note that monotonic decrease in the dual does not, That is, for unary potential i (xi ) we add to each pairwise term ij (xi , xj ) where ni is the number of pairwise terms involving variable xi . 1 1 (xi ) ni i References Bertsekas, D.P. Nonlinear Programming. Athena Scientific, 1999. Finley, T. and Joachims, T. Training structural SVMs when exact inference is intractable. In ICML, 2008. Globerson, A. and Jaakkola, T. Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In NIPS, 2007. Held, M. and Karp, R. M. The traveling-salesman problem and minimum spanning trees. Operations Research, 18:1138­1162, 1970. Johnson, J. K., Malioutov, D., and Willsky, A. S. Lagrangian relaxation for map estimation in graphical models. In 45th Annual Allerton Conference on Communication, Control and Computing, 2007. Accelerated dual decomposition for MAP inference Kolmogorov, V. Convergent tree-reweighted message passing for energy minimization. IEEE PAMI, 2006. Komodakis, N. and Paragios, N. Beyond pairwise energies: Efficient optimization for higher-order MRFs. In CVPR, 2009. Komodakis, N., Paragios, N., and Tziritas, G. MRF optimization via dual decomposition: Messagepassing revisited. In ICCV, 2007. Meltzer, T., Globerson, A., and Weiss, Y. Convergent message passing algorithms - a unifying view. In UAI, 2009. Nesterov, Yu. A method of solving a convex programming problem with convergence rate o(1/k 2 ). Soviet. Math. Dokl., 27:372­376, 1983. Nesterov, Yu. Smooth minimization of non-smooth functions. Math. Program., 103(1):127­152, 2005. Ravikumar, P. D., Agarwal, A., and Wainwright, M. J. Message-passing for graph-structured linear programs: proximal projections, convergence and rounding schemes. In ICML, pp. 800­807, 2008. Taskar, B., Chatalbashev, V., Koller, D., and Guestrin, C. Learning structured prediction models: A large margin approach. In ICML, 2005. Tseng, P. On accelerated proximal gradient methods for convex-concave optimization. SIAM Journal on Optimization Optim., 2008. Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6:1453­1484, 2005. Wainwright, M. J., Jaakkola, T., and Willsky, A. S. MAP estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Info. Theory, 2005. Wolsey, L. A. Integer Programming. Interscience, 1998. Wiley- c The partial derivatives with respect to i,a of this smooth objective are sc ( c ) µ c i,a P o n 1 exp µ (c (xc )+ c ,xc c ) xc i,a o . n P 1 c c xc exp µ (c (xc )+ ,xc ) xc = We note that these quantities are exactly marginals of a MRF with potentials c (xc c ) + c , xc c at temperature µ, n o explicitly pc (xc ) exp µ we observe that sc ( c ) µ c i,a 1 µ (c (xc ) + c , xc c ) . Hence = pc (xc = 1). µ i,a Proof of Proposition 4.2. P xc Since dc (q) = ln(|c |) - q(xc ) ln q(xc ) is bounded by 0 dc (q) ln |c |, X q(xc ) (c (xc c ) + c , xc c ) - µdc (q) xc then sc ( c ) = max µ qc sc (c ) - max dc (q) = sc (c ) - µ ln |c | qc X c c sµ ( ) = max q(xc ) (c (xc c ) + c , xc c ) - µdc (q) qc xc max X xc qc q(xc ) (c (xc c ) + c , xc c ) = sc ( c ) It follows that sµ ( c ) s( c ) Pµ ( c ) + µ ln |c | and s hence mµ () m() mµ () + µ c ln |c |. Proof of Proposition 5.1. Assuming that is the optimum of the nonsmooth problem (MSTR), s is the optimum of the smooth ,, problem (SMSTR) and is the so^ q « 2L iterations. The following lution obtained after O inequality chain gives the desired bound on error m(^) mµ (^) + 2 mµ ( s ) + mµ ( ) + m( ) + (Prop. 4.2) (by algorithm rate) (by optimality of s ) (Prop. 4.2) Computing Lipschitz. In order to compute the Lipschitz constant for the master problem we rewrite the master problem as E XD c q , c + AT c - µdc (q) mµ () = max · · · max c q1 qC c = D E max · · · max q, + q, AT - 1, µd(q) , q1 qC Yanover, C., Meltzer, T., and Weiss, Y. Linear programming relaxations and belief propagation--an empirical study. JMLR, 2006. where q = [q 1 ; q 2 ; · · · q C ] and A = [A1 ; A2 ; · · · AC ] and d(q) = [d1 (q 1 ); d2 (q 2 ); · · · dC (q C )]. Further, each Ac is a biP Q nary matrix with ic |Xi | rows and ic |Xi | columns and (Ac )(i,a),x = xi,a . The Lipschitz constant is then 1 L = µ A 1,2 (Nesterov, 2005) with ¸ A 1,2 = maxq, { T A, q : 1 = 1, q 2 = 1} , T ¸, ¯ ¯ , = max maxq , A, q ¯2 : q 2 = 1 : 1 = 1 ¯ = max maxx | T (A·,x )| : 1 = 1 = maxi,j (|A|i,j ) = 1 Assuming both · 1 A. Proofs of Propositions Proof of Proposition 4.1. sc ( c ) = max µ P qc The slave problem is q(xc ) (c (xc c ) + c , xc c ) - x " c " P µ ln(|c |) + x q(xc ) ln q(xc ) . c and · 2 are L1 norms A 1,2 = 1.