Efficiently Solving Convex Relaxations for MAP Estimation M. Pawan Kumar Department of Engineering Science, University of Oxford P.H.S. Torr Department of Computing, Oxford Brookes University pawan@robots.ox.ac.uk philiptorr@brookes.ac.uk Abstract The problem of obtaining the maximum a posteriori (map) estimate of a discrete random field is of fundamental importance in many areas of Computer Science. In this work, we build on the tree reweighted message passing (trw) framework of (Kolmogorov, 2006; Wainwright et al., 2005). trw iteratively optimizes the Lagrangian dual of a linear programming relaxation for map estimation. We show how the dual formulation of trw can be extended to include cycle inequalities (Barahona & Mahjoub, 1986) and some recently proposed second order cone (soc) constraints (Kumar et al., 2007). We propose efficient iterative algorithms for solving the resulting duals. Similar to the method described in (Kolmogorov, 2006), these algorithms are guaranteed to converge. We test our approach on a large set of synthetic data, as well as real data. Our experiments show that the additional constraints (i.e. cycle inequalities and soc constraints) provide better results in cases where the trw framework fails (namely map estimation for non-submodular energy functions). It is therefore not surprising that a number of approximate map estimation approaches exist in the literature. One such class of approaches which provides a good approximation, both in theory and in practice, is based on convex relaxations (e.g. see (Kumar et al., 2007) for an overview). In this work, we focus on the issue of solving these relaxations efficiently with the goal of handling a large number of random variables, e.g. variables corresponding to pixels in an image. A discrete random field is defined over random variables v = {v0 , · · · , vn-1 }, each of which can take a label from the set l = {l0 , · · · , lh-1 }. Throughout this paper, we will assume a conditional random field (crf) while noting that all our results are applicable to the Markov random field framework. A crf describes a neighbourhood relationship E between the variables such that (a, b) E if, and only if, va and vb are neighbours. A labelling of the crf is specified by a function f : {0, · · · , n - 1} - {0, · · · , h - 1} (i.e. variable va takes label lf (a) ). Given data D, the energy of the labelling is given by Q(f ; D, ) = v 1 a;f (a) + a v ( 2 ab;f (a)f (b) , a,b)E (1) 1. Introduction The problem of obtaining the maximum a posteriori (map) estimate of a discrete random field plays a central role in various applications, e.g. stereo reconstruction (Szeliski et al., 2006) and protein side-chain prediction (Sontag & Jaakkola, 2007). Furthermore, it is closely related to many important combinatorial optimization problems such as maxcut (Goemans & Williamson, 1995) and 0-extension (Karzanov, 1998). App earing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). 1 2 where a;f (a) and ab;f (a)f (b) are the data-dependent unary and pairwise potentials respectively, and denotes the parameter of the crf. The problem of map estimation is to obtain the labelling f with the minimum energy (or equivalently the maximum posterior probability), i.e. f = arg minf Q(f ; D, ). Related Work: We build upon the linear programming (lp) relaxation of (Wainwright et al., 2005), which we call lp-s (since it was first proposed by (Schlesinger, 1976) for the special case of hard constraint pairwise potentials). Although the lp-s relaxation can be solved in polynomial time using Interior Point algorithms, the state of the art softwares can only handle up to a few hundred variables due to their large memory requirements. To overcome this prob- Efficiently Solving Convex Relaxations for MAP Estimation lem, two iterative algorithms were proposed by (Wainwright et al., 2005) for solving the dual of the lp-s relaxation. Similar to min-sum belief propagation (bp), these algorithms are not guaranteed to converge. The work of (Kolmogorov, 2006) addressed this problem by proposing a convergent sequential tree-reweighted message passing (trw-s) algorithm for solving the dual. Despite its strong theoretical foundation, it was observed that trw-s yields labellings with very high energies when the energy function contains nonsubmodular terms (Kolmogorov, 2006). This is not surprising since the lp-s relaxation provides an inaccurate approximation in such cases (e.g. see (Kumar et al., 2007)). In this work, we address this deficiency of trw-s by appending the lp-s relaxation with some useful constraints. Our Results: We show how the dual formulation of the lp-s relaxation can be extended to include linear cycle inequalities (Barahona & Mahjoub, 1986) (section 3). Furthermore, we incorporate the recently proposed second order cone (soc) constraints of (Kumar et al., 2007) within this framework (section 4). Note that although the importance of cycle inequalities and soc constraints is well-recognized, their use has been limited to a small number of random variables due to the lack of efficient algorithms (Sontag & Jaakkola, 2007). Our results on including these constraints within the trw formulation allow us to develop efficient convergent algorithms for solving the resulting duals. We successfully apply these algorithms to several synthetic and real problems containing a large number of variables which could not be handled by previous approaches (section 5). Our experiments indicate that incorporating these constraints provides a much better approximation for the map estimation problem within reasonable computational times compared to several state of the art algorithms. Additional experimental results and proofs are provided in (Kumar & Torr, 2008). where the term D is dropped from the lhs to make the notation less cluttered. Reparameterization: A parameter is called a reparameterization of the parameter (denoted by ) if, and only if, Q(f ; D, ) = Q(f ; D, ), f . (5) Over-complete Representations: A labelling f can be represented using an over-complete set of boolean variables y defined as 1 if f (a) = i, ya;i = , yab;ij = ya;i yb;j . (6) 0 otherwise. We also define variables (x, X) such that xa;i = 2ya;i - 1, Xab;ij = 4yab;ij - 2ya;i - 2yb;j + 1. (7) We will sometimes specify the additional constraints (i.e. cycle inequalities and soc constraints) using variables (x, X), since they will allow us to write these constraints concisely. The lp-s Relaxation: The lp-s relaxation of the map estimation problem is given by y = arg minyLOC AL(v,E ) y , ya;i [0l, 1], yab;ij [0, 1], (8) LOC AL(v, E ) = l i l ya;i = 1, yab;ij = ya;i . j l The term LOC AL(v, E ) stands for local consistency polytope (Wainwright et al., 2005) and denotes the feasibility region of the lp-s relaxation (specified by the above constraints for all va v, (a, b) E , li , lj l). Dual of the lp-s Relaxation: Let T denote a set of tree-structured crfs defined over subsets of the given random variables. For a crf T T , we denote its random variables by vT , its neighbourhood relationship by ET and its parameter as T . The parameter T conT sists of unary potentials a;1 and pairwise potentials i T2 ab;ij . Let = {(T ), T T } be a set of non-negative real numbers which sum to one. Using the above notation, the dual of the lp-s relaxation can be written as follows (Kolmogorov, 2006; Wainwright et al., 2005): T (T )q (T ). (9) T max T (T ) T The trw-s Algorithm: Table 1 describes the trw-s algorithm (Kolmogorov, 2006) which attempts to solve the dual of the lp-s relaxation. In other words, it solves for the set of parameters T , T T , which maximize the dual (9). There are two main steps: (i) 2. Preliminaries We begin by introducing some notation which would allow us to describe our results concisely. Optimal Energy and Min-Marginals: The energy of the optimal labelling and the min-marginals of random variables and neighbouring random variables is given by the following equations respectively: q ( ) qa;i ( ) = min Q(f : D), f (2) (3) (4) = min Q(f ; D, ), f ,f (a)=i f ,f (a)=i,f (b)=j qab;ij ( ) = min Q(f ; D, ), Efficiently Solving Convex Relaxations for MAP Estimation reparameterization, which involves running one pass of bp on the tree structured crfs T; and (ii) averaging operation. trw-s is guaranteed not to decrease the value of the dual (9) at each iteration. Further, it can be shown that it converges to a solution which satisfies the weak tree agreement (wta) (Kolmogorov, 2006). Initialization E 1. For every v , find all trees T T which contains . T (T ) T . 2. Initialize T such that 1 Typically, we set (T ) = |T | for all T T . | T 1 Then we can initialize a;1 = a;i |TT | | for all T Tva . i v T2 2 Similarly, ab;ij = ab;ij |T|T | | for all T T(a,b) . (a,b) Iterative Steps E 3. Pick an element v . T 4. For all T T , reparameterize T to such that T1 (i) a;i = qa;i ( T ), if = va v, T1 T1 T2 (ii) a;i + b;j + ab;ij = qab;ij ( T ), if = (a, b) E . This step involves running one iteration of bp for T . 5. Averaging op eration: (i) If = va v, T T1 (T ) a ; i . (a) Compute a;i = 1 Ta a a ( Xak am ;ik im - ak ,am )EF ( Xak am ;ik im 2 - c, ak ,am )EC -EF (10) where lik , lim l . The variables Xak am ;ik im are defined in equation (7). It can be shown that adding cycle inequalities to lp-s, i.e. problem (8), provides a better relaxation than lp-s alone. Their importance is reflected in their wide use in recent literature such as (Sontag & Jaakkola, 2007; Zwick, 1999). In general, a set of NC cycle inequalities defined on a cycle C = (vC , EC ) (using different labels lik for variables vak vC ) can be written as AC y bC . In other words, for every cycle we can define up to hc cycle inequalities (where h = |l|), i.e. NC {0, 1, · · · , hc }. Let C be a set of cycles in the given crf. Theorem 1 (given below) provides us with the dual of the lp relaxation obtained by appending problem (8) with cycle inequalities (defined over cycles in the set C ). We refer to the resulting relaxation as lp-c (where c denotes cycles). Theorem 1: The following problem is the dual of problem (8) appended with a set of cycle inequalities AC y bC , for all C C (hereby referred to as the lp-c relaxation): C T (C )(bC ) uC , (T )q (T ) + max C T T (C )(AC ) uC , (T ) + s.t. uC 0, k {1, 2, · · · , NC }, C C . k (11) Here = { (C ), C C } is some (fixed) set of nonnegative real numbers which sum to one, and uC = {uC , k = 1, · · · , NC } are some non-negative slack varik ables. Similar to the dual (9), the above problem cannot be solved using standard software for a large number of variables v. In order to overcome this deficiency we propose a convergent algorithm (similar to trw-s) to approximately solve problem (11). We call our approach the trw-s(lp-c) algorithm. In order to describe trw-s(lp-c), we need the following definitions. We say that a tree structured random field T = (vT , ET ) T belongs to a cycle C = (vC , EC ) C (denoted by T C ) if, and only if, there exists an edge (a, b) ET such that (a, b) EC . In other words, T C if they share a common pair of neighbouring random variables (a, b) E . We also define the following problem: T T C C max C (T )q ( ) + (C )(b ) u , 1 (b) Set a;i + b;j + ab;ij = ab;ij , for all T T(a,b) . 6. Rep eat steps 3, 4 and 5 till convergence. T T2 Table 1. The trw-s algorithm. Recall that a;1 and ab;ij i are the unary and pairwise p otentials for the parameter T1 T2 T . Similarly, a;i and ab;ij are the unary and pairwise p otentials defined by the parameter . The terms a = T T (T ) and ab = (T ) are the vari,va vT ,(a,b)ET able and edge app earance terms for va v and (a, b) E resp ectively. In step 3, the value of the dual (9) remains unchanged. Step 4, i.e. the averaging op eration, ensures that the value of the dual does not decrease. trw-s converges to a solution which satisfies the wta condition. (b) Set a;i = a;i , for all T Tva . (ii) If = (a, b) T , (a) Compute ab;ij = T T1 T1 T2 1 (T )(a;i + b;j + ab;ij ). T ab T1 T1 (a,b) T1 T2 3. Adding Linear Constraints We now show how the results of (Kolmogorov, 2006; Wainwright et al., 2005) can be extended to include an arbitrary number of linear cycle inequalities (Barahona & Mahjoub, 1986; Kumar et al., 2007). This requires us to incorporate cycle inequalities into the dual (11). We begin by briefly describing cycle inequalities. Consider a cycle of length c in the graphical model of the given crf, which is specified over a set of random variables vC = {vb , b = a1 , a2 , · · · , ac } such that EC = {(a1 , a2 ), (a2 , a3 ), · · · , (an , a1 )} E . Further, let EF EC such that |EF | (i.e. the cardinality of EF ) is odd. Using these sets of edges, a cycle inequality can be specified as 1 Note that using the variable y would result in a less compact representation of cycle inequalities. Efficiently Solving Convex Relaxations for MAP Estimation s.t. T C (T )T + (C )(AC ) uC = C , uC 0, k {1, 2, · · · , NC }, (12) k for some parameter C . The variables of the above C T T problem are rE stricted to uk , a;1 and ab2ij where e i ; (a, b) ET C for some T C . In other words, problem (12) has fewer variables and constraints than dual (11) and can be solved easily using standard Interior Point algorithms for small cycles C . As will be seen, even using cycles of size 3 or 4 results in much better approximations of the map estimation problem for non-submodular energy functions. Table 2 describes the convergent trw-s(lp-c) algorithm for approximately solving the dual (11). The algorithm consists of two main steps : (i) solving problem (12) for a cycle; and (ii) running steps 4 and 5 of the trw-s algorithm. Note that our approach is different from other generalizations of trw, e.g. (Wiegernick, 2005) which computes marginals. Specifically, we do not cluster random variables but include additional constraints to reduce the feasibility region of the relaxation. Our experiments in section 5 show that, unlike (Wiegernick, 2005), we always outperform bp. The properties of the trw-s(lp-c) algorithm are summarized below. Initialization 1. Choose a set of tree structured random fields T . Choose a set of cycles C . For example, if the 4-neighb ourhood is employed, C can b e the set of all cycles of size 4. T (T ) T . 2. Initialize T such that C Initialize uk = 0 for all C and k. Iterative Steps C 3. Pick an element v . Find all cycles C C which contains . 4. For a cycle C C , compute T (T )T + (C )(AC ) uC C = C using the values of T and uC obtained in the previous iteration. Solve problem (12) using an Interior Point method. Up date the values of T and uC . 5. For all trees T T which contain , run steps 4 and 5 of the trw-s algorithm. 6. Rep eat steps 3 and 4 for all cycles C C . 7. Rep eat steps 3 to 5 for all elements till convergence. Table 2. The trw-s(lp-c) algorithm. vector C of cycle C remains unchanged. Hence, after step 4 of the trw-s(lp-c) algorithm, the reparameterization constraint is satisfied. It was also shown that step 5 (i.e. running trw-s) provides a reparameterization of (see Lemma 3.3 of (Kolmogorov, 2006) for details). This proves Property 1. Prop erty 2: At each step of the algorithm, the value of the dual (11) never decreases. Clearly, step 4 of the trw-s(lp-c) algorithm does not decrease the value of the dual (11) (since the ob jective function of problem (12) is part of the ob jective function of dual (11)). The work of (Kolmogorov, 2006) showed that step 5 (i.e. trw-s) also does not decrease this value. Note that the lp-c relaxation is guaranteed to be bounded since it dominates the lp-s relaxation (Kumar et al., 2007), which itself is bounded (Kolmogorov, 2006). Therefore, by the Bolzano-Weierstrass theorem (Fitzpatrick, 2006), it follows that trw-s(lp-c) will converge. Prop erty 3: Like trw-s, the necessary condition for convergence of trw-s(lp-c) is that the parameter vectors T of the trees T T satisfy wta. This follows from the fact that trw-s increases the value of the dual in a finite number of steps as long as the set of parameters T , T T , do not satisfy wta (see (Kolmogorov, 2006) for details). Prop erty 4: Unlike trw-s, wta is not the sufficient condition for convergence. One of the main drawbacks of the trw-s algorithm is that it converges as soon as the wta condition is satisfied. Experiments in (Kolmogorov, 2006) indicate that this results in high energy solutions for the map estimation problem when the energy function is non-submodular. Using a counterexample, it can be shown that wta is not the sufficient condition for the convergence of trw-s(lp-c) (Kumar & Torr, 2008). Obtaining the Lab elling: Similar to the trw-s algorithm, trw-s(lp-c) solves the dual (11) and not the primal problem. In other words, it does not directly provide a labelling of the random variables. In order to obtain a labelling, we use the same scheme as the one suggested in (Kolmogorov, 2006) for the trw-s algorithm. Briefly, we assign labels to the variables v = {v0 , v1 , · · · , vn-1 } in increasing order (i.e. we label variable T 0 , followed by variable v1 and so on). Let v (T )T . At each stage, a variable va is asT = signed the label lf (a) such that b T T ab2i,f (b)) , (14) f (a) = arg min a;1 + i ; i,li l