Learning Efficiently with Approximate Inference via Dual Losses

Ofer Meshi David Sontag Tommi Jaakkola Amir Globerson

meshi@cs.huji.ac.il dsontag@csail.mit.edu tommi@csail.mit.edu gamir@cs.huji.ac.il

Abstract
Many structured prediction tasks involve complex models where inference is computationally intractable, but where it can be well approximated using a linear programming relaxation. Previous approaches for learning for structured prediction (e.g., cuttingplane, subgradient methods, perceptron) repeatedly make predictions for some of the data points. These approaches are computationally demanding because each prediction involves solving a linear program to optimality. We present a scalable algorithm for learning for structured prediction. The main idea is to instead solve the dual of the structured prediction loss. We formulate the learning task as a convex minimization over both the weights and the dual variables corresponding to each data point. As a result, we can begin to optimize the weights even before completely solving any of the individual prediction problems. We show how the dual variables can be efficiently optimized using coordinate descent. Our algorithm is competitive with state-of-the-art methods such as stochastic subgradient and cutting-plane.

would be to explicitly model the interactions between the labels, which then results in the labels being jointly predicted. Structured prediction models do this by us^ ing classifiers of the form y = arg maxy w З f (x, y ), ^ where f (x, y) is a given function and w are weights to be learned from data. Much of the early work on structured prediction (Lafferty et al., 2001; Taskar et al., 2004) focused on the case where prediction (i.e., maximization over y) could be done using efficient combinatorial algorithms such as dynamic programming or maximum-weight matching. However, this restricted the types of interactions that these models were capable of capturing to tractable structures such as tree graphs. Recent work on graphical models has shown that even when the maximization over y is not known a priori to be tractable, linear programming (LP) relaxations often succeed at finding the true maximum, even giving certificates of optimality (Sontag et al., 2008). This strongly motivates learning structured prediction models which use LP relaxations for prediction, and indeed several recent works show that this yields empirically effective results (Finley and Joachims, 2008; Martins et al., 2009). Learning with large scale data necessitates efficient algorithms for finding the optimal weight vector w. Although several such algorithms have been proposed for structured prediction, these have primarily focused on settings where the maximization over y is performed using combinatorial optimization. Some examples are structured perceptron (Collins, 2002), stochastic subgradient (Ratliff et al., 2007), extra-gradient (Taskar et al., 2006), and cutting-plane algorithms (Joachims et al., 2009). All of these approaches require making a prediction at every iteration. When LP relaxations are used, this corresponds to repeatedly solving an LP to optimality, significantly reducing the scalability of the overall learning algorithm. These earlier approaches have two potential sources of inefficiency. First, it is likely not necessary to solve the LPs to optimality to obtain an approximately cor-

1. Introduction
In many prediction problems we are interested in predicting multiple labels y1 , . . . , yd from an input x rather than a single label as in multiclass prediction. This setting is referred to as structured prediction, and has found many applications in various domains, from natural language processing to computational biology (Bakir et al., 2007). A naive approach to the problem is to predict each label yi individually, ignoring possible correlations between the labels. A better approach
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Learning Efficiently with Approximate Inference via Dual Losses

rect update for the weights, particularly in the early iterations of the algorithms. Second, these approaches typically re-solve the LPs from scratch at each iteration. However, particularly in later iterations when there are only small changes to the weight vector, we would like to be able to "warm start" using the previous iteration's solution to the LP for a data point. In this paper we introduce a novel method for learning structured prediction models using LP relaxations. Whereas previous learning approaches involved repeatedly solving the computationally intensive LP problem per data point, our new formulation replaces the standard LP with its dual. This turns the entire problem into a minimization over the weights w and auxiliary dual variables . The latter can be updated via a simple closed form message passing scheme that decreases the overall objective at each iteration. We combine these with stochastic subgradient updates on w and thus our scheme has an online flavor similar to Shalev-Shwartz et al. (2007). We show empirically that avoiding the LP solution indeed results in much improved convergence time when compared to previous methods. This effect becomes more pronounced the larger the label space is, and the method is thus expected to enhance performance on many large scale structured prediction problems.

In M 3 N the weight vector is found by minimizing the following regularized hinge loss: min
w

1 w 2

2

+

C n

h (x m

(m)

, y (m) ; w)

(3)

where:
(m) , y (m) ; w) h (x

= max w З f (m) (y) + e(m) (y) (4)
y

Here e(m) (y) is the discrepancy between the true labelling y (m) and y, and is assumed to decompose as (m) e(m) (y) = i ei (y).2 We also define f (m) (y) = (m) f (x , y) - f (x(m) , y (m) ). The problems in Eq. 1 and Eq. 4 involve finding an assignment y that maximizes a function of the form ij ij (yi , yj ) + i i (yi ). This problem, commonly referred to as the MAP problem in the graphical models literature, is intractable (NP hard) for general graphs G, and tractable only in isolated cases such as tree structured graphs. However, linear programming (LP) relaxations are often effective as approximations, and have been incorporated into M 3 N by several authors, which we review further in Section 4. In MAP-LP relaxations one replaces maximization over y of ij ij (yi , yj ) + i i (yi ) with the following linear program:3 maxЕML (G) ЕЗ, where ML (G) enforces consistency between the pairwise distributions Еij (yi , yj ):      ML (G) = Е  0    
yj yi yi

2. Problem Formulation
We begin by reviewing the maximum margin Markov network formulation (M 3 N ) (Taskar et al., 2004), and its LP relaxation. We consider a labelled dataset {x(m) , y (m) }n containing n samples. We seek a funci=1 tion h(x; w) that will predict y from x. It is assumed to be of the form h(x; w) = arg max w З f (x, y)
y

 Еij (yi , yj ) = Еi (yi )     Еij (yi , yj ) = Еj (yj ) . (5)     Еi (yi ) = 1

(1)

where f (x, y), the feature vector, is given by a fixed, known, vector-valued function of both x and y. In what follows we assume that y is multivariate and has d variables denoted by y1 , . . . , yd . Furthermore, f is assumed to decompose into pairwise and singleton factors on y, so that: w З f (x, y) =
ijE

fij (yi , yj , x) З wij +
i

fi (yi , x) З wi

To introduce the LP relaxation into M 3 N , we use the notation (f, e, w) to denote the vector of parameters, where the pairwise elements are ij (yi , yj ) = fij (yi , yj ) З wij and the singleton elements are i (yi ) = fi (yi ) З wi + ei (yi ). We shall also use:  (m) (w) = (f (m) , e(m) , w). Finally, we will denote the singleton and pairwise elements of  (m) (w) by  (m) (yi ; w) and  (m) (yi , yj ; w) respectively. We then have the following approximation of the loss h from Eq. 4: ^h (x(m) , y (m) ; w) = max
ЕML (G)

(2) where E is a set of edges in a graph G, and the vectors wij , wi are the elements of the weight vector corresponding to each one of the factors (some weights may be shared across edges).1
1 We use factors of size one and two for notational convenience only; our approach generalizes to larger factors.

Е З  (m) (w)

(6)

2 The loss could also be defined along the edges, but we omit this for notational convenience. 3 the notation P P З  Е = P WeP use ijE yi ,yj Еij (yi , yj )ij (yi , yj ) + i yi Еi (yi )i (yi )

Learning Efficiently with Approximate Inference via Dual Losses

It is straightforward to show that the relaxed loss ^h provides an upper bound on the true loss h (Finley and Joachims, 2008). The final classifier is obtained by solving the maximization arg maxЕML (G) Е З (f, 0, w) and returning yi = arg maxyi Еi (^i ). To summarize the above, we y ^ are interested in solving the optimization problem: min
w

. The other is stochastic subgradient updates on w that process each example separately and are thus ideally suited for large data sets. 3.1. Dual minimization via coordinate descent Notice that, unlike the w variables, the  (m) variables are only dependent on the mth sample in the dual formulation Eq. 9. Block coordinate descent of g( (m) ;  (m) (w)) can be performed in closed form, as has been noted by several authors (e.g., Globerson and Jaakkola, 2008; Werner, 2007), and as we review next. Suppose that at iteration t we have a given value of the  (m) variables, denoted by  (m,t) . Now assume we (m) (m) fix all  (m) variables except ij , ji and seek the optimal value of ij , ji . The closed form solution is given by: ij
(m,t+1) 1 (yj ) = - 2  (m) (yj ; w) - 1 + 2 maxyi (m,t) (yj ) kN (j)\i kj (m,t) (m)  (yi , yj ; w) - ji (yi ) , 1 2 (m) (m)

1 w 2

2

+

C n

max
m ЕML (G)

Е З  (m) (w)

(7)

3. Optimization via Dual Losses
In this section we present our algorithm for solving the optimization problem in Eq. 7. We begin by using convex duality to replace the internal maximization in Eq. 7 with minimization of a piecewise-linear objective. Numerous duals have been suggested for the MAP LP relaxation problem (e.g., Globerson and Jaakkola, 2008; Komodakis et al., 2007; Werner, 2007). We use the formulation discussed in Werner (2007). The dual of max Е З  is thus:
ЕML (G)

min
i ij

maxyi i (yi ) +

kN (i) ki (yi )

+

(10) (m,t+1) and analogously for ji (yi ). This update is commonly referred to as max-sum diffusion, or MSD (see Werner, 2007, and references within). We use a more efficient block coordinate descent step where we simultaneously update all of the dual variables ij (yj ) going into variable yj (see Globerson and Jaakkola, 2008, for a similar update). It is equivalent to iterating the MSD updates for the corresponding edges until convergence, and is given by (we drop the m superscript for brevity): 1 1  (m) (yj ; w) - j (yj ) + ij (yj ) 1 + dj 1 + dj (11) where dj is the degree of node j in the graph and: ij (yj ) = - ij (yj ) = max  (m) (yi , yj ; w) - ji (yi )
yi

maxyi ,yj ij (yi , yj ) - ij (yj ) - ji (yi ) (8) Denote the objective of the above dual by g(; ). Then we have that Eq. 7 equals: min
w, (1) ,..., (n)

1 w 2

2

+

C n

g( (m) ;  (m) (w)),
m

(9)

where we now minimize over the dual variables  (m) in addition to the weights w. Because the dual always upper bounds the primal, the function g( (m) ;  (m) (w)) is a convex upper bound on the relaxed loss ^h (x(m) , y (m) ; w) for every value of  (m) . This bound can be tightened by minimizing it over  (m) . By LP duality, the minimal  (m) value gives us exactly ^h (x(m) , y (m) ; w). The key advantage of Eq. 9 is that it removes the difficult inner maximization from Eq. 7. Moreover, Eq. 9 is jointly convex in its variables (namely w and the  (m) s), and furthermore is unconstrained. Thus, we can employ a variety of minimization algorithms to it without the need for a "black-box" solver of the maximization problem. In the next three sections we describe our algorithm for optimizing Eq. 9. Our approach has two components: one is to decrease the objective via message passing updates on , corresponding to coordinate descent on

(12)

and j (yj ) = kN (j) kj (yj ). The messages ij (yj ) need to be updated simultaneously for all neighbors of j. The derivation is very similar to that in Globerson and Jaakkola (2008) and is not repeated here. We note that even more efficient updates can be performed, for example by simultaneously updating all 's that correspond to a tree (Sontag and Jaakkola, 2009). Because the objective g is not strictly convex, the MSD updates may get trapped in a sub-optimal point (Kolmogorov, 2006). This problem does not occur for binary variables however, as shown in e.g., Globerson and Jaakkola (2008). One way to avoid this is to replace the max function in Eq. 8 with a soft-max func-

Learning Efficiently with Approximate Inference via Dual Losses

tion, namely: max f (yi ) 
yi

1 log K

eKf (yi )
yi

(13)

The upper bound becomes tight as K  . Note that g with the soft-max is thus again a convex upper bound on the original loss. Thus we can use g with a sufficiently high K and the technical issue of non-optimal fixed points is alleviated. It also turns out (Johnson et al., 2007) that the MSD updates when the soft-max is used are exactly as in Eq. 10 and Eq. 11, only with the soft-max replacing the max function. The convergence rate of such updates has recently been analyzed in the context of LDPC codes (Burshtein, 2009). In practice we have found that using the original max does not sacrifice optimality, so we do not use the softmax in practice. 3.2. Subgradient optimization over w The previous section showed how to update the  (m) variables such that the objective is decreased at every iteration. We now turn to the update steps on the weight vector w. One method that has proven very useful for losses as in Eq. 9 is stochastic subgradient descent (SSD) (Shalev-Shwartz et al., 2007; Ratliff et al., 2007). In SSD, the vector w is changed in the direction of the subgradient of g( (m) ;  (m) (w) for each sample m. This strategy is especially effective for large datasets since the exact subgradient involves summation over all the sample points. In the Pegasos method (Shalev-Shwartz et al., 2007; Shalev-Shwartz and Srebro, 2009), the stochastic subgradient is followed by a projection on a ball of radius  C. This is justified by the fact that the optimal w is known to be inside this ball, and results in improved rates of convergence. We follow the same procedure here, since w in our case satisfies the same condition (assuming that the label loss is upper bounded by one, which it is in our case since we use the normalized Hamming loss). The key difference between Pegasos and the method we propose is that we introduce additional variables  (m) and minimize with respect to those as well. In the original Pegasos algorithm, one views the objective as a function of w alone, and therefore has to calculate exact subgradients w.r.t. w, which requires solving an LP problem at every sample point. As we show in the experiments, this can have a major effect on runtime. 3.3. The DLPW algorithm and convergence To summarize the above two sections, we propose to solve the structured prediction problem in Eq. 7 by

casting it as a joint minimization problem over  (m) and w (Eq. 9) and performing coordinate descent updates on  (m) together with stochastic subgradient updates on w. The overall algorithm, which we call DLPW for Dual Loss Primal Weights is described in Algorithm 1. When processing the mth sample point, the algorithm first updates its  (m) variables by improving the objective using coordinate descent updates (m) (Eq. 11). Each ij (yj ) should be updated at least once, but in practice it is preferable to perform R passes over the graph, where R is a small number (we use R = 10 in our experiments). Our scheme combines two minimization approaches: stochastic subgradient and coordinate descent. Each is known to converge to the global optimum if used alone (under appropriate conditions on the objective). Although it is not obvious that using them together would have the same guarantees, we show in Appendix A that the combined method will in fact converge to the global optimum. Since the MSD updates for nonbinary yi may get trapped in suboptimal  (m) , we show convergence for either binary yi or a soft-max with any K (which at the limit is equivalent to the max function). Algorithm 1 The DLPW algorithm  Initialize: Choose w1 s.t. w1  C for t = 1 to T do Pick a sample point m Perform R coordinate descent iterations on all variables  (m,t) via the updates in Eq. 11. Denote the new values by  (m,t+1) . Set: wt+ 1 = wt - 1 wt gm ( (m,t+1) ,  (m) (w)) t 2 Set: wt+1 = min 1, end for
 C wt+ 1
2

wt+ 1 2

4. Previous Approaches
Most algorithmic effort in structured prediction has focused on the case where maximization over the label space y is tractable. Key examples are when the graph G corresponds to a tree (Taskar et al., 2004; Collins, 2002), or where labels correspond to a combinatorial object such as graphs or matchings (Taskar et al., 2006). Below we review these approaches, and highlight their applicability to LP approximations. In Taskar et al. (2004) the authors noted that although the primal hinge loss involves maximizing over an exponentially large set, its dual has a simpler structure. Specifically, for singly connected graphs G the dual involves only a polynomial number of constraints. They

Learning Efficiently with Approximate Inference via Dual Losses

suggested a dual algorithm similar to the SMO algorithm in Platt (1998), where the dual variables are updated by switching probability mass between two labels y1 , y2 . We note that this dual is different from ours since it also involves dualizing over w. This method can also be applied to the LP relaxation case as noted in Taskar et al. (2004, Section 4). Another dual approach is to use exponentiated gradient steps (Collins et al., 2008). However, these are tailored for the case when inference is tractable and do not seem easily transferable to LP approximations. It is also possible to adapt the perceptron update to the LP case (Kulesza and Pereira, 2008) but this again requires solving the LP at every iteration, and is only exact in the separable case. Primal methods (such as the one we propose here) operate by updating w directly, and seem to have been used more frequently for structured prediction with LP approximations. One natural approach is to use stochastic (or incremental) subgradient descent on the objective in Eq. 3 (e.g., Shalev-Shwartz and Srebro, 2009; Ratliff et al., 2007). The main drawback of this approach is that calculating the subgradient requires solving the LP approximation after every sample point. This can be quite costly, and is precisely what we avoid doing in the current paper. Another popular primal approach is based on cutting planes (Joachims et al., 2009; Finley and Joachims, 2008). Here one incrementally adds constraints that correspond to vertices of the LP constraints. It can be shown that a polynomial number of constraints are needed to achieve a given optimization accuracy, but in practice this number may be large. The method we present here avoids this growth in problem size. The work closest to ours is Taskar et al. (2005), where the dual of only the LP over Е is taken and the w is kept intact. However, this is done only for problems where the LP is exact (has integral vertices) and the authors suggest solving the problem via a standard QP solver, as opposed to the efficient coordinate descent message passing procedure we employ here.

over all label variables. Our model is equivalent to the one used by Finley and Joachims (2008) except that we use an overcomplete representation for feature functions f (fi (yi , x) is defined for all values yi and similarly for fij (yi , yj , x)). The inputs x are vectors in Rs . The feature fi (yi , x) is a |yi |s dimensional vector, i.e., |yi | concatenated vectors of dimension s. The value of fi (yi , x) will be x for the vector corresponding to the label yi and zero elsewhere. The edge feature functions fij (yi , yj , x) are indicator vectors, so the length of wij is |yi ||yj | (4 in the binary case). We use a normalized (m) (m) Hamming loss with ei (y) = 1{yi = yi }/d. We focus on three datasets of real-world domains taken from the LIBSVM collection (Chang and Lin, 2001) including Yeast [14 labels, 1500 training samples, 103 features in x] (Elisseeff and Weston, 2001), Scene [6 labels, 1211 samples, 294 features] (Boutell et al., 2004), and Reuters (subset 1 [3000 samples, 47236 features]) (Lewis et al., 2004). We use a reduction of the Reuters dataset to the 30 most frequent labels. For each dataset, we train a classifier using DLPW and two other algorithms. The first is a cutting-plane algorithm (Finley and Joachims, 2008) and the second is the Pegasos algorithm which uses an LP solver to obtain the approximate MAP at each iteration.4 The results are shown in Fig. 1(a-c). As we mentioned in Section 3.3, we limited the number of iterations (MSD updates to all nodes) in DLPW to R = 10. We tried a couple of other values for R in this regime and found that these gave similar overall performance. Generally, decreasing R results in faster iterations, but each has a smaller improvement in the objective. On the other hand, if we set no limit on R and allow the MSD algorithm to converge, we get similar performance to that of Pegasos. This makes sense as the LP would be solved completely at each iteration by both algorithms. Figure 1 shows the objective of Eq. 7 as a function of runtime for each algorithm.For the Yeast dataset we can first see that both subgradient algorithms converge to the optimum much faster than the cuttingplane algorithm. Furthermore, we see that DLPW is significantly more efficient than Pegasos, which solves the primal LP at each iteration. Specifically, on this
4 Implementation details: we have implemented all algorithms in C++. The cutting-plane code is taken from svm struct (http://svmlight.joachims.org/svm struct.html) and adapted for the LP case. We use the GLPK library (http://www.gnu.org/software/glpk/glpk.html) to solve the relaxed LP in both cutting-plane and Pegasos algorithms. We run the experiments on a Dual-Core AMD 2.6 GHz Linux machine.

5. Experiments
To evaluate our proposed algorithm, we compare its performance on multi-label classification tasks to some of the alternative approaches discussed in Section 4. We will show that DLPW often outperforms the other algorithms, and that it scales well with problem size. In this multi-label classification setting, each label yi is a binary random variable indicating whether the i'th label is 'on', and these form a fully connected graph

Learning Efficiently with Approximate Inference via Dual Losses

(a)

(b)

(c)

(d)

Figure 1. Comparing quality of solution as a function of runtime for various datasets. The x-axis in each subfigure represents the runtime in seconds while the y-axis represents the objective of Eq. 7. Each subfigure shows a line for each of the tested algorithms. Some of the traces were truncated to show more details in the interesting range.

dataset DLPW runs 6 times faster than Pegasos and 43 times faster than cutting-plane. In the Scene dataset we see that again the subgradient methods converge much faster than cutting-plane, but here there is only a small advantage for DLPW over Pegasos. This is presumably because the graph is rather small (6 nodes vs. 14 nodes in Yeast) so the LP solver becomes quite efficient. Finally, for the Reuters dataset we observe once more the improved efficiency of the subgradient algorithms. Here DLPW takes less than half the time to converge than Pegasos. We note that when reducing the Reuters dataset to only 10 most frequent labels rather than 30 then DLPW converges only slightly faster than Pegasos (not shown), which demonstrates the improved scalability of our method as the graph grows in size. We note that for the multi-label binary models, one could also use cut based methods which are typically much faster than general LP solvers (e.g., see Finley and Joachims, 2008). However, our method generalizes to the non-binary setting where it is not clear how to employ cut based methods for solving LPs. Indeed, Figure 1(d) shows results for such a case. For this we use synthetic data similar to the multi-label setting, but with fi (yi , x) holding xi in position yi rather than the whole vector x (x and y are assumed to be of the

same length). We run all algorithms on a fully connected graph with d = 20, each yi has 4 states, and the training set consists of 100 samples (these were generated by randomly sampling w, x and obtaining y via iterative conditional modes). We can see that in this setting the cutting-plane algorithm outperforms Pegasos, however DLPW is significantly faster than both.

6. Discussion
We have presented an algorithm for efficient scalable optimization of structured prediction problems that employ approximate inference. Our algorithm dualizes the LP approximation and thus avoids the need to completely solve it at each iteration. The dual can be viewed as an upper bound on the hinge loss (hence the term dual-loss) which can be tightened via auxiliary variables  (m) . An interesting future direction would be to further explore this notion of tunable surrogate losses further, and study its effects on generalization. Our empirical results show that our DLPW algorithm improves on methods that employ LP solvers as a black box, and that the improvement is increased as the number of labels grow. A natural question might be why we could not have simply used the MSD updates within these solvers to replace the black-box LP.

Learning Efficiently with Approximate Inference via Dual Losses

There are two problems with this approach. First, we would still be required to iterate the MSD to convergence (since the LP needs to be solved exactly in these methods). Second, there would be an extra step of obtaining a primal solution from the  (m) variables, which requires extra work for non-binary labels. In many structured prediction problems, part of the instance may have special structure for which more efficient inference algorithms are known. In these cases we can make global moves to optimize the  (m) variables, e.g. tree block coordinate descent (Sontag and Jaakkola, 2009). Our techniques can also be applied to structured prediction problems other than graphical models, such as parsing, by varying the dual decomposition of the optimization problem used for prediction. The convergence results presented here are asymptotic. It would be desirable to also derive rate of convergence results. It is possible to show that if the  (m) are updated at each iteration until they minimize the dual loss then a rate of O( 1 ) is obtained. A similar result can be given for minimization up to a certain accuracy. It thus seems likely we can obtain rate results by employing convergence rates for the  (m) updates. Recent work (Burshtein, 2009) has analyzed convergence for a related variant of MSD, and can probably be used in this context. Finally, our results are easily extended to functions f (x, y) that involve larger cliques on y. This can be done via generalizations of MSD type updates to this case (e.g., the GMPLP algorithm in Globerson and Jaakkola, 2008). Furthermore, such cliques can be introduced to improve the accuracy of the LP approximation (Sontag et al., 2008). We thus expect DLPW to allow improved prediction accuracy for a variety of large scale structured prediction problems. Acknowledgements
This work was supported by BSF grant 2008303. D.S. was supported by a Google PhD Fellowship.

use the soft-max loss (see Eq. 13) for any finite value of K. h is strictly convex w.r.t. w because of the L2 regularization. The proof also applies (with minor modifications) to the case where yi are binary, since in this case the max-sum-diffusion updates do not have non-optimal fixed-points and the strict convexity w.r.t.  is then not needed in the proof. We wish to show that our algorithm converges to the global minimum of h(w, ). The updates of our algorithm are as follows. At iteration t choose the next sample point m (for simplicity we shall assume that the sample points are picked according to the same order at every iteration. We also implicitly assume that m is dependent on t but for brevity drop it from the notation) and: З Choose 1 З Choose 2
(m,t+1)

to minimize f (wt , 1 to minimize f (wt , 1

(m)

, 2

(m,t)

) )

(m,t+1)

(m,t+1)

, 2

(m)

З Set wt+1 = wt - t w hm (wt , 1

(m,t+1)

, 2

(m,t+1)

)

Note that the update on w may be viewed as a sub(m,t+1) (m,t+1) gradient on the function w hm (wt , 1 , 2 ) when the latter is understood as a function of w. Using the same derivations as in Nedic and Bertsekas (2001, Lemma 2.1 therein) we arrive at the following inequality, which holds at iteration t for every w:
2 wt - w 2 + t D2 (m,t+1) -2t h(wt ,  ) - h(w,  (m,t+1) ) (15) where D is an upper bound on the norm of the sub ^ gradient (which is indeed bounded by C + R where ^ is the maximum norm of a feature vector f (x, y) as R in Shalev-Shwartz et al., 2007). Specifically, the above holds for w = w( (m,t+1) ) = arg minw hm (w,  (m,t+1) ) (this w is unique due to the strict convexity of h(w, ) in w as a result of the L2 regularization). For this w the difference in h objectives above is always nonnegative so that by Eq. 15 every iteration brings wt closer to its optimal values for the current  (m,t) , provided that the stepsize t is sufficiently small. We now wish to take t   and analyze the limit point Џ Џ w, . It can be shown via standard methods that such a point exists. Furthermore, using similar arguments to Correa and LemarДchal (1993, Proposition e Џ Џ Џ Џ 1.2) we can conclude that: h(w(), ) = h(w, ) and Џ Џ thus w = arg minw h(w, ).

wt+1 - w

2


A. Convergence Proof
To simplify the derivation we assume we only have two delta variables per sample point, and denote those by (m) (m) 1 , 2 . All arguments generalize to  (m) with more variables. Our objective thus has the form: h(w, ) =
m

hm (w, 1

(m)

, 2

(m)

)

(14)

where h includes the L2 regularization on w. We assume that h is strictly convex w.r.t. its variables. Strict convexity w.r.t. the  variables is obtained if we

Џ Џ We now wish to prove that  is optimal for w. This is done as in standard coordinate descent analyses (e.g., (m,t+1) Bertsekas, 1995, page 274). The update of 1 re(m,t+1) (m,t) (m) (m,t) quires hm (wt , 1 , 2 )  hm (wt , 1 , 2 )

Learning Efficiently with Approximate Inference via Dual Losses

for all values of 1 . Taking t   we have (m) Џ(m) (m) Џ(m) Џ Џ Џ hm (w, 1 , 2 )  hm (w, 1 , 2 ). This implies Џ(m) that the limit value 1 is optimal with respect to all other coordinate ( and w). We can show this local optimality for all coordinates of , and by the properties of strictly convex functions we obtain that Џ Џ  = arg min h(w, ). Џ Џ Taken together, the above arguments show that w,  are the minimizers of h(w, ) when one of the two Џ Џ variables is fixed to either w or . Strict convexity Џ Џ of h then implies that w,  is the global optimum of h(w, ).

(m)

J. K. Johnson, D. Malioutov, and A. S. Willsky. Lagrangian relaxation for map estimation in graphical models. In 45th Annual Allerton Conference on Communication, Control and Computing, September 2007. V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on PAMI, 28(10):1568­1583, 2006. N. Komodakis, N. Paragios, and G. Tziritas. MRF optimization via dual decomposition: Message-passing revisited. In ICCV, 2007. A. Kulesza and F. Pereira. Structured learning with approximate inference. In NIPS 20, pages 785­792. 2008. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML 18, pages 282­289, 2001. D. Lewis, , Y. Yang, T. Rose, and F. Li. RCV1: a new benchmark collection for text categorization research. JMLR, 5:361­397, 2004. A. F. T. Martins, N. A. Smith, and E. P. Xing. Polyhedral outer approximations with application to natural language parsing. In ICML 26, pages 713­720, 2009. A. Nedic and D. P. Bertsekas. Incremental subgradient methods for nondifferentiable optimization. SIAM J. on Optimization, 12(1):109­138, 2001. J. C. Platt. Fast training of Support Vector Machines using sequential minimal optimization. In B. SchЈlkopf, o C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998. N. Ratliff, J. A. D. Bagnell, and M. Zinkevich. (Online) subgradient methods for structured prediction. In AISTATS, 2007. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML 24, pages 807­814, New York, NY, 2007. ACM Press. S. Shalev-Shwartz and N. Srebro. Theory and practice of support vector machines optimization. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. 2009. D. Sontag and T. Jaakkola. Tree block coordinate descent for MAP in graphical models. In AISTATS 12, 2009. D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In UAI 24, pages 503­510, 2008. B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: a large margin approach. In ICML 22, pages 896­903, 2005. B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. In NIPS 16, pages 25­32. 2004. B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction, dual extragradient and Bregman projections. JMLR, pages 1627­1653, 2006. T. Werner. A linear programming approach to max-sum problem: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29:1165­1179, 2007.

References
G. H. Bakir, T. Hofmann, B. SchЈlkopf, A. J. Smola, o B. Taskar, and S. V. N. Vishwanathan. Predicting Structured Data. The MIT Press, 2007. D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. M. Boutell, J. Luo, X. Shen, and C. Brown. Learning multi-label scene classification. Pattern Recognition, 37 (9):1757­1771, 2004. D. Burshtein. Iterative approximate linear programming decoding of ldpc codes with linear complexity. IEEE Trans. on Information Theory, 55(11):4835­4859, 2009. C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http:// www.csie.ntu.edu.tw/~cjlin/libsvm. M. Collins. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In EMNLP, pages 1­8, 2002. M. Collins, A. Globerson, T. Koo, X. Carreras, and P. Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. JMLR, 9:1775­1822, 2008. R. Correa and C. LemarДchal. Convergence of some algoe rithms for convex minimization. Math. Program., 62(2): 261­275, 1993. A. Elisseeff and J. Weston. Kernel methods for multilabelled classification and categorical regression problems. In NIPS 14, pages 681­687, 2001. T. Finley and T. Joachims. Training structural svms when exact inference is intractable. In ICML 25, pages 304­ 311, New York, NY, USA, 2008. ACM. A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP LPrelaxations. In NIPS 20, pages 553­560. 2008. T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27­59, 2009.