Laplace Maximum Margin Markov Networks Jun Zhu jun-zhu@mails.tsinghua.edu.cn Eric P. Xing epxing@cs.cmu.edu Bo Zhang dcszb@mail.tsinghua.edu.cn Department of Computer Science and Technology, Tsinghua University, Beijing 100084 China School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA Abstract We propose Laplace max-margin Markov networks (LapM3 N), and a general class of Bayesian M3 N (BM3 N) of which the LapM3 N is a special case with sparse structural bias, for robust structured prediction. BM3 N generalizes extant structured prediction rules based on point estimator to a Bayes-predictor using a learnt distribution of rules. We present a novel Structured Maximum Entropy Discrimination (SMED) formalism for combining Bayesian and max-margin learning of Markov networks for structured prediction, and our approach subsumes the conventional M3 N as a special case. An efficient learning algorithm based on variational inference and standard convex-optimization solvers for M3 N, and a generalization bound are offered. Our method outperforms competing ones on both synthetic and real OCR data. such as maximum likelihood estimation (Lafferty et al., 2001), and max-margin learning (Altun et al., 2003; Taskar et al., 2003; Tsochantaridis et al., 2004). For domains with complex feature space, it is often desirable to pursue a "sparse" representation of the model that leaves out irrelevant features. Learning such a sparse model is key to reduce the rick of overfitting and achieve good generalizability. In likelihoodbased estimation, sparse model fitting has been extensively studied. A commonly used strategy is to add an L1 -penalty to the likelihood function, which can also be viewed as a MAP estimation under a Laplace prior. Recent work along this line includes (Lee et al., 2006; Wainwright et al., 2006; Andrew & Gao, 2007). This progress notwithstanding, little progress has been made so far on learning sparse MNs or log-linear models in general based on the max-margin principle, which is arguably a more desirable paradigm for training highly discriminative structured prediction models in a number of application contexts. While sparsity has been pursued in maximum margin learning of certain discriminative models such as SVM that are "unstructured" (i.e., with a univariate output), by using L1 -regularization (Bennett & Mangasarian, 1992) or by adding a cardinality constraint (Chan et al., 2007), generalization of these techniques to structured output space turns out to be extremely non-trivial. For example, although it appears possible to formulate sparse max-margin learning as a convex optimization problem as for SVM, both the primal and dual problems are hard to solve since there is no obvious way to exploit the conditional independence structures within a regularized MN to efficiently deal with the typically exponential number of margin constraints. Another empirical insight as we will show in this paper is that the L1 -regularized estimation is not so robust. Discarding the features that are not completely irrelevant can potentially hurt generalization ability. In this paper, we propose a new formalism called Structured Maximum Entropy Discrimination 1. Intro duction In recent years, log-linear models based on composite features that explicitly exploit the structural dependencies among elements in high-dimensional inputs (e.g., DNA strings, text sequences, image lattices) and structured interpretational outputs (e.g., gene segmentation, natural language parsing, scene description) have gained substantial popularity in learning structured predictions from complex data. Major instances of such models include the conditional random fields (CRFs) (Lafferty et al., 2001), Markov networks (MNs) (Taskar et al., 2003), and other specialized graphical models (Altun et al., 2003). Adding to the flexibilities and expressive power of such models, different learning paradigms have been explored, Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Laplace Maximum Margin Markov Networks (SMED), which offers a general framework to combine Bayesian learning and max-margin learning of log-linear models for structured prediction. SMED is a generalization of the maximum entropy discrimination (Jaakkola et al., 1999) methods originally developed for classification to the broader problem of structured learning. It facilitates posterior inference of a full distribution of feature coefficients (i.e., weights), rather than a point-estimate as in the standard maxmargin Markov network (M3 N) (Taskar et al., 2003), under a user-specified prior distribution of the coefficients and generalized maximum margin constraints. One can use the learned posterior distribution of coefficients to form a Bayesian max-margin Markov network (BM3 N) that is equivalent to a weighted sum of differentially parameterized M3 Ns, or one can obtain a MAP BM3 N. We show that, by using a Laplace prior for the feature coefficients, the resulting BM3 N is effectively a "sparse" max-margin Markov network, which we refer to as a Laplace M3 N (LapM3 N). But unlike the L1 -regularized maximum likelihood estimation, where sparsity is due to a hard threshold introduced by the Laplace prior (Kaban, 2007), the effect of Laplace prior in LapM3 N is a biased posterior weighting of the parameters. Smaller parameters are shrunk more and thus robust estimation is achieved when the data have irrelevant features. The Bayesian formalism also makes the LapM3 N less sensitive to regularization constants. Interestingly, a trivial assumption on the prior distribution of the coefficients, i.e., a standard (zero-mean and identity covariance) normal, reduces BM3 N to the standard M3 N, as shown in Theorem 3. The paper is structured as follows. The next section reviews the basic structured prediction formalism and sets the stage for our model. Sec. 3 presents the SMED formalism and basic results on BM3 N. Sec. 4 presents LapM3 N and a novel learning algorithm. Sec. 5 presents a generalization bound of BM3 N. Sec. 6 shows empirical results. Sec. 7 concludes this paper. over some segmentation of an image. The prediction y (y1 , . . . , yl ) is structured because each individual label yi Yi within y must be determined in the context of other labels yj =i , rather than independently as in a standard classification problem. Let F : X × Y R represent a discriminant function over the input-output pairs from which one can define the predictive function h. A common choice of F is a linear model, which is based on a set of feature functions fk : X × Y R and their weights wk , i.e., F (x, y; w) = w f (x, y). Given F , the prediction function h is typically defined in terms of an optimization problem that maximizes F over the response variable y given input x: h0 (x; w) = arg max F (x, y; w). yY (x) (1) 2. Preliminaries Consider a structured prediction problem such as natural language parsing, image understanding, or DNA decoding. The ob jective is to learn a predictive function h : X Y from a structured input x X (e.g., a sentence or an image) to a structured output y Y (e.g., a sentence parsing or a scene annotation), where Y = Y1 × · · · × Yl with Yi = {y1 , . . . , ymi } represents a combinatorial space of structured interpretations of multi-facet ob jects. For example, Y could correspond to the space of all possible instantiations of the part-ofspeech (POS) tagging in the parse tree of a sentence, or the space of all possible ways of labeling entities Depending on the specific choice of the ob jective function C (w) for estimating the parameter w (e.g., likelihood, or margin), incarnations of the general structured prediction formalism described above can be seen in models such as the CRFs (Lafferty et al., 2001), where C (w) is the conditional likelihood of the true structured label; and the M3 N (Taskar et al., 2003), where C (w) is the margin between the true label and any other label. Recent advances in structured prediction has introduced regularizations of C (w) in the CRF context, so that a sparse w can be learned (Andrew & Gao, 2007). To the best of our knowledge, existing max-margin structured prediction methods utilize a single discriminant function F ( · ; w) defined by the "optimum" estimate of w, similar to a practice in Frequentist statistics. In this paper, we propose a Bayesian version of the predictive rule in Eq. (1) so that the prediction function h can be obtained from a posterior mean over multiple (indeed infinitely many) F ( · ; w); and we also propose a new formalism and ob jective C (w) that lead to a Bayesian M3 N, which subsumes the standard M3 N as a special case, and can achieve a posterior shrinkage effect on w that resembles L1 -regulatiztion. To our knowledge, although sparse graphical model learning based on various likelihood-based principles has recently received substantial attention (Lee et al., 2006; Wainwright et al., 2006), learning sparse networks based on the maximum margin principle has not yet been successfully explored. Our proposed method represents an initial foray in this important direction. Before dwelling into exposition of the proposed approach, we end this section with a brief recapitulation of the basic M3 N that motivates this work, and provides a useful baseline that grounds the proposed approach. Under a max-margin framework, given training data D = { xi , yi }N 1 , we obtain a point estimate i= Laplace Maximum Margin Markov Networks of the weight vector w by solving the following maxmargin problem P0 (Taskar et al., 2003): iN 1 P0 (M3 N) : min w 2+C i w, 2 =1 s.t. i, y = yi : w fi (y) i(y) - i , i 0 , i where fi (y) = f (x , yi ) - f (xi , y) and w fi (y) is the "margin" between the true label yi and a prediction y, i(y) is a loss function with respect to yi , and i is a slack variable that absorbs errors in the training data. Various loss functions have been proposed in the literature (Tsochantaridis et al., 2004). In this paper, we adopt the hamming loss used in (Taskar et al., |xi | i 2003): i(y) = j =1 I(yj = yj ), where I(·) is an indicator function that equals to one if the argument is true and zero otherwise. The optimization problem P0 is intractable because the feasible space for w, F0 = {w : w fi (y) i(y) - i ; i, y = yi }, is defined by O(N |Y |) number of constraints, and Y itself is exponential to the size of the input x. Exploring sparse dependencies among individual labels yi in y, as reflected in the specific design of the feature functions (e.g., based on pair-wise labeling potentials), and convex duality of the ob jective, efficient algorithms based on cutting-plane (Tsochantaridis et al., 2004) or message-passing (Taskar et al., 2003) have been proposed to obtain an approximate optimum solution. As described shortly, these algorithms can be directly employed as subroutines in solving our proposed model. p(w) is defined by a set of expected margin constraints: F1 = {p(w) : Fi (y; w)- i(y) p(w) -i , i, y = yi }, p where Fi (y; w) = F (xi , yi ; w) - F (xi , y; w) and · denotes the expectations with respect to p. To choose the best distribution p(w) from F1 , the maximum entropy principle suggests that one can consider the distribution that minimizes its relative entropy with respect to some chosen prior p0 , as measured by the Kullback-Leibler divergence, K L(p||p0 ) = log(p/p0 ) p. To accommodate the discriminative prediction problem we concern, instead of minimizing the usual KL, we optimize the generalized entropy (Dud´k i et al., 2007; Lebanon & Lafferty, 2001), or a regularized KL-divergence, K L(p(w)||p0 (w)) + U ( ), where U ( ) is a closed proper convex function over the slack variables. This leads to the following Structured Maximum Entropy Discrimination Model: Definition 1 (The Structured Maximum Entropy Discrimination Mo del) Given training data D = { xi , yi }N 1 , a discriminant function F (x, y; w), i= a loss function x(y), and an ensuing feasible subspace F1 (defined above) for parameter distribution p(w), the SMED model that leads to a prediction function of the form of Eq. (2) is defined by the fol lowing generalized relative entropy minimization with respect to a parameter prior p0 (w): P1 : p(w), min K L(p(w)||p0 (w)) + U ( ) s.t. p(w) F1 , i 0, i. 3. Bayesian Maximum Margin Markov Networks In this paper, we take a Bayesian approach and learn a distribution p(w), rather than a point estimate of w, in a max-margin manner. For prediction, we take the average over all the possible models, that is: p h1 (x) = arg max (w)F (x, y; w) dw . (2) yY (x) Now, the open question is how we can devise an appropriate ob jective function over p(w), in a similar spirit as the L2 -norm cost over w in P0, that leads to an optimum estimate of p(w). Below, we present a structured maximum entropy discrimination (SMED) framework that facilitates the estimation of a Bayesian M3 N defined by p(w). As we show in the sequel, our Bayesian max-margin learning formalism offers several advantages like the PAC-Bayes generalization guarantee and estimation robustness. 3.1. SMED and the Bayesian M3 N Given a training set D, analogous to the feasible space F0 for weight vector w in an M3 N (i.e., problem P0), the feasible subspace F1 of weight distribution The P1 defined above is a variational optimization problem over p(w) in a subspace of valid parameter distributions. Since both the KL and the function U in P1 are convex, and the constraints in F1 are linear, P1 is a convex program, which can be solved via applying the calculus of variations to the Lagrangian to obtain a variational extremum, followed by a dual transformation of P1. Due to space limit, a detailed derivation is given in an extended version of this paper, and below we state the main results as a theorem. Theorem 2 (Solution to SMED) The variational optimization problem P1 underlying the SMED model gives rise to the fol lowing optimum distribution of Markov network parameters w: p (w ) = X 1 p0 (w) exp{ i (y)[Fi (y; w) - i(y)]}, (3) Z ( ) i,y where the Lagrangian multipliers i (y) (corresponding to constraints in F1 ) can be obtained by solving the dual problem of P1: D1 : max - log Z () - U () s.t. i (y) 0, i, y, where U (·) represents the conjugi te of the slack funca . tion U (·), i.e., U () = sup ,y i (y)i - U ( ) Laplace Maximum Margin Markov Networks For a closed proper convex function (µ), its conjugate is defined as ( ) = supµ [ µ - (µ)]. In problem D1, by convex duality, the log normalizer log Z () can be shown to be the conjugate of the KL-divergence. i If the slack function is U ( ) = C = C i , y it is easy to show that U () = I ( i (y) C, i), where I (·) is a function that equals to zero when its argument holds true and infinity otherwise. Here, the inequality corresponds to the trivial solution = 0, that is, the training data are perfectly separative. Ignoring this inequality does not affect the solution since the special case = 0 is still included. Thus, the Lagrangian multipliers i (y) in the dual problem D1 comply with the set of cony straints that i (y) = C, i. Another example is U ( ) = K L(p( )||p0 ( )) by introducing uncertainty on the slack variables (Jaakkola et al., 1999). Some other U functions and their dual functions are studied in (Lebanon & Lafferty, 2001; Dud´k et al., 2007). i The optimum parameter distribution p(w) defined by Eq. (3), along with the predictive function h1 (x; w) given by Eq. (2), jointly form what we would like to call a Bayesian M3 N (BM3 N). The close connection of BM3 N and M3 N is suggested by the striking isomorphisms of the opt-problem P1, the feasible space F1 , and the predictive function h1 underlying an BM3 N, to their counterparts P0, F0 , and h0 , respectively, underlying an M3 N. Indeed, by making a special choice of a parameter prior in Eq. (3), based on the above discussion of conjugate functions in D1, we arrive at a reduction of D1 to an M3 N optimization problem. The following theorem makes this explicit. Theorem 3 shows that in the supervised learning setting, M3 N is subsumed by the SMED model, and can be viewed as a special case of a Bayesian M3 N when the slack function is linear and the parameter prior is a standard normal. As described later, this connection renders many existing techniques for solving the M3 N directly applicable for solving the BM3 N. Note that although the distribution p(w) in Eq. (3) has the same form as that of Bayesian CRFs (Qi et al., 2005), the underlying principles are fundamentally different. Recent trend in pursuing "sparse" graphical models has led to the emergence of regularized version of CRFs (Andrew & Gao, 2007) and Markov networks (Lee et al., 2006; Wainwright et al., 2006). Interestingly, while such extensions have been successfully implemented by several authors in maximum likelihood learning of various sparse graphical models, they have not yet been explored in the context of maximum margin learning. Such a gap is not merely due to a negligence. Indeed, learning a sparse M3 N can be significantly harder as we discuss below. 2 (2 ) 2 i,y ,,X « 1 = exp - i (y) i(y) + X i (y)fi (y) 2 . 2 i,y i,y As Theorem 3 reveals, an M3 N corresponds to a BM3 N with a standard normal prior for the weight vector w. To encourage a sparse model, when using zero-mean normal prior, the weights of irrelevant features should peak around zero with very small variances. However, the isotropy of the variances in all dimensions in the standard normal prior makes M3 N infeasible to adjust the variances in different dimensions to fit sparse data. One way to learn a sparse model is to adopt the strategy of L1 -SVM to use L1 -norm instead of L2 -norm (a detailed description of this formulation and the duality Theorem 3 (Reduction of BM3 N to M3 N) i derivation is available in the extended version of this Assuming F (x, y; w) = w f (x, y), U ( ) = i , paper). However, in both the primal and dual of an and p0 (w) = N (w|0, I ), where I denotes an identity L1 -regularized M3 N, there is no obvious way to exploit matrix, then the Lagrangian multipliers i (y) are the sparse dependencies among variables of the MN obtained by solving the fol lowing dual problem: in order to efficiently deal with typically exponential X number of constraints, which makes direct optimiza1 X i (y)fi (y) 2 max i (y) i(y) - tion or LP-formulation expensive. In this paper, we 2 i,y i,y adopt the SMED framework that directly leads to a X s. t . i (y) = C ; i (y) 0, i, y, Bayesian M3 N, and employ a Laplace prior for w to y learn a Laplace M3 N. When fitted to training data, the which, when applied to h1 , lead to a predictive function parameter posterior p(w) under a Laplace M3 N has a that is identical to h0 (x; w) given by Eq. (1). shrinkage effect on small weights, which is similar to Proof: (sketch) Replacing p0 (w) in Eq. (3) with the L1 -regularizer in an M3 N. Although exact learning N (w|0, I ), we can obtain the following closed-form of a Laplace M3 N is still very hard, we show that it can expression of the Z () in p(w): be efficiently approximated by a variational inference Z 1 ww X + i (y)[w fi (y) - i(y)]} dw pro cedure based on existing metho ds. K exp{- 4. Laplace M3 N K -|wk | = The Laplace prior is p0 (w) = k=1 2 e K - w e . The Laplace density is heavy tailed 2 y As we have stated, the constraintsi i (y) = C are due to the conjugate of U ( ) = i . Laplace Maximum Margin Markov Networks and peaked at zero. Thus, it encodes the prior belief that the distribution of w is strongly peaked around zero. Another nice property is that the Laplace density is log-convex, which can be exploited to get convex estimation problems like LASSO (Tibshirani, 1996). 4.1. Variational Learning with Laplace Prior Although in principle we have a closed-form solution of p(w) in Theorem 2, the parameters i (y) are hard to estimate when using the Laplace prior. As we shall see in Section 4.2, exact integration will lead to a dual function that is difficult to maximize. Thus, we present a variational approximate learning approach. Our approach is based on the hierarchical interpretation (Figueiredo, 2003) of the Laplace prior, that is, each wk has a zero-mean Gaussian distribution p(wk |k ) = N (wk |0, k ) and the variance k has an exponential hyper-prior density, - , k for k 0. p(k |) = exp 2 2 K K Let p(w| ) = k=p p(wk |k ), p( |) = k=1 p(k |), 1 then, p0 (w) = (w| )p( |) d . Using the hierarchical representation and applying the Jensen's inequality, we get the following upper bound: Z K L(p||p0 ) = -H (p) - log p(w| )p( |) d p p(w| )p( |) d p q ( ) -H (p) - Z q ( ) log L(p(w), q ( )), we get the posterior distribution p(w) as follows, Z p(w) exp{ q ( ) log p(w| ) d - b} · exp{w 1 exp{- w A-1 qw - b + w 2 = N ( w | µw , w ) , - L} - L} i i where = ,y i (y)fi (y), L = ,y i (y) i(y), A = diag(k ), and b = K L(q ( )||p( |)) is a constant. The posterior mean and variance are w p = µw = w and w = ( A-1 q)-1 = ww p - w p w p , respectively. The dual parameters are estimated by solving the following dual problem: max X i,y i (y) i(y) - 1 2 w (5) s. t . X y i (y) = C ; i (y) 0, i, y. This dual problem can be directly solved using existing algorithms developed for M3 N, such as (Taskar et al., 2003; Bartlett et al., 2004). Alternatively, we can solve the following primal problem: min w, 1 w 2 -1 ww +C N X i=1 i (6) s. t . w fi (y) i(y) - i ; i 0, i, y = yi . It is easy to show that the solution of problem (6) leads to the posterior mean of w under p(w). The primal problem can be solved with subgradient (Ratliff et al., 2007) or extragradient (Taskar et al., 2006) methods. Step 2: Keep p(w) fixed, we optimize (4) with respect to q ( ). Take the derivative of L with respect to q ( ) K and set it to zero, then we get q ( ) = k=1 q (k ). Each q (k ) is computed as follows: l k : q (k ) p(k |) exp og p(wk |k ) p w 1 2 N( k p|0, k ) exp(- k ). w2 N 2 The normalization factor is ( k p|0, k ) · 1 2 wk p). The ex2 exp(- 2 k ) dk = 2 exp(- -1 pectations k q required in calculating A-1 q are calculated as follows, s Z 1 q= k 1 q (k ) dk = k . 2 wk p (7) where q ( ) is a variational distribution which is used to approximate p( |). Substituting this upper bound for the KL in P1, we now solve the following problem, p(w)F1 ;q ( ); min L(p(w), q ( )) + U ( ). (4) This problem can be solved with an iterative minimization algorithm alternating between p(w) and q ( ), as outlined in Algorithm 1, and detailed below. Algorithm 1 Variational Bayesian Learning Input: data D = { x , y constants C and , iteration number T Output: posterior mean w T p Initialize w 1 0, 1 I p w for t = 1 to T - 1 do Step 1: solve (5) or (6) for w t+1 = t ; update p w ww t+1 t + w t+1 ( w t+1 ) . w p p p q 2 wk t+1 p Step 2: use (7) to update t+1 diag( ). w end for i i }N 1 , i= Step 1: Keep q ( ) fixed, we optimize (4) with respect to p(w). Taking the same procedure as in solving P1, We iterate between the above two steps until convergence. Then, we use the posterior distribution p(w) to make prediction. For irrelevant features, the variances should converge to zeros and thus lead to a sparse estimation. The intuition behind this iterative minimization algorithm is as follows. First, we use a Gaussian distribution to approximate the Laplace distribution and thus get a QP problem that is analogous to that of M3 N; then, the second step updates the covariance matrix in the QP problem with an exponential hyperprior on the variance. Laplace Maximum Margin Markov Networks 4.2. Insights To see how the Laplace prior affects the posterior distribution, we do the following calculations. Substitute the hierarchical representation of the Laplace prior into p(w) in Theorem 2, and we get: Z ( ) = p(w| )p( |) d · exp{w - L} dw Z Z = p( |) p(w| ) · exp{w - L} dw d = exp{-L} K Y k=1 ZZ Figure 1. Posterior mean with different priors against the estimation of M3 N (i.e. with the standard normal prior). (8) , 2 - k i i i i where k = ,y i (y)(fk (x , y ) - fk (x , y)) and an 2 additional constraint is k , k < . Otherwise, the integration is infinity. Using the result (8), we can get: log Z = µ fi (y) - i(y), (9) i (y) 2k where µ is a column vector and µk = -2 , 1 k k K . An alternative way is using the definition of Z : p Z= - L} dw . We can get: 0 (w) · exp{w log Z =w i (y ) p where c is a positive constant. Recall that our averaging model is h(x, y) = F (x, y; w) p(w) . We define the margin of an example (x, y) for such a function h as M (h, x, y) = h(x, y) - maxy =y h(x, y ). Clearly, the model h makes a wrong prediction on (x, y) only if M (h, x, y) 0. Let Q be a distribution over X × Y , and let D be a sample of N examples randomly drawn from Q. We have the following PAC-Bayes theorem. Theorem 4 (PAC-Bayes Bound of BM3 N) Let p0 be any continuous probability distribution over H and let (0, 1). If F H : X × Y [-c, c] for al l w, then with probability at least 1 - over random samples D of Q, for very distribution p over H and for al l margin thresholds > 0: PrQ (M (h, x, y) 0) PrD (M (h, x, y) ) « ,,r -2 K L(p||p0 ) ln(N |Y |) + ln N + ln -1 . +O N fi (y) - i(y). (10) Comparing Eqs. (9) and (10), we get w p = µ, that k is, wk p = 22 , 1 k K . Similar calculation - can lead to the result that in M3 N (standard normal prior) w p = . Figure 1 shows the posterior means (any dimension) when the priors are standard normal, Laplace with = 4, and Laplace with = 6. We can see that with a Laplace prior, the parameters are shrunk around zero. The larger the value is, the greater the shrinkage effect. For a fixed , the shape of the posterior mean is smoothly nonlinear but no component is explicitly discarded, that is, no weight is set to zero. This is different from the shape of a L1 regularized maximum likelihood estimation (Kaban, 2007) where an interval exists around the origin and parameters falling into this interval are set to zeros. Note that if we use the exact integration as in Eq. (8), K the dual problem D1 will maximize L - k=1 log -2 . 2 Since k appears within a logarithm, the optimization problem would be very hard to solve. Thus, we turn to a variational approximation method. k k Here, PrQ (.) stands for . Q and PrD (.) stands for the empirical average on D. The proof follows the same structure as the original PAC-Bayes bound proof, with consideration of the margins. Due to space limit, details of the proof are given in the extended paper. 6. Exp eriments In this section, we present some empirical results of LapM3 N on both synthetic and real data sets. We compare LapM3 N with M3 N, CRFs, L1 -regularized CRFs (L1 -CRFs), and L2 -regularized CRFs (L2 CRFs). We use the quasi-Newton method (Andrew & Gao, 2007) to learn L1 -CRFs. 6.1. Synthetic Data Sets 5. Generalization b ound The PAC-Bayes bound (Langford et al., 2001) provides a theoretical motivation to learn an averaging model as in P1 which minimizes the KL-divergence and simultaneously satisfies the discriminative classification constraints. To apply it to our structured learning setting, we assume that the discriminant functions are bounded, that is, F H : X × Y [-c, c] for all w, 6.1.1. I.I.D Features The first experiment is conducted on synthetic sequence data with 100 i.i.d features. We generate three types of data sets with 10, 30, and 50 relevant features. For each setting, we randomly generate 10 linear-chain CRFs with 8 binary labeling states. The feature functions include: a real valued state-feature function over a one dimensional input feature and a class label; and Laplace Maximum Margin Markov Networks Figure 2. Evaluation results on data sets with i.i.d features. Figure 3. Results on data sets with 30 relevant features. 4 (2 × 2) binary transition-feature functions capturing pairwise label dependencies. For each model we generate a data set of 1000 samples. For each sample, we first independently draw the 100 features from a standard normal distribution, and then apply a Gibbs sampler to assign a label sequence with 5000 iterations. For each data set, we randomly draw a part as training data and use the rest for testing. The numbers of training data are 30, 50, 80, 100, and 150. The QP problem is solved with the exponentiated gradient method (Bartlett et al., 2004). In all the following experiments, the regularization constant of L1 -CRFs and L2 -CRFs is chosen from {0.01, 0.1, 1, 4, 9, 16} by a 5fold cross-validation in training. For LapM3 N, we use the same method to choose from 20 roughly evenly spaced values between 1 and 268. For each setting, the average over 10 data sets is the final performance. The results are shown in Figure 2. All the results of LapM3 N are achieved with 3 iterations of the variational learning. Under different settings LapM3 N consistently outperforms M3 N and performs comparably with L1 -CRFs. But note that the synthetic data come from simulated CRFs. Both L1 -CRFs and L2 -CRFs outperform the un-regularized CRFs. One interesting result is that M3 N and L2 -CRFs perform comparably. This is reasonable because as derived by Lebanon and Lafferty (2001) and noted by Globerson et al. (2007) the L2 -regularized MLE of CRFs has a similar convex dual as that of M3 N. The only difference is the loss they try to optimize. CRFs optimize the log-loss while M3 N optimizes the hinge-loss. As the number of training data increase, all the algorithms consistently get higher performance. The advantage of LapM3 N is more obvious when there are fewer relevant features. 6.1.2. Correlated Features In reality, most data sets contain redundancy and the features are usually correlated. So, we evaluate our models on synthetic data sets with correlated features. We take the similar procedure as in generating the data sets with i.i.d features to first generate one linearchain CRF model. Then, we use the CRF model to generate 10 data sets of which each sample has 30 relevant features. The 30 relevant features are partitioned into 10 groups. For the features in each group, we first draw a real-value from a standard normal distribution and then `spoil' the feature with a random Gaussian noise to get 3 correlated features. The noise Gaussian has a zero mean and standard variance 0.05. Here and in all the remaining experiments, we use the subgradient method (Ratliff et al., 2007) to solve the QP problem in both M3 N and LapM3 N. We use the learning rate and complexity constant that are suggested by the authors, that is, t = 21 t and C = 200 , where is a parameter we introduced to adjust t and C . We do K-fold CV on each data set and take the average over the 10 data sets as the final results. Like (Taskar et al., 2003), in each run we choose one part to do training and test on the rest K-1 parts. We vary K from 20, 10, 7, 5, to 4. In other words, we use 50, 100, about 150, 200, and 250 samples during the training. We use the same grid search to choose and from {9, 16, 25, 36, 49, 64} and {1, 10, 20, 30, 40, 50, 60} respectively. Results are shown in Figure 3. We can get the same conclusions as in the previous results. 6.2. Real-World OCR Data Set The OCR data set is partitioned into 10 subsets for 10fold CV (Taskar et al., 2003; Ratliff et al., 2007). We randomly select N samples from each fold for our experiments. We vary N from 100, 150, 200, to 250, and denote the selected data sets by OCR100, OCR150, OCR200, and OCR250 respectively. When = 4 on OCR100 and OCR150, = 2 on OCR200 and OCR250, and = 36, results are shown in Figure 4. Overall, as the number of training data increases, all algorithms achieve lower error rates and smaller variances. Generally, LapM3 N consistently outperforms all the other models. M3 N outperforms the standard, non-regularized, CRFs and the L1 -CRFs. Again, L2 CRFs perform comparably to M3 N. This is a bit surprising but still reasonable due to the understanding of their only difference on loss functions (Globerson et al., 2007). By examining the prediction accuracy, we can see an obvious over-fitting in CRFs and L1 CRFs. In contrast, L2 -CRFs are very robust. This is because unlike the synthetic data sets, features in real-world data are usually not completely irrelevant. In this case, putting small weights to zero as in L1 CRFs will hurt generalization ability and also lead to instability to regularization constants as shown later. Instead, L2 -CRFs do not put small weights to zero but Laplace Maximum Margin Markov Networks Acknowledgements We thank Ivor Tsang for inspiring discussions. This work was conceived and completed while J.Z. was a visiting researcher at CMU under a State Scholarship from China, and supports from NSF DBI-0546594 and DBI-0640543 awarded to Eric Xing. J.Z. and B.Z. are also supported by Chinese NSF Grant 60321002; and the National Key Foundation R&D Pro jects 2004CB318108 and 2007CB311003. Figure 4. Evaluation results on OCR data set with different numbers of selected data. References Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hidden markov support vector machines. ICML. Andrew, G., & Gao, J. (2007). Scalable training of l1 -regularized log-linear models. ICML. Bartlett, P., Collins, M., Taskar, B., & McAllester, D. (2004). Exponentiated gradient algorithms for largmargin structured classification. NIPS. Bennett, K. P., & Mangasarian, O. L. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optim. Methods Softw, 23­34. Chan, A. B., Vasconcelos, N., & Lanckriet, G. R. G. (2007). Direct convex relaxations of sparse svm. ICML. Dud´k, M., Phillips, S. J., & Schapire, R. E. (2007). i Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. JMLR, 1217­1260. Figueiredo, M. (2003). Adaptive sparseness for supervised learning. IEEE Trans. on PAMI, 25, 1150­1159. Globerson, A., Koo, T. Y., Carreras, X., & Collins, M. (2007). Exponentiated gradient algorithms for log-linear structured prediction. ICML. Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination. NIPS. Kaban, A. (2007). On bayesian classification with laplace priors. Pattern Recognition Letters, 28, 1271­1282. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML. Langford, J., Seeger, M., & Megiddo, N. (2001). An improved predictive accuracy bound for averaging classifiers. ICML. Lebanon, G., & Lafferty, J. (2001). Boosting and maximum likelihood for exponential models. NIPS. Lee, S.-I., Ganapathi, V., & Koller, D. (2006). Efficient structure learning of markov networks using l1-regularization. NIPS. Qi, Y. A., Szummer, M., & Minka, T. P. (2005). Bayesian conditional random fields. AISTATS. Ratliff, N. D., Bagnell, J. A., & Zinkevich, M. A. (2007). (online) subgradient methods for structured prediction. AISTATS. Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin markov networks. NIPS. Taskar, B., Lacoste-Julien, S., & Jordan, M. I. (2006). Structured prediction via the extragradient method. NIPS. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc., B, 267­288. Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. ICML. Wainwright, M. J., Ravikumar, P., & Lafferty, J. (2006). High-dimensional graphical model selection using l1-regularized logistic regression. NIPS. Figure 5. Error rates of different models on OCR100 with different regularization constants. From left to right, the regularization constants are 0.0001, 0.001, 0.01, 0.1, 1, 4, 9, 16, and 25 for L1 -CRFs and L2 -CRFs, and for M3 N and LapM3 N they are 1, 4, 9, 16, 25, 36, 49, 64, and 81. shrink them towards zero as in LapM3 N. The nonregularized MLE can also easily lead to over-fitting. 6.3. Sensitivity to Regularization Constants Figure 5 shows the error rates of different models on OCR100. From the results, we can see that the L1 CRFs are much sensitive to the regularization constants. However, L2 -CRFs, M3 N, and LapM3 N are much less sensitive. Among all the models, LapM3 N is the most stable one. The stability of LapM3 N is due to the posterior weighting instead of hard-thresholding to set small weights to zero as in L1 -CRFs. 7. Conclusions We proposed a Structured Maximum Entropy Discrimination formalism for Bayesian max-margin learning in structured prediction. This formalism gives rise to a general class of Bayesian M3 Ns and subsumes the standard M3 N as a spacial case where the predictive model is assumed to be linear and the parameter prior is a standard normal. We show that the adoption of a Laplace prior of the parameter leads to a Laplace M3 N that enjoys properties expected from a sparsified Bayesian M3 N. Unlike the L1 -regularized MLE which sets small weights to zeros to achieve sparsity, LapM3 N weights the parameters a posteriori. Features with smaller weights are shrunk more. This posterior weighting effect makes LapM3 N more stable with respect to the magnitudes of the regularization coefficients and more generalizable.