On Primal and Dual Sparsity of Markov Networks Jun Zhu jun-zhu@mails.tsinghua.edu.cn Eric P. Xing epxing@cs.cmu.edu Dept. of Comp. Sci & Tech, TNList Lab, Tsinghua University, Beijing 100084 China School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA Abstract Sparsity is a desirable property in high dimensional learning. The 1 -norm regularization can lead to primal sparsity, while max-margin methods achieve dual sparsity. Combining these two methods, an 1 -norm max-margin Markov network ( 1 -M3 N) can achieve both types of sparsity. This paper analyzes its connections to the Laplace maxmargin Markov network (LapM3 N), which inherits the dual sparsity of max-margin models but is pseudo-primal sparse, and to a novel adaptive M3 N (AdapM3 N). We show that the 1 -M3 N is an extreme case of the LapM3 N, and the 1 -M3 N is equivalent to an AdapM3 N. Based on this equivalence we develop a robust EM-style algorithm for learning an 1 -M3 N. We demonstrate the advantages of the simultaneously (pseudo-) primal and dual sparse models over the ones which enjoy either primal or dual sparsity on both synthetic and real data sets. For domains with complex feature spaces, it is often desirable to pursue a sparse representation of the model that leaves out irrelevant features. We say a model enjoys primal sparsity, if only very few features in the original model have non-zero weights. The term "primal" stems from a convention in the optimization literature, which generally refers to (constrained) problems pertaining to the original model. Primal sparsity is important for selecting significant features and for reducing the risk of over-fitting. The bet-on-sparsity principle (Friedman et al., 2004) suggests that one should prefer the model that does well in sparse problems. In likelihood-based estimation, sparse model fitting has been extensively studied. A common strategy is to add an 1 -penalty of feature weights to the log-likelihood function. Due to the singularity of 1 -norm at the origin (Tibshirani, 1996), 1 -norm regularization can lead to primal sparse estimates. Recent work on structure learning of graphical models (Lee et al., 2006; Wainwright et al., 2006) falls into this paradigm. Another type of sparsity, as enjoyed by large margin models, like the unstructured SVM and the structured M3 N, is the dual sparsity, which refers to a phenomenon that only a few Lagrangian multipliers in the dual form of the original model turn out to be nonzero. When a model is dual sparse, its decision boundary depends only on a few number of support vectors, which in principle leads to a robust decision boundary. Moreover, the dual sparsity provides a theoretical motivation of the cutting-plane algorithms (Tsochantaridis et al., 2004) and the bundle methods (Smola et al., 2007), which generally explore the fact that in max-margin models only very few (e.g., polynomial) number of constraints are sufficient to achieve a good enough solution. Unfortunately, although both primal and dual sparsity can benefit structured prediction models, they usually do not co-exist. For example, the powerful M3 N is not primal sparse, because it employs an 2 -norm penalty that cannot automatically select significant features. One natural way to bring these two types of spar- 1. Introduction Learning structured prediction models, which explicitly explore the structural dependencies among input features (e.g., text sequences, DNA strings) and structural interpretational outputs (e.g., parsing trees, gene annotations), has gained substantial popularity in data mining, machine intelligence, and scientific discovery. Based on different learning paradigms, major instances of such models include the conditional random fields (CRFs) (Lafferty et al., 2001) based on maximum conditional likelihood estimation, and maxmargin Markov networks (M3 Ns) (Taskar et al., 2003) or structural SVMs (Altun et al., 2003; Tsochantaridis et al., 2004) based on max-margin learning. Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). On Primal and Dual Sparsity of Markov Networks sity together is to build an 1 -norm regularized large margin model. In the structured learning setting, 1 norm regularized max-margin Markov networks ( 1 M3 N) can be formulated, following the same spirit of the unstructured 1-norm SVM (Zhu et al., 2004). Another approach that attempts to achieve both primal and dual sparsity is the recently proposed Laplace max-margin Markov networks (LapM3 N) (Zhu et al., 2008b), which inherit the dual sparseness of maxmargin models. However, since the posterior shrinkage effect as shown in (Zhu et al., 2008b) is smooth, LapM3 N is pseudo-primal sparse (i.e., only very few input features have large weights) and does not explicitly select features by setting the weights of irrelevant features to zeros. This paper presents the 1 -M N and a novel adaptive M3 N, and analyzes their close connections to the LapM3 N from both theoretical and algorithmic perspectives. In the theoretical aspect, we show that 1 M3 N is an extreme case of LapM3 N when the regularization constant of the entropic regularizer goes to infinity and LapM3 N is a smooth relaxation of 1 -M3 N. We also show that 1 -M3 N is equivalent to an adaptive M3 N. In the algorithmic aspect, based on the equivalence between 1 -M3 N and an adaptive M3 N, we develop a novel EM-style algorithm to learn an 1 -M3 N. The robust algorithm has the same structure as the variational algorithm of LapM3 N (Zhu et al., 2008b) and helps uncover the difference between 1 -M3 N and LapM3 N. Finally, we present empirical studies comparing 1 -M3 N and LapM3 N with competing models, which enjoy either primal or dual sparsity but not both, on both synthetic and real data sets. The rest of the paper is structured as follows. Sec. 2 presents problem formulations, including the 1 -M3 N and an adaptive M3 N. Sec. 3 presents the theoretical connections, while Sec. 4 discusses the algorithmic connection. Sec. 5 presents our empirical studies, and Sec. 6 concludes this paper. 3 the risk: R(h) = E(x,y)P [ (h(x), y)], where is a non-negative loss function, e.g., the hamming loss: l (^ , y) = j=1 I(^j = yj ), where I(·) is an indicay y tor function that equals one if the argument holds and zero otherwise. (^ , y) measures the loss of the prey ^ diction y when the true prediction is y. Since the true distribution P is unknown, empirical risk is used as an approximation of the risk. Given a set of training data D = {(xi , yi )}N , drawn i.i.d from P , the emi=1 N 1 pirical risk is: Remp (h) = N i=1 (h(xi ), yi ). To avoid over-fitting, one method is to minimize the regularized empirical risk: (h) + Remp (h), where (h) is a regularizer, and is a regularization parameter. 2.1. 2 -norm Max-Margin Markov networks Let F ( . ; w) : X × Y R be a parametric discriminant function over the input-output pairs. The max-margin Markov networks define a predictive rule as an optimization problem: h0 (x; w) = arg max F (x, y; w). yY (1) A common choice of F is a linear model, where F is defined by a set of K feature functions fk : X ×Y R and their weights wk : F (x, y; w) = w f (x, y). Consider the general case where errors are allowed in the training data. To learn such a prediction rule h0 , the "margin re-scaling" 2 -norm M3 N (Taskar et al., 2003) minimizes a regularized structured hinge loss: min w w 2 2 + Rhinge (w), (2) 1 where Rhinge (w) i maxyY [ i (y) - N w fi (y)], of which i (y) = (y, yi ) and fi (y) = f (xi , yi ) - f (xi , y). w fi (y) is the "margin" favored by the true label yi over a prediction y. Since maxyY [ i (y) - w fi (y)] (y, yi ) for w fi (y) 0, Rhinge (w) is an upper bound of the empirical risk of the prediction rule (1). The problem (2) can be equivalently formulated as a constrained optimization problem: P0 (M3 N) : s.t. i, y = yi : min w, 2. Problem Formulation Structured output classification aims to learn a predictive function h : X Y, where X is the subspace of inputs and Y = Y1 × · · · × Yl is the space of outputs, which are multivariate and structured. For part-ofspeech (POS) tagging, Yi consists of all the POS tags, and each input x is a word sequence and each label y = (y1 , · · · , yl ) is a sequence of POS tags. We assume a finite number of feasible outputs for any input. In supervised learning, where input-output pairs (x, y) are drawn i.i.d. from a distribution P (X, Y), the goal is to find an h from a hypothesis space that minimizes 1 w 2 N 2 2 +C i=1 i w fi (y) i (y) - i ; i 0, where i is a slack variable absorbing errors in training data and C is a positive constant. P0 is a convex program and satisfies the Slater's condition. Due to the KKT condition, 2 -norm M3 N enjoys the dual sparsity, i.e., only a few lagrange multipliers are nonzero, which correspond to the active constraints whose equality holds, analogous to the support vectors in SVM. However, due to the differentiability of 2 -norm, the 2 -norm M3 N is not primal sparse. On Primal and Dual Sparsity of Markov Networks Exploring sparse dependencies among individual labels in y, efficient optimization algorithms based on cutting-plane (Tsochantaridis et al., 2004), messagepassing (Taskar et al., 2003), or gradient descent (Bartlett et al., 2004; Ratliff et al., 2007) have been proposed to (approximately) solve P0. 2.2. Laplace Max-Margin Markov Networks Unlike the M N, which performs point-estimate to predict based on a single rule F ( . ; w), the Laplace max-margin Markov networks (LapM3 N) (Zhu et al., 2008b) approach the structured prediction problem by performing Bayesian-style learning under the general structured maximum entropy discrimination formalism (Zhu et al., 2008b), which facilitates a Bayes-style prediction by averaging F ( . ; w) over a posterior distribution of rules p(w): h1 (x) = arg max yY Due to the KKT condition, the above solution enjoys the dual sparsity as in M3 N. Thus, MaxEnDNet enjoys a similar generalization property as the M3 N and SVM due to the small "effective size" of the margin constraints. In fact, as shown in (Zhu et al., 2008b), 3 2 -norm M N is a Gaussian MaxEnDNet where the prior is standard normal and U () = C i i . The Laplace max-margin Markov network (LapM3 N) is a Laplace MaxEnDNet by using a heavy tailed Laplace prior, which encodes the prior belief that the distribution of w is strongly peaked around zero. Since the KL-divergence is differentiable, the resulting posterior shrinkage effect in LapM3 N, as shown in (Zhu et al., 2008b), is smooth. Thus, in the input feature space, LapM3 N is pseudo-primal sparse, i.e., only a few elements in w have large values. This pseudo-primal sparsity makes LapM3 N enjoy nice robust properties and in many cases as we shall see LapM3 N can perform as well as a primal sparse M3 N, as presented below. The robustness of KL-regularization is also demonstrated in sparse coding (Bradley & Bagnell, 2008). Below, we introduce two novel formulations of sparse M3 Ns and then analyze their connections. 2.3. 1 -norm 3 p(w)F (x, y; w) dw , (3) where p(w) is estimated by learning a maximum entropy discrimination Markov network (MaxEnDNet, or MEDN) (Zhu & Xing, 2009). In the same spirit of the structured hinge loss of M3 N, the empirical risk of the averaging model (3) is upper bounded by the expected structured hinge loss: Rhinge (p(w)) = 1 N max i yY Max-Margin Markov Networks p(w)[ i (y) - Fi (y; w)] dw . MaxEnDNet minimizes a regularized structured hinge loss as in (2) and uses the KL-divergence with a prior to regularize p(w), i.e., (p(w)) = KL(p(w) p0 (w)). Similar to (2), the MaxEnDNet can be equivalently formulated as a constrained optimization problem: P1 (MEDN) : min KL(p(w)||p0 (w)) + U () p(w), To introduce the primal sparsity in max-margin Markov networks in a more direct way, we propose to use the 1 -norm of the model parameters in the regularized hinge loss minimization framework (2). Therefore, the 1 -M3 N is formulated as follows, P2 ( 1 -M3 N) : s.t. i, y = yi : min w, 1 w 2 N 1 +C i=1 i w fi (y) i (y) - i ; i 0, 1 -norm. where . 1 is the Another equivalent formu- s.t. i, y : p(w)[Fi (y; w) - i (y)] dw -i ; i 0, lation1 , which is useful in the subsequent analysis and where U () is a closed proper convex function over slack variables , e.g., U () = C i i . U is also known as a potential term in the maximum entropy principle. The problem P1 is a convex program, and satisfies the Slater's condition. As showin in (Zhu et al., 2008b), the optimum solution of P1 is: p(w) = 1 p0 (w) exp{ i (y)[Fi (y; w) - i (y)]}, Z() i,y algorithm development, is as follows: P2 : s.t. min Rhinge (w) w w 1 where Z() is a normalization factor and the lagrange multipliers i (y) (corresponding to constraints in P1) can be obtained by solving the following dual problem: D1 : max - log Z() - U () s.t. i (y) 0, i, y, where U (·) is the conjugate of the slack function U (·), i.e., U () = sup i,y i (y)i - U () . Unlike 2 -norm, the 1 -norm is not differentiable at the origin. This singularity property ensures that the 3 1 -M N is able to remove noise features by estimating their weights to be exactly zero. When the feature space is high dimensional and has many noise features, the 2 -norm M3 N will suffer a poor generalization ability caused by these noise features. Thus, the 1 -norm M3 N, or the closely related pseudo-sparse LapM3 N, would be a better choice in this scenario. Moreover, the primal sparse 1 -M3 N is of great interest itself like the 1-norm SVM because it can automatically select significant features in max-margin Markov networks. 1 See (Taskar et al., 2006) for transformation techniques. On Primal and Dual Sparsity of Markov Networks To learn an 1 -M3 N, various methods can be applied, such as the sub-gradient method (Ratliff et al., 2007) with a projection to an 1 -ball (Duchi et al., 2008) based on the P2 formulation, and the cutting-plane method (Tsochantaridis et al., 2004) with an LP solver to solve the generated LP sub-problems based the formulation P2. Our empirical studies show that both of these algorithms are sensitive to their regularization constants. We will develop a novel EM-style algorithm, which is robust and helps uncover the connection and difference between 1 -M3 N and LapM3 N. 2.4. Adaptive Max-Margin Markov Networks Our EM-style algorithm for the 1 -M3 N is developed based on an equivalence between the 1 -M3 N and an adaptive M3 N, which is defined as follows: N Corollary 1 Assuming F (x, y; w) = w f (x, y), U () = C i i , and p0 (w) = N (0, I), the mean µ of the posterior distribution p(w) under the MaxEnDNet is achieved by solving the following problem: min µ, 1 µ µ+C 2 N i i=1 s.t. i, y = yi : µ fi (y) i (y) - i ; i 0. Since in linear models the averaging prediction rule h1 is determined by the posterior mean, the above corollary shows a reduction of MaxEnDNet to M3 N. This result is complementary to the reduction theorem in (Zhu et al., 2008b), which considers dual problems. 3.1. LapM3 N v.s. 3 1 -M N P3 (AdapM N) : s.t. i, y = yi : k : 3 w,, min w -1 w+C i=1 i , By using the Laplace prior in MaxEnDNet, we can get the following theorem for LapM3 N. Theorem 2 Assuming F (x, y; w) = f (x, y), w K U () = C i i , and p0 (w) = 2 e- w , the mean µ of the posterior distribution p(w) under the MaxEnDNet is obtained by solving the following primal problem: min µ, w fi (y) i (y) - i ; i 0 1 K K k = k=1 1 ; k 0. where = diag( ). The rationale behind P3 is that: by adaptively penalizing different components, the coefficients of irrelevant features can be shrunk to zero, i.e., the corresponding go to zero. The same idea have been explored in Automatic Relevance Determination (Qi et al., 2004) and sparse Bayesian learning (Tipping, 2001). The mathematical intuition is from a two-layer interpretation of the Laplace prior (Figueiredo, 2003), namely, a univariate Laplace distribution p(w) = 2 e- |w| is equivalent to a Gaussian-exponential model, where w is a zero-mean normal p(w| ) = N (w|0, ) and the variance has an exponential hyper-prior: p( |) = 2 exp{- 2 }, for 0. Therefore, the quadratic term -1 of w w in P3 is from the first layer Gaussian dis1 1 tribution, and the constraint K k k = is from the second-layer hyper-prior, because the mean of the exK 1 1 ponential hyper-prior is E[ ] = and K k=1 k is an empirical estimate of E[ ]. Thus, the first-order K 1 1 constraint K k=1 k = can be seen as a relaxation of the exponential hyper-prior. K ( k=1 µ2 + k 1 1 - log µ2 + 1 + 1 k )+C 2 N i i=1 s.t. µ fi (y) i (y) - i ; i 0, i, y = yi . 2 µk +1+1 K 1 1 Since the term ( µ2 + - log ) k=1 k 2 corresponds to the KL-divergence between p(w) and p0 (w) under a Laplace MaxEnDNet, we will refer to it as a KL-norm2 and denote it by µ KL in the sequel. This KL-norm is different from the 2 -norm, but is closely related to the 1 -norm, which encourages a sparse estimator due to its singularity at the origin. Specifically, we have the following corollary: Corollary 3 The LapM3 N yields the same estimate as the 1 -M3 N when . To prove Corollary 3, we note that as goes to infinity, the logarithm terms in µ KL disappear because of the fact that log x 0 when x . Thus, the KL-norm x µ KL approaches µ 1 , i.e., the 1 -norm, as . This means that the LapM3 N will be (nearly) the same as the 1 -M3 N if the regularization constant is large enough. In (Zhu et al., 2008b), a posterior shrinkage effect is shown based on the exact computation of the 2 This is not exactly a norm because the positive scalability does not hold. However, by using the inequality ex 3. Theoretical Connections In this section, we show the theoretical connections of three variants of sparse max-margin Markov networks. We show that LapM3 N is a smooth relaxation of 1 M3 N; 1 -M3 N is an extreme case of LapM3 N; and 1 M3 N is equivalent to an adaptive M3 N. Due to space limitation, the proofs are deferred to a longer version. We begin with the special case of Gaussian MaxEnDNet, for which we have the following corollary: 1 + x, we can show: k, ( µ2 + k 1 - 1 log µ2 +1+1 k ) 2 is monotonically increasing with respect to µ2 and µ KL k K/ , where the equality holds only when µ = 0. Thus, µ KL penalizes large weights. For convenient comparison with the popular 2 and 1 norms, we call it a KL-norm. On Primal and Dual Sparsity of Markov Networks 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1 Lambda(4) Lambda(100) Lambda(1000) -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Basically, our proof follows a similar technique as in (Grandvalet, 1998), where an equivalence between adaptive regression and the 1 -regularized least square regression (LASSO) (Tibshirani, 1996) is proved. 4. EM-Style Learning of 1 -M 3 N -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Figure 1. (Left) 2 -norm (solid line) and 1 -norm (dashed line); (Right) KL-norm with different Laplace priors. normalization factor. Theorem 2 and Corollary 3 offer another perspective of how the pseudo-primal sparse LapM3 N relates to the primal sparse 1 -M3 N. A more explicit illustration of the entropic regularization under a LapM3 N, comparing to the 1 and 2 regularization over an M3 N, can be seen in Figure 1, where the feasible regions due to the three different norms used in the regularizer are plotted in a two dimensional 2 2 space. Specifically, it shows (1) 2 -norm: w1 + w2 1; (2) 1 -norm: |w1 | + |w2 | 1; and (2) KL-norm3 : 2 2 2 w1 + 1/ w2 + 1/ - (1/ ) log( w1 + 1/2 + + 2 1/2) - (1/ ) log( w1 + 1/2 + 1/2) b, where b is a parameter to make the boundary pass the (0, 1) point for easy comparison with the 2 and It is easy show that b equals to to 1 curves. 1/ + 1 + 1/ - (1/ ) log( + 1/2 + 1/2). It can be seen that the 1 -norm boundary has sharp turning points when it passes the axes, whereas the 2 and KL-norm boundaries turn smoothly at those points. This is the intuitive explanation of why the 1 -norm directly gives sparse estimators, whereas the 2 -norm and KL-norm due to a Laplace prior do not. But as shown in Figure 1, when the gets larger and larger, the KL-norm boundary moves closer and closer to the 1 -norm boundary. When , 2 2 2 w1 + 1/ + w2 + 1/ - (1/ ) log( w1 + 1/2 + 2 1/2) - (1/ ) log( w1 + 1/2 + 1/2) |w1 | + |w2 | and b 1, which yields exactly the 1 -norm in the two dimensional space. Thus, under the linear model assumption of the discriminant functions F ( · ; w), the MaxEnDNet with a Laplace prior (i.e., the LapM3 N) can be seen as a smooth relaxation of the 1 -M3 N. 3.2. 3 1 -M N Based on the theorem 4, we develop a novel algorithm to approximately solve the 1 -M3 N. As we shall see, this algorithm provides another perspective on the connection and difference between the 1 -M3 N and LapM3 N. The algorithm iteratively solves the following two steps until a local optimum is arrived: Step 1: keep fixed, optimize P3 over (w, ). This is an 2 -norm M3 N problem: N min w -1 w + C w, i=1 i s.t. i, y = yi : w fi (y) i (y) - i ; i 0, Step 2: keep (w, ) fixed, optimize P3 over . The problem reduces to: min w -1 w, s.t. k : 1 K K k = k=1 1 ; k 0. By forming a Lagrangian and doing some algebra, it is easy to show that the solution is: k : k = K|wk | k |wk | (4) Note that when wk = 0, k = 0 and the corresponding feature will be discarded in the final estimate. 4.1. Connection to the Variational LapM3 N As shown in (Zhu et al., 2008b), exact calculation leads to a normalization factor of LapM3 N that couples all the dual variables. Thus, the problem of LapM3 N is hard to be directly optimized. In (Zhu et al., 2008b), an efficient variational algorithm was developed to learn the LapM3 N, as recaped below. Let p(w| ) = k=1 p(wk |k ), p( |) = k=1 p(k |) and d d1 · · · dK , then the multivariate independent Laplace prior is p0 (w) = p(w| )p( |) d by the two-layer interpretation of a Laplace distribution. By applying the Jensen's inequality, an upper bound of the KL-divergence in LapM3 N is achieved, L(p(w), q( )) = -H(p) - q( ) log p(w| )p( |) d q( ) p, K K is an Adaptive M3 N For the 1 -M3 N and adaptive M3 N, we have the following equivalence theorem: Theorem 4 The AdapM N yields the same estimate as the 1 -M3 N. The curves are drawn with a symbolic computational package to solve an equation of the form: 2x - log x = a, where x is the variable to be solved and a is a constant. 3 3 where q( ) is a variational distribution to approximate p( |). Substituting this upper bound for the KL in LapM3 N, the variational method alternatively optimizes over (p(w), ) and q( ) in two steps: On Primal and Dual Sparsity of Markov Networks Step 1: solve the following problem to get the posterior mean µ of w: N min µ, i drawn from the conditional distribution p(y|x) defined by the CRF based on a Gibbs sampler. Due to space limitation, we only report the results on the data sets with correlated input features. Conclusions in the i.i.d case are the same. Specifically, we set d = 100 and 30 input features are relevant to the output. The 30 relevant features are partitioned into 10 groups. For the features in each group, we first draw a real-value from a standard normal distribution and then corrupt the feature with a random Gaussian noise (zero mean and standard variance 0.05) to get 3 correlated features. Then, we generate 10 linearchain CRFs with 8 binary states (i.e., L = 8 and Yl = {0, 1}). The feature functions include: 200 real valued state-feature functions, of which each is over a one-dimensional input feature and a class label; and 4 (2 × 2) transition feature functions capturing pairwise label dependencies. Each CRF is used to generate a data set that contains 1000 instances. We do K-fold cross-validation on each data set and take the average over the 10 data sets as the final results. In each run we choose one part to do training and test on the rest K -1 parts. K is changed from 20, 10, 7, 5, to 4. In other words, we use 50, 100, about 150, 200, and 250 samples during the training. Figure 3(a) shows the performance. We can see that the primal sparse models (i.e., 1 -M3 N and 1 -CRFs) outperform the M3 N, which is only dual sparse, when the underlying model is primal sparse. As we have shown, the pseudo-sparse LapM3 N is a smooth relaxation of the 1 -M3 N. If we choose a large regularization constant, LapM3 N will shrink the weights of irrelevant features to be extremely small. Thus, the LapM3 N performs similarly to the primal-sparse models. Figure 3(b) shows the average weights of different models doing 10-fold CV on the first data set and the weights of the CRF model (first plot) that generates this data set. For LapM3 N and M3 N, all the weights are non-zero, although the weights of LapM3 N are generally much smaller than those of M3 N because of a shrinkage effect (Zhu et al., 2008b). For 1 -M3 N and 1 -CRFs, the estimates are sparse. Both of them can discard all the noise features when choosing an appropriate regularization constant. As shown in (Zhu et al., 2008b), 1 -CRFs are very sensitive to the regularization constant. As we shall see the 1 -M3 N with the EM-style algorithm is very robust. Note that all the models have quite different average weights from the model that generates the data. This is because we use a stochastic procedure (i.e., Gibbs sampler) to assign labels to the generated data samples. In fact, if we use the model that generates the data to pre- 1 µ -1 µ + C 2 i i=1 s.t. i, y = y : µ fi (y) i (y) - i ; i 0, -1 where = diag( k onal matrix. -1 q ) = ww p - µµ q is a diag- -1 Step 2: update the expectations k as follows, (old) q k : 1 k new q = 2 wk = p -1 µ2 + 1/ k k . (5) It is obvious that the difference between the EM-style learning of the 1 -M3 N and the variational learning of the LapM3 N is the second step in updating the adaptive parameters, i.e., in 1 -M3 N and -1 -1 in q LapM3 N. The different update rules reflet the essential difference between the 1 -M3 N and the LapM3 N. In Eq. (4), if wk = 0, then k = 0. That means the corresponding feature will be discarded in the final estimate. However, in the LapM3 N, the update rule -1 (5) ensures that k -1 are always positive. Thereq 3 fore, LapM N does not explicitly discard features even though the variances can be very small. This observation (approximately) explains why the 1 -M3 N is primal sparse, while LapM3 N is pseudo-primal sparse. Figure 2 summarizes the relationships among the three variants of sparse M3 N. Basically, (1) the 1 Figure 2. Relationships. M3 N is equiva3 lent to the adaptive M N; (2) 1 -M3 N is an extreme case of LapM3 N when ; (3) LapM3 N is a smooth relaxation of 1 -M3 N; and (4) LapM3 N and adaptive M3 N share a similar EM-style algorithm. 5. Experiments This section presents some empirical studies of the 1 M3 N and LapM3 N, compared with competing methods, including the primal sparse 1 -norm regularized CRFs and the dual sparse 2 -norm M3 N. 5.1. Evaluation on Synthetic Data We follow the method as described in (Zhu et al., 2008b) to do the experiments. We generate sequence data sets, i.e., each input x is a sequence (x1 , · · · , xL ), and each component xl is a d-dimensional vector of input features. The synthetic data are generated from pre-specified CRF models with either i.i.d. instantiations of the input features or correlated instantiations of the input features, from which samples of the structured output y, i.e., a sequence (y1 , · · · , yL ), can be On Primal and Dual Sparsity of Markov Networks 0.3 L1-CRFs M3N L1-M3N LapM3N 1 0.5 0 0 20 40 60 80 100 120 140 160 180 200 Table 1. The number of non-zero average weights by different algorithms doing 10-fold CV on the first data set. Proj-Subgradient 0 20 40 60 80 100 120 140 160 180 200 0.28 W LapM3N 0.1 0 -0.1 0.2 Irrelevant 0.26 Error Rate M3N Total C Cutting Plane Irrelevant 0.5 138 198 0.5 0 13 0.036 140 200 0.7 136 196 0.7 0 21 0.069 140 198 0.87 140 200 0.87 2 26 0.14 128 182 1 140 200 1 16 44 0.21 108 158 1.41 140 200 1.41 103 145 0.35 48 90 1.73 140 200 1.73 122 169 0.5 18 54 2 140 200 2 134 184 0.64 2 34 3 140 200 3 139 186 0.78 0 32 4 140 200 4 140 189 0.93 0 28 0.24 0 -0.2 0 20 40 60 80 100 120 140 160 180 200 Total 14000 0.22 L1-M3N 0.2 0 -0.2 0.2 0 -0.2 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 EM Alg. Irrelevant 0.18 L1-CRF 0.2 Total 0 100 200 300 Size of training data State Feature Functions (a) (b) Figure 3. (a) Error rates and (b) average weights of different models. Proj-Subgradient 0.4 0.4 0.35 0.3 0.25 0.2 Cutting Plane 0.4 0.35 0.3 0.25 0.2 EM Alg. 0.35 0.3 0.25 0.2 0 2 4 0 2 4 0 0.5 1 15 600 65 Training Time of (Total) all the state-feature functions and (Irrelevant) the state-feature functions based on irrelevant input features. In EM, we set = 0 if it is less than 10-4 . We can see the EM algorithm has similar numbers of non-zero weights as the cutting-plane method. However, the projected sub-gradient method keeps many features, whose weights are small but not exactly zero, and truncating the feature weights with the same threshold as in EM doesn't change the sparse pattern much. Maybe tuning the learning rate could make this tail of very small features disappear. 5.2. Web Data Extraction Web data extraction is a task to identify interested information from web pages. Each sample is a data record or an entire web page which is represented as a set of HTML elements. One striking characteristic of web data extraction is that various types of structural dependencies between HTML elements exist, e.g. the HTML tag tree is itself hierarchical. In (Zhu et al., 2008a), hierarchical CRFs are shown to achieve better performance than flat models like linear-chain CRFs (Lafferty et al., 2001). One method to construct a hierarchical model is to first use a parser to construct a so called vision tree. Then, based on the vision tree, a hierarchical model can be constructed accordingly to extract the interested attributes. See (Zhu et al., 2008a) for an example of the vision tree and the corresponding hierarchical model. We use the data set that is built with web pages generated by 37 different templates (Zhu et al., 2008a) and extract the Name, Image, Price, and Description for each product. For each template, there are 5 pages for training and 10 for testing. Here, we assume that data records are given, and compare different hierarchical models on extracting attributes in the given records. There are 1585 and 3391 data records in the training and testing pages, respectively. We use the two comprehensive evaluation measures, i.e. average F1 and block instance accuracy (Zhu et al., 2008a). Average F1 is the average value of the F1 scores of the four attributes, and block instance accuracy is the percent of data records whose Name, Image, and Price are all correctly identified. On this data set, the cutting-plane method is too slow, and both the sub-gradient and EM Error Rate 400 10 200 60 55 5 0 2 4 0 0 2 4 50 0 0.5 1 square root of lambda square root of C lambda / 14000 Figure 4. Error rates and training time of different algorithms for 1 -M3 N on the first generated data set. dict on its generated data, the error rate is about 0.5. Thus, the learned models, which get higher accuracy, are different from the model that generates the data. Figure 4 shows the error rate and training time of three algorithms for 1 -M3 N: projected sub-gradient (Ratliff et al., 2007) (based on P2 ), cutting plane (Tsochantaridis et al., 2004) (based on P2), and our EM algorithm (based on P3, where C is kept fixed), doing 10-fold CV on the first data set. Since the regularization constants for different algorithms are generally incomparable, we select a set of values around the best one we have tried for each method (exact values shown in Table 1). We can see that both the sub-gradient and cutting plane methods are sensitive to their regularization constants. For the sub-gradient method, a projection to 1 -ball (Duchi et al., 2008) is performed in each iteration, and the time depends largely on the 1 -ball's radius. For larger balls, the projection is easier. So, the training time decreases as increases. For the cutting plane method, the time is mainly dependent on the LP solver (e.g., MOSEK as we use) and increases very fast as C gets larger. For the EMalgorithm, both the error rate and training time are stable as changes. We use 15 EM-iterations in these experiments and each iteration takes about 3.7 cpuseconds, less than the time of the sub-gradient method. Table 1 shows the number of non-zero average weights On Primal and Dual Sparsity of Markov Networks Block Instance Accuracy 0.95 0.94 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0.76 0 10 20 30 L1-CRFs M3N L1-M3N LapM3N 40 50 0.93 0.92 0.91 0.9 0.89 0.88 0 10 20 30 L1-CRFs M3N L1-M3N LapM3N 40 50 Training Ratio Training Ratio Figure 5. Average F1 and block instance accuracy on web data extraction with different number of training data. algorithms are efficient and have similar performance. We randomly select m (5, 10, 15, 20, 30, 40, or, 50) percent of the training records for training and test on all the testing records. For each m, 10 independent experiments are conducted and the average performance is summarized in Figure 5. We can see that: LapM3 N performs comparably with 1 -M3 N, which enjoys both dual and primal sparsity, and outperforms the other models which enjoy either dual sparsity (i.e., M3 N) or primal sparsity (i.e., 1 -CRFs), especially when the number of training data is small. The better performance of 1 -M3 N compared to 1 -CRFs demonstrates the promise of primal-sparse max-margin models. 6. Conclusion We have presented the 1 -norm max-margin Markov network ( 1 -M3 N) and a novel adaptive M3 N (AdapM3 N), which enjoy both primal and dual sparsity, and analyzed their close connections to the Laplace M3 N (LapM3 N), which is pseudo-primal sparse due to a smooth shrinkage effect. We show that 1 -M3 N is an extreme case of LapM3 N, and 1 M3 N is equivalent to an AdapM3 N. We also develop a robust EM-style algorithm to learn an 1 -M3 N. The algorithm helps uncover the difference between 1 -M3 N and LapM3 N. On both synthetic and real web data, we show the promise of simultaneously (pseudo-) primal and dual sparse models over the competing ones which enjoy either dual or primal sparsity. Acknowledgements This work was done while J.Z. was a visiting researcher at CMU under a support from NSF DBI-0546594 and DBI-0640543 awarded to E.X.; J.Z. is also supported by Chinese NSF Grant 60621062 and 60605003; National Key Foundation R&D Projects 2003CB317007, 2004CB318108 and 2007CB311003; and Basic Research Foundation of Tsinghua National TNList Lab. References Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hidden Markov support vector machines. International Conference on Machine Learning, 3­10. Bartlett, P, Collins, M, Taskar, B & McAllester, D (2004). Exponentiated gradient algorithms for large margin structured classification. NIPS, 113­120. Bradley, D., & Bagnell, A. (2008). Differentiable sparse coding. Neur. Info. Proc. Sys., 113­120. Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). Efficient projection onto the 1 -ball for learning in high dimensions. ICML, 272­279. Figueiredo, M. (2003). Adaptive sparseness for supervised learning. IEEE Trans. on Pattern Analysis and Machince Intelligence, 25, 1150­1159. Friedman, J., Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). Discussion of boosting papers. Annals of Statistics, 102­107. Grandvalet, Y. (1998). Least absolute shrinkage is equivalent to quadratic penalization. Inter. Conf. on Aartifical Neural Networks, 201­206. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning, 282­289. Lee, S.-I., Ganapathi, V., & Koller, D. (2006). Efficient structure learning of Markov networks using 1 -regularization. Neur. Info. Proc. Sys., 817­824. Qi, Y., Minka, T., Picard, R., & Ghahramani, Z. (2004). Predictive automatic relevance determination by expectation propagation. ICML, 671­678. Ratliff, N. D., Bagnell, J. A., & Zinkevich, M. A. (2007). (Online) Subgradient methods for structured prediction. AI Statistics, . Smola, A., Vishwanathan, S., & Le, Q. (2007). Buddle methods for machine learning. NIPS, 1377­1384. Taskar, B., Guestrin, C., & Koller, D. (2003). Maxmargin Markov networks. Neur. Info. Proc. Sys.. Taskar, B., Lacoste-Julien, S., & Jordan, M. I. (2006). Structured prediction via the extragradient method. Neur. Info. Proc. Sys., 1345­1352. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Soceity, Series B, 267­288. Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. JMLR, 211­244. Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. Inter. Conf. on Mach. Learn., 823­830. Wainwright, M., Ravikumar, P., & Lafferty, J. (2006). High-dimensional graphical model selection using 1 -regularized logistic regression. NIPS, 1465­1472. Zhu, J., Nie, Z., Zhang, B., & Wen, J.-R. (2008a). Dynamic hierarchical Markov random fields for integrated web data extraction. JMLR, 1583­1614. Zhu, J., Rosset, S., Hastie, T., & Tibshirani, R. (2004). 1-norm support vector machines. Advances in Neural Information Processing Systems, 49­56. Zhu, J., & Xing, E. (2009). Maximum entropy discrimination Markov networks. ArXiv 0901.2730. Zhu, J., Xing, E., & Zhang, B. (2008b). Laplace maximum margin Markov networks. ICML, 1977­1984. Average F1