Polyhedral Outer Approximations with Application to Natural Language Parsing Andr´ F. T. Martins e Noah A. Smith Eric P. Xing School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA Instituto de Telecomunica¸~es, Instituto Superior T´cnico, Lisboa, Portugal co e afm@cs.cmu.edu nasmith@cs.cmu.edu epxing@cs.cmu.edu Abstract Recent approaches to learning structured predictors often require approximate inference for tractability; yet its effects on the learned model are unclear. Meanwhile, most learning algorithms act as if computational cost was constant within the model class. This paper sheds some light on the first issue by establishing risk bounds for max-margin learning with LP relaxed inference and addresses the second issue by proposing a new paradigm that attempts to penalize "timeconsuming" hypotheses. Our analysis relies on a geometric characterization of the outer polyhedra associated with the LP relaxation. We then apply these techniques to the problem of dependency parsing, for which a concise LP formulation is provided that handles non-local output features. A significant improvement is shown over arc-factored models. and learning are only tractable for a small class of network topologies. While discriminative learning is able to handle features that depend globally on the input variables, the analogous property on the output side is more difficult to obtain. For tractability, assumptions about the locality of output variable dependence are often made at the expense of expressive power. Exploring non-local output dependencies is not just an empiricist attempt to produce more accurate models: indeed, there are instances of structured problems in which the output variables are known to interact globally, e.g., by definition of the output space. An example is natural language parsing, where the output variables must "agree" to ensure that they jointly encode a valid parse tree. While dynamic programming sometimes offers a solution to this problem (albeit under locality assumptions), the same does not happen in many combinatorial problems of interest, like those involving matchings, permutations, or spanning trees. When exact inference is intractable, one has to resort to approximate algorithms; typically, the same algorithms are called as subroutines to train the model. This has been done, e.g., by Taskar et al. (2004b) and Daum´ and Marcu (2005); recently, Kulesza and e Pereira (2007) and Finley and Joachims (2008) provided some learning approximation guarantees and empirical analyses. However, a theoretical study of the actual impact of these approximations in the learning procedure--when compared with the exact formulation--is still missing. This paper aims to fill this gap for the case of outer approximations when learning large margin classifiers. Often, the problem of inference can be represented as an integer linear program (ILP); this is common in NLP applications like semantic role labeling (Roth & Yih, 2005), summarization (Clarke & Lapata, 2008), coreference resolution (Denis & Baldridge, 2007), and dependency parsing (Riedel & Clarke, 2006; Martins 1. Introduction Structured classification tackles problems characterized by strong interdependence among the output variables, often with a sequential, graphical, or combinatorial structure. Problems of this kind arise in natural language processing (NLP), computer vision, robotics, and computational biology. Considerable progress has been made lately toward a unified view of these problems (Lafferty et al., 2001; Collins, 2002; Taskar et al., 2004a; Tsochantaridis et al., 2004). A typical approach is to capture the problem structure via a Markov network. Unfortunately, exact inference Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Polyhedral Outer Approximations with Application to Natural Language Parsing et al., 2009). While solving an ILP is NP-hard in general, fast solvers are available today that make this a practical solution in some cases. We study approximate techniques based on LP relaxations applied to the problem of dependency parsing; our formulation effectively handles global features and constraints. The techniques discussed here are also relevant to other applications; in particular, they are able to exploit expert knowledge in the form of soft or hard first-order constraints (Richardson & Domingos, 2006; Chang et al., 2008). The contributions of this paper are: · We provide risk bounds for approximate learners, in a large margin framework, that arise from an LP relaxation of the inference problem. We characterize these bounds in terms of geometric and algorithmic properties. In particular, we provide sufficient conditions for algorithmic separability. · We propose a new learning paradigm for approximate inference that balances accuracy and runtime, where an additional loss term is included in the learner objective function to penalize models with long expected runtime. We formulate and analyze a new algorithm based on this paradigm. · We empirically evaluate the performance of such approximate learners on dependency parsing tasks. Our parsers are able to handle sibling and grandparent interactions, word valency, and to softly favor nearly projective parses. The paper is organized as follows. Sec. 2 introduces the framework of ILP formulations for large margin structured classification; Sec. 3 provides a theoretical analysis of outer approximations through LP relaxations; Sec. 4 presents the task; experiments are discussed in Sec. 5. We conclude in Sec. 6. of optimizing over Y into that of optimizing a linear function over Z (guaranteed to attain the optimum at a vertex of Z) and then invert, y = -1 (z ). This framework was first studied by Taskar et al. (2004b) in the context of associative Markov networks. Denote by V (Z) the set of vertices of Z. A map : Y V (Z) arises naturally whenever the elements of Y can be represented as collections of "parts" in a finite set R (i.e., each y Y satisfies y R).1 Indeed, let us consider linear classifiers of the form hw (x) = arg maxyY w f (x, y) (here, f (x, y) is a vector of features and w is the corresponding weight vector). If the features decompose over the parts through f (x, y) ry fr (x) = rR zr fr (x), (1) where zr vector z I(r y), then, by defining the indicator (zr )rR and the matrix F (fr (x))rR : yY zZ hw (x) = arg max w f (x, y) = -1 arg max w Fz , (2) where Z is the convex hull of the set of indicator vectors that correspond to elements of Y. Therefore, in this situation, inference can be cast as an LP; this automatically enables learning using any algorithm that only needs to perform inference steps, like the structured perceptron (Collins, 2002). Furthermore, whenever the loss function can be expressed as a linear function of z--which turns out to be the case in many cases of interest, e.g. in Hamming-like losses, (y ; y) rR (I(r y )I(r y) + I(r y )I(r y)) / / = rR zr (1 - zr ) + (1 - zr )zr (3) = p z + q, 2. Structured Classification and LP Let X and Y denote the input and output sets, respectively, and assume a supervised setting, where we are given labeled data L {(x1 , y1 ), . . . , (xm , ym )} X × Y, drawn according to a fixed, unknown distribution P (X, Y ), and aim to learn a function h : X Y with small expected loss E (h(X); Y ) on unseen data; here, : Y × Y R+ denotes the loss function. We are interested in the case where Y is exponentially large but finite. We assume that there is a bijection between Y and the set of vertices of a polytope Z {z Rn | Az b}, where A is a p-by-n matrix and b is a vector in Rp , for some integers p and n. In that case, to carry out structured classification (the inference problem), one may transform the problem where p 1 - 2z, q 1 z, and we made the variable changes z (y) and z (y )--then this setting also allows learning the model parameters w using a max-margin criterion. To see how, define, for each data point (xt , yt ) L, the non-negative quantity rt (w) = yt Y max w f (xt , yt ) - w f (xt , yt ) + (yt ; yt ) max(Ft w + pt ) zt - (Ft w) zt + qt , (4) zt Z where zt (yt ); the problem of finding the arg max in (4) is often referred to as the loss-augmented inference (LAI) problem; observe that it can also be 1 The set of parts R can be, e.g., the set of all possible clique assignments in a Markov network. Polyhedral Outer Approximations with Application to Natural Language Parsing Figure 1. Left: Schematic representation of the outer poly¯ tope Z associated with the relaxation (6). Right: The ~ carved polytope Z implicit in the problem (13). cast as an LP. The complete learning problem is min w w 2 2 + 1 m m rt (w), t=1 (5) where 0 is a regularization parameter. This is a convex problem for which several algorithms have been proposed (Taskar et al., 2004b; Taskar et al., 2006; Tsochantaridis et al., 2004; Ratliff et al., 2006). In what follows, we denote U [0, 1], and B {0, 1} = U Z. Although in some special cases there exists a concise polyhedral representation of Z (in terms of the matrix A and vector b) that does not happen in general. Often, what we have is a concise representation ¯ of an outer polytope Z Z such that min c z = zZ ¯ zZ,zBn min c z min c z ¯ zZ (6) Let H {hw | w W} be our hypothesis class, where W Rd is a convex set, and hw is the maximum (over Z) of linear discriminant functions, as in (2). We consider approximate inference algorithms A, that accept as input a parameter w W and a data point ¯ x X , and output a value A(x; w) in Z. Following Kulesza and Pereira (2007), we say that the dataset L is separable (w.r.t. H) if there is w W such m that t=1 (hw (xt ); zt ) = 0; we say that L is algorithmically separable (w.r.t. A) if there is w W m such that t=1 (A(xt ; w); zt ) = 0. In both cases, we say that "L is separated (resp. algorithmically separated) by w" when w is a "witness" in the above definitions. In our setting, where we consider algo¯ rithms induced by the outer approximation Z Z, algorithmic separability w.r.t. A is equivalent to sepa¯ ¯ ¯ rability w.r.t. H {hw | w W}, where hw takes the ¯ w (x) = arg maxzZ w F¯. Observe also that, form h z ¯ ¯ in this setting, algorithmic separability implies separability; this was pointed out by Kulesza and Pereira (2007) and Finley and Joachims (2008). Essentially, their theoretical analyses for structured learning with outer approximate (or "overgenerating") inference algorithms are extensions of known results for the exact ¯ formulations, obtained by replacing Z with Z in the formulas. While these theoretical guarantees are helpful (constrasting with undergenerating predictors, for which no guarantees exist), they do not provide sufficient conditions for algorithmic separability, neither do they bound the risk of an outer approximation compared with the exact formulation. Lemma 1 Assume that the feature function satisfies fr (x) 1,2 and let Nf maxxX ,rR fr (x) 1 . ¯ ¯ Then, for any w W, z Z and z V (Z): |w F(¯ - z)| z w 2 holds for any c with a fairly tight bound. Notice that, ¯ if we assume that Z Un , then there are no integer ¯ points in the relative interior of Z; consequently, (6) ¯ implies that any vertex of Z is also a vertex of Z. The two polytopes are represented schematically in Fig. 1. Nf (¯; z). z (7) 3. Learning with LP Relaxed Inference We now study the impact of relaxations like (6) in the learning problem. 3.1. Approximation Bounds To cope with our approximate learning setting, we identify Y V (Z) through the map ; with some abuse of notation, we write (z ; z) for the loss function (3) instead of ( -1 (z ); -1 (z)). Also, we extend its ¯ ¯ ¯ domain by defining (¯ ; z) = p z + q for any z Z. z ¯ Observe that, whenever z V (Z), (¯ ; z) = z - z 1 . z As a consequence, has the triangle inequality prop¯ ¯ erty, i.e., (¯ ; z) (¯ ; z ) + (z ; z), for any z Z z z and z, z V (Z). This fact will be exploited below. Proof: By using H¨lder's inequality, |w F(¯ - z)| o z ¯ F w · z - z 1 = maxrR w fr (x) · (¯; z) z w 2 Nf (¯; z). z Note that Lemma 1 implies ^(¯; z, w) z w F(¯ - z) + (¯; z) z z (1 + w 2 Nf ) (¯, z). z (8) We can now provide a sufficient condition for algorithmic separability. Recall that rt (w) maxzt Z ^(zt ; zt , w); defining analogously rt (w) ¯ ^(¯ ; zt , w), we have: maxzt Z zt ¯ ¯ This is guaranteed if the features are binary-valued, in which case Nf is the maximum possible number of active features at a single part. 2 Polyhedral Outer Approximations with Application to Natural Language Parsing ¯ ¯ Proposition 2 Let L be such that, for any z Z \ Z, there exists z V (Z) such that (¯; z) L. Then: z rt (w) rt (w) rt (w) + (1 + w ¯ 2 ¯ ¯ of Z have constant 1 -norm (say K), and any z Z ¯ satisfies 1 z K, then: rt (w) rt (w) rt (w) + 2 K(1 + w ¯ 2 Nf )L. (9) Nf ). (12) ¯ Furthermore, let L L be such that, for any z ¯ Z \ Z, there exist z, z V (Z) such that (¯; z) z (¯; z ) L . If L is separated by w with a large z m margin (i.e. if w 2 < t=1 rt (w ) = 0) and 1/(L Nf ), then L is algorithmically separated by w . Proof: The fact that rt (w) rt (w) is trivial, since ¯ ¯ Z Z. As for the upper bound, we choose zt V (Z) such that (¯t , zt ) L and apply triangle inequality: z rt (w) ¯ ¯ ¯ zt Z Proof: Since A is outer -approximate, it underestimates minzZ c z by at most c z , provided c 0. In the general case for LAI, however, c = -Ft w-pt 0; but since the vertices of Z have constant norm, the problem of optimizing over Z is unchanged by adding any constant to the cost vector. Thus, defining c = c + c 0: rt (w) = - min (c - c ¯ ¯ ¯ zt Z zt Z ¯ · 1) zt - w Ft zt + qt z max w Ft (¯t - zt ) + w Ft (zt - zt ) -(1 - ) min c zt + K c rt (w) + min c zt zt Z - w Ft zt + qt + (¯t ; zt ) + (zt , zt ) z z = max ^(¯t ; zt , w) + ^(zt ; zt , w) ¯ ¯ zt Z rt (w) + 2 K(1 + w (10) again due to H¨lder's inequality. o Nf ), rt (w) + (1 + w 2 Nf )L, ¯ ¯ As for the second part, we have for all t and zt Z: ¯ ¯ w Ft(zt - zt ) = w Ft (zt - zt ) + w Ft (zt - zt ) w Ft(zt - zt ) - w (zt ; zt ) - w 1 - 1 = 0, 2 2 Props. 2­4 will be used in Sec. 3.3 to establish empirical risk and generalization bounds. 3.2. Balancing Accuracy and Runtime ¯ Nf (zt ; zt ) (11) We now propose a new learning strategy that balances accuracy and algorithmic cost. We argue that, when the computational cost of performing inference at test time is something that we worry about, then this cost should be taken into account in the learning problem. Let c : H × X R be a cost function that, given a hypothesis h H and a data point x X , expresses the cost of computing h(x). Most existing learning algorithms concern minimizing the expected loss on unseen data, E (h(X), Y ); yet, it may happen that the model that minimizes this quantity (call it h ) has an impractically high average computational cost. On the other hand, there may well exist another hypothesis h H performing similarly to h that yields much faster runtimes. Hence, our target should be to minimize E ( (h(X), Y ) + c (h, X)), where 0 is a trade-off parameter. Traditional algorithmic complexity theory, which looks at the worst possible problem configurations, is not useful here, as we are interested in average complexities (Levin, 1986) under the unknown distribution P (X, Y ). As an example, consider the ILP formulation (6), where the cost vector c F w is affected both by the model parameters w and by the input data, represented in the matrix F (that we can see as a random variable). Although solving an ILP is an NP-complete problem, for some "nice" distributions P (c) (peaked Nf L where we chose zt = zt such that (¯t ; zt ) L , and z used Lemma 1, together with the fact that any two distinct points in V (Z) have at least unit loss (since they belong to Bn ). Corollary 3 Under the conditions stated in the second part of Prop. 2, the perceptron algorithm running approximate inference will make a finite number of mistakes. Proof: (Sketch) Use Prop. 2 and the mistake bound for the structured perceptron (Collins, 2002). The bound (9) relies on a geometric characterization ¯ of the approximating polytope Z (through the parameters L and L ). However, in some cases one has algorithmic approximation guarantees instead (see, e.g., Vazirani, 2001). The following establishes a bound similar to the one in Prop. 2 that relies on an algorith¯ mic characterization (we define A(x, w) hw (x)): Proposition 4 If A is outer -approximate3 for the class of problems minzZ c z with c 0, the vertices 3 Given a class F of nonnegative functions and a minimization problem minxX f (x) f , an algorithm is said to be outer -approximate if it retrieves a lower bound of the optimum, f , such that (f - f )/f . Polyhedral Outer Approximations with Application to Natural Language Parsing over values of c that hit integer vertices of the constraint polyhedron) the average computational cost is low. Therefore, it is desirable to obtain model parameters w that, besides having small expected loss, yield nice cost vectors c P (F w) with high probability. When the constraint polytope does not contain integer points in its relative interior (which is our case, cf. (6)), the runtime of many off-the-shelf ILP solvers (like those based on branch-and-bounding or Gomory's cuts) decreases as the solution of the relaxed problem is closer to the exact solution; therefore, it may be reasonable to approximate the expected computational cost E c (hw , X) by the expected relaxation gap ¯ E (hw (X), hw (X)). Led by this thought, we add a "empirical relaxation gap" term to our learning objecm 1 tive of the form m t=1 (¯t (w) - rt (w)) 0. Rearr ranging terms, the overall learning problem becomes: min w Algorithm 1 Modified Online Subgradient Input: L, t t , learning rate sequence t t Initialize w1 0 for t = 1 to m = |L| do Pick t Bernoulli(t ) if t = 1 then ^ Solve relaxed LAI, ztarg maxzt Z ^(¯t ; zt , wt ) ¯ z ¯ else ^ Solve exact LAI, zt arg maxzt Z ^(zt ; zt , wt ) end if Compute the subgradient gt wt + Ft (^t - zt ) z Project and update wt+1 ProjW (wt - t gt ) end for m 1 ^ Return the averaged model w m t=1 wt . w 2 2 + 1- m m rt (w) + t=1 m m rt (w). ¯ t=1 (13) ~ ¯ By defining Z (1 - )Z + Z and due to linearity of the loss function, observe that each LAI problem r in (13) can be written as (1 - )rt (w) + ¯t (w) = maxz Z w Ft (~t - zt ) + (~t , zt ). As a consequence, z z ~t ~ the formulation (13) is isomorphic to the one in (5), with the sole difference that the polytope Z is replaced ~ ~ by Z ; unfortunately, Z is, in general, as hard to rep~ ¯ resent as Z. Notice that Z Z Z for any [0, 1]; ~ ¯ geometrically, Z is obtained from Z by intersecting the latter with cutting half-spaces that "carve out" the fractional vertices (see Fig. 1). Therefore, optimizing ~ over Z has the effect of providing approximate solutions that lie near the integer vertices. Rather than optimizing over the carved polytope, we propose a simple stochastic online strategy (see Algorithm 1) to tackle (13); we also allow the trade-off parameter to vary over time (i.e., we replace by t t ). This algorithm is similar to the subgradient algorithm of Ratliff et al. (2006), except that, at each step, it randomly decides whether it will perform exact or approximate LAI; we analyze this algorithm in Sec. 3.3.4 3.3. Generalization Bounds Kulesza and Pereira (2007) observed that a PAC-Bayes generalization bound for empirical risk minimization Note that Algorithm 1 with fixed and fixed sample size indeed optimizes (13). The law of large numbers implies that after enough iterations, the fraction of time that each datum is used to solve exact LAI (resp. relaxed LAI) is arbitrarily close, in probability, to 1 - (resp. ); hence, one just needs to adapt the convergence proofs of the subgradient algorithm. 4 can be adapted to the case of LP relaxed inference (by invoking the fact that relaxed inference essentially augments the output set). Here, we go farther and provide a generalization bound for approximate learning w.r.t. the empirical risk minimizer in the exact setting. ^ Proposition 5 Let w be the solution returned by Algorithm 1 with learning rate chosen as t = 1/(t). Assume that ^(.; zt , wt ) is upper bounded by and that the subgradient norm is bounded by G.5 Then, the following bound holds with probability at least 1 - : 1 E (hw (X); Y ) min ^ wW m m rt (w) + M (w, m) + t=1 82 /m ln(2/), (14) where M (w, m) /2 · w 2 + G2 (1 + log m)/(2m) + m (1 + w Nf )L t=1 t /m. Proof: (Sketch.) We adapt a result from Cesa-Bianchi et al. (2004) for convex and bounded loss functions to get (for any 0 < 1) P ^ Er(w) 1 m m rt (wt ) + t=1 22 1 ln m . (15) ^ Since (hw (X); Y ) r(w), it suffices to bound the ^ RHS. Noting that rt (wt ) rt (wt ), and adapting a ¯ regret bound from Ratliff et al. (2006) for the subgradient algorithm we get, for any w W: The function ^ can be made bounded if we define W to be a convex body, e.g. by constraining w , which en1/2 sures, through Lemma 1, that ^(.; zt , wt ) (1+Nf )K , where K z 1 . As for G, assuming feature vectors whose norm is bounded by R/2, we may take G = R + 2 . 5 Polyhedral Outer Approximations with Application to Natural Language Parsing 1 m m rt (wt ) t=1 1 m m rt (w) + t=1 w 2 2 + G2 (1 + log m) + 2m m 1 t (¯t (w) - rt (w)); (16) r m t=1 Now, applying Hoeffding's inequality to the random sequence t t=1,...,m , we get (for any 0 < 1) m m P ( t=1 t t=1 t + 2/m · log(1/ )) . Substituting in (16), invoking Prop. 2, and plugging the result in (15) (noting that upper bounds the approximation gap), we get the desired result (by taking 2 = 2 ). Corollary 6 If we set = ( (1 + log m)/m) and a.s. 1 m ^ t = (1/ t), then Er(w) - m t=1 rt (w ). Proof: The choice of was proposed Ratliff et al., by m 2006. As for t , we have t=1 1/ t = O( m) = o(m); therefore limm M (w, m) = 0. An arc-factored model for dependency parsing is one in which the feature vector decomposes as a sum over arcs (i.e., R = A in (1)); such models can capture each word's preferences for particular properties of its children or parents. McDonald et al. (2005) showed that arc-factored inference among trees is an instance of the minimum arborescence problem, which enables efficient algorithms for exact inference (Chu & Liu, 1965; Edmonds, 1967). Riedel and Clarke (2006) added hard linguistic constraints to an arc-factored model, representing the inference problem as an ILP with exponentially many constraints. They used a cutting plane algorithm for inference, in which constraints are only invoked when violated; further, they trained with an arc-factored model since the cutting plane algorithm was slow. In Martins et al. (2009), we formulate dependency parsing as an ILP with a polynomial number of constraints, by adapting a single-commodity directed flow model due to Magnanti and Wolsey (1994). Our representation allows constraints to be made soft, so that their strengths are learned as features of the model. This permits us to include non-arc-factored features, described next. Siblings and grandparents It was shown by McDonald et al. (2006) and Smith and Eisner (2008) that modeling interactions among words who share a parent or among a word's children and its parent can be beneficial. To incorporate these features, we employ a linearization trick (Boros & Hammer, 2002). This can be done with O(n3 ) variables and constraints. Valency Words in a language have preferences not only for which words will be their children, but also how many children they will have (valency or arity). Our model includes valency indicator features. An extra O(n2 ) variables and constraints are necessary. Projectivity If y Y(x), we say that an arc a = (i, j) y is projective if for any vertex k in the span of a (i.e. satisfying min(i, j) < k < max(i, j)), there exists a path in y from i to k. A dependency tree is called projective if it only contains projective arcs.8 Although non-projectivity is arguably necessary for correctly capturing dependency structure in some languages, parse trees tend to be nearly projective. We encode this preference in a learned, language-specific way, as a feature. Indicator variables for projective arcs can be added with an extra O(n3 ) variables and constraints. When the arc-factored assumption is weakened and non-projectivity is permitted, exact inference becomes NPhard (McDonald & Satta, 2007), cf. parsing with nonprojectivity disallowed (Eisner, 1996). 8 4. Dependency Parsing We next show how approximate learning using LP relaxed inference can be used in an important NLP task involving non-local interactions among output variables: dependency parsing. We merely sketch the problem; see Martins et al. (2009) for a full discussion of dependency parsing and its ILP representations. Dependency trees are a lightweight syntactic representation that attempts to capture functional relationships between words. Given a sentence x = w0 , . . . , wn (where wi denotes the word at the i-th position, and w0 = $ is a wall symbol), consider the digraph D = (V, A), with vertices in V = {0, . . . , n} (the i-th vertex corresponding to the i-th word) and arcs in A = V 2 . A (legal) dependency parse tree of x is any 0-arborescence6 of D; we denote the set of legal dependency parse trees of x by Y(x).7 If a = (i, j) y we refer to i as the parent of j and j as the child of i. A parser is a function h : X Y, where Y = xX Y(x). The fact that Y(x) is exponentially large makes this a structured classification problem. We want to learn an h with small expected loss--here, the Hamming loss function (y ; y) |{(i, j) y : (i, j) / y}|, which, if R = A, is proportional to (3). An r-arborescence of D is a subset B of A such that (V, B) is a (directed) tree rooted at r. 7 In this paper, we consider unlabeled dependency parsing, where only the backbone structure is predicted. 6 Polyhedral Outer Approximations with Application to Natural Language Parsing Table 1. Results for dependency parsing. We have reproduced the system of McDonald et al. (2006) for the sake of comparison (MLP'06). For each language and model setting, we report the unlabeled attachment scores (UAS, %) using exact and approximate inference at test time. For the arc-factored model, we report the results obtained by learning with exact LAI. Bold indicates significantly best results (statistical significance is measured by Dan Bikel's randomized parsing evaluation comparator, http://www.cis.upenn.edu/dbikel/software.html). MLP'06 Learning Inference Danish Dutch Portuguese Slovene Arc-Factored Model Exact ( = 0) Approx. ( = 1) Exact Approx. Exact Approx. 89.86 89.68 89.80 89.78 83.15 83.17 83.55 83.61 90.66 90.66 90.66 90.70 84.05 83.87 83.93 83.95 Full Model Approx. ( = 1) Exact Approx. 91.18 91.04 85.57 85.41 91.42 91.44 85.61 85.41 90.60 84.11 91.40 83.67 5. Experiments We demonstrate the effectiveness of our polyhedral approximations for dependency parsing, with experiments on four languages from the CoNLL-X shared task (Buchholz & Marsi, 2006): Danish, Dutch, Portuguese and Slovene. We used the same arc-factored features as McDonald et al. (2005) and optional nonarc-factored features as described in Sec. 4. All our experiments were conducted on a PC with a Intel dualcore processor with 2.66 GHz and 2 Gb RAM memory. We used CPLEX to solve the ILPs. For scalability, we first prune the base graph by running a simple algorithm that ranks the k-best candidate parents for each word in the sentence, setting k = 10; this reduces the number of variables and constraints in the arc-factored model to O(nk), and in the full model to O(n2 k).9 The ranker is a local model trained using a max-margin criterion; it is arcfactored and not subject to any structural constraints, so it is fast. Pruning was employed in both training and testing. To learn the actual parser, we implemented Alg. 1 with passive-aggressive updates (Crammer et al., 2006).10 At test time, we experimented with exact and approximate inference. The approximate decoder was implemented as follows to obtain a true parse: first, solve the LP relaxation; then, if the solution z is fractional, project its arc components z (z )aA onto the feasible set Y(x). The Euclida A ean projection can be computed by finding a maximal arborescence in the digraph whose weights are defined by z (proof omitted); as seen in Sec. 4, the Chu-LiuA Edmonds algorithm can do this in polynomial time. Tab. 1 summarizes the results. As expected, adding non-arc-factored features makes the models more accurate. We also observe that approximate training did not hurt the arc-factored model, compared with ex9 The oracle constrained to pick parents from these lists achieves > 98% in every case. 10 Without regularization, this can be seen as a variant of the online subgradient algorithm of Ratliff et al. (2006). Table 2. Runtimes for Slovene, as a function of . Learning the full model was intractable as 0; the valency and non-projectivity features were excluded in this analysis. The reported values (collected at test time) are: the percentage of words for which a fractional parent (FP) was assigned in the LP-relaxed problem, and average runtimes per sentence, using exact and approximate inference. 0.00 0.25 0.50 0.75 1.00 FP (%) 16.53 10.42 8.29 7.76 7.12 Exact (sec.) 19.33 4.16 2.20 1.62 1.11 Approx. (sec.) 0.17 0.13 0.13 0.13 0.12 act training. Moreover, approximate decoding at test time did not considerably affect accuracy for any of the models. For 3 out of 4 languages, the full model yields substantially better results than the approximate nonarc-factored parser of McDonald et al. (2006). To see whether the parameter is effectively penalizing computational cost, according to the paradigm sketched in Sec. 3.2, we did an additional experiment for the Slovene dataset (Tab. 2). We observe that, as approaches 0 (i.e., as training becomes close to exact), the learned model tends to assign more fractional solutions in the LP relaxed problem (a subroutine for both the approximate and exact decoders), which results in a dramatic increase in runtime for the exact decoder. In contrast, when trained with = 1, the model learns to avoid fractional solutions, and ILP solvers will converge faster to the optimum (on average). Yet the approximate decoder is still significantly faster. 6. Conclusions We studied the impact of LP relaxed inference in maxmargin learning. Based on a geometric characterization, we established conditions that guarantee algorithmic separability and derived risk bounds w.r.t. the exact formulations. As a by-product, we put forth a new learning paradigm that takes computational cost into consideration. We demonstrated the effectiveness Polyhedral Outer Approximations with Application to Natural Language Parsing of these techniques on a structured prediction problem. As future work, we will look for polyhedral characterizations that guarantee tighter risk bounds; in particular, we aim to obtain conditions under which the approximation gap decreases with the sample size. Acknowledgments The authors thank the reviewers for their comments and Nathan Ratliff for discussions. A.M. was supported by a grant from FCT/ICTI through the CMUPortugal Program, and also by Priberam Inform´tica. a N.S. was supported by NSF IIS-0836431 and an IBM Faculty Award. E.X. was supported by NSF DBI0546594, DBI-0640543, IIS-0713379, and an Alfred Sloan Foundation Fellowship in Computer Science. References Boros, E., & Hammer, P. (2002). Pseudo-Boolean optimization. Discrete Applied Math., 123, 155­225. Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. Conf. Comp. Nat. Lang. Learn., 189­210. Cesa-Bianchi, N., Conconi, A., & Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Trans. on Inf. Theory, 50, 2050­2057. Chang, M., Ratinov, L., & Roth, D. (2008). Constraints as prior knowledge. Int. Conf. Mach. Learn. Workshop on Prior Knowledge for Text and Language Processing. Chu, Y. J., & Liu, T. H. (1965). On the shortest arborescence of a directed graph. Science Sinica, 14, 1396­1400. Clarke, J., & Lapata, M. (2008). Global inference for sentence compression: An integer linear programming approach. J. Art. Intel. Res., 31, 399­429. Collins, M. (2002). Discriminative training methods for HMMs: Theory and experiments with perceptron algorithms. Emp. Meth. Nat. Lang. Proc. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. J. Mach. Learn. Res., 7, 551­585. Daum´, H., & Marcu, D. (2005). Learning as search e optimization: Approximate large margin methods for structured prediction. Int. Conf. Mach. Learn., 169­176. Denis, P., & Baldridge, J. (2007). Joint determination of anaphoricity and coreference resolution using integer programming. North. Am. Assoc. Comp. Ling. Edmonds, J. (1967). Optimum branchings. J. Res. National Bureau of Standards, 71B, 233­240. Eisner, J. (1996). Three new probabilistic models for dependency parsing: An exploration. Int. Conf. Comp. Ling., 340­345. Finley, T., & Joachims, T. (2008). Training structural SVMs when exact inference is intractable. Int. Conf. Mach. Learn., 304­311. Kulesza, A., & Pereira, F. (2007). Structured learning with approximate inference. Neur. Inf. Proc. Sys. 20, 785­792. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Int. Conf. Mach. Learn., 282­289. Levin, L. (1986). Average case complete problems. SIAM Journal on Computing, 15, 285. Magnanti, T., & Wolsey, L. (1994). Optimal Trees (Tech. Rep. 290-94). MIT, Op. Res. Center. Martins, A. F. T., Smith, N. A., & Xing, E. P. (2009). Concise integer linear programming formulations for dependency parsing. Assoc. Comp. Ling. To appear. McDonald, R., Lerman, K., & Pereira, F. (2006). Multilingual dependency analysis with a two-stage discriminative parser. Conf. Comp. Nat. Lang. Learn., 216­220. McDonald, R., & Satta, G. (2007). On the complexity of non-projective data-driven dependency parsing. Int. Conf. Pars. Tech., 121­132. McDonald, R. T., Pereira, F., Ribarov, K., & Hajic, J. (2005). Non-projective dependency parsing using spanning tree algorithms. Emp. Meth. Nat. Lang. Proc., 523­530. Ratliff, N., Bagnell, J., & Zinkevich, M. (2006). Subgradient methods for maximum margin structured learning. Int. Conf. Mach. Learn. Workshop on Learning in Structured Outputs Spaces. Richardson, M., & Domingos, P. (2006). Markov logic networks. Mach. Learn., 62, 107­136. Riedel, S., & Clarke, J. (2006). Incremental integer linear programming for non-projective dependency parsing. Emp. Meth. Nat. Lang. Proc., 129­137. Roth, D., & Yih, W. (2005). Integer linear programming inference for conditional random fields. Int. Conf. Mach. Learn., 736­743. Smith, D. A., & Eisner, J. (2008). Dependency parsing by belief propagation. Emp. Meth. Nat. Lang. Proc., 145­156. Taskar, B., Guestrin, C., & Koller, D. (2004a). Maxmargin markov networks. Neur. Inf. Proc. Sys. 16. Taskar, B., Chatalbashev, V., & Koller, D. (2004b). Learning associative Markov networks. Int. Conf. Mach. Learn. Taskar, B., Lacoste-Julien, S., & Jordan, M. (2006). Structured prediction via the extragradient method. Neur. Inf. Proc. Sys. 18, 1345­1352. Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. Int. Conf. Mach. Learn. Vazirani, V. (2001). Approximation Algorithms. Springer.