Treebank Grammar Techniques for Non-Projective Dependency Parsing
Marco Kuhlmann Uppsala University Uppsala, Sweden marco.kuhlmann@lingfil.uu.se Giorgio Satta University of Padua Padova, Italy satta@dei.unipd.it

Abstract
An open problem in dependency parsing is the accurate and efficient treatment of non-projective structures. We propose to attack this problem using chart-parsing algorithms developed for mildly contextsensitive grammar formalisms. In this paper, we provide two key tools for this approach. First, we show how to reduce nonprojective dependency parsing to parsing with Linear Context-Free Rewriting Systems (LCFRS), by presenting a technique for extracting LCFRS from dependency treebanks. For efficient parsing, the extracted grammars need to be transformed in order to minimize the number of nonterminal symbols per production. Our second contribution is an algorithm that computes this transformation for a large, empirically relevant class of grammars.

The problem of non-projective dependency parsing under the joint requirement of accuracy and efficiency has only recently been addressed in the literature. Some authors propose to solve it by techniques for recovering non-projectivity from the output of a projective parser in a post-processing step (Hall and Novák, 2005; Nivre and Nilsson, 2005), others extend projective parsers by heuristics that allow at least certain non-projective constructions to be parsed (Attardi, 2006; Nivre, 2007). McDonald et al. (2005) formulate dependency parsing as the search for the most probable spanning tree over the full set of all possible dependencies. However, this approach is limited to probability models with strong independence assumptions. Exhaustive nonprojective dependency parsing with more powerful models is intractable (McDonald and Satta, 2007), and one has to resort to approximation algorithms (McDonald and Pereira, 2006). In this paper, we propose to attack non-projective dependency parsing in a principled way, using polynomial chart-parsing algorithms developed for mildly context-sensitive grammar formalisms. This proposal is motivated by the observation that most dependency structures required for the analysis of natural language are very nearly projective, differing only minimally from the best projective approximation (Kuhlmann and Nivre, 2006), and by the close link between such `mildly non-projective' dependency structures on the one hand, and grammar formalisms with mildly context-sensitive generative capacity on the other (Kuhlmann and Möhl, 2007). Furthermore, as pointed out by McDonald and Satta (2007), chart-parsing algorithms are amenable to augmentation by non-local information such as arity constraints and Markovization, and therefore should allow for more predictive statistical models than those used by current systems for non-projective dependency parsing. Hence, mildly non-projective dependency parsing promises to be both efficient and accurate.

1

Introduction

Dependency parsing is the task of predicting the most probable dependency structure for a given sentence. One of the key choices in dependency parsing is about the class of candidate structures for this prediction. Many parsers are confined to projective structures, in which the yield of a syntactic head is required to be continuous. A major benefit of this choice is computational efficiency: an exhaustive search over all projective structures can be done in cubic, greedy parsing in linear time (Eisner, 1996; Nivre, 2003). A major drawback of the restriction to projective dependency structures is a potential loss in accuracy. For example, around 23% of the analyses in the Prague Dependency Treebank of Czech (Haji et al., 2001) are nonc projective, and for German and Dutch treebanks, the proportion of non-projective structures is even higher (Havelka, 2007).

Proceedings of the 12th Conference of the European Chapter of the ACL, pages 478­486, Athens, Greece, 30 March ­ 3 April 2009. c 2009 Association for Computational Linguistics

478

Contributions In this paper, we contribute two key tools for making the mildly context-sensitive approach to accurate and efficient non-projective dependency parsing work. First, we extend the standard technique for extracting context-free grammars from phrase-structure treebanks (Charniak, 1996) to mildly context-sensitive grammars and dependency treebanks. More specifically, we show how to extract, from a given dependency treebank, a lexicalized Linear Context-Free Rewriting System (LCFRS) whose derivations capture the dependency analyses in the treebank in the same way as the derivations of a context-free treebank grammar capture phrasestructure analyses. Our technique works for arbitrary, even non-projective dependency treebanks, and essentially reduces non-projective dependency to parsing with LCFRS. This problem can be solved using standard chart-parsing techniques. Our extraction technique yields a grammar whose parsing complexity is polynomial in the length of the sentence, but exponential in both a measure of the non-projectivity of the treebank and the maximal number of dependents per word, reflected as the rank of the extracted LCFRS. While the number of highly non-projective dependency structures is negligible for practical applications (Kuhlmann and Nivre, 2006), the rank cannot easily be bounded. Therefore, we present an algorithm that transforms the extracted grammar into a normal form that has rank 2, and thus can be parsed more efficiently. This contribution is important even independently of the extraction procedure: While it is known that a rank-2 normal form of LCFRS does not exist in the general case (Rambow and Satta, 1999), our algorithm succeeds for a large and empirically relevant class of grammars.

some given alphabet T , and let L be an alphabet of edge labels. A dependency tree for w is a construct D D .w; E; /, where E forms a rooted tree (in the standard graph-theoretic sense) on the set OEjwj, and is a total function that assigns every edge in E a label in L. Each node of D represents a (position of a) token in w. Example 1 Figure 2 shows a dependency tree for the sentence A hearing is scheduled on the issue today, which consists of 8 tokens and the edges f .2; 1/; .2; 5/; .3; 2/; .3; 4/; .4; 8/; .5; 7/; .7; 6/ g. The edges are labelled with syntactic functions such as sbj for `subject'. The root node is marked by a dotted line. Let u be a node of a dependency tree D. A node u0 is a descendant of u, if there is a (possibly empty) path from u to u0 . A block of u is a maximal interval of descendants of u. The number of blocks of u is called the block-degree of u. The blockdegree of a dependency tree is the maximum among the block-degrees of its nodes. A dependency tree is projective, if its block-degree is 1. Example 2 The tree shown in Figure 2 is not projective: both node 2 (hearing) and node 4 (scheduled) have block-degree 2. Their blocks are f 2 g; f 5; 6; 7 g and f 4 g; f 8 g, respectively. 2.2 LCFRS

2

Preliminaries

Linear Context-Free Rewriting Systems (LCFRS) have been introduced as a generalization of several mildly context-sensitive grammar formalisms. Here we use the standard definition of LCFRS (Vijay-Shanker et al., 1987) and only fix our notation; for a more thorough discussion of this formalism, we refer to the literature. Let G be an LCFRS. Recall that each nonterminal symbol A of G comes with a positive integer called the fan-out of A, and that a production p of G has the form A ! g.A1 ; : : : ; Ar / I g.x1 ; : : : ; xr / D  ; E E E where A; A1 ; : : : ; Ar are nonterminals with fan-out f; f1 ; : : : ; fr , respectively, g is a function symbol, and the equation to the right of the semicolon specifies the semantics of g. For each i 2 OEr, xi is E an fi -tuple of variables, and  D h1 ; : : : ; f i is a E tuple of strings over the variables on the left-hand side of the equation and the alphabet of terminal symbols in which each variable appears exactly once. The production p is said to have rank r, fan-out f , and length j1 j C C jf j C .f 1/.

We start by introducing dependency trees and Linear Context-Free Rewriting Systems (LCFRS). Throughout the paper, for positive integers i and j , we write OEi; j  for the interval f k j i Ä k Ä j g, and use OEn as a shorthand for OE1; n. 2.1 Dependency Trees

Dependency parsing is the task to assign dependency structures to a given sentence w. For the purposes of this paper, dependency structures are edge-labelled trees. More formally, let w be a sentence, understood as a sequence of tokens over

479

3

Grammar Extraction

1: 2: 3: 4: 5: 6: 7: 8:

We now explain how to extract an LCFRS from a dependency treebank, in very much the same way as a context-free grammar can be extracted from a phrase-structure treebank (Charniak, 1996). 3.1 Dependency Treebank Grammars

Function Annotate-L.D/ for each u of D, from left to right do if u is the first node of D then b WD the root node of D else b WD the lca of u and its predecessor for each u0 on the path b u do leftOEu0  WD leftOEu0  u Figure 1: Annotation with components

A simple way to induce a context-free grammar from a phrase-structure treebank is to read off the productions of the grammar from the trees. We will specify a procedure for extracting, from a given dependency treebank, a lexicalized LCFRS G that is adequate in the sense that for every analysis D of a sentence w in the treebank, there is a derivation tree of G that is isomorphic to D, meaning that it becomes equal to D after a suitable renaming and relabelling of nodes, and has w as its derived string. Here, a derivation tree of an LCFRS G is an ordered tree such that each node u is labelled with a production p of G, the number of children of u equals the rank r of p, and for each i 2 OEr, the i th child of u is labelled with a production that has as its left-hand side the i th nonterminal on the right-hand side of p. The basic idea behind our extraction procedure is that, in order to represent the compositional structure of a possibly non-projective dependency tree, one needs to represent the decomposition and relative order not of subtrees, but of blocks of subtrees (Kuhlmann and Möhl, 2007). We introduce some terminology. A component of a node u in a dependency tree is either a block B of some child u0 of u, or the singleton interval that contains u; this interval will represent the position in the string that is occupied by the lexical item corresponding to u. We say that u0 contributes B, and that u contributes OEu; u to u. Notice that the number of components that u0 contributes to its parent u equals the block-degree of u0 . Our goal is to construct for u a production of an LCFRS that specifies how each block of u decomposes into components, and how these components are ordered relative to one another. These productions will make an adequate LCFRS, in the sense defined above. 3.2 Annotating the Components

annotates each node u with the list of the left endpoints of its components (Annotate-L) and one that annotates the corresponding right endpoints (Annotate-R). The list of components can then be obtained by zipping the two lists of endpoints together in linear time. Figure 1 shows pseudocode for Annotate-L; the pseudocode for Annotate-R is symmetric. We do a single left-to-right sweep over the nodes of the input tree D. In each step, we annotate all nodes u0 that have the current node u as the left endpoint of one of their components. Since the sweep is from left to right, this will get us the left endpoints of u0 in the desired order. The nodes that we annotate are the nodes u0 on the path between u and the least common ancestor (lca) b of u and its predecessor, or the path from the root node to u, in case that u is the leftmost node of D. Example 3 For the dependency tree in Figure 2, Annotate-L constructs the following lists leftOEu of left endpoints, for u D 1; : : : ; 8: 1; 1 2 5; 1 3 4 5 8; 4 8; 5 6; 6; 6 7; 8 The following Lemma establishes the correctness of the algorithm: Lemma 1 Let D be a dependency tree, and let u and u0 be nodes of D. Let b be the least common ancestor of u and its predecessor, or the root node in case that u is the leftmost node of D. Then u is the left endpoint of a component of u0 if and only if u0 lies on the path from b to u. Proof It is clear that u0 must be an ancestor of u. If u is the leftmost node of D, then u is the left endpoint of the leftmost component of all of its ancestors. Now suppose that u is not the leftmost node of D, and let u be the predecessor of u. DisO tinguish three cases: If u0 is not an ancestor of u, O then u does not belong to any component of u0 ; O therefore, u is the left endpoint of a component

The core of our extraction procedure is an efficient algorithm that annotates each node u of a given dependency tree with the list of its components, sorted by their left endpoints. It is helpful to think of this algorithm as of two independent parts, one that

480

of u0 . If u0 is an ancestor of u but u0 ¤ b, then u O O and u belong to the same component of u0 ; therefore, u is not the left endpoint of this component. Finally, if u0 D b, then u and u belong to different O components of u0 ; therefore, u is the left endpoint of the component it belongs to. We now turn to an analysis of the runtime of the algorithm. Let n be the number of components of D. It is not hard to imagine an algorithm that performs the annotation task in time O.n log n/: such an algorithm could construct the components for a given node u by essentially merging the list of components of the children of u into a new sorted list. In contrast, our algorithm takes time O.n/. The crucial part of the analysis is the assignment in line 6, which computes the least common ancestor of u and its predecessor. Using markers for the path from the root node to u, it is straightforward to implement this assignment in time O.j j/, where is the path b u. Now notice that, by our correctness argument, line 8 of the algorithm is executed exactly n times. Therefore, the sum over the lengths of all the paths , and hence the amortized time of computing all the least common ancestors in line 6, is O.n/. This runtime complexity is optimal for the task we are solving. 3.3 Extraction Procedure

where L is the label of the incoming edge of u (or the special label root in case that u is the root node of D) and for each i 2 OEr: Li is the label of the incoming edge of ui ; xi is a fi -tuple of variE ables of the form xi;j , where j 2 OEfi ; and  is E an f -tuple that is constructed in a single left-toright sweep over the list of components computed for u as follows. Let k 2 OEfi  be a pointer to a current segment of ; initially, k D 1. If the current E component is not adjacent (as an interval) to the previous component, we increase k by one. If the current component is contributed by the child ui , i 2 OEr, we add the variable xi;j to k , where j is the number of times we have seen a component contributed by ui during the sweep. Notice that j 2 OEfi . If the current component is the (unique) component contributed by u, we add the token corresponding to u to k . In this way, we obtain a complete specification of how the blocks of u (represented by the segments of the tuple ) decompose E into the components of u, and of the relative order of the components. As an example, Figure 2 shows the productions extracted from the tree above.

3.4

Parsing the Extracted Grammar

We now describe how to extend the annotation algorithm into a procedure that extracts an LCFRS from a given dependency tree D. The basic idea is to transform the list of components of each node u of D into a production p. This transformation will only rename and relabel nodes, and therefore yield an adequate derivation tree. For the construction of the production, we actually need an extended version of the annotation algorithm, in which each component is annotated with the node that contributed it. This extension is straightforward, and does not affect the linear runtime complexity. Let D be a dependency tree for a sentence w. Consider a single node u of D, and assume that u has r children, and that the block-degree of u is f . We construct for u a production p with rank r and fan-out f . For convenience, let us order the children of u, say by their leftmost descendants, and let us write ui for the i th child of u according to this order, and fi for the block-degree of ui , i 2 OEr. The production p has the form L ! g.L1 ; : : : ; Lr / I g.x1 ; : : : ; xr / D  ; E E E

Once we have extracted the grammar for a dependency treebank, we can apply any parsing algorithm for LCFRS to non-projective dependency parsing. The generic chart-parsing algorithm for LCFRS runs in time O.jP j jwjf .rC1/ /, where P is the set of productions of the input grammar G, w is the input string, r is the maximal rank, and f is the maximal fan-out of a production in G (Seki et al., 1991). For a grammar G extracted by our technique, the number f equals the maximal block-degree per node. Hence, without any further modification, we obtain a parsing algorithm that is polynomial in the length of the sentence, but exponential in both the block-degree and the rank. This is clearly unacceptable in practical systems. The relative frequency of analyses with a block-degree 2 is almost negligible (Havelka, 2007); the bigger obstacle in applying the treebank grammar is the rank of the resulting LCFRS. Therefore, in the remainder of the paper, we present an algorithm that can transform the productions of the input grammar G into an equivalent set of productions with rank at most 2, while preserving the fan-out. This transformation, if it succeeds, yields a parsing algorithm that runs in time O.jP j r jwj3f /.

481

root node pp nmod sbj vc

tmp np nmod
6 the 7 issue 8 today

1 A 2 hearing 3 is 4 scheduled 5 on

nmod ! g1 sbj ! g2 .nmod; pp/ root ! g3 .sbj; vc/ vc ! g4 .tmp/ pp ! g5 .np/ nmod ! g6 np ! g7 .nmod/ tmp ! g8

g1 D hAi g2 .hx1;1 i; hx2;1 i/ D hx1;1 hearing; x2;1 i g3 .hx1;1 ; x1;2 i; hx2;1 ; x2;2 i/ D hx1;1 is x2;1 x1;2 x2;2 i g4 .hx1;1 i/ D hscheduled; x1;1 i g5 .hx1;1 i/ D hon x1;1 i g6 D hthei g7 .hx1;1 i/ D hx1;1 issuei g8 D htodayi

Figure 2: A dependency tree, and the LCFRS extracted for it

4

Adjacency

In this section we discuss a method for factorizing an LCFRS into productions of rank 2. Before starting, we get rid of the `easy' cases. A production p is connected if any two strings i , j in p's definition share at least one variable referring to the same nonterminal. It is not difficult to see that, when p is not connected, we can always split it into new productions of lower rank. Therefore, throughout this section we assume that LCFRS only have connected productions. We can split p into its connected components using standard methods for finding the strongly connected components of an undirected graph. This can be implemented in time O.r f /, where r and f are the rank and the fan-out of p, respectively. 4.1 Adjacency Graphs

We use m-intervals to represent the nonterminals and the lexical element heading p. The i th nonterminal on the right-hand side of p is represented by the m-interval obtained by collecting all the positions of p that represent a variable from the i th argument of g. The head of p is represented by the m-interval containing the associated position. Note that all these m-intervals are pairwise disjoint. Example 5 Consider the production for is in Figure 2. The set of positions is OE5. The first nonterminal is represented by the m-interval f OE1; 1; OE4; 4 g, the second nonterminal by f OE3; 3; OE5; 5 g, and the lexical head by f OE2; 2 g. For disjoint m-intervals v1 ; v2 , we say that v1 is adjacent to v2 , denoted by v1 ! v2 , if for every interval I1 2 v1 , there is an interval I2 2 v2 such that I1 is adjacent to I2 . Adjacency is not symmetric: if v1 D f OE1; 1; OE4; 4 g and v2 D f OE2; 2 g, then v2 ! v1 , but not vice versa. Let V be some collection of pairwise disjoint m-intervals representing p as above. The adjacency graph associated with p is the graph G D .V; !G / whose vertices are the m-intervals in V , and whose edges !G are defined by restricting the adjacency relation ! to the set V . For m-intervals v1 ; v2 2 V , the merger of v1 and v2 , denoted by v1 ° v2 , is the (uniquely determined) m-interval whose span is the union of the spans of v1 and v2 . As an example, if v1 D f OE1; 1; OE3; 3 g and v2 D f OE2; 2 g, then v1 ° v2 D f OE1; 3 g. Notice that the way in which we defined m-intervals ensures that a merging operation collapses all adjacent intervals. The proof of the following lemma is straightforward and omitted for space reasons:

Let p be a production with length n and fan-out f , associated with function a g. The set of positions of p is the set OEn. Informally, each position represents a variable or a lexical element in one of the components of the definition of g, or else a `gap' between two of these components. (Recall that n also accounts for the f 1 gaps in the body of g.) Example 4 The set of positions of the production for hearing in Figure 2 is OE4: 1 for variable x1 , 2 for hearing, 3 for the gap, and 4 for y1 . Let i1 ; j1 ; i2 ; j2 2 OEn. An interval OEi1 ; j1  is adjacent to an interval OEi2 ; j2  if either j1 D i2 1 (left-adjacent) or i1 D j2 C 1 (right-adjacent). A multi-interval, or m-interval for short, is a set v of pairwise disjoint intervals such that no interval in v is adjacent to any other interval in v. The fan-out of v, written f .v/, is defined as jvj.

482

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

Function Factorize.G D .V; !G // R WD ;; while !G ¤ ; do choose .v1 ; v2 / 2 !G ; R WD R [ f .v1 ; v2 / g; V WD V f v1 ; v2 g [ f v1 ° v2 g; !G WD f .v; v 0 / j v; v 0 2 V; v ! v 0 g; if jV j D 1 then output R and accept; else reject; Figure 3: Factorization algorithm

vertex, then we have managed to factorize our production into a set of productions with rank at most two that can be computed from R. Example 6 Let V D f v1 ; v2 ; v3 g with v1 D f OE4; 4 g, v2 D f OE1; 1; OE3; 3 g, and v3 D f OE2; 2; OE5; 5 g. Then !G D f .v1 ; v2 / g. After merging v1 ; v2 we have a new graph G with V D f v1 ° v2 ; v3 g and !G D f .v1 ° v2 ; v3 / g. We finally merge v1 ° v2 ; v3 resulting in a new graph G with V D f v1 ° v2 ° v3 g and !G D ;. We then accept and stop. 4.3 Mathematical Properties

Lemma 2 If v1 ! v2 , then f .v1 ° v2 / Ä f .v2 /. 4.2 The Adjacency Algorithm

Let G D .V; !G / be some adjacency graph, and let v1 !G v2 . We can derive a new adjacency graph from G by merging v1 and v2 . The resulting graph G 0 has vertices V 0 D V f v1 ; v2 g [ f v1 ° v2 g and set of edges !G 0 obtained by restricting the adjacency relation ! to V 0 . We denote the derive relation as G ).v1 ;v2 / G 0 . Informally, if G represents some LCFRS production p and v1 ; v2 represent nonterminals A1 ; A2 , then G 0 represents a production p 0 obtained from p by replacing A1 ; A2 with a fresh nonterminal A. A new production p 00 can also be constructed, expanding A into A1 ; A2 , so that p 0 ; p 00 together will be equivalent to p. Furthermore, p 0 has a rank smaller than the rank of p and, from Lemma 2, A does not increase the overall fan-out of the grammar. In order to simplify the notation, we adopt the following convention. Let G ).v1 ;v2 / G 0 and let v !G v1 , v ¤ v2 . If v !G 0 v1 ° v2 , then edges .v; v1 / and .v; v1 ° v2 / will be identified, and we say that G 0 inherits .v; v1 ° v2 / from G. If v !G 0 v1 ° v2 , then we say that .v; v1 / does not 6 survive the derive step. This convention is used for all edges incident upon v1 or v2 . Our factorization algorithm is reported in Figure 3. We start from an adjacency graph representing some LCFRS production that needs to be factorized. We arbitrarily choose an edge e of the graph, and push it into a set R, in order to keep a record of the candidate factorization. We then merge the two m-intervals incident to e, and we recompute the adjacency relation for the new set of vertices. We iterate until the resulting graph has an empty edge set. If the final graph has one one

We have already argued that, if the algorithm accepts, then a binary factorization that does not increase the fan-out of the grammar can be built from R. We still need to prove that the algorithm answers consistently on a given input, despite of possibly different choices of edges at line 4. We do this through several intermediate results. A derivation for an adjacency graph G is a sequence of edges d D he1 ; : : : ; en i, n 1, such that G D G0 and Gi 1 )ei Gi for every i with 1 Ä i Ä n. For short, we write G0 )d Gn . Two derivations for G are competing if one is a permutation of the other. Lemma 3 If G )d1 G1 and G )d2 G2 with d1 and d2 competing derivations, then G1 D G2 . Proof We claim that the statement of the lemma holds for jd1 j D 2. To see this, let G )e1 0 0 G1 )e2 G1 and G )e2 G2 )e1 G2 be valid derivations. We observe that G1 and G2 have the same set of vertices. Since the edges of G1 and G2 are defined by restricting the adjacency relation to their set of vertices, our claim immediately follows. The statement of the lemma then follows from the above claim and from the fact that we can always obtain the sequence d2 starting from d1 by repeatedly switching consecutive edges. We now consider derivations for the same adjacency graph that are not competing, and show that they always lead to isomorphic adjacency graphs. Two graphs are isomorphic if they become equal after some suitable renaming of the vertices. Lemma 4 The out-degree of G is bounded by 2. Proof Assume v !G v1 and v !G v2 , with v1 ¤ v2 , and let I 2 v. I must be adjacent to some interval I1 2 v1 . Without loss of generality, assume that I is left-adjacent to I1 . I must also be adjacent to some interval I2 2 v2 . Since v1 and v2

483

are disjoint, I must be right-adjacent to I2 . This implies that I cannot be adjacent to an interval in any other m-interval v 0 of G. A vertex v of G such that v !G v1 and v !G v2 is called a bifurcation. Example 7 Assume v D f OE2; 2 g, v1 D f OE3; 3; OE5; 5 g, v2 D f OE1; 1 g with v !G v1 and v !G v2 . The m-interval v ° v1 D f OE2; 3; OE5; 5 g is no longer adjacent to v2 . The example above shows that, when choosing one of the two outgoing edges in a bifurcation for merging, the other edge might not survive. Thus, such a choice might lead to distinguishable derivations that are not competing (one derivation has an edge that is not present in the other). As we will see (in the proof of Theorem 1), bifurcations are the only cases in which edges might not survive a merging. Lemma 5 Let v be a bifurcation of G with outgoing edges e1 ; e2 , and let G )e1 G1 , G )e2 G2 . Then G1 and G2 are isomorphic. Proof (Sketch) Assume e1 has the form v !G v1 and e2 has the form v !G v2 . Let also VS be the set of vertices shared by G1 and G2 . We show that the statement holds under the isomorphism mapping v ° v1 and v2 in G1 to v1 and v ° v2 in G2 , respectively. When restricted to VS , the graphs G1 and G2 are equal. Let us then consider edges from G1 and G2 involving exactly one vertex in VS . We show that, for v 0 2 VS , v 0 !G1 v ° v1 if and only if v 0 !G2 v1 . Consider an arbitrary interval I 0 2 v 0 . If v 0 !G1 v ° v1 , then I 0 must be adjacent to some interval I1 2 v ° v1 . If I1 2 v1 we are done. Otherwise, I1 must be the concatenation of two intervals I1v and I1v1 with I1v 2 v and I1v1 2 v1 . Since v !G2 v2 , I1v is also adjacent to some interval in v2 . However, v 0 and v2 are disjoint. Thus I 0 must be adjacent to I1v1 2 v1 . Conversely, if v 0 !G2 v1 , then I 0 must be adjacent to some interval I1 2 v1 . Because v 0 and v are disjoint, I 0 must also be adjacent to some interval in v ° v1 . Using very similar arguments, we can conclude that G1 and G2 are isomorphic when restricted to edges with at most one vertex in VS . Finally, we need to consider edges from G1 and G2 that are not incident upon vertices in VS . We show that v ° v1 !G1 v2 only if v1 !G2 v ° v2 ; a similar argument can be used to prove the converse. Consider an arbitrary interval I1 2 v °v1 . If v ° v1 !G1 v2 , then I1 must be adjacent to some

interval I2 2 v2 . If I1 2 v1 we are done. Otherwise, I1 must be the concatenation of two adjacent intervals I1v and I1v1 with I1v 2 v and I1v1 2 v1 . 0 Since I1v is also adjacent to some interval I2 2 v2 0 (here I2 might as well be I2 ), we conclude that I1v1 2 v1 is adjacent to the concatenation of I1v 0 and I2 , which is indeed an interval in v ° v2 . Note that our case distinction is exhaustive. We thus conclude that v1 !G2 v ° v2 . A symmetrical argument can be used to show that v2 !G1 v ° v1 if and only if v ° v2 !G2 v1 , which concludes our proof. Theorem 1 Let d1 and d2 be derivations for G, describing two different computations c1 and c2 of the algorithm of Figure 3 on input G. Computation c1 is accepting if and only if c2 is accepting. Proof First, we prove the claim that if e is not an edge outgoing from a bifurcation vertex, then in the derive relation G )e G 0 all of the edges of G but e and its reverse are inherited by G 0 . Let us write e in the form v1 !G v2 . Obviously, any edge of G not incident upon v1 or v2 will be inherited by G 0 . If v !G v2 for some m-interval v ¤ v1 , then every interval I 2 v is adjacent to some interval in v2 . Since v and v1 are disjoint, I will also be adjacent to some interval in v1 ° v2 . Thus we have v !G 0 v1 ° v2 . A similar argument shows that v !G v1 implies v !G 0 v1 ° v2 . If v2 !G v for some v ¤ v1 , then every interval I 2 v2 is adjacent to some interval in v. From v1 !G v2 we also have that each interval I12 2 v1 ° v2 is either an interval in v2 or else the concatenation of exactly two intervals I1 2 v1 and I2 2 v2 . (The interval I2 cannot be adjacent to more than an interval in v1 , because v2 !G v). In both cases I12 is adjacent to some interval in v, and hence v1 ° v2 !G 0 v. This concludes the proof of our claim. Let d1 , d2 be as in the statement of the theorem, with G )d1 G1 and G )d2 G2 . If d1 and d2 are competing, then the theorem follows from Lemma 3. Otherwise, assume that d1 and d2 are not competing. From our claim above, some bifurcation vertices must appear in these derivations. Let us reorder the edges in d1 in such a way that edges outgoing from a bifurcation vertex are processed last and in some canonical order. The 0 0 resulting derivation has the form d d1 , where d1 involves the processing of all bifurcation vertices. 0 We can also reorder edges in d2 to obtain dd2 , 0 where d2 involves the processing of all bifurcation

484

not context-free not binarizable not well-nested

102 687 24 622

100.00% 0.02% 0.61%

Table 1: Properties of productions extracted from the CoNLL 2006 data (3 794 605 productions)
0 vertices in exactly the same order as in d1 , but with possibly different choices for the outgoing edges. 0 0 0 Let G )d Gd )d1 G1 and G )d Gd )d2 0 0 G2 . Derivations d d1 and d1 are competing. Thus, 0 by Lemma 3, we have G1 D G1 . Similarly, we can 0 conclude that G2 D G2 . Since bifurcation vertices 0 0 in d1 and in d2 are processed in the same canonical order, from repeated applications of Lemma 5 we 0 0 have that G1 and G2 are isomorphic. We then conclude that G1 and G2 are isomorphic as well. The statement of the theorem follows immediately.

We now turn to a computational analysis of the algorithm of Figure 3. Let G be the representation of an LCFRS production p with rank r. G has r vertices and, following Lemma 4, O.r/ edges. Let v be an m-interval of G with fan-out fv . The incoming and outgoing edges for v can be detected in time O.fv / by inspecting the 2 fv endpoints of v. Thus we can compute G in time O.jpj/. The number of iterations of the while cycle in the algorithm is bounded by r, since at each iteration one vertex of G is removed. Consider now an iteration in which m-intervals v1 and v2 have been chosen for merging, with v1 !G v2 . (These mintervals might be associated with nonterminals in the right-hand side of p, or else might have been obtained as the result of previous merging operations.) Again, we can compute the incoming and outgoing edges of v1 ° v2 in time proportional to the number of endpoints of such an m-interval. By Lemma 2, this number is bounded by O.f /, f the fan-out of the grammar. We thus conclude that a run of the algorithm on G takes time O.r f /.

all binarizable cases. This raises the question about the practical relevance of our technique. In order to get at least a preliminary answer to this question, we extracted LCFRS productions from the data used in the 2006 CoNLL shared task on data-driven dependency parsing (Buchholz and Marsi, 2006), and evaluated how large a portion of these productions could be binarized using our algorithm. The results are given in Table 1. Since it is easy to see that our algorithm always succeeds on context-free productions (productions where each nonterminal has fan-out 1), we evaluated our algorithm on the 102 687 productions with a higher fan-out. Out of these, only 24 (0.02%) could not be binarized using our technique. We take this number as an indicator for the usefulness of our result. It is interesting to compare our approach with techniques for well-nested dependency trees (Kuhlmann and Nivre, 2006). Well-nestedness is a property that implies the binarizability of the extracted grammar; however, the classes of wellnested trees and those whose corresponding productions can be binarized using our algorithm are incomparable--in particular, there are well-nested productions that cannot be binarized in our framework. Nevertheless, the coverage of our technique is actually higher than that of an approach that relies on well-nestedness, at least on the CoNLL 2006 data (see again Table 1). We see our results as promising first steps in a thorough exploration of the connections between non-projective and mildly context-sensitive parsing. The obvious next step is the evaluation of our technique in the context of an actual parser. As a final remark, we would like to point out that an alternative technique for efficient non-projective dependency parsing, developed by Gómez Rodríguez et al. independently of this work, is presented elsewhere in this volume. Acknowledgements We would like to thank Ryan McDonald, Joakim Nivre, and the anonymous reviewers for useful comments on drafts of this paper, and Carlos Gómez Rodríguez and David J. Weir for making a preliminary version of their paper available to us. The work of the first author was funded by the Swedish Research Council. The second author was partially supported by MIUR under project PRIN No. 2007TJNZRE_002.

5

Discussion

We have shown how to extract mildly contextsensitive grammars from dependency treebanks, and presented an efficient algorithm that attempts to convert these grammars into an efficiently parseable binary form. Due to previous results (Rambow and Satta, 1999), we know that this is not always possible. However, our algorithm may fail even in cases where a binarization exists--our notion of adjacency is not strong enough to capture

485

References
Giuseppe Attardi. 2006. Experiments with a multilanguage non-projective dependency parser. In Tenth Conference on Computational Natural Language Learning (CoNLL), pages 166­170, New York, USA. Sabine Buchholz and Erwin Marsi. 2006. CoNLLX shared task on multilingual dependency parsing. In Tenth Conference on Computational Natural Language Learning (CoNLL), pages 149­164, New York, USA. Eugene Charniak. 1996. Tree-bank grammars. In 13th National Conference on Artificial Intelligence, pages 1031­1036, Portland, Oregon, USA. Jason Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In 16th International Conference on Computational Linguistics (COLING), pages 340­345, Copenhagen, Denmark. Carlos Gómez-Rodríguez, David J. Weir, and John Carroll. 2009. Parsing mildly non-projective dependency structures. In Twelfth Conference of the European Chapter of the Association for Computational Linguistics (EACL), Athens, Greece. Jan Haji , Barbora Vidova Hladka, Jarmila Panevová, c Eva Haji ová, Petr Sgall, and Petr Pajas. 2001. c Prague Dependency Treebank 1.0. Linguistic Data Consortium, 2001T10. Keith Hall and Václav Novák. 2005. Corrective modelling for non-projective dependency grammar. In Ninth International Workshop on Parsing Technologies (IWPT), pages 42­52, Vancouver, Canada. Jií Havelka. 2007. Beyond projectivity: Multilinr gual evaluation of constraints and measures on nonprojective structures. In 45th Annual Meeting of the Association for Computational Linguistics (ACL), pages 608­615, Prague, Czech Republic. Marco Kuhlmann and Mathias Möhl. 2007. Mildly context-sensitive dependency languages. In 45th Annual Meeting of the Association for Computational Linguistics (ACL), pages 160­167, Prague, Czech Republic. Marco Kuhlmann and Joakim Nivre. 2006. Mildly non-projective dependency structures. In 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Main Conference Poster Sessions, pages 507­514, Sydney, Australia. Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Eleventh Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 81­88, Trento, Italy.

Ryan McDonald and Giorgio Satta. 2007. On the complexity of non-projective data-driven dependency parsing. In Tenth International Conference on Parsing Technologies (IWPT), pages 121­132, Prague, Czech Republic. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Haji . 2005. Non-projective dependency parsc ing using spanning tree algorithms. In Human Language Technology Conference (HLT) and Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523­530, Vancouver, Canada. Joakim Nivre and Jens Nilsson. 2005. Pseudoprojective dependency parsing. In 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 99­106, Ann Arbor, USA. Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Eighth International Workshop on Parsing Technologies (IWPT), pages 149­160, Nancy, France. Joakim Nivre. 2007. Incremental non-projective dependency parsing. In Human Language Technologies: The Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 396­403, Rochester, NY, USA. Owen Rambow and Giorgio Satta. 1999. Independent parallelism in finite copying parallel rewriting systems. Theoretical Computer Science, 223(1­2):87­ 120. Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, and Tadao Kasami. 1991. On Multiple ContextFree Grammars. Theoretical Computer Science, 88(2):191­229. K. Vijay-Shanker, David J. Weir, and Aravind K. Joshi. 1987. Characterizing structural descriptions produced by various grammatical formalisms. In 25th Annual Meeting of the Association for Computational Linguistics (ACL), pages 104­111, Stanford, CA, USA.

486