Unlexicalised Hidden Variable Models of Split Dependency Grammars Gabriele Antonio Musillo Department of Computer Science and Department of Linguistics University of Geneva 1211 Geneva 4, Switzerland musillo4@etu.unige.ch Paola Merlo Department of Linguistics University of Geneva 1211 Geneva 4, Switzerland merlo@lettres.unige.ch Abstract This paper investigates transforms of split dependency grammars into unlexicalised context-free grammars annotated with hidden symbols. Our best unlexicalised grammar achieves an accuracy of 88% on the Penn Treebank data set, that represents a 50% reduction in error over previously published results on unlexicalised dependency parsing. 1 Introduction Recent research in natural language parsing has extensively investigated probabilistic models of phrase-structure parse trees. As well as being the most commonly used probabilistic models of parse trees, probabilistic context-free grammars (PCFGs) are the best understood. As shown in (Klein and Manning, 2003), the ability of PCFG models to disambiguate phrases crucially depends on the expressiveness of the symbolic backbone they use. Treebank-specific heuristics have commonly been used both to alleviate inadequate independence assumptions stipulated by naive PCFGs (Collins, 1999; Charniak, 2000). Such methods stand in sharp contrast to partially supervised techniques that have recently been proposed to induce hidden grammatical representations that are finer-grained than those that can be read off the parsed sentences in treebanks (Henderson, 2003; Matsuzaki et al., 2005; Prescher, 2005; Petrov et al., 2006). Part of this work was done when Gabriele Musillo was visiting the MIT Computer Science and Artificial Intelligence Laboratory, funded by a grant from the Swiss NSF (PBGE2117146). Many thanks to Michael Collins and Xavier Carreras for their insightful comments on the work presented here. This paper presents extensions of such grammar induction techniques to dependency grammars. Our extensions rely on transformations of dependency grammars into efficiently parsable contextfree grammars (CFG) annotated with hidden symbols. Because dependency grammars are reduced to CFGs, any learning algorithm developed for PCFGs can be applied to them. Specifically, we use the Inside-Outside algorithm defined in (Pereira and Schabes, 1992) to learn transformed dependency grammars annotated with hidden symbols. What distinguishes our work from most previous work on dependency parsing is that our models are not lexicalised. Our models are instead decorated with hidden symbols that are designed to capture both lexical and structural information relevant to accurate dependency parsing without having to rely on any explicit supervision. 2 Transforms of Dependency Grammars Contrary to phrase-structure grammars that stipulate the existence of phrasal nodes, dependency grammars assume that syntactic structures are connected acyclic graphs consisting of vertices representing terminal tokens related by directed edges representing dependency relations. Such terminal symbols are most commonly assumed to be words. In our unlexicalised models reported below, they are instead assumed to be part-of-speech (PoS) tags. A typical dependency graph is illustrated in Figure 1 below. Various projective dependency grammars exemplify the concept of split bilexical dependency grammar (SBG) defined in (Eisner, 2000). 1 SBGs are 1 An SBG is a tuple V, W, L, R such that: 213 Proceedings of ACL-08: HLT, Short Papers (Companion Volume), pages 213­216, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics ]]]]]]]]]]]]]] ]]]]]]]]]]]]]] ]]]]]]]]]]]]] 1 R1root /R1 B DB RV B DB YYYY V YYYYYY SS k k YYY 1 r 1 1 1V B DB RV B DB /RI ND RI ND RRR RRR l lll ll R kk root 1 L1 B DB V R R L1 B DB \L1 N PA V N 0V B DB Rp B DB /R1 N PC N V R R 1rND I R1ND /R1 NF I N 1l N PA N 0N N PA ll 1r N PC N 0N N PC 0I ND L1 NF N RRR llll RR R lll 1r NF R N RRR 0N NF L1 NF \L1 TE N D 1l TE D N ica u 0DTE hit 3 M iles 6 with the v trumpet 8 Figure 1: A projective dependency graph for the sentence Nica hit Miles with the trumpet paired with its second-order unlexicalised derivation tree annotated with hidden variables. closely related to CFGs as they both define structures that are rooted ordered projective trees. Such a close relationship is clarified in this section. It follows from the equivalence of finite automata and regular grammars that any SBG can be transformed into an equivalent CFG. Let D = V, W, L, R be a SBG and G = N, W, P, S a CFG. To transform D into G we to define the set P of productions, the set N of non-terminals, and the start symbol S as follows: · For each v in W , transform the automaton Lv into a right-linear grammar GLv whose start symbol is L1 ; by construction, GLv consists of v rules such as Lp u Lq or Lp , where terv v v minal symbols such as u belong to W and nonterminals such as Lp correspond to the states of v the Lv automaton; include all -productions in P , and, if a rule such as Lp u Lq is in GLv , v v include the rule Lp 2l Lq in P . v uv · For each v in V , transform the automaton Rv into a left-linear grammar GRv whose start symbol is R1 ; by construction, GRv consists v · V is a set of terminal symbols which include a distinguished element root; · L is a function that, for any v W (= V - { root}), returns a finite automaton that recognises the well-formed sequences in W of left dependents of v ; · R is a function that, for each v V , returns a finite automaton that recognises the well-formed sequences of right dependents in W for v . of rules such as Rp Rq u or Rp , v v v where terminal symbols such as u belongs to W and non-terminals such as Rp correspond v to the states of the Rv automaton; include all productions in P , and, if a rule such as Rp v Rq u is in GRv , include the rule Rp Rq 2r v vu v in P . · For each symbol 2l occurring in P , include the u productions 2l L1 1l , 1l 0u R1 , and u uu u u 0u u in P ; for each symbol 2r in P , include u the productions 2r 1r R1 , 1r L1 0u , u u u u u and 0u u in P . · Set the start symbol S to R1root . 2 Parsing CFGs resulting from such transforms runs in O(n4 ). The head index v decorating nonterminals such as 1l , 1r , 0v , Lp and Rq can be comv v vv puted in O(1) given the left and right indices of the sub-string wi,j they cover. 3 Observe, however, that if 2l or 2r derives wi,j , then v does not functionally v v depend on either i or j . Because it is possible for the head index v of 2l or 2r to vary from i to j , v has v v to be tracked by the parser, resulting in an overall O(n4 ) time complexity. In the following, we show how to transform our O(n4 ) CFGs into O(n3 ) grammars by apCFGs resulting from such transformations can further be normalised by removing the -productions from P . 3 Indeed, if 1l or 0v derives wi,j , then v = i; if 1r derives v v wi,j , then v = j ; if wi,j is derived from Lp , then v = j + 1; v and if wi,j is derived from Rq , then v = i - 1. v 2 214 plying transformations, closely related to those in (McAllester, 1999) and (Johnson, 2007), that eliminate the 2l and 2r symbols. v v We only detail the elimination of the symbols 2r . v The elimination of the 2l symbols can be derived v symmetrically. By construction, a 2r symbol is the v right successor of a non-terminal Rp . Consequently, u 2r can only occur in a derivation such as v Rp Rq 2r Rq 1r R1 . u uv uv v To substitute for the problematic 2r non-terminal in v the above derivation, we derive the form Rq 1r R1 uv v from Rp /R1 R1 where Rp /R1 is a new nonu u v v v terminal whose right-hand side is Rq 1r . We thus uv transform the above derivation into the derivation Rp Rp /R1 R1 Rq 1r R1 . 4 u u uv v v v Because u = i - 1 and v = j if Rp /R1 derives u v wi,j , and u = j + 1 and v = i if Lp \L1 derives u v wi,j , the parsing algorithm does not have to track any head indices and can consequently parse strings in O(n3 ) time. The grammars described above can be further transformed to capture linear second-order dependencies involving three distinct head indices. A second-order dependency structure is illustrated in Figure 1 that involves two adjacent dependents, Miles and with, of a single head, hit. To see how linear second-order dependencies can be captured, consider the following derivation of a sequence of right dependents of a head u: Rp /R1 u v Rq 1r uv Rq /R1 R1 1r . u w wv The form Rq /R1 R1 1v mentions three heads: u u w w is the the head that governs both v and w, and w precedes v . To encode the linear relationship between w and v , we redefine the right-hand side of Rp /R1 as Rq /R1 R1 , 1r and include the prou u v w w v duction R1 , 1r R1 1r in the productions. w v wv The relationship between the dependents w and v of the head u is captured, because Rp /R1 jointly genu v erates R1 and 1r . 5 w v Any second-order grammar resulting from transforming the derivations of right and left dependents 4 Symmetrically, the derivation Lp 2l Lq u v u L1 1l Lq involving the 2l symbol is transformed into v v u v Lp L1 Lp \L1 L1 1l Lq . u v u v vv u 5 Symmetrically, to transform the derivation of a sequence of left dependents of u, we redefine the right-hand side of Lp \L1 u v as 1l , L1 Lq \L1 and include the production 1l , L1 v w u w v w 1l L1 in the set of rules. v w in the way described above can be parsed in O(n3 ), because the head indices decorating its symbols can be computed in O(1). In the following section, we show how to enrich both our first-order and second-order grammars with hidden variables. 3 Hidden Variable Models Because they do not stipulate the existence of phrasal nodes, commonly used unlabelled dependency models are not sufficiently expressive to discriminate between distinct projections of a given head. Both our first-order and second-order grammars conflate distributionally distinct projections if they are projected from the same head. 6 To capture various distinct projections of a head, we annotate each of the symbols that refers to it with a unique hidden variable. We thus constrain the distribution of the possible values of the hidden variables in a linguistically meaningful way. Figure 1 illustrates such constraints: the same hidden variable B decorates each occurrence of the PoS tag VBD of the head hit. Enforcing such agreement constraints between hidden variables provides a principled way to capture not only phrasal information but also lexical information. Lexical pieces of information conveyed by a minimal projection such as 0V B DB in Figure 1 will consistently be propagated through the derivation tree and will condition the generation of the right and left dependents of hit. In addition, states such as p and q that decorate non-terminal symbols such as Rp or Lq can also u u capture structural information, because they can encode the most recent steps in the derivation history. In the models reported in the next section, these states are assumed to be hidden and a distribution over their possible values is automatically induced. 4 Empirical Work and Discussion The models reported below were trained, validated, and tested on the commonly used sections from the Penn Treebank. Projective dependency trees, obAs observed in (Collins, 1999), an unambiguous verbal head such as prove bearing the VB tag may project a clause with an overt subject as well as a clause without an overt subject, but only the latter is a possible dependent of subject control verbs such as try. 6 215 Development Data ­ section 24 FOM: q = 1, h = 1 SOM: q = 1, h = 1 FOM: q = 2, h = 2 FOM: q = 2, h = 4 SOM: q = 2, h = 2 SOM: q = 1, h = 4 Test Data ­ section 23 (Eisner and Smith, 2005) SOM: q = 1, h = 4 (McDonald, 2006) per word 75.7 80.5 81.9 84.7 84.3 87.0 per word 75.6 88.0 91.5 per sentence 9.9 16.2 17.4 22.0 21.5 25.8 per sentence NA 30.6 36.7 Table 1: Accuracy results on the development and test data set, where q denotes the number of hidden states and h the number of hidden values annotating a PoS tag involved in our first-order (FOM) and second-order (SOM) models. order transforms are relevant to parsing accuracy or not. The lower section of Table 1 reports the results achieved by our best model on the test data set and compare them both to those obtained by the only unlexicalised dependency model we know of (Eisner and Smith, 2005) and to those achieved by the stateof-the-art dependency parser in (McDonald, 2006). While clearly not state-of-the-art, the performance achieved by our best model suggests that massive lexicalisation of dependency models might not be necessary to achieve competitive performance. Future work will lie in investigating the issue of lexicalisation in the context of dependency parsing by weakly lexicalising our hidden variable models. tained using the rules stated in (Yamada and Matsumoto, 2003), were transformed into first-order and second-order structures. CFGs extracted from such structures were then annotated with hidden variables encoding the constraints described in the previous section and trained until convergence by means of the Inside-Outside algorithm defined in (Pereira and Schabes, 1992) and applied in (Matsuzaki et al., 2005). To efficiently decode our hidden variable models, we pruned the search space as in (Petrov et al., 2006). To evaluate the performance of our models, we report two of the standard measures: the per word and per sentence accuracy (McDonald, 2006). Figures reported in the upper section of Table 1 measure the effect on accuracy of the transforms we designed. Our baseline first-order model (q = 1, h = 1) reaches a poor per word accuracy that suggests that information conveyed by bare PoS tags is not fine-grained enough to accurately predict dependencies. Results reported in the second line shows that modelling adjacency relations between dependents as second-order models do is relevant to accuracy. The third line indicates that annotating both the states and the PoS tags of a first-order model with two hidden values is sufficient to reach a performance comparable to the one achieved by a naive second-order model. However, comparing the results obtained by our best first-order models to the accuracy achieved by our best second-order model conclusively shows that first-order models exploit such dependencies to a much lesser extent. Overall, such results provide a first solution to the problem left open in (Johnson, 2007) as to whether second216 References Eugene Charniak. 2000. A maximum-entropy-inspired parser. In NAACL'00. Michael John Collins. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. Jason Eisner and Noah A. Smith. 2005. Parsing with soft and hard constraints on dependency length. In IWPT'05. Jason Eisner. 2000. Bilexical grammars and their cubic-time parsing algorithms. In H.Bunt and A. Nijholt, eds., Advances in Probabilistic and Other Parsing Technologies, pages 29­62. Kluwer Academic Publishers. Jamie Henderson. 2003. Inducing history representations for broad-coverage statistical parsing. In NAACL-HLT'03. Mark Johnson. 2007. Transforming projective bilexical dependency grammars into efficiently-parsable cfgs with unfold-fold. In ACL'06. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In ACL'03. Takuya Matsuzaki, Yusuke Miyao, and Junichi Tsujii. 2005. Probabilistic CFG with latent annotations. In ACL'05. David McAllester. 1999. A reformulation of eisner and satta's cubit time parser for split head automata gram~ mars. http://ttic.uchicago.edu/dmcallester. Ryan McDonald. 2006. Discriminative Training and Spanning Tree Algorithms for Dependency Parsing. Ph.D. thesis, University of Pennsylvania. Fernando Pereira and Yves Schabes. 1992. Inside-outside reestimation form partially bracketed corpora. In ACL'92. Slav Petrov, Leon Barrett Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In ACL'06. Detlef Prescher. 2005. Head-driven PCFGs with latent-head statistics. In IWPT'05. H. Yamada and Y. Matsumoto. 2003. Statistical dependency analysis with support vectore machines. In IWPT'03.