Graph Transformations in Data-Driven Dependency Parsing

Jens Nilsson ¨¨ Vaxjo University jni@msi.vxu.se

Joakim Nivre ¨¨ Vaxjo University and Uppsala University nivre@msi.vxu.se

Johan Hall ¨¨ Vaxjo University jha@msi.vxu.se

Abstract
Transforming syntactic representations in order to improve parsing accuracy has been exploited successfully in statistical parsing systems using constituency-based representations. In this paper, we show that similar transformations can give substantial improvements also in data-driven dependency parsing. Experiments on the Prague Dependency Treebank show that systematic transformations of coordinate structures and verb groups result in a 10% error reduction for a deterministic data-driven dependency parser. Combining these transformations with previously proposed techniques for recovering nonprojective dependencies leads to state-ofthe-art accuracy for the given data set.

1

Introduction

It has become increasingly clear that the choice of suitable internal representations can be a very important factor in data-driven approaches to syntactic parsing, and that accuracy can often be improved by internal transformations of a given kind of representation. This is well illustrated by the Collins parser (Collins, 1997; Collins, 1999), scrutinized by Bikel (2004), where several transformations are applied in order to improve the analys i s of noun phr a s e s , c oor di na t i on a nd punc t ua t i on. Other examples can be found in the work of Johnson (1998) and Klein and Manning (2003), which show that well-chosen transformations of syntactic representations can greatly improve the parsing accuracy obtained with probabilistic context-free grammars. In this paper, we apply essentially the same techniques to data-driven dependency parsing,
257

specifically targeting the analysis of coordination and verb groups, two very common constructions that pose special problems for dependency-based approaches. The basic idea is that we can facilitate learning by transforming the training data for the parser and that we can subsequently recover the original representations by applying an inverse transformation to the parser's output. The data used in the experiments come from  the Prague Dependency Treebank (PDT) (Hajic,  1998; Hajic et al., 2001), the largest available dependency treebank, annotated according to the theory of Functional Generative Description (FGD) (Sgall et al., 1986). The parser used is MaltParser (Nivre and Hall, 2005; Nivre et al., 2006), a freely available system that combines a deterministic parsing strategy with discriminative classifiers for predicting the next parser action. The paper is structured as follows. Section 2 provides the necessary background, including a definition of dependency graphs, a discussion of different approaches to the analysis of coordination and verb groups in dependency grammar, as well as brief descriptions of PDT, MaltParser and some related work. Section 3 introduces a set of dependency graph transformations, specifically defined to deal with the dependency annotation found in PDT, which are experimentally evaluated in section 4. While the experiments reported in section 4.1 deal with pure treebank transformations, in order to establish an upper bound on what can be achieved in parsing, the experiments presented in section 4.2 examine the effects of different transformations on parsing accuracy. Finally, in section 4.3, we combine these transformations with previously proposed techniques in order to optimize overall parsing accuracy. We conclude i n s e c t i on 5.

Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 257­264, Sydney, July 2006. c 2006 Association for Computational Linguistics


2
2.1

Background
Dependency Graphs

The basic idea in dependency parsing is that the s ynt a c t i c a na l ys i s c ons i s t s i n e s t a bl i s hi ng t ype d, binary relations, called dependencies, between the words of a sentence. This kind of analysis can be represented by a labeled directed graph, defined as follows: · Let R = {r1 , . . . , rm } be a set of dependency t ype s ( a r c l a be l s ) . · A dependency graph for a string of words W = w1 . . . wn is a labeled directed graph G = (W, A), where: ­ W is the set of nodes, i.e. word tokens i n t he i nput s t r i ng, or de r e d by a l i ne a r precedence relation <. ­ A is a set of labeled arcs (wi , r, wj ), wi , wj  W , r  R. · A dependency graph G = (W, A) is wellformed iff it is acyclic and no node has an in-degree greater than 1. We will use the notation wi  wj to symbolize that (wi , r, wj )  A, where wi is referred to as the head and wj as the dependent. We say that an arc is projective iff, for every word wj occurring between wi and wk (i.e., wi < wj < wk or wi > wj > wk ), there is a path from wi to wj . A graph is projective iff all its arcs are projective. Figure 1 shows a well-formed (projective) dependency graph for a sentence from the Prague Dependency Treebank. 2.2 Coordination and Verb Groups
r

t he c onj unc t s , a nd i ns t e a d l e t t he c onj unc t s have a direct dependency relation to the same ` head (Tesniere, 1959; Hudson, 1990). Another approach is to make the conjunction the head and let the conjuncts depend on the conjunction. This analysis, which appears well motivated on semantic grounds, is adopted in the FGD framework and will therefore be called Prague style (PS). It is exemplified in figure 1, where the conjunction a (and) is the head of the conjuncts bojovnost´ and i i tvrdost´. A different solution is to adopt a more hierarchical analysis, where the conjunction depends on the first conjunct, while the second conj unc t de pe nds on t he c onj unc t i on. I n c a s e s of multiple coordination, this can be generalized to a chain, where each element except the first depends on the preceding one. This more syntactically oriented approach has been advocated notably by   Mel'cuk (1988) and will be called Mel'cuk style (MS). It is illustrated in figure 2, which shows a transformed version of the dependency graph in figure 1, where the elements of the coordination form a chain with the first conjunct (bojovnost´) as i the topmost head. Lombardo and Lesmo (1998) conjecture that MS is more suitable than PS for incremental dependency parsing. The difference between the more semantically oriented PS and the more syntactically oriented MS is seen also in the analysis of verb groups, where the former treats the main verb as the head, since it is the bearer of valency, while the latter treats the auxiliary verb as the head, since it is the finite element of the clause. Without questioning the theoretical validity of either approach, we can again ask which analysis is best suited to achieve high accuracy in parsing. 2.3 PDT

Dependency grammar assumes that syntactic structure consists of lexical nodes linked by binary dependencies. Dependency theories are thus best suited for binary syntactic constructions, where one element can clearly be distinguished as the syntactic head. The analysis of coordination is problematic in this respect, since it normally involves at least one conjunction and two conjuncts. The verb group, potentially consisting of a whole chain of verb forms, is another type of construction where the syntactic relation between elements is not clear-cut in dependency terms. Several solutions have been proposed to the problem of coordination. One alternative is to avoid creating dependency relations between
258

  PDT (Hajic, 1998; Hajic et al., 2001) consists of 1.5M words of newspaper text, annotated in three layers: morphological, analytical and tectogrammatical. In this paper, we are only concerned with the analytical layer, which contains a surfacesyntactic dependency analysis, involving a set of 28 dependency types, and not restricted to projective dependency graphs.1 The annotation follows FGD, which means that it involves a PS analysis of both coordination and verb groups. Whether better parsing accuracy can be obtained by transforming
About 2% of all dependencies are non-projective and about 25% of all sentences have a non-projective dependency graph (Nivre and Nilsson, 2005).
1


Atr A7 Velkou
great

Obj Co N7 bojovnost´ i
fighting-spirit

Coord Obj Co Atr J^ a
?

AuxT
?

Sb

Atr N2 N2 ´ finale turnaje
final of-the-tournament

?

?

A7  ne c e ka nou
unexpected

?

N7 tvrdost´ i
hardness

P4 se

?

Vp  vyznacovalo
distinguished

?

?

and

itself

("The final of the tournament was distinguished by great fighting spirit and unexpected hardness") Figure 1: Dependency graph for a Czech sentence from the Prague Dependency Treebank Obj Obj Atr J^ a
?

Atr A7 Velkou
great

Coord N7 bojovnost´ i
fighting-spirit

AuxT N7 tvrdost´ i
hardness

Sb

Atr N2 N2 ´ finale turnaje
final of-the-tournament

?

?

A7  ne c e ka nou
unexpected

?

?

P4 se

?

Vp  vyznacovalo
distinguished

?

?

and

itself

("The final of the tournament was distinguished by great fighting spirit and unexpected hardness") Figure 2: Transformed dependency graph for a Czech sentence from the Prague Dependency Treebank

this to MS is one of the hypotheses explored in the experimental study below. 2.4 MaltParser

MaltParser (Nivre and Hall, 2005; Nivre et al., 2006) is a data-driven parser-generator, which can induce a dependency parser from a treebank, and which supports several parsing algorithms and learning algorithms. In the experiments below we use the algorithm of Nivre (2003), which constructs a labeled dependency graph in one leftto-right pass over the input. Classifiers that predict the next parser action are constructed through memory-based learning (MBL), using the T I M B L software package (Daelemans and Van den Bosch, 2005), and support vector machines (SVM), using LIBSVM (Chang and Lin, 2005). 2.5 Related Work

form the dependency structure for coordination but does not present any results. Graph transformations in dependency parsing have also been used in order to recover nonprojective dependencies together with parsers that are restricted to projective dependency graphs. Thus, Nivre and Nilsson (2005) improve parsing accuracy for MaltParser by projectivizing training data and applying an inverse transformation to the ´ output of the parser, while Hall and Novak (2005) apply post-processing to the output of Charniak's parser (Charniak, 2000). In the final experiments below, we combine these techniques with the transformations investigated in this paper.

3

Dependency Graph Transformations

Other ways of improving parsing accuracy with r e s pe c t t o c oor di na t i on i nc l ude l e a r ni ng pa t t e r ns of morphological and semantical information for the conjuncts (Park and Cho, 2000). More specifically for PDT, Collins et al. (1999) relabel coordinated phrases after converting dependency structures to phrase structures, and Zeman (2004) uses a kind of pattern matching, based on frequencies of t he pa r t s - of - s pe e c h of c onj unc t s a nd c onj unc tions. Zeman also mentions experiments to trans259

In this section, we describe algorithms for transforming dependency graphs in PDT from PS to MS and back, starting with coordination and continuing with verb groups. 3.1 Coordination

The PS-to-MS transformation for coordination will be designated c (), where  is a data set. The transformation begins with the identification of a base conjunction, based on its dependency type (Coord) and/or its part-of-speech (J^ ). For example, the word a (and) in figure 1 is identified a s a ba s e c onj unc t i on.


Before the actual transformation, the base conj unc t i on a nd a l l i t s de pe nde nt s ne e d t o be c l a s s i fied into three different categories. First, the base conjunction is categorized as a separator (S ). If the coordination consists of more than two conjuncts, it normally has one or more commas sepa r a t i ng c onj unc t s , i n a ddi t i on t o t he ba s e c onj unc tion. These are identified by looking at their dependency type (mostly AuxX ) and are also categorized as S . The coordination in figure 1 contains no commas, so only the word a will belong to S . The remaining dependents of the base conjunction need to be divided into conjuncts (C ) and other dependents (D). To make this distinction, the algorithm again looks at the dependency type. In principle, the dependency type of a conjunct has the suffix Co, although special care has to be taken for coordinated prepositional cases and em¨ ´ bedded clauses (Bohmova et al., 2003). The words bojovnost´ and tvrdost´ in figure 1, both having the i i dependency type Obj Co, belong to the category C . Since there are no other dependents of a, the coordination contains no instances of the category D. Given this classification of the words involved in a coordination, the transformation c () is straightforward and basically connects all the arcs in a chain. Let C1 , . . . , Cn be the elements of C , ordered by linear precedence, and let S1i , . . . , Smi be the separators occurring between Ci and Ci+1 . Then every Ci becomes the head of S1i , . . . , Smi , Smi becomes the head of Ci+1 , and C1 becomes t he onl y de pe nde nt of t he or i gi na l he a d of t he ba s e conjunction. The dependency types of the conjuncts are truncated by removing the suffix Co.2 Also, each word in wd  D becomes a dependent of the conjunct closest to its left, and if such a word does not exist, wd will depend on the leftmost conjunct. After the transformation c (), every coordination forms a left-headed chain, as illustrated in figure 2. This new representation creates a problem, however. It is no longer possible to distinguish the de pe nde nt s i n D f r om ot he r de pe nde nt s of t he c onjuncts. For example, the word Velkou in figure 2 i s not di s t i ngui s ha bl e f r om a pos s i bl e de pe nde nt in D, which is an obvious drawback when transforming back to PS. One way of distinguishing D elements is to extend the set of dependency types.
2 Preliminary results indicated that this increases parsing accuracy.

The dependency type r of each wd  D can be replaced by a completely new dependency type r+ (e.g., Atr+), theoretically increasing the number of dependency types to 2 · |R|. - The inverse transformation, c 1 (), again s t a r t s by i de nt i f yi ng ba s e c onj unc t i ons , us i ng t he same conditions as before. For each identified base conjunction, it calls a procedure that performs the inverse transformation by traversing the chain of conjuncts and separators "upwards" ( r i ght - t o- l e f t ) , c ol l e c t i ng c onj unc t s ( C ) , s e pa r a t or s (S ) and potential conjunction dependents (Dpot ). When this is done, the former head of the leftmost conjunct (C1 ) becomes the head of the rightmost (base) conjunction (Smn-1 ). In figure 2, the leftmost conjunct is bojovnost´, with the head i vyznacovalo, and the rightmost (and only) con junction is a, which will then have vyznacovalo as  its new head. All conjuncts in the chain become dependents of the rightmost conjunction, which means that the structure is converted back to the one depicted in figure 1. As mentioned above, the original structure in figure 1 did not have any coordination dependents, but Velkou  Dpot . The last step of the inverse transformation is therefore to sort out conjunction dependents from conjunct dependents, where the former will attach to the base conjunction. Four versions have been implemented, two of which take into account the fact that the dependency types AuxG, AuxX, AuxY, and Pred are the only dependency types that are more frequent as conj unc t i on de pe nde nt s ( D ) t ha n a s c onj unc t de pe nde nt s i n t he t r a i ni ng da t a s e t : · c : Do not extend arc labels in c . Leave all - words in Dpot in place in c 1 . · c : Do not extend arc labels in c . Attach all words with label AuxG, AuxX, AuxY or Pred - to the base conjunction in c 1 . · c+ : Extend arc labels from r to r+ for D elements in c . Attach all words with label r + t o t he ba s e c onj unc t i on ( a nd c ha nge t he - label to r) in c 1 . · c+ : Extend arc labels from r to r+ for D elements in c , except for the labels AuxG, AuxX, AuxY and Pred. Attach all words with label r+, AuxG, AuxX, AuxY, or Pred to the ba s e c onj unc t i on ( a nd c ha nge t he l a be l t o r i f - necessary) in c 1 .

260


3.2

Verb Groups

To transform verb groups from PS to MS, the transformation algorithm, v (), starts by identifying all auxiliary verbs in a sentence. These will be l ong t o t he s e t A a nd a r e pr oc e s s e d f r om l e f t t o AuxV right. A word waux  A iff wmain - waux , where wmain is the main verb. The transformation into MS reverses the relation between the verbs, AuxV i.e., waux - wmain , and the former head of wmain becomes the new head of waux . The main verb can be located on either side of the auxiliary verb and can have other dependents (whereas auxiliary verbs never have dependents), which means that dependency relations to other dependents of wmain may become non-projective through the transformation. To avoid this, all dependents to the left of the rightmost verb will depend on the leftmost verb, whereas the others will depend on the rightmost verb. Performing the inverse transformation for verb - groups, v 1 (), is quite simple and essentially the same procedure inverted. Each sentence is traversed from right to left looking for arcs of the AuxV type waux - wmain . For every such arc, the head of waux will be the new head of wmain , and wmain the new head of waux . Furthermore, since waux does not have dependents in PS, all dependents of waux in MS will become dependents of wmain in PS.

Data t d e

#S 73088 7319 7507

#W 1256k 126k 126k

%S 3.9 4.0 3.8

%C 7.7 7.8 7.3

%A 1.3 1.4 1.4

Table 1: PDT data sets; S = sentence, W = word; S = separator, C = conjunct, A = auxiliary verb T c c c+ c+ v AS 97.8 98.6 99.6 99.4 100.0

Table 2: Transformations; T = transformation; AS = attachment score (unlabeled) of  -1 ( (t )) compared to t MaltParser is used with the parsing algorithm of Nivre (2003) together with the feature model used for parsing Czech by Nivre and Nilsson (2005). In section 4.2 we use MBL, again with the same settings as Nivre and Nilsson (2005),3 and in section 4.2 we use SVM with a polynomial kernel of degree 2.4 The metrics for evaluation are the attachment score (AS) (labeled and unlabeled), i.e., the proportion of words that are assigned the correct head, and the exact match (EM) score (labeled and unlabeled), i.e., the proportion of sentences that are assigned a completely correct analysis. All tokens, including punctuation, are included in the evaluation scores. Statistical significance is assessed using McNemar's test. 4.1 Experiment 1: Transformations

4

Experiments

All experiments are based on PDT 1.0, which is divided into three data sets, a training set (t ), a development test set (d ), and an evaluation test set (e ). Table 1 shows the size of each data set, as well as the relative frequency of the specific constructions that are in focus here. Only 1.3% of all words in the training data are identified as auxiliary verbs (A), whereas coordination (S and C ) is more common in PDT. This implies that coordination transformations are more likely to have a greater impact on overall accuracy compared to the verb group transformations. In the parsing experiments reported in sections 4.1­4.2, we use t for training, d for tuning, and e for the final evaluation. The part-of-speech t a ggi ng us e d ( bot h i n t r a i ni ng a nd t e s t i ng) i s t he HMM tagging distributed with the treebank, with a tagging accuracy of 94.1%, and with the tagset compressed to 61 tags as in Collins et al. (1999).
261

The algorithms are fairly simple. In addition, there will always be a small proportion of syntactic constructions that do not follow the expected pattern. Hence, the transformation and inverse transformation will inevitably result in some distortion. In order to estimate the expected reduction in parsing accuracy due to this distortion, we first consider a pure treebank transformation experiment, where we compare  -1 ( (t )) to t , for all the different transformations  defined in the previous section. The results are shown in table 2. We see that, even though coordination is more frequent, verb groups are easier to handle.5 The
T I M B L parameters: -k 5 -m M -L 3 -w 0 -d ID. LIBSVM parameters: -s 0 -t 1 -d 2 -g 0.12 -r 0 -c 1 -e 0.1. 5 The result is rounded to 100.0% but the transformed tree4 3


coordination version with the least loss of information (c+ ) fails to recover the correct head for 0.4% of all words in t . The difference between c+ and c is expected. However, in the next section this will be contrasted with the increased burden on the parser for c+ , s i nc e i t i s a l s o r e s pons i bl e f or s e l e c t i ng t he c or r e c t dependency type for each arc among as many as 2 · | R | t ype s i ns t e a d of | R | . 4.2 Experiment 2: Parsing

T None c c c+ c+ v v  c+

AS U L 79.08 72.83 80.55 74.06 80.90 74.41 80.58 74.07 80.87 74.36 79.28 72.97 81.01 74.51

EM U L 28.99 21.15 30.08 21.27 30.56 21.42 30.42 21.17 30.89 21.38 29.53 21.38 31.02 21.57

Parsing experiments are carried out in four steps (for a given transformation  ): 1. 2. 3. 4. Transform the training data set into  (t ). Train a parser p on  (t ). Parse a test set  using p with output p(). Transform the parser output into  -1 (p()).

Table 3: Parsing accuracy (MBL, e ); T = transformation; AS = attachment score, EM = exact match; U = unlabeled, L = labeled AS e Length: t c (t ) v (t ) 90.1 1 51.9 54.1 52.9 83.6 2- 3 29.4 29.1 29.2 70.5 4- 6 11.2 10.7 10.7 59.5 7- 10 4.4 3.8 4.2 45.9 113.0 2.4 2.9

Table 3 presents the results for a selection of transformations using MaltParser with MBL, tested on the evaluation test set e with the untransformed data as baseline. Rows 2­5 show that transforming coordinate structures to MS improves parsing accuracy compared to the baseline, regardless of which transformation and inverse transformation are used. Moreover, the parser benefits from the verb group transformation, as seen in row 6. The final row shows the best combination of a coordination transformation with the verb group transformation, which amounts to an improvement of roughly two percentage points, or a ten percent overall error reduction, for unlabeled accuracy. All improvements over the baseline are statistically significant (McNemar's test) with respect to attachment score (labeled and unlabeled) and unlabeled exact match, with p < 0.01 except for the unlabeled exact match score of the verb group transformation, where 0.01 < p < 0.05. For the labeled exact match, no differences are significant. The experimental results indicate that MS is more suitable than PS as the target representation for deterministic data-driven dependency parsing. A relevant question is of course why this is the case. A partial explanation may be found in the "short-dependency preference" exhibited by most parsers (Eisner and Smith, 2005), with MaltParser being no exception. The first row of table 4 shows the accuracy of the parser for different arc lengths under the baseline condition (i.e., with no transformations). We see that it performs very well on
bank contains 19 erroneous heads.

Table 4: Baseline labeled AS per arc length on e (row 1); proportion of arcs per arc length in t (rows 3­5)

short arcs, but that accuracy drops quite rapidly as the arcs get longer. This can be related to the mean arc length in t , which is 2.59 in the untransformed version, 2.40 in c (t ) and 2.54 in v (t ). Rows 3-5 in table 4 show the distribution of arcs for different arc lengths in different versions of the data set. Both c and v make arcs shorter on average, which may facilitate the task for the parser. Another possible explanation is that learning is facilitated if similar constructions are represented similarly. For instance, it is probable that learning is made more difficult when a unit has different heads depending on whether it is part of a coordina t i on or not . 4.3 Experiment 3: Optimization

In this section we combine the best results from the previous section with the graph transformations proposed by Nivre and Nilsson (2005) to recover non-projective dependencies. We write p - for the projectivization of training data and p 1 for the inverse transformation applied to the parser's output.6 In addition, we replace MBL with SVM, a learning algorithm that tends to give higher accuracy in classifier-based parsing although it is more
6 More precisely, we use the variant called PAT H in Nivre and Nilsson (2005).

262


T None p p  v  c+ None p p  v  c+

LA MBL MBL MBL SVM SVM SVM

AS U L 79.08 72.83 80.79 74.39 82.93 76.31 81.09 75.68 82.93 77.28 84.55 78.82

EM U L 28.99 21.15 31.54 22.53 34.17 23.01 32.24 25.02 35.99 27.05 37.63 27.69

Table 5: Optimized parsing results (SVM, e ); T = transformation; LA = learning algorithm; AS = attachment score, EM = exact match; U = unlabeled, L = labeled T None p  v  c+ P:S 52.63 63.73 R:S 72.35 82.10 P:C 55.15 63.20 R:C 67.03 75.14 P:A 82.17 90.89 R:A 82.21 92.79 P:M 69.95 80.02 R:M 69.07 81.40

Table 6: Detailed results for SVM; T = transformation; P = unlabeled precision, R = unlabeled recall

costly to train (Sagae and Lavie, 2005). Table 5 shows the results, for both MBL and SVM, of the baseline, the pure pseudo-projective parsing, and the combination of pseudo-projective parsing with PS-to-MS transformations. We see that pseudo-projective parsing brings a very consistent increase in accuracy of at least 1.5 percentage points, which is more than that reported by Nivre and Nilsson (2005), and that the addition of the PS-to-MS transformations increases accuracy with about the same margin. We also see that SVM outperforms MBL by about two percentage points across the board, and that the positive effect of the graph transformations is most pronounced for the unlabeled exact match score, where the improvement is more than five percentage points overall for both MBL and SVM. Table 6 gives a more detailed analysis of the parsing results for SVM, comparing the optimal parser to the baseline, and considering specifically t he ( unl a be l e d) pr e c i s i on a nd r e c a l l of t he c a t e gories involved in coordination (separators S and conjuncts C ) and verb groups (auxiliary verbs A and main verbs M ). All figures indicate, without exception, that the transformations result in higher precision and recall for all directly involved words. (All differences are significant beyond the 0.01 level.) It is worth noting that the error reduct i on i s a c t ua l l y hi ghe r f or A a nd M t ha n f or S a nd C , although the former are less frequent. With respect to unlabeled attachment score, the results of the optimized parser are slightly below the best published results for a single parser. Hall ´ and Novak (2005) report a score of 85.1%, apply263

ing a corrective model to the output of Charniak's parser; McDonald and Pereira (2006) achieve a score of 85.2% using a second-order spanning tree algorithm. Using ensemble methods and a pool of  ´ different parsers, Zeman and Zabokrtsky (2005) attain a top score of 87.0%. For unlabeled exact match, our results are better than any previously reported results, including those of McDonald and Pereira (2006). (For the labeled scores, we are not aware of any comparable results in the literature.)

5

Conclusion

The results presented in this paper confirm that choosing the right representation is important in parsing. By systematically transforming the representation of coordinate structures and verb groups in PDT, we achieve a 10% error reduction for a data-driven dependency parser. Adding graph transformations for non-projective dependency parsing gives a total error reduction of about 20% (even more for unlabeled exact match). In this way, we achieve state-of-the-art accuracy with a deterministic, classifier-based dependency parser.

Acknowledgements
The research presented in this paper was partially supported by the Swedish Research Council. We  are grateful to Jan Hajic and Daniel Zeman for help with the Czech data and to three anonymous reviewers for helpful comments and suggestions.


References
Daniel M. Bikel. 2004. Intricacies of Collins' parsing model. Computational Linguistics, 30:479­511. ¨ ´  ´ Alena Bohmova, Jan Hajic, Eva Hajicova, and Barbora ´ Hladka. 2003. The Prague Dependency Treebank: ´ A three-level annotation scenario. In Anne Abeille, editor, Treebanks: Building and Using Syntactically Annotated Corpora. Kluwer Academic Publishers. Chih-Chung Chang and Chih-Jen Lin. 2005. LIBSVM: A library for support vector machines. Eugene Charniak. 2000. A maximum-entropyinspired parser. In Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 132­139.  Michael Collins, Jan Hajic, Eric Brill, Lance Ramshaw, and Christoph Tillmann. 1999. A statistical parser for Czech. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), pages 505­512. Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annatual Meeting of the Association for Computational Linguistics (ACL), pages 16­23. Michael Collins. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. Walter Daelemans and Antal Van den Bosch. 2005. Memory-Based Language Processing. Cambridge University Press. Jason Eisner and Noah A. Smith. 2005. Parsing with soft and hard constraints on dependency length. In Proceedings of the 9th International Workshop on Parsing Technologies (IWPT).  ´ Jan Hajic, Barbora Vidova Hladka, Jarmila Panevova, ´ Eva Hajicova, Petr Sgall, and Petr Pajas. 2001. Prague Dependency Treebank 1.0. LDC, 2001T10.  Jan Hajic. 1998. Building a Syntactically Annotated Corpus: The Prague Dependency Treebank. In Issues of Valency and Meaning, pages 12­19. Prague Karolinum, Charles University Press. ´ Keith Hall and Vaclav Novak. 2005. Corrective modeling for non-projective dependency parsing. In Proceedings of the 9th International Workshop on Parsing Technologies (IWPT). Richard Hudson. 1990. English Word Grammar. Basil Blackwell. Mark Johnson. 1998. Pcfg models of linguistic tree representations. Computational Linguistics, 24:613­632.

Dan Klein and Christopher Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), pages 423­430. Vincenzo Lombardo and Leonardo Lesmo. 1998. Unit coordination and gapping in dependency theory. In Proceedings of the Workshop on Processing of Dependency-Based Grammars, pages 11­20. Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Igor Mel'cuk. 1988. Dependency Syntax: Theory and Practice. State University of New York Press. Joakim Nivre and Johan Hall. 2005. MaltParser: A language-independent system for data-driven dependency parsing. In Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories (TLT), pages 137­148. Joakim Nivre and Jens Nilsson. 2005. Pseudoprojective dependency parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 99­106. Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. MaltParser: A data-driven parser-generator for dependency parsing. In Proceedings of the 5th International Conference on Language Resources and Evaluation. Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pages 149­160. Jong C. Park and Hyung Joon Cho. 2000. Informed parsing for coordination with combinatory categorial grammar. In Proceedings of the 18th International Conference on Computational Linguistics (COLING), pages 593­599. Kenji Sagae and Alon Lavie. 2005. A classifier-based parser with linear run-time complexity. In Proceedings of the 9th International Workshop on Parsing Technologies (IWPT), pages 125­132. ´ ´ Petr Sgall, Eva Hajicova, and Jarmila Panevova. 1986. The Meaning of the Sentence in Its Pragmatic Aspects. Reidel. ´´ ` Lucien Tesniere. 1959. Elements de syntaxe structurale. Editions Klincksieck.  ´ Daniel Zeman and Zdenek Zabokrtsky. 2005. Improving parsing accuracy by combining diverse dependency parsers. In Proceedings of the 9th International Workshop on Parsing Technologies (IWPT). Daniel Zeman. 2004. Parsing with a Statistical Dependency Model. Ph.D. thesis, Charles University.

264