Discriminative Classifiers for Deterministic Dependency Parsing Johan Hall ¨¨ Vaxjo University jni@msi.vxu.se Joakim Nivre ¨¨ Vaxjo University and Uppsala University nivre@msi.vxu.se Jens Nilsson ¨¨ Vaxjo University jha@msi.vxu.se Abstract Deterministic parsing guided by treebankinduced classifiers has emerged as a simple and efficient alternative to more complex models for data-driven parsing. We present a systematic comparison of memory-based learning (MBL) and support vector machines (SVM) for inducing classifiers for deterministic dependency parsing, using data from Chinese, English and Swedish, together with a variety of different feature models. The comparison shows that SVM gives higher accuracy for richly articulated feature models across all languages, albeit with considerably longer training times. The results also confirm that classifier-based deterministic parsing can achieve parsing accuracy very close to the best results reported for more complex parsing models. 1 Introduction M a i ns t r e a m a ppr oa c he s i n s t a t i s t i c a l pa r s i ng a r e based on nondeterministic parsing techniques, usually employing some kind of dynamic programming, in combination with generative probabilistic models that provide an n-best ranking of the set of candidate analyses derived by the parser (Collins, 1997; Collins, 1999; Charniak, 2000). These parsers can be enhanced by using a discriminative model, which reranks the analyses output by the parser (Johnson et al., 1999; Collins and Duffy, 2005; Charniak and Johnson, 2005). Alternatively, discriminative models can be used to search the complete space of possible parses (Taskar et al., 2004; McDonald et al., 2005). A radically different approach is to perform disambiguation deterministically, using a greedy 316 parsing algorithm that approximates a globally optimal solution by making a sequence of locally optimal choices, guided by a classifier trained on gold standard derivations from a treebank. This methodology has emerged as an alternative to more complex models, especially in dependencybased parsing. It was first used for unlabeled dependency parsing by Kudo and Matsumoto (2002) (for Japanese) and Yamada and Matsumoto (2003) (for English). It was extended to labeled dependency parsing by Nivre et al. (2004) (for Swedish) and Nivre and Scholz (2004) (for English). More recently, it has been applied with good results to lexicalized phrase structure parsing by Sagae and Lavie (2005). The machine learning methods used to induce classifiers for deterministic parsing are dominated by two approaches. Support vector machines (SVM), which combine the maximum margin strategy introduced by Vapnik (1995) with the use of kernel functions to map the original feature space to a higher-dimensional space, have been used by Kudo and Matsumoto (2002), Yamada and Matsumoto (2003), and Sagae and Lavie (2005), among others. Memory-based learning (MBL), which is based on the idea that learning is the simple storage of experiences in memory and that solving a new problem is achieved by reusing solutions from similar previously solved problems (Daelemans and Van den Bosch, 2005), has been used primarily by Nivre et al. (2004), Nivre and Scholz (2004), and Sagae and Lavie (2005). Comparative studies of learning algorithms are relatively rare. Cheng et al. (2005b) report that SVM outperforms MaxEnt models in Chinese dependency parsing, using the algorithms of Yamada and Matsumoto (2003) and Nivre (2003), while Sagae and Lavie (2005) find that SVM gives better Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 316­323, Sydney, July 2006. c 2006 Association for Computational Linguistics performance than MBL in a constituency-based shift-reduce parser for English. In this paper, we present a detailed comparison of SVM and MBL for dependency parsing using the deterministic algorithm of Nivre (2003). The comparison is based on data from three different languages ­ Chinese, English, and Swedish ­ and on five different feature models of varying complexity, with a separate optimization of learning algorithm parameters for each combination of language and feature model. The central importance of feature selection and parameter optimization in machine learning research has been shown very clearly in recent research (Daelemans and Hoste, 2002; Daelemans et al., 2003). The rest of the paper is structured as follows. Section 2 presents the parsing framework, including the deterministic parsing algorithm and the history-based feature models. Section 3 discusses the two learning algorithms used in the experiments, and section 4 describes the experimental setup, including data sets, feature models, learning algorithm parameters, and evaluation metrics. Experimental results are presented and discussed i n s e c t i on 5, a nd c onc l us i ons i n s e c t i on 6. Definition 1 Given a set R of dependency types (arc labels), a dependency graph for a sentence x = (w1 , . . . , wn ) is a labeled directed graph G = (V , E , L), where: 1. V = Z n + 1 2. E V × V 3. L : E R The set V of nodes (or vertices) is the set Zn+1 = {0, 1, 2, . . . , n} (n Z+ ), i.e., the set of nonnegative integers up to and including n. This means that every token index i of the sentence is a node (1 i n) and that there is a special node 0, which does not correspond to any token of the sentence and which will always be a root of the dependency graph (normally the only root). We us e V + t o de not e t he s e t of node s c or r e s pondi ng to tokens (i.e., V + = V - {0}), and we use the term token node for members of V + . The set E of arcs (or edges) is a set of ordered pairs (i, j ), where i and j are nodes. Since arcs are used to represent dependency relations, we will s a y t ha t i i s t he he ad a nd j i s t he de pe nde nt of the arc (i, j ). As usual, we will use the notation i j to mean that there is an arc connecting i and j (i.e., (i, j ) E ) and we will use the notation i j for the reflexive and transitive closure of the arc relation E (i.e., i j if and only if i = j or t he r e i s a pa t h of a r c s c onne c t i ng i t o j ) . The function L assigns a dependency type (arc label) r R to every arc e E . Definition 2 A dependency graph G is wellformed if and only if: 1. 2. 3. 4. 5. The node 0 is a root. Every node has in-degree at most 1. G is connected.1 G is acyclic. G is projective.2 2 Inductive Dependency Parsing The system we use for the experiments uses no grammar but relies completely on inductive learning from treebank data. The methodology is based on three essential components: 1. Deterministic parsing algorithms for building dependency graphs (Kudo and Matsumoto, 2002; Yamada and Matsumoto, 2003; Nivre, 2003) 2. History-based models for predicting the next parser action (Black et al., 1992; Magerman, 1995; Ratnaparkhi, 1997; Collins, 1999) 3. Discriminative learning to map histories to parser actions (Kudo and Matsumoto, 2002; Yamada and Matsumoto, 2003; Nivre et al., 2004) In this section we will define dependency graphs, describe the parsing algorithm used in the experiments and finally explain the extraction of features for the history-based models. 2.1 Dependency Graphs Conditions 1­4, which are more or less standard in dependency parsing, together entail that the graph is a rooted tree. The condition of projectivity, by contrast, is somewhat controversial, since the analys i s of c e r t a i n l i ngui s t i c c ons t r uc t i ons a ppe a r s t o To be more exact, we require G to be weakly connected, which entails that the corresponding undirected graph is connected, whereas a strongly connected graph has a directed path between any pair of nodes. 2 An arc (i, j ) is projective iff there is a path from i to every node k such that i < j < k or i > j > k. A graph G is projective if all its arcs are projective. 1 A dependency graph is a labeled directed graph, the nodes of which are indices corresponding to the tokens of a sentence. Formally: 317 P . JJ NN VB JJ NN IN JJ NN Economic news had little effect on financial markets . § § N M O D ¤S B J ¤ § § ? ? ¤ OBJ PMOD ¤ ? ? § ? ? ¤ §N M O D ¤ ? ? N M O D¤ § O D NM ¤ § Figure 1: Dependency graph for an English sentence from the WSJ section of the Penn Treebank require non-projective dependency arcs. For the purpose of this paper, however, this assumption is unproblematic, given that all the treebanks used in the experiments are restricted to projective dependency graphs. Figure 1 shows a well-formed dependency graph for an English sentence, where each word of the sentence is tagged with its part-of-speech and each arc labeled with a dependency type. 2.2 Parsing Algorithm all the token nodes in the input sequence, and with all token nodes attached to the special root node 0 with a special dependency type r0 . The parser terminates in any configuration cm = (, , h, d) where the input sequence is empty, which happens after one left-to-right pass over the input. There are four possible parser transitions, two of which are parameterized for a dependency type r R. 1. L E F T- A R C (r ) makes the top token i a (left) dependent of the next token j with depenr dency type r, i.e., j i, and immediately pops t he s t a c k. 2. R I G H T- A R C (r ) makes the next token j a (right) dependent of the top token i with der pendency type r, i.e., i j , and immediately pus he s j ont o t he s t a c k. 3. R E D U C E pops t he s t a c k. 4. S H I F T pushes the next token i onto the stack. The choice between different transitions is nondeterministic in the general case and is resolved by a classifier induced from a treebank, using features extracted from the parser configuration. 2.3 Feature Models We begin by defining parser configurations and the abstract data structures needed for the definition of history-based feature models. Definition 3 Given a set R = {r0 , r1 , . . . rm } of dependency types and a sentence x = (w1 , . . . , wn ), a parser configuration for x is a quadruple c = (, , h, d), where: 1. is a stack of tokens nodes. 2. is a sequence of token nodes. + 3. h : Vx V is a function from token nodes t o node s . + 4. d : Vx R is a function from token nodes to dependency types. + 5. For every token node i Vx , h(i) = 0 if a n d o n l y i f d ( i ) = r0 . The idea is that the sequence represents the remaining input tokens in a left-to-right pass over the input sentence x; the stack contains partially pr oc e s s e d node s t ha t a r e s t i l l c a ndi da t e s f or de pendency arcs, either as heads or dependents; and the functions h and d represent a (dynamically defined) dependency graph for the input sentence x. We refer to the token node on top of the stack as the top token and the first token node of the input sequence as the next token. When parsing a sentence x = (w1 , . . . , wn ), the parser is initialized to a configuration c0 = (, (1, . . . , n), h0 , d0 ) with an empty stack, with 318 The task of the classifier is to predict the next transition given the current parser configuration, where the configuration is represented by a feature vector (1,p) = (1 , . . . , p ). Each feature i is a function of the current configuration, defined in terms of an address function ai , which identifies a specific token in the current parser configuration, and an attribute function fi , which picks out a specific attribute of the token. Definition 4 Let c = (, , h, d) be the current parser configuration. 1. For every i (i 0), i and i are address functions identifying the ith token of and , respectively (with indexing starting at 0). 2. If is an address function, then h(), l(), a nd r ( ) a r e a ddr e s s f unc t i ons , i de nt i f yi ng the head (h), the leftmost child (l), and the rightmost child (r), of the token identified by ( a c c or di ng t o t he f unc t i on h ) . 3. If is an address function, then p(), w() a nd d ( ) a r e f e a t ur e f unc t i ons , i de nt i f yi ng the part-of-speech (p), word form (w) and dependency type (d) of the token identified by . We call p, w and d attribute functions. A feature model is defined by specifying a vector of feature functions. In section 4.2 we will define the feature models used in the experiments. 3.2 SVM 3 Learning Algorithms The learning problem for inductive dependency parsing, defined in the preceding section, is a pure classification problem, where the input instances are parser configurations, represented by feature vectors, and the output classes are parser transitions. In this section, we introduce the two machine learning methods used to solve this problem in the experiments. 3.1 MBL MBL is a lazy learning method, based on the idea that learning is the simple storage of experiences in memory and that solving a new problem is achieved by reusing solutions from similar previously solved problems (Daelemans and Van den Bosch, 2005). In essence, this is a k nearest neighbor approach to classification, although a variety of sophisticated techniques, including different distance metrics and feature weighting schemes can be used to improve classification accuracy. For the experiments reported in this paper we use the T I M B L software package for memorybased learning and classification (Daelemans and Van den Bosch, 2005), which directly handles multi-valued symbolic features. Based on results from previous optimization experiments (Nivre et al., 2004), we use the modified value difference metric (MVDM) to determine distances between instances, and distance-weighted class voting for determining the class of a new instance. The parameters varied during experiments are the number k of nearest neighbors and the frequency threshold l below which MVDM is replaced by the simple Overlap metric. 319 SVM in its simplest form is a binary classifier that tries to separate positive and negative cases in training data by a hyperplane using a linear kernel function. The goal is to find the hyperplane that separates the training data into two classes with the largest margin. By using other kernel functions, such as polynomial or radial basis function (RBF), feature vectors are mapped into a higher dimensional space (Vapnik, 1998; Kudo and Matsumoto, 2001). Multi-class classification with n classes can be handled by the one-versus-all method, with n classifiers that each separate one class from the rest, or the one-versus-one method, with n(n - 1)/2 classifiers, one for each pair of classes (Vural and Dy, 2004). SVM requires all features to be numerical, which means that symbolic features have to be converted, normally by introducing one binary feature for each value of the symbolic feature. For the experiments reported in this paper we use the L I B S V M library (Wu et al., 2004; Chang and Lin, 2005) with the polynomial kernel K (xi , xj ) = ( xT xj + r)d , > 0, where d, and i r are kernel parameters. Other parameters that are varied in experiments are the penalty parameter C , which defines the tradeoff between training error and the magnitude of the margin, and the termination criterion , which determines the tolerance of t r a i ni ng e r r or s . We adopt the standard method for converting symbolic features to numerical features by binarization, and we use the one-versus-one strategy for multi-class classification. However, to reduce training times, we divide the training data into smaller sets, according to the part-of-speech of the next token in the current parser configuration, and train one set of classifiers for each smaller set. Similar techniques have previously been used by Yamada and Matsumoto (2003), among others, without significant loss of accuracy. In order to avoid too small training sets, we pool together all parts-of-speech that have a frequency below a certain threshold t (set to 1000 in all the experiments). 4 Experimental Setup In this section, we describe the experimental setup, including data sets, feature models, parameter optimization, and evaluation metrics. Experimental r e s ul t s a r e pr e s e nt e d i n s e c t i on 5. 4.1 Data Sets The data set used for Swedish comes from Talbanken (Einarsson, 1976), which contains both written and spoken Swedish. In the experiments, t he pr of e s s i ona l pr os e s e c t i on i s us e d, c ons i s t i ng of about 100k words taken from newspapers, textbooks and information brochures. The data has been manually annotated with a combination of constituent structure, dependency structure, and topological fields (Teleman, 1974). This annotation has been converted to dependency graphs and the original fine-grained classification of grammatical functions has been reduced to 17 dependency types. We use a pseudo-randomized data split, dividing the data into 10 sections by allocating sentence i to section i mod 10. Sections 1­9 are used for 9-fold cross-validation during development and section 0 for final evaluation. The English data are from the Wall Street Journal section of the Penn Treebank II (Marcus et al., 1994). We use sections 2­21 for training, section 0 for development, and section 23 for the final evaluation. The head percolation table of Yamada and Matsumoto (2003) has been used to convert constituent structures to dependency graphs, and a variation of the scheme employed by Collins (1999) has been used to construct arc labels that can be mapped to a set of 12 dependency types. The Chinese data are taken from the Penn Chinese Treebank (CTB) version 5.1 (Xue et al., 2005), consisting of about 500k words mostly from Xinhua newswire, Sinorama news magazine and Hong Kong News. CTB is annotated with a combination of constituent structure and grammatical functions in the Penn Treebank style, and has been converted to dependency graphs using essentially the same method as for the English data, although with a different head percolation table and mapping scheme. We use the same kind of pseudo-randomized data split as for Swedish, but we use section 9 as the development test set (training on section 1­8) and section 0 as the final test s e t ( t r a i ni ng on s e c t i on 1­9) . A standard HMM part-of-speech tagger with suffix smoothing has been used to tag the test data with an accuracy of 96.5% for English and 95.1% for Swedish. For the Chinese experiments we have us e d t he or i gi na l ( gol d s t a nda r d) t a gs f r om t he treebank, to facilitate comparison with results previ ous l y r e por t e d i n t he l i t e r a t ur e . 320 Feature p(0 ) p(0 ) p(1 ) p(2 ) p(3 ) p(1 ) d ( 0 ) d ( l ( 0 ) ) d ( r ( 0 ) ) d(l (0 )) w ( 0 ) w(0 ) w(1 ) w ( h( 0 ) ) 1 + + + 2 + + + 3 + + + 4 + + + + + + + + + + + + + + + + + + + + + 5 + + + + + + + + + + + + + + Table 1: Feature models 4.2 Feature Models Table 1 describes the five feature models 1 ­5 used in the experiments, with features specified in column 1 using the functional notation defined in section 2.3. Thus, p(0 ) refers to the part-ofspeech of the top token, while d(l(0 )) picks out the dependency type of the leftmost child of the next token. It is worth noting that models 1 ­2 are unlexicalized, since they do not contain any features of the form w(), while models 3 ­5 are all lexicalized to different degrees. 4.3 Optimization As already noted, optimization of learning algorithm parameters is a prerequisite for meaningful comparison of different algorithms, although an exhaustive search of the parameter space is usually impossible in practice. For MBL we have used the modified value difference metric (MVDM) and class voting weighted by inverse distance (ID) in all experiments, and performed a grid search for the optimal values of the number k of nearest neighbors and the frequency threshold l for switching from MVDM to the simple Overlap metric (cf. section 3.1). The best values are different for different combinations of data sets and models but a r e ge ne r a l l y f ound i n t he r a nge 3­10 f or k a nd i n t he r a nge 1­8 f or l . The polynomial kernel of degree 2 has been used for all the SVM experiments, but the kernel parameters and r have been optimized together with the penalty parameter C and the termination Swedish FM 1 2 3 4 5 LM MB L SVM MB L SVM MB L SVM MB L SVM MB L SVM U 75.3 75.4 81.9 *83.1 85.9 86.2 86.1 86.0 86.6 86.9 AS L 68.7 68.9 74.4 *76.3 81.4 *82.6 82.1 82.2 82.3 *83.2 EM U 16.0 16.3 31.4 *34.3 37.9 38.7 37.6 37.9 39.9 40.7 L 11.4 12.1 19.8 *24.0 28.9 *32.5 30.1 31.2 29.9 *33.7 U *76.5 76.4 81.2 81.3 85.5 *86.4 87.0 *88.4 88.0 *89.4 AS English EM L 73.7 73.6 78.2 78.3 83.7 *84.8 85.2 *86.8 86.2 *87.9 U 9.8 9.8 19.8 19.4 26.5 *28.5 29.8 *33.2 32.8 *36.4 L 7.7 7.7 14.9 14.9 23.7 *25.9 26.0 *30.3 28.4 *33.1 U 66.4 66.4 73.0 *73.2 77.9 *79.7 79.4 *81.7 81.1 *84.3 AS Chinese L 63.6 63.6 70.7 *71.0 76.3 *78.3 77.7 *80.1 79.2 *82.7 EM U L 14.3 12.1 14.2 12.1 22.6 18.8 22.1 18.6 26.3 23.4 *30.1 *25.9 28.0 24.7 *31.0 *27.0 30.2 25.9 *34.5 *30.5 Table 2: Parsing accuracy; FM: feature model; LM: learning method; AS: attachment score, EM: exact match; U: unlabeled, L: labeled criterion e. The intervals for the parameters are: : 0.16­0.40; r: 0­0.6; C : 0.5­1.0; e: 0.1­1.0. 4.4 Evaluation Metrics surprisingly, the lowest accuracy is obtained with the simplest feature model 1 . By and large, more complex feature models give higher accuracy, with one exception for Swedish and the feature models 3 and 4 . It is significant in this context that the Swedish data set is the smallest of the three (about 20% of the Chinese data set and about 10% of the English one). If we compare MBL and SVM, we see that SVM outperforms MBL for the three most complex models 3 , 4 and 5 , both for English and Chinese. The results for Swedish are less clear, although the labeled accuracy for 3 and 5 are significantly better. For the 1 model there is no significant improvement using SVM. In fact, the small differences found in the ASU scores are to the advantage of MBL. By contrast, there is a large gap between MBL and SVM for the model 5 and the languages Chinese and English. For Swedish, the differences are much smaller (except for the EML score), which may be due to the smaller size of the Swedish data set in combination with the technique of dividing the training data for SVM (cf. section 3.2). Another important factor when comparing two learning methods is the efficiency in terms of time. Table 3 reports learning and parsing time for the three languages and the five feature models. The learning time correlates very well with the complexity of the feature model and MBL, being a lazy learning method, is much faster than SVM. For the unlexicalized feature models 1 and 2 , the parsing time is also considerably lower for MBL, especially for the large data sets (English and Chinese). But as model complexity grows, especially with the addition of lexical features, SVM gradually gains an advantage over MBL with respect to parsing time. This is especially striking for Swedish, 321 The evaluation metrics used for parsing accuracy are the unlabeled attachment score ASU , which is the proportion of tokens that are assigned the correct head (regardless of dependency type), and the labeled attachment score ASL , which is the proportion of tokens that are assigned the correct head and the correct dependency type. We also consider the unlabeled exact match EMU , which is the proportion of sentences that are assigned a completely correct dependency graph without considering dependency type labels, and the labeled exact match EML , which also takes dependency type labels into account. Attachment scores are presented as mean scores per token, and punctuation tokens are excluded from all counts. For all experiments we have performed a McNemar test of significance at = 0.01 for differences between the two learning methods. We also compare learning and parsing times, as measured on an AMD 64-bit processor running Linux. 5 Results and Discussion Table 2 shows the parsing accuracy for the combination of three languages (Swedish, English and Chinese), two learning methods (MBL and SVM) and five feature models (1 ­5 ), with algorithm parameters optimized as described in section 4.3. For each combination, we measure the attachment score (AS) and the exact match (EM). A significant improvement for one learning method over the other is marked by an asterisk (*). I nde pe nde nt l y of l a ngua ge a nd l e a r ni ng method, the most complex feature model 5 gives the highest accuracy across all metrics. Not Method 1 2 3 4 5 Model MB L SVM MB L SVM MB L SVM MB L SVM MB L SVM Swedish LT PT 1s 2s 40 s 14 s 3s 5s 40 s 13 s 6 s 1 min 1 min 15 s 8 s 2 min 2 min 18 s 10 s 7 min 2 min 25 s English LT PT 16 s 26 s 1.5 h 14 min 35 s 32 s 1h 11 min 1.5 min 9.5 min 1h 9 min 1.5 min 9 min 2h 12 min 3 min 41 min 1.5 h 10 min Chinese LT PT 7s 8s 1.5 h 17 min 13 s 14 s 1.5 h 15 min 46 s 10 min 2 h 16 min 45 s 12 min 2.5 h 14 min 1.5 min 46 min 6 h 24 min Table 3: Time efficiency; LT: learning time, PT: parsing time where the training data set is considerably smaller t ha n f or t he ot he r l a ngua ge s . Compared to the state of the art in dependency parsing, the unlabeled attachment scores obtained for Swedish with model 5 , for both MBL and SVM, are about 1 percentage point higher than the results reported for MBL by Nivre et al. (2004). For the English data, the result for SVM with model 5 is about 3 percentage points below the results obtained with the parser of Charniak (2000) and reported by Yamada and Matsumoto (2003). For Chinese, finally, the accuracy for SVM with model 5 is about one percentage point lower than the best reported results, achieved with a deterministic classifier-based approach using SVM and preprocessing to detect root nodes (Cheng et al., 2005a ) , a l t hough t he s e r e s ul t s a r e not ba s e d on exactly the same dependency conversion and data s pl i t a s our s . sharing their head percolation tables for English and Chinese, respectively, and to three anonymous reviewers for helpful comments and suggestions. References Ezra Black, Frederick Jelinek, John D. Lafferty, David M. Magerman, Robert L. Mercer, and Salim Roukos. 1992. Towards history-based grammars: Using richer models for probabilistic parsing. In Proceedings of the 5th DARPA Speech and Natural Language Workshop, pages 31­37. Chih-Chung Chang and Chih-Jen Lin. 2005. LIBSVM: A library for support vector machines. Eugene Charniak and Mark Johnson. 2005. Coarseto-fine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 173­180. Eugene Charniak. 2000. A Maximum-EntropyInspired Parser. In Proceedings of the First Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 132­139. Yuchang Cheng, Masayuki Asahara, and Yuji Matsumoto. 2005a. Chinese deterministic dependency analyzer: Examining effects of global features and root node finder. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pages 17­24. Yuchang Cheng, Masayuki Asahara, and Yuji Matsumoto. 2005b. Machine learning-based dependency analyzer for Chinese. In Proceedings of the International Conference on Chinese Computing (ICCC). Michael Collins and Nigel Duffy. 2005. Discriminative reranking for natural language parsing. Computational Linguistics, 31:25­70. Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL), pages 16­23. 6 Conclusion We have performed an empirical comparison of MBL (T I M B L) and SVM (LIBSVM) as learning methods for classifier-based deterministic dependency parsing, using data from three languages and feature models of varying complexity. The evaluation shows that SVM gives higher parsing accuracy and comparable or better parsing efficiency for complex, lexicalized feature models across all languages, whereas MBL is superior with respect to training efficiency, even if training data is divided into smaller sets for SVM. The best accuracy obtained for SVM is close to the state of the art for all languages involved. Acknowledgements The work presented in this paper was partially supported by the Swedish Research Council. We are grateful to Hiroyasu Yamada and Yuan Ding for 322 Michael Collins. 1999. Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. Walter Daelemans and Veronique Hoste. 2002. Evaluation of machine learning methods for natural language processing tasks. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC), pages 755­760. Walter Daelemans and Antal Van den Bosch. 2005. Memory-Based Language Processing. Cambridge University Press. Walter Daelemans, Veronique Hoste, Fien De Meulder, and Bart Naudts. 2003. Combined optimization of feature selection and algorithm parameter interaction in machine learning of language. In Proceedings of the 14th European Conference on Machine Learning (ECML), pages 84­95. Jan Einarsson. 1976. Talbankens skriftsprakskonkordans. Lund University, Department of ° Scandinavian Languages. Mark Johnson, Stuart Geman, Steven Canon, Zhiyi Chi, and Stefan Riezler. 1999. Estimators for stochastic "unification-based" grammars. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), pages 535­541. Taku Kudo and Yuji Matsumoto. 2001. Chunking with support vector machines. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL). Taku Kudo and Yuji Matsumoto. 2002. Japanese dependency analysis using cascaded chunking. In Proceedings of the Sixth Workshop on Computational Language Learning (CoNLL), pages 63­69. David M. Magerman. 1995. Statistical decision-tree models for parsing. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 276­283. Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The Penn Treebank: Annotating predicateargument structure. In Proceedings of the ARPA Human Language Technology Workshop, pages 114­ 119. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 91­98. Joakim Nivre and Mario Scholz. 2004. Deterministic dependency parsing of English text. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), pages 64­70. Joakim Nivre, Johan Hall, and Jens Nilsson. 2004. Memory-based dependency parsing. In Proceedings of the 8th Conference on Computational Natural Language Learning (CoNLL), pages 49­56. Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pages 149­160. Adwait Ratnaparkhi. 1997. A linear observed time statistical parser based on maximum entropy models. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1­10. Kenji Sagae and Alon Lavie. 2005. A classifier-based parser with linear run-time complexity. In Proceedings of the 9th International Workshop on Parsing Technologies (IWPT), pages 125­132. Ben Taskar, Dan Klein, Michael Collins, Daphne Koller, and Christopher Manning. 2004. Maxmargin parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1­8. Ulf Teleman. 1974. Manual for grammatisk beskriv¨ ning av talad och skriven svenska. Studentlitteratur. Vladimir Vapnik. 1995. The Nature of Statistical Learning Theory. Springer. Vladimir Vapnik. 1998. Statistical Learning Theory. John Wiley and Sons, New York. Volkan Vural and Jennifer G. Dy. 2004. A hierarchical method for multi-class support vector machines. ACM International Conference Proceeding Series, 69:105­113. Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng. 2004. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5:975­1005. Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer. 2005. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2):207­238. Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pages 195­206. 323