Multi-Tagging for Lexicalized-Grammar Parsing
James R. Curran School of IT University of Sydney NSW 2006, Australia james@it.usyd.edu.au Stephen Clark David Vadas Computing Laboratory School of IT Oxford University University of Sydney Wolfson Building NSW 2006, Australia Parks Road dvadas1@it.usyd.edu.au Oxford, OX1 3QD, UK sclark@comlab.ox.ac.uk

Abstract
With performance above 97% accuracy for newspaper text, part of speech (P O S) tagging might be considered a solved problem. Previous studies have shown that allowing the parser to resolve P O S tag ambiguity does not improve performance. However, for grammar formalisms which use more fine-grained grammatical categories, for example TAG and C C G, tagging accuracy is much lower. In fact, for these formalisms, premature ambiguity resolution makes parsing infeasible. We describe a multi-tagging approach which maintains a suitable level of lexical category ambiguity for accurate and efficient C C G parsing. We extend this multitagging approach to the P O S level to overcome errors introduced by automatically assigned P O S tags. Although P O S tagging accuracy seems high, maintaining some P O S tag ambiguity in the language processing pipeline results in more accurate C C G supertagging.

1

Introduction

State-of-the-art part of speech (P O S) tagging accuracy is now above 97% for newspaper text (Collins, 2002; Toutanova et al., 2003). One possible conclusion from the P O S tagging literature is that accuracy is approaching the limit, and any remaining improvement is within the noise of the Penn Treebank training data (Ratnaparkhi, 1996; Toutanova et al., 2003). So why should we continue to work on the P O S tagging problem? Here we give two reasons. First, for lexicalized grammar formalisms such as TAG
697

and C C G, the tagging problem is much harder. Second, any errors in P O S tagger output, even at 97% acuracy, can have a significant impact on components further down the language processing pipeline. In previous work we have shown that using automatically assigned, rather than gold standard, P O S tags reduces the accuracy of our C C G parser by almost 2% in dependency F-score (Clark and Curran, 2004b). C C G supertagging is much harder than P O S tagging because the C C G tag set consists of finegrained lexical categories, resulting in a larger tag set ­ over 400 C C G lexical categories compared with 45 Penn Treebank P O S tags. In fact, using a state-of-the-art tagger as a front end to a C C G parser makes accurate parsing infeasible because of the high supertagging error rate. Our solution is to use multi-tagging, in which a C C G supertagger can potentially assign more than one lexical category to a word. In this paper we significantly improve our earlier approach (Clark and Curran, 2004a) by adapting the forward-backward algorithm to a Maximum Entropy tagger, which is used to calculate a probability distribution over lexical categories for each word. This distribution is used to assign one or more categories to each word (Charniak et al., 1996). We report large increases in accuracy over single-tagging at only a small cost in increased ambiguity. A further contribution of the paper is to also use multi-tagging for the P O S tags, and to maintain some P O S ambiguity in the language processing pipeline. In particular, since P O S tags are important features for the supertagger, we investigate how supertagging accuracy can be improved by not prematurely committing to a P O S tag decision. Our results first demonstrate that a surprising in-

Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 697­704, Sydney, July 2006. c 2006 Association for Computational Linguistics


crease in P O S tagging accuracy can be achieved with only a tiny increase in ambiguity; and second that maintaining some P O S ambiguity can significantly improve the accuracy of the supertagger. The parser uses the C C G lexical categories to build syntactic structure, and the P O S tags are used by the supertagger and parser as part of their statisical models. We show that using a multitagger for supertagging results in an effective preprocessor for C C G parsing, and that using a multitagger for P O S tagging results in more accurate C C G supertagging.

timisation algorithm (Nocedal and Wright, 1999; Malouf, 2002) to perform the estimation. M L E has a tendency to overfit the training data. We adopt the standard approach of Chen and Rosenfeld (1999) by introducing a Gaussian prior term to the objective function which penalises feature weights with large absolute values. A parameter defined in terms of the standard deviation of the Gaussian determines the degree of smoothing. The conditional probability of a sequence of tags, y1 , . . . , yn , given a sentence, w1 , . . . , wn , is defined as the product of the individual probabilities for each tag: P (y1 , . . . , yn |w1 , . . . , wn ) =
in
=1

2

Maximum Entropy Tagging
P (yi |xi ) (3)

The tagger uses conditional probabilities of the form P (y |x) where y is a tag and x is a local context containing y . The conditional probabilities have the following log-linear form: P (y |x) =
i 1 e i fi (x,y) Z (x)

where xi is the context for word wi . We use the standard approach of Viterbi decoding to find the highest probability sequence. 2.1 Multi-tagging

(1) Multi-tagging -- assigning one or more tags to a word -- is used here in two ways: first, to retain ambiguity in the C C G lexical category sequence for the purpose of building parse structure; and second, to retain ambiguity in the P O S tag sequence. We retain ambiguity in the lexical category sequence since a single-tagger is not accurate enough to serve as a front-end to a C C G parser, and we retain some P O S ambiguity since P O S tags are used as features in the statistical models of the supertagger and parser. Charniak et al. (1996) investigated multi-P O S tagging in the context of P C F G parsing. It was found that multi-tagging provides only a minor improvement in accuracy, with a significant loss in efficiency; hence it was concluded that, given the particular parser and tagger used, a single-tag P O S tagger is preferable to a multi-tagger. More recently, Watson (2006) has revisited this question in the context of the R A S P parser (Briscoe and Carroll, 2002) and found that, similar to Charniak et al. (1996), multi-tagging at the P O S level results in a small increase in parsing accuracy but at some cost in efficiency. For lexicalized grammars, such as C C G and TAG , the motivation for using a multi-tagger to assign the elementary structures (supertags) is more compelling. Since the set of supertags is typically much larger than a standard P O S tag set, the tagging problem becomes much harder. In
698

where Z (x) is a normalisation constant which ensures a proper probability distribution for each context x. The feature functions fi (x, y ) are binaryvalued, returning either 0 or 1 depending on the tag y and the value of a particular contextual predicate given the context x. Contextual predicates identify elements of the context which might be useful for predicting the tag. For example, the following feature returns 1 if the current word is the and the tag is D T; otherwise it returns 0: if word(x) = the & y = D T 0 otherwise (2) word(x) = the is an example of a contextual predicate. The P O S tagger uses the same contextual predicates as Ratnaparkhi (1996); the supertagger adds contextual predicates corresponding to P O S tags and bigram combinations of P O S tags (Curran and Clark, 2003). Each feature fi has an associated weight i which is determined during training. The training process aims to maximise the entropy of the model subject to the constraints that the expectation of each feature according to the model matches the empirical expectation from the training data. This can be also thought of in terms of maximum likelihood estimation (M L E) for a log-linear model (Della Pietra et al., 1997). We use the L - B F G S opfi (x, y ) =
1


fact, when using a state-of-the-art single-tagger, the per-word accuracy for C C G supertagging is so low (around 92%) that wide coverage, high accuracy parsing becomes infeasible (Clark, 2002; Clark and Curran, 2004a). Similar results have been found for a highly lexicalized H P S G grammar (Prins and van Noord, 2003), and also for TAG. As far as we are aware, the only approach to successfully integrate a TAG supertagger and parser is the Lightweight Dependency Analyser of Bangalore (2000). Hence, in order to perform effective full parsing with these lexicalized grammars, the tagger front-end must be a multi-tagger (given the current state-of-the-art). The simplest approach to C C G supertagging is to assign all categories to a word which the word was seen with in the data. This leaves the parser the task of managing the very large parse space resulting from the high degree of lexical category ambiguity (Hockenmaier and Steedman, 2002; Hockenmaier, 2003). However, one of the original motivations for supertagging was to significantly reduce the syntactic ambiguity before full parsing begins (Bangalore and Joshi, 1999). Clark and Curran (2004a) found that performing C C G supertagging prior to parsing can significantly increase parsing efficiency with no loss in accuracy. Our multi-tagging approach follows that of Clark and Curran (2004a) and Charniak et al. (1996): assign all categories to a word whose probabilities are within a factor,  , of the probability of the most probable category for that word: Ci = {c | P (Ci = c|S ) >  P (Ci = cmax |S )} Ci is the set of categories assigned to the ith word; Ci is the random variable corresponding to the category of the ith word; cmax is the category with the highest probability of being the category of the ith word; and S is the sentence. One advantage of this adaptive approach is that, when the probability of the highest scoring category is much greater than the rest, no extra categories will be added. Clark and Curran (2004a) propose a simple method for calculating P (Ci = c|S ): use the word and P O S features in the local context to calculate the probability and ignore the previously assigned categories (the history). However, it is possible to incorporate the history in the calculation of the tag probabilities. A greedy approach is to use the locally highest probability history as a feature, which avoids any summing over alternative histories. Alternatively, there is a well-known
699

dynamic programming algorithm -- the forward backward algorithm -- which efficiently calculates P (Ci = c|S ) (Charniak et al., 1996). The multitagger uses the following conditional probabilities: P (yi |w1,n ) =
y

P (yi , y1,i-1 , yi+1,n |w1,n )
1,i-1 ,yi+1,n

where xi,j = xi , . . . xj . Here yi is to be thought of as a fixed category, whereas yj (j = i) varies over the possible categories for word j . In words, the probability of category yi , given the sentence, is the sum of the probabilities of all sequences containing yi . This sum is calculated efficiently using the forward-backward algorithm: P (Ci = c|S ) = i (c)i (c) (4)

where i (c) is the total probability of all the category sub-sequences that end at position i with category c; and i (c) is the total probability of all the category sub-sequences through to the end which start at position i with category c. The standard description of the forwardbackward algorithm, for example Manning and Schutze (1999), is usually given for an H M M-style tagger. However, it is straightforward to adapt the algorithm to the Maximum Entropy models used here. The forward-backward algorithm we use is similar to that for a Maximum Entropy Markov Model (Lafferty et al., 2001). P O S tags are very informative features for the supertagger, which suggests that using a multiP O S tagger may benefit the supertagger (and ultimately the parser). However, it is unclear whether multi-P O S tagging will be useful in this context, since our single-tagger P O S tagger is highly accurate: over 97% for W S J text (Curran and Clark, 2003). In fact, in Clark and Curran (2004b) we report that using automatically assigned, as opposed to gold-standard, P O S tags as features results in a 2% loss in parsing accuracy. This suggests that retaining some ambiguity in the P O S sequence may be beneficial for supertagging and parsing accuracy. In Section 4 we show this is the case for supertagging.

3

CCG

Supertagging and Parsing

Parsing using C C G can be viewed as a two-stage process: first assign lexical categories to the words in the sentence, and then combine the categories


The NP /N

WSJ

N

is a paper that I enjoy reading (S [dcl ]\NP )/NP NP /N N (NP \NP )/(S [dcl ]/NP ) NP (S [dcl ]\NP )/(S [ng ]\NP ) (S [ng ]\NP )/NP

Figure 1: Example sentence with C C G lexical categories. together using C C G's combinatory rules.1 We perform stage one using a supertagger. The set of lexical categories used by the supertagger is obtained from CCGbank (Hockenmaier, 2003), a corpus of C C G normal-form derivations derived semi-automatically from the Penn Treebank. Following our earlier work, we apply a frequency cutoff to the training set, only using those categories which appear at least 10 times in sections 02-21, which results in a set of 425 categories. We have shown that the resulting set has very high coverage on unseen data (Clark and Curran, 2004a). Figure 1 gives an example sentence with the C C G lexical categories. The parser is described in Clark and Curran (2004b). It takes P O S tagged sentences as input with each word assigned a set of lexical categories. A packed chart is used to efficiently represent all the possible analyses for a sentence, and the C K Y chart parsing algorithm described in Steedman (2000) is used to build the chart. A log-linear model is used to score the alternative analyses. In Clark and Curran (2004a) we described a novel approach to integrating the supertagger and parser: start with a very restrictive supertagger setting, so that only a small number of lexical categories is assigned to each word, and only assign more categories if the parser cannot find a spanning analysis. This strategy results in an efficient and accurate parser, with speeds up to 35 sentences per second. Accurate supertagging at low levels of lexical category ambiguity is therefore particularly important when using this strategy. We found in Clark and Curran (2004b) that a large drop in parsing accuracy occurs if automatically assigned P O S tags are used throughout the parsing process, rather than gold standard P O S tags (almost 2% F-score over labelled dependencies). This is due to the drop in accuracy of the supertagger (see Table 3) and also the fact that the log-linear parsing model uses P O S tags as features. The large drop in parsing accuracy demonstrates that improving the performance of P O S tagSee Steedman (2000) for an introduction to C C G, and see Hockenmaier (2003) for an introduction to wide-coverage parsing using C C G.
1

TAG S / W O R D

1.00 1.01 1.05 1.10 1.20 1.30 1.40 4.23

 1 0.8125 0.2969 0.1172 0.0293 0.0111 0.0053 0

W O R D AC C

S E N T AC C

96.7 97.1 98.3 99.0 99.5 99.6 99.7 99.8

51.8 55.4 70.7 80.9 89.3 91.7 93.2 94.8

Table 1: P O S tagging accuracy on Section 00 for different levels of ambiguity. gers is still an important research problem. In this paper we aim to reduce the performance drop of the supertagger by maintaing some P O S ambiguity through to the supertagging phase. Future work will investigate maintaining some P O S ambiguity through to the parsing phase also.

4 Multi-tagging Experiments
We performed several sets of experiments for P O S tagging and C C G supertagging to explore the trade-off between ambiguity and tagging accuracy. For both P O S tagging and supertagging we varied the average number of tags assigned to each word, to see whether it is possible to significantly increase tagging accuracy with only a small increase in ambiguity. For C C G supertagging, we also compared multi-tagging approaches, with a fixed category ambiguity of 1.4 categories per word. All of the experiments used Section 02-21 of CCGbank as training data, Section 00 as development data and Section 23 as final test data. We evaluate both per-word tag accuracy and sentence accuracy, which is the percentage of sentences for which every word is tagged correctly. For the multi-tagging results we consider the word to be tagged correctly if the correct tag appears in the set of tags assigned to the word. 4.1 Results

Table 1 shows the results for multi-P O S tagging for different levels of ambiguity. The row corresponding to 1.01 tags per word shows that adding
700


METHOD

GOLD POS WORD SENT

AU T O P O S WORD SENT

single noseq best hist fwdbwd

92.6 96.2 97.2 97.9

36.8 51.9 63.8 72.1

91.5 95.2 96.3 96.9

32.7 46.1 57.2 64.8

Table 2: Supertagging accuracy on Section 00 using different approaches with multi-tagger ambiguity fixed at 1.4 categories per word.
TAG S / WORD GOLD POS AU T O P O S WORD SENT

1.0 1.2 1.4 1.6 1.8 2.0 2.5 3.0 12.5

 1 0.1201 0.0337 0.0142 0.0074 0.0048 0.0019 0.0009 0

WORD

SENT

92.6 96.8 97.9 98.3 98.4 98.5 98.7 98.7 98.9

36.8 63.4 72.1 76.4 78.3 79.4 80.6 81.4 82.3

91.5 95.8 96.9 97.5 97.7 97.9 98.1 98.3 98.8

32.7 56.5 64.8 69.3 71.0 72.5 74.3 75.6 80.1

egory ambiguity of 1.4 categories per word. The noseq method, which performs significantly better than single, does not take into account the previously assigned categories. The best hist method gains roughly another 1% in accuracy over noseq by taking the greedy approach of using only the two most probable previously assigned categories. Finally, the full forward-backward approach described in Section 2.1 gains roughly another 0.6% by considering all possible category histories. We see the largest jump in accuracy just by returning multiple categories. The other more modest gains come from producing progressively better models of the category sequence. The final set of supertagging experiments in Table 3 demonstrates the trade-off between ambiguity and accuracy. Note that the ambiguity levels need to be much higher to produce similar performance to the P O S tagger and that the upper bound case ( = 0) has a very high average ambiguity. This is to be expected given the much larger C C G tag set.

Table 3: Supertagging accuracy on Section 00 for different levels of ambiguity. even a tiny amount of ambiguity (1 extra tag in every 100 words) gives a reasonable improvement, whilst adding 1 tag in 20 words, or approximately one extra tag per sentence on the W S J, gives a significant boost of 1.6% word accuracy and almost 20% sentence accuracy. The bottom row of Table 1 gives an upper bound on accuracy if the maximum ambiguity is allowed. This involves setting the  value to 0, so all feasible tags are assigned. Note that the performance gain is only 1.6% in sentence accuracy, compared with the previous row, at the cost of a large increase in ambiguity. Our first set of C C G supertagging experiments compared the performance of several approaches. In Table 2 we give the accuracies when using gold standard P O S tags, and also P O S tags automatically assigned by our P O S tagger described above. Since P O S tags are important features for the supertagger maximum entropy model, erroneous tags have a significant impact on supertagging accuracy. The single method is the single-tagger supertagger, which at 91.5% per-word accuracy is too inaccurate for use with the C C G parser. The remaining rows in the table give multi-tagger results for a cat701

5 Tag uncertainty thoughout the pipeline
Tables 2 and 3 show that supertagger accuracy when using gold-standard P O S tags is typically 1% higher than when using automatically assigned P O S tags. Clearly, correct P O S tags are important features for the supertagger. Errors made by the supertagger can multiply out when incorrect lexical categories are passed to the parser, so a 1% increase in lexical category error can become much more significant in the parser evaluation. For example, when using the dependency-based evaluation in Clark and Curran (2004b), getting the lexical category wrong for a ditransitive verb automatically leads to three dependencies in the output being incorrect. We have shown that multi-tagging can significantly increase the accuracy of the P O S tagger with only a small increase in ambiguity. What we would like to do is maintain some degree of P O S tag ambiguity and pass multiple P O S tags through to the supertagging stage (and eventually the parser). There are several ways to encode multiple P O S tags as features. The simplest approach is to treat all of the P O S tags as binary features, but this does not take into account the uncertainty in each of the alternative tags. What we need is a way of incorporating probability information into the Maximum Entropy supertagger.


6

Real-values in M E models

Maximum Entropy (M E) models, in the N L P literature, are typically defined with binary features, although they do allow real-valued features. The only constraint comes from the optimisation algorithm; for example, G I S only allows non-negative values. Real-valued features are commonly used with other machine learning algorithms. Binary features suffer from certain limitations of the representation, which make them unsuitable for modelling some properties. For example, P O S taggers have difficulty determining if capitalised, sentence initial words are proper nouns. A useful way to model this property is to determine the ratio of capitalised and non-capitalised instances of a particular word in a large corpus and use a realvalued feature which encodes this ratio (Vadas and Curran, 2005). The only way to include this feature in a binary representation is to discretize (or bin) the feature values. For this type of feature, choosing appropriate bins is difficult and it may be hard to find a discretization scheme that performs optimally. Another problem with discretizing feature values is that it imposes artificial boundaries to define the bins. For the example above, we may choose the bins 0  x < 1 and 1  x < 2, which separate the values 0.99 and 1.01 even though they are close in value. At the same time, the model does not distinguish between 0.01 and 0.99 even though they are much further apart. Further, if we have not seen cases for the bin 2  x < 3, then the discretized model has no evidence to determine the contribution of this feature. But for the real-valued model, evidence supporting 1  x < 2 and 3  x < 4 provides evidence for the missing bin. Thus the real-valued model generalises more effectively. One issue that is not addressed here is the interaction between the Gaussian smoothing parameter and real-valued features. Using the same smoothing parameter for real-valued features with vastly different distributions is unlikely to be optimal. However, for these experiments we have used the same value for the smoothing parameter on all real-valued features. This is the same value we have used for the binary features.

through to the supertagger. For the later experiments, this required the existing binary-valued framework to be extended to support real values. The level of P O S tag ambiguity was varied between 1.05 and 1.3 P O S tags per word on average. These results are shown in Table 4. The first approach is to treat the multiple P O S tags as binary features (bin). This simply involves adding the multiple P O S tags for each word in both the training and test data. Every assigned P O S tag is treated as a separate feature and considered equally important regardless of its uncertainty. Here we see a minor increase in performance over the original supertagger at the lower levels of P O S ambiguity. However, as the P O S ambiguity is increased, the performance of the binary-valued features decreases and is eventually worse than the original supertagger. This is because at the lowest levels of ambiguity the extra P O S tags can be treated as being of similar reliability. However, at higher levels of ambiguity many P O S tags are added which are unreliable and should not be trusted equally. The second approach (split) uses real-valued features to model some degree of uncertainty in the P O S tags, dividing the P O S tag probability mass evenly among the alternatives. This has the effect of giving smaller feature values to tags where many alternative tags have been assigned. This produces similar results to the binary-valued features, again performing best at low levels of ambiguity. The third approach (invrank) is to use the inverse rank of each P O S tag as a real-valued feature. The inverse rank is the reciprocal of the tag's rank ordered by decreasing probability. This method assumes the P O S tagger correctly orders the alternative tags, but does not rely on the probability assigned to each tag. Overall, invrank performs worse than split. The final and best approach is to use the probabilities assigned to each alternative tag as realvalued features: fi (x, y ) = (P O S(x) = N N) if y = NP 0 otherwise (5) This model gives the best performance at 1.1 P O S tags per-word average ambiguity. Note that, even when using the probabilities as features, only a small amount of additional P O S ambiguity is required to significantly improve performance.
702
p

7

Multi-P O S Supertagging Experiments

We have experimented with four different approaches to passing multiple P O S tags as features


METHOD

POS AMB

WORD

SENT

TAG S / WORD

AU T O P O S WORD SENT

M U LT I P O S WORD SENT

GOLD POS WORD SENT

orig bin

split

prob

invrank

gold

1.00 1.05 1.10 1.20 1.30 1.05 1.10 1.20 1.30 1.05 1.10 1.20 1.30 1.05 1.10 1.20 1.30 -

96.9 97.3 97.3 97.0 96.8 97.4 97.4 97.3 97.2 97.5 97.5 97.5 97.5 97.3 97.4 97.3 97.3 97.9

64.8 67.7 66.3 63.5 62.1 68.5 67.9 67.0 65.1 68.7 69.1 68.7 68.7 68.0 68.0 67.1 67.1 72.1

1.0 1.2 1.4 1.6 1.8 2.0 2.5 3.0

91.5 95.8 96.9 97.5 97.7 97.9 98.1 98.3

32.7 56.5 64.8 69.3 71.0 72.5 74.3 75.6

91.9 96.3 97.5 97.9 98.2 98.4 98.5 98.6

34.3 59.2 67.0 73.3 76.1 77.4 78.7 79.7

92.6 96.8 97.9 98.3 98.4 98.5 98.7 98.7

36.8 63.4 72.1 76.4 78.3 79.4 80.6 81.4

Table 5: Best multi-P O S supertagging accuracy on Section 00 using P O S ambiguity of 1.1 and the probability real-valued features.
METHOD
AU T O P O S M U LT I P O S GOLD POS

single noseq fwdbwd

92.0 95.4 97.1

97.7

93.3 96.6 98.2

Table 4: Multi-P O S supertagging on Section 00 with different levels of P O S ambiguity and using different approaches to P O S feature encoding. Table 5 shows our best performance figures for the multi-P O S supertagger, against the previously described method using both gold standard and automatically assigned P O S tags. Table 6 uses the Section 23 test data to demonstrate the improvement in supertagging when moving from single-tagging (single) to simple multi-tagging (noseq); from simple multitagging to the full forward-backward algorithm (fwdbwd); and finally when using the probabilities of multiply-assigned P O S tags as features (M U LT I P O S column). All of these multi-tagging experiments use an ambiguity level of 1.4 categories per word and the last result uses P O S tag ambiguity of 1.1 tags per word.

Table 6: Final supertagging results on Section 23.

8

Conclusion

The N L P community may consider P O S tagging to be a solved problem. In this paper, we have suggested two reasons why this is not the case. First, tagging for lexicalized-grammar formalisms, such as C C G and TAG, is far from solved. Second, even modest improvements in P O S tagging accuracy can have a large impact on the performance of downstream components in a language processing pipeline.
703

We have developed a novel approach to maintaining tag ambiguity in language processing pipelines which avoids premature ambiguity resolution. The tag ambiguity is maintained by using the forward-backward algorithm to calculate individual tag probabilities. These probabilities can then be used to select multiple tags and can also be encoded as real-valued features in subsequent statistical models. With this new approach we have increased P O S tagging accuracy significantly with only a tiny ambiguity penalty and also significantly improved on previous C C G supertagging results. Finally, using P O S tag probabilities as real-valued features in the supertagging model, we demonstrated performance close to that obtained with gold-standard P O S tags. This will significantly improve the robustness of the parser on unseen text. In future work we will investigate maintaining tag ambiguity further down the language processing pipeline and exploiting the uncertainty from previous stages. In particular, we will incorporate real-valued P O S tag and lexical category features in the statistical parsing model. Another possibility is to investigate whether similar techniques can improve other tagging tasks, such as Named Entity Recognition. This work can be seen as part of the larger goal of maintaining ambiguity and exploiting un-


certainty throughout language processing systems (Roth and Yih, 2004), which is important for coping with the compounding of errors that is a significant problem in language processing pipelines.

Julia Hockenmaier. 2003. Data and Models for Statistical Parsing with Combinatory Categorial Grammar. Ph.D. thesis, University of Edinburgh. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282­289, Williams College, MA. Robert Malouf. 2002. A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the Sixth Workshop on Natural Language Learning, pages 49­55, Taipei, Taiwan. Christopher Manning and Hinrich Schutze. 1999. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts. Jorge Nocedal and Stephen J. Wright. 1999. Numerical Optimization. Springer, New York, USA. Robbert Prins and Gertjan van Noord. 2003. Reinforcing parser preferences through tagging. Traitement Automatique des Langues, 44(3):121­139. Adwait Ratnaparkhi. 1996. A maximum entropy part-ofspeech tagger. In Proceedings of the EMNLP Conference, pages 133­142, Philadelphia, PA. D. Roth and W. Yih. 2004. A linear programming formulation for global inference in natural language tasks. In Hwee Tou Ng and Ellen Riloff, editors, Proc. of the Annual Conference on Computational Natural Language Learning (CoNLL), pages 1­8. Association for Computational Linguistics. Mark Steedman. 2000. The Syntactic Process. The MIT Press, Cambridge, MA. Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the HLT/NAACL conference, pages 252­259, Edmonton, Canada. David Vadas and James R. Curran. 2005. Tagging unknown words with raw text features. In Proceedings of the Australasian Language Technology Workshop 2005, pages 32­39, Sydney, Australia. Rebecca Watson. 2006. Part-of-speech tagging models for parsing. In Proceedings of the Computaional Linguistics in the UK Conference (CLUK-06), Open University, Milton Keynes, UK.

Acknowledgements
We would like to thank the anonymous reviewers for their helpful feedback. This work has been supported by the Australian Research Council under Discovery Project DP0453131.

References
Srinivas Bangalore and Aravind Joshi. 1999. Supertagging: An approach to almost parsing. Computational Linguistics, 25(2):237­265. Srinivas Bangalore. 2000. A lightweight dependency analyser for partial parsing. Natural Language Engineering, 6(2):113­138. Ted Briscoe and John Carroll. 2002. Robust accurate statistical annotation of general tex. In Proceedings of the 3rd LREC Conference, pages 1499­1504, Las Palmas, Gran Canaria. Eugene Charniak, Glenn Carroll, John Adcock, Anthony Cassandra, Yoshihiko Gotoh, Jeremy Katz, Michael Littman, and John McCann. 1996. Taggers for parsers. Artificial Intelligence, 85:45­57. Stanley Chen and Ronald Rosenfeld. 1999. A Gaussian prior for smoothing maximum entropy models. Technical report, Carnegie Mellon University, Pittsburgh, PA. Stephen Clark and James R. Curran. 2004a. The importance of supertagging for wide-coverage CCG parsing. In Proceedings of COLING-04, pages 282­288, Geneva, Switzerland. Stephen Clark and James R. Curran. 2004b. Parsing the WSJ using CCG and log-linear models. In Proceedings of the 42nd Meeting of the ACL, pages 104­111, Barcelona, Spain. Stephen Clark. 2002. A supertagger for Combinatory Categorial Grammar. In Proceedings of the TAG+ Workshop, pages 19­24, Venice, Italy. Michael Collins. 2002. Discriminative training methods for Hidden Markov Models: Theory and experiments with perceptron algorithms. In Proceedings of the EMNLP Conference, pages 1­8, Philadelphia, PA. James R. Curran and Stephen Clark. 2003. Investigating GIS and smoothing for maximum entropy taggers. In Proceedings of the 10th Meeting of the EACL, pages 91­98, Budapest, Hungary. Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. 1997. Inducing features of random fields. IEEE Transactions Pattern Analysis and Machine Intelligence, 19(4):380­393. Julia Hockenmaier and Mark Steedman. 2002. Generative models for statistical parsing with Combinatory Categorial Grammar. In Proceedings of the 40th Meeting of the ACL, pages 335­342, Philadelphia, PA.

704