Reranking and Self-Training for Parser Adaptation David McClosky, Eugene Charniak, and Mark Johnson Brown Laboratory for Linguistic Information Processing ( B L L I P ) Brown University Providence, RI 02912 {dmcc|ec|mj}@cs.brown.edu Abstract Statistical parsers trained and tested on the Penn Wall Street Journal (W S J) treebank have shown vast improvements over the last 10 years. Much of this improvement, however, is based upon an ever-increasing number of features to be trained on (typically) the W S J treebank data. This has led to concern that such parsers may be too finely tuned to this corpus at the expense of portability to other genres. Such worries have merit. The standard "Charniak parser" checks in at a labeled precisionrecall f -measure of 89.7% on the Penn W S J test set, but only 82.9% on the test set from the Brown treebank corpus. This paper should allay these fears. In particular, we show that the reranking parser described in Charniak and Johnson (2005) improves performance of the parser on Brown to 85.2%. Furthermore, use of the self-training techniques described in (McClosky et al., 2006) raise this to 87.8% (an error reduction of 28%) again without any use of labeled Brown data. This is remarkable since training the parser and reranker on labeled Brown data achieves only 88.4%. 1 Introduction Modern statistical parsers require treebanks to train their parameters, but their performance declines when one parses genres more distant from the training data's domain. Furthermore, the treebanks required to train said parsers are expensive and difficult to produce. 337 Naturally, one of the goals of statistical parsing is to produce a broad-coverage parser which is relatively insensitive to textual domain. But the lack of corpora has led to a situation where much of the current work on parsing is performed on a single domain using training data from that domain -- the Wall Street Journal (W S J) section of the Penn Treebank (Marcus et al., 1993). Given the aforementioned costs, it is unlikely that many significant treebanks will be created for new genres. Thus, parser adaptation attempts to leverage existing labeled data from one domain and create a parser capable of parsing a different domain. Unfortunately, the state of the art in parser portability (i.e. using a parser trained on one domain to parse a different domain) is not good. The "Charniak parser" has a labeled precision-recall f -measure of 89.7% on W S J but a lowly 82.9% on the test set from the Brown corpus treebank. Furthermore, the treebanked Brown data is mostly general non-fiction and much closer to W S J than, e.g., medical corpora would be. Thus, most work on parser adaptation resorts to using some labeled in-domain data to fortify the larger quantity of outof-domain data. In this paper, we present some encouraging results on parser adaptation without any in-domain data. (Though we also present results with indomain data as a reference point.) In particular we note the effects of two comparatively recent techniques for parser improvement. The first of these, parse-reranking (Collins, 2000; Charniak and Johnson, 2005) starts with a "standard" generative parser, but uses it to generate the n-best parses rather than a single parse. Then a reranking phase uses more detailed features, features which would (mostly) be impossible to incorporate in the initial phase, to reorder Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 337­344, Sydney, July 2006. c 2006 Association for Computational Linguistics the list and pick a possibly different best parse. At first blush one might think that gathering even more fine-grained features from a W S J treebank would not help adaptation. However, we find that reranking improves the parsers performance from 82.9% to 85.2%. The second technique is self-training -- parsing unlabeled data and adding it to the training corpus. Recent work, (McClosky et al., 2006), has shown that adding many millions of words of machine parsed and reranked LA Times articles does, in fact, improve performance of the parser on the closely related W S J data. Here we show that it also helps the father-afield Brown data. Adding it improves performance yet-again, this time from 85.2% to 87.8%, for a net error reduction of 28%. It is interesting to compare this to our results for a completely Brown trained system (i.e. one in which the first-phase parser is trained on just Brown training data, and the second-phase reranker is trained on Brown 50-best lists). This system performs at a 88.4% level -- only slightly higher than that achieved by our system with only W S J data. Training WSJ WSJ Testing WSJ Brown W S J+Brown Brown Brown Brown f -measure Gildea Bacchiani 86.4 87.0 80.6 81.1 84.0 84.7 84.3 85.6 Table 1: Gildea and Bacchiani results on W S J and Brown test corpora using different W S J and Brown training sets. Gildea evaluates on sentences of length 40, Bacchiani on all sentences. ent parsers such as voting schemes and parse selection can improve performance on biomedical data. Lease and Charniak (2005) use the Charniak parser for biomedical data and find that the use of out-of-domain trees and in-domain vocabulary information can considerably improve performance. However, the work which is most directly comparable to ours is that of (Ratnaparkhi, 1999; Hwa, 1999; Gildea, 2001; Bacchiani et al., 2006). All of these papers look at what happens to modern W S J-trained statistical parsers (Ratnaparkhi's, Collins', Gildea's and Roark's, respectively) as training data varies in size or usefulness (because we are testing on something other than W S J). We concentrate particularly on the work of (Gildea, 2001; Bacchiani et al., 2006) as they provide results which are directly comparable to those presented in this paper. Looking at Table 1, the first line shows us the standard training and testing on W S J -- both parsers perform in the 86-87% range. The next line shows what happens when parsing Brown using a W S J-trained parser. As with the Charniak parser, both parsers take an approximately 6% hit. It is at this point that our work deviates from these two papers. Lacking alternatives, both (Gildea, 2001) and (Bacchiani et al., 2006) give up on adapting a pure W S J trained system, instead looking at the issue of how much of an improvement one gets over a pure Brown system by adding W S J data (as seen in the last two lines of Table 1). Both systems use a "model-merging" (Bacchiani et al., 2006) approach. The different corpora are, in effect, concatenated together. However, (Bacchiani et al., 2006) achieve a larger gain by weighting the in-domain (Brown) data more heavily than the out-of-domain W S J data. One can imagine, for instance, five copies of the Brown data concatenated with just one copy of W S J data. 338 2 Related Work Work in parser adaptation is premised on the assumption that one wants a single parser that can handle a wide variety of domains. While this is the goal of the majority of parsing researchers, it is not quite universal. Sekine (1997) observes that for parsing a specific domain, data from that domain is most beneficial, followed by data from the same class, data from a different class, and data from a different domain. He also notes that different domains have very different structures by looking at frequent grammar productions. For these reasons he takes the position that we should, instead, simply create treebanks for a large number of domains. While this is a coherent position, it is far from the majority view. There are many different approaches to parser adaptation. Steedman et al. (2003) apply cotraining to parser adaptation and find that cotraining can work across domains. The need to parse biomedical literature inspires (Clegg and Shepherd, 2005; Lease and Charniak, 2005). Clegg and Shepherd (2005) provide an extensive side-by-side performance analysis of several modern statistical parsers when faced with such data. They find that techniques which combine differ- 3 Corpora We primarily use three corpora in this paper. Selftraining requires labeled and unlabeled data. We assume that these sets of data must be in similar domains (e.g. news articles) though the effectiveness of self-training across domains is currently an open question. Thus, we have labeled (W S J) and unlabeled (NA N C) out-of-domain data and labeled in-domain data (B ROW N). Unfortunately, lacking a corresponding corpus to NA N C for B ROW N, we cannot perform the opposite scenario and adapt B ROW N to W S J. 3.1 Brown The B ROW N corpus (Francis and Kucera, 1979) consists of many different genres of text, intended to approximate a "balanced" corpus. While the full corpus consists of fiction and nonfiction domains, the sections that have been annotated in Treebank II bracketing are primarily those containing fiction. Examples of the sections annotated include science fiction, humor, romance, mystery, adventure, and "popular lore." We use the same divisions as Bacchiani et al. (2006), who base their divisions on Gildea (2001). Each division of the corpus consists of sentences from all available genres. The training division consists of approximately 80% of the data, while held-out development and testing divisions each make up 10% of the data. The treebanked sections contain approximately 25,000 sentences (458,000 words). 3.2 Wall Street Journal cleanups on NA N C to ease parsing. NA N C contains news articles from various news sources including the Wall Street Journal, though for this paper, we only use articles from the LA Times portion. To use the data from NA N C, we use self-training (McClosky et al., 2006). First, we take a W S J trained reranking parser (i.e. both the parser and reranker are built from W S J training data) and parse the sentences from NA N C with the 50-best (Charniak and Johnson, 2005) parser. Next, the 50-best parses are reordered by the reranker. Finally, the 1-best parses after reranking are combined with the W S J training set to retrain the firststage parser.1 McClosky et al. (2006) find that the self-trained models help considerably when parsing W S J. 4 Experiments We use the Charniak and Johnson (2005) reranking parser in our experiments. Unless mentioned otherwise, we use the W S J-trained reranker (as opposed to a B ROW N-trained reranker). To evaluate, we report bracketing f -scores.2 Parser f -scores reported are for sentences up to 100 words long, while reranking parser f -scores are over all sentences. For simplicity and ease of comparison, most of our evaluations are performed on the development section of B ROW N. 4.1 Adapting self-training Our first experiment examines the performance of the self-trained parsers. While the parsers are created entirely from labeled W S J data and unlabeled NA N C data, they perform extremely well on B ROW N development (Table 2). The trends are the same as in (McClosky et al., 2006): Adding NA N C data improves parsing performance on B ROW N development considerably, improving the f -score from 83.9% to 86.4%. As more NA N C data is added, the f -score appears to approach an asymptote. The NA N C data appears to help reduce data sparsity and fill in some of the gaps in the W S J model. Additionally, the reranker provides further benefit and adds an absolute 1-2% to the f score. The improvements appear to be orthogonal, as our best performance is reached when we use the reranker and add 2,500k self-trained sentences from NA N C. 1 We trained a new reranker from this data as well, but it does not seem to get significantly different performance. 2 The harmonic mean of labeled precision (P) and labeled recall (R), i.e. f = 2×P ×R P +R Our out-of-domain data is the Wall Street Journal (W S J) portion of the Penn Treebank (Marcus et al., 1993) which consists of about 40,000 sentences (one million words) annotated with syntactic information. We use the standard divisions: Sections 2 through 21 are used for training, section 24 for held-out development, and section 23 for final testing. 3.3 North American News Corpus In addition to labeled news data, we make use of a large quantity of unlabeled news data. The unlabeled data is the North American News Corpus, NA N C (Graff, 1995), which is approximately 24 million unlabeled sentences from various news sources. NA N C contains no syntactic information and sentence boundaries are induced by a simple discriminative model. We also perform some basic 339 Sentences added Baseline B ROW N Baseline W S J W S J+50k W S J+250k W S J+500k W S J+750k W S J+1,000k W S J+1,500k W S J+2,000k W S J+2,500k Parser 86.4 83.9 84.8 85.7 86.0 86.1 86.2 86.2 86.1 86.4 Reranking Parser 87.4 85.8 86.6 87.2 87.3 87.5 87.3 87.6 87.7 87.7 Table 2: Effects of adding NA N C sentences to W S J training data on parsing performance. f -scores for the parser with and without the W S J reranker are shown when evaluating on B ROW N development. For this experiment, we use the W S J-trained reranker. The results are even more surprising when we compare against a parser3 trained on the labeled training section of the B ROW N corpus, with parameters tuned against its held-out section. Despite seeing no in-domain data, the W S J based parser is able to match the B ROW N based parser. For the remainder of this paper, we will refer to the model trained on W S J+2,500k sentences of N A N C as our "best W S J+N A N C " model. We also note that this "best" parser is different from the "best" parser for parsing W S J, which was trained on W S J with a relative weight4 of 5 and 1,750k sentences from NA N C. For parsing B ROW N, the difference between these two parsers is not large, though. Increasing the relative weight of W S J sentences versus NA N C sentences when testing on B ROW N development does not appear to have a significant effect. While (McClosky et al., 2006) showed that this technique was effective when testing on W S J, the true distribution was closer to W S J so it made sense to emphasize it. 4.2 Incorporating In-Domain Data Up to this point, we have only considered the situation where we have no in-domain data. We now 3 In this case, only the parser is trained on B ROW N. In section 4.3, we compare against a fully B ROW N-trained reranking parser as well. 4 A relative weight of n is equivalent to using n copies of the corpus, i.e. an event that occurred x times in the corpus would occur x × n times in the weighted corpus. Thus, larger corpora will tend to dominate smaller corpora of the same relative weight in terms of event counts. explore different ways of making use of labeled and unlabeled in-domain data. Bacchiani et al. (2006) applies self-training to parser adaptation to utilize unlabeled in-domain data. The authors find that it helps quite a bit when adapting from B ROW N to W S J. They use a parser trained from the B ROW N train set to parse W S J and add the parsed W S J sentences to their training set. We perform a similar experiment, using our W S Jtrained reranking parser to parse B ROW N train and testing on B ROW N development. We achieved a boost from 84.8% to 85.6% when we added the parsed B ROW N sentences to our training. Adding in 1,000k sentences from NA N C as well, we saw a further increase to 86.3%. However, the technique does not seem as effective in our case. While the self-trained B ROW N data helps the parser, it adversely affects the performance of the reranking parser. When self-trained B ROW N data is added to W S J training, the reranking parser's performance drops from 86.6% to 86.1%. We see a similar degradation as NA N C data is added to the training set as well. We are not yet able to explain this unusual behavior. We now turn to the scenario where we have some labeled in-domain data. The most obvious way to incorporate labeled in-domain data is to combine it with the labeled out-of-domain data. We have already seen the results (Gildea, 2001) and (Bacchiani et al., 2006) achieve in Table 1. We explore various combinations of B ROW N, W S J, and N A N C corpora. Because we are mainly interested in exploring techniques with self-trained models rather than optimizing performance, we only consider weighting each corpus with a relative weight of one for this paper. The models generated are tuned on section 24 from W S J. The results are summarized in Table 3. While both W S J and B ROW N models benefit from a small amount of NA N C data, adding more than 250k NA N C sentences to the B ROW N or combined models causes their performance to drop. This is not surprising, though, since adding "too much" NA N C overwhelms the more accurate B ROW N or W S J counts. By weighting the counts from each corpus appropriately, this problem can be avoided. Another way to incorporate labeled data is to tune the parser back-off parameters on it. Bacchiani et al. (2006) report that tuning on held-out B ROW N data gives a large improvement over tun340 ing on W S J data. The improvement is mostly (but not entirely) in precision. We do not see the same improvement (Figure 1) but this is likely due to differences in the parsers. However, we do see a similar improvement for parsing accuracy once N A N C data has been added. The reranking parser generally sees an improvement, but it does not appear to be significant. 4.3 Reranker Portability Parser model WSJ W S J+N A N C Parser f -score 74.0 75.6 Reranker f -score 75.9 77.0 Table 4: Parser and reranking parser performance on the S W I T C H B OA R D development corpus. In this case, W S J+NA N C is a model created from W S J and 1,750k sentences from NA N C. Model WSJ W S J+N A N C We have shown that the W S J-trained reranker is actually quite portable to the B ROW N fiction domain. This is surprising given the large number of features (over a million in the case of the W S J reranker) tuned to adjust for errors made in the 50best lists by the first-stage parser. It would seem the corrections memorized by the reranker are not as domain-specific as we might expect. As further evidence, we present the results of applying the W S J model to the Switchboard corpus -- a domain much less similar to W S J than B ROW N. In Table 4, we see that while the parser's performance is low, self-training and reranking provide orthogonal benefits. The improvements represent a 12% error reduction with no additional in-domain data. Naturally, in-domain data and speech-specific handling (e.g. disfluency modeling) would probably help dramatically as well. Finally, to compare against a model fully trained on B ROW N data, we created a B ROW N reranker. We parsed the B ROW N training set with 20-fold cross-validation, selected features that occurred 5 times or more in the training set, and fed the 50-best lists from the parser to a numerical optimizer to estimate feature weights. The resulting reranker model had approximately 700,000 features, which is about half as many as the W S J trained reranker. This may be due to the smaller size of the B ROW N training set or because the feature schemas for the reranker were developed on W S J data. As seen in Table 5, the B ROW N reranker is not a significant improvement over the W S J reranker for parsing B RO W N data. B RO W N 1-best 82.6 86.4 86.3 10-best 88.9 92.1 92.0 25-best 90.7 93.5 93.3 50-best 91.9 94.3 94.2 Table 6: Oracle f -scores of top n parses produced by baseline W S J parser, a combined W S J and N A N C parser, and a baseline B RO W N parser. Section 5.3). 5.1 Oracle Scores Table 6 shows the f -scores of an "oracle reranker" -- i.e. one which would always choose the parse with the highest f -score in the n-best list. While the W S J parser has relatively low f -scores, adding N A N C data results in a parser with comparable oracle scores as the parser trained from B ROW N training. Thus, the W S J+NA N C model has better oracle rates than the W S J model (McClosky et al., 2006) for both the W S J and B ROW N domains. 5.2 Parser Agreement In this section, we compare the output of the W S J+N A N C -trained and B RO W N -trained reranking parsers. We use evalb to calculate how similar the two sets of output are on a bracket level. Table 7 shows various statistics. The two parsers achieved an 88.0% f -score between them. Additionally, the two parsers agreed on all brackets almost half the time. The part of speech tagging agreement is fairly high as well. Considering they were created from different corpora, this seems like a high level of agreement. 5.3 Statistical Analysis We conducted randomization tests for the significance of the difference in corpus f -score, based on the randomization version of the paired sample ttest described by Cohen (1995). The null hypothesis is that the two parsers being compared are in fact behaving identically, so permuting or swapping the parse trees produced by the parsers for 341 5 Analysis We perform several types of analysis to measure some of the differences and similarities between the B ROW N-trained and W S J-trained reranking parsers. While the two parsers agree on a large number of parse brackets (Section 5.2), there are categorical differences between them (as seen in 87.8 87.0 f -score 86.0 85.0 B ROW N tuned reranking parser W S J tuned reranking parser B ROW N tuned parser WSJ tuned parser 83.8 0k 250k 500k 1000k 1250k 1500k 1750k 2000k N A N C sentences added 750k Figure 1: Precision and recall f -scores when testing on B ROW N development as a function of the number of NA N C sentences added under four test conditions. "B ROW N tuned" indicates that B ROW N training data was used to tune the parameters (since the normal held-out section was being used for testing). For "W S J tuned," we tuned the parameters from section 24 of W S J. Tuning on B ROW N helps the parser, but not for the reranking parser. Parser model W S J alone W S J+2,500k N A N C B ROW N alone B ROW N+50k NA N C B ROW N+250k NA N C B ROW N+500k NA N C W S J+B RO W N W S J+B RO W N +50k N A N C W S J+B RO W N +250k N A N C W S J+B RO W N +500k N A N C Parser alone 83.9 86.4 86.3 86.8 86.8 86.7 86.5 86.8 86.8 86.6 Reranking parser 85.8 87.7 87.4 88.0 88.1 87.8 88.1 88.1 88.1 87.7 Table 3: f -scores from various combinations of W S J, NA N C, and B ROW N corpora on B ROW N development. The reranking parser used the W S J-trained reranker model. The B ROW N parsing model is naturally better than the W S J model for this task, but combining the two training corpora results in a better model (as in Gildea (2001)). Adding small amounts of NA N C further improves the models. Parser model WSJ W S J+N A N C B RO W N Parser alone 82.9 87.1 86.7 W S J-reranker 85.2 87.8 88.2 B ROW N-reranker 85.2 87.9 88.4 Table 5: Performance of various combinations of parser and reranker models when evaluated on B ROW N test. The W S J+NA N C parser with the W S J reranker comes close to the B ROW N-trained reranking parser. The B ROW N reranker provides only a small improvement over its W S J counterpart, which is not statistically significant. 342 Bracketing agreement f -score Complete match Average crossing brackets POS Tagging agreement 88.03% 44.92% 0.94 94.85% Table 7: Agreement between the W S J+NA N C parser with the W S J reranker and the B ROW N parser with the B ROW N reranker. Complete match is how often the two reranking parsers returned the exact same parse. the same test sentence should not affect the corpus f -scores. By estimating the proportion of permutations that result in an absolute difference in corpus f -scores at least as great as that observed in the actual output, we obtain a distributionfree estimate of significance that is robust against parser and evaluator failures. The results of this test are shown in Table 8. The table shows that the B ROW N reranker is not significantly different from the W S J reranker. In order to better understand the difference between the reranking parser trained on Brown and the W S J+NA N C/W S J reranking parser (a reranking parser with the first-stage trained on W S J+NA N C and the second-stage trained on W S J) on Brown data, we constructed a logistic regression model of the difference between the two parsers' f scores on the development data using the R statistical package5 . Of the 2,078 sentences in the development data, 29 sentences were discarded because evalb failed to evaluate at least one of the parses.6 A Wilcoxon signed rank test on the remaining 2,049 paired sentence level f -scores was significant at p = 0.0003. Of these 2,049 sentences, there were 983 parse pairs with the same sentence-level f -score. Of the 1,066 sentences for which the parsers produced parses with different f -scores, there were 580 sentences for which the B ROW N/B ROW N parser produced a parse with a higher sentence-level f -score and 486 sentences for which the W S J+NA N C/W S J parser produce a parse with a higher f -score. We constructed a generalized linear model with a binomial link with B ROW N/B ROW N f -score > W S J+N A N C /W S J f -score as the predicted variable, and sentence length, the number of prepositions (IN), the number of conjunctions (CC) and Brown http://www.r-project.org This occurs when an apostrophe is analyzed as a possessive marker in the gold tree and a punctuation symbol in the parse tree, or vice versa. 6 5 Feature (Intercept) IN ID=G ID=K ID=L ID=M ID=N ID=P ID=R Estimate 0.054 -0.134 0.584 0.697 0.552 0.376 0.642 0.624 0.040 z -value 0.3 -4.4 2.5 2.9 2.3 0.9 2.7 2.7 0.1 Pr(> |z |) 0.77 8.4e-06 0.011 0.003 0.021 0.33 0.0055 0.0069 0.90 *** * ** * ** ** Table 9: The logistic model of B ROW N/B ROW N f -score > W S J+NA N C/W S J f -score identified by model selection. The feature IN is the number prepositions in the sentence, while ID identifies the Brown subcorpus that the sentence comes from. Stars indicate significance level. subcorpus ID as explanatory variables. Model selection (using the "step" procedure) discarded all but the IN and Brown ID explanatory variables. The final estimated model is shown in Table 9. It shows that the W S J+NA N C/W S J parser becomes more likely to have a higher f -score than the B ROW N/B ROW N parser as the number of prepositions in the sentence increases, and that the B ROW N/B ROW N parser is more likely to have a higher f -score on Brown sections K, N, P, G and L (these are the general fiction, adventure and western fiction, romance and love story, letters and memories, and mystery sections of the Brown corpus, respectively). The three sections of B ROW N not in this list are F, M, and R (popular lore, science fiction, and humor). 6 Conclusions and Future Work We have demonstrated that rerankers and selftrained models can work well across domains. Models self-trained on W S J appear to be better parsing models in general, the benefits of which are not limited to the W S J domain. The W S Jtrained reranker using out-of-domain LA Times parses (produced by the W S J-trained reranker) achieves a labeled precision-recall f -measure of 87.8% on Brown data, nearly equal to the performance one achieves by using a purely Brown trained parser-reranker. The 87.8% f -score on Brown represents a 24% error reduction on the corpus. Of course, as corpora differences go, Brown is relatively close to W S J. While we also find that our 343 WSJ+NANC/WSJ W S J/W S J W S J+N A N C /W S J 0.025 (0) B RO W N /W S J 0.030 (0) 0.004 (0.1) B RO W N /W S J B RO W N /B RO W N 0.031 (0) 0.006 (0.025) 0.002 (0.27) Table 8: The difference in corpus f -score between the various reranking parsers, and the significance of the difference in parentheses as estimated by a randomization test with 106 samples. "x/y " indicates that the first-stage parser was trained on data set x and the second-stage reranker was trained on data set y . "best" W S J-parser-reranker improves performance on the Switchboard corpus, it starts from a much lower base (74.0%), and achieves a much less significant improvement (3% absolute, 11% error reduction). Bridging these larger gaps is still for the future. One intriguing idea is what we call "self-trained bridging-corpora." We have not yet experimented with medical text but we expect that the "best" W S J+N A N C parser will not perform very well. However, suppose one does self-training on a biology textbook instead of the LA Times. One might hope that such a text will split the difference between more "normal" newspaper articles and the specialized medical text. Thus, a selftrained parser based upon such text might do much better than our standard "best." This is, of course, highly speculative. Michael Collins. 2000. Discriminative reranking for natural language parsing. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML 2000), pages 175­182, Stanford, California. W. Nelson Francis and Henry Kucera. 1979. Manual of Information to accompany a Standard Corpus of Present-day Edited American English, for use with Digital Computers. Brown University, Providence, Rhode Island. Daniel Gildea. 2001. Corpus variation and parser performance. In Empirical Methods in Natural Language Processing (EMNLP), pages 167­202. David Graff. 1995. North American News Text Corpus. Linguistic Data Consortium. LDC95T21. Rebecca Hwa. 1999. Supervised grammar induction using training data with limited constituent information. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 72­80, University of Maryland. Matthew Lease and Eugene Charniak. 2005. Parsing biomedical literature. In Second International Joint Conference on Natural Language Processing (IJCNLP'05). Michell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Comp. Linguistics, 19(2):313­330. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of HLT-NAACL 2006. Adwait Ratnaparkhi. 1999. Learning to parse natural language with maximum entropy models. Machine Learning, 34(1-3):151­175. Satoshi Sekine. 1997. The domain dependence of parsing. In Proc. Applied Natural Language Processing (ANLP), pages 96­102. Mark Steedman, Miles Osborne, Anoop Sarkar, Stephen Clark, Rebecca Hwa, Julia Hockenmaier, Paul Ruhlen, Steven Baker, and Jeremiah Crim. 2003. Bootstrapping statistical parsers from small datasets. In Proc. of European ACL (EACL), pages 331­338. Acknowledgments This work was supported by NSF grants LIS9720368, and IIS0095940, and DARPA GALE contract HR0011-06-20001. We would like to thank the BLLIP team for their comments. References Michiel Bacchiani, Michael Riley, Brian Roark, and Richard Sproat. 2006. MAP adaptation of stochastic grammars. Computer Speech and Language, 20(1):41­68. Eugene Charniak and Mark Johnson. 2005. Coarseto-fine n-best parsing and MaxEnt discriminative reranking. In Proc. of the 2005 Meeting of the Assoc. for Computational Linguistics (ACL), pages 173­180. Andrew B. Clegg and Adrian Shepherd. 2005. Evaluating and integrating treebank parsers on a biomedical corpus. In Proceedings of the ACL Workshop on Software. Paul R. Cohen. 1995. Empirical Methods for Artificial Intelligence. The MIT Press, Cambridge, Massachusetts. 344