The Effect of Corpus Size in Combining Sup ervised and Unsup ervised Training for Disambiguation
Michaela Atterer Institute for NLP University of Stuttgart atterer@ims.uni-stuttgart.de Hinrich Schutze ¨ Institute for NLP University of Stuttgart hinrich@hotmail.com

Abstract
We investigate the effect of corpus size in combining sup ervised and unsup ervised learning for two typ es of attachment decisions: relative clause attachment and prep ositional phrase attachment. The sup ervised comp onent is Collins' parser, trained on the Wall Street Journal. The unsup ervised comp onent gathers lexical statistics from an unannotated corpus of newswire text. We find that the combined system only improves the p erformance of the parser for small training sets. Surprisingly, the size of the unannotated corpus has little effect due to the noisiness of the lexical statistics acquired by unsup ervised learning.

1

Introduction

The b est p erforming systems for many tasks in natural language processing are based on sup ervised training on annotated corp ora such as the Penn Treebank (Marcus et al., 1993) and the prep ositional phrase data set first describ ed in (Ratnaparkhi et al., 1994). However, the production of training sets is exp ensive. They are not available for many domains and languages. This motivates research on combining sup ervised with unsup ervised learning since unannotated text is in ample supply for most domains in the ma jor languages of the world. The question arises how much annotated and unannotated data is necessary in combination learning strategies. We investigate this question for two attachment ambiguity problems: relative clause (RC) attachment and prep ositional phrase (PP) attachment. The sup ervised comp onent is Collins' parser (Collins, 1997), trained on
25

the Wall Street Journal. The unsup ervised comp onent gathers lexical statistics from an unannotated corpus of newswire text. The sizes of b oth typ es of corp ora, annotated and unannotated, are of interest. We would exp ect that large annotated corp ora (training sets) tend to make the additional information from unannotated corp ora redundant. This exp ectation is confirmed in our exp eriments. For example, when using the maximum training set available for PP attachment, p erformance decreases when "unannotated" lexical statistics are added. For unannotated corp ora, we would exp ect the opp osite effect. The larger the unannotated corpus, the b etter the combined system should p erform. While there is a general tendency to this effect, the improvements in our exp eriments reach a plateau quickly as the unlab eled corpus grows, esp ecially for PP attachment. We attribute this result to the noisiness of the statistics collected from unlab eled corp ora. The pap er is organized as follows. Sections 2, 3 and 4 describ e data sets, methods and exp eriments. Section 5 evaluates and discusses exp erimental results. Section 6 compares our approach to prior work. Section 7 states our conclusions.

2

Data Sets

The unlab eled corpus is the Reuters RCV1 corpus, ab out 80,000,000 words of newswire text (Lewis et al., 2004). Three different subsets, corresp onding to roughly 10%, 50% and 100% of the corpus, were created for exp eriments related to the size of the unannotated corpus. (Two weeks after Aug 5, 1997, were set apart for future exp eriments.) The lab eled corpus is the Penn Wall Street Journal treebank (Marcus et al., 1993). We

Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 25­32, Sydney, July 2006. c 2006 Association for Computational Linguistics


created the 5 subsets shown in Table 1 for exp eriments related to the size of the annotated corpus.
unlab eled R 100% 50% 10% lab eled WSJ 50% 25% 5% 1% 0.05% 20/08/1996­05/08/1997 (351 days) 20/08/1996­17/02/1997 (182 days) 20/08/1996­24/09/1996 (36 days) sections 00­12 (23412 sentences) lines 1 ­ 292960 (11637 sentences) lines 1 ­ 58284 (2304 sentences) lines 1 ­ 11720 (500 sentences) lines 1 ­ 611 (23 sentences)

or the transaction is p erformed by written consent. (2) . . . a ma jority . . . have approved the transaction by written consent . . .

Both data sets are available for download (Web App endix, 2006). We did not use the PP data set describ ed by (Ratnaparkhi et al., 1994) b ecause we are using more context than the limited context available in that set (see b elow).

Table 1: Corp ora used for the exp eriments: unlab eled Reuters (R) corpus for attachment statistics, lab eled Penn treebank (WSJ) for training the Collins parser. The test set, sections 13-24, is larger than in most studies b ecause a single section does not contain a sufficient numb er of RC attachment ambiguities for a meaningful evaluation.
which-clauses subset develop set (sec 00-12) test set (sec 13-24) PP subset develop set (sec 00-12) test set (sec 13-24) highA 71 71 verbA 5927 5930 lowA 211 193 nounA 6560 6273 total 282 264 total 12487 12203

3

Methods

Collins parser. Our baseline method for ambiguity resolution is the Collins parser as implemented by Bikel (Collins, 1997; Bikel, 2004). For each ambiguity, we check whether the attachment ambiguity is resolved correctly by the 5 parsers corresp onding to the different training sets. If the attachment ambiguity is not recognized (e.g., b ecause parsing failed), then the corresp onding ambiguity is excluded for that instance of the parser. As a result, the size of the effective test set varies from parser to parser (see Table 4). Minipar. The unannotated corpus is analyzed using minipar (Lin, 1998), a partial dep endency parser. The corpus is parsed and all extracted dep endencies are stored for later use. Dep endencies in ambiguous PP attachments (those corresp onding to [VP NP PP] and [VP [NP PP]] subtrees) are not indexed. An exp eriment with indexing b oth alternatives for ambiguous structures yielded p oor results. For example, indexing b oth alternatives will create a large numb er of spurious verb attachments of of, which in turn will result in incorrect high attachments by our disambiguation algorithm. For relative clauses, no such filtering is necessary. For example, spurious sub ject-verb dep endencies due to RC ambiguities are rare compared to a large numb er of sub ject-verb dep endencies that can b e extracted reliably. Inverted index. Dep endencies extracted by minipar are stored in an inverted index (Witten et al., 1999), implemented in Lucene (Lucene, 2006). For example, "john sub j buy", the analysis returned by minipar for John buys, is stored as "john buy john<sub j
26

Table 2: RC and PP attachment ambiguities in the Penn Treebank. Numb er of instances with high attachment (highA), low attachment (lowA), verb attachment (verbA), and noun attachment (nounA) according to the gold standard. All instances of RC and PP attachments were extracted from development and test sets, yielding ab out 250 RC ambiguities and 12,000 PP ambiguities p er set (Table 2). An RC attachment ambiguity was defined as a sentence containing the pattern NP1 Prep NP2 which. For example, the relative clause in Example 1 can either attach to mechanism or to System. (1) ... the exchange-rate mechanism of the Europ ean Monetary System, which links the ma jor EC currencies.

A PP attachment ambiguity was defined as a subtree matching either [VP [NP PP]] or [VP NP PP]. An example of a PP attachment ambiguity is Example 2 where either the approval


subj<buy john<sub j<buy". All words, dep endencies and partial dep endencies of a sentence are stored together as one document. This storage mechanism enables fast on-line queries for lexical and dep endency statistics, e.g., how many sentences contain the dep endency "john sub j buy", how often does john occur as a sub ject, how often does buy have john as a sub ject and car as an ob ject etc. Query results are approximate b ecause double occurrences are only counted once and structures giving rise to the same set of dep endencies (a piece of a tile of a roof of a house vs. a piece of a roof of a tile of a house) cannot b e distinguished. We b elieve that an inverted index is the most efficient data structure for our purp oses. For example, we need not compute exp ensive joins as would b e required in a database implementation. Our long-term goal is to use this inverted index of dep endencies as a versatile comp onent of NLP systems in analogy to the increasingly imp ortant role of search engines for association and word count statistics in NLP. A total of three inverted indexes were created, one each for the 10%, 50% and 100% Reuters subset. Lattice-Based Disambiguation. Our disambiguation method is Lattice-Based Disambiguation (LBD, (Atterer and Schutze, ¨ 2006)). We formalize a p ossible attachment as a triple < R, i, X > where X is (the parse of ) a phrase with two or more p ossible attachment nodes in a sentence S, i is one of these attachment nodes and R is (the relevant part of a parse of ) S with X removed. For example, the two attachments in Example 2 are represented as the triples: < approvedi1 the transactioni2 , i1 , by consent >, < approvedi1 the transactioni2 , i2 , by consent >. We decide b etween attachment p ossibilities based on p ointwise mutual information, the well-known measure of how surprising it is to see R and X together given their individual frequencies: ( iX MI(< R, i, X >) = log2 PP <R,P,(X>) (R) ) for P (< R, i, X >), P (R), P (X ) = 0 MI(< R, i, X >) = 0 otherwise where the probabilities of the dep endency structures < R, i, X >, R and X are estimated on the unlab eled corpus by querying the in27

MN:pMN N:pMN N:pN MN:pN N:pM N:p 0:p MN:pM MN:p

Figure 1: Lattice of pairs of p otential attachment site (NP) and attachment phrase (PP). M: premodifying adjective or noun (upp er or lower NP), N: head noun (upp er or lower NP), p: Prep osition.

verted index. Unfortunately, these structures will often not occur in the corpus. If this is the case we back off to generalizations of R and X . The generalizations form a lattice as shown in Figure 1 for PP attachment. For example, MN:pMN corresp onds to commercial transaction by unanimous consent, N:pM to transaction by unanimous etc. For 0:p we compute MI of the two events "noun attachment" and "occurrence of p". Points in the lattice in Figure 1 are created by successive elimination of material from the complete context R:X. A child c directly dominated by a parent p is created by removing exactly one contextual element from p, either on the right side (the attachment phrase) or on the left side (the attachment node). For RC attachment, generalizations other than elimination are introduced such as the replacement of a prop er noun (e.g., Canada) by its category (country) (see b elow). The MI of each p oint in the lattice is computed. We then take the maximum over all MI values of the lattice as a measure of the affinity of attachment phrase and attachment node. The intuition is that we are looking for the strongest evidence available for the attachment. The strongest evidence is often not provided by the most sp ecific context (MN:pMN in the example) since contextual elements like modifiers will only add noise to the attachment decision in some cases. The actual syntactic disambiguation is p erformed by computing the affinity (maximum over MI values in the lattice) for each p ossible attachment and selecting the attachment with highest affinity. (The


default attachment is selected if the two values are equal.) The second lattice for PP attachment, the lattice for attachment to the verb, has a structure identical to Figure 1, but the attachment node is SV instead of MN, where S denotes the sub ject and V the verb. So the supremum of that lattice is SV:pMN and the infimum is 0:p (which in this case corresp onds to the MI of verb attachment and occurrence of the prep osition). LBD is motivated by the desire to use as much context as p ossible for disambiguation. Previous work on attachment disambiguation has generally used less context than in this pap er (e.g., modifiers have not b een used for PP attachment). No change to LBD is necessary if the lattice of contexts is extended by adding additional contextual elements (e.g., the prep osition b etween the two attachment nodes in RC, which we do not consider in this pap er).

for NP2 in NP1 Prep NP2 RC. Figure 2 shows the maximum p ossible lattice. If contextual elements are not present in a context (e.g., a modifier), then the lattice will b e smaller. The supremum of the lattice corresp onds to a query that includes the entire NP (including modifying adjectives and nouns)2 , the verb and its ob ject. Example: exchange rate<nn<mechanim && mechanism<subj<link && currency<obj<link.
M Nf:VO

MN:VO

Nf:VO

MNf:V

Mn:VO

N:VO

MN:V

Nf:V

MC:VO

n:VO

Mn:V

N:V

C:VO

MC:V

n:V

4

Experiments
C:V

The Reuters corpus was parsed with minipar and all dep endencies were extracted. Three inverted indexes were created, corresp onding to 10%, 50% and 100% of the corpus.1 Five parameter sets for the Collins parser were created by training it on the WSJ training sets in Table 1. Sentences with attachment ambiguities in the WSJ corpus were parsed with minipar to generate Lucene queries. (We chose this procedure to ensure compatibility of query and index formats.) The Lucene queries were run on the three indexes. LBD disambiguation was then applied based on the statistics returned by the queries. LBD results are applied to the output of the Collins parser by simply replacing all attachment decisions with LBD decisions. 4.1 RC attachment

[empty]

Figure 2: Lattice of pairs of p otential attachment site NP and relative clause X. M: premodifying adjective or noun, Nf: head noun with lexical modifiers, N: head noun only, n: head noun in lower case, C: class of NP, V: verb in relative clause, O: ob ject of verb in the relative clause. To generalize contexts in the lattice, the following generalization op erations are employed: · strip the NP of the modifying adjective/noun (weekly report  report) · use only the head noun of the NP (Catastrophic Care Act  Act) · use the head noun in lower case (Act  act) · for named entities use a hyp ernym of the NP (American Bel l Telephone Co.  company) · strip the ob ject from X (company have subsidiary  company have) The most imp ortant dep endency for disamFrom the minipar output, we use all adjectives that modify the NP via the relation mod, and all nouns that modify the NP via the relation nn.
2

The lattice for LBD in RC attachment is shown in Figure 2. When disambiguating an RC attachment, two instances of the lattice are formed, one for NP1 and one
In fact, two different sets of inverted indexes were created, one each for PP and RC disambiguation. The RC index indexes all dep endencies, including ambiguous PP dep endencies. Computing the RC statistics on the PP index should not affect the RC results presented here, but we didn't have time to confirm this exp erimentally for this pap er.
1

28


biguation is the noun-verb link, but the other dep endencies also improve the accuracy of disambiguation (Atterer and Schutze, 2006). ¨ For example, light verbs like make and have only provide disambiguation information when their ob jects are also considered. Downcasing and hyp ernym generalizations were used b ecause prop er nouns often cause sparse data problems. Named entity classes were identified with LingPip e (LingPip e, 2006). Named entities identified as companies or organizations are replaced with company in the query. Locations are replaced with country. Persons block RC attachment b ecause which-clauses do not attach to p erson names, resulting in an attachment of the RC to the other NP.
query +exchange rate nn mechanism +mechanism subj link +currency obj link +exchange rate nn mechanism +mechanism subj link +mechanism subj link +currency obj link mechanism subj link +Europ ean Monetary System subj link +currency obj link +System subj link +currency obj link Europ ean Monetary System subj link System subj link +system subj link +currency obj link system subj link +company subj link +currency obj link company subj link empty MI 12.2 4.8 10.2 3.4 0 0 0 0 0 1.2 0 -1.1 3

Decision list. For increased accuracy, LBD is emb edded in the following decision list. 1. If minipar has already chosen high attachment, choose high attachment (this only occurs if NP1 Prep NP2 is a named entity). 2. If there is agreement b etween the verb and only one of the NPs, attach to this NP. 3. If one of the NPs is in a list of p erson entities, attach to the other NP.4 4. If p ossible, use LBD. 5. If none of the ab ove strategies was successful (e.g. in the case of parsing errors), attach low. 4.2 PP attachment

The two lattices for LBD applied to PP attachment were describ ed in Section 3 and Figure 1. The only generalization op eration used in these two lattices is elimination of contextual elements (in particular, there is no downcasing and named entity recognition). Note that in RC attachment, we compare affinities of two instances of the same lattice (the one shown in Figure 2). In PP attachment, we compare affinities of two different lattices since the two attachment p oints (verb vs. noun) are different. The basic version of LBD (with the untuned default value 0 and without decision lists) was used for PP attachment.

Table 3: Queries for computing high attachment (ab ove) and low attachment (b elow) for Example 1. Table 3 shows queries and mutual information values for Example 1. The highest values are 12.2 for high attachment (mechanism) and 3 for low attachment (System). The algorithm therefore selects high attachment. The value 3 for low attachment is the default value for the empty context. This value reflects the bias for low attachment: the majority of relative clauses are attached low. If all MI-values are zero or otherwise low, this procedure will automatically result in low attachment.3
We exp erimented with a numb er of values (2, 3, and 4) on the development set. Accuracy of the algorithm was b est for a value of 3. The results presented here differ slightly from those in (Atterer and Schutze, ¨ 2006) due to a coding error.
3

5

Evaluation and Discussion

Evaluation results are shown in Table 4. The lines marked LBD evaluate the p erformance of LBD separately (without Collins' parser). LBD is significantly b etter than the baseline for PP attachment (p < 0.001, all tests are 2 tests). LBD is also b etter than baseline for RC attachment, but this result is not significant due to the small size of the data set (264). Note that the baseline for PP attachment is 51.4% as indicated in the table (upp er right corner of PP table), but that the baseline for RC attachment is 73.1%. The difference b etween 73.1% and 76.1% (upp er right corner of RC table) is due to the fact that for RC attachment LBD prop er is emb edded in a decision list. The decision list alone, with an
4 This list contains 136 entries and was semiautomatically computed from the Reuters corpus: Antecedents of who relative clauses were extracted, and the top 200 were filtered manually.

29


Train data LBD 50% 25% 5% 1% 0.05%

# 264 251 250 238 245 194

RC attachment Coll. only 100% R 78.4% 71.7% 78.5% 70.0% 78.0% 68.9% 78.2% 67.8% 78.8% 60.8% 76.8% PP attachment Coll. only 100% R 73.4% 82.8% 73.6% 81.5% 73.6% 77.4% 74.1% 72.9% 73.6% 58.0% 73.9%

50% R 78.0% 78.1% 77.6% 77.7% 78.4% 76.3%

10% R 76.9% 76.9% 76.4% 76.9% 77.1% 75.8%

0% R 76.1% 76.1% 76.4% 76.1% 76.7% 73.7%

Train data LBD 50% 25% 5% 1% 0.05%

# 12203 11953 11950 11737 11803 8486

50% R 73.4% 73.6% 73.7% 74.2% 73.6% 73.8%

10% R 73.0% 73.2% 73.3% 73.7% 73.2% 74.0%

0% R 51.4% 51.7% 51.7% 52.3% 51.6% 52.8%

Table 4: Exp erimental results. Results for LBD (without Collins) are given in the first lines. # is the size of the test set. The baselines are 73.1% (RC) and 51.4% (PP). The combined method p erforms b etter for small training sets. There is no significant difference b etween 10%, 50% and 100% for the combination method (p < 0.05). unlab eled corpus of size 0, achieves a p erformance of 76.1%. The b ottom five lines of each table evaluate combinations of a parameter set trained on a subset of WSJ (0.05% ­ 50%) and a particular size of the unlab eled corpus (100% ­ 0%). In addition, the third column gives the p erformance of Collins' parser without LBD. Recall that test set size (second column) varies b ecause we discard a test instance if Collins' parser does not recognize that there is an ambiguity (e.g., b ecause of a parse failure). As exp ected, p erformance increases as the size of the training set grows, e.g., from 58.0% to 82.8% for PP attachment. The combination of Collins and LBD is consistently b etter than Collins for RC attachment (not statistically significant due to the size of the data set). However, this is not the case for PP attachment. Due to the good p erformance of Collins' parser for even small training sets, the combination is only sup erior for the two smallest training sets (significant for the smallest set, p < 0.001). The most surprising result of the exp eriments is the small difference b etween the three unlab eled corp ora. There is no clear pattern in the data for PP attachment and only a small effect for RC attachment: an increase b etween 1% and 2% when corpus size is increased from 10% to 100%. We p erformed an analysis of a sample of in30

correctly attached PPs to investigate why unlab eled corpus size has such a small effect. We found that the noisiness of the statistics extracted from Reuters were often resp onsible for attachment errors. The noisiness is caused by our filtering strategy (ambiguous PPs are not used, resulting in undercounting), by the approximation of counts by Lucene (Lucene overcounts and undercounts as discussed in Section 3) and by minipar parse errors. Parse errors are particularly harmful in cases like the impact it would have on prospects, where, due to the extraction of the NP impact, minipar attaches the PP to the verb. We did not filter out these more complex ambiguous cases. Finally, the two corp ora are from distinct sources and from distinct time p eriods (early nineties vs. mid-nineties). Many topicand time-sp ecific dep endencies can only b e mined from more similar corp ora. The exp eriments reveal interesting differences b etween PP and RC attachment. The dep endencies used in RC disambiguation rarely occur in an ambiguous context (e.g., most sub ject-verb dep endencies can b e reliably extracted). In contrast, a large prop ortion of the dep endencies needed in PP disambiguation (verb-prep and noun-prep dep endencies) do occur in ambiguous contexts. Another difference is that RC attachment is syntactically more complex. It interacts with agreement, passive and long-distance dep en-


dencies. The algorithm prop osed for RC applies grammatical constraints successfully. A final difference is that the baseline for RC is much higher than for PP and therefore harder to b eat.5 An innovation of our disambiguation system is the use of a search engine, lucene, for serving up dep endency statistics. The advantage is that counts can b e computed quickly and dynamically. New text can b e added on an ongoing basis to the index. The up dated dep endency statistics are immediately available and can b enefit disambiguation p erformance. Such a system can adapt easily to new topics and changes over time. However, this architecture negatively affects accuracy. The unsup ervised approach of (Hindle and Rooth, 1993) achieves almost 80% accuracy by using partial dep endency statistics to disambiguate ambiguous sentences in the unlab eled corpus. Ambiguous sentences were excluded from our index to make index construction simple and efficient. Our larger corpus (ab out 6 times as large as Hindle et al.'s) did not comp ensate for our lower-quality statistics.

6

Related Work

Other work combining sup ervised and unsup ervised learning for parsing includes (Charniak, 1997), (Johnson and Riezler, 2000), and (Schmid, 2002). These pap ers present integrated formal frameworks for incorp orating information learned from unlab eled corp ora, but they do not explicitly address PP and RC attachment. The same is true for uncorrected colearning in (Hwa et al., 2003). Conversely, no previous work on PP and RC attachment has integrated sp ecialized ambiguity resolution into parsing. For example, (Toutanova et al., 2004) present one of the b est results achieved so far on the WSJ PP set: 87.5%. They also integrate sup ervised and unsup ervised learning. But to our knowledge, the relationship to parsing has not b een explored b efore ­ even though application to parsing is the stated ob jective of most work on PP attachment.
5 However, the baseline is similarly high for the PP problem if the most likely attachment is chosen p er prep osition: 72.2% according to (Collins and Brooks, 1995).

With the exception of (Hindle and Rooth, 1993), most unsup ervised work on PP attachment is based on sup erficial analysis of the unlab eled corpus without the use of partial parsing (Volk, 2001; Calvo et al., 2005). We b elieve that dep endencies offer a b etter basis for reliable disambiguation than cooccurrence and fixed-phrase statistics. The difference to (Hindle and Rooth, 1993) was discussed ab ove with resp ect to analysing the unlab eled corpus. In addition, the decision procedure presented here is different from Hindle et al.'s. LBD uses more context and can, in principle, accommodate arbitrarily large contexts. However, an evaluation comparing the p erformance of the two methods is necessary. The LBD model can b e viewed as a backoff model that combines estimates from several "backoffs". In a typical backoff model, there is a single more general model to back off to. (Collins and Brooks, 1995) also present a model with multiple backoffs. One of its variants computes the estimate in question as the average of three backoffs. In addition to the maximum used here, testing other combination strategies for the MI values in the lattice (e.g., average, sum, frequency-weighted sum) would b e desirable. In general, MI has not b een used in a backoff model b efore as far as we know. Previous work on relative clause attachment has b een sup ervised (Siddharthan, 2002a; Siddharthan, 2002b; Yeh and Vilain, 1998).6 (Siddharthan, 2002b)'s accuracy for RC attachment is 76.5%.7

7

Conclusion

Previous work on sp ecific typ es of ambiguities (like RC and PP) has not addressed the integration of sp ecific resolution algorithms into a generic statistical parser. In this pap er, we have shown for two typ es of ambiguities, relative clause and prep ositional phrase attachment ambiguity, that integration into a statistical parser is p ossible and that the com6 Strictly sp eaking, our exp eriments were not completely unsup ervised since the default value and the most frequent attachment were determined based on the development set. 7 We attempted to recreate Siddharthan's training and test sets, but were not able to based on the description in the pap er and email communication with the author.

31


bined system p erforms b etter than either comp onent by itself. However, for PP attachment this only holds for small training set sizes. For large training sets, we could only show an improvement for RC attachment. Surprisingly, we only found a small effect of the size of the unlab eled corpus on disambiguation p erformance due to the noisiness of statistics extracted from raw text. Once the unlab eled corpus has reached a certain size (510 million words in our exp eriments) combined p erformance does not increase further. The results in this pap er demonstrate that the baseline of a state-of-the-art lexicalized parser for sp ecific disambiguation problems like RC and PP is quite high compared to recent results for stand-alone PP disambiguation. For example, (Toutanova et al., 2004) achieve a p erformance of 87.6% for a training set of ab out 85% of WSJ. That numb er is not that far from the 82.8% achieved by Collins' parser in our exp eriments when trained on 50% of WSJ. Some of the sup ervised approaches to PP attachment may have to b e reevaluated in light of this good p erformance of generic parsers.

Rebecca Hwa, Miles Osborne, Anoop Sarkar, and Mark Steedman. 2003. Corrected co-training for statistical parsers. In Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, ICML. Mark Johnson and Stefan Riezler. 2000. Exploiting auxiliary distributions in stochastic unification-based grammars. In NAACL. David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5. Dekang Lin. 1998. Dependency-based evaluation of MINIPAR. In Workshop on the Evaluation of Parsing Systems, Granada, Spain. LingPipe. 2006. i.com/lingpipe/. http://www.alias-

Lucene. 2006. http://lucene.apache.org. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large natural language corpus of English: the Penn treebank. Computational Linguistics, 19:313­330. Adwait Ratnaparkhi, Jeff Reynar, and Salim Roukos. 1994. A maximum entropy model for prepositional phrase attachment. In HLT. Helmut Schmid. 2002. Lexicalization of probabilistic grammars. In Coling. Advaith Siddharthan. 2002a. Resolving attachment and clause boundary ambiguities for simplifying relative clause constructs. In Student Research Workshop, ACL. Advaith Siddharthan. 2002b. Resolving relative clause attachment ambiguities using machine learning techniques and wordnet hierarchies. In 4th Discourse Anaphora and Anaphora Resolution Col loquium. Kristina Toutanova, Christopher D. Manning, and Andrew Y. Ng. 2004. Learning random walk models for inducing word dependency distributions. In ICML. Martin Volk. 2001. Exploiting the WWW as a corpus to resolve pp attachment ambiguities. In Corpus Linguistics 2001. Web Appendix. 2006. http://www.ims.unistuttgart.de/schuetze/colingacl06/apdx.html. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufman. Alexander S. Yeh and Marc B. Vilain. 1998. Some properties of preposition and subordinate conjunction attachments. In Coling.

References
Michaela Atterer and Hinrich Schutze. 2006. A ¨ lattice-based framework for enhancing statistical parsers with information from unlabeled corpora. In CoNLL. Daniel M. Bikel. 2004. Intricacies of Collins' parsing model. Computational Linguistics, 30(4):479­511. Hiram Calvo, Alexander Gelbukh, and Adam Kilgarriff. 2005. Distributional thesaurus vs. WordNet: A comparison of backoff techniques for unsupervised PP attachment. In CICLing. Eugene Charniak. 1997. Statistical parsing with a context-free grammar and word statistics. In AAAI/IAAI, pages 598­603. Michael Collins and James Brooks. 1995. Prepositional attachment through a backed-off model. In Third Workshop on Very Large Corpora. Association for Computational Linguistics. Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In ACL. Donald Hindle and Mats Rooth. 1993. Structural ambiguity and lexical relations. Computational Linguistics, 19(1):103­120.

32