The Effect of Corpus Size in Combining Sup ervised and Unsup ervised Training for Disambiguation Michaela Atterer Institute for NLP University of Stuttgart atterer@ims.uni-stuttgart.de Hinrich Schutze ¨ Institute for NLP University of Stuttgart hinrich@hotmail.com Abstract We investigate the effect of corpus size in combining sup ervised and unsup ervised learning for two typ es of attachment decisions: relative clause attachment and prep ositional phrase attachment. The sup ervised comp onent is Collins' parser, trained on the Wall Street Journal. The unsup ervised comp onent gathers lexical statistics from an unannotated corpus of newswire text. We find that the combined system only improves the p erformance of the parser for small training sets. Surprisingly, the size of the unannotated corpus has little effect due to the noisiness of the lexical statistics acquired by unsup ervised learning. 1 Introduction The b est p erforming systems for many tasks in natural language processing are based on sup ervised training on annotated corp ora such as the Penn Treebank (Marcus et al., 1993) and the prep ositional phrase data set first describ ed in (Ratnaparkhi et al., 1994). However, the production of training sets is exp ensive. They are not available for many domains and languages. This motivates research on combining sup ervised with unsup ervised learning since unannotated text is in ample supply for most domains in the ma jor languages of the world. The question arises how much annotated and unannotated data is necessary in combination learning strategies. We investigate this question for two attachment ambiguity problems: relative clause (RC) attachment and prep ositional phrase (PP) attachment. The sup ervised comp onent is Collins' parser (Collins, 1997), trained on 25 the Wall Street Journal. The unsup ervised comp onent gathers lexical statistics from an unannotated corpus of newswire text. The sizes of b oth typ es of corp ora, annotated and unannotated, are of interest. We would exp ect that large annotated corp ora (training sets) tend to make the additional information from unannotated corp ora redundant. This exp ectation is confirmed in our exp eriments. For example, when using the maximum training set available for PP attachment, p erformance decreases when "unannotated" lexical statistics are added. For unannotated corp ora, we would exp ect the opp osite effect. The larger the unannotated corpus, the b etter the combined system should p erform. While there is a general tendency to this effect, the improvements in our exp eriments reach a plateau quickly as the unlab eled corpus grows, esp ecially for PP attachment. We attribute this result to the noisiness of the statistics collected from unlab eled corp ora. The pap er is organized as follows. Sections 2, 3 and 4 describ e data sets, methods and exp eriments. Section 5 evaluates and discusses exp erimental results. Section 6 compares our approach to prior work. Section 7 states our conclusions. 2 Data Sets The unlab eled corpus is the Reuters RCV1 corpus, ab out 80,000,000 words of newswire text (Lewis et al., 2004). Three different subsets, corresp onding to roughly 10%, 50% and 100% of the corpus, were created for exp eriments related to the size of the unannotated corpus. (Two weeks after Aug 5, 1997, were set apart for future exp eriments.) The lab eled corpus is the Penn Wall Street Journal treebank (Marcus et al., 1993). We Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 25­32, Sydney, July 2006. c 2006 Association for Computational Linguistics created the 5 subsets shown in Table 1 for exp eriments related to the size of the annotated corpus. unlab eled R 100% 50% 10% lab eled WSJ 50% 25% 5% 1% 0.05% 20/08/1996­05/08/1997 (351 days) 20/08/1996­17/02/1997 (182 days) 20/08/1996­24/09/1996 (36 days) sections 00­12 (23412 sentences) lines 1 ­ 292960 (11637 sentences) lines 1 ­ 58284 (2304 sentences) lines 1 ­ 11720 (500 sentences) lines 1 ­ 611 (23 sentences) or the transaction is p erformed by written consent. (2) . . . a ma jority . . . have approved the transaction by written consent . . . Both data sets are available for download (Web App endix, 2006). We did not use the PP data set describ ed by (Ratnaparkhi et al., 1994) b ecause we are using more context than the limited context available in that set (see b elow). Table 1: Corp ora used for the exp eriments: unlab eled Reuters (R) corpus for attachment statistics, lab eled Penn treebank (WSJ) for training the Collins parser. The test set, sections 13-24, is larger than in most studies b ecause a single section does not contain a sufficient numb er of RC attachment ambiguities for a meaningful evaluation. which-clauses subset develop set (sec 00-12) test set (sec 13-24) PP subset develop set (sec 00-12) test set (sec 13-24) highA 71 71 verbA 5927 5930 lowA 211 193 nounA 6560 6273 total 282 264 total 12487 12203 3 Methods Collins parser. Our baseline method for ambiguity resolution is the Collins parser as implemented by Bikel (Collins, 1997; Bikel, 2004). For each ambiguity, we check whether the attachment ambiguity is resolved correctly by the 5 parsers corresp onding to the different training sets. If the attachment ambiguity is not recognized (e.g., b ecause parsing failed), then the corresp onding ambiguity is excluded for that instance of the parser. As a result, the size of the effective test set varies from parser to parser (see Table 4). Minipar. The unannotated corpus is analyzed using minipar (Lin, 1998), a partial dep endency parser. The corpus is parsed and all extracted dep endencies are stored for later use. Dep endencies in ambiguous PP attachments (those corresp onding to [VP NP PP] and [VP [NP PP]] subtrees) are not indexed. An exp eriment with indexing b oth alternatives for ambiguous structures yielded p oor results. For example, indexing b oth alternatives will create a large numb er of spurious verb attachments of of, which in turn will result in incorrect high attachments by our disambiguation algorithm. For relative clauses, no such filtering is necessary. For example, spurious sub ject-verb dep endencies due to RC ambiguities are rare compared to a large numb er of sub ject-verb dep endencies that can b e extracted reliably. Inverted index. Dep endencies extracted by minipar are stored in an inverted index (Witten et al., 1999), implemented in Lucene (Lucene, 2006). For example, "john sub j buy", the analysis returned by minipar for John buys, is stored as "john buy john where X is (the parse of ) a phrase with two or more p ossible attachment nodes in a sentence S, i is one of these attachment nodes and R is (the relevant part of a parse of ) S with X removed. For example, the two attachments in Example 2 are represented as the triples: < approvedi1 the transactioni2 , i1 , by consent >, < approvedi1 the transactioni2 , i2 , by consent >. We decide b etween attachment p ossibilities based on p ointwise mutual information, the well-known measure of how surprising it is to see R and X together given their individual frequencies: ( iX MI(< R, i, X >) = log2 PP ) (R) ) for P (< R, i, X >), P (R), P (X ) = 0 MI(< R, i, X >) = 0 otherwise where the probabilities of the dep endency structures < R, i, X >, R and X are estimated on the unlab eled corpus by querying the in27 MN:pMN N:pMN N:pN MN:pN N:pM N:p 0:p MN:pM MN:p Figure 1: Lattice of pairs of p otential attachment site (NP) and attachment phrase (PP). M: premodifying adjective or noun (upp er or lower NP), N: head noun (upp er or lower NP), p: Prep osition. verted index. Unfortunately, these structures will often not occur in the corpus. If this is the case we back off to generalizations of R and X . The generalizations form a lattice as shown in Figure 1 for PP attachment. For example, MN:pMN corresp onds to commercial transaction by unanimous consent, N:pM to transaction by unanimous etc. For 0:p we compute MI of the two events "noun attachment" and "occurrence of p". Points in the lattice in Figure 1 are created by successive elimination of material from the complete context R:X. A child c directly dominated by a parent p is created by removing exactly one contextual element from p, either on the right side (the attachment phrase) or on the left side (the attachment node). For RC attachment, generalizations other than elimination are introduced such as the replacement of a prop er noun (e.g., Canada) by its category (country) (see b elow). The MI of each p oint in the lattice is computed. We then take the maximum over all MI values of the lattice as a measure of the affinity of attachment phrase and attachment node. The intuition is that we are looking for the strongest evidence available for the attachment. The strongest evidence is often not provided by the most sp ecific context (MN:pMN in the example) since contextual elements like modifiers will only add noise to the attachment decision in some cases. The actual syntactic disambiguation is p erformed by computing the affinity (maximum over MI values in the lattice) for each p ossible attachment and selecting the attachment with highest affinity. (The default attachment is selected if the two values are equal.) The second lattice for PP attachment, the lattice for attachment to the verb, has a structure identical to Figure 1, but the attachment node is SV instead of MN, where S denotes the sub ject and V the verb. So the supremum of that lattice is SV:pMN and the infimum is 0:p (which in this case corresp onds to the MI of verb attachment and occurrence of the prep osition). LBD is motivated by the desire to use as much context as p ossible for disambiguation. Previous work on attachment disambiguation has generally used less context than in this pap er (e.g., modifiers have not b een used for PP attachment). No change to LBD is necessary if the lattice of contexts is extended by adding additional contextual elements (e.g., the prep osition b etween the two attachment nodes in RC, which we do not consider in this pap er). for NP2 in NP1 Prep NP2 RC. Figure 2 shows the maximum p ossible lattice. If contextual elements are not present in a context (e.g., a modifier), then the lattice will b e smaller. The supremum of the lattice corresp onds to a query that includes the entire NP (including modifying adjectives and nouns)2 , the verb and its ob ject. Example: exchange rate