Automatic Classification of Verbs in Biomedical Texts Anna Korhonen University of Cambridge Computer Laboratory 15 JJ Thomson Avenue Cambridge CB3 0GD, UK alk23@cl.cam.ac.uk Yuval Krymolowski Nigel Collier Dept. of Computer Science National Institute of Informatics Technion Hitotsubashi 2-1-2 Haifa 32000 Chiyoda-ku, Tokyo 101-8430 Israel Japan yuvalkr@cs.technion.ac.il collier@nii.ac.jp Abstract Lexical classes, when tailored to the application and domain in question, can provide an effective means to deal with a number of natural language processing (N L P) tasks. While manual construction of such classes is difficult, recent research shows that it is possible to automatically induce verb classes from cross-domain corpora with promising accuracy. We report a novel experiment where similar technology is applied to the important, challenging domain of biomedicine. We show that the resulting classification, acquired from a corpus of biomedical journal articles, is highly accurate and strongly domainspecific. It can be used to aid B I O-N L P directly or as useful material for investigating the syntax and semantics of verbs in biomedical texts. 1 Introduction Lexical classes which capture the close relation between the syntax and semantics of verbs have attracted considerable interest in N L P (Jackendoff, 1990; Levin, 1993; Dorr, 1997; Prescher et al., 2000). Such classes are useful for their ability to capture generalizations about a range of linguistic properties. For example, verbs which share the meaning of `manner of motion' (such as travel, run, walk), behave similarly also in terms of subcategorization (I traveled/ran/walked, I traveled/ran/walked to London, I traveled/ran/walked five miles). Although the correspondence between the syntax and semantics of words is not perfect and the classes do not provide means for full semantic inferencing, their predictive power is nevertheless considerable. 345 NLP systems can benefit from lexical classes in many ways. Such classes define the mapping from surface realization of arguments to predicateargument structure, and are therefore an important component of any system which needs the latter. As the classes can capture higher level abstractions they can be used as a means to abstract away from individual words when required. They are also helpful in many operational contexts where lexical information must be acquired from small application-specific corpora. Their predictive power can help compensate for lack of data fully exemplifying the behavior of relevant words. Lexical verb classes have been used to support various (multilingual) tasks, such as computational lexicography, language generation, machine translation, word sense disambiguation, semantic role labeling, and subcategorization acquisition (Dorr, 1997; Prescher et al., 2000; Korhonen, 2002). However, large-scale exploitation of the classes in real-world or domain-sensitive tasks has not been possible because the existing classifications, e.g. (Levin, 1993), are incomprehensive and unsuitable for specific domains. While manual classification of large numbers of words has proved difficult and time-consuming, recent research shows that it is possible to automatically induce lexical classes from corpus data with promising accuracy (Merlo and Stevenson, 2001; Brew and Schulte im Walde, 2002; Korhonen et al., 2003). A number of M L methods have been applied to classify words using features pertaining to mainly syntactic structure (e.g. statistical distributions of subcategorization frames (S C Fs) or general patterns of syntactic behaviour, e.g. transitivity, passivisability) which have been extracted from corpora using e.g. part-of-speech tagging or robust statistical parsing techniques. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 345­352, Sydney, July 2006. c 2006 Association for Computational Linguistics This research has been encouraging but it has so far concentrated on general language. Domainspecific lexical classification remains unexplored, although it is arguably important: existing classifications are unsuitable for domain-specific applications and these often challenging applications might benefit from improved performance by utilizing lexical classes the most. In this paper, we extend an existing approach to lexical classification (Korhonen et al., 2003) and apply it (without any domain specific tuning) to the domain of biomedicine. We focus on biomedicine for several reasons: (i) N L P is critically needed to assist the processing, mining and extraction of knowledge from the rapidly growing literature in this area, (ii) the domain lexical resources (e.g. U M L S metathesaurus and lexicon1 ) do not provide sufficient information about verbs and (iii) being linguistically challenging, the domain provides a good test case for examining the potential of automatic classification. We report an experiment where a classification is induced for 192 relatively frequent verbs from a corpus of 2230 biomedical journal articles. The results, evaluated with domain experts, show that the approach is capable of acquiring classes with accuracy higher than that reported in previous work on general language. We discuss reasons for this and show that the resulting classes differ substantially from those in extant lexical resources. They constitute the first syntactic-semantic verb classification for the biomedical domain and could be readily applied to support B I O-N L P. We discuss the domain-specific issues related to our task in section 2. The approach to automatic classification is presented in section 3. Details of the experimental evaluation are supplied in section 4. Section 5 provides discussion and section 6 concludes with directions for future work. 2 The Biomedical Domain and Our Task Recent years have seen a massive growth in the scientific literature in the domain of biomedicine. For example, the M E D L I N E database2 which currently contains around 16M references to journal articles, expands with 0.5M new references each year. Because future research in the biomedical sciences depends on making use of all this existing knowledge, there is a strong need for the develop1 2 http://www.nlm.nih.gov/research/umls http://www.ncbi.nlm.nih.gov/PubMed/ ment of N L P tools which can be used to automatically locate, organize and manage facts related to published experimental results. In recent years, major progress has been made on information retrieval and on the extraction of specific relations e.g. between proteins and cell types from biomedical texts (Hirschman et al., 2002). Other tasks, such as the extraction of factual information, remain a bigger challenge. This is partly due to the challenging nature of biomedical texts. They are complex both in terms of syntax and semantics, containing complex nominals, modal subordination, anaphoric links, etc. Researchers have recently began to use deeper N L P techniques (e.g. statistical parsing) in the domain because they are not challenged by the complex structures to the same extent than shallow techniques (e.g. regular expression patterns) are (Lease and Charniak, 2005). However, deeper techniques require richer domain-specific lexical information for optimal performance than is provided by existing lexicons (e.g. U M L S). This is particularly important for verbs, which are central to the structure and meaning of sentences. Where the lexical information is absent, lexical classes can compensate for it or aid in obtaining it in the ways described in section 1. Consider e.g. the I N D I C AT E and AC T I VAT E verb classes in Figure 1. They capture the fact that their members are similar in terms of syntax and semantics: they have similar S C Fs and selectional preferences, and they can be used to make similar statements which describe similar events. Such information can be used to build a richer lexicon capable of supporting key tasks such as parsing, predicate-argument identification, event extraction and the identification of biomedical (e.g. interaction) relations. While an abundance of work has been conducted on semantic classification of biomedical terms and nouns, less work has been done on the (manual or automatic) semantic classification of verbs in the biomedical domain (Friedman et al., 2002; Hatzivassiloglou and Weng, 2002; Spasic et al., 2005). No previous work exists in this domain on the type of lexical (i.e. syntactic-semantic) verb classification this paper focuses on. To get an initial idea about the differences between our target classification and a general language classification, we examined the extent to which individual verbs and their frequencies differ in biomedical and general language texts. We 346 s NDICATE t PROTEINS: p53 CTIVATE ENES: WAF1 t I duggests i emonstraF tes I indicates mplies... hat T p53 Dp53 . mp53 .. uctivates i ap-regulates s duces n tiA ulates... m C WAF1 p IP1 . 21 G .. .. . igure 1: Sample lexical classes BIO show suggest use indicate contain describe express bind require observe find determine demonstrate perform induce BNC do say make go see take get know come give think use find look want pus data using the comprehensive subcategorization acquisition system of Briscoe and Carroll (1997) (Korhonen, 2002). The system incorporates R A S P, a domain-independent robust statistical parser (Briscoe and Carroll, 2002), which tags, lemmatizes and parses data yielding complete though shallow parses and a S C F classifier which incorporates an extensive inventory of 163 verbal S C Fs3 . The S C Fs abstract over specific lexically-governed particles and prepositions and specific predicate selectional preferences. In our work, we parameterized two high frequency S C Fs for prepositions (P P and N P + P P S C Fs). No filtering of potentially noisy S C Fs was done to provide clustering with as much information as possible. 3.2 Classification Table 1: The 15 most frequent verbs in the biomedical data and in the BNC created a corpus of 2230 biomedical journal articles (see section 4.1 for details) and compared the distribution of verbs in this corpus with that in the British National Corpus (B N C) (Leech, 1992). We calculated the Spearman rank correlation between the 1165 verbs which occurred in both corpora. The result was only a weak correlation: 0.37 ± 0.03. When the scope was restricted to the 100 most frequent verbs in the biomedical data, the correlation was 0.12 ± 0.10 which is only 1.2 away from zero. The dissimilarity between the distributions is further indicated by the KullbackLeibler distance of 0.97. Table 1 illustrates some of these big differences by showing the list of 15 most frequent verbs in the two corpora. The S C F frequency distributions constitute the input data to automatic classification. We experiment with five clustering methods: the simple hard nearest neighbours method and four probabilistic methods ­ two variants of Probabilistic Latent Semantic Analysis and two information theoretic methods (the Information Bottleneck and the Information Distortion). 3.2.1 Nearest Neighbours The first method collects the nearest neighbours (N N) of each verb. It (i) calculates the JensenShannon divergence (J S) between the S C F distributions of each pair of verbs, (ii) connects each verb with the most similar other verb, and finally (iii) finds all the connected components. The N N method is very simple. It outputs only one clustering configuration and therefore does not allow examining different cluster granularities. 3.2.2 Probabilistic Latent Semantic Analysis The Probabilistic Latent Semantic Analysis (P L S A, Hoffman (2001)) assumes a generative model for the data, defined by selecting (i) a verb v erbi , (ii) a semantic class classk from the distribution p(C lasses | v erbi ), and (iii) a S C F scfj from the distribution p(S C Fs | classk ). P L S A uses Expectation Maximization (EM) to find the distribution p(S C Fs | C lusters, V erbs) which max~ imises the likelihood of the observed counts. It does this by minimising the cost function F = - log Likelihood(p | data) + H (p) . ~ ~ See http://www.cl.cam.ac.uk/users/alk23/subcat/subcat.html for further detail. 3 3 Approach We extended the system of Korhonen et al. (2003) with additional clustering techniques (introduced in sections 3.2.2 and 3.2.4) and used it to obtain the classification for the biomedical domain. The system (i) extracts features from corpus data and (ii) clusters them using five different methods. These steps are described in the following two sections, respectively. 3.1 Feature Extraction We employ as features distributions of S C Fs specific to given verbs. We extract them from cor347 For = 1 minimising F is equivalent to the standard EM procedure while for < 1 the distribution p tends to be more evenly spread. We use ~ = 1 (P L S A/E M) and = 0.75 (P L S A =0.75 ). We currently "harden" the output and assign each verb to the most probable cluster only4 . 3.2.3 Information Bottleneck The Information Bottleneck (Tishby et al., 1999) (I B) is an information-theoretic method which controls the balance between: (i) the loss of information by representing verbs as clusters (I (C lusters; V erbs)), which has to be minimal, and (ii) the relevance of the output clusters for representing the S C F distribution (I (C lusters; S C Fs)) which has to be maximal. The balance between these two quantities ensures optimal compression of data through clusters. The trade-off between the two constraints is realized through minimising the cost function: LI B = I (C lusters; V erbs) - I (C lusters; S C Fs) , ing journals in biomedicine: 1) Genes & Development (molecular biology, molecular genetics), 2) Journal of Biological Chemistry (biochemistry and molecular biology) and 3) Journal of Cell Biology (cellular structure and function). 2230 fulltext articles from years 2003-2004 were used. The data included 11.5M words and 323,307 sentences in total. 192 medium to high frequency verbs (with the minimum of 300 occurrences in the data) were selected for experimentation5 . This test set was big enough to produce a useful classification but small enough to enable thorough evaluation in this first attempt to classify verbs in the biomedical domain. 4.2 Processing the Data The data was first processed using the feature extraction module. 233 (preposition-specific) S C F types appeared in the resulting lexicon, 36 per verb on average.6 The classification module was then applied. N N produced Knn = 42 clusters. From the other methods we requested K = 2 to 60 clusters. We chose for evaluation the outputs corresponding to the most informative values of K: 20, 33, 53 for I B, and 17, 33, 53 for I D. 4.3 Gold Standard Because no target lexical classification was available for the biomedical domain, human experts (4 domain experts and 2 linguists) were used to create the gold standard. They were asked to examine whether the test verbs similar in terms of their syntactic properties (i.e. verbs with similar S C F distributions) are similar also in terms of semantics (i.e. they share a common meaning). Where this was the case, a verb class was identified and named. The domain experts examined the 116 verbs whose analysis required domain knowledge (e.g. activate, solubilize, harvest), while the linguists analysed the remaining 76 general or scientific text verbs (e.g. demonstrate, hypothesize, appear). The linguists used Levin (1993) classes as gold standard classes whenever possible and created novel ones when needed. The domain experts used two purely semantic classifications of biomedical verbs (Friedman et al., 2002; Spasic et al., 2005)7 as a starting point where this was pos230 verbs were employed initially but 38 were dropped later so that each (coarse-grained) class would have the minimum of 2 members in the gold standard. 6 This number is high because no filtering of potentially noisy S C Fs was done. 7 See http://www.cbr-masterclass.org. 5 where is a parameter that balances the constraints. I B takes three inputs: (i) S C F-verb distributions, (ii) the desired number of clusters K, and (iii) the initial value of . It then looks for the minimal that decreases LI B compared to its value with the initial , using the given K. I B delivers as output the probabilities p(K |V ). It gives an indication for the most informative number of output configurations: the ones for which the relevance information increases more sharply between K - 1 and K clusters than between K and K + 1. 3.2.4 Information Distortion The Information Distortion method (Dimitrov and Miller, 2001) (I D) is otherwise similar to I B but LI D differs from LI B by an additional term that adds a bias towards clusters of similar size: LI D = -H (C lusters | V erbs) - I (C lusters; S C Fs) = LI B - H (C lusters) . ID yields more evenly divided clusters than I B. 4 Experimental Evaluation 4.1 Data We downloaded the data for our experiment from the M E D L I N E database, from three of the 10 lead4 The same approach was used with the information theoretic methods. It made sense in this initial work on biomedical classification. In the future we could use soft clustering a means to investigate polysemy. 348 1 Have an effect on activity (BIO/29) 1.1 Activate / Inactivate 1.1.1 Change activity: activate, inhibit 1.1.2 Suppress: suppress, repress 1.1.3 Stimulate: stimulate 1.1.4 Inactivate: delay, diminish 1.2 Affect 1.2.1 Modulate: stabilize, modulate 1.2.2 Regulate: control, support 1.3 Increase / decrease: increase, decrease 1.4 Modify: modify, catalyze 2 Biochemical events (BIO/12) 2.1 Express: express, overexpress 2.2 Modification 2.2.1 Biochemical modification: dephosphorylate, phosphorylate 2.2.2 Cleave: cleave 2.3 Interact: react, interfere 3 Removal (BIO/6) 3.1 Omit: displace, deplete 3.2 Subtract: draw, dissect 4 Experimental Procedures (BIO/30) 4.1 Prepare 4.1.1 Wash: wash, rinse 4.1.2 Mix: mix 4.1.3 Label: stain, immunoblot 4.1.4 Incubate: preincubate, incubate 4.1.5 Elute: elute 4.2 Precipitate: coprecipitate coimmunoprecipitate 4.3 Solubilize: solubilize,lyse 4.4 Dissolve: homogenize, dissolve 4.5 Place: load, mount 5 Process (BIO/5): linearize, overlap 6 Transfect (BIO/4): inject, microinject 7 Collect (BIO/6) 7.1 Collect: harvest, select 7.2 Process: centrifuge, recover 8 Physical Relation Between Molecules (BIO/20) 8.1 Binding: bind, attach 8.2 Translocate and Segregate 8.2.1 Translocate: shift, switch 8.2.2 Segregate: segregate, export 8.3 Transmit 8.3.1 Transport: deliver, transmit 8.3.2 Link: connect, map 9 Report (GEN/30) 9.1 Investigate 9.1.1 Examine: evaluate, analyze 9.1.2 Establish: test, investigate 9.1.3 Confirm: verify, determine 9.2 Suggest 9.2.1 Presentational: hypothesize, conclude 9.2.2 Cognitive: consider, believe 9.3 Indicate: demonstrate, imply 10 Perform (GEN/10) 10.1 Quantify 10.1.1 Quantitate: quantify, measure 10.1.2 Calculate: calculate, record 10.1.3 Conduct: perform, conduct 10.2 Score: score, count 11 Release (BIO/4): detach, dissociate 12 Use (GEN/4): utilize, employ 13 Include (GEN/11) 13.1 Encompass: encompass, span 13.2 Include: contain, carry 14 Call (GEN/3): name, designate 15 Move (GEN/12) 15.1 Proceed: progress, proceed 15.2 Emerge: arise, emerge 16 Appear (GEN/6): appear, occur classification methods and which deliver a numerical value easy to interpret. The first measure, the adjusted pairwise precision, evaluates clusters in terms of verb pairs: APP = 1 K i K num. of correct pairs in ki · num. of pairs in ki =1 |ki |-1 |ki |+1 A P P is the average proportion of all withincluster pairs that are correctly co-assigned. Multiplied by a factor that increases with cluster size it compensates for a bias towards small clusters. The second measure is modified purity, a global measure which evaluates the mean precision of clusters. Each cluster is associated with its prevalent class. The number of verbs in a cluster K that take this class is denoted by nprevalent (K ). Verbs that do not take it are considered as errors. Clusters where nprevalent (K ) = 1 are disregarded as not to introduce a bias towards singletons: n mP U R = prevalent (ki )2 nprevalent (ki ) Table 2: The gold standard classification with a few example verbs per class sible (i.e. where they included our test verbs and also captured their relevant senses)8 . The experts created a 3-level gold standard which includes both broad and finer-grained classes. Only those classes / memberships were included which all the experts (in the two teams) agreed on.9 The resulting gold standard including 16, 34 and 50 classes is illustrated in table 2 with 1-2 example verbs per class. The table indicates which classes were created by domain experts (B I O) and which by linguists (G E N). Each class was associated with 1-30 member verbs10 . The total number of verbs is indicated in the table (e.g. 10 for P E R F O R M class). 4.4 Measures The clusters were evaluated against the gold standard using measures which are applicable to all the Purely semantic classes tend to be finer-grained than lexical classes and not necessarily syntactic in nature. Only these two classifications were found to be similar enough to our target classification to provide a useful starting point. Section 5 includes a summary of the similarities/differences between our gold standard and these other classifications. 9 Experts were allowed to discuss the problematic cases to obtain maximal accuracy - hence no inter-annotator agreement is reported. 10 The minimum of 2 member verbs were required at the coarser-grained levels of 16 and 34 classes. 8 number of verbs The third measure is the weighted class accuracy, the proportion of members of dominant clusters D O M - C L U S Ti within all classes ci . iC AC C = =1 verbs in D O M - C L U S Ti number of verbs mP U R can be seen to measure the precision of clusters and AC C the recall. We define an F measure as the harmonic mean of mP U R and AC C: F= 2 · m P U R · AC C m P U R + AC C The statistical significance of the results is measured by randomisation tests where verbs are swapped between the clusters and the resulting clusters are evaluated. The swapping is repeated 100 times for each output and the average avswaps and the standard deviation swaps is measured. The significance is the scaled difference sig nif = (result - avswaps )/swaps . 4.5 Results from Quantitative Evaluation Table 3 shows the performance of the five clustering methods for K = 42 clusters (as produced by the N N method) at the 3 levels of gold standard classification. Although the two P L S A variants (particularly P L S A =0.75 ) produce a fairly accurate coarse grained classification, they perform worse than all the other methods at the finergrained levels of gold standard, particularly according to the global measures. Being based on 349 APP NN IB ID P L S A/E M P L S A =0.75 81 74 79 55 65 16 Classes m P U R AC C 86 39 88 47 89 37 72 49 71 68 F 53 61 52 58 70 APP 64 61 63 43 53 34 Classes m P U R AC C 74 62 76 74 78 65 53 61 48 76 F 67 75 70 57 58 APP 54 55 53 35 41 50 Classes m P U R AC C 67 73 69 87 70 77 47 66 34 77 F 69 76 73 55 47 Table 3: The performance of the N N, P L S A, I B and I D methods with Knn = 42 clusters K 20 17 33 53 APP IB ID IB ID IB ID 74 67 78 81 71 79 16 Classes m P U R AC C 77 66 76 60 87 52 88 43 87 41 89 33 F 71 67 65 57 55 48 APP 60 43 69 65 61 66 34 Classes m P U R AC C 56 86 56 81 75 81 75 70 78 66 79 55 F 67 66 77 72 71 64 APP 54 34 61 54 54 53 50 Classes m P U R AC C 48 93 46 91 67 93 67 82 72 79 72 68 F 63 61 77 73 75 69 Table 4: The performance of I B and I D for the 3 levels of class hierarchy for informative values of K pairwise similarities, N N shows mostly better performance than I B and I D on the pairwise measure A P P but the global measures are better for I B and I D . The differences are smaller in m P U R (yet significant: 2 between N N and I B and 3 between N N and I D ) but more notable in AC C (which is e.g. 8 - 12% better for I B than for N N). Also the F results suggest that the two information theoretic methods are better overall than the simple N N method. I B and I D also have the advantage (over N N ) that they can be used to produce a hierarchical verb classification. Table 4 shows the results for I B and I D for the informative values of K. The bold font indicates the results when the match between the values of K and the number of classes at the particular level of the gold standard is the closest. I B is clearly better than I D at all levels of gold standard. It yields its best results at the medium level (34 classes) with K = 33: F = 77 and A P P = 69 (the results for I D are F = 72 and A P P = 65). At the most fine-grained level (50 classes), I B is equally good according to F with K = 33, but A P P is 8% lower. Although I D is occasionally better than I B according to A P P and mP U R (see e.g. the results for 16 classes with K = 53) this never happens in the case where the correspondence between the number of gold standard classes and the values of K is the closest. In other words, the informative values of K prove really informative for I B. The lower performance of I D seems to be due to its tendency to create evenly sized clusters. All the methods perform significantly better 350 than our random baseline. The significance of the results with respect to two swaps was at the 2 level, corresponding to a 97% confidence that the results are above random. 4.6 Qualitative Evaluation We performed further, qualitative analysis of clusters produced by the best performing method I B. Consider the following clusters: A: inject, transfect, microinfect, contransfect (6) B: harvest, select, collect (7.1) centrifuge, process, recover (7.2) C: wash, rinse (4.1.1) immunoblot (4.1.3) overlap (5) D: activate (1.1.1) When looking at coarse-grained outputs, interestingly, K as low as 8 learned the broad distinction between biomedical and general language verbs (the two verb types appeared only rarely in the same clusters) and produced large semantically meaningful groups of classes (e.g. the coarse-grained classes E X P E R I M E N TA L P RO C E D U R E S , T R A N S F E C T and C O L L E C T were mapped together). K = 12 was sufficient to identify several classes with very particular syntax One of them was T R A N S F E C T (see A above) whose members were distinguished easily because of their typical S C Fs (e.g. inject /transfect/microinfect/contransfect X with/into Y). On the other hand, even K = 53 could not identify classes with very similar (yet un-identical) syntax. These included many semantically similar sub-classes (e.g. the two sub-classes of C O L L E C T shown in B whose members take similar N P and P P S C F s). However, also a few semantically different verbs clustered wrongly because of this reason, such as the ones exemplified in C. In C, immunoblot (from the L A B E L class) is still somewhat related to wash and rinse (the WA S H class) because they all belong to the larger E X P E R I M E N TA L P RO C E D U R E S class, but overlap (from the P RO C E S S class) shows up in the cluster merely because of syntactic idiosyncracy. While parser errors caused by the challenging biomedical texts were visible in some S C Fs (e.g. looking at a sample of S C Fs, some adjunct instances were listed in the argument slots of the frames), the cases where this resulted in incorrect classification were not numerous11 . One representative singleton resulting from these errors is exemplified in D. Activate appears in relatively complicated sentence structures, which gives rise to incorrect S C Fs. For example, MECs cultured on 2D planar substrates transiently activate MAP kinase in response to EGF, whereas... gets incorrectly analysed as S C F N P - N P , while The effect of the constitutively activated ARF6-Q67L mutant was investigated... receives the incorrect S C F analysis N P-S C O M P. Most parser errors are caused by unknown domainspecific words and phrases. 5 Discussion Due to differences in the task and experimental setup, direct comparison of our results with previously published ones is impossible. The closest possible comparison point is (Korhonen et al., 2003) which reported 50-59% mP U R and 15-19% A P P on using I B to assign 110 polysemous (general language) verbs into 34 classes. Our results are substantially better, although we made no effort to restrict our scope to monosemous verbs12 and although we focussed on a linguistically challenging domain. It seems that our better result is largely due to the higher uniformity of verb senses in the biomedical domain. We could not investigate this effect systematically because no manually sense This is partly because the mistakes of the parser are somewhat consistent (similar for similar verbs) and partly because the S C Fs gather data from hundreds of corpus instances, many of which are analysed correctly. 12 Most of our test verbs are polysemous according to WordNet (W N) (Miller, 1990), but this is not a fully reliable indication because W N is not specific to this domain. 11 annotated data (or a comprehensive list of verb senses) exists for the domain. However, examination of a number of corpus instances suggests that the use of verbs is fairly conventionalized in our data13 . Where verbs show less sense variation, they show less S C F variation, which aids the discovery of verb classes. Korhonen et al. (2003) observed the opposite with general language data. We examined, class by class, to what extent our domain-specific gold standard differs from the related general (Levin, 1993) and domain classifications (Spasic et al., 2005; Friedman et al., 2002) (recall that the latter were purely semantic classifications as no lexical ones were available for biomedicine): 33 (of the 50) classes in the gold standard are biomedical. Only 6 of these correspond (fully or mostly) to the semantic classes in the domain classifications. 17 are unrelated to any of the classes in Levin (1993) while 16 bear vague resemblance to them (e.g. our T R A N S P O RT verbs are also listed under Levin's S E N D verbs) but are too different (semantically and syntactically) to be combined. 17 (of the 50) classes are general (scientific) classes. 4 of these are absent in Levin (e.g. Q UA N T I TAT E ). 13 are included in Levin, but 8 of them have a more restricted sense (and fewer members) than the corresponding Levin class. Only the remaining 5 classes are identical (in terms of members and their properties) to Levin classes. These results highlight the importance of building or tuning lexical resources specific to different domains, and demonstrate the usefulness of automatic lexical acquisition for this work. 6 Conclusion This paper has shown that current domainindependent N L P and M L technology can be used to automatically induce a relatively high accuracy verb classification from a linguistically challenging corpus of biomedical texts. The lexical classification resulting from our work is strongly domain-specific (it differs substantially from previous ones) and it can be readily used to aid B I ON L P . It can provide useful material for investigating the syntax and semantics of verbs in biomedical data or for supplementing existing domain lexical resources with additional information (e.g. The different sub-domains of the biomedical domain may, of course, be even more conventionalized (Friedman et al., 2002). 13 351 semantic classifications with additional member verbs). Lexical resources enriched with verb class information can, in turn, better benefit practical tasks such as parsing, predicate-argument identification, event extraction, identification of biomedical relation patterns, among others. In the future, we plan to improve the accuracy of automatic classification by seeding it with domain-specific information (e.g. using named entity recognition and anaphoric linking techniques similar to those of Vlachos et al. (2006)). We also plan to conduct a bigger experiment with a larger number of verbs and demonstrate the usefulness of the bigger classification for practical B I O-N L P application tasks. In addition, we plan to apply similar technology to other interesting domains (e.g. tourism, law, astronomy). This will not only enable us to experiment with cross-domain lexical class variation but also help to determine whether automatic acquisition techniques benefit, in general, from domain-specific tuning. V. Hatzivassiloglou and W. Weng. 2002. Learning anchor verbs for biological interaction patterns from published text articles. International Journal of Medical Inf., 67:19­32. L. Hirschman, J. C. Park, J. Tsujii, L. Wong, and C. H. Wu. 2002. Accomplishments and challenges in literature data mining for biology. Journal of Bioinformatics, 18(12):1553­1561. T. Hoffman. 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177­196. R. Jackendoff. 1990. Semantic Structures. MIT Press, Cambridge, Massachusetts. A. Korhonen, Y. Krymolowski, and Z. Marx. 2003. Clustering polysemic subcategorization frame distributions semantically. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 64­71, Sapporo, Japan. A. Korhonen. 2002. Subcategorization Acquisition. Ph.D. thesis, University of Cambridge, UK. M. Lease and E. Charniak. 2005. Parsing biomedical literature. In Second International Joint Conference on Natural Language Processing, pages 58­69. G. Leech. 1992. 100 million words of English: the British National Corpus. Language Research, 28(1):1­13. B. Levin. 1993. English Verb Classes and Alternations. Chicago University Press, Chicago. Acknowledgement We would like to thank Yoko Mizuta, Shoko Kawamato, Sven Demiya, and Parantu Shah for their help in creating the gold standard. References C. Brew and S. Schulte im Walde. 2002. Spectral clustering for German verbs. In Conference on Empirical Methods in Natural Language Processing, Philadelphia, USA. E. J. Briscoe and J. Carroll. 1997. Automatic extraction of subcategorization from corpora. In 5th ACL Conference on Applied Natural Language Processing, pages 356­363, Washington DC. E. J. Briscoe and J. Carroll. 2002. Robust accurate statistical annotation of general text. In 3rd International Conference on Language Resources and Evaluation, pages 1499­1504, Las Palmas, Gran Canaria. A. G. Dimitrov and J. P. Miller. 2001. Neural coding and decoding: communication channels and quantization. Network: Computation in Neural Systems, 12(4):441­472. B. Dorr. 1997. Large-scale dictionary construction for foreign language tutoring and interlingual machine translation. Machine Translation, 12(4):271­325. C. Friedman, P. Kra, and A. Rzhetsky. 2002. Two biomedical sublanguages: a description based on the theories of Zellig Harris. Journal of Biomedical Informatics, 35(4):222­235. P. Merlo and S. Stevenson. 2001. Automatic verb classification based on statistical distributions of argument structure. Computational Linguistics, 27(3):373­408. G. A. Miller. 1990. WordNet: An on-line lexical database. International Journal of Lexicography, 3(4):235­312. D. Prescher, S. Riezler, and M. Rooth. 2000. Using a probabilistic class-based lexicon for lexical ambiguity resolution. In 18th International Conference on Computational Linguistics, pages 649­655, ¨ Saarbrucken, Germany. I. Spasic, S. Ananiadou, and J. Tsujii. 2005. Masterclass: A case-based reasoning system for the classification of biomedical terms. Journal of Bioinformatics, 21(11):2748­2758. N. Tishby, F. C. Pereira, and W. Bialek. 1999. The information bottleneck method. In Proc. of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368­377. A. Vlachos, C. Gasperin, I. Lewin, and E. J. Briscoe. 2006. Bootstrapping the recognition and anaphoric linking of named entitites in drosophila articles. In Pacific Symposium in Biocomputing. 352