Pacific Symposium on Biocomputing 13:556-567(2008) ASSISTED CURATION: DOES TEXT MINING REALLY HELP? BEATRICE ALEX, CLAIRE GROVER, BARRY HADDOW, MIJAIL KABADJOV, EWAN KLEIN, MICHAEL MATTHEWS, STUART ROEBUCK, RICHARD TOBIN, AND XINGLONG WANG School of Informatics University of Edinburgh EH8 9LW, UK E-mail for correspondence: balex@inf.ed.ac.uk Although text mining shows considerable promise as a tool for supporting the curation of biomedical text, there is little concrete evidence as to its effectiveness. We report on three experiments measuring the extent to which curation can be speeded up with assistance from Natural Language Processing (N L P), together with subjective feedback from curators on the usability of a curation tool that integrates N L P hypotheses for protein-protein interactions (P P Is). In our curation scenario, we found that a maximum speed-up of 1/3 in curation time can be expected if NLP output is perfectly accurate. The preference of one curator for consistent NLP output and output with high recall needs to be confirmed in a larger study with several curators. 1. Introduction Curating biomedical literature into relational databases is a laborious task requiring considerable expertise, and it is proposed that text mining should make the task easier and less time-consuming [1, 2, 3]. However, to date, most research in this area has focused on developing objective performance metrics for comparing different text mining systems (see [4] for a recent example). In this paper, we describe initial feedback from the use of text mining within a commercial curation effort, and report on experiments to evaluate how well our N L P system helps curators in their task. This paper is organised as follows. We review related work in Section 2. In Section 3, we introduce the concept of assisted curation and describe the different aspects involved in this process. Section 4 provides an overview of the components of our text mining system, the T X M (text mining) N L P pipeline, and describes the annotated corpus used to train and evaluate this system. In Section 5, we describe and discuss the results of three different curation experiments which attempt to test the effectiveness of various versions of the N L P pipeline in assisting curation. Discussion and conclusions follow in Section 6. Pacific Symposium on Biocomputing 13:556-567(2008) 2. Related Work Despite the recent surge in the development of information extraction (I E) systems for automatic curation of biomedical data spurred on by the BioCreAtIvE II competition [5], there is a lack of user studies that extrinsically evaluate the usefulness of I E as a way to assist curation. Donaldson et al. [6] reported an estimated 70% reduction in curation time of yeast-protein interactions when using the PreBIND/Textomy IE system, designed to recognise abstracts containing protein interactions. This estimate is limited to the document selection component of PreBind and does not include time savings due to automatic extraction and normalization of named entities (N Es) and relations. Karamanis et al. [7] studied the functionality and usefulness of their curation tool, ensuring that integrating N L P output does not impede curators in their work. In three curation experiments with one curator, they found evidence that improving their curation tool and integrating N L P speeds up curation compared to using a tool prototype with which the curator was not experienced at the start of the experiment. Karamanis et al. [7] mainly focus on tool functionality and presentational issues. They did not analyse the aspects of the N L P output that were useful to curators, how it affected their work, or how the N L P pipeline can be tuned to simplify the curator's job. Recently, Hearst et al. [8] reported on a pilot usability study showing positive reactions to figure display and caption search for bioscience journal search interfaces. Regarding non-biomedical-related applications, Kristjansson et al. [9] describe an interactive I E tool with constraint propagation to reduce human effort in address form filling. They show that highlighting contact details in unstructured text, pre-populating form fields, and interactive error correction by the user reduces the cognitive load on users when entering address details into a database. This reduction is reflected in the expected number of user actions, which is determined based on the number of clicks to enter all fields. They also integrated confidence values to inform the user about the reliability of extracted information. 3. Assisted Curation The curation task that we will discuss in this paper requires curators to identify examples of protein-protein interactions (P P Is) in biomedical literature. The initial step involves retrieving a set of papers that match criteria for the curation domain. After an initial step of further filtering the papers into promising candidates for curation, curators proceed on a paper-by-paper basis. Using an inhouse editing and verification tool (henceforth referred to as the `Editor'), the curators are able to read through an electronic version of the paper and enter retrieved information into a template which will then be used to add a record to a relational database. Pacific Symposium on Biocomputing 13:556-567(2008) Paper NLP Engine Candidate NEs and PPIs Curator Interactive Editor PPI Database Figure 1. Information Flow in the Curation Process Curation is a laborious task which requires considerable expertise. The curator spends a significant amount of time on reading through a paper and trying to locate material that might contain curatable facts. Can N L P help the curator work more efficiently? Our basic assumption, which is commonly held [1], is that I E techniques are likely to be effective in identifying relevant entities and relations. More specifically, we assume that N L P can propose candidate P P Is; if the curators restrict their attention to these candidates, then the time required to explore the paper can be reduced. Notice that we are not proposing that N L P should replace human curators--given the current state of the art, only expert humans can assure that the captured data is of sufficiently high quality to be entered into databases. Our curation scenario is illustrated in Figure 1. The source paper undergoes processing by the N L P engine. The result is a set of normalised N Es and candidate P P Is. The original paper and the N L P output are fed into the interactive Editor, which then displays a view to the curator. The curator makes a decision about which information to enter into the Editor, which is then communicated to a backend database. In one sense, we can see this scenario as one in which the software provides decision support to the human. Although in broad terms the decision is about what facts, if any, to curate, this can be broken down into smaller subtasks. Given a sentence S , (i) do the terms in S name proteins? If so, (ii) which proteins do they name? And (iii), given two protein mentions, do the proteins stand in an interaction relation? These decision subtasks correspond to three components of the N L P engine: (i) Named Entity Recognition, (ii) Term Identification, and (iii) Relation Extraction. We will examine each of these in turn shortly, but first, we want to consider further the kind of choices that need to be made in examining the usability of N L P for curation. A crucial observation is that the N L P output is bound to be imperfect. How can the curator make use of an unreliable assistant? First, there are interface design issues--what information is displayed to the curator, in what form, and what kind of manipulations can the curator carry out? Pacific Symposium on Biocomputing 13:556-567(2008) Second, what is the division of labour between the human and the software? For example, there might be some decisions which are relatively cheap for the curator to make, such as deciding what species is associated with a protein mention, and which can then help the software in providing a more focused set of candidates for term identification. Third, what are the optimal functional characteristics of the N L P engine, given that complete reliability is not currently attainable? For example, should the N L P try to improve recall over precision, or vice versa? Although the first and second dimensions are clearly important, in this paper we will focus on the third, namely the functional characteristics of our system. 4. TXM Pipeline The output displayed in the interactive curation Editor is produced by the pipeline, an I E pipeline that is being developed for use in biomedical I E tasks. The particular version of the pipeline used in the experiments described here focuses on extracting proteins, their interactions, and other entities which are used to enrich the interactions with extra information of biomedical interest. Proteins are also normalised (i.e., mapped to identifiers in an appropriate database) using the term identification (T I) component of the pipeline. In this section a brief description of the pipeline, and the corpus used to develop and test it, will be given, with more implementation details provided by appropriate references. NLP TXM Corpus In order to use machine learning approaches for named entity recognition (N E R) and relation extraction (R E), and for evaluating the pipeline components, an annotated corpus was produced using a team of domain experts. Since the annotations contain information about proteins and their interactions, it is referred to as the enriched protein-protein interaction (E P P I) corpus. The corpus consists of 217 full-text papers selected from PubMed and PubMedCentral as containing experimentally proven P P Is. The papers, retrieved in X M L or H T M L, were converted to an internal X M L format. Nine types of entities (Complex, CellLine, DrugCompound, ExperimentalMethod, Fusion, Fragment, Modification, Mutant, and Protein) were annotated, as well as P P I relations and F R AG relations (which link Fragments or Mutants to their parent proteins). Furthermore, proteins were normalised to their RefSeqa identifier and P P Is were enriched with properties and attributes. The properties added to the P P Is are IsProven, IsDirect and IsPositive and the possible attributes are CellLine, DrugTreatment, ExperimentalMethod or ModificationType. More details on properties and attributes can be found in Haddow a http://www.ncbi.nlm.nih.gov/RefSeq/index.html Pacific Symposium on Biocomputing 13:556-567(2008) and Matthews [10]. The inter-annotator agreement (I A A), measured on a sample of doubly and triply annotated papers, amounts to an overall micro-averaged F1scoreb of 84.9 for N Es, 88.4 for normalisations, 64.8 for P P I relations, 87.1 for properties and 59.6 for attributes. The E P P I corpus ( 2m tokens) is divided into three sections, T R A I N (66%), D E V T E S T (17%), and T E S T (17%). Pre-processing A set of pre-processing steps in the pipeline was implemented using the LT- X M L2 tools [11]. The pre-processing performs sentence boundary detection and tokenization, adds useful linguistic markup such as chunks, part-ofspeech tags, lemmas, verb stems, and abbreviation information, and also attaches NCBI taxonomy identifiers to any species-related terms. Named Entity Recognition The N E R component is based on the C&C tagger, a Maximum Entropy Markov Model (MEMM) tagger developed by Curran and Clark [12], and augmented with extra features and gazetteers tailored to the domain and described fully in Alex et al. [13]. The C&C tagger allows for the adjustment of the entity decision threshold through the prior file, which has the effect of varying the precision-recall balance in the output of the component. This prior file was modified to produce the high precision and high recall models used in the assisted curation experiment described in Section 5.3. Term Identification The T I component uses a rule-based fuzzy matcher to produce a set of candidate identifiers for each recognized protein. Species are assigned to proteins using a machine learning based tagger trained on contextual and species word features [14]. The species information and a set of heuristics are used to choose the most probable identifiers from the set of candidates proposed by the matcher. The evaluation metric for the T I system is bag accuracy. This means that if the system produces multiple identifiers for an entity mention, it is counted as a hit as long as one of the identifiers is correct. The rationale is that since a T I system that outputs one identifier is not accurate enough, generating a bag of choices increases chances of finding the correct one. This can assist curators as the right identifier can be chosen from a bag (see [15] for more details). Relation Extraction Intra-sentential P P I and F R AG relations are both extracted using the system described in Nielsen [16], with inter-sentential F R AG relations addressed using a maximum entropy model trained on features derived from the entities, their context, and other entities in the vicinity. Enriching the relations with properties and attributes is implemented using a mixture of machine learning and rule-based methods described in Haddow and Matthews [10]. b Micro-averaged F1-score means that each example is given equal weight in the evaluation. Pacific Symposium on Biocomputing 13:556-567(2008) Component Performance The performance of the I E components of the pipeline (N E R, T I, and R E) is measured using precision, recall, and F1-score (except T I­ see above), by testing each component in isolation and comparing its output to the annotated data. For example, R E is tested using the annotated (gold) entities as its input, rather than the output of N E R, in order that N E R errors not affect the score for R E. Table 1 shows the performance of each component when tested on D E V T E S T , where the machine learning components are trained on T R A I N . Table 1. Performance of pipeline components, tested in isolation on D E V T E S T and trained on T R A I N. Component (micro-average) R E (P P I) R E ( F R AG ) R E (properties micro-average) R E (attributes micro-average) NER TP 19,925 1,208 1,699 3,041 483 TP 9,078 FP 5,964 1,173 963 567 822 FP 91,396 FN 7,755 1,080 1,466 579 327 FN 2,843 Precision 76.96 50.73 63.82 84.28 37.01 Precision 9.04 Recall 71.98 52.80 53.68 84.01 59.63 Recall 76.15 F1 74.39 51.75 58.31 84.14 45.67 Bag Acc. 76.15 Component TI (micro-average) 5. Curation Experiments We conducted three curation experiments with and without assistance from the output of the N L P pipeline or gold standard annotations (G S A). In all of the experiments, curators were asked to curate several documents according to internal guidelines. Each paper is assigned a curation ID for which curators create several records corresponding to the curatable information in the document. Curators always use an interactive Editor which allows them to see the document on screen and enter the curatable information into record forms. All curators are experienced in using the interactive curation Editor, but not necessarily familiar with assisted curation. After completing the curation for each paper, they were asked to fill in a questionnaire. 5.1. Manual versus Assisted Curation In the first experiment, 4 curators curated 4 papers in 3 different conditions: · · · M A N UA L : without assistance with integrated gold standard annotations N L P -assisted: with integrated NLP pipeline output G S A -assisted: Each curator processed a paper only once, in one specific condition, without being informed about the type of assistance (G S A or N L P), if any. This experiment Pacific Symposium on Biocomputing 13:556-567(2008) Table 2. Total number of records curated in each condition and average curation speed per record. Condition M A N UA L GSA NLP Records 121 170 141 Time per record Average StDev 312s 327s 205s 52s 243s 36s Table 3. Average questionnaire scores. Scores ranged from (1) for strongly agree to (5) for strongly disagree. Statement N L P was helpful in curating this documents N L P speeded up the curation of this paper N E annotations were useful for curation Normalizations of N Es were useful for curation P P I s were useful for curation GSA NLP 2.75 3.75 2.50 2.75 3.50 3.25 3.75 3.00 2.75 3.25 aims to answer the following questions: Does the N L P output which is currently integrated in the interactive Editor accelerate curation? Secondly, do human gold standard annotations assist curators in their work--i.e., how helpful would N L P be to a curator if it performed as well as a human annotator? Table 2 shows that for all four papers, the fewest records (121) were curated during manual curation, 20 more records (+16.5%) were curated given N L P assistance, and 49 more records (+40.5%) with G S A assistance. This indicates that providing N L P output helps curators to spot more information. Ongoing work involves a senior curator assessing each curated record in terms of quality and coverage. This will provide evidence for whether this additional information is also curatable, i.e. how the N L P output affects curation accuracy, and also give an idea of inter-curator agreement for different conditions. As each curator curated in all three conditions but never curated the same paper twice, inter-document and inter-curator variability must be considered. Therefore, we present curation speed per condition as the average speed of curating a record. Manual curation is most time-consuming, followed by N L P-assisted curation (22% faster), followed by G S A-assisted curation (34% faster). Assisted curation clearly speeds up the work of a curator, and a maximum reduction of 1/3 in manual curation time can be expected if the N L P pipeline performed with perfect accuracy. In the questionnaire, curators rated G S A assistance slightly more positively than N L P assistance (see Table 3). However, they were not convinced of either condition speeding up their work, even though the time measurements show otherwise. Considering that they were not familiar with assisted curation prior to the experiment, a certain effect of learning should be allowed for. Moreover, they Pacific Symposium on Biocomputing 13:556-567(2008) Table 4. Total number of records curated in each consistency condition and average curation speed per record. Condition CONSISTENCY1 CONSISTENCY2 Time per record Average StDev 128s 43s 92s 22s may have had relatively high expectations of the N L P output. In fact, individual feedback in the questionnaire shows that N L P assistance was useful for some papers and some curators, but not others. Further feedback in the questionnaire includes aspects of visualization (e.g. P D F conversion errors) and interface design (e.g. inadequate display of information linked to N E normalizations) in the interactive Editor. Regarding the N L P output, curators also requested more accurate identification of P P I candidates, e.g. in coordinations like "A and B interact with C and D", and more consistency in the N L P output. 5.2. NLP Consistency The N L P pipeline extracts information based on context features and may, for example, recognize a string as a protein in one part of the document but as a drug/compound in another, or assign different species to the same protein mentioned multiple times in the document. While this inconsistency may not be erroneous, the curators' feedback is that consistency would be preferred. To test this hypothesis, and to determine whether consistent N L P output helps to speed up curation, we conducted a second experiment. One curator was asked to curate 10 papers containing N L P output made consistent in two ways. In 5 papers, all N Es recognized by the pipeline were propagated throughout the document (C O N S I S T E N C Y 1). In the other 5 papers, only the most frequent N E recognized for a particular surface form is propagated, while less frequent ones are removed (C O N S I S T E N C Y 2). In both conditions, the most frequent protein identifier bag determined by the T I component is propagated for each surface form, and eP P Is are extracted as usual. Subsequent to completing the questionnaire, the curator viewed a second version of the paper in which consistency in the N L P output was not forced, and filled in a second questionnaire regarding the comparison of both versions. Table 4 shows that the curator managed to curate 28% faster given the second type of consistency. However, examining the answers to the questionnaire listed in Table 5, it appears that the curator actually considerably preferred the first type of consistency, where all N Es that were recognized by the N E R component are propagated throughout the paper. While this speed-up in curation may be attrac- Pacific Symposium on Biocomputing 13:556-567(2008) Table 5. Average questionnaire scores. Scores ranged from (1) for strongly agree to (5) for strongly disagree. In questionnaire 2, consistent (C O N S I S T E N C Y 1 / 2) N L P output (A) is compared to baseline N L P (B). Statement Questionnaire 1 output was helpful for curation output speeded up curation N E s were useful for curation Normalizations of N Es were useful for curation P P I s were useful for curation Questionnaire 2 A was more useful for curation than B would have been A speeded up the curation process more than B would have A appeared more accurate than B A missed important information compared to B A contained too much information compared to B NLP NLP CONSISTENCY1 CONSISTENCY2 1.6 1.8 1.4 3.2 3.6 2.6 3.0 2.6 4.4 3.6 2.6 3.2 4.0 4.0 4.2 4.0 4.0 4.2 1.8 4.6 tive from a commercial perspective, this experiment illustrates how important it is to get feedback from users who may well reject a technology altogether if they are not happy working with it. 5.3. Optimizing for Precision or Recall Currently, all pipeline components are optimized for F1-score, resulting in a relative balance between the correctness and coverage of extracted information, i.e. precision and recall. In previous curation rounds, curators felt they could not completely trust the N L P output, as some of the information displayed was incorrect. The final curation experiment tests whether optimizing the N L P pipeline for F1 is ideal in assisted curation, or whether a system that is more correct but misses some curatable information (high precision) or one that extracts most of the curatable information along with many non-curatable or incorrect facts (high recall) would be preferred. In this experiment, only the N E component was adapted to increase its precision or recall. This is done by changing the threshold in the C&C prior file to modify tag probabilities assigned by the C&C tagger.c The intrinsic evaluation scores of the N E R component optimized either for F1, precision, or recall are listed in Table 6. In the experiment, one curator processed 10 papers in random order containing N L P output, 5 with high recall N E R and 5 with high precision. Note that to simplify c Internal and external features were not optimized for precision or recall. This could be done to increase effects even more. The T I and R E components were also not modified for this experiment. Pacific Symposium on Biocomputing 13:556-567(2008) Table 6. Optimized F1-score versus high precision (P) and high recall (R) N E R, along with corresponding counts of true positives (TP), false positives (FP), and false negatives (FN). Setting High F1 High P High R TP 20,091 11,836 21,880 FP 6,085 1,511 20,653 FN 7,589 15,844 5,800 P 76.75 88.68 51.44 R 72.58 42.76 79.05 F1 74.61 57.70 62.32 the experiment the curator did not normalise entities in this curation round. Subsequent to completing the questionnaire, the curator viewed a second version of the paper with N L P output based on optimized F1-score N E R and filled in a second questionnaire regarding the comparison of both versions. The results in Table 7 show that the curator rated all aspects of the high recall N E R condition more positively than of the high precision N E R condition. Moreover, the curator tended to prefer N L P output with optimised F1 N E R over that containing high precision N E R, and N L P output containing high recall N E R over that with high F1 N E R. Although the number of curated papers is small, this curator seems to prefer N L P output that captures more curatable information but is overall less accurate. The curator noted that since her curation style involves skim-reading, the N L P output helped her to spot information that she otherwise would have missed. The results of this experiment could therefore be explained simply by curation style. Another curator with a more meticulous reading style may actually prefer more precise and trustworthy information extracted by the N L P pipeline. Clearly, the last curation experiment needs to be repeated using several curators, curating a larger set of papers, and providing additional timing information per curated record. In general, it would be useful to develop a system that will allow curators to filter information presented onscreen dynamically, possibly based on confidence values, as integrated in the tool described by Kristjansson et al. [9]. 6. Discussion and Conclusions This paper has focused on optimizing functional characteristics of an N L P pipeline for assisted curation, given that current text mining techniques for biomedical I E are not completely reliable. Starting with the hypothesis that assisted curation can support the task of a curator, we found that a maximum reduction of 1/3 in curation time can be expected if N L P output is perfectly accurate. This shows that biomedical text mining can assist in curation. Moreover, N L P assistance led to the curation of more records, although the validity of this additional information still needs to be confirmed by a senior curator. In extrinsic evaluation of the N L P pipeline in curation, we have tested several optimizations of the output in order to determine the type of assistance that is Pacific Symposium on Biocomputing 13:556-567(2008) Table 7. Average questionnaire scores. Scores ranged from (1) for strongly agree to (5) for strongly disagree. In questionnaire 2, optimized precision/recall (HighP/HighR) N E R output (A) is compared to optimized F1 N E R output (B). Statement Questionnaire 1 output was helpful for curation output speeded up curation N E s were useful for curation P P I s were useful for curation NLP NLP HighP N E R 3.0 3.4 3.0 3.2 4.2 4.2 4.4 1.4 4.8 HighR N E R 2.2 2.4 2.0 2.5 2.6 3.0 2.8 3.2 3.8 Questionnaire 2 A was more useful for curation than B would have been A speeded up the curation process more than B would have A appeared more accurate than B A missed important information compared to B A contained too much information compared to B preferred by curators. We found that the curator prefers consistency, with all N Es propagated throughout the document, even though this preference is not reflected in the average time measurements for curating a record. When comparing curation with N L P output containing high recall or high precision N E predictions, the curator clearly preferred the former. While this result illustrates that optimizing an I E system for F1-score does not necessarily result in optimal performance in assisted curation, this experiment must be repeated with several curators in view of different curation styles. Overall, we learnt that measuring curation in terms of curation time is not sufficient to capture the usefulness of N L P output for assisted curation. As recognized by Karamanis et al. [7], it is difficult to measure a curator's performance as one quantitative metric. The average time to curate a record, alone, is clearly not sufficient for capturing all factors involved the curation process. It is important to work closely with the user of a curation system in order to identify helpful and hindering aspects of such technology. In future work, we will conduct further curation experiments to determine the merit of high recall and high precision N L P output for the curation task. We will also invest some time in implementing confidence values of extracted information into the interactive Editor. Acknowledgements This work was carried out as part of an ITI Life Sciences Scotland (http: //www.itilifesciences.com) research programme with Cognia EU (http://www.cognia.com) and the University of Edinburgh. The authors are very grateful to the curators at Cognia EU who participated in the experiments. The inhouse curation tool used for this work is the subject of International Patent Application No. PCT/GB2007/001170. Pacific Symposium on Biocomputing 13:556-567(2008) References 1. A. S. Yeh, L. Hirschman, and A. Morgan. Evaluation of text data mining for database curation: Lessons learned from the KDD challenge cup. Bioinformatics, 19(Suppl 1): i331­339, 2003. 2. D. Rebholz-Schuhmann, H. Kirsch, and F. Couto. Facts from text - is text mining ready to deliver? PLoS Biology, 3(2), 2005. 3. H. Xu, D. Krupke, J. Blake, and C. Friedman. A natural language processing (NLP) tool to assist in the curation of the laboratory mouse tumor biology database. Proceedings of the AMIA 2006 Annual Symposium, page 1150, 2006. 4. L. Hirschman, M. Krallinger, and A. Valencia, editors. Second BioCreative Challenge ´ Evaluation Workshop. Fundacion CNIO Carlos III, Madrid, Spain, 2007. 5. M. Krallinger, F. Leitner, and A. Valencia. Assessment of the second BioCreative PPI task: Automatic extraction of protein-protein interactions. In Proceedings of the Second BioCreative Challenge Evaluation Workshop, pages 41­54, Madrid, Spain, 2007. 6. I. Donaldson, J. Martin, B. de Bruijn, C. Wolting, V. Lay, B. Tuekam, S. Zhang, B. Baskin, G.D. Bader, K. Michalickova, T. Pawson, and C.W.V. Hogue. PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 4(11), 2003. 7. N. Karamanis, I. Lewin, R. Seal, R. Drysdale, and E. Briscoe. Integrating natural language processing with FlyBase curation. In Proceedings of PSB 2007, pages 245­ 256, Maui, Hawaii, 2007. 8. M. A. Hearst, A. Divoli, J. Ye, and M. A. Wooldridge. Exploring the efficacy of caption search for bioscience journal search interfaces. In Proceedings of BioNLP 2007, pages 73­80, Prague, Czech Republic, 2007. 9. T. T. Kristjansson, A. Culotta, P. Viola, and A. McCallum. Interactive information extraction with constrained conditional random fields. In Deborah L. McGuinness and George Ferguson, editors, Proceedings of AAAI 2004, pages 412­418, San Jose, US, 2004. 10. Barry Haddow and Michael Matthews. The extraction of enriched protein-protein interactions from biomedical text. In Proceedings of BioNLP, pages 145­152, Prague, Czech Republic, 2007. 11. C. Grover and R. Tobin. Rule-based chunking and reusability. In Proceedings of LREC 2006, pages 873­878, Genoa, Italy, 2006. 12. J. Curran and S. Clark. Language independent NER using a maximum entropy tagger. In Proceedings of CoNLL-2003, pages 164­167, Edmonton, Canada, 2003. 13. B. Alex, B. Haddow, and C. Grover. Recognising nested named entities in biomedical text. In Proceedings of BioNLP 2007, pages 65­72, Prague, Czech Republic, 2007. 14. X. Wang. Rule-based protein term identification with help from automatic species tagging. In Proceedings of CICLING 2007, pages 288­298, Mexico City, Mexico, 2007. 15. X. Wang and M. Matthews. Comparing usability of matching techniques for normalising biomedical named entities. In Proceedings of PSB 2008, 2008. 16. L. A. Nielsen. Extracting protein-protein interactions using simple contextual features. In Proceedings of BioNLP 2006, pages 120­121, New York, US, 2006.