Concept-Based Biomedical Text Retrieval Ming Zhong Depar tment of Computer Science York University Toronto, Canada Xiangji Huang School of Information Technology York University Toronto, Canada ming@cs.yorku.ca jhuang@yorku.ca ABSTRACT One challenging problem for biomedical text retrieval is to find accurate synonyms or name variants for biomedical entities. In this paper, we propose a new concept-based approach to tackle this problem. In this approach, a set of concepts instead of keywords will be extracted from a query first. Then these concepts will be used for retrieval purpose. The experiment results show that the proposed approach can boost the retrieval performance and it generates very good results on 2005 TREC Genomics data sets. Table 1: Examples for Break-p oints A space or hyphen in the middle A place b etween the lowercase alphab etical character and the following upp ercase alphab etical character and vice versa A place b efore or after a Greek letter, such as alpha,b eta, gamma, etc A place b etween the digit character and the following alphab etical character and vice versa A place b efore the ending characters `I', `I I', `I I I',`P',`R', `A', `B', `G', `E', `K' of a word if the length of the word is greater than 3 Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval General Terms Experimentation Dredze and Tong [1] proposed a method only for the names which have the form "ABC1". However, their methods are either tedious or incomplete. In addition, how to organize these name variants into a concept is not clear. To address these issues, we propose a concept based approach which includes concept detection, concept expansion and concept representation. 2. Keywords Biomedical Text Retrieval, Concept, Information Retrieval CONCEPT DETECTION & EXPANSION 1. INTRODUCTION Information Retrieval (IR) in the context of biomedical databases has the following three ma jor problems [3]: the frequent use of (possibly non-standardized) acronyms, the presence of homonyms (the same word referring to two or more different entities) and synonyms (two or more words referring to the same entity). How to deal with an abundant number of lexical variants of the same term is a challenging task in biomedical IR. In biomedical domain, using one or two keywords to retrieve the corresponding concept is usually not enough. Thus, concept-oriented retrieval techniques have drawn much attention recently from researchers since traditional keyword-based retrieval fails to meet the requirements for both precision and recall. TREC Genomics Track [7] provides a platform for evaluation of information retrieval systems in the genomics domain. In the 2005 Genomic Track, most of groups agreed that methods need to be developed for expanding the query to include a more complete set of name variants for a biomedical entity. Huang, Zhong and Luo [6] proposed a method to extract name variants from domain-specific database. Ando, Copyright is held by the author/owner. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. For the concept detection, a set of concepts will be extracted from the query first. This process can be described as follows: (1) extract the full name of a biomedical entity and its abbreviation by using BioNLP abbreviation extraction function [4]; (2) take out all the stop words; (3) generate a concept 1 . Before we present how to expand the concept, we would like to define the following two terms: "break-point" and "replacement". A break-p oints is a position in a string that can be broken into two parts separated by a space. It can be (1) a hyphen; (2) a position between two letters which have different cases except for the first and second positions of a word; (3) a position between a letter and a digit. For example, the word "185delAG" has 2 break-points. Thus, its variants are "185 delAG", "185del AG" and "185 del AG". More examples for break-points are given in Table 1. A replacement is a substring in a string that can be replaced by a different string and the string after replacing still represents the same meaning as the original one. For example, the number "2" in "COP2" is a replacement which can be replaced by "ii". "alpha" is a replacement that can be replaced by "a" and "beta" is a replacement that can be replaced by "b", and so on. Detailed examples for replacements are given in Table 2. Given the above two definitions, we describe the algorithm for concept expansion as follows: (1) Locate all the 1 The full name and its corresponding abbreviations represent the same concept. 723 Table 2: Examples for Replacements No 1 2 3 4 5 6 7 8 9 10 11 From alpha b eta gamma epsilon kappa I II III receptor type gene To a b g e k 1 2 3 r null genetic No 12 13 14 15 16 17 18 19 20 21 22 From a b g e k 1 2 3 r p mutation To alpha b eta gamma epsilon kappa I II III receptor protein mutant Table 4: Results on 2005 Genomics Track Datasets Run Run1 Run2 Run3 Run4 Run5 Run6 Run7 Run Description without CR CR without break-p oint CR without replacement CR with full functions Top Manual run in 2005 Top Automatic run in 2005 Auto-runs average in 2005 MAP 0.2863 0.2977 0.2951 0.3118 0.3020 0.2888 0.2095 R-Prec 0.3104 0.3139 0.3182 0.3341 0.3212 0.3118 0.2381 Total 3586 3597 3589 3611 3667 3534 n/a Table 3: Name Variants for TGF-b eta1 TGFb eta1 TGF b eta 1 TGF b etaI TGFb 1 TGFbI TGFb eta 1 TGFb etaI TGF b eta I TGF b1 TGFb I TGF b eta1 TGFb eta I TGFb1 TGF b 1 break-points; (2) Create the variants for the first breakpoint. Repeat this process until all the break-points have been processed; (3) For each variant generated from Step 2, substitute the replacement with the corresponding string for creating more variants. For each variant, we search it against the corpus. Only those variants that have non-zero occurrences will be considered as valid ones. For example, TGF-Beta1 is expanded to a set of variants shown in Table 3. All these variants will be considered as the same concept. retrieved. The 1st run describes the performance for the run without using the concept retrieval (CR). The 2nd and 3rd runs are obtained by disabling the "break-point" function and "replacement" function respectively, while the 4th run is obtained by using the concept retrieval with all the functions activated 2 . The 5th and 6th rows are our official submissions to the 2005 Genomics track. From Table 4, we can clearly see that the proposed concept based approach can increase the retrieval performance. In terms of MAP, the relative rate of improvement for the 4th run (0.3118) over the base run (0.2863) is 8.91%. Compared with the best automatic run in TREC 2005 Genomics track, the relative rate of improvement for the 4th run is 7.96% in terms of MAP. 5. CONCLUSION 3. CONCEPT REPRESENTATION For all the variants of a biological entity, we treat them as the same concept. These variants will be connected by the symbol "+" to represent a concept. We implement a new function for concept retrieval on Okapi retrieval system by normalizing the Robertson-Sparck Jones weight (RSJ). More background knowledge about Okapi and RSJ weight can be found in [2]. Assume that there is a concept C = t0 , t1 , . . . , tk , . . . , tn-1 , where tj is a term or phrase referring to an identical concept C , then the RS J weight for C is defined as follows: log N - |st0 st1 . . . stk . . . stn-1 | + 0.5 |st0 st1 . . . stk . . . stn-1 | + 0.5 We have proposed a concept-based approach for the purpose of biomedical text retrieval. Using concepts instead of keywords for the purpose of retrieval works well in the biomedical domain. We find that there may be no performance improvement by just creating variants for the biomedical terms without using concept-based retrieval. Compared with other methods, our approach is independent of any domain database and easy to be implemented. Furthermore, it has been shown to be competitive and effective on the TREC Genomics data sets. 6. ACKNOWLEDGMENTS This study was supported in part by a research grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada. 7. REFERENCES where N is the number of documents in the corpus, stj is the set of documents containing the term or phrase tj , |stj | is the number of documents in this set. In addition, we only tokenize the full name and add each term as a concept into the query with an adjusted query term frequency. 4. EXPERIMENTAL RESULTS We conduct a series of experiments to evaluate the effectiveness of the proposed concept based approach. In the experiments, we preprocess the raw documents by replacing all the hyphens in corpus with spaces. Stemming and pseudo relevance feedback are used in all the experiments. For Okapi BM25, we set k1 , k2 , k3 and b to be 1.4, 0, 8 and 0.55 respectively. Table 5 shows the experimental results on 50 topics provided by TREC 2005 Genomics track. In Table 4, we present the results in three measures (MAP, R-Prec and the total number of relevant documents retrieved). "Total" represents the total number of relevant documents [1] Rie Kub ota Ando, Mark Dredze and Tong Zhang (2005), TREC 2005 Genomics Track Exp eriments at IBM Watson. Proc. of the 14th Text Retrieval Conference, 2005. [2] M. Beaulieu, M. Gatford, Xiang ji Huang, Stephen E. Rob ertson, S. Walker and P. Williams (1996), Okapi at TREC-5. Proceedings of 5th Text REtrieval Conference, pp. 143-166, 1996. [3] Stefan Buttcher, Charles L. A. Clarke, Gordon V. Cormack (2004), Domain-Sp ecific Synonym Expansion and Validation for Biomedical Information Retrieval. Proceedings of the 13th Text Retrieval Conference, 2004. [4] JT Chang, H Schtze and RB Altman (2002). Creating an Online Dictionary of Abbreviations from MEDLINE. Journal of the American Medical Informatics Association. 9(6):612-20, 2002. [5] AM Cohen and WR Hersh (2005), A survey of current work in biomedical text mining, Briefings in Bioinformatics, 6(10):57-71, 2005. [6] Xiangji Huang, Ming Zhong and Luo Si (2005), York University at TREC 2005: Genomics Track. Proc. of the 14th Text Retrieval Conference, 2005. [7] http://ir.ohsu.edu/genomics/ 2 All the runs in our experiments are automatic runs. 724