Large-Scale Testing of Bibliome Informatics Using Pfam Protein Families Ana G. Maguitman, Andreas Rechtsteiner, Karin Verspoor, Charlie E. Strauss, and Luis M. Rocha Pacific Symposium on Biocomputing 11:76-87(2006) September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 LARGE-SCALE TESTING OF BIBLIOME INFORMATICS USING PFAM PROTEIN FAMILIES ANA G. MAGUITMAN , ANDREAS RECHTSTEINER , KARIN VERSPOOR , CHARLIE E. STRAUSS , LUIS M. ROCHA School of Informatics, Indiana University 1900 East Tenth Street, Bloomington, IN 47408 E-mail: anmaguit@indiana.edu, rocha@indiana.edu Los Alamos National Laboratory PO Box 1663, Los Alamos, NM 87545 E-mail: arechtsteiner@gmail.com, verspoor@lanl.gov, cems@lanl.gov Literature mining is expected to help not only with automatically sifting through huge biomedical literature and annotation databases, but also with linking bio-chemical entities to appropriate functional hypotheses. However, there has been very limited success in testing literature mining methods due to the lack of large, objectively validated test sets or "gold standards". To improve this situation we created a large-scale test of literature mining methods and resources. We report on a specific implementation of this test: how well can the Pfam protein family classification be replicated from independently mining different literature/annotation resources? We test and compare different keyterm sets as well as different algorithms for issuing protein family predictions. We find that protein families can indeed be automatically predicted from the literature. Using words from PubMed abstracts, of 3663 proteins tested, over 75% were correctly assigned to one of 618 Pfam families. For 90% of proteins the correct Pfam family was among the top 5 ranked families. We found that protein family prediction is far superior with keywords extracted from PubMed abstracts than with GO annotations or MeSH keyterms, suggesting that the text itself (in combination with the vector space model) is superior to GO and MeSH as a literature mining resources, at least for detecting protein family membership. Finally, we show that Shannon's entropy can be exploited to improve prediction by facilitating the integration of the different literature sources tested. 1. Introduction Biology was until recently essentially a hypothesis driven science in which experiments were carefully designed to answer one or very few specific questions -- e.g. test the function of a specific protein in a specific context. In the last decade, fueled by the widespread use of high-throughput technology, we have witnessed Address since Oct 2005: Center for Genomics and Bioinformatics, Indiana University, 1001 East 3rd St, Bloomington, IN 47405. September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 the emergence of a more data-driven paradigm for biological research. Since highthroughput experiments are frequently conducted for the sake of discovery rather than hypothesis testing, and due to the sheer amount of measured variables they entail, it is very difficult to interpret their results. Moreover, since the goal of many experiments is to uncover bio-chemical and functional information about genes and proteins, there is an obvious need to understand the linkages amongst biological entities in literature and databases which allow us to make inferences. Literature mining18 is expected to help with those inferences; its objective is to automatically sort through huge collections of literature and suggest the most relevant pieces of information for a specific analysis task, e.g. the annotation of proteins9 . Another application is to uncover similarities of genes according to "publication space", or the more tongue-in-cheek term "bibliome"8 . Since literature mining hinges on the quality of available sources of literature as well as their linkage to other electronic sources of biological knowledge, it is particularly important to study the quality of the inferences it can provide. Indeed, the Bibliome is not just the collection of publications and annotations available; its usefulness ultimately depends on the quality of linking resources that allow us to associate experimental data with publications and annotations. Interestingly, while literature mining is receiving considerable attention in Bioinformatics, it has not been hitherto seriously validated. Towards improving this situation, we present here our large-scale testing and comparison of literature mining algorithms, paired with specific bibliome resources. We present a general method for testing bibliome resources and literature mining algorithms in the context of classification of biological entities. This method formalizes and extends a previous study in which we tested how well is the Pfam protein family classification inferred from PubMed as indexed by the MeSH keyterm vocabulary16,14 . We expand on these results by testing additional bibliome resources such as GO annotations and text extracted from PubMed abstracts for the same classification problem. We additionally propose a new method based on Shannon's entropy to integrate results from different bibliome resources, and show that it significantly improves protein family predictions. 2. From Text Mining to the Bibliome: Looking for a "Gold Standard" There exists extensive cross-linkage amongst biomedical databases which can be exploited for bioinformatics analysis. For instance, gene chip identifiers can be linked to protein entries in SWISSPROT which in turn can be linked to PubMed documents. Indeed, in the bibliome, documents are linked to or indexed by various semantic (textual) tags which describe their content; these include Medical Subject Headings (MeSH), Gene Ontology (GO) annotations, PubMed abstract text, September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 HUGO nomenclature for human genes, GenBank accession numbers for gene sequences, etc. Therefore, in order to fully capture the potential of the bibliome for analysis, integration and dissemination of biological knowledge, in addition to research on text mining and natural language processing, literature mining needs more research on the quality of links amongst the resources that make up the bibliome. Text Mining is particularly applicable to the discovery of relevant information inside text -- e.g. discovering a portion of text in a document appropriate to annotate a given protein9 . But given the highly cross-linked nature of the bibliome, in addition to text mining, we need to approach bibliome informatics from an Information Retrieval (IR) perspective. Several research groups have been exploiting the cross-linked nature of the bibliome, particularly with semantic annotations such as MeSH and GO, for instance the systems developed by Masys et al13 and Jenssen et al11 for identifying sets of keyterms associated with sets of genes. Tools that are similar in spirit are PubMatrix2 , MedMiner22 , MeshMap21 and others. While these systems are potentially very useful, the quality of their results has not been thoroughly validated. For instance, we have applied Latent Semantic Analysis (LSA) to discover functional themes15,14 from the literature for microarray experiments dealing with the response to human cytomegalovirus infection. Though the functional themes we discovered automatically matched our previously published manual annotation of the same experiments4 , and even uncovered novel functional themes15,14 , such validation by a few expert biologists is done without access to a "gold standard". By "gold standard" we mean a standardized test data which allows us, unambiguously, to decide if a given inference is correct. Homayouni et al were able to build such gold standard for evaluating the performance of LSA, but only by focusing on a very small set of genes10 . Unfortunately, for data-driven experiments there is no clear expectation of what functional associations are to be found. Therefore, bibliome tools are typically tested by sampling some of their output and presenting it to experts. The problem is that experts typically disagree or cannot be an expert on all the topics involved. Even more systematic approaches such as Biocreative suffer from variability in experts' opinions5,9,3 , leading to potentially unreliable answers. 3. Large-scale standard for bibliome informatics: Methods and Data 3.1. A general large-scale bibliome informatics test The first requirement for our testing methodology is the existence of a biological classification C , accepted as a true standard, and defined on a large set P of biological entities p (e.g. proteins or genes), where each entity p is associated with a single class C (p). Given that the Bibliome is defined not only by publication September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 and annotation resources, but also by their linkage, we also need a high-quality linking resource LD between P and the documents of some publication or annotation resource D -- where LD (p) denotes the set of documents of D associated with entity p. Given a C and LD pair, our large-scale bibliome informatics test (LSBIT) can be applied to any pair, A, KD , of classification algorithm A and keyterm set KD extracted from D -- where KD (p) denotes the set of keyterms that index documents LD (p)a . The objective of the LSBIT is then to establish how well a given algorithm A can discover a known classification C of biological entities P , from a publication resource D using an associated keyterm set KD and a bibliome linking resource LD between P and D . 3.2. Bibliome Resources We chose the Pfam protein sequence classification20 as C for our tests. Pfam is a manually curated collection of protein families, currently encompassing several thousands of families. Pfam is an ideal classification for objective evaluation and comparison of Bibliome informatics due to it being based on sequence, which is a physical property of proteins that typically leads to functional similarity. Having settled on Pfam for our classification standard C , our biological entities P are proieins. Therefore, a most appropriate linking resource LD to test various A, KD t s the SWISSPROT (now UNIPROT19 ) database, which is a protein sequence database curated by experts. Besides the amino acid sequence of a protein it also lists different types of annotations, cross-references to other databases (including the Pfam family of a protein), as well as references to relevant publications for each protein. Therefore, the LSBIT with C = Pfam and LD = SWISSPROT, can be applied to classify proteins p under various pairs A, KD . The expert nature of Pfam and SWISSPROT allows us to use them as a standard for the classification of proteins. However, before the LSBIT may be performed, some preprocessing of the set of proteins to be tested is necessary. We extracted all the SWISSPROT protein IDs which contained a single Pfam classification. Multiple Pfam family assignments occur for 15% of all SWISSPROT proteins, possibly because some proteins have more than one classified domain. Because we are interested in constructing a large, unambiguous data set for validating bibliome methods, we removed multiclassification proteins. We do not consider those to be erroneous in any way, but they simply do not serve the purposes of out testing standard, which needs to be unambiguous. After pre-processing (details in14,16 ), we obtained a dataset with a We 3.2.1. Defining C and LD use keyterm to refer to both keywords and keyphrases depending on available resources. September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 |P | = 15, 217 proteins from C = 1611 Pfam families. Each protein p is associated with a unique Pfam family C (p). 3.2.2. Defining publication/annotation resources D Since SWISSPROT lists PubMed IDs, a very natural publication resource is PubMed; let us denote it as DP M . Via SWISSPROT, our linking resource LD , we retrieve different keyterm sets KD from PubMed, detailed in the next subsection. Another annotation resource we used was GO, which we denote as DGO , derived from the GOA/UNIPROT dataset provided by the GOA project, run by the European Bioinformatics Institute (EBI). Because we needed to compare and integrate the tests using DP M and DGO , we looked at a reduced set of proteins for which links to both PubMed publications and GO annotations were found, that L is P r = {p : LDP M (p) DGO (p) = }. We also restricted our study to Pfam families with at least 3 proteins. This reduced dataset P r contains 3663 proteins from 618 distinct Pfam families, where 179 of these families contain only 3 proteins and the largest 3 families contain 17 proteins. Mean and median family size is 5.9 and 5 proteins, respectively; standard deviation is 3.3. 3.3. Keyterm Sets KD to Test We have adapted the IR vector space model1 to represent proteins as vectors in a keyterm space. Four different keyterm sets were used in our analysis. Three of these sets contain keyterms extracted from PubMed (DP M ) publications associated with proteins, while the fourth was based on term annotations in the Gene MS Ontology (DGO ). The first keyterm set KDPeMH contains MeSH terms. MeSH (Medical Subject Headings) is a hierarchically organized vocabulary produced by MS the National Library of Medicine to index MEDLINE/PubMed. KDPeMH contains all MeSH terms occurring in the LDP M (p) set of PubMed records associated with all proteins p P r . Wr For the second keyterm set, KDPoMds , we used all words (after stop-word filtering) extracted from PubMed abstracts associated with all proteins p P r . To Sm Wr build the third keyterm set, KDteM s , we reduced the words in KDPoMds to their P linguistic stems, using a morphological normalization tool, called BioMorpher, T which we have used previously23 . Finally, the fourth keyterm set KDerms conGO tains terms from the LDGO (p) set of GO annotations associated with all proteins p P r . Notice that many of the annotations in GOA are electronically inferred (e.g. they are based on hits from sequence similarity searches or are transferred from database records). To avoid circularity in our argument we used the GOA evidence code to filter out term annotations inferred from electronic annotations (IEA), limiting our selection to those annotations assigned due to experimental September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 evidence or published literature. For each of the keyterm sets, we computed a protein-keyterm co-occurrence matrix where each positive entry denotes that the respective keyterm occurs in a document or annotation linked to the respective protein. The rows of the Matrix define the protein vectors for each protein p P r in the respective keyterm space. Table 1 shows the number of non-zero entries for each matrix and the average number of keyterms per protein in each of the four keyterm sets. Table 1. A comparison of the four keyterm sets. M KD eS H PM W KD ords PM S KDtems PM T KDerms GO total protein-keyterm associations avg. keyterms per protein 98707 27 560639 153 484072 132 14583 4 3.4. Protein Vectors and Protein Similarity The entry for a given protein-keyterm pair in the protein-keyterm co-occurrence matrix is a weight representing the relative importance of the keyterm for that protein. This weight is defined by multiplying a local and a global weight for the protein-keyterm pair. The local weight is the term frequency tfik , defined as the number of documents or annotations cited for protein pi in SWISSPROT that are also indexed by keyterm k in publication resource D being tested. The coefficients of the protein vectors are then scaled by a global weight to capture the relative importance of each keyterm in the space. The global weight we applied is related to the Inverse Document Frequency (IDF) in IR7 . We named PF it inverse protein family frequency (IPFF) and defined it as ipf fk = log( NP F ) n where N P F is the total number of Pfam families in C and nP F is the number of k Pfam families that contain a protein with at least a document/annotation indexed by keyterm k . Finally, the protein-keyterm co-occurrence matrix W is defined by wik = tfik · ipf fk where row i denotes protein vector i in keyterm dimension/column k . Figure 1 depicts this process. To measure protein similarity in keyterm space, we used the IR cosine measure1 : given protein vectors pi and pj in a n-dimensional keyterm space, the cosine similarity cos between them is their normalized dot product: 3 cos (pi , pj ) = pi · pj pi pj k .5. Prediction Algorithms A Our first LSBIT experiments, designed to establish how well we can predict the Pfam family of proteins using the bibliome resources described above, tested two classification algorithms closely related to the k -nearest neighbor algorithm6 . Given a protein keyterm vector pi and an angle , the first algorithm, A , assigns a score to each Pfam family j based on the number of proteins of that family September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 SwissProt Protein 1 (Pfam 1) Pub 1 Pub 2 ... Protein 2 (Pfam 2) Pub 3 Pub 4 ... MEDLINE/PubMed Abstract Pub 1 Keyterm 1 Keyterm 2 Keyterm 3 ... Abstract Pub 2 Keyterm 4 Keyterm 5 Keyterm 6 ... SwissProt Protein 1 (Pfam 1) Pub 1 Pub 2 ... Protein 2 (Pfam 2) Pub 3 Pub 4 ... GOA_UNIPROT Protein 1 Keyterm 1 Keyterm 2 Keyterm 3 ... Protein 2 Keyterm 4 Keyterm 5 Keyterm 6 ... MeSH terms, general terms, or stems GO terms K1 K2 K3 K4 ... K1 K2 K3 K4 ... Inverse Pfam Frequency K1 K2 K3 K4 ... P1 P2 P3 ... P1 P2 P3 ... Inverse Pfam Frequency K1 K2 K3 K4 ... Protein-Keyterm Matrix Protein-Keyterm Matrix Figure 1. The process of building a protein-keyterm matrix using different linkage infor- mation sources: MEDLINE/PubMed (left) and GOA UNIPROT (right). Target protein Target protein ? W h eig ted te vo ? ! (a) Ne igh b orh oo db ou nd ary (b) Figure 2. (a) A prediction algorithm: target protein neighborhood defined by the hyper- cone with opening angle and centered around the target protein vector. (b) AW V prediction algorithm: proteins voting in proportion to their cosine similarity to the target protein. found in a hypercone defined by the angle and centered around pi , as illustrated in figure 2(a). Thus, A returns a ranking of Pfam families based on this score: A : Pfamj (pi , ) = |{pk pfamj : cos (pi , pk ) cos()}| The family with most proteins in the neighborhood is ranked first, and so forth. This algorithm is described in detail in14,16 . A problem with the A algorithm is that it depends on an angle . If is large, unrelated proteins may be included in the neighborhood; if is small the neighborhood may contain very few proteins or may be empty, in which case no prediction can be made. A second problem is that it is biased towards ranking September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 3500 3000 80% 2500 60% 2000 1500 1000 20% 500 0 weighted 0.1 0% 40% 0.2 0.3 0.4 Figure 3. M eS Prediction success using algorithms AW V and A , and keyterm set KDP MH . 0.5 0.6 cos() 0.7 0.8 0.9 1 larger families first. We have adapted A to deal with both these issues. In the new algorithm, AW V , every protein in the space issues a "weighted vote" for its Pfam family (not just those inside a neighborhood hypercone): AW V : Pfamj (pi ) = pk pfamj cos (pi , pk ) |pfamj | The weight of each protein's vote is given by the cosine of the angle between its vector and the vector of the protein being classified. In order to weaken the bias towards larger families, the family score is normalized with a division by the square root of family size. Figure 2(b) illustrates this process. AW V improves on our first algorithm because it does not require a neighborhood angle to be defined in advance and it always issues a prediction for any protein vector in the space. Additionally, as we will see next, it has a higher prediction success than A . T. Results: Testing A, KD 4 he two algorithms A and AW V were tested using the four keyterm sets T Sm Wr MS KDPeMH , KDPoMds , KDteM s and KDerms . Figure 3 shows the prediction sucGO P M eS H cess of our algorithms using KDP M in terms of true-positives, i.e. the number of proteins for which the Pfam family was predicted correctly. The first entry on the x-axis (labelled weighted) corresponds to the weighted-voting algorithm AW V . The remaining entries on the x-axis (labelled 0.1, 0.2, etc.) indicate the cosine of for the A algorithm. The y-axis shows the number of proteins correctly predicted out of a total of 3663 P r . The black, dashed curve shows the number of proteins for which a prediction was made using A for various angle . As the cosine threshold increases, the number of predictions made by A decreases. Percentage of Proteins Predicted Nr of Proteins Predicted made predictions top 50 top 10 top 5 top 2 1st prediction 100% September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 AW V outperformed A in all our tests, therefore for the other three keyterm sets, we only display results for AW V summarized in Table 2. Noticeably, the three keyterm sets extracted from PubMed records performed better than the one extracted from GO annotations. This might be due to fewer GO than PubMed keyterms per protein (see table 1). Among the three keyterm sets based on PubMed, the two obtained from abstract words significantly outperform the one containing MeSH terms; the stem-based keyterms provided slightly better results than plain words. Table 2. Prediction success for AW V . M KD eS H PM W KD ords PM S KDtems PM T KDerms GO 1st prediction top 2 top 5 top 10 top 50 54.35% 66.72% 77.70% 83.76% 91.54% 75.27% 84.17% 88.83% 91.13% 94.02% 75.89% 84.22% 89.30% 91.48% 94.40% 38.08% 45.65% 55.53% 61.86% 75.59% 5. Integrating Predictions from Different Keyterm Sets We noticed that the sets of correctly predicted proteins using different keyterm sets do not completely overlap. Therefore, using Shannon's measure of entropy12 we can select the lower-uncertainty class predictions from the different keyterm sets, leading to a more successful algorithm that efficiently integrates information from those distinct sources. Let K (pi , pfamj , ) be the probability of selecting pfamj as the protein family predicted for protein pi using keyterm set K and a neighborhood bounded by angle . We estimate this probability as follows: K (pi , pfamj , ) = |{pk pfamj : cos (pi , pk ) cos()}| . |{pk : cos (pi , pk ) cos()}| Then, we compute the entropy of a prediction for protein i as follows: HK (pi , ) = - if |{pk : cos (pi , pk ) cos()}| = 0 K (pi , pfamj , ) log K (pi , pfamj , ) otherwise. j Finally, we compute the prediction uncertainty of protein i using keyterm set K , UK (pi ), as the average entropy on a finite set of angle thresholds T : UK (pi ) = if T , HK (pi , ) = HK (pi , ) : T HK (pi , ) = . Using the uncertainty measure, we implemented and tested a novel algorithm that integrates protein family predictions issued by each keyterm set, by selecting the September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 lower uncertainty predictions. Let K be a set of keyterm sets. For K K, let PfamK (pi ) be the score assigned to protein family j when predicting protein i j using keyterm set K . Then, our integration algorithm based on uncertainty, AU is implemented as follows: AU : PfamUK (pi ) = PfamK (pi ) where K = argmin UK j j K K (pi ). As a baseline for comparison, we implemented a simple prediction algorithm, A K , that also integrates the predictions issued by the four keyterm systems by computing the average score PfamK (pi ) over all K K. Table 3 summaj rizes the results obtained by these algorithms, highlighting the usefulness of an uncertainty-based method for the top predictions. Indeed, in addition to clearly outperforming A K , AU outperforms the best results of AW V with a single S keyterm set (KP tems ) (see table 2) for correct first and top 2 predictions. M Table 3. Prediction with combined keyterm sets. A 1st prediction top 2 top 5 top 10 top 50 K AU 77.15% 84.77% 88.86% 90.88% 93.80% 70.84% 80.02% 87.50% 91.35% 95.93% 6. Discussion and Conclusions Our experiments show that the Pfam classification of SWISSPROT proteins is quite well inferred, independently, from the publication resources and associated keyterm sets (MeSH, GO, PubMed abstracts), we tested with the LSBIT. The publication space with associated keyterms largely captures the functional information structure represented by the Pfam classification. Moreover, we have shown that Shannon's measure of entropy can be used to integrate the predictions from various keyterm sets, resulting in an improved protein Pfam prediction algorithm. Algorithm AW V always issues a prediction, which is desirable when we want to maximize the number of true-positives. However, for certain tasks it may be desirable to minimize the number of false-positives, or to use a certainty factor to express how reliable we judge a prediction to be. We are exploring the use of Shannon's measure of entropy to implement this scheme. An interesting finding for us was that for all tested algorithms, protein family prediction is far superior with keywords extracted from PubMed abstracts than with terms extracted from GO annotations. This suggests that although GO is becoming the standard annotation resource for gene and protein annotation, PubMed abstracts, and even MeSH keyterms, are far superior as resources for literature September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 mining. Given our results, it is fair to conclude that PubMed abstracts and MeSH terms contain more semantic and functional information to classify proteins. In future work, we will investigate what specific information is missing in the GO annotations which causes the lower performance. Our results also show that the simple vector space model from IR is capable of well representing the semantics entailed in PubMed abstracts for protein family prediction: e.g. for 90% of proteins the correct Pfam family was among the top 5 ranked families (see table 2). In preliminary tests, we have observed that LSA improves the results only when using PubMed abstract words, and not with the other keyword sets. These results suggest that abstract keyterms have more synonymy and polysemy than MeSH and GO, but the details of that analysis are forthcoming. In future work we intend to produce working bibliome informatics tools that build up on the knowledge and algorithms of this study. We will also extend this study with additional algorithms and resources. This includes extending our algorithms by exploiting the Ontology nature of MeSH and GO with similarity measures, testing additional uncertainty-based methods, and methods based on our network analysis methodology23,17 . Acknowledgements We are grateful to IU's Research and Technical Services (especially Steve Simms and George Turner) for technical support. The AVIDD Linux Clusters used in our analysis are funded in part by NSF Grant CDA-9601632. This work was also supported by the Department of Energy under contract W-7405-ENG-36 to the University of California. We particularly thank Tom Terwilliger at the Los Alamos National Laboratory for the motivation to conduct this study. References 1. R. Baeza-Yates and B. Ribiero-Neto. Modern Information Retrieval. Pearson Education, 1999. 2. K. G. Becker, D. A. Hosack, G. Dennis, R. A. Lempicki, T. J. Bright, C. Cheadle, and J. Engel. PubMatrix: a tool for multiplex literature mining. BMC Bioinformatics, 4(1):61, Dec 2003. 3. E. B. Camon, D. G. Barrell, E. C. Dimmer, V. Lee, M. Magrane, J. Maslen, D. Binns, and R. Apweiler. An evaluation of go annotation retrieval for biocreative and goa. BMC Bioinformatics, 6 Suppl 1:S15, 2005. 4. J. Challacombe, A. Rechtsteiner, R. Gottardo, L. M. Rocha, E. P. Brown, T. Shenk, M. Altherr, and T. Brettin. Evaluation of the host transcriptional response to human cytomegalovirus infection. Physiological Genomics., 18(1):51­62, 2004. 5. M. E. Colosimo, A. A. Morgan, A. S. Yeh, J. B. Colombe, and L. Hirschman. Data preparation and interannotator agreement: Biocreative task 1b. BMC Bioinformatics, 6 Suppl 1:S12, 2005. September 22, 2005 11:21 Proceedings Trim Size: 9in x 6in psb06 6. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, New York, NY, 2nd edition, 2000. 7. S. Dumais. Enhancing performance in latent semantic indexing, 1990. 8. W. Hersh, R. T. Bhupatiraju, and S. Corley. Enhancing access to the Bibliome: the TREC Genomics Track. Medinfo, 11(Pt 2):773­777, 2004. 9. L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia. Overview of biocreative: critical assessment of information extraction for biology. BMC Bioinformatics, 6 Suppl 1:S1, 2005. 10. R. Homayouni, K. Heinrich, L. Wei, and M. W. Berry. Gene clustering by Latent Semantic Indexing of MEDLINE Abstracts. Bioinformatics, 21(1):104­115, 2005. 11. T. K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig. A literature network of human genes for high-throughput analysis of gene expression. Nat. Genet., 28(1):21­ 28, 2001. 12. G. J. Klir and M. J. Wierman. Uncertainty-Based Information : Elements of Generalized Information Theory. Studies in Fuzziness and Soft Computing. Physica-Verlag, 1999. 13. D. R. Masys, J. B. Welsh, J. Lynn Fink, M. Gribskov, I. Klacansky, and J. Corbeil. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics, 17(4):319­26, 2001. 14. A. Rechtsteiner. Multivariate Analysis Of Gene Expression Data And Functional Information: Automated Methods For Functional Genomics. PhD thesis, Portland State University, 2005. 15. A. Rechtsteiner and L. M. Rocha. MeSH key terms for validation and annotation of gene expression clusters. In Currents in Computational Molecular Biology. RECOMB 2004, pages 212­213, 2004. 16. A. Rechtsteiner, L. M. Rocha, and C. E. Strauss. Clustering of protein families in literature keyword space. In Currents in Computational Molecular Biology. RECOMB 2005, Boston, MA, 2005. 17. L. M. Rocha, T. Simas, A. Rechtsteiner, M. DiGiacomo, and R. Luce. Mylibrary@lanl: Proximity and semi-metric networks for a collaborative and recommender web service. In The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2005), pages 565­571, Compiegne, France, Sep 2005. 18. H. Shatkay and R. Feldman. Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology, 10(6):821­856, 2003. 19. SIB/EBI. UniProt/Swiss-Prot. http://www.ebi.ac.uk/swissprot/, 2004. 20. E. L. Sonnhammer, S. R. Eddy, and R. Durbin. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins, 28(3):405­420, Jul 1997. 21. P. Srinivasan. MeSHmap: a text mining tool for MEDLINE. Proc AMIA Symp, pages 642­646, 2001. 22. L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter, and J. N. Weinstein. MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques, 27(6):1210­1214, Dec 1999. 23. K. Verspoor, J. Cohn, C. Joslyn, S. Mniszewski, A. Rechtsteiner, L. M. Rocha, and T. Simas. Protein annotation as term categorization in the gene ontology using word proximity networks. BMC Bioinformatics, 6 Suppl 1:S20, 2005.