Pattern Discovery, Validation, and Hypothesis Development from the Annotated Biological Web

Decades of successful bioinformatics research is reflected in public repositories including PubMed, Entrez Gene and model organism resources such as The Arabidopsis Information Resource (TAIR). Each repository includes a collection of curated records, (publications, genes, proteins, etc.). Curation includes enriching records with annotations and enriching the repositories by interconnecting these records. Significant investments have been made to develop ontologies, e.g., Gene Ontology (GO) and Plant Ontology (PO) for increased interoperability within research communities. These annotated and interlinked repositories create the annotated biological Web .

Scientists still lack the essential tools to mine this wealth of annotated data, in synergy with the scientific literature, to obtain nuggets of biologically meaningful knowledge. We propose algorithms that will uncover patterns from the hyperlinked annotation graph representing the annotated Web - pattern discovery. We will exploit literature-based methods to explore the biological meaning underlying these patterns - pattern validation. Some patterns may be well established and may have an imprint in the literature. Novel patterns carry the potential to be developed into hypotheses and biological experiments. The pattern discovery and validation processes will be integrated within an experiment design workflow tailored to the bioscientist - hypothesis development.

Our basic building block is a triple of the form (gene, GO, PO). We build upon knowledge that the gene has been annotated with the GO and PO terms by a curator. Our innovation is that pattern discovery (and hypothesis development) starts with such triples and moves towards discovering more complex patterns that are composed of multiple triples, possibly across multiple genes. These triple-based patterns capture associations between concepts that span across the GO and PO ontologies. Pattern validation will explore the literature for the imprint of triple(s). An imprint is a ranked collection of sentences that describe the triple. A bioscientist can use the ranked list to gauge whether the biological relationship(s) underlying the triple is well known, i.e., whether the triple is validated. Pattern discovery and literature-based validation will drive hypothesis development and will be key components of an experiment design methodology.

Project Sponsors

This award is sponsored by the National Science Foundation CISE III Program



Recent papers

  • Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs. Barna Saha and Allie Hoch and Samir Khuller and Louiqa Raschid and Xiao-Ning Zhang. PDF version. To appear in RECOMB 2010.
  • Link Prediction for Annotation Graphs using Graph Summarization. Andreas Thor, Philip Anderson, Louiqa Raschid, Saket Navlakha, Barna Saha, Samir Khuller, Xiao-Ning Zhang. Under review. Please email for a copy of this paper.
  • A Ranking-Based Approach to Discover Semantic Associations Between Linked Data. Maria Esther Vidal, Louiqa Raschid, Luis Ibanez, Hector Rodriguez, Jean Carlo Rivera and Edna Ruckhaus. PDF version. ESWC Workshop on Inductive Reasoning and Machine Learning for the Semantic Web 2010.
  • Ranking Target Objects of Navigational Queries. Yao Wu, Louiqa Raschid, Maria Esther Vidal, Panayiotis Tsaparas, Woei-Jyh Lee, Padmini Srinivasan and Aditya Sehgal. Proceedings of the Workshop on Information and Data Management, p. 1, vol. 1, (2006). PDF.
  • Using Annotations from Controlled Vocabularies to Find Patterns in LSLinks. Woei-Jyh Lee and Louiqa Raschid and Padmini Srinivasan and Nigam Shah and Daniel Rubin and Natasha Noy. Proceedings of the Conference on Data Integration for the Life Sciences, p. 1, vol. 1, (2007). PDF.
  • Mining Meaningful Associations from Annotations in Life Science Data Resources. Lee, Woei-Jyh; Raschid, Louiqa; Srinivasan, Padmini. Proceedings of the Conference on Data Integration for the Life Sciences, p. 1, vol. 1, (2008). Winner of the Swiss Institute of Bioinformatics Best Paper Award. PDF.
  • Flexible and Efficient Querying and Ranking on Hyperlinked Data Sources. Varadarajan, Ramakrishna; Hristidis, Vagelis; Raschid, Louiqa; Vidal, Mari Esther; Rodriguez, Hector; Ibanez, Luis. Proceedings of the EDBT Conference, p. 1, vol. 1, (2009). PDF.

Interesting Links

  • PattArAn - A searchable database of annotated (GO-gene-PO) triples from TAIR and a tool for annotated graph summarization and visualization.
  • LSLINKS - A searchable database of annotated hyperlinked data between Entrez Gene, PubMed and OMIM and a data mining tool to discover significant pairs of terms between two ontologies.
  • GeneDocs: PubMed search customized for gene retrieval.