Pattern Discovery, Validation, and Hypothesis Development from the Annotated Biological Web

Decades of successful bioinformatics research is reflected in public repositories including PubMed, Entrez Gene and model organism resources such as The Arabidopsis Information Resource (TAIR). Each repository includes a collection of curated records, (publications, genes, proteins, etc.). Curation includes enriching records with annotations and enriching the repositories by interconnecting these records. Significant investments have been made to develop ontologies, e.g., Gene Ontology (GO) and Plant Ontology (PO) for increased interoperability within research communities. These annotated and interlinked repositories create the annotated biological Web .

Scientists still lack the essential tools to mine this wealth of annotated data, in synergy with the scientific literature, to obtain nuggets of biologically meaningful knowledge. We propose algorithms that will uncover patterns from the hyperlinked annotation graph representing the annotated Web - pattern discovery. We will exploit literature-based methods to explore the biological meaning underlying these patterns - pattern validation. Some patterns may be well established and may have an imprint in the literature. Novel patterns carry the potential to be developed into hypotheses and biological experiments. The pattern discovery and validation processes will be integrated within an experiment design workflow tailored to the bioscientist - hypothesis development.

Our basic building block is a triple of the form (gene, GO, PO). We build upon knowledge that the gene has been annotated with the GO and PO terms by a curator. Our innovation is that pattern discovery (and hypothesis development) starts with such triples and moves towards discovering more complex patterns that are composed of multiple triples, possibly across multiple genes. These triple-based patterns capture associations between concepts that span across the GO and PO ontologies. Pattern validation will explore the literature for the imprint of triple(s). An imprint is a ranked collection of sentences that describe the triple. A bioscientist can use the ranked list to gauge whether the biological relationship(s) underlying the triple is well known, i.e., whether the triple is validated. Pattern discovery and literature-based validation will drive hypothesis development and will be key components of an experiment design methodology.

Project Sponsors

This award is sponsored by the National Science Foundation CISE III Program

Participants

Louiqa Raschid, University of Maryland co-PI
Padmini Srinivasan, University of Iowa co-PI
Maria Esther Vidal, Universidad Simon Bolivar
Andreas Thor, University of Leipzig and UMIACS
Philip Anderson, Undergraduate on REU Award, University of Maryland

Collaborators

Samir Khuller, University of Maryland
Barna Saha, University of Maryland
Xiao-Ning Zhang, St. Bonaventure University
Caren Chang, Cell Biology and Molecular Genetics
Zhongchi Liu, Cell Biology and Molecular Genetics

Interesting Links

PattArAn - A searchable database of annotated (GO-gene-PO) triples from TAIR and a tool for annotated graph summarization and visualization.
LSLINKS - A searchable database of annotated hyperlinked data between Entrez Gene, PubMed and OMIM and a data mining tool to discover significant pairs of terms between two ontologies.
GeneDocs: PubMed search customized for gene retrieval.

Pattern Discovery, Validation, and Hypothesis Development from the Annotated Biological Web

Project Sponsors

Participants

Collaborators

Recent papers

Interesting Links