Pattern Discovery, Validation, and Hypothesis Development from the
Annotated Biological Web
Decades of successful bioinformatics research is reflected
in public repositories including PubMed,
Entrez Gene and model organism resources such as The Arabidopsis
Information Resource (TAIR). Each repository includes a collection
of curated records, (publications, genes, proteins, etc.).
Curation includes enriching records with annotations
and enriching the repositories by interconnecting these records.
Significant investments have been made to develop ontologies, e.g.,
Gene Ontology (GO) and Plant Ontology (PO) for increased
interoperability within research communities.
These annotated and interlinked repositories create the
annotated biological Web .
Scientists still lack the essential tools to mine
this wealth of annotated data, in synergy with the scientific
literature, to obtain nuggets of biologically meaningful knowledge.
We propose algorithms that will uncover patterns from the
hyperlinked annotation graph representing the annotated Web
- pattern discovery.
We will exploit literature-based methods to explore the biological
meaning underlying these patterns - pattern validation.
Some patterns may be well established and may have an
imprint in the literature.
Novel patterns carry the potential to
be developed into hypotheses and biological experiments.
The pattern discovery and validation processes will be integrated
within an experiment design workflow tailored to the bioscientist -
Our basic building block is a triple of the form (gene, GO, PO).
We build upon knowledge that the gene has been
annotated with the GO and PO terms by a curator.
Our innovation is that pattern discovery (and hypothesis development)
starts with such triples and moves towards discovering
more complex patterns that are composed of multiple triples, possibly
across multiple genes. These triple-based patterns
capture associations between concepts that span across
the GO and PO ontologies.
Pattern validation will explore the literature
for the imprint of triple(s). An imprint is a ranked collection of sentences
that describe the triple.
A bioscientist can use the ranked list to gauge whether the
biological relationship(s) underlying the triple is well known,
i.e., whether the triple is validated.
Pattern discovery and literature-based validation will drive
hypothesis development and will be key components of an experiment
is sponsored by the National Science
Foundation CISE III
Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs.
Barna Saha and Allie Hoch and Samir Khuller and Louiqa Raschid and Xiao-Ning Zhang.
To appear in RECOMB 2010.
Link Prediction for Annotation Graphs using Graph Summarization.
Andreas Thor, Philip Anderson, Louiqa Raschid, Saket Navlakha,
Barna Saha, Samir Khuller, Xiao-Ning Zhang. Under review. Please email firstname.lastname@example.org for a copy of this paper.
A Ranking-Based Approach to Discover Semantic Associations Between Linked Data.
Maria Esther Vidal, Louiqa Raschid, Luis Ibanez, Hector Rodriguez,
Jean Carlo Rivera and Edna Ruckhaus.
ESWC Workshop on Inductive Reasoning and Machine Learning for the Semantic Web
Ranking Target Objects of Navigational Queries.
Yao Wu, Louiqa Raschid, Maria Esther Vidal, Panayiotis Tsaparas, Woei-Jyh Lee, Padmini Srinivasan and Aditya Sehgal.
Proceedings of the Workshop on Information and Data Management, p. 1, vol. 1, (2006).
Using Annotations from Controlled Vocabularies to Find Patterns in LSLinks.
Woei-Jyh Lee and Louiqa Raschid and Padmini Srinivasan and Nigam Shah and Daniel Rubin and Natasha Noy.
Proceedings of the Conference on Data Integration for the Life Sciences, p. 1, vol. 1, (2007).
Mining Meaningful Associations from Annotations in Life Science Data Resources.
Lee, Woei-Jyh; Raschid, Louiqa; Srinivasan, Padmini.
Proceedings of the Conference on Data Integration for the Life Sciences, p. 1, vol. 1, (2008).
Winner of the Swiss Institute of Bioinformatics Best Paper Award.
Flexible and Efficient Querying and Ranking on Hyperlinked Data Sources.
Varadarajan, Ramakrishna; Hristidis, Vagelis; Raschid, Louiqa; Vidal, Mari Esther; Rodriguez, Hector; Ibanez, Luis.
Proceedings of the EDBT Conference, p. 1, vol. 1, (2009).
- A searchable database of annotated (GO-gene-PO) triples from TAIR
and a tool for annotated graph summarization and visualization.
- A searchable database of annotated hyperlinked data between
Entrez Gene, PubMed and OMIM and a data mining tool to
discover significant pairs of terms between two ontologies.
search customized for gene retrieval.