CMSC 828U - Exploiting Biological Resources

Student Presentations
Date: November 14, 2006 Speaker: Samuel Angiuoli Title: Scope, motivation, design, and management of four leading ontologies in the domain of biology and biomedicine: GO, SO, MGED, UMLS Abstract: Numerous ontologies have taken hold across the domain of biology and medicine providing data models for describing a complex array of data types. This presentation provides a survey of 4 popular ontologies that are under active development in different domains of biomedicine and bioinformatics. The scope, design, and management of these ontologies will be presented. The Gene Ontology and Sequence Ontology are used for genome annotation and have been developed through an open collaborative process with extensive community involvement. These ontologies are now managed by the newly formed National Center for Biomedical Ontology, which is a providing standards, tools, and a repository for community built ontologies. Another community built ontology, is the MGED ontology, developed by the MGED consortium, for describing microarray experiments. The MGED ontology describes experimental design, protocols, and biomaterials and follows the MAGE data model. In the domain of biomedicine, the UMLS integrates a number of ontologies for biomedical and health related concepts and is built and managed by the NLM. The Metathesaurus within UMLS, provides relationships between terms from a number of established ontologies including SNOMED, ICD-9-CM, and MeSH. By providing a mapping between source ontologies, the UMLS attempts to integrate existing ontologies, identify synonymous concepts, and provide a common data format, while preserving the content and structure of the source ontology. Reading: Creating the Gene Ontology Resource: Design and Implementation The Sequence Ontology: a tool for the unification of genome annotations The MGED Ontology: a resource for semantics-based description of microarray experiments The Unified Medical Language System Optional: Relations in biomedical ontologies Bio-ontologies: current trends and future directions Are the current ontologies in biology good ontologies? Date: November 14, 2006 Speaker: Elena Zheleva Title: Mapping and Linking of Ontologies Abstract: Ontology mapping and alignment is necessary in order to provide interoperability between data contributed by independent sources. Lexical analysis provides a tool for discovering two entities from the same or different ontologies, which refer to the same concept. I will present an overview of the efforts going at the National Center for Biological Ontologies related to mapping and aligning of ontologies, and an evaluation of three lexical methods for ontology mapping. Reading: A Fault Model for Ontology Mapping, Alignment, and Linking Systems Evaluation of Lexical Methods for Detecting Relationships Between Concepts from Multiple Ontologies National Center for Biomedical Ontology: Advancing Biomedicine through Structured Organization of Scientific Knowledge Date: November 21, 2006 Speaker: Asad Sayeed Title: Machine learning and text mining approaches to data integration in bioinformatic contexts Abstract: In my talk, I will present the major points of two papers containing applications of machine learning and text mining techniques to biological databases with an eye towards data integration. The first paper consists of a discussion of a system that labels records from protein databases with Gene Ontology codes. This system was submitted for the BioCreative text mining evaluation event and used simple text-mining techniques, such as n-gram models. The second paper delves more deeply into the nitty-gritty of biological applications, using support vector machines trained on multiple datasets in order to predict the relationships between transcription factors and binding sites. Reading: Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text Machine learning methods for transcription data integration Optional: Protein homology detection by HMM-HMM comparison Date: November 21, 2006 Speaker: Inbal Yahav Title: Text mining in life science Abstract: Biomedical text plays a fundamental role in knowledge discovery in life science. Although information retrieval or text searching is useful, it is not sufficient to find specific facts and relations. Managing the increasing volume, complexity and specialization of knowledge expressed in these texts is therefore very challenging. In my presentation I will mainly discuss the challenges, method and architecture of Unstructured Information management, as offered by an IBM research group. Reading: Text-based knowledge discovery: search and mining of life-sciences documents Text Mining in the Life Sciences Text analytics for life science using the Unstructured Information Management Architecture Date: November 28, 2006 Speaker: Michael Schatz Title: Managing SNP data and the HapMap Project Abstract: The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention. Reading: dbSNP ER Diagram The International HapMap Consortium. A haplotype map of the human genome Date: November 28, 2006 Speaker: Yao Wu Title: Ranking target objects for gene queries Abstract: Web navigation plays an important role in exploring public interconnected data sources such as life science data. A navigational query in the life science graph produces a result graph which is a layered directed acyclic graph (DAG). Traversing the result paths in this graph reaches a target object set (TOS). The challenge for ranking the target objects is to provide recommendations that reflect the relative importance of the retrieved object, as well as its relevance to the specific query posed by the scientist. We present a metric layered graph PageRank (lgPR) to rank target objects based on the link structure of the result graph. LgPR is a modification of PageRank; it avoids random jumps to respect the path structure of the result graph. We also outline a metric layered graph ObjectRank (lgOR) which extends the metric ObjectRank to layered graphs. We then present an initial evaluation of lgPR. We perform experiments on a real-world graph of life sciences objects from NCBI and report on the ranking distribution produced by lgPR. We compare lgPR with PageRank. In order to understand the characteristics of lgPR, an expert compared the Top K target objects (publications in the PubMed source) produced by lgPR and a word-based ranking method that uses text features extracted from an external source (such as Entrez Gene) to rank publications. Reading: Ranking target objects of navigational queries Retrieval with gene queries Optional: Authority-Based Keyword Queries in Databases using ObjectRank The PageRank Citation Ranking: Bringing Order to the Web

Student Presentations

Date: November 14, 2006

Speaker: Samuel Angiuoli

Title: Scope, motivation, design, and management of four leading ontologies in the domain of biology and biomedicine: GO, SO, MGED, UMLS

Abstract:

Numerous ontologies have taken hold across the domain of biology and medicine providing data models for describing a complex array of data types. This presentation provides a survey of 4 popular ontologies that are under active development in different domains of biomedicine and bioinformatics. The scope, design, and management of these ontologies will be presented. The Gene Ontology and Sequence Ontology are used for genome annotation and have been developed through an open collaborative process with extensive community involvement. These ontologies are now managed by the newly formed National Center for Biomedical Ontology, which is a providing standards, tools, and a repository for community built ontologies. Another community built ontology, is the MGED ontology, developed by the MGED consortium, for describing microarray experiments. The MGED ontology describes experimental design, protocols, and biomaterials and follows the MAGE data model. In the domain of biomedicine, the UMLS integrates a number of ontologies for biomedical and health related concepts and is built and managed by the NLM. The Metathesaurus within UMLS, provides relationships between terms from a number of established ontologies including SNOMED, ICD-9-CM, and MeSH. By providing a mapping between source ontologies, the UMLS attempts to integrate existing ontologies, identify synonymous concepts, and provide a common data format, while preserving the content and structure of the source ontology.

Reading:

Creating the Gene Ontology Resource: Design and Implementation

The Sequence Ontology: a tool for the unification of genome annotations

The MGED Ontology: a resource for semantics-based description of microarray experiments

The Unified Medical Language System

Optional:

Relations in biomedical ontologies

Bio-ontologies: current trends and future directions

Are the current ontologies in biology good ontologies?

Date: November 14, 2006

Speaker: Elena Zheleva

Title: Mapping and Linking of Ontologies

Abstract:

Ontology mapping and alignment is necessary in order to provide interoperability between data contributed by independent sources. Lexical analysis provides a tool for discovering two entities from the same or different ontologies, which refer to the same concept. I will present an overview of the efforts going at the National Center for Biological Ontologies related to mapping and aligning of ontologies, and an evaluation of three lexical methods for ontology mapping.

Reading:

A Fault Model for Ontology Mapping, Alignment, and Linking Systems

Evaluation of Lexical Methods for Detecting Relationships Between Concepts from Multiple Ontologies

National Center for Biomedical Ontology: Advancing Biomedicine through Structured Organization of Scientific Knowledge

Date: November 21, 2006

Speaker: Asad Sayeed

Title: Machine learning and text mining approaches to data integration in bioinformatic contexts

Abstract:

In my talk, I will present the major points of two papers containing applications of machine learning and text mining techniques to biological databases with an eye towards data integration. The first paper consists of a discussion of a system that labels records from protein databases with Gene Ontology codes. This system was submitted for the BioCreative text mining evaluation event and used simple text-mining techniques, such as n-gram models. The second paper delves more deeply into the nitty-gritty of biological applications, using support vector machines trained on multiple datasets in order to predict the relationships between transcription factors and binding sites.

Reading:

Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text

Machine learning methods for transcription data integration

Optional:

Protein homology detection by HMM-HMM comparison

Date: November 21, 2006

Speaker: Inbal Yahav

Title: Text mining in life science

Abstract:

Biomedical text plays a fundamental role in knowledge discovery in life science. Although information retrieval or text searching is useful, it is not sufficient to find specific facts and relations. Managing the increasing volume, complexity and specialization of knowledge expressed in these texts is therefore very challenging. In my presentation I will mainly discuss the challenges, method and architecture of Unstructured Information management, as offered by an IBM research group.

Reading:

Text-based knowledge discovery: search and mining of life-sciences documents

Text Mining in the Life Sciences

Text analytics for life science using the Unstructured Information Management Architecture

Date: November 28, 2006

Speaker: Michael Schatz

Title: Managing SNP data and the HapMap Project

Abstract:

The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.

Reading:

dbSNP ER Diagram

The International HapMap Consortium. A haplotype map of the human genome

Date: November 28, 2006

Speaker: Yao Wu

Title: Ranking target objects for gene queries

Abstract:

Web navigation plays an important role in exploring public interconnected data sources such as life science data. A navigational query in the life science graph produces a result graph which is a layered directed acyclic graph (DAG). Traversing the result paths in this graph reaches a target object set (TOS). The challenge for ranking the target objects is to provide recommendations that reflect the relative importance of the retrieved object, as well as its relevance to the specific query posed by the scientist. We present a metric layered graph PageRank (lgPR) to rank target objects based on the link structure of the result graph. LgPR is a modification of PageRank; it avoids random jumps to respect the path structure of the result graph. We also outline a metric layered graph ObjectRank (lgOR) which extends the metric ObjectRank to layered graphs. We then present an initial evaluation of lgPR. We perform experiments on a real-world graph of life sciences objects from NCBI and report on the ranking distribution produced by lgPR. We compare lgPR with PageRank. In order to understand the characteristics of lgPR, an expert compared the Top K target objects (publications in the PubMed source) produced by lgPR and a word-based ranking method that uses text features extracted from an external source (such as Entrez Gene) to rank publications.

Reading:

Ranking target objects of navigational queries

Retrieval with gene queries

Optional:

Authority-Based Keyword Queries in Databases using ObjectRank

The PageRank Citation Ranking: Bringing Order to the Web