Student Presentations


Date: November 14, 2006

Speaker: Samuel Angiuoli

Title: Scope, motivation, design, and management of four leading ontologies in the domain of biology and biomedicine: GO, SO, MGED, UMLS

Abstract:

Numerous ontologies have taken hold across the domain of biology and medicine providing data models for describing a complex array of data types. This presentation provides a survey of 4 popular ontologies that are under active development in different domains of biomedicine and bioinformatics. The scope, design, and management of these ontologies will be presented. The Gene Ontology and Sequence Ontology are used for genome annotation and have been developed through an open collaborative process with extensive community involvement. These ontologies are now managed by the newly formed National Center for Biomedical Ontology, which is a providing standards, tools, and a repository for community built ontologies. Another community built ontology, is the MGED ontology, developed by the MGED consortium, for describing microarray experiments. The MGED ontology describes experimental design, protocols, and biomaterials and follows the MAGE data model. In the domain of biomedicine, the UMLS integrates a number of ontologies for biomedical and health related concepts and is built and managed by the NLM. The Metathesaurus within UMLS, provides relationships between terms from a number of established ontologies including SNOMED, ICD-9-CM, and MeSH. By providing a mapping between source ontologies, the UMLS attempts to integrate existing ontologies, identify synonymous concepts, and provide a common data format, while preserving the content and structure of the source ontology.

Reading:

  • Creating the Gene Ontology Resource: Design and Implementation
  • The Sequence Ontology: a tool for the unification of genome annotations
  • The MGED Ontology: a resource for semantics-based description of microarray experiments
  • The Unified Medical Language System

    Optional:

  • Relations in biomedical ontologies
  • Bio-ontologies: current trends and future directions
  • Are the current ontologies in biology good ontologies?
    Date: November 14, 2006

    Speaker: Elena Zheleva

    Title: Mapping and Linking of Ontologies

    Abstract:

    Ontology mapping and alignment is necessary in order to provide interoperability between data contributed by independent sources. Lexical analysis provides a tool for discovering two entities from the same or different ontologies, which refer to the same concept. I will present an overview of the efforts going at the National Center for Biological Ontologies related to mapping and aligning of ontologies, and an evaluation of three lexical methods for ontology mapping.

    Reading:

  • A Fault Model for Ontology Mapping, Alignment, and Linking Systems
  • Evaluation of Lexical Methods for Detecting Relationships Between Concepts from Multiple Ontologies
  • National Center for Biomedical Ontology: Advancing Biomedicine through Structured Organization of Scientific Knowledge
    Date: November 21, 2006

    Speaker: Asad Sayeed

    Title: Machine learning and text mining approaches to data integration in bioinformatic contexts

    Abstract:

    In my talk, I will present the major points of two papers containing applications of machine learning and text mining techniques to biological databases with an eye towards data integration. The first paper consists of a discussion of a system that labels records from protein databases with Gene Ontology codes. This system was submitted for the BioCreative text mining evaluation event and used simple text-mining techniques, such as n-gram models. The second paper delves more deeply into the nitty-gritty of biological applications, using support vector machines trained on multiple datasets in order to predict the relationships between transcription factors and binding sites.

    Reading:

  • Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text
  • Machine learning methods for transcription data integration

    Optional:

  • Protein homology detection by HMM-HMM comparison
    Date: November 21, 2006

    Speaker: Inbal Yahav

    Title: Text mining in life science

    Abstract:

    Biomedical text plays a fundamental role in knowledge discovery in life science. Although information retrieval or text searching is useful, it is not sufficient to find specific facts and relations. Managing the increasing volume, complexity and specialization of knowledge expressed in these texts is therefore very challenging. In my presentation I will mainly discuss the challenges, method and architecture of Unstructured Information management, as offered by an IBM research group.

    Reading:

  • Text-based knowledge discovery: search and mining of life-sciences documents
  • Text Mining in the Life Sciences
  • Text analytics for life science using the Unstructured Information Management Architecture
    Date: November 28, 2006

    Speaker: Michael Schatz

    Title: Managing SNP data and the HapMap Project

    Abstract:

    The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.

    Reading:

  • dbSNP ER Diagram
  • The International HapMap Consortium. A haplotype map of the human genome
    Date: November 28, 2006

    Speaker: Yao Wu

    Title: Ranking target objects for gene queries

    Abstract:

    Web navigation plays an important role in exploring public interconnected data sources such as life science data. A navigational query in the life science graph produces a result graph which is a layered directed acyclic graph (DAG). Traversing the result paths in this graph reaches a target object set (TOS). The challenge for ranking the target objects is to provide recommendations that reflect the relative importance of the retrieved object, as well as its relevance to the specific query posed by the scientist. We present a metric layered graph PageRank (lgPR) to rank target objects based on the link structure of the result graph. LgPR is a modification of PageRank; it avoids random jumps to respect the path structure of the result graph. We also outline a metric layered graph ObjectRank (lgOR) which extends the metric ObjectRank to layered graphs. We then present an initial evaluation of lgPR. We perform experiments on a real-world graph of life sciences objects from NCBI and report on the ranking distribution produced by lgPR. We compare lgPR with PageRank. In order to understand the characteristics of lgPR, an expert compared the Top K target objects (publications in the PubMed source) produced by lgPR and a word-based ranking method that uses text features extracted from an external source (such as Entrez Gene) to rank publications.

    Reading:

  • Ranking target objects of navigational queries
  • Retrieval with gene queries

    Optional:

  • Authority-Based Keyword Queries in Databases using ObjectRank
  • The PageRank Citation Ranking: Bringing Order to the Web