UNISON: AN INTEGRATED PLATFORM FOR COMPUTATIONAL BIOLOGY DISCOVERY REECE K. HART, KIRAN MUKHYALA Genentech, Inc. 1 DNA Way South San Francisco, CA, 94080, USA E-mail: rkh @ gene.com Revision 186 (2008-09-16 23:01) This paper describes the design and applications of Unison, a comprehensive and integrated warehouse of protein sequences, diverse precomputed predictions, and other biological data. Unison provides a practical solution to the burden of preparing data for computational discovery projects, enables holistic feature-based mining queries regarding protein composition and functions, and provides a foundation for the development of new tools. Unison is available for immediate use online via direct database connections and a web interface. In addition, the database schema, command line tools, web interface, and non-proprietary precomputed predictions are released under the Academic Free License and available for download at http://unison-db.org/. This project has resulted in a system that significantly reduces several practical impediments to the initiation of computational biology discovery projects. 1. Introduction Computational biology projects frequently begin with the time-consuming process of downloading, formatting, characterizing, and integrating data of disparate types and sources [1]. These data preparation tasks require significant effort while contributing little to the distinguishing intellectual value of a project. Furthermore, these efforts are often duplicated for other projects, by other scientists, and even by the same scientist for the same project when new source data are released. The data preparation burden is a barrier to efficient and reproducible computational discovery efforts. From a computational biologist's point of view, there are several important functional criteria for integrative databases: completeness and currency of the data, the breadth of source data and data types, query speed, technical accessibility, legal accessibility (i.e., licensing restrictions), and the extent of semantic integration. Semantic integration means that data of the same type are modeled similarly and that appropriate relationships are established among data, regardless of source. Importantly, users experience semantically integrated databases as a representation of familiar concepts rather than a collection of proprietary data models (even if freely available). Reliable reasoning with data necessitates consistent, well-defined and well-understood definitions of the modeled data. There are many techniques for data integration (for review, see [2][3]). Link integration, such as that provided by web pages, enables users to follow prescribed links between data sources. Although link integration is useful for web browsing, it is insufficient for reliable and semantically precise querying. Similarly, full text indexing greatly facilitates searching, but it is insufficient for the reliable integration of concepts. Database federation semantically integrates data that are stored remotely. The principle advantage of federation is that queries are always based on current data, but the drawbacks are the run-time dependencies on external resources, poor query performance compared to that for locally stored data, and the difficulty of devising interfaces that translate the semantics of the remote data models to those of the integrated schema. Data warehouses, in which source data are aggregated within one database environment, eliminate external database dependencies and generally provide better query performance than other integration methods. The simplest data warehouses replicate source data locally but provide little or no semantic integration of the data. Because data are local, performance is improved relative to that of remote databases. More frequently, data warehouses provide a semantically integrated schema by creating database views to local replicas of source data ("view integration" [3]) or by materializing source data within an integrated schema during loading. Despite the existence of several high-quality and well-known integrative databases (e.g., ATLAS [4], BioMart [5], InterPro [6], RefSeq [7], STRING [8], UniProt [9], and others), data preparation for new computational discovery projects remains burdensome. There are technical, practical, and legal reasons that current databases do not meet the needs of computational biologists [2][3]. A few of these reasons are: limited content/project specificity, inaccurate or outof-date data, limited access methods, necessity for local deployment, and licensing restrictions. Although no system will meet all needs of all users, there is a significant and unmet need for a standardized integration platform that lessens the data preparation burden of a broad audience. Unison is a comprehensive data warehouse of a superset of nearly all available protein sequences (currently, 12M from 20 sources), extensive precomputed proteomic predictions (200M of 18 distinct types), and diverse auxiliary data. Unison includes predictions of protein domains and motifs, signal and transmembrane domains, secondary and tertiary structure, disorder, cellular localization, phosphorylation, and genomic alignment and clustering. The motivation for Unison is to lessen the data preparation burden of computational biologists by providing a standardized integration platform of commonly-used source databases and computationally expensive proteomic predictions. The integration enables the same resource to be used for traditional per-sequence domain analysis and for complex, holistic data mining queries regarding protein domain composition, structure, and function. Unison's schema is designed for incremental updating with respect to source sequences, models, and computational methods. The entire update process is fully automated. The complete Unison package ­ schema, tools, web site, underlying database, and precomputed data ­ are freely available online for immediate use and for download. 2. Methods 2.1. Schema overview Unison is principally a data warehouse of sequences, annotations, and precomputed predictions in an integrated schema. Familiar concepts that transcend multiple data sources, such as protein sequences and Hidden Markov Model (HMM) alignments, are modeled as abstract types in the semantically integrated core schema. Specialized data, such as NCBI GeneRIF and SCOP, are incorporated as auxiliary data without remodeling. One of the tenets of Unison's core schema is to represent only the salient features of an entity or concept rather than to fully represent the character of each source database. As a result, most tables in the Unison schema are concise representations of the essential features of a biological entity or piece of information. This design decision is consistent with the goal of including data only when it is likely to inform queries or analysis. Links to source data enable users to pursue source-specific content. Unison makes extensive use of table inheritance to model abstract data types and their concrete descendants. Normalization is essential to enabling efficient incremental updates of source data, which is one of Unison's primary goals. Nearly all tables in Unison are third normal form [10]. In rare cases when normalization leads to unacceptable performance loss, materialized views are provided. In practice, this level of normalization is readily understood by Unison's users and does not impose a barrier to use. Nonetheless, Unison provides a functional layer of views that provide a simplified, pragmatic, and stable query interface for users. 2.2. Essential schema objects The essential objects modeled within the Unison schema are shown in Figure 1 and described below. The full schema and schema documentation are available online. The origin table stores the provenance of all data, the version of the source database, a Uniform Resource Identifier (URI) to information about the source (e.g., a project's "home page"), a URI of the file or directory from which data were loaded, and a URI template that is used to construct web page links to source records. A flag also indicates whether the data are publicly distributable. Protein sequences are stored non-redundantly in the pseq table and referred to by a the primary key, pseq_id. A database trigger on this table computes the MD5 checksum on the protein sequence during loading, and a uniqueness constraint on the checksum ensures that sequences are distinct. A second trigger prohibits updating or deleting any sequence so that predictions are Figure 1: Essential tables and their relationships within Unison. This figure represents a small subset of the schema and a majority of the content of Unison. guaranteed to reflect the stored sequence; sequence changes require inserting a new sequence. The pannotation table represents the "alias" (accession or identifier from a source database), description, and taxonomic annotation of a sequence. A uniqueness constraint prohibits duplicate aliases within the same origin. The pseq_pannotation table maps and versions annotations in the pannotation table to sequences in the pseq table. Annotation versioning enables Unison to properly track obsolete aliases and changes to a sequence changes that is associated with a single alias. The current annotations for sequences are available through a view. The params table represents prediction methods and their invocation. Each row includes a primary key, params_id, the command line that specified how a program is invoked, and a flag that indicates whether the results of the method are publicly distributable. The command line provides important traceability of all predictions in Unison. Some prediction methods depend on additional external input, generically called models and represented using an abstract base table, pmodel. For example, Position Specific Scoring Matrices (PSSMs), HMMs, and Prospect Pro (http://bioinformaticssolutions.com/) protein structure templates are represented by pmpssm, pmhmm, and pmprospect respectively, each of which inherits from the abstract table, pmodel. The pmodel table and its derived tables contain a primary key, pmodel_id. Sets of models are represented by pmodelset. feature types/models (pmodel_id) sequences (pseq_id) prediction result prediction result Parameters (params_id) Figure 2: Visualization of the prediction results volume. The cube dimensions are unique sequences, feature types/models, and parameters. An analysis of protein features corresponds to horizontal slices of data. Feature-based mining corresponds to a vertical slices of data. The run table is used to associate parameters and appropriate model sets, referenced by params_id and pmodelset_id respectively. The params, pmodelset, and run tables together permit Unison to distinguish how a method is invoked on a sequence and which models a sequence is run against. Prediction results in Unison are represented using an abstract base table, pfeature, that represents a localizable feature on a protein sequence, specified by the foreign key pseq_id, using a specific prediction method, specified by the foreign key params_id. By including params_id in this table, Unison explicitly models alternative invocations of a predictive method and this, in turn, enables specialized or exploratory work using multiple prediction parameters with a single method. The palignment table is a subclass of pfeature that represents an alignment of sequence to a model, as specified by the foreign key pmodel_id. Results for each prediction method are modeled as distinct types by subclassing pfeature or palignment as appropriate and adding columns that are specific to the prediction method. Readers may benefit from visualizing the arrangement of prediction results as a volume of distinct predictions, each of which depends on sequence (pseq_id), method (params_id), and an optional model (pmodel_id), as shown in Figure 2. Unison's schema includes a run_history table to track which sequence analyses ­ that is, runs from the run table ­ were performed on which sequences and the date of the execution. Run histories enable Unison to provide incremental updates with respect to new sequence releases, new versions of predictive methods, and new models/modelsets. Hidden Markov Models and sequence alignments to them permit a concrete and representative discussion of the schema in practice. HMMs for HMMer [11] are available from many sources, such as PANTHER [12], Pfam [13], Superfamily [14], and in-house efforts. Such sources are stored in the origin table. HMMs from those sources may be loaded into pmhmm with an obligatory reference to the model origin. Details specific to HMMer-built HMMs, such as score cutoffs, are included in the model table to enable advanced queries using these criteria. HMM alignments are represented using pahmm (Protein Alignment HMM), which is a subclass of palignment and therefore contains foreign keys to the query sequence (pseq_id), the parameters (params_id), and the HMM (pmodel_id), in addition to the positions of the alignment on the sequence and the model, the score, the E-value for the individual alignment, and other alignment details. When the alignments are loaded, the run_history table is updated to reflect this. The pahmm_v view provides a denormalized relation that simplifies the querying of HMM data. 2.3. Loading and Updating Unison Most of the Unison's initial build process and all of the update process, including data downloads and submission of prediction jobs to a compute cluster, are automated by a series of makefiles and Perl scripts. The loading process consists of several phases that must proceed in order and without error before continuing to the next phase. In phase 1, auxiliary data are loaded. These data include cytogenetic bands from UCSC [15], Gene Ontology [16], Human Genome Organization official gene names [17], NCBI Entrez Gene, GeneRIF, HomoloGene and taxonomy files [7], PDB [18], and SCOP [19]. The PDB schema includes structure summary data, ligands with canonical names and various descriptors, and an explicit sequence-to-residue mapping to facilitate the localization of primary sequence features on structures. In phase 2, sequences and annotations are loaded from approximately 30 public sequence sources; the list of origins and currency is available at http://unison-db.org/contents.pl. Sequence loading from the commercial GENESEQ patent database [20] is also supported, but these data are not in the public release. Within Genentech, in-house sequences and annotations are also loaded. In phase 3, three nested sets of sequences ­ runA, runB, and runC ­ prioritize sequences by origin, species, size and other criteria. Because predictions take widely varying time, expensive computations such as protein structure predictions are maintained for reliable human sequences of moderate size (runA); methods of intermediate computational cost are maintained for certain eukaryotic sequences from reliable sources (runB); and inexpensive computations are maintained for sequences from a wider selection of sources and species (runC). Ad hoc analyses of any type may be arranged for any sequence. In phase 4, sequences are submitted to a compute cluster queue for analysis and the resulting data are loaded. The sequences to be run are determined by subtracting the set of sequences for which data are already available, as determined by the run_history table, from the run set that is appropriate for the analysis type. The 18 methods and command lines are listed at http://unisondb.org/contents.pl. Unison itself does not infer any features by sequence similarity, although many methods use sequence similarity internally. In phase 5, materialized views that depend on phase 4 data are built. In phase 6, database summary statistics, maintenance, and cleanup are performed. The public version of Unison is built by specialized scripts that extract only the publicly distributable sequences, annotations, and predictions. 2.4. Hardware and software requirements Unison is implemented in the PostgreSQL relational database running in a Linux/GNU environment. Deploying the database server and Unison on other operating systems should be straightforward. Installation requires approximately 200 GB of disk space. Although the extensive use of table inheritance and server-side functions in C and Perl would make porting to other database systems difficult, this is under consideration. The Unison Perl API facilitates access to the Unison database and provides many convenience functions. The API makes extensive use of BioPerl [21] and other modules readily available from the Comprehensive Perl Archive Network. The Unison web interface runs on Apache in a Linux/GNU environment. All web pages are currently implemented as Perl CGI scripts to enable Kerberos authentication. This configuration causes user authentication and authorization to be delegated to the database and security policies to be enforced there rather than in the middle tier. Neither registration nor authentication is required for the public version. 2.5. Accessing Unison Direct database access to Unison is available at host unisondb.org, port 5432, database unison, and username PUBLIC via the PostgreSQL native protocol, ODBC, JDBC (for Java), and SDBC (for OpenOffice). ODBC access within R has been tested. These connection methods are not specific to the client or server operating system. The number of concurrent connections and statement query times are limited to promote equitable access. Web access is available at http://unison-db.org/. Unison web pages use URLs with simple "GET"-method query arguments, thereby facilitating programmatic linking to content. For example, all sequence analysis pages allow specification of the sequence by any recognized alias, protein sequence, MD5 checksum, or Unison's internal pseq_id. MD5 checksums provide an intrinsic key that obviates inter-database coordination of protein sequence accessions. CREATE OR REPLACE VIEW itim_search_v AS Y SELECT IG.pseq_id, IG.start as ig_start,IG.stop as ig_stop, ig tm TM.start as tm_start,TM.stop as tm_stop, ITIM.start as itim_start,ITIM.stop as itim_stop FROM pahmm_current_pfam_v IG JOIN pftmhmm_tms_v T ON IG.pseq_id=TM.pseq_id AND IG.stop