ModuleFinder: A Tool for Computational Discovery of Cis Regulatory Modules A.A. Philippakis, F.S. He, and M.L. Bulyk Pacific Symposium on Biocomputing 10:519-530(2005)


MODULEFINDER: A TOOL FOR COMPUTATIONAL DISCOVERY OF CIS REGULATORY MODULES
ANTHONY A. PHILIPPAKIS,1,3,4, FANGXUE SHERRY HE,1,3, MARTHA L. BULYK*,1,2,3,4 1 Divisi on of Genetics, Department of Medicine, 2Department of Pathology, 3 Harvard/MIT Division of Health Sciences & Technology (HST), and 4 Harvard University Biophysics Program Brigham & Women's Hospital and Harvard Medical School Boston, MA 02115 Email: {aphilippakis, mlbulyk}@receptor.med.harvard.edu, sherryhe@mit.edu
Regulation of gene expression occurs largely through the binding of sequence-specific transcription factors (TFs) to genomic binding sites (BSs). We present a rigorous scoring scheme, implemented as a C program termed "ModuleFinder", that evaluates the likelihood that a given genomic region is a cis regulatory module (CRM) for an input set of TFs according to its degree of: (1) homotypic site clustering; (2) heterotypic site clustering; and (3) evolutionary conservation across multiple genomes. Importantly, ModuleFinder obtains all parameters needed to appropriately weight the relative contributions of these sequence features directly from the input sequences and TFBS motifs, and does not need to first be trained. Using two previously described collections of experimentally verified CRMs in mammals and in fly as validation datasets, we show that ModuleFinder is able to identify CRMs with great sensitivity and specificity.

1.

Introduction

Recent technological advances have enabled both the sequencing of a large number of genomes an d the generation of expansive gene expression datasets. Still, little is known about how these gene expression patterns are precisely regulated through the binding of sequence-specific transcription factors (TFs) to their DNA binding sites (BSs). Of particular interest is the organization of TF binding sites (TFBSs) into cis regulatory modules (CRMs) that coordinate the complex spatio-temporal patterns of gene expression, and to use that information to identify the CRMs themselves. Mapping TFs to their target CRMs, however, is significantly complicated in higher eukaryotic genomes by the large proportion of non-protein-coding sequence. Since a typical TFBS can be as short as ~5 base pairs (bp), matches to its motif occur frequently by chance alone, with many of these occurrences presumably not acting to modulate gene expression. Therefore, a central challenge that must be overcome is distinguishing functional TFBSs from spurious motif matches.

 *

These authors contributed equally. Corresponding author.


To date, three indicators have been used to identify functional TFBSs. First, functional BSs for some TFs tend to occur in clusters, with multiple BSs occurring in close proximity (homotypic clustering). Second, searching for clusters containing BSs for 2 or more TFs that are believed to co-regulate can enrich for likely CRMs (heterotypic clustering). Finally, functional TFBSs are frequently conserved across evolutionarily divergent organisms1. Cross-species sequence conservation in particular has enormous potential for filtering sequence space, as man y genomes have recently been sequenced, and many more are slated to be sequenced (http://www.genome.gov/10002154). The discrimin atory power of phylogenetic footprinting for identifying cis regulatory elements is therefore expected to continue to increase through the use of more genomes2-6. In order to appropriately incorporate information on conservation across multiple genomes, however, a measure of TFBS conservation is required that weights each alignment genome according to its evolutionary distance not only from the query genome, but also relative to the other alignment genomes. For example, given a candidate TFBS in the human genome, observing conservation in chicken should be weighted more heavily than conservation in mouse, as mouse is evolutionarily closer to human. Moreover, if the candidate site were also conserved in rat, then this additional conservation should be weighted only slightly, given the evolutionary proximity of mouse and rat. While numerous groups have developed approaches for the prediction of CRMs, none is optimized for practical applications. Specifically, many approaches7-9 have been based on binary scoring schemes, wherein all regions containing a threshold number of occurrences for a given combination of TFBSs are returned. These approaches suffer from the limitation that they do not prioritize among the predictions, an important feature for experimentalists as only a limited number of candidate CRMs can feasibly be validated. Additionally, the threshold value determined in any given biological system is unlikely to be generalizeable from one set of TFs and CRM type to another; thus, the appropriate discriminatory criterion must be re-discovered with each application. Alternatively, among existing continuous scoring schemes, many require large training sets10,11. Such approaches cannot be applied to a system in which there are only a handful of known examples, as is frequently the case in practical applications. Finally, among approaches that employ continuous scoring schemes and do not require training12-15, most do not systematically integrate BS clustering and conservation. We are aware of only one other approach that combines all three indicators16, but it is computationally rather slow and requires the user to specify a single sequence window size for the search. Since CRMs are known to vary greatly in size, a scoring scheme is needed that evaluates clustering and conservation over windows of varying sizes15. We have developed a statistically rigorous scoring scheme that for any given genomic region integrates into a single score th e degree of: (1) homotypic clustering; (2) heterotypic clustering; and (3) evolutionary conservation across


multiple genomes. Similar to programs such as BLAST17, our score is an objective measure of the statistical significance of the observed degree of clustering and conservation that is independent of the genome and TFBSs under consideration. Thus, the scoring scheme obtains all parameters needed to appropriately weight the relative contribution of each input alignment and TFBS motif directly from the sequences and motifs themselves, and so does not need to first be trained. We have implemented this scoring scheme as a C program called "ModuleFinder," that is algorithmically efficient and has an intuitive interface. Using two previously described collections of experimentally verified CRMs (mammalian skeletal muscle18 an d D. melanogaster segmentation genes7), we show that ModuleFinder is able to identify CRMs with ~95% sensitivity and ~95% specificity. 2. Methods

Meth ods that evaluate the overall degree of conservation for a given region have been successful in identifying cis regulatory elements in metazoan genomes2,6; they do not, however, necessarily identify the CRMs through which a given set of TFs exert their regulatory roles (i.e., the TFs' "target" CRMs). Since our ultimate goal is to identify candidate CRMs that are bound by a given set of TFs, we have developed a scoring scheme that specifically considers the conservation of a particular set of TFBSs comprising a given tr an scriptional regulatory model. For this, we developed a novel statistical framework that builds on earlier work. Blanchette et al. stated the substring parsimony problem and presented a rigorous and efficient algorithmic procedure for solving it19; th is model was applied to the identification of candidate DNA motifs. Moses et al. used mixture models to evaluate conservation within a tree, and applied it to the identification of candidate DNA motifs from sets of co-expressed genes20; this was similar to an approach given by Prakash et al.21 Here we present a related approach for identifying candidate CRMs from input TFBS motifs. 2.1. Scoring Scheme We define a word to be a short sequence on the DNA alphabet {A,C,G,T}, and a motif to be a collection of words all of the same length. ModuleFinder takes as input a collection of arbitrarily many motifs {m1...mm}, where each motif mi is composed of arbitrarily many words of length li . It also takes as input a set of sequences G = {g1,...gn} corresponding to genomic regions that are to be searched for instances of these motifs, as well as two sets of genomic sequences, A = {a1,...,an} and B = {b1,...,bn}, extracted from evolutionarily divergent organisms and then aligned to the sequences of G. Here, we primarily illustrate the scoring scheme for the case of two alignment genomes, but include comments on th e extension to fewer or more alignments. For any gj, let gj,k denote the base at the kth position and (gj,k...gj,k+l) denote the subsequence of length l beginning at position k. If there is a match to a given motif mi at position k of sequence gj, we define it to be conserved in A (respectively, B), if it is true


that the subsequence (aj,k...aj,k+l) (respectively, (bj,k...bj,k+l)) is also a word in motif mi . Note that we are not assuming that gj,k...gj,k+l = aj,k...aj,k+l, but merely that they are both words in mi . Our basic approach is to scan each sequence in G with a series of nested windows (i.e., overlapping windows of differing sizes). In each window we count the number of occurrences of each motif and the number of these that are conserved in A and B. We then evaluate the likelihood of observing this number of matches and conserved matches under the appropriate null hypothesis, and return those windows that are statistically significant. Specifically, let X = ( X1,..., Xm) be the vector whose components indicate the number of occurrences for each motif individually in a given window, and let Y = ( Y1,..., Ym) and Z = ( Z1,..., Zm) be the corresponding vectors indicating that Yi and Zi out of Xi occurrences are conserved in A and B, respectively. The window score is obtained by finding the probability of observing (X,Y,Z). This quantity will vary according to the likelihood of conservation in A and/or B, the motif frequency, and the window width. Thus, this probability can be represented by: (1) P , ,w ( X , Y , Z ) where  parameterizes conservation likelihood,  parameterizes motif frequencies, and w is the window width. Observe that: (2) P , , w ( X , Y , Z )  P (Y , Z | X ) P , w ( X ) where the relevant parameters can be split between terms in the Markov decomposition, as P ,w ( X ) is unaffected by conservation likelihood, and

P (Y , Z | X ) is unaffected by motif frequency and window size.
( X i ) of Eq. (2) is the likelihood of observing Xi occurrences under the null hypothesis that the motif matches are distributed at random. This has been proved to be well-approximated by a Poisson distribution, provided the motif occurs infrequently and the words comprising it do not exhibit extensive self-overlap.22 Thus, P ,w ( X i )  e  ( i X / X i ! ) , where
i

For a single motif mi , the term P

,w

i

i

i

i =i *w. Th e value of i will itself be determined by both the words comprising mi , as well as genomic word frequencies. To obtain it, we estimate the frequency of each word in mi by a seventh order Markov approximation based on genomic word frequencies, and then sum these frequencies for all words in the motif.
For multiple motifs, the joint probability is given by assuming independence:
P , w  X 1 ,... X m    P i , w  X i 
i 1 m


Th is is a simplifying assumption to make the computation tractable; the error in this approximation has, however, been proved to be bounded22.


The computation of the second term of Eq. (2), P (Y , Z | X ) , is complicated by two factors. First, the score must reflect not only the evolutionary distances of A an d B to G, but also the distances of A and B to each other. Thus,  must repar ameterize P (Y , Z | X ) so that it becomes smaller as A and B grow more distant from G, and as the correlation between A and B decreases. Second, the quantity P (Y , Z | X ) will depend not only on the phylogeny of A, B and G, but also on the degeneracy of the motifs mi . Since we have defined a given motif match to be conserved in A or B if there is a motif occurrence (but not necessarily an exact word match) at the same position in these aligned sequences, a more degenerate motif has a greater likelihood of being conserved.
1 We account for these difficulties as follows. Define A,B to be the covariance matrix representing the relative proportions of A and B that can be aligned 1 against G; thus, 0,0 gives the proportion of sequence in G for which neither A 1 nor B could be aligned, 11,0 an d 0,1 give the proportion for which either A or B

(but not both) could be aligned, and 11,1 gives the proportion for which both A
i an d B could be aligned. Similarly, for each motif mi , defin e A,,2B to be the covariance matrix representing the relative likelihoods of exact conservation of li positions (i.e., (gj,k...gj,k+l) = (aj,k...aj,k+l)) in A and/or B. Here, we have observed non-independence of exact conservation likelihood between adjacent positions, so we model it as a first order Markov chain. 1 Conservation of a completely degenerate motif is parameterized by A, B , and i conservation of a motif composed of a single word is parameterized by A,,2B . The parameterization of a generic motif is between these extremes; for this, let Pi ,j,k be the matrix giving the frequency of nucleotide j{A,C,G,T} at position k {1,...,li } in motif mi , and let Ei be the average entropy of the motif:

Ei  

1 2li

 P

li

i , j ,k k 1 j{ A,C ,G ,T }

log 2 Pi , j ,k

Hence, Ei =1 for a completely degenerate motif, Ei =0 for a motif composed of a single word, and Ei increases monotonically and smoothly between these extremes as the motif degeneracy increases. Therefore, we take our 1 i parameterization of i for mi to be a weighted average of A, B and A,,2B :
i i i  Ei A,,1B  (1  Ei )A,,2B

We then use i to compute P i (Yi , Z i | X i ) . In a sequence window containing Xi matches to motif mi , let ai be the number that are not conserved in either A or B, let bi an d ci be the number conserved in either A or B (but not both), and let di be the number that are conserved in both A and B. The following equations hold: (3-5) ai  bi  ci  d i  X i


bi  di  Yi

ci  di  Zi

P(Yi ,Zi |Xi ) is therefore given by the following multinomial:
 i X i! Pi (Yi , Z i | X i )     a ! b ! c ! d !  0,0   i i i i


a ib 1, 0 ic 1, 0

id 1,1


(6)

where the summation is performed over all values of ai , bi , ci and di satisfying Eqs. (3)-(5). To achieve computational efficiency, we make use of the following 1-dimensional parameterization, where Xi , Yi and Zi remain fixed as di is varied: (7-9) ai  X i  Yi  Z i  d i
bi  Yi  d i

Th us, the summation of Eq. (6) can be performed by simply taking each value of di in the range 0  di  min(Yi , Zi ). If one desires to only input one genome, it is sufficient to set A=B. The relevant parameters then simplify, and the preceding multinomial distribution collapses to a binomial distribution with parameter

ci  Z i  d i

 i  i

1,1

:

Th is parameterization can also be easily generalized to more than 2 alignment genomes by replacing the matrix i with an appropriate tensor. Th is derived value of P, , w ( X , Y , Z ) alone is insufficient for determining statistical significance, since a measurement of distance into the appropriate tail of the distribution is also required. Therefore, we perform a summation of P, ,w ( X , Y , Z ) extending from the observed value of (X,Y,Z) and including all values of (X,Y,Z) with an increased degree of clustering and conservation (we use log values to simplify the numerical analysis):
 X X m ~ ~~  S , ,w ( X ,Y , Z )  log10      Pi , ,w ( X i ,Yi , Z i )  ~ ~ ~  i  X  X Y Y Z  Z i1  ~ ~ Xi Xi m  m  ~ ~~   log10     Pi , ,w ( X i ,Yi , Zi )    Si , ,w ( X i ,Yi , Zi )  X  X Y Y Z  Z  i 1 i i ~ ~ ~ i 1  i ii ii i 
~ ~

X  P i (Yi | X i )   i  Yi (1   i ) X i Yi   Yi 


(10)

Th erefore, the output score S , , w ( X , Y , Z ) for a given window is the linear sum of scores for the input motifs, S  , , w ( X i , Yi , Z i ) , where each such term has been
i i

automatically weighted so that more degenerate motifs contribute less. Observe also that S  , , w ( X i , Yi , Z i ) =0 if and only if Xi =0, and that S  , , w ( X i , Yi , Zi )
i i
i i

in creases monotonically with increasing values of (Xi ,Yi ,Zi ), as desired. 2.2. Implementation and Availability ModuleFinder has been implemented in C. To minimize runtime, we pre-process each sequence of G with suffix arrays23 for efficient searching; additionally, as


the algorithm proceeds, a look-up table is kept that contains a list of scores for all observed window sizes w and motif matches (X,Y,Z). ModuleFinder can scan ~120 Mb/hr using window sizes of 300-700 bp with an increment size of 50 bp an d one alignment genome on a Pentium 4 computer. The compiled code, along with README files and appropriately formatted genomes and alignments for human, mouse, rat, fly, worm and yeast based on the latest UCSC assemblies24 ar e available for download at our website (http://the_brain.bwh.harvard.edu). Two additional features were included for improved practical applicability. First, it is known that TFs frequently bind to DNA as homo- and hetero-dimers. We have added to ModuleFinder the ability to take pairs of TFBSs as input, along with minimum and maximum spacer lengths between sites. The score of the dimer is computed by evaluating the probability of each component motif as in Eq. (1), then taking the product of these probabilities and summing them over all input spacings. Second, ModuleFinder allows a certain amount of `wiggle room' to compensate for the potential existence of local misalignments. Specifically, given an input value r, a motif match (gj,k....gj,k+l) is considered conserved in A if there is any subsequence of (aj,k-r...aj,k+r+l) that is a word in mi . Although this does increase the likelihood of conservation, the effect is miniscule for small values of r (1 r  5) and has frequently identified potentially conserved sites that would have been missed otherwise. 3. Results

3.1 Validation of ModuleFinder on human skeletal muscle CRMs In order to evaluate ModuleFinder, we used a set of positive control regions previously compiled by Wasserman et al.18 Th is test dataset comprises 27a skeletal muscle CRMs that have been demonstrated to direct transcription in skeletal muscle or a suitable cell-culture model system18. Each region contains a validated BS for at least one of the following 5 TFs: the Myf family (total of 39 TFBSs in the positive control set), Mef2 (26 TFBSs), SRF (20 TFBSs), Tef (12 TFBSs) and Sp1 (13 TFBSs). Of these 27 regions, 23 are located within 5 kb upstream of translational Start, and 2 with in introns. As negative controls, 1000 regions of size 200 bp were randomly selected to positionally match the positive control regions: 852 (=(23/27)*1000) regions were within 5 kb of translational Start for a randomly chosen RefGene24 gene, and the remaining 148 were within introns. This matching of chromosomal locations was performed as ModuleFinder accounts for local word frequencies, which vary throughout the genome; in particular, promoter regions are known to be GC-rich. We ran ModuleFinder on the positive and negative control regions with window sizes of 100-200 bp (increment size = 10 bp), using human sequence alone,
a

The original collection gave 28 genes, but we removed the gene Rb1 as there were no confirmed TFBSs for the listed TFs.


human/mouse/rat (H/M/R) alignments and human/mouse/chicken alignments (H/M/C) obtained from UCSC Genome Browser (hg16, mm3, rn3, galGal2)24. Currently, two alternative strategies for representing TFBSs have been used by various groups in computational searches for CRMs: exact word matches to known BSs9,15, and position weight matrices (PWMs)7,10-13, which allow for extrapolation to additional BSs. To determine which of these representations had greater discriminatory power, we performed our searches both ways, using a PWM threshold value of 1 standard deviation (SD) below the motif average25. We used a "jack-knife" strategy11 for these searches, whereby the BSs for each CRM were excluded from the construction of the PWM used to search that CRM, and similarly the exact word matches from each CRM were excluded in the search of that CRM. In addition, since in vitro binding experiments had been performed for Mef226 and SRF27, we also added those BSs to both searches.
Human Alone Exact Sens. Spec. p-val 88.9% 90.1% 1.19x10-8 PWM 92.6% 89.2% 6.5x10-10 Human/Mouse/Rat Exact 92.6% 88.8% 2.5x10-10 PWM 96.3% 94.4% 1.4x10-10 Human/Mouse/Chicken Exact 92.6% 87.4% 4.5x10-10 PWM 92.6% 94.4% 7.1x10-10

Table 1. ModuleFinder was run on a human skeletal muscle dataset using both exact word matches and PWMs. The searches were done using no alignments, mouse/rat alignments, and mouse/chicken alignments. For each search, sensitivity ("Sens."), specificity ("Spec."), and a ttest on the means ("p-val") were computed, as compared to matched random regions.

Th e results of these evaluations are shown in Table 1. Here, we have reported those values for sensitivity and specificity which maximally discriminate between the positive and negative control sets (i.e., using the threshold score such that the difference between the sensitivity and specificity is minimized). Since there was great variability in score among the positive control regions (see Figure 1; i.e., the top positive control region received a score of -11.23 and the worst positive control region scored only -1.22 (positive controls: mean = -4.69, SD = 2.24; negative controls: mean = -0.42, SD = 0.81)), we also performed a ttest on the positive and negative control region means, in order to measure the effectiveness of ModuleFinder on regions falling far from the threshold score. On this dataset, ModuleFinder achieved a maximum sensitivity of 96.3% and specificity of 94.4% on the H/M/R PWM searches. Moreover, the PWM approach consistently gave better discrimination than exact word matches. Much of this improved discrimination, however, is an artifact of the jack-knife procedure, which has a stronger effect on exact match searches. Here, using the complete set of BSs (i.e., without the jack-knife), exact word matching achieves 100% sensitivity and 95.1% specificity (we removed degenerate flanking sequences for all searches with exact words). In addition, these results indicate that the H/M/R searches reliably outperformed the H/M/C searches. There are two possible explanations for this: 1) the chicken genome is not yet complete,


and the appropriate alignment regions may not have been sequenced yet; 2) the underlying mechanisms of transcriptional regulation are not actually conserved in an organism as distant as chicken. Neither of these hypotheses can be ruled out until the completion of the chicken genome.
S p ecif icity Since ModuleFinder was 0.8 specifically developed to integrate 0.6 homotypic clustering, heterotypic 0.4 clustering, and conservation, we 0.2 S e n s i ti v i t y wanted to determine which of 0 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -1 0 -1 1 these features were most M od u l e F i n d er Score contributory to discriminatory power. In order to assess this, we Figure 1: Sensitivity and specificity of ModuleFinder on skeletal muscle test regions, ran ModuleFinder on the positive versus randomly selected control regions. and negative control regions using no alignments, one alignment (each of mouse, rat and chicken), and two alignments (H/M/R and H/M/C). These searches were repeated with each TF individually, as well as with all 5 TFs together. In Figure 2, we show the negative logarithm of the p-values obtained from t-tests on the positive versus negative control regions for each of these searches. Here the mouse and rat alignments improved discriminatory power, but little was gained by using both genomes, because of their evolutionary proximity. Somewhat surprisingly, using chicken actually reduced discrimination relative to human alone. This was unexpected, as it implies that our negative controls are more likely to be conserved than these 27 regions. However, this effect could be an artifact of the small size of the positive controls and gaps in the chicken genome (only 13/27 positive controls had any alignable chicken sequence).
1

All 5 TFs
1 0 9 8 7 6 5 4 3 2 1 0

Mef2

Myf

Sp1

SRF

Tef

-log(p-value)

mouse rat chick en

- +- - ++ - -+-+- --+-+

- +- - ++ --+-+---+-+

- +- - ++ --+-+---+-+

- +- - ++ - -+-+---+-+

- + - - ++ - -+-+- --+-+

- +- - ++ - -+-+- --+-+

Figure 2: Negative log of p-values obtained from t-test on means between positive and negative controls. ModuleFinder was run with various combinations of mouse, rat, and chicken alignments (indicated by +/-), using all 5 TFs together, and each TF alone.

Finally, at least four other algorithms have used overlapping subsets of this dataset as positive controls11-13,28, achieving sensitivities between 59% and 66%,


and specificities between 95.3% and 97.1% (see Table 2). Thus, ModuleFinder appears to have comparable specificity but greater sensitivity. However, note the following caveats for this comparison. First, because ModuleFinder uses evolutionary conservation as a central component and because few vertebrate genomes have been sequenced, we limited our searches to the subset of the original compilation for which human/rodent alignments were available18. The other algorithms tested on this dataset did not consider conservation, and thus used the original, larger compilation that included CRMs obtained from diverse organisms including chicken, hamster, rabbit, pig and cow11. Frith et al.12,13 tr immed this larger set11 to a subset of 27 regions, but their subset overlapped with ours by only 15 genes. Second, each group used a different set of negative controls. The original paper by Wasserman et al.11 used a set of negative control regions similar to our set; it was composed of 200 bp regions selected from the Eukaryotic Promoter Database. Comet and Cister were each tested on 300 bp regions that were selected to overlap well-characterized transcriptional Starts12,13. Finally, MSCAN28 measured specificity by looking at the "hit rate" in contiguous stretches of the Fugu genome.
Algorithm Logistic Regression11 Cister12 COMET13 MSCAN28 ModuleFinder Sensitivity 60% 59% 59% 66% 96% Specifi city 96% 97.1% 95.3% NA 94%

Table 2. Relative performance of ModuleFinder: Sensitivities and specificities, as reported by groups using overlapping subsets of the skeletal muscle dataset. Logistic regression, Cister, Comet and ModuleFinder specificities refer to 200-300bp portions of the human genome; the MSCAN specificity was ascertained using large stretches of Fugu sequence.

3.2 Other validations of ModuleFinder In addition to the mammalian skeletal muscle set, we have also tested ModuleFinder on a D. melanogaster dataset that comprises 20 transcriptional enhancers from 9 genes known to be co-regulated during anterior-posterior segmentation of fly embryos7. Using the D. melanogaster/D. pseudoobscura alignments and a protocol similar to that described in Section 2.1, ModuleFinder was able to discriminate this collection of CRMs from randomly chosen noncoding regions with 95% sensitivity and 95% specificity (Philippakis et al., manuscript in preparation). In addition to these in silico confirmations, we have also successfully applied ModuleFinder to predict CRMs in three biological systems: (1) development of the fly pericardium (Michaud et al., manuscript in preparation), (2) development of fly muscle founder cells (Philippakis et al., manuscript in preparation), and (3) mammalian myogenesis (Warner et al., manuscript in preparation). For mammalian myogenesis, we applied the same 5


TFs and their BSs as described above; indeed much of the work presented here was done for the explicit purpose of selecting optimized BSs and sequence alignments before attempting to predict novel mammalian CRMs. 4. Discussion and Future Directions

We have presented a statistically rigorous approach for scoring windows of genomic sequence according to their likelihood of containing BSs for a collection of input TFs. The approach systematically integrates homotypic clustering, heterotypic clustering and evolutionary conservation across multiple genomes into a single, objective scoring scheme that does not require training. Additionally, our algorithm, implemented as a C program called "ModuleFinder," is publicly available for download, along with pre-processed genomes and alignments for yeast, worm, fly, mouse, rat, and human, at our lab website (http://the_brain.bwh.harvard.edu). The current version of ModuleFinder considers up to two alignment genomes as input, and we are curr ently expanding it to accept arbitrarily many genomes. We have tested ModuleFinder on a set of human skeletal muscle CRMs using a variety of genome alignments and TFBSs, and have achieved a maximum sensitivity and specificity of 96% and 94%. On this dataset, improved sensitivity and specificity were achieved by using mouse and rat alignments in the searches, whereas chicken alignments actually decreased sensitivity and specificity. Furthermore, PWMs resulted in improved sensitivity and specificity over exact TFBS matches. Preliminary results indicate that ModuleFinder can successfully predict novel CRMs in human myoblasts (Warner et al., manuscript in preparation). In addition, on a D. melanogaster segmentation gene dataset with D. pseudoobscura as the alignment genome, ModuleFinder achieved sensitivity and specificity of 95% and 95%. We have also predicted and experimentally validated several novel CRMs in the developing fly mesoderm (Philippakis et al., manuscript in preparation). We expect that in the future we and others will use ModuleFinder to further refine transcriptional regulatory models for CRMs in particular biological systems and thus discover how the associated TFBSs are organized to confer specific gene expression patterns. 5. Acknowledgments

Th e authors thank Bertrand Huber and Jason Warner for helpful discussion, and Mike Berger, Mark Umbarger, and Roman Yelensky for critical reading of the manuscript. This work was funded in part by a PhRMA Foundation Informatics Research Starter Grant (M.L.B.), a William F. Milton Fund Award (M.L.B.), and NIH/NHGRI R01 HG02966-01 (M.L.B.). A.A.P. was supported in part by a National Defense Science and Engineering Graduate Fellowship from the Department of Defense, and an Athin oula Martinos Fellowship from HST. References


1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.

Bulyk, M.L. Genome Biol. 5, 201 (2003). Boffelli, D. et al. Science 299, 1391-4 (2003). Cliften, P. et al. Science 301, 71-6 (2003). Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E.S. Nature 423, 241-54 (2003). McGuire, A.M., Hughes, J.D. & Church, G.M. Genome Res. 10, 744-57 (2000). Th omas, J.W. et al. Nature 424, 788-93 (2003). Berman, B.P. et al. Proc. Natl. Acad. Sci. USA 99, 757-62 (2002). Halfon, M.S., Grad, Y., Church, G.M. & Michelson, A.M. Genome Res. 12, 1019-28 (2002). Markstein, M., Markstein, P., Markstein, V. & Levine, M.S. Proc. Natl. Acad. Sci. USA 99, 763-8 (2002). Krivan , W. & Wasserman, W.W. Genome Res. 11, 1559-66 (2001). Wasserman, W.W. & Fickett, J.W. J. Mol. Biol. 278, 167-81 (1998). Frith , M.C., Hansen, U. & Weng, Z. Bioinformatics 17, 878-89 (2001). Frith , M.C., Spouge, J.L., Hansen, U. & Weng, Z. Nucleic Acids Res. 30, 3214-24 (2002). Rajewsky, N., Vergassola, M., Gaul, U. & Siggia, E.D. BMC Bioinformatics 3, 30 (2002). Rebeiz, M., Reeves, N.L. & Posakony, J.W. Proc. Natl. Acad. Sci. USA 99, 9888-93 (2002). Sinha, S., van Nimwegen, E. & Siggia, E.D. Bioinformatics 19 Suppl 1, i292-301 (2003). Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403-10 (1990). Wasserman, W.W., Palumbo, M., Thompson, W., Fickett, J.W. & Lawrence, C.E. Nat. Genet. 26, 225-8 (2000). Blan chette, M., Schwikowski, B. & Tompa, M. J. Comput. Biol. 9, 211-23 (2002). Moses, A.M., Chiang, D.Y. & Eisen, M.B. Pac. Symp. Biocomput., 324-35 (2004). Prakash, A., Blanchette, M., Sinha, S. & Tompa, M. Pac. Symp. Biocomput., 348-59 (2004). Reinert, G., Schbath, S. & Waterman, M.S. J. Comput. Biol. 7, 1-46 (2000). Irvin g, R.W. & Love, L. Technical Report no. TR-2001082 of the Computing Science Department of Glasgow University (2001). http://www.genome.ucsc.edu. Stormo, G.D. Bioinformatics 16, 16-23 (2000). Andr es, V., Cervera, M. & Mahdavi, V. J. Biol. Chem. 270, 23246-9 (1995). http://www.cognia.com. Johan sson, O., Alkema, W., Wasserman, W.W. & Lagergren, J. Bioinformatics 19 Suppl 1, i169-76 (2003).