Pacific Symposium on Biocomputing 13:489-500(2008) September 27, 2007 11:15 WSPC - Pro ceedings Trim Size: 9in x 6in revisions0924 USE OF AN EVOLUTIONARY MODEL TO PROVIDE EVIDENCE FOR A WIDE HETEROGENEITY OF REQUIRED AFFINITIES BETWEEN TRANSCRIPTION FACTORS AND THEIR BINDING SITES IN YEAST RICHARD W. LUSK Department of Molecular and Cel l Biology, University of California, Berkeley Berkeley, California 94720, USA E-mail: lusk@berkeley.edu www.berkeley.edu MICHAEL B. EISEN Genomics Division, Lawrence Berkeley National Laboratory, Department of Molecular and Cel l Biology, University of California, Berkeley Berkeley, California 94720, USA E-mail: mbeisen@lbl.gov www.lbl.gov Keywords : binding sites, evolution, PWM, ChIP-chip, affinity 1. Abstract The identification of transcription factor binding sites commonly relies on the interpretation of scores generated by a position weight matrix. These scores are presumed to reflect on the affinity of the transcription factor for the bound sequence. In almost all applications, a cutoff score is chosen to distinguish between functional and non-functional binding sites. This cutoff is generally based on statistical rather than biological criteria. Furthermore, given the variety of transcription factors, it is unlikely that the use of a common statistical threshold for all transcription factors is appropriate. In order to incorporate biological information into the choice of cutoff score, we developed a simple evolutionary model that assumes that transcription factor binding sites evolve to maintain an affinity greater than some factor-specific threshold. We then compared patterns of substitution in binding sites predicted by this model at different thresholds to patterns Pacific Symposium on Biocomputing 13:489-500(2008) September 27, 2007 11:15 WSPC - Pro ceedings Trim Size: 9in x 6in revisions0924 of substitution observed at sites bound in vivo by transcription factors in S. cerevisiae. Assuming that the cutoff value that gives the best fit between the observed and predicted values will optimally distinguish functional and non-functional sites, we discovered substantial heterogeneity for appropriate cutoff values among factors. While commonly used thresholds seem appropriate for many factors, some factors appear to function at cutoffs satisfied commonly in the genome. This evidence was corroborated by local patterns of rate variation for examples of stringent and lenient p-value cutoffs. Our analysis further highlights the necessity of taking a factor-specific approach to binding site identification. 2. Intro duction A gene's expression is governed largely by the differential recruitment of the basal transcription machinery by bound transcription factors.1,2 In this way, transcription factor binding sites are fundamental components of the regulatory co de, and this co de's decipherment is partially a problem of recognizing their lo cation and affinity.3 These are usually determined using position weight matrices, although a number of more recently developed metho ds are beginning to become adopted.4 We use position weight matrices here due to their ease of use with evolutionary analysis and their established theoretical ties with biochemistry. A position weight matrix generates a score comprising the log odds of a given subsequence being drawn from a binding site distribution of nucleotide frequencies vs. an analogous background distribution.5 The score's p-value is used to determine the location of binding sites: subsequence scores above a predetermined cutoff designate that subsequence to be a binding site, and subsequence scores below the cutoff designate the subsequence to be ignored. The interpretation of regulatory regions is thus dependent on the choice of the p-value cutoff. However, this choice is not straightforward, although it is commonly made to conform to established but biologically arbitrary statistical standards, e.g. p < .001. In addition to assuming that this particular p-value is appropriate, the user here also assumes that a single p-value is appropriate for all transcription factors. Being that score shares an approximately monotonic relationship with affinity,6,7 this implies that the nature of the interaction between different transcription factors and their binding sites is the same. This may not be the case. For example, some transcription factors may require a stronger binding site to compensate for weaker interactions with other transcription machinery, and so a lenient cutoff would be inappropriate. Conversely, the choice of a stringent cutoff Pacific Symposium on Biocomputing 13:489-500(2008) September 27, 2007 11:15 WSPC - Pro ceedings Trim Size: 9in x 6in revisions0924 could eliminate viable sites of factors that commonly rely on cooperative interactions with other proteins to be recruited to the DNA. A single common standard of significance is a compromise that may not be reasonable. Ideally, biological information should inform the choice of a p-value and its consequent ramifications in the determination of function. Several recent approaches have well used expression8 and ChIP-chip9 data towards understanding binding specificity. Here we take advantage of selective pressure as a third source of information. Tracking selective pressure has the advantage of directly interpreting sequence in terms of its value to the organism in its environment; to a degree, function can be inferred by observing the impact of selection. To this end, we propose a simple selective model of binding site evolution. Selection prevents the fixation of low affinity sites that may not affect expression to a satisfactory level and does not maintain unnecessary high affinity sites. We train the model on the ChIP-chip data available in yeast, and we find evidence for a wide heterogeneity in required binding site affinity between factors. Supporting recent work by Tanay,10 many factors appear to require only weak affinity for function, and we find some evidence that these may rely on co operative binding to achieve specificity. 3. Results and Discussion 3.1. Definition and training of the affinity-threshold model In order to use selection as a means to investigate function, a model must be defined to describe how selection acts on functional and non-functional binding site sequence. Our model was created to be the simplest possible for our purposes. We assume that binding sites evolve independently from other sites in their promoter, but that all sites that bind the same factor evolve equivalently. We interpret a binding site's function in a binary manner: our mo del supposes that there exists a satisfactory level of expression and that binding site polymorphisms that are able to drive this expression level or greater have equal fitness, while binding site polymorphisms that cannot are deleterious. By assuming that this deleterious effect is large enough to preclude fixation in S. cerevisiae, our model imposes an effective threshold on permitted affinity: it does not allow a substitution to o ccur if it drops the position weight matrix score beneath a given boundary. Analogous reasoning lets us treat repressors identically. By imposing a threshold on permitted affinity and by relying on the assumption that position weight matrix score shares a monotonic relationship with affinity,6 we impose a threshold weight matrix score. Pacific Symposium on Biocomputing 13:489-500(2008) September 27, 2007 11:15 WSPC - Pro ceedings Trim Size: 9in x 6in revisions0924 Our purpose in training the model is to find where that threshold lies for each factor, which we accomplish using simulation. For any given threshold and matrix, we simulate the relative rates of substitution that would be expected, and then we compare these rates to empirically determined rates to cho ose the most appropriate threshold. The simulation is run as follows: we start with the matrix's consensus sequence, and make one mutation according to the neutral HKY11 model. The sequence's score is evaluated: if it exceeds the threshold, the mutation is considered fixed and the count of substitutions at that position is incremented, and if not, no increment is made and the sequence reverts back to the original sequence. This mutateselect pro cess is repeated. Assuming that the impact of polymorphism is negligible, removing a given fraction of mutations by selection will reduce the substitution rate by that fraction. Thus, the proportion of accepted over total mutations at each position is evaluated to be the rate of mutation relative to the neutral rate. We use sum-of-squares as a distance metric to compare each affinitythreshold rate distribution to the empirical distribution, and we considered the best-fitting affinity threshold to be the affinity threshold that generates the distribution with the smallest distance to the empirical relative rates. 3.2. The affinity-threshold model wel l describes binding site substitution rates The Halpern-Bruno mo del12 has been incorporated into effective tools for motif discovery13 and identification,14 and it has been shown to well describe yeast binding site relative rates of substitution.15 These rates are also generated by our mo del, and so we judged our model's accuracy by comparing its performance to the Halpern-Bruno model's performance (fig. 1). We aligned ChIP-chip bound regions and computed summed position-specific rates of substitution for the aggregate binding sites of the 111 transcription factors that met our conservation requirements. We were able to find a threshold at which the affinity-threshold model better resembled the empirical data than the Halpern-Bruno model did for 42 of the 49 factors with adequate training data (see Methods). The affinity-threshold model well approximates the position-specific substitution rates of most factors. The best-fitting score threshold for a transcription factor's binding sites may correspond to their minimum non-deleterious affinity for that transcription factor. If this minimum is variable and can be found through our evolutionary analysis, then we should be able to detect that variability robustly. To this end, we used a bootstrap to assess the reliability of our Pacific Symposium on Biocomputing 13:489-500(2008) September 27, 2007 11:15 WSPC - Pro ceedings Trim Size: 9in x 6in revisions0924 Fig. 1. Position specific rate variation and model predictions for (a) Fkh2, (b) Fhl1, and (c) Aft2: relative rate (subst/substtot ) vs position in site. The black line marks the empirical rates, the dashed line marks the Halpern-Bruno predicted rates, and grey line marks the best-fitting affinity-threshold. The grey bar contains the set of rates predicted by all affinity thresholds within the factor's 95% confidence interval predictions, resampling the the aligned sites. Although most transcription factors had large confidence intervals, they were dispersed over sufficiently wide intervals such that we could form three distinct sets (table 1). We grouped factors with lower bounds greater than 5.9 into a "stringent threshold" set, factors with upper bounds lower than 5.1 into a "lenient threshold" set, and factors with upper bounds lower than 12 and lower bounds greater than -2 into a "medium threshold" set; transcription factors appear to have variable site affinity requirements. We use these sets in all further analysis. 3.3. The affinity-threshold model predicts extant score distributions for most factors If the affinity-threshold mo del is a reasonable approximation of the evolution of the system, then it should describe other properties of the system beyond the position-specific rate variation of binding sites. One additional prediction of the mo del is the distribution of binding site scores. For each Pacific Symposium on Biocomputing 13:489-500(2008) September 27, 2007 11:15 WSPC - Pro ceedings Trim Size: 9in x 6in revisions0924 Table 1. Affinity threshold confidence intervals and corresponding site prevalence for transcription factors in the stringent (left), medium (middle), and lenient (right) threshold groups CIa Reb1p Bas1p Fkh2p Cbf1p Abf1p Sum1p Tye7p Mcm1p Hap4p 8.3-11.1 5.8-13.6 8.1-15.2 6.2-12.0 11.0-12.9 6.2-14.5 8.6-11.3 8.7-19.5 11.0-14.9 P r ev . b .226-.117 .566-.005 .497-.003 .219-.028 .108-.075 .484-.009 .183-.037 .133-.002 .059-.003 Cin5p Mbp1p Fhl1p Gcn4p Swi6p Ste12p Nrg1p CIa -0.4-8.5 2.7-11.7 4.2-11.3 4.0-10.6 3.8-9.9 1.0-6.5 -1.3-7.0 P r ev . b .997-.294 .793-.059 .702-.048 .682-.080 .854-.166 .997-.705 .968-.388 Sut1p Aft2p Phd1p Ace2p Yap6p Adr1p Hap5p Mot3p CIa -9.9-4.2 -9.8-4.2 -9.8-5.1 -9.9--0.8 -9.9-4.2 -9.5-2.3 -9.4--2.1 -2.9-5.1 P r ev . b .988-.845 .988-.794 .998-.867 .999-.999 .993-.909 .991-.856 .993-.993 .996-.595 Note : a 95% confidence interval, log base two scores b Prevalence: first and second quantities are the fraction of all promoters containing a site meeting the lower and upper bounds of the CI, respectively factor in the groups determined above we sampled the Markov chain and computed the mean binding site score under the affinity-threshold model. We compared this to the average maximum score for that transcription factor in ChIP-chip bound regions (fig. 2). Although it had a downward bias, the affinity-threshold mo del predicted the extant distribution of stringentand medium- threshold transcription factor binding sites. However, it fared worse with the lenient-threshold binding sites, suggesting that the evolution of these sites may not operate within the simplifying bounds of the model, i.e. perhaps their evolution is governed by a more complex fitness landscape instead of our stepwise plateau. Nevertheless, average maximum scores in bound regions for these factors are still found commonly in the genome. 3.4. Stringent- and lenient-threshold binding sites have distinct patterns of local evolution The lenient set of transcription factors allows for binding sites that would be found often by chance in the genome. If this lenient affinity is truly sufficient, these transcription factors may rely on other bound proteins to separate desired from undesired binding sites. In contrast, sites meeting the affinity threshold for stringent-threshold transcription factors should be high-o ccupancy sites without a need for additional information due to their strong predicted affinity. To investigate this hypothesis, we counted the average number of different transcription factors bound at each promoter for each of the factors used in the Harbison et al ChIP-chip experiments. Let "lenient-group sites" Pacific Symposium on Biocomputing 13:489-500(2008) September 27, 2007 11:15 WSPC - Pro ceedings Trim Size: 9in x 6in revisions0924 Fig. 2. Predicted average score at best-fitting affinity threshold vs. average maximum score in ChIP-chip bound regions (log base two scores). Stringent-, medium-, and lenientthreshold transcription factors presented as black, dark grey, and light grey dots, respectively. refer to sites bound by lenient-threshold transcription factors (e.g. Sut1p, table 1), and let "medium-group" and "stringent group" sites be defined similarly. As expected, the stringent and lenient groups were separated, the lenient group promoters having just under three more unique bound factors per promoter for each of three binding significance cutoffs. However, the medium and lenient groups were not well separated. We used the variation in local substitution patterns to determine whether medium and lenient group factors could be distinguished by an enrichment of lo cal binding events. While medium and lenient group sites have similar numbers of different transcription factors bound to promoters that they also bind, lenient group sites will have a higher density of other binding sites immediately surrounding theirs if recruitment by other proteins is necessary for their function. This density should be reflected in the lo cal pattern of evolution, as the sequence will be comparatively restrained. We calculated rates of substitution surrounding the binding sites of Pacific Symposium on Biocomputing 13:489-500(2008) September 27, 2007 11:15 WSPC - Pro ceedings Trim Size: 9in x 6in revisions0924 Table 2. Average number of binding sites per promoter, grouped by best-fit affinity threshold and ChIP-chip binding p-value Gr oup .005 Stringent Medium Lenient 7.78 10.30 10.73 p