Bayesian Joint Prediction of Associated Transcription Factors in Bacillus subtilis Y. Makita, M.J.L. De Hoon, N. Ogasawara, S. Miyano, and K. Nakai Pacific Symposium on Biocomputing 10:507-518(2005)


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

BAYESIAN JOINT PREDICTION OF ASSOCIATED TRANSCRIPTION FACTORS IN BACILLUS SUBTILIS

Y. MAKITA1,2 , M.J.L. DE HOON1 , N. OGASAWARA3 , S. MIYANO1 , AND K. NAKAI1 Human Genome Center, Institute of Medical Science, University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan 2 School of Technology, Nagoya University Furocho Chikusa-ku Nagoya, Aichi 464-8603, Japan 3 Graduate School of Biological Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0101, Japan
Sigma factors, often in conjunction with other transcription factors, regulate gene expression in prokaryotes at the transcriptional level. Sp ecific transcription factors tend to co-occur with sp ecific sigma factors. To predict new members of the transcription factor regulon, we applied Bayes rule to combine the Bayesian probability of sigma factor prediction calculated from microarray data and the sigma factor binding sequence motif, the motif score of the transcription factor asso ciated with the sigma factor, the empirically determined distance between the transcription start site to the cis-regulatory region, and the tendency for sp ecific sigma factors and transcription factors to co-o ccur. By combining these information sources, we improve the accuracy of predicting regulation by transcription factors, and also confirm the sigma factor prediction. We applied our prop osed metho d to all genes in Bacil lus subtilis to find currently unknown gene regulations by transcription factors and sigma factors.
1

1. Intro duction In recent years, the genomes of more than one hundred bacteria have been sequenced and the respective coding regions have been found. Inferring the regulatory mechanism of those genes remains a difficult problem. For understanding the regulatory system on a genome-wide scale, gene expression data have been accumulated in microarray experiments for several organisms under various experimental conditions. Due to the complexity of the regulatory network and limits on the experimental accuracy, it is difficult to predict reliably which transcription factor (TF) regulates which genes. One of the promising methods to predict regulation is supervised learning. However, it is powerful only if a sufficiently large training set is avail-


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

able, which is often not the case. Even in one of the best-studied bacteria, B. subtilis, only 20% of known TFs have more than 10 known binding sequences.1 To address this problem, we consider combining other data under the biological context. In this paper, we focus on the joint prediction of sigma factors and associated TFs. Sigma factors, which bind to the RNA polymerase complex, recognize specific DNA motifs that are located -35/-10 or -24/-12 basepairs from the transcription start site. For B. subtilis, 18 sigma factors are known. SigA is the primary sigma factor and regulates most genes, while secondary sigma factors activate specific groups of genes depending on cellular conditions. For example, the sigma factors SigE, SigF, SigG, SigK, and SigH are related to sporulation, while SigB is involved in stress response, and SigD regulates genes related to flagellar motion and chemotaxis. Similarly, other (nonsigma) TFs are involved in particular cellular processes. As a result, some combinations of sigma factors and TFs are often found to jointly regulate a gene, while other combinations do not occur often. As an extreme example, SigL, which belongs to the sigma54 family of enhancer-dependent sigma factors, can only direct transcription if one of the activating TFs AcoR, BkdR, LevR, RocR, or YplP is present. Joint prediction of sigma factors and TFs is particulary important for SigA, which regulates about 90% of the B. subtilis genes. For differential regulation of these genes, additional TFs are therefore needed. Previously, our group predicted which sigma factor regulates each gene in B. subtilis using 174 microarray data as well as experimentally known sigma factor binding motifs.2 TF binding sites are typically located near the transcription start site, which can be found from the predicted sigma factor binding site. For example, in Escherichia coli, it is known that almost all activators have upstream binding sites near the transcription start site, whereas more than two third of repressors have at least one downstream binding site.3 Here, we aim to predict gene regulation by TFs by combining predicted sigma factor binding sites with the biological information of joint regulation by associated TFs, as well as the distribution of TF binding sites near the sigma factor binding site. Additionally, we consider TFs with more than one binding site for a specific gene, which can be used to improve the prediction accuracy.4,5,6


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

2. Metho d To construct a suitable score function, we applied Bayesian statistics to combine the Bayesian probability of sigma factor prediction calculated from the microarray data and binding motif,2 the Position Specific Score Matrix (PSSM) of the binding motif of the TF associated with the sigma factor, and the empirically determined distance between the transcription start site to the cis-regulatory region. We used the sigma factor predictions2 to find the transcription start site and to determine which TFs may be expected to co-regulate the gene. 2.1. Sigma factor prediction Previously, our group predicted gene regulation by sigma factors using the information of sigma factor binding motif and microarray data.2 We extend this prediction to the full B. subtilis genome and to all sigma factors with known regulated genes, allowing genes to be regulated by more than one sigma factor. From this prediction, we find the Bayesian prior probability Pprior ( =  N ) that a gene is regulated by  N , where N  {A, B , D, E , F, G, H, K, L, W, X }. 2.2. Combining sigma factors and transcription factors Specific TFs tend to occur with specific sigma factors, as shown in Table 1. In addition to four knowns gene, one more gene was predicted as an enhancer for SigL-regulated genes by our Pfam seach (PF00309)7 .
Table 1. Family sigma70 Sigma factor SigA Sigma factors and asso ciated TFs in B. subtilis. Coop erative transcription factors AbrB(21) AraR(3) CcpA(40) CcpC(3) ComA(6) ComK(40) CtsR(6) DegU(15) DinR(6) FNR(5) Fur(21) GlnR(4) Hpr(6) PerR(7) PucR(7) PurR(11) RocR(4) Spo0A(10) TnrA(11) Zur(3) SpoI IID(4) Spo0A(4)

Function Housekeeping Early sporulation

Expressed in early mother cell Expressed in p ostexponential phase; comp etence and early sporulation SigK Expressed in late mother cell GerE(13) SpoI IID(5) sigma54 SigL Degradative enzymes AcoR(1) BkdR(1) LevR(1) Ro cR(3)  The number in parentheses is the number of genes known to b e regulated by each combination of sigma factor and TF. Genes whose sigma factor is unknown exp erimentally were assigned to the SigA regulon, which contains 90% of the B.subtilis genes10 .

SigE SigH


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

From Table 1, we can estimate the probability that a gene is co-regulated by transcription factor Ti , given regulation by sigma factor  N : Pprior (T = Ti | =  N ) = # genes regulated by Ti and  N # genes regulated by  N (1)

Some combinations of sigma factor and TF may exist that have not yet been found experimentally. To allow for this possibility, we add a pseudocount8  1 k+1 # genes regulated by Ti , where k is the numb er of TFs under con sideration, to the numerator, and # genes regulated by Ti to the denominator. Note that i runs from 0 to k , where 0 corresponds to a currently unknown transcription factor. 2.3. Motif search The motif sequences can be described statistically by a position specific score matrix (PSSM) Wr,b for each TF.8 This matrix is the log-odds score of finding a nucleotide b at position r in the binding sequence motif of TF. The log-likelihood that a transcription factor Ti binds a subsequence Si of the sequence S upstream of a gene is then
R-1 r P [Si | Ti binds Si ] = Wr,Si [r] Mi  ln P [Si |background] =0

(2)

where R is the length of the motif. The PSSM was calculated from the known binding motifs of the genes in the regulon of each TF, as listed in the DBTBS database. For the matrix calculation based on n known binding  sites, we added n pseudocounts,8 using a non-coding region background probability of 0.3185 for A and T, and 0.1815 for C and G. 2.4. Relative distance from transcription start site to TF binding site Using the DBTBS data, we estimated the probability density distribution fdist (Di ) of the distance Di from the transcription start site to the binding site of transcription factor Ti , measured in base pairs, using a kernel density estimation based on Gaussian kernels.9 Positive regulators tend to bind in front of the transcription start site, while negative regulators bind at or downstream of the transcription start site. About half of TFs we consider are dual purpose regulators, which regulate some genes positively and others negatively. Those dual TF binding sites are located over a wider range than single regulators.


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

Probability distribution function

0.03

0.02

ComA (positive regulator)

Fur (negative regulator)

0.01 Spo0A (dual regulator) 0.00 -200 -100 +1 100 Distance in basepairs

Figure 1. Distribution of the position of the TF binding site with the resp ect to the transcription start site.

As Figure 1 shows, the graph for positive regulators (ComA) and negative regulators (Fur) each have two peaks. The lower peaks correspond to TFs having two or more binding sites. 2.5. Combining sigma factor and transcription factor prediction The joint probability that a gene is regulated by transcription factor Ti , i  1..k and sigma factor  N is denoted by P ( =  N , T = Ti ). Here, T0 corresponds to an unknown TF. For deriving the posterior joint probability, we combined the following three elements: the prior joint probability Pprior ( =  N , T = Ti ), the maximum PSSM score in each promoter sequence Mi calculated for Ti , and the distance Di between the transcription start site and the predicted TF binding site. Mi and Di are calculated from the sequence region S upstream of the gene. The Bayesian posterior probability that a gene is regulated by sigma factor  N and transcription factor Ti , given the upstream sequence S , can be calculated as P ( =  N , T = Ti |S ) P (S | =  N , T = Ti )Pprior ( =  N , T = Ti ) , =Uj P (S | =  U , T = Tj )Pprior ( =  U , T = Tj )

(3)


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

where in the denominator U is summed over sigma factors A, B, D, E, F, G, H, K, L, X, and W. The prior probability Pprior ( =  N , T = Ti ) is calculated as Pprior ( =  N )Pprior (T = Ti | =  N ), as described above. P (S | =  N , T = Ti ) is the conditional probability that an upstream sequence S is generated, given that  N and Ti regulate the gene. The upstream sequence S consists of the binding site Si , described by the PSSM, and the remaining sequence S \Si . We can then decompose P (S | =  N , T = Ti ) into three parts: P (S | =  N , T = Ti ) = P (Si |T = Ti ) · P (S \Si |background) · fdist (Di ). (4) The third factor is the probability that Si is generated at a distance Di from the transcription start site (Section 2.4). Here, the predicted position of the transcription start site depends on the sigma factor  N , as described previously.2 Dividing by the background probability yields P (S | =  N , T = Ti ) P (Si |T = Ti ) = fdist (Di ) = eMi fdist (Di ), (5) P (S |background) P (Si |background) where Mi is the maximum value of the PSSM score for transcription factor Ti over the upstream region S . For an unknown transcription factor (T = T0 ), however, this ratio is equal to unity. Note that for fdist (Di ) uniform, this reduces to eMi /Dmax , where Dmax is the size of the upstream region S that we search. This then corresponds to the Bonferoni correction for multiple comparisons. By combining these equations, we find the following expression for the posterior probability: U P [ =  N , T = Ti |S ] = where we defined the score functions score( N , Ti )  lnPprior (T = Ti | =  N ) + lnPprior ( =  N ) + Mi + lnfdist (Di ), (7) s  exp core N , Ti , k U j =0 exp [score ( , Tj )] (6)

while we drop the last two terms if i = 0. For genes that have more than two binding sites for the same transcription factor Ti , we add terms (Mi + lnfdist (Di )) correspondingly.


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

2.6. Example calculation We calculated the Bayesian posterior probability in Eq. (6) that the gene rocA is regulated by each sigma factor and by one of the TFs AcoR, BkdR, LevR, RocR, or on unknown TF. Table 2 shows that the (SigL, RocR) combination is by far the most likely. From biological experiments, rocA is known to be regulated by SigL and RocR, which serves as the transcriptional activator of arginine utilization operons. 3. Validation 3.1. The sigma factor prediction aids in the TF prediction To verify the validity of combining the TF prediction with the sigma factor prediction, we examined the contribution of each term in Eq. (7). To assess the effect of using the sigma factor prediction for the TF prediction, we compare the two scores Mi + lnP (T = Ti | =  N ) and Mi (Table 3). The negative dataset consists of genes regulated by sigma factors whose regulons do not contain any genes that are known to be regulated by the TF. The positive dataset are the genes known to be regulated by the TF. The specificity is given by T P /(T P + F P ) and the sensitivity is given by T P /(T P + F N ), where TP is true positive, FP is false positive, and FN is false negative. Furthermore, the predicted sigma factor binding site Pprior ( =  N ) in Eq. (7) allows us to search for the TF motif nearby on the genome, as represented by the term in fdist (Di ) in Eq. (7). We show the effect of including this term in Table 4. As shown in these tables, both the sigma factor information and the transcription start sites greatly improve the specificity and the sensitivity of the TF prediction. The biological knowledge that specific sigma factors and TFs tend to co-occur is particularly informative, as shown in Table 3. 3.2. The TF prediction aids in the sigma factor prediction We calculate the posterior probability that a gene is regulated by a specific sigma factor by summing Eq. (6) over Ti . As shown in Table 5, this posterior probability is more accurate than the prior probability in predicting sigma factors. While the prior probability already gives a very accurate prediction of sigma factor regulation, the accuracy of the posterior probability is even higher. We note that for unknown genes, the sigma factor prediction may be less accurate due to uncertainties in the operon structure.2 , 11


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

Table 2.
Sigma sigA

Probability that rocA is regulated by various combinations of a sigma factor and TF.
ln(Pprior (T = Ti | =  N )) -5.54 -5.54 -5.54 -5.54 -5.54 -4.69 -4.69 -4.69 -4.69 -4.69 -3.63 -3.63 -3.63 -3.63 -3.63 -4.82 -4.82 -4.82 -4.82 -4.82 -3.57 -3.57 -3.57 -3.57 -3.57 -4.03 -4.03 -4.03 -4.03 -4.03 -4.59 -4.59 -4.59 -4.59 -4.59 -4.13 -4.13 -4.13 -4.13 -4.13 -1.74 -1.74 -1.74 -1.22 -2.85 -2.58 -2.58 -2.58 -2.58 -2.58 -3.99 -3.99 -3.99 -3.99 -3.99 ln(Pprior ( =  N )) -2.04 -2.04 -2.04 -2.04 -2.04 -1.93 -1.93 -1.93 -1.93 -1.93 -6.55 -6.55 -6.55 -6.55 -6.55 -2.17 -2.17 -2.17 -2.17 -2.17 -5.14 -5.14 -5.14 -5.14 -5.14 -4.12 -4.12 -4.12 -4.12 -4.12 -4.14 -4.14 -4.14 -4.14 -4.14 -5.02 -5.02 -5.02 -5.02 -5.02 -0.57 -0.57 -0.57 -0.57 -0.57 -9.36 -9.36 -9.36 -9.36 -9.36 -7.82 -7.82 -7.82 -7.82 -7.82 Score -7.62 -8.03 -9.3 3.52 -7.58 -11.13 -7.91 -7.74 10.14 -6.62 -12 -12.66 -12.12 10.05 -10.18 -6.18 -7.66 -9.64 1.06 -6.99 -13.08 -10.03 -9.87 8.17 -8.72 -10.18 -11.23 -10.74 12.06 -8.14 -13.82 -9.71 -9.92 4.38 -8.74 -13.98 -11.04 -11.05 8.37 -9.15 -6.36 -3.8 -3.89 17.85 -3.42 -13 -12.94 -13.49 6.03 -11.94 -13.91 -15.01 -14.55 8.24 -11.81 Probability 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.996 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

sigB

sigD

sigE

sigF

sigG

sigH

sigK

sigL

sigX

sigW ND:

TF Mi ln(fdist (Di )) AcoR 6.41 -6.45 4.53 -4.98 BkdR LevR 3.21 -4.93 Ro cR 30.5 -19.4 ND AcoR 6.41 -10.92 BkdR 3.77 -5.06 LevR 4.08 -5.2 Ro cR 30.5 -13.74 ND AcoR 6.41 -8.23 BkdR 4.53 -7.01 LevR 3.48 -5.42 30.5 -10.27 Ro cR ND AcoR 6.41 -5.61 BkdR 4.53 -5.21 3.21 -5.87 LevR 30.5 -22.45 Ro cR ND AcoR 6.41 -10.78 3.77 -5.08 BkdR LevR 4.08 -5.24 30.5 -13.61 Ro cR ND AcoR 6.41 -8.45 BkdR 4.53 -7.61 3.48 -6.08 LevR 30.5 -10.3 Ro cR ND AcoR 0.76 -5.84 4.53 -5.5 BkdR LevR 3.48 -4.66 30.5 -17.38 Ro cR ND AcoR 0 -4.83 BkdR 3.77 -5.66 LevR 4.08 -5.98 30.5 -12.98 Ro cR ND AcoR 6.41 -10.46 4.53 -6.02 BkdR LevR 3.48 -5.06 Ro cR 30.5 -10.86 ND AcoR 6.41 -7.47 BkdR 4.53 -5.53 LevR 3.48 -5.03 Ro cR 30.5 -12.53 ND AcoR 6.41 -8.51 BkdR 4.53 -7.74 2.96 -5.71 LevR Ro cR 30.5 -10.45 ND TF unknown case.


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

Table 3. TF sigma

The effect of sigma factor information on the TF prediction. Mi + lnP (T = Ti | =  N ) FP FN SP SN Mi TP FP FN SP SN 3 62.5% 62.5% 6 27.3% 33.3% 6 35.0% 53.8% 0 100.0% 100.0% 15 45.5% 57.1% and SN sensitivity.

TP

Spo0A A,H SpoIIID E,K GerE K SigL L Total TP true positive,

8 0 0 9 0 0 13 0 0 5 0 0 35 0 0 FP false positive,

100.0% 100.0% 5 3 100.0% 100.0% 3 8 100.0% 100.0% 7 13 100.0% 100.0% 5 0 100.0% 100.0% 20 24 FN false negative, SP sp ecificity,

Table 4. TF Spo0A SpoIIID GerE SigL Total sigma A,H E,K K L

The effect of transcription start site information on TF prediction. TP 6 5 7 5 23 FP 2 6 2 0 10 Mi + lnfdist (Di ) FN SP 2 4 6 0 12 75.0% 45.5% 77.8% 100.0% 69.7% Mi SN 75.0% 55.6% 53.8% 100.0% 65.7% TP 5 3 7 5 20 FP 3 8 13 0 24 FN 3 6 6 0 15 SP 62.5% 27.3% 35.0% 100.0% 45.5% SN 62.5% 33.3% 53.8% 100.0% 57.1%

4. Result We applied our proposed method to jointly predict sigma factor and TFs for all genes in B. subtilis in order to find currently unknown gene regulations. Table 6 shows some predicted combinations for which a high posterior probability was found. For many proteins, the function is presently unknown. The sigma/TF prediction can suggest the cellular function of those proteins. CcpA is one of the global repressor of the carbon catabolite repressors which bind to CRE site (TGWAANCGGNTNWCA)10 . Our prediction shows that CcpA acts on some genes related to sugar metabolism (sacP, fruR, yojA) and dehydrogenase (yrbE), which is consistent with the known function of CcpA. The sporulation genes, spoIIP and spoIID are known to be regulated by SigE. Both genes are required for complete dissolution of the asymmetric
Table 5. sigma SigE SigH SigK SigL Total TP 53 33 24 5 115 FP 3 5 1 0 9 The accuracy of the sigma factor prediction. SN 96.4% 86.8% 96.0% 100.0% 93.5% TP 53 35 25 5 118 FP 2 5 1 0 8 posterior FN SP 2 3 0 0 5 96.4% 87.5% 96.2% 100.0% 93.7% SN 96.4% 92.1% 100.0% 100.0% 95.9%

prior FN SP 2 5 1 0 8 94.6% 86.8% 96.0% 100.0% 92.7%


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

Table 6. Sigma SigA RG sacP yqgQ yrzF yvfH yvfK yngI yngI ycsA opuE yrpD ywqC yvfI glcR

Newly predicted gene regulations by TFs and sigma factors in B. subtilis. p osterior Prob. 0.997 0.980 0.976 0.972 0.967 0.953 0.953 0.947 0.916 0.912 0.904 0.901 0.985 Function

TF CcpA CcpA CcpA CcpA CcpA CcpA CcpA CcpA CcpA CcpA CcpA CcpA ComK

PTS sucrose-specific enzyme IIBC component * unknown unknown unknown; similar to L-lactate p ermease unknown; similar to maltose/maltodextrin-binding protein unknown; similar to long-chain acyl-CoA synthetase unknown; similar to long-chain acyl-CoA synthetase unknown; similar to 3-isopropylmalate dehydrogenase * proline transporter unknown; similar to unknown proteins from B. subtilis unknown; similar to capsular p olysaccharide biosynthesis unknown; similar to transcriptional regulator (GntR family) transcriptional repressor involved in the expression of the phosphotransferase system aadK ComK 0.971 aminoglycoside 6-adenylyltransferase yufL ComK 0.946 unknown; similar to two-component sensor histidine kinase [YufM] yuiD ComK 0.903 unknown; similar to unknown proteins glmS CtsR 0.968 L-glutamine-D-fructose-6-phosphate amidotransferase yozM DinR 0.949 unknown ypoP Fur 0.958 unknown; similar to transcriptional regulator (MarR family) yodE TnrA 0.938 unknown; similar to unknown proteins SigE spoIIP SpoIIID 0.961  required for dissolution of the septal cell wall spoIID SpoIIID 0.960  required for complete dissolution of the asymmetric septum cw lD SpoIIID 0.930  N-acetylmuramoyl-L-alanine amidase (germination) ylbJ SpoIIID 0.910  unknown; similar to unknown proteins ytvA SpoIIID 0.873 unknown; similar to protein kinase yurH SpoIIID 0.857 unknown; similar to N-carbamyl-L-amino acid amidohydrolase greA SpoIIID 0.849 transcription elongation factor yugP SpoIIID 0.827 unknown; similar to unknown proteins yjkB SpoIIID 0.813 unknown; similar to amino acid ABC transp orter ytxC SpoIIID 0.754  unknown; similar to unknown proteins yqfZ SpoIIID 0.745  unknown; similar to unknown proteins spoVE SpoIIID 0.687  required for spore cortex peptidoglycan synthesis yugO SpoIIID 0.671 unknown; similar to potassium channel protein yqeW SpoIIID 0.664 unknown; similar to Na+/Pi cotransporter SigH yvyD Spo0A 0.667  general stress protein under dual control of sigB and sigH SigK nucB GerE 0.887 sp orulation-specific extracellular nuclease ytkC GerE 0.851 unknown; similar to autolytic amidase ywjE GerE 0.820 unknown; similar to cardiolipin synthetase ypgA GerE 0.808 unknown; similar to unknown proteins SigL yokK BkdR 0.416 unknown  The sigma factor has been determined experimentally. In all cases shown in this table, the exp erimentally determined sigma factor agrees with the computational prediction. All predicted regulations by TFs shown in this table are currently unknown.


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

septum cell wall. We found the SpoIIID binding motif at +18 and +3 for spoIIP and at +24 for spoIID. From the location of the binding site, we infer that those genes might be negatively regulated. For the SigE-dependent asparagine synthetase gene yisO, we found three SpoIIID binding sites in the promoter region. GerE is a transcriptional regulator required for the expression of late spore coat genes. It is predicted to regulate membrane phospholipid cardiolipin (ywjE) and permease (yecA). Since in addition it is known that GerE regulates N-acetylmuramoyl-L-alanine amidase, we expect the prediction for ytkC, which is similar to autolytic amidase, to be correct. In E.coli, 17 operons are known to be regulated by SigL12 . In B. subtilis, only six operons are known to be regulated by SigL. Whereas we may expect currently unknown SigL-regulated genes to exist in B. subtilis, our result suggests that there are few additional SigL regulated genes in the B. subtilis genome.

5. Discussion Our result shows that the joint prediction of TFs is a powerful way both to confirm the sigma prediction and to predict new members of the TF regulon. As the joint prediction of sigma factors and TFs is a supervised learning method, it can make better use of known biological facts than unsupervised methods. This method can also detect genes regulated by two or more different sigma factors. For example, spoIVCB is initially transcribed under the direction of SigE acting in conjunction with SpoIIID. Later in sporulation, SigK-mediated transcription of spoIVCB is repressed by GerE. In our method, we can calculate the probability that spoIVCB is regulated by SigK with GerE and by SigE with SpoIIID separately. This method can also be applied to other organisms such as E.coli, cyanobacteria and yeast, for which some regulatory relations are known.

Acknowledgments We thank Seiya Imoto for his kind advice on the statistical analysis. This research was supported by Grant-in-Aid for Scientific Research on Priority Areas and JSPS Fellow ship of the Ministry of Education, Science, Sports and Culture.


September 23, 2004

22:28

Proceedings Trim Size: 9in x 6in

makita

References
1. Y. Makita, M. Nakao, N. Ogasawara, and K. Nakai. DBTBS: Database of transcriptional regulation in Bacil lus subtilis and its contribution to comparative genomics. Nucleic Acids Res., 1:32 Database issue:D75-7, 2004. http://dbtbs.hgc.jp. 2. M.J.L. de Hoon, Y. Makita, S. Imoto, K. Kobayashi, N. Ogasawara, K. Nakai and S. Miyano, Predicting gene regulation by sigma factors in Bacil lus subtilis from genome-wide data. Bioinformatics, 20 Suppl 1:I102-I108, 2004. 3. M.M. Babu and S.A. Teichmann, Functional determinants of transcription factors in Escherichia coli: protein families and binding sites. TRENDS in Genetics, 19(2):75-79, 2003. 4. M.L. Bulyk, A.M. McGuire, N. Masuda, and G.M. Church. A motif co-occurrence approach for genome-wide prediction of transcription-factorbinding sites in Escherichia coli. Genome Res., 14(2):201-8, 2004 5. S. Sinha, M. Tompa. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 30(24):5549-60, 2002. 6. E. Segal and S. Sharan. A Discriminative Model for Identifying Sparial cisRegulatory Modules. In Proc. 8th Inter. Conf. on Research in Computational Molecular Biology (RECOMB), 2004. 7. A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Stud holme, C. Yeats, and S.R. Eddy. The Pfam Protein Families Database. Nucleic Acids Res., 1:32 Database Issue:D138-141 2004. 8. R. Durbin, S. Eddy, A. Krogh, G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, Cambridge, UK. 1998. 9. B.W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hill, London, 1986. 10. A.L. Sonenshein, J.A. Hoch, and R. Losick. Bacil lus subtilis and its closest relatives: From genes to cells. ASM Press, Washington, DC, 2001. 11. M.J.L. de Hoon, S. Imoto, K. Kobayashi, N. Ogasawara, and S. Miyano. Predicting the operon structure of Bacil lus subtilis using operon length, intergene distance, and gene expression information. PSB 2004:276-87. 12. L. Reitzer and B.L. Schneider. Metabolic context and possible physiological themes of sigma(54)-dependent genes in Escherichia coli. Microbiol Mol Biol Rev., 65(3):422-44, 2001.