Two-Stage Multi-Class Support Vector Machines to Protein Secondary Structure Prediction M.N. Nguyen and J.C. Rajapakse Pacific Symposium on Biocomputing 10:346-357(2005) TWO-STAGE MULTI-CLASS SUPPORT VECTOR MACHINES TO PROTEIN SECONDARY STRUCTURE PREDICTION M. N. NGUYEN AND J. C. RAJAPAKSE BioInformatics Research Centre, School of Computer Engineering, Nanyang Technological University, Singapore 639798 E-mail: asjagath@ntu.edu.sg Bioinformatics techniques to protein secondary structure (PSS) prediction are mostly single-stage approaches in the sense that they predict secondary structures of proteins by taking into account only the contextual information in amino acid sequences. In this paper, we propose two-stage Multi-class Support Vector Machine (MSVM) approach where a MSVM predictor is introduced to the output of the first stage MSVM to capture the sequential relationship among secondary structure elements for the prediction. By using position specific scoring matrices, generated by PSI-BLAST, the two-stage MSVM approach achieves Q3 accuracies of 78.0% and 76.3% on the RS126 dataset of 126 nonhomologous globular proteins and the CB396 dataset of 396 nonhomologous proteins, respectively, which are better than the highest scores published on both datasets to date. 1 Intro duction One of the ma jor goals of bioinformatics is to predict the three-dimensional (3-D) structure of a protein from its amino acid sequence. Unfortunately, the protein structure prediction problem is a combinatorial optimization problem, which so far has an eluded solution, because of the exponential number of potential solutions. One of the current approaches is to predict the protein secondary structure (PSS), which is linear representation of the full knowledge of the 3-D structure, and, thereafter, predict the 3-D structure 1,2 . The usual goal of secondary structure prediction is to classify a pattern of residues in amino acid sequences to a pattern of protein secondary structure elements: an -helix (H), -strand (E) or coil (C, the remaining type). Many computational techniques have been proposed in the literature to solve the PSS prediction problem, which broadly fall into three categories: (1) statistical methods, (2) neural network approaches, and (3) nearest neighbor methods. The statistical methods are mostly based on likelihood techniques 3, 4, 5 . Neural networks use residues in a local neighborhood or a window, as inputs, to predict the secondary structure at a particular location of an amino acid sequence by finding an appropriate non-linear mapping 6,7,8,9 . The nearest neighbor approach often uses the k-nearest neighbor techniques 10,11 . The consensus approaches that combine different classifiers, parallely, into a single superior predictor have been proposed for PSS prediction 12,13 . Support Vector Machines (SVMs) have been earlier applied to PSS prediction 14,15 ; one of the drawbacks in these approaches is that the methods do not take into account the sequential relationship among the protein secondary structure elements. Additionally, SVM methods only construct a multi-class classifier by combining several binary classifiers. Most existing secondary structure techniques are single-stage approaches, which are unable to find complex relations (correlations) among structural elements in the sequence. This could be improved by incorporating the interactions or contextual information among the elements of the sequences of secondary structures. We argue that it is feasible in enhancing the present single-stage MSVM approach farther by augmenting with another prediction scheme at their outputs and propose to use MSVM as the second-stage. By using the position specific scoring matrices generated by PSI-BLAST, the two-stage MSVM approach significantly achieves Q3 accuracies of 78.0% and 76.3% on the RS126 and CB396 datasets, based on a seven-fold cross validation. 2 Two-Stage MSVM Approach In the two-stage MSVM approach, we use two MSVMs in cascade to predict secondary structures of residues in amino acid sequences. Let us denote the given amino acid sequence by r = (r1 , r2 , . . . , rn ) where ri R and R is the set of 20 amino acid residues, and t = (t1 , t2 , . . . tn ) denote the corresponding secondary structure sequence where ti T and T = {H, E , C }; n is the length of the sequence. The prediction of the PSS sequence, t, from an amino acid sequence, r, is the problem of finding the optimal mapping from the space of n to the space of n . R T Let vi be the vector representing 21-dimensional coding of the residue ri where 20 units are the values from raw matrices of PSI-BLAST profiles ranging from [0, 1] and the other is used for the padding space to indicate the overlapping end of the sequence 9 . Let the input pattern to the MSVM approach at site i be ri = (vi-h1 , vi-h1 +1 , . . . , vi , . . . , vi+h1 ) where vi denote 2 1 1 the center element, h1 and h1 denote the width of window on the two sides; 2 1 w1 = h1 + h1 + 1 is the neighborhood size around the element i. 1 2 2.1 First Stage A MSVM scheme has been proposed by Crammer and Singer 16 . For PSS prediction, this method constructs three discriminant functions but all are obtained by solving one single optimization problem, which can be formulated as follows: Minimize 1k 2 sub ject to the constraints k 1 w1j 1 (rj ) - w1 1 (rj ) ck - j j t k k (w1 )T w1 + 1 T jN 1 j =1 (1) where tj is the secondary structural type of residu0 rj corresponding to the e if tj = k k the training vector rj , j = 1, 2, . . . , N , and cj = 1 if tj = k We find the minimization of the above formulation by solving the following quadratic programming problem 16 : NN k 1j i K1 (rj , ri ) 2 =1 =1 max - k j kk j i - T such that k k k j = 0 and j T 0 if tj = k 1 if tj = k jN k k j ck j (2) (3) =1 T k where K1 (ri , rj ) = 1 (ri )1 (rj ) denotes the kernel function and w1 = N k1 j =1 j (rj ). k Once the parameters j are obtained from the optimization, the resulting k discriminant function f1 of a test input vector ri is given by k f1 (ri ) = jN k k j K1 (ri , rj ) = w1 1 (ri ) (4) =1 k Let f1 (ri ) = arg maxkT f1 (ri ). In the single-stage MSVM method, the secondary structural type ti corresponding to the residue at site i, ri , is determined by ti = f1 (ri ) (5) The function, f1 , discriminates the type of PSS, based on the features or interactions among the residues in the input pattern. With optimal parameters, the MSVM attempts to minimize the generalization error in the prediction. If the training and testing patterns are drawn independently and identically according to a probability distribution P , then the generalization error, errP (f1 ), is given by errP (f1 ) = P {(r, t) : f1 (r) = t; (r, t) 1 × {H, E , C }} where 1 denotes the set of input patterns seen by the MSVM during both the training and testing phases, and t denotes the desired ouput for input pattern r. 2.2 Second Stage We extend the single-stage MSVM technique by cascading another MSVM at the output of the present single-stage approach to improve the accuracy of prediction. This is because the secondary structure at a particular position of the sequence depends on the structures of the rest of the sequence, i.e., it accounts for the fact that the strands span over at least three adjacent residues and helices consist of at least four consecutive residues 6 . This intrinsic relation cannot be captured by using only single-stage approaches alone. Therefore, another layer of classifiers that minimize the generalization error of the output of single-stage methods by incorporating the sequential relationship among the protein structure elements improves the prediction accuracy. Consider a window of w2 size at a site of the output sequence of the first k stage; the vector at position i, di = (dk-h2 , dk-h2 +1 , . . . , di , . . . , dk+h2 ) where i i i k w2 = 3(h2 + h2 + 1), dk = 1/(1 + e-f1 (ri ) ), and f1 denotes the discriminant 2 1 i function of the first stage. The application of the logistic sigmoid function to the outputs of the first stage has the advantage of constraining the input units of the second stage to the (0,1) interval that is similar to the range of the input units of the first stage. The purpose of this choice is to easier determine parameters for optimal performance. The MSVM converts the input patterns, usually linearly inseparable, to a higher dimensional space by using the mapping 2 with a kernel function K2 (di , dj ) = 2 (di )2 (dj ). As in the first stage, the hidden outputs in the higher dimensional space are linearly combined by a weight vector, w2 , to obtain the prediction output. Let the training set of exemplars for the second stage MSVM be 2rain = {dj : t j = 1, . . . , N }. The vector w2 is obtained by solving the following convex quadratic programming problem, over all the patterns seen in the training phase 16 . k Let f2 (di ) = arg maxkT f2 (di ). The secondary structural type ti corresponding to the residue ri is determined by k 1 1 2 ti = f2 (di ) (6) If the set of input patterns for the second stage MSVM in both training and testing phases is denoted by 2 , the generalization error of the two-stage MSVM approach, errP (f2 ), is given by errP (f2 ) = P {(d, t) : f2 (d) = t; (d, t) 2 × {H, E , C }} If the input pattern d corresponds to a site i, then d ((1 + e k -f1 (ri-h2 ) -1 1 = ) , (1 + e k -f1 (ri-h2 +1 ) -1 1 di = ) , . . . , (1 + e k -f1 (ri ) -1 ) , . . . , (1 + e ) ). That is, the second stage takes into account the influences of the PSS values of residues in the neighborhood into the prediction. It could be easily shown that there exists a function f2 such that errP (f2 ) = errP (f1 ) when h2 = h2 = 0. 1 2 3 Minimal Generalization Error k -f1 (ri+h2 ) -1 2 In this section, we find a function f2 that minimizes the generalization error errP (f2 ) when connecting another MSVM predictor at the output of the existing predictor. The optimal function f2 providing the smallest errP (f2 ) ensures errP (f2 ) errP (f1 ). However, finding the global minimum of generalization error errP (f2 ) is not a trivial problem because the form of the probability distribution P is unknown. We can instead consider the probably approximately correct (pac) bound, (N, ), of the generalization error satisfying P {2rain : f2 such that errP (f2 ) > (N, )} < t This is equivalent to asserting that with probability greater than 1 - over the training set 2rain , the generalization error of f2 is bounded by t errP (f2 ) (N, ) In the following proofs, we assume that both the training set 2rain 2 t and the testing set 2est 2 for the second stage contained N patterns. t k /l For the MSVM technique at the second stage, let w2 be the weight vector l k w2 - w2 . Therefore, the secondary structure of a residue r is not l if k /l w2 2 (d) > 0 or not k otherwise. Theorem 3.1. 17 Let F = {f2 : d w2 2 (d); w2 1; d 2 ; k , l T } be restricted to points in a ball of m dimensions of radius R about the origin, that is 2 (d) Rm and 2 (d) R. Then the fat-shattering dimension is bounded by R2 fatF (2 ) k /l k /l k /l k /l 2 k /l Theorem 3.2. 18 Let G be a decision directed acyclic graph on 3 classes H , E , and C , with 3 decision nodes, H/E , E /C , and C /H , with k /l k /l margins 2 and discriminant functions f2 F at decision nodes k /l, where 2 k /l = mind2rain t |w 2 k/l 2 (d)| , probability is bounded by k/l w2 k and l T . Then, the following 2 P {2rain , 2est : G such that err2rain (G) = 0; errtest (G) > (N, )} < t t t k , 3 1 ak/l = where (N, ) = N ak/l log 4eN log(4N ) + log 2 ,l T ak/l , k/l 2 fatF err2rain (G) and err2est (G) are a fraction of points misclassi8 t t fied of G on the training set 2rain and a random testing set 2est , respectively. t t Theorem 3.3. Let G be a decision directed acyclic graph with discriminant k /l functions f2 F at nodes k /l, k and l T . Then, the generalization k error of f2 where f2 (d) = arg maxkT f2 (d) in the probability distribution P is errP (f2 ) = errP (G) Pro of. This can be easily proved for an arbitrary example d 2 , f2 (d) equals to the secondary structural type of d predicted by the decision directed T yclic graph G. ac heorem 3.4. 19 Let errP (G) be the generalization error of G at the output of the first stage. Then 2 P train : G such that err2rain (G) = 0 and errP (G) > 2 (N, ) t T 2 2 2 2P train , test : G such that err2rain (G) = 0 and errtest (G) > (N, ) t heorem 3.5. Suppose we classify a random N examples in the training set 2rain using the MSVM method at second stage with optimal values of t k weight vectors w2 , k T . Then, the generalization error errP (f2 ) with probability greater than 1 - is bound to be less than P 1 (N, ) = N 3 90R2 k k w2 2 log(4eN ) log(4N ) + 2 log T 2(2N )3 ro of. distances k /l = Since the margin 2 is the minimum value of the from the instances labeled k or l to the hyperplane k /l k /l w2 2 (d) = 0 at the second stage, we have, 2 mind2rain t k 1 M= k/l ,l (2 k l |(w2 -w2 )2 (d)| k l w 2 -w 2 = mind2rain t |w 2 k/l 2 (d)| k/l w2 1 k l w 2 -w 2 . l w2 Therefore, the quantity k k w2 2. )2 mization problems at second stage results in the minimization of the quantity M which is directly related to the margin of the classifier. Plugging the k /l binary classifiers induced by w2 results a stepwise method for calculating k k the maximum among {f2 (d) = w2 2 (d)} that is similar to the process of finding the secondary structure in the decision directed acyclic graph G. Let k /l us apply the result of Theorem (3.2) for G with specified margin 2 at each node to bound the generalization error errP (G). Since the number of decision nodes is 3 and the largest allowed value of ak/l is N , the number of all possible patterns of ak/l 's over the decision nodes is bounded by N 3 . We N3 let i = /N 3 so that the sum i=1 i = . By choosing (N, i ) 2 > 1 k 1 2(2N )3 k 95R2 = w2 2 log(4eN ) log(4N ) + log N T 4eN 23 1 k R2 log k/l log(4N ) + log k /l N i /2 a ( /8)2 ,l T 2 k ,l k w2 - 23 Solving the opti- Theorem (3.2) ensures that the probability of any of the statements failing to hold is less than /2. By using the result of the Theorem (3.4), the probability P {2rain : G; err2rain (G) = 0; errP (G) > 2 (N, i /2)} is bound to t t be less than . From Theorem (3.3), the generalization error errP (f2 ) with probability greater than 1 - is bound to be less than 2 (N, i /2). from Theorem (3.1) 4eN 1 k 23 ak/l log k/l log(4N ) + log > N i /2 a ,l T Minimizing the quantity k /l margin 2 results in the minimization of the generalization error of the single k k stage MSVM method. Minimization of T w2 2 is done by solving the convex quadratic programming problem of MSVM. As shown in the result of Theorem (3.5), two-stage MSVMs are sufficient for PSS prediction because they minimize both the generalization error errP (f1 ) based on interactions among amino acids and the generalization error errP (f2 ) of the output of the single-stage MSVM by capturing the contextual information of secondary structure. 4 4.1 Exp eriments and Results Dataset 1 (RS126) k T k w2 2, that is, maximizing the value of The set 126 nonhomologous globular protein chains, used in the experiment of Rost and Sander 6 and referred to as the RS126 set, was used to evaluate the accuracy of the classifiers. The RS126 set is available at http://www.compbio.dundee.ac.uk/www-jpred/data/pred res/126 set.html. The single-stage and two-stage MSVM approaches were implemented, with the position specific scoring matrices generated by PSI-BLAST, and tested on the dataset, using a seven-fold cross validation to estimate the prediction accuracy. 4.2 Dataset 2 (CB396) The second dataset generated by Cuff and Barton 12 at the European Bioinformatics Institute (EBI) consisted of 396 nonhomologous protein chains and was referred to as the CB396 set. Cuff and Barton used a rigorous method consisting on the computation of the similarity score to derive their nonredundant dataset. The CB396 set is available at http://www.compbio.dundee.ac.uk/www-jpred/data/. The single-stage and two-stage MSVM approaches have been used to predict PSS based on the position specific scoring matrices generated by PSI-BLAST. 4.3 Protein secondary structure definition The secondary structure states for each structure in the training and testing sets were assigned from DSSP 20 that is the most widely used secondary structure definition. The eight states, H(-helix), G(310 -helix, I( -helix), E( strand), B(isolated -bridge), T(turn), S(bend), and -(rest), were reduced to three classes, -helix (H), -strand (E) and coil (C), by using the following method: H and G to H; E and B to E; all others states to C. 4.4 Prediction accuracy assessment We have used several measures to evaluate the prediction accuracy. The Q3 accuracy indicates the percentage of correctly predicted residues of three states of secondary structure 12 . The QH , QE , QC accuracies represent the percentage of correctly predicted residues of each type of secondary structure 12 . Segment overlap measure (Sov) gives accuracy by counting predicted and observed segments, and measuring their overlap 21 . 4.5 Results For MSVM classifier at the first stage, a window size of 15 amino acid residues (h1 = h1 = 7) was used as input for optimal result in the [7, 21] range. 1 2 At the second stage, the window size of width 21 (h2 = 2 and h2 = 4) in 1 2 the [9, 24] range gave the optimal accuracy. The kernel selected here was 2 the radial basis function K(x, y) = e- x-y with the parameters: = 0.05, 1 = 0.5 for MSVM at the first stage, and = 0.01, 2 = 0.5 for two-stage MSVMs, determined empirically for optimal performance in the [0.01, 0.5] and [0.1, 2] ranges, respectively. We used BSVM library 22 , which leads to faster convergence for large optimization problem, to implement the multi-class technique. In tables 1 and 2, the results of Zpred, NNSSP, PREDATOR, DSC and Jpred methods on the RS126 and CB396 datasets were obtained from Cuff and Barton 12 . The results of the refined neural network proposed by Riis and Krogh, SVM method of Hua and Sun, dual-layer SVM of Guo, BRNN, and PHD methods were obtained from their papers 6,7,8,14,15 . Table 1 shows the performance of the different secondary structure predictors and two-stage MSVM approach on the RS126 set. The best algorithm was found to be the cascade of two MSVMs with the PSI-BLAST profiles, which achieved 78.0% of Q3 accuracy. Comparing two-stage MSVMs to two multi-layer perceptron networks of PHD method proposed by Rost and Sander 6 , a substantial gain of 7.2% of Q3 accuracy was observed. Compared to SVM method of Hua and Sun 14 , the two-stage MSVM method obtained 6.8% higher Q3 score. Table 2 shows the performance of two-stage MSVMs with the CB396 dataset based on multiple sequence alignments and PSI-BLAST profiles. Twostage MSVMs with PSI-BLAST profiles achieved 76.3% of Q3 accuracy that is Table 1. Comparison of performances of single-stage and two-stage MSVM approaches in PSS prediction on the RS126 dataset. The notation - indicates that the result cannot be obtained from the papers. Metho d Zvelebil et al. (Zpred) 23 Rost and Sander (PHD) 6 Salamov et al. (NNSSP) 10 Frishman (PREDATOR) 24 King and Sternberg (DSC) 25 Riis and Krogh7 Baldi et al. (BRNN) 8 Cuff and Barton (Jpred) 12 Hua and Sun (SVM) 14 Single-Stage MSVM Two-Stage MSVMs Q3 66.7 70.8 72.7 70.3 71.1 71.3 72.0 74.8 71.2 76.2 78.0 QH 72.0 68.9 73.0 69.6 73.1 QE 66.0 57.0 58.0 63.5 65.7 QC 72.0 79.2 75.0 83.1 83.8 Sov 68.8 72.6 Table 2. Comparison of performances of single-stage and two-stage MSVM approaches in PSS prediction on the CB396 dataset with PSI-BLAST profiles. Metho d Zvelebil et al.(Zpred)23 Salamov et al. (NNSSP)10 Frishman (PREDATOR)24 King and Sternberg (DSC)25 Guo et al.(Dual-Layer SVM)15 Single-Stage MSVM Two-Stage MSVMs Q3 64.8 71.4 68.6 68.4 74.0 74.5 76.3 QH 79.3 68.5 70.6 QE 69.3 62.0 63.4 QC 72.0 82.4 83.4 Sov 69.5 73.2 the highest scores on the CB396 set to date. Compared to the newest method of Guo et al. using dual-layer SVM 15 , the two-stage MSVM method significantly obtained 2.3% higher Q3 score. As shown, the prediction accuracy of two-stage MSVMs outperformed the result of single-stage MSVM method for PSS prediction. In order to avoid to gross overestimates of accuracy, we performed another test on CB396 dataset: we selected best parameters within each crossvalidation step by dividing the training data into one for SVM learning and another for selection of window size, and parameters. The accuracies of the new evaluation approach were not significantly different from those shown on Table 2 (76.5% of Q3 and 72.9% of Sov). These results confirmed that the selected window size, sigma and gamma parameters in both learning stages were not biased by the test data chosen. 5 Discussion and Conclusion We have introduced a two-stage MSVM approach to PSS prediction. With two-stage approaches, the accuracy of prediction is improved because secondary structure at a particular position of a sequence depends not only on the amino acid residue at a particular location but also on the structural formations of the rest of the sequence. Two-stage approach was first introduced in PHD approach which uses two MLPs in cascade for PSS prediction. MLPs are not optimal for this because the cannot generalize the prediction for unseen patterns. The outputs of single stages have been combined in parallel into a single superior predictor to improve upon the individual predictions 12,13 . However, these methods are depended on performances of individual single models and also do not overcome the limitation of single-stage methods. As shown, the MSVM method was an optimal classifier for the second stage because it minimizes not only the empirical risk of known sequences but also the actual risk of unknown sequences. Additionally, two stages were proven to be sufficient to find an optimal classifier for PSS prediction as the MSVM minimized the generalization error of the output of single-stage by solving the optimization problems at second stage. Furthermore, we have compared two-stage SVM techniques for PSS problem: one method based on binary classifications of Guo 15 and the other approach for multi-class problem by solving one single optimization problem. We found that the two-stage MSVMs is more suitable for protein secondary structure prediction because its capacity to lead faster convergence for large and complex training sets of PSS problem and solve the optimization problem in one step. As proved analytically, two-stage MSVMs have the best generalization ability for PSS prediction, by minimizing the generalization error made in the first stage MSVM. However, since this scenario could not be compared with the other techniques as they stick to seven-fold cross-validation for evaluation, which does not test true generalization capabilities. Further, our comparisons with the other techniques were not complete due to the inaccessibility of previously used data and programs. Also, the kernels and SVM parameters were empirically determined as there do not exist any simple methods to find them otherwise. Investigation on two-stage MSVM parameters could further enhance accuracies. References 1. P. Clote and R. Backofen, Computational Molecular Biology, Wiley and Sons, Ltd., Chichester, 2000. 2. D.W. Mount, Bioinformatics: Sequence and Genome Analysis, (Cold Spring Harbor Laboratory Press, 2001). 3. J. Garnier et al, Journal of Molecular Biology 120, 97 (1978). 4. J.F. Gibrat et al, Journal of Molecular Biology 198, 425 (1987). 5. J. Garnier et al, Methods Enzymol 266, 541 (1996). 6. B. Rost and C. Sander, Journal of Molecular Biology 232, 584 (1993). 7. S.K. Riis and A. Krogh, Journal of Computational Biology 3, 163 (1996). 8. P. Baldi et al, Bioinformatics 15 937 (1999). 9. D.T. Jones, Journal of Molecular Biology 292, 195 (1999). 10. A.A. Salamov and V.V. Solovyev, Journal of Molecular Biology 247, 11 (1995). 11. A.A. Salamov and V.V. Solovyev, Journal of Molecular Biology 268, 31 (1997). 12. J. A. Cuff and G.J. Barton, Proteins 4, 508 (1999). 13. M. Ouali and R. King, Protein Science 9, 1162 (1999). 14. S. Hua and Z. Sun, Journal of Molecular Biology 308, 397 (2001). 15. J. Guo et al, Proteins 54, 738 (2004). 16. K. Crammer and Y. Singer, Computational Learing Theory , 35 (2000). 17. N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, (Cambridge University Press, 2000). 18. J.C. Platt et al, Proc. Advances in Neural Information Processing Systems 12. Cambridge, MA:MIT Press 12, 547 (2000). 19. V. Vapnik, Estimation of Dependences Based on Empirical Data, (Springer-Verlag, New York, 1982). 20. W. Kabsch and C. Sander, Biopolymers 22, 2577 (1983). 21. A. Zemla et al, Proteins: Structure, Function, and Genetics 34 ,220 (1999). 22. C.W. Hsu and C.J. Lin, IEEE Transactions on Neural Networks 13, 415 (2002). 23. M.J.J.M Zvelebil et al, Journal of Molecular Biology 195, 957 (1987). 24. D. Frishman and P. Argos, Proteins: Structure, Function, and Genetics 23, 566 (1995). 25. R.D. King and M.J.E Sternberget al, Protein Science 5 2298 (1996).