Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader COMPARISONS OF PROTEIN FAMILY DYNAMICS A. J. RADER and JOSHUA T. HARRELL Department of Physics, Indiana University Purdue University Indianapolis, 402 N. Blackford St., LD156D Indianapolis, IN 46219, USA E-mail: ajrader@iupui.edu www.physics.iupui.edu/ajrader/ Similarities between different protein structures have led to the identification of protein families based upon some measure of structural similarity. Using these similarities one can classify proteins into structural families and higher-order groupings from which inferred function can b e transferred. When taken for a large number of proteins, these schemes p oint to evolutionary relationships between organisms. We prop ose a novel classification scheme based up on the structurally-inspired dynamics of each protein. This classification scheme has the advantages of b eing quantitative, automatically assigned, and able to also make distinctions within protein families. Results are presented for five protein families illustrating the correct identification of previously un-classified structures and sources of intrafamily distinctions. Keywords : GNM; protein dynamics; conformations; families. 1. Intro duction The comparison of proteins from different organisms relies heavily upon the paradigm that sequence encodes for protein structure which in turn determines protein function. Often protein function is not a easily definable quantity1 making some associations unreliable. More directly, proteins can be grouped into families based upon shared structural characteristics since structural changes are generally more conservative than sequence changes. Two widely used structurally-based classification systems are SCOP (Structural Classification of Proteins)2 and CATH (Class, Architecture, Topology, and Homologous superfamily).3 Both classification systems require some manual intervention and depend upon the additional step of defining the domains within a protein structure. The assignment of such domains is not a unique process and adds another layer of complication to such classifications. Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader We contend that structurally-inspired information, specifically protein dynamics, are important for making the correct functional assignment of proteins.4 Information regarding dynamics is absent in both of the two structural classification systems mentioned above. In this paper we present an automatic assignment criteria for grouping protein families based upon their entire structure rather than the added step of domain identification. Thus although the analysis presented here is similar to the SCOP and CATH classifications it differs by considering the dynamics of complete protein structures. The Gaussian Network Model (GNM)5,6 provies an efficient calculation of protein dynamics by representing the protein structure by an elastic network of residues. This creates a coarse-grained representation of the structure and its dynamics. As a result, comparison of low frequency (global) modes of motion from GNM to proteins with a similar Rossmann-fold displayed a striking similarity.7 A related study applied to the globin family observed a similar trend that similar protein structures exhibited similar dynamics.8 Preliminary analysis at the superfamily level found that regions with high mobility also demonstrated high levels of evolutionary fluctuations.9 In this work we quantify the degree of similarity in dynamics with the aim of exploring how these dynamics play a role in defining protein function. We generalize these comparisons to families of proteins based upon the SCOP classification schemes, allowing a new automatic classification of each protein in terms of their GNM-defined dynamic similarities. 2. Metho ds 2.1. Protein Family Selection The Protein Data Bank (PDB)10 was used in conjunction with the iGNM database11 to select the families of proteins used in this study. The iGNM database is an online resource of pre-computed GNM analysis for all structures deposited in the PDB. The low frequency eigenvectors, termed slow modes, were used in the analysis presented here because these slow modes have previously been associated with global motions and likely (large-scale) functional motions.12 The SCOP (v 1.71) classification was used as the inital basis for familial groupings. Five families were chosen from the SCOP database site such that the proteins in these families represented different functions, architectures and number of residues. A family was considered if it had more than 25 member structures with the same number of residues. Since the number of Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader structurally resolved residues does not always match the protein sequence length, the list of SCOP family member PDB structures was checked against the iGNM database to determine the number of nodes (residues) present in each structure. Only structures with the same number of residues were selected for use in this study. Requiring the proteins to have the same length allowed a direct comparison of them against each other using the dot product of their modes of motion (see below). Once this number of residues was determined, an additional set of structures was obtained by retrieving all structures present in the iGNM with this number of residues. Thus each protein family studied had a set of structures already deemed part of the (SCOP) family and a second set of (non-family) structures each having the same length as those in the family. The analysis was carried out for the five SCOP families listed in Table 1. Table 1. Abbreviation: Name FABP: Fatty acid binding protein-like Glob: Globins CytC: mono domain cytochrome c DHFR: Dihydrofolate reductases PoBP: Phosphate binding protein-like List of protein families tested. Residue count 131 153 108 159 517 Family 30 84 28 27 30 Non-family 42 58 70 46 6 2.2. GNM The GNM5,6 treats the structure as an elastic network model where amino acid residues within a cutoff distance, rc are connected by springs with a uniform force constant. In this model, the C atom positions of each residue serve as the nodes. Denoting Rij as the distance between residues i and j , a Kirchhoff or connectivity matrix, , is constructed such that off-diagonal elements are -1 when Rij rc and 0 when Rij > rc ; while the diagonal elements are the sum of off-diagonal elements. The normal modes characterizing the motion of this network are found by eigenvalue decomposition of the Kirchhoff matrix according to Eq. (1) = UUT (1) where U is a matrix composed of eigenvectors, ui (1 i N ), and is the diagonal matrix of the eigenvalues i . Despite being a purely topological model, GNM and related models have been widely used to characterize Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader functionally relevant motions in terms of a few low frequency (small ) modes.4 For each of the structures in the protein families, the 20 slowest mode (lowest frequency) eigenvectors were downloaded from the iGNM database. The correlation between residue fluctuations (Ri ) due to a specific mode is calculated according to Eq. (2). [Ri · Rj ]k = 3kb T -1 k [uk ]i [uk ]j (2) Here [uk ]i is the ith element of the eigenvector uk , k is the eigenvalue, T is the absolute temperature and kb is the Boltzmann constant. a) 0.08 0.07 b) 0.20 0.16 Lowest mode shape, u12 0.06 0.05 0.04 0.03 0.02 0.12 0.08 0.04 0.01 0 0 20 40 60 80 100 120 140 0 0 20 40 60 80 100 120 140 Residue number Residue number Fig. 1. Relationships between mo de shapes. a) The lowest mode shap e [u1 ]2 is plotted for each family structure with gray dashed lines. The thick black line highlights the average mo de shap e. b) The lowest mo de shap e for each non-family structure is plotted in gray dashed lines and contrasted with the average family mode shap e in black. Figure 1 illustrates the lowest mode shape [u1 ]2 plotted against residue number. By inverting Eq. (2) one can see that this is proportional to the lowest mode self-fluctuation or mobility. In Fig. 1a) the results are plotted for each of the family protein structures as gray dashed lines. One can see that each structure has a slightly different degree of mobility in this plot. Clearly there are some regions of qualitative agreement such as the coincidence of minima, highlighted by the average curve in black. It has previously been demonstrated that these minima serve as hinge sites that correlate with binding and/or catalytic sites.12 Complimenting this previous insight, we observe that the largest variations occur not in these hinge sites but in the mobile regions between such hinges. As shown here, the general mode shape illustrates the ability of GNM to cluster groups of structures by their dynamics. Additionally the variations in the degree of modal mobility Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader points to the ability of GNM to differentiate between similar structures. Figure 1b) shows the same average slow mode mobility in black compared results for each of the non-family structures (dashed lines). Unlike the case of structures from a family, there is no observable trend for mobility among non-family structures. Beyond this qualitative observation, we desire to develop a more quantitative comparison for a large number of structures. 2.3. Quantitative Mode Comparisons Letting x and y represent the th and th eigenvectors of proteins x and y respectively, one can define the dot product between eigenvectors of different proteins according to Eq. (3). xy P = x · y = iN =1 xi y i (3) Using the fact that these are eigenvectors, we ignore elements of Eq. (3) corresponding to the same protein. Thus if k represents the number of eigenvectors being considered we define a (k × k ) matrix for each pair of proteins in the dataset given by Eq. (4). xy xy xy P11 P12 · · · P1k xy xy P21 P22 · · · P xy 2k P(x, y ) = . (4) k . .. . . . .. . . . xy xy xy Pk1 Pk2 · · · Pkk Thus when we compare L proteins, we can define a large (k L × k L) matrix comprised of smaller (k × k ) matrices, which compare the individual k lowest modes of each protein against the k slowest modes of the other proteins in the family set. The amount of correlation data contained here makes it hard to recognize the correlations between proteins. In order to succintly compare the data, two correlation metrics are introduced as functions of the number of eigenvectors being compared, denoted by k . The first correlation, Mk (x, y ), defines the maximum dot product between the slowest k modes for each pair of structures (Eq. (5)) and the second (Eq. (6)) calculates the sum of the maximum values for each column, j , in the smaller (k × k ) matrix. Mk (x, y ) = max (|k P(x, y )|) Sk (x, y ) = jk =1 (5) (6) Mj (x, y ) Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader Two different measures of correlation were introduced because it is not clear how correlations between different modes in different structures should be computed a priori. The Mk correlation measure is concerned with identifying any two highly correlated eigenvectors between the two proteins without regard for the specific order of these low frequency modes. This acccounts for the possibility of mode mixing which occurs when the lowest eigenvector from one protein is highly correlated with an eigenvector from another protein that is not the lowest frequency mode. In contrast, the second correlation measure, Sk , focuses on identifying how well the entire subspace spanned by the low frequency eigenvectors of one protein matches the subspace spanned by the low frequency eigenvectors of another protein. By averaging these measures over the set of family structures we can determine the average amount of correlation a structure has with respect to a protein family. For a family with lf proteins, the family averaged, Mk value for the xth protein, Mk (x) f , is defined by Eq. (7) along with a family M averaged standard deviation, k f . lf 1y Mk (x, y ) Mk (x) f = lf =1 (7) Similarly, the family averaged, Sk value for the xth protein, Sk (x) f , is M defined in Eq. (8) along with an average standard deviation, k f . Sk (x) f = lf 1y Sk (x, y ) lf =1 (8) 3. Results 3.1. Classification of Protein Families by Dynamics In order to determine a suitable number of modes to consider we performed calculations for all values of k 20. Taking the average of the family averaged metrics in Eq. (7) and Eq. (8), defines an overall correlation value for each protein family. Mk f = lf 1x Mk (x) f lf =1 (9) Figure 2 plots these family-averaged values ( Mk f and Sk f ) as a function of the number of modes, k . The Mk averages when k = 1 are relatively low (0.67) in the case of FABP. This is due to the fact that using only one eigenvector from each protein prevents considering the correlation Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader 1.0 0.9 0.8 0.7 0.6 FABP Glob CytC DHFR PoBP Family Average <f>/k 20 a) Family Average <f> b) 1.0 0.9 0.8 0.7 0.6 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 20 Number of Modes Fig. 2. The family averaged metrics calculated for different numbers of modes. a) Mk f b) Sk f /k between potentially mixed modes. However, as more modes are included, hhis situation is quickly remedied. By the time k = 5, the value of Mk t as reached an asymptote and so k = 5 was chosen for the results presented here. Due to the high average family correlation expressed by M5 , this measure suggests a means to distinguish family from non-family structures. The trend for Sk f /k in Fig. 2b) is different, generally reflecting the a greater disparity for larger values of k . However, as illustrated by FABP, more than one mode is required to account for potential mode mixing. The overall lower correlation of Sk when compared to Mk makes Sk more appropriate for monitoring distinctions within a family. To keep the analysis consistent with Mk and allow for intrafamily distinctions, we set k = 5 for the Sk results presented here. We plot the Mk (x, y ) correlation values between each pair of protein structures using a scheme that runs from no (zero) correlation in blue to maximal (one) correlation in red. Figure 3 shows this plotted for FABP, Glob and CytC in panels a) through c) respectively with each row and column corresponding to a specific protein structure. The proteins identified by SCOP as family members are plotted first in each case. Similarly, we plot the Sk (x, y ) values with a color scheme ranging from no correlation in blue to maximal correlation in red. Figure 3d) through f ) plots this correlation for the family members of these three families. Results for the other two protein families (DHFR and PoBP) are similar and thus not shown explicitly. As mentioned above, these results are shown for k = 5 although the calculations were repeated for all values of k 20. To understand the significance of these plots, consider CytC (Fig. 3c) as an example. The first 28 rows and columns correspond to the 28 CytC family proteins and are shown in shades of red to signify their high corre- Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader a) b) c) Protein index d) e) f) Protein index Fig. 3. Correlation measure (k = 5) plots for three protein families. a) M5 for FABP b) M5 for Glob c) M5 for CytC. The known family members are listed first followed by other proteins outside the family. In each case there are clear distinctions between family and non-family structures. High correlation corresp onds to red and low correlation is in blue. For the S5 plots, only the subset corresponding to family members is plotted to show the intra-family distinctions. d) S5 for FABP e) S5 for Glob f ) S5 for CytC. lation. By contrast, the last 70, non-family structures (rows and columns) are in shades of greens and yellows indicating low correlation (0.3 to 0.6). This distinction in color clearly shows a difference between the modes of proteins within the SCOP family compared to the modes of proteins not in the family. This trend is observed for each of the five protein families studied. Such a trend suggests Mk correlations of protein dynamics can be used as a classifying technique. The clear distinction between Mk values for proteins in the family versus non-family proteins makes idenfication of potential candidates for inclusion in the family relatively simple. There are some proteins in Fig. 3a) and b) not classified as being part of the SCOP family that share the same color pattern as those within the family. This indicates a high degree of correlation between the eigenvectors from these structures and those in the family implying that they should be considered as part of the family. Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader 3.2. Identification of Dynamical ly Similar Proteins The numerical values behind the similar color patterns are used to identify these candidate structures with respect to the family average Mk (x) f values defined in Eq. (7). Specifically, we consider a structure as a family candidate if its family averaged value is within three standard deviations of the mean family averaged value as defined in Eq. (10). Mk (x) f M Mk f - 3 k f (10) Thus instead of looking at the correlation color patterns illustrated in Fig. 3, we can plot the family averaged, Mk (x) f , values against protein index as M in Fig. 4a). Indicatting the 35 limits by gray dashed lines quickly identifies three potential candidates for the FABP family according to the criteria in Eq. (10). These three candidate structures have protein indices 57, 66 and 68 corresponding to PDBids 1t8v, 1yiv and 2a0a. a) 1.1 1.0 0.9 4 b) 6 5 f f 10 20 30 40 50 60 70 0.8 0.7 0.6 0.5 3 2 1 0.4 0.3 0 0 0 10 20 30 40 50 60 70 80 90 100 Protein index Protein index Fig. 4. The family averaged correlation measures for different proteins in a family illustrate the candidate and outlier criteria. (a) M5 (x) f values for each FABP family and non-family structure. The mean family value, M5 f , is shown by a black dashed line M and the candidate 35 range is indicated by gray dashed lines. (b) S5 (x) f values for each CytC family and non-family structure. The mean family value, S5 f , is shown S by a black dashed line and the outlier 5 range is indicated by gray dashed lines. Table 2 summarizes the mean family averaged values as well as the number of candidates for each family. In the case of FABP, none of these three structures are part of the SCOP family. 1t8v is a fairly recent PDB structure which is annotated as a fatty acid binding protein, but due to its recent deposition it has not been included in the SCOP database. The other two structures, 1yiv and 2a0a, are annotated as a myelin protein and dust mite allergen respectively. Visual inspection of these structures confirms that they contain the dominant -barrel structure of FABP structures and Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader Table 2. Family FABP Glob CytC DHFR PoBP Family averaged correlation metrics and standard deviations M5 f 0.9411 0.9617 0.9804 0.9852 0.9967 M 5 f Candidates 3 10 0 0 0 S5 f 4.1579 4.2958 4.5958 4.3925 4.4045 S 5 f Outliers 2 17 3 0 0 0.0484 0.0377 0.0267 0.0104 0.0083 0.4818 0.5221 0.3373 0.4174 0.0710 suggests a new potential functional mechanism for these structures, namely as a fatty acid binding protein. Analysis of the Glob family indicated ten candiate structures all of which are recent myoglobin structures which are not included in the SCOP database. Correct identification of these structures as part of the Glob family by comparison of their dynamic modes serves to confirm the applicability of this method to distinguish protein families. The other three families: CytC, DHFR and PoBP do not have any candidate proteins. 3.3. Intrafamily Distinctions After demonstrating the ability of this method to distinguish family from non-family structures, the ability to distinguish variations among structures within a family was also investigated. As can be seen in Fig. 3d) through f ), correlations among structures within a family are not uniform but have variations. Regions with lower correlation (greens, yellows and oranges in these panels) correspond to structures that are potential family outliers. Similar to the definition of candidates in Eq. (10), we define such outliers S as being more than one standard deviation ( k f ) away from the mean family averaged Sk value as in Eq. (11). Sk (x) f S Sk f - k f (11) S S Again the actual values of S5 f and 5 f used for each of the families are listed in Table 2. Using this outlier criteria we are able to pick out structures that may be structurally and/or functionally distinct from within a family in an automated fashion. Beginning with CytC as an example case, one can see two green-yelloworange bands in Fig. 3c) representing protein indices 18 and 21 that are less correlated with other CytC structures in general. Figure 4b) plots the family averaged correlation measures, S5 (x) f , for each protein along with the outlier criteria from Table 2. In this plot one can see that all nonS family members fall below the 5 f level shown with a dashed gray line Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader because there were no candidates this family. More importantly, there are three outliers corresponding to protein indices 18,21 and 27. These indices refer to PDBids 1fhb, 1nmi and 1yic. Since these outliers have the same overall family structure, the differences identified by this analysis of the dynamics are due to some other factor(s). Using similar analysis, FABP had two outliers: PDBids 1ael and 1tou (indices 4 and 25 in Fig. 3a) and Glob had 17 outliers identified (protein indices 21­24, 30­35 and 69­75 in Fig. 3b). Examining the structures that were deemed to be outliers as a whole we are able to determine a few reasonable explanations for these differences. The outliers for FABP and CytC were structures that were determined by NMR rather than X-ray crystallography. Although these were not the only NMR structures in the datasets representing these families, it suggests that structures that are not forced to conform to the Ramachandran phi-psi plot may adopt a "looser" structure with measurable differences in dynamics. Further supporting this claim is the fact that one of the CytC outliers, 1nmi, is an averaged NMR structure which would not necessarily reflect the true dyanamics of the CytC family. The 17 outliers in the Glob family correspond to al l of the structures of one type of globin: leghemoglobin from a specific species: yellow lupin. In this case, these structures serve to form a sub-family within Glob that can be seen visually by the greenyellow bands in Fig. 3b).DHFR and PoBP had no outliers according to the criteria of Eq. (11). However examination of the most varied structures support the claims of sub-family organization by speciation and differences due to ligand-binding state (data not shown). 4. Conclusions We have demonstrated an automatic family classification scheme for protein structures based upon their computed dynamics. Comparisions using the low frequency eigenvectors of structures accurately assigns these structures to a unique protein family. Using this precomputed data, Eq. (10) provides a measure for assigning newly determined structures as candidates to a particular protein family. In addition, this method provides a quantitative measure of the differences within protein families. These differences can be investigated in terms of outliers or most dynamically different as indicated in the text. Examination of the outliers indicates that differences within the familes can be attributed to some combination of differences in ligand-binding state, method of structural determination and sequence. These factors are an initial list of Pacific Symposium on Biocomputing 13:426-437(2008) October 1, 2007 9:6 WSPC - Pro ceedings Trim Size: 9in x 6in rader possible explanations, and more research on a larger set of structures needs to be done in order to obtain a more complete understanding of what these variations in dynamics correspond to universally. Finally we introduce a new direction for protein classification schemes that is both automatic and relies upon the dynamics of protein structures. In the version presented here, the comparisons were only made for structures that had the same number of residues. Admittedly, this restricted the number of families this analysis is applicable to. However, it is clear that dynamics can be used to make meaningful distinctions both between and within protein families. In the future we anticipate generalizing this method to allow comparisons between structures having different numbers of residues. It is expected that such an extension will allow for more family comparisons and thus more general results regarding the relationships between protein structures, dynamics and functions to surface. This method can also provide an automatic classification scheme to aid in the identification of functions for so called "hypothetical proteins" produced by structural genomics initiatives. Acknowledgments The authors thank the IUPUI Department of Physics for funding this work. References 1. J. C. Whisstock and A. M. Lesk, Quart. Rev. Biophys. 36, 307, (2003). 2. L. Lo Conte, et.al., Nuc. Acid Res. 28, 257, (2000). 3. C. A. Orengo, et.al., Structure 5, 1093, (1997). 4. I. Bahar and A.J. Rader, Curr. Opin. Str. Biol. 15, 1 (2005). 5. T. Haliloglu, I. Bahar and B. Erman, Phys. Rev. Lett. 79, 3090, (1997). 6. I. Bahar, A. R. Atilgan, and B. Erman, Fold. Des. 2, 173, (1997). 7. O. Keskin, R. L. Jernigan and I. Bahar, Biophys. J. 78, 2093, (2000). 8. S. Maguid, et.al., Biophys. J. 89, 3, (2005). 9. A. Leo-Macias, et.al., Biophys. J. 8, 1291, (2006). 10. H. M. Berman, et.al.,Nuc. Acids Res. 28, 235 (2000). 11. L. W. Yang, et.al., Bioinformatics 20, 2978, (2005). 12. C. Chennubhotla, et.al., Phys. Biol. 2, S173 (2005).