Coarse and Reliable Geometric Alignment for Protein Docking Y. Wang, P.K.Agarwal, P.Brown, H. Edelsbrunner, and J. Rudolph Pacific Symposium on Biocomputing 10:64-75(2005) September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 COARSE AND RELIABLE GEOMETRIC ALIGNMENT FOR PROTEIN DOCKING£ D Y. WANGÝ P. K. AGARWALÞ P. BROWNÜ H. EDELSBRUNNERß and J. RUDOLPH , , , uke University, Durham, North Carolina Abstract. We present an efficient algorithm for generating a small set of coarse alignments between interacting proteins using meaningful features on their surfaces. The proteins are treated as rigid bodies, but the results are more generally useful as the produced configurations can serve as input to local improvement algorithms that allow for protein flexibility. We apply our algorithm to a diverse set of protein complexes from the Protein Data Bank, demonstrating the effectivity of our algorithm, both for bound and for unbound protein docking problems. 1. Introduction Protein-protein docking is the computational approach to predicting interactions between proteins. In this paper, we contribute to this field by describing an algorithm for generating a small set of coarse alignments between protein structures. Motivation. Highly organized transient or static assemblies of proteins control most cellular events. A better understanding of the protein-protein interactions involved in these assemblies would help elucidate how individual proteins form complexes and dynamically function in concert to generate the cell circuitry and its time-dependent responses to external stimuli. Protein structures determined at atomic resolution by X-ray crystallography, nuclear magnetic resonance, and increasingly by computer modeling provide one basis for the study of protein interactions. However, given the relative wealth of structural details for monomeric proteins compared to multimeric protein complexes, there exists a need for computational tools and thus for the field of protein docking. Prior work. Current research on protein-protein docking focuses on either bound £ All authors are supported by NSF under grant CCR-00-86013. JR and HE are also supported by NIH under grant R01 GM61822-01. PA is also supported by NSF under grants EIA-01-31905 and CCR-02-04118 and by the U.S.-Israel Binational Science Foundation. Ý Department of Computer Science. Þ Departments of Computer Science and Mathematics. Ü Department of Computer Science. ß Departments of Computer Science and Mathematics, and Raindrop Geomagic. Departments of Biochemistry and Chemistry. September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 docking (the reassembly of known complexes from their constituents), or unbound docking (the assembly of as yet unknown complexes under the assumption of only small protein conformational changes). Most approaches to unbound docking consist of two stages20 : the rigid docking stage produces a set of potential docking configurations by considering only rigid motions, and the refinement stage locally improves the docking configuration, possibly allowing for a limited amount of flexibility. The two essential components in both stages are: a scoring function that discriminates near-native from incorrect docking configurations and a search algorithm to find (approximately) the best configuration for the scoring function. Approaches to the rigid docking stage rely mainly on geometric complementarity. Some are based on uniform discretizations of the space of rigid motions, which they search exhaustively 3. This approach has been accelerated using the fast Fourier transform (FFT) 16 , which forms the basis of the docking software FTDock13 , 3D-Dock18, GRAMM21 , and ZDock 6 . Others sample a small number of rigid motions non-uniformly from the space by aligning feature points found on the molecular surfaces 14 17 . This idea goes back to Connolly 11 , who proposed to use the minima and maxima of a function related to mean curvature, now known as the Connolly function. An example of this method has been described by Fischer et al.12 , who use geometric hashing to align critical points of a variant of the Connolly function. The refinement stage is usually modeled as an energy minimization problem, with the scoring function focusing on the thermodynamic aspects of the interaction. The difficulty of the problem increases with the dimension of the search space or, equivalently, the degree of freedom, which is large even if we keep the back-bone rigid and consider only side-chain flexibility. Recently, Vajda et al. have proposed a hierarchical, progressive refinement protocol4 5 , which seems to reliably converge to a near-native docking configuration starting with initial configurations up to ½¼ ° root-mean-square-distance away from the native configuration. Little success has been reported on including backbone conformational changes 19. Since each step in the refinement stage is costly, it is essential that the set of potential configurations generated in the rigid docking stage is small and reliably contains configurations not too far from the native configuration. Current solutions to the rigid docking stage fall short on at least one of the two requirements. New work. In this paper, we present an efficient algorithm for the rigid docking stage. We use geometric complementarity to guide the search for a small set of rigid motions so that the two proteins fit loosely into each other. Such a set of potential configurations can be further refined to obtain more accurate docking predictions5 9 . We remark that for the case of unbound docking, it is especially September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 important to start with coarse (not tight) fits between proteins to take advantage of flexibility in the later refinement stage. We describe our algorithm in Section 2. It relies on a novel approach to describe protrusions and cavities on molecular surfaces using a succinct set of point pairs computed from the elevation function 1. We then align such pairs and evaluate the resulting configurations using a simple and rapid scoring function. Compared to similar approaches that align feature points 12 17 , our algorithm inspects orders of magnitude fewer configurations. This is made possible by using slightly more complicated features that contain information useful in assessing their significance. We exploit this extra information twice, first to ignore insignificant features and second to be more discriminant in matching up features from different proteins. In Section 3, we demonstrate the efficacy of our approach by testing a set of 25 bound protein complexes from the Protein Data Bank 2 . We demonstrate that a combination of our algorithm with the local improvement procedure described in9 efficiently finds near-native docking positions for all but two cases without creating false positives. In addition, we test our algorithm on the unbound protein docking benchmark 7. In particular, we demonstrate that the algorithm generates poses sufficiently close to the native configuration such that refinement methods that take into account protein flexibility will succeed in bringing them within an acceptable neighborhood of the correct solution. We conclude and discuss future work in Section 4. 2. Methods We represent each protein by a set or union of finitely many balls in threedimensional Euclidean space, which we denote by Ê ¿ . Specifically, we are given ½¾ Ò and ½¾ Ñ , where is two proteins, ¾ Ê¿ and van der Waals radius Ö ¾ Ê and is the the ball with center ball with center and van der Waals radius × . We fix in Ê¿ and describe an algorithm that finds a small set of candidate transformations for . Each transfor´ µ). We mation is a rigid motion that produces a candidate configuration ( begin by describing the scoring function that assesses the fit between two proteins. Scoring function. A good scoring function favors near-native configurations over Ö × be configurations that are far from the native. Letting and , we define the distance between the balls w here ÓÒØ Ø´ µ ÓÐÐ × ÓÒ´ µ is a constant we refer to as the contact-threshold. ¿ ½ if ½ ¼ ¼ if ¼ ¼ if ¼ The score of ´ µ September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 is based on the total number of contacts, ÓÒØ´ µ È ÓÒØ Ø´ µ, and ÓÐд µ È ÓÐÐ × ÓÒ´ µ We call the the total number of collisions, µ valid if ÓÐд µ , where the constant collisionconfiguration ´ threshold defines the maximum number of collisions we tolerate. The score ignores invalid configurations and equals the number of contacts for valid configurations. This notion of score is similar to the ones in 3 6 but different because our score penalizes collisions twice, first by counting them toward a possibly invalid configuration and second by reducing the contact number. The reason for this difference is that we aim at coarse alignments and thus are more tolerant to ¼ rather than , as in 3 . The second penalty counteracts collisions, using the usual increase in contact number that goes along with an increase in collision number. In other words, it seeks to avoid a bias toward configurations with higher collision number without unfairly discriminating against them. Features. Our algorithm generates rigid motions from feature sets ¨ and ¨ btained by analyzing the shapes of the two proteins. We compute these features from (approximately) smooth surfaces representing the two shapes 8 10 . Letting Å be the surface representing , we briefly review the function Ð Ú Ø ÓÒ Å Ê that underlies our definition of feature. To first approximation, it resembles the elevation on Earth, which is the height difference of a point and the mean sea level at that point. This definition makes sense on Earth, where we have a natural choice of origin (the center of mass) and mean sea level (a level set of the gravitational potential), neither of which exists for general surfaces. In the absence of both concepts, we associate each point Ü ¾ Å with a canonically defined partner Ý ¾ Å , with same normal direction Ò Ý ÒÜ, and define Ð Ú Ø ÓҴܵ as the absolute height difference between Ü and Ý in that common direction. For more details, in particular on how to define the canonical pairing, we refer to 1 . Loosely speaking, Ü is the top of a protrusion or the bottom of a cavity in the direction ÒÜ and the pairing partner Ý is the saddle point that marks where the protrusion or cavity starts. It is also possible that the roles of Ü and Ý are reversed. For the purpose of protein docking, we are interested in points with locally maximal elevation, as they represent locally most significant features. Almost all points Ü on Å have exactly one partner Ý , but most maxima arise at positions where the partner is ambiguous. More specifically, for a generic surface there are four types of maxima describing different types of features, as illustrated in Figure 1. By definition, a feature consists of two points, Ù and its partner Ú , with common surface normal, Ò Ù Ð ÒÚ , and common elevation, Ð Ú Ø ÓÒ´Ùµ Ú Ø ÓÒ´Úµ. Its length is the Euclidean distance between the two points, Ù Ú . Each maximum of the elevation function is defined by ¾ ¾ ¿ o September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 x x x x Figure 1. Left: a one-legged maximum characterized by Ü having a unique partner. Middle left: a two-legged maximum in which Ü has two partners, both with the same normal direction and the same height difference to Ü. Middle right: a three-legged maximum in which Ü has three partners, again sharing the same normal direction and the same height difference to Ü. Right: a four-legged maximum in which Ü has two partners and both partners have the same two partners each. points and gives rise to ¾ features in ¨ . For example, the 3-legged maximum points defining ¾ point (third from the left in Figure 1) consists of pairs each forming a feature in ¨ . The length and elevation of a feature are used to estimate its importance, and both together with the normal direction are used to pair up features from the sets ¨ and ¨ . Coarse alignment. Given two proteins and together with their feature sets ¨ and ¨ , our algorithm computes a set of potential coarse alignments : for every « ¾ ¨ and every ¬ ¾ ¨ do if « ¬ form a plausible alignment then Ð Ò´« ¬ µ; compute the contact and collision numbers for ´ ´ µµ is valid then add to endif if ´ endif endfor; sort by contact number. ¡ ¡ ´ µµ; The rationale behind the algorithm is that good fits between the input proteins have aligned features, such as a protrusion of fitting inside a cavity of , or vice versa. If we pair up all features of with all features of , we surely cover all good fits. On the other hand, the information that comes with each feature can be used to discriminate between pairs and gain efficiency by filtering out alignments we deem not important or implausible. Specifically, we introduce an importance filter that eliminates features from ¨ and ¨ whose lengths or elevations are below threshold. The remaining features form pairs ´« ¬µ which pass the plausibility filter provided « and ¬ are not too different in length and they represent complementary types (a protrusion and a cavity). The constants used in the importance filter are given in the caption of Table 1. Assuming ´« ¬µ passes the importance and the plausibility filters, we com- September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 ´ ute an aligning rigid motion as follows. Writing « p ´Ù Ú Ô Õ Ò¬ µ for the points and normals, we define the bi-normals « and ¬ Ò¬ ¢ Ô Õ . We obtain the rigid motion in three steps: Õ Ô Ò « µ and ¬ Ò« ¢ Ú Ù Ú Ù Ô·Õ ; 1. translate ¬ so that the two midpoints coincide: Ù·Ú ¾ ¾ 2. rotate ¬ about the common midpoint so that Ù Ú Ô Õ are collinear; ¬. 3. rotate ¬ about the common line so that « We note that there is an ambiguity in Step 2, allowing for two different alignments distinguished by having Ú Ù and Õ Ô point in the same or in opposite directions. We are interested in both but simplify the description by pretending that Function Ð Ò returns only one rigid motion, instead of two as it really does. Observe that Step 3 positions the two features to maximize the angle between the two normal vectors. Given , we compute the score and the number of collisions using a hierarchical data structure storing and . Letting Ò and Ñ be the number of balls in the two sets, this takes time O´´Ò · ѵ ÐÓ ´Ò · ѵµ. 3. Results In this section, we present the results of testing our algorithm on twenty-five bound docking problems obtained from the Protein Data Bank 2 and on forty-nine unbound docking problems from the benchmark in 7 . We begin with a detailed study of a well known protein complex. A case study. We use the barnase/barstar complex (pdb-id 1BRS, chains A and D, with 864 and 693 atoms) as a sample system to introduce the capabilities of our algorithm. We generate molecular surfaces of the two chains with the M S M S software (available as part of the VMD software distribution 15 ) and obtain triangulations with 8,959 and 7,248 vertices. In Table 1, we show the total number chain A, # legs 2 3 4 total # of features #s after importance filter 1,044 112 696 205 156 50 chain D, # legs 2 3 4 828 68 510 160 154 49 Table 1. Compare the total number of features obtained from two-, three- and four-legged maxima for chains A and D of 1BRS with the number of features that pass the importance filter, having length at least ¿ ¼ ° and elevation at least ¼ ¾ ° . (There are no one-legged maxima for this data set.) of features generated from the maxima of the elevation function and the number of features that survive the importance filter. The latter form the input to our coarse alignment algorithm. We note that a substantially larger number of features September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 obtained from 3-legged maxima are retained than features obtained from 2- and 4-legged maxima. Given the two sets of input features, our algorithm takes about three minutes on a single processor PIII 1GHz computer to generate a family of 5,021 valid configurations with contact number larger than or equal to 150. Each configuration in corresponds to a transformation for chain D. We use the root-meanbefore local improvement rank #cont #coll RMSD 12 5 1 4 2 59 3 11 15 76 327 342 427 353 391 269 373 339 318 263 24 48 23 49 39 12 29 18 16 29 3.23 2.42 1.59 3.57 1.70 2.84 2.32 3.07 3.00 39.39 after local improvement rank #cont' RMSD 1 2 3 4 5 6 7 8 9 10 359 338 328 314 311 310 307 281 251 213 0.54 0.80 0.72 0.80 0.91 0.78 1.50 1.47 2.09 39.96 Table 2. Top ten configurations after local improvement and their ranks before local improvement. The first nine have small RMSD and may be considered near-native configurations. We use different definitions for the number of contacts before and after the local improvement: #cont is defined as in Section 2, and #cont' is as computed by the local improvement algorithm, which is the number of non-overlapping spheres at distance at most ½ ° . square-distance (RMSD) between the centers of the matching atoms in D and ´Dµ to measure how close the configuration is to the native one. Ranking by score, the top configuration in has an RMSD of ½ ° , and six of the top ten ° . Letting £ be the subset configurations have RMSD smaller than or equal to ¼ ½¼¼ configurations, we refine each one using the local improvement of top Æ heuristic of Choi et al. 9 We then re-rank the configurations in £ based on the new scores, limiting ourselves to configurations with collision number at most five. The results in Table 2 show that our algorithm generates multiple coarse alignments that are useful, in the sense that the local improvement heuristic succeeds in refining them to near-native configurations. More bound protein complexes. We extend our experiments to a collection of twenty-five protein complexes obtained from the Protein Data Bank. Each complex consists of two chains, and we generate a set of features for each. For a typical chain, the number of features that survive the importance filter is on the same order of magnitude as the number of atoms. In Table 3, we show a lowRMSD configuration for each protein complex, as well as its rank in the list of September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 configurations output by our algorithm (using contact-threshold ¾ ¼ ° and ¼). With only one exception (1JAT), we have at least collision-threshold one low-RMSD configuration ranked among the top one hundred. The last column shows the running time for the coarse alignment algorithm, which does not include the time to compute the triangulated surface and the maxima of the elevation function. pdb-id 1A22 1BI8 1BRS 1BUH 1BXI 1CHO 1CSE 1DFJ 1F47 1FC2 1FIN 1FS1 1JAT 1JLT 1MCT 1MEE 1STF 1TEC 1TGS 1TX4 2PTC 3HLA 3SGB 3YGS 4SGB chains A, B A, B A, D A, B B, A E, I E, I I, E B, A D, C A, B B, A A, B A, B A, I A, I E, I E, I Z, I A, B E, I A, B E, I C, P E, I rank 2 12 1 5 3 1 2 78 15 5 11 1 522 8 1 1 1 9 1 2 1 1 1 6 10 #coll 23 43 11 14 34 14 22 11 1 49 44 29 20 23 27 23 43 54 46 4 18 19 38 7 33 RMSD 2.75 2.48 1.52 1.85 2.54 2.71 2.21 3.09 1.49 4.13 3.70 1.62 1.20 3.64 3.49 1.33 1.18 3.07 2.61 3.35 4.55 1.87 3.21 1.07 2.33 time 20 26 3 2 8 3 9 27 1 6 41 5 9 10 3 9 8 7 6 14 6 16 5 6 4 Table 3. For each protein complex, we show data for the highest ranking configuration with RMSD at most ¼ ° . The running time of the coarse alignment algorithm is given in minutes. Next, we apply the local improvement heuristic 9 to the top Æ ½¼¼ con¾¾ to get a figurations of each complex (except 1JAT, for which we need Æ near-native configuration) and re-rank them based on the new scores. Eliminating all configurations with more than 5 collisions, Table 4 shows before and after data for the configuration that is ranked at the top after local improvement. In all but two cases, the top ranked configuration is near-native, and in one of the two exceptional cases, the second ranked configuration is near-native. In the remaining exceptional case (1BI8), we can obtain a near-native configuration by relaxing September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 the threshold of allowed collisions to eight. In summary, for 23 of the 25 test complexes, our coarse alignment algorithm combined with the local improvement heuristic9 predicts a near-native configuration without false positives. before local improvement rank #cont #coll RMSD 2 62 12 5 16 1 23 78 15 5 34 2 522 3 84 1 1 10 2 80 1 1 1 6 10 363 324 327 311 261 375 276 273 238 323 361 402 203 362 280 542 444 334 373 296 346 402 364 315 298 23 10 37 14 21 14 36 11 1 49 54 27 21 14 34 23 43 51 13 25 18 19 38 7 33 2.75 30.00 3.23 1.85 5.59 2.71 2.57 3.09 1.49 4.13 9.94 1.59 1.20 6.17 3.57 1.33 1.18 4.51 2.71 4.34 4.55 1.97 3.21 1.03 2.33 after local improvement rank #cont' RMSD 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 475 234 349 256 289 305 317 220 221 200 413 326 288 310 322 372 314 304 348 355 314 416 257 209 266 1.08 29.88 0.54 0.61 0.63 0.99 0.82 1.28 0.56 1.33 0.61 0.89 0.87 1.77 0.32 0.57 0.79 1.28 0.44 0.36 0.66 0.70 2.24 0.85 2.50 pdb-id 1A22 1BI8 1BRS 1BUH 1BXI 1CHO 1CSE 1DFJ 1F47 1FC2 1FIN 1FS1 1JAT 1JLT 1MCT 1MEE 1STF 1TEC 1TGS 1TX4 2PTC 3HLA 3SGB 3YGS 4SGB Table 4. For each protein complex, we locally improve the Æ top ranked configurations and show the data for the highest re-ranked configuration with small RMSD. After local improvement we admit only configurations with at most five collisions, as usual. The number of contacts before and after the improvement, #cont and #cont', are computed as described in the caption of Table2. It is interesting to compare the data in Tables 3 and 4 and notice that the highest ranked configuration after local improvement is the highest ranked configuration with small RMSD before the local improvement in only slightly more than Ralf the cases. Consider for example 1FIN, which has a configuration at ¿ ¼ ° h MSD with 44 collisions but the one that leads to the best final configuration has ° and ÓÐÐ . RMSD Unbound docking benchmark. We further test our algorithm on the proteinprotein docking benchmark provided in 7 . We omit the seven complexes classified as difficult in7 because they have significantly different conformations in the un- September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 bound vs. bound structures. We also omit complexes 1IAI, 1WQ1 and 2PCC for which we had difficulties to generate surface triangulations of required quality. Of the remaining forty-nine complexes, twenty-five are so-called bound-unbound C-id 1ACB 1AVW 1BRC 1BRS 1CGI 1CHO 1CSE 1DFJ 1FSS 1MAH 1TGS 1UGH 2KAI 2PTC 2SIC 2SNI 1PPE 1STF 1TAB 1UDI 2TEC 4HTC 1AHW 1BVK 1DQJ #hits 20 8 35 7 5 27 7 2 2 4 18 3 26 32 27 10 10 8 3 3 5 2 1 5 7 bound-unbound min* rank 3.70 5.51 4.66 1.60 3.04 2.35 3.15 6.44 7.65 2.78 5.27 7.95 6.55 4.55 4.04 6.34 4.13 1.41 3.78 4.50 1.42 5.94 9.38 1.95 4.59 3,951 4,698 1,629 426 695 92 15,271 1,433 10,721 1,561 543 8,268 2,560 4,983 76 4,894 37 1 48 1,124 6 396 2,781 1,189 710 size 14,426 23,565 12,770 11,607 10,135 11,815 21,068 35,231 25,609 25,402 11,383 14,656 13,478 13,929 20,065 15,830 7,660 15,082 8,296 21,133 21,134 14,032 32,919 24,611 28,694 min 1.75 5.42 4.66 1.60 3.04 2.35 2.74 6.44 5.15 2.78 5.27 7.16 3.41 4.16 4.04 4.58 4.13 1.41 3.78 4.50 1.42 5.94 4.37 1.95 4.59 C-id 1MLC 1WEJ 1BQL 1EO8 1FBI 1JHL 1KXQ 1KXT 1KXV 1MEL 1NCA 1NMB 1QFU 2JEL 2VIR 1AVZ 1L0Y 2MTA 1A0O 1ATN 1GLA 1IGC 1SPB 2BTF #hits 7 3 11 1 8 18 2 12 7 8 7 7 4 19 11 8 2 40 3 8 3 3 2 unbound-unbound min* rank 3.71 6.27 6.98 2.31 6.49 3.47 5.99 4.52 2.48 2.21 1.75 7.18 1.97 3.46 1.08 4.06 2.75 2.91 5.95 1.52 2.48 2.83 5.02 6,949 4,659 10,388 11 11,783 14,185 1,495 153 321 73 621 14,202 12 115 1 4,243 1,136 19,167 3,950 1 25,307 3,260 617 10,132 size 29,747 18,194 23,308 45,512 26,036 32,091 37,218 39,240 46,368 17,741 49,600 42,066 47,693 34,072 40,813 7,895 34,044 36,903 9,113 50,729 33,879 25,303 13,728 33,480 min 3.32 5.86 4.39 2.31 2.30 2.61 5.99 4.52 2.48 2.21 1.75 2.72 1.97 3.46 1.08 3.52 2.75 2.07 4.35 1.52 2.82 2.06 2.83 3.28 Table 5. Twenty-five bound-unbound cases on the left plus twenty-four unbound-unbound cases on £ the right. From left to right: the complex identification, the number of configurations in with RMSD* less than or equal to ½¼ ¼ ° , the smallest RMSD* value of any configuration in £ (min*), the rank of this configuration within , the number of configurations in , and the smallest RMSD* value of any configuration in (min). cases, in which one of the components is rigid. For each complex, we fix one chain as A, which is the rigid chain for each bound-unbound case and the receptor for each unbound-unbound case. We generate , a set of the potential configurations, each corresponding to a rigid motion applied to the other chain, B. For each , we measure the root-mean-square-distance between the matching interface « atoms of B and ´Bµ, and refer to it as RMSD*. Similar to the bound docking case, this value is a good estimate for the distance to the native configuration since the benchmark provides the unbound structures superimposed onto their corresponding crystallized bound structures. For each complex, we let £ be the subset of top Æ ¾ ¼¼¼ configurations in . We show the results of our experiments in Table 5, demonstrating a number of favorable characteristics of our coarse alignment algorithm: 1. Within the relatively small set of 2,000 top-scoring configurations, £ , about ± of the complexes yield a configuration below ¼ ° RMSD and about ± yield a configuration below the ½¼ ¼ ° cut-off needed as input for the hierarchical, progressive refinement protocol in 4 5 . September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 2. For most complexes, our algorithm generates multiple hits, implying that a local refinement is not likely to get trapped in a local minimum and instead find a near-native configuration. 3. Within the set of all generated configurations, , about ± of the complexes yield a configuration below ¼ ° , typically within the top 10,000 sc w ores. All 49 complexes generate at least one configuration below ° ithin the top 25,000 scores. We remark that there are at least two ways to further improve the results: use a different ranking mechanism that moves more low-RMSD configurations into the top ranks, and reduce the size of by clustering similar configurations 12. 4. Discussion We conclude this paper with a brief comparison of our results with prior work on bound and unbound docking. We classify the bound docking methods by how they sample the search space of rigid motions. Methods that sample densely and more or less uniformly predict more accurate rigid docking configurations, but at a high computational cost. To adapt these methods to unbound docking, we may run the algorithms at low resolution or select a small set of promising candidate configurations for further refinement. As of today, neither approach has produced a workable solution to the problem of unbound docking. Methods that sample the space of rigid motions in a biased manner rely on some sort of shape analysis, aimed at detecting locally complementary configurations. All prior work is based on point features marking protrusions and cavities. Alignments are created by matching the points, e.g. all pairs from one set with all pairs from another. The running time is often improved using geometric hashing, as in 12 . Our algorithm belongs to the second class of methods but differs from prior work in the nature of the features, which are point pairs with extra information useful in estimating the scale level and in finding promising matches. Using this information, we generate significantly sparser samples of the search space. Our experiments provide evidence that despite the lower density, we always get candidates that can be refined to near-native configurations. The algorithm is reasonably fast and improvements are still possible. Acknowledgement. The authors would like to thank Vicky Choi for the local improvement software used in our experiments. References 1. P. K. Agarwal, H. Edelsbrunner, J. Harer and Y. Wang. Extreme elevation on a 2manifold. In Proc. 20th Ann. Sympos. Comput. Geom., 357­365, 2004. September 22, 2004 15:0 Proceedings Trim Size: 9in x 6in psb-2005 2. H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shinkdyalov and P. E. Bourne. The protein data bank. Nucleic Acid Res., 28:235­242, 2000. 3. S. Bespamyatnikh, V. Choi, H. Edelsbrunner and J. Rudolph. Accurate protein docking by shape complementarity alone. Manuscript, Duke Univ., Durham, NC, 2004. 4. C. J. Camacho, D. W. Gatchell, S. R. Kimura and S. Vajda. Scoring docked conformations generated by rigid-body protein-protein docking. Proteins: Struct. Funct. Genet., 40:525­537, 2000. 5. C. J. Camacho and S. Vajda. Protein docking along smooth association pathways. Proc. Natl. Acad. Sci., 98:10636­10641, 2001. 6. R. Chen, L. Li, and Z. Weng. ZDOCK: an initial-stage protein docking algorithm. Proteins: Struct. Funct. Genet., 52:80­87, 2003. 7. R. Chen, J. Mintseris, J. Janin and Z. Weng. A protein-protein docking benchmark. Proteins: Struct. Funct. Genet., 52:88­91, 2003. 8. H.-L. Cheng, T. K. Dey, H. Edelsbrunner and J. Sullivan. Dynamic skin triangulation. Discrete Comput. Geom., 25:525­568, 2001. 9. V. Choi, P. K. Agarwal, H. Edelsbrunner and J. Rudolph. Local search heuristic for rigid protein docking. In Proc. 4th Intl. Workshop Alg. Bioinform., 2004, to appear. 10. M. L. Connolly. Analytic molecular surface calculation. J. Appl. Crystallogr., 6:548­ 558, 1983. 11. M. L. Connolly. Shape complementarity at the hemo-globin albl subunit interface. Biopolymers, 25:1229­1247, 1986. 12. D. Fischer, S. L. Lin, H. Wolfson and R. Nussinov. A geometry-based suite of molecular docking processes. J. Mol. Biol., 248:459­477, 1995. 13. H. A. Gabb, R. M. Jackson and M. J. Sternberg. Modeling protein docking using shape complementarity, electrostatics and biochemical information. J. Mol. Biol., 272:106­ 120, 1997. 14. B. B. Goldman and W. T. Wipke. QSD: quadratic shape descriptors. 2. Molecular docking using quadratic shape descriptors (QSDock). Proteins, 38:79­94, 2000. 15. W. Humphrey, A. Dalke and K. Schulten. VMD--Visual Molecular Dynamics. J. Mol. Graphics, 15:33­38, 1996. 16. E. Katchalski-Katzir, I. Shariv, M. Eisenstein, A. Friesen, C. Aflalo and I. Vakser. Molecular surface recognition: determination of geometric fit between protein and their ligands by correlation techniques. Proc. Natl. Acad. Sci. (USA), 89:2195­2199, 1992. 17. H. Lenhof. An algorithm for the protein docking problem. In Bioinformatics: From Nucleic Acids and Proteins to Cell Metabolism, eds. D. Schomburg and U. Lessel, Wiley, 1995, 125­139. 18. G. Moont and M. J. E. Sternberg. Modelling protein-protein and protein-dna docking. In Bioinformatics: From Genomes to Drugs, ed. T. Lengauer, Wiley, 2002, 361­404. 19. B. Sandak, R. Nussinov and H. J. Wolfson. A method for biomolecular structural recognition and docking allowing conformational flexibility. J. Comput. Biol., 5:631­ 654, 1998. 20. G. R. Smith and M. J. E. Sternberg. Prediction of protein-protein interactions by docking methods. Curr. Opin. Struct. Bio., 12:29­35, 2002. 21. I. A. Vakser. Protein docking for low-resolution structures. Protein Engin., 8:371­377, 1995.