Almost-Delaunay Simplices: Nearest Neighbor Relations for Imprecise Points Deepak Bandyopadhyay Jack Snoeyink Abstract Delaunay tessellations and Voronoi diagrams capture proximity relationships among sets of points in any dimension. When point coordinates are not known exactly, as in the case of 3D points representing protein atom coordinates, the Delaunay tessellation may not be robust; small perturbations in the coordinates may cause the Delaunay simplices to change. In this paper, we define the almost-Delaunay simplices, derive some of their properties, and give algorithms for computing them, especially for neighbor analysis in three dimensions. We sketch applications in proteins that will be described more fully in a companion paper in biology. http://www.cs.unc.edu/debug/papers/AlmDel. 2 Motivation & Definitions We first need the following definitions for a finite set of point sites S d. A k -simplex is the convex hull of k + 1 affinely independent sites; the simplices in 3 are points, edges, triangles and tetrahedra formed by sites of S . The Voronoi diagram of S is the decomposition of space into regions with the same set of closest neighbor sites [44]. The dual Delaunay tessellation is a decomposition of the same space based on an "empty sphere property:" [14] if a subset of sites, S S , lie on the boundary of a sphere that is otherwise empty of sites, then the convex hull of S is a region of the Delaunay tessellation. It is common to assume, or to simulate, that the sites S are in general position--no d + 1 sites may be co-planar or d + 2 sites co-spherical in d--so that the Delaunay regions are simplices. Proteins are long chains of amino acids that reliably fold into 3-d structures that perform the functions essential for life [25, 24]. The genomics revolution has produced powerful tools to manipulate the sequence of amino acids, but it is the structure of a protein that determines its function. Crystallography and NMR give us information about the structures of many molecules; the Protein Data Bank (PDB) contains the coordinates of atoms for over 20,000 proteins [1]. One of the most important open problems in science, the "protein folding problem," is to derive 3-d structure directly from the amino acid sequence. Many researchers have found Delaunay tessellations and Voronoi diagrams to be useful tools for problems of protein structure analysis, such as Protein geometry definition Given a set of labeled points representing atoms or residues of a protein, what is the protein's volume, surface, and density? What are its cavities and pockets? Richards [36] pioneered the use of Voronoi diagrams to compute protein volumes. This has been an active research area, with more detailed empirical analysis of parameters [42, 43], refinements on the definition of the surface [29, 30], and analysis of differential packing in the core and surface regions [26]. The Delaunay tessellation has been used to define and detect pockets and cavities [6, 27, 28]. Decoy discrimination Given a collection of "decoy" structures for the same protein, (often generated by 1 Introduction The Voronoi diagram and Delaunay tessellation, which are geometric structures defined for sets of points, have found use in many areas of science and engineering [5, 13, 9, 34]. These diagrams are defined by exact geometric criteria. In many applications, however, point coordinates are known only imprecisely, so it is natural to ask whether the uses of Voronoi and Delaunay are stable and robust under changes to the input coordinates. In this paper, we consider the possible structures that could be defined by nearby inputs. Specifically, we focus on the Delaunay tessellation, and consider the almost-Delaunay simplices, which are additional sets of sites that could become Delaunay simplices if all sites are perturbed by a minimum amount 0. In the next section, we list several problems from protein structure analysis that motivate the definition of almostDelaunay simplices. In Section 3 we relate almost-Delaunay simplices to a variant of the minimum-width annulus problem, which allows us to determine several properties of almost-Delaunay simplices and sketch a straightforward, but slow, algorithm to compute them. In Section 4 we consider the properties of our applications and derive efficient algorithms to compute almost-Delaunay edges, triangles and tetrahedra in two and three dimensions. of Computer Science, University of North Carolina at Chapel Hill. {debug,snoeyink}@cs.unc.edu Portions of this research were supported by NSF grant 0076984. Department Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 410 structure prediction programs), determine which is the most native-like. Both Voronoi and Delaunay have been used to score residue interactions in folded proteins and decoys. The Voronoi diagram naturally assigns a region to each atom or representative point, and the contact area between residues has been incorporated into "two-body" potentials [48, 4]. The Delaunay tessellation collects sets of four "neighboring" representative points into tetrahedra. Researchers have analyzed the frequency of occurrence of different amino acid types in tetrahedra to develop empirical four-body potentials [11, 33, 40, 46]. These four-body potentials complement fragmentbased methods [39] and pairwise potentials [32] to capture favorable or unfavorable packing interactions, which are difficult to efficiently incorporate into structure prediction codes. Motif detection What shapes or structure fragments occur frequently in proteins? The Voronoi diagram has been used to partition proteins into structural domains with minimal interaction between them [47]. Wako and Yamato use the Delaunay tessellation of C carbons and find patterns of the backbone sequence among neighboring tetrahedra to identify local motifs for helices and sheets [45]. Small changes in the coordinates chosen to represent atoms or residues can produce different sets of Delaunay simplices. We would like to identify the simplices that are almost Delaunay for a finite set of sites S . D E FI N I T I O N 2 . 1 . ( A L M O S T- D E L AU NAY ) A k -tuple Q S k is in the set of almost Delaunay simplices AD( ) iff, by perturbing each site of S by at most 0, the perturbed Q becomes a Delaunay k -simplex--the perturbed Q has an empty circumscribing sphere. We say that Q is almost Delaunay with threshold iff = lim inf { | Q AD( )}. Every k -tuple for 1 k d + 1 is an almost Delaunay simplex for some finite threshold; k -tuples with threshold 0 are Delaunay. In general, the sets of AD( ) do not form a simplicial complex--AD( ) includes the faces of its simplices, but not necessarily the intersections of simplices. Abellanas et al. [2] explored the robustness of a Delaunay triangulation in the plane, giving a complementary definition of tolerance as the supremum of all > 0 such that AD( ) contains only the Delaunay triangles. They used a minimum-width annulus problem, defined below, to find value of over the whole triangulation and for individual Delaunay edges. We generalize to s for all simplices, and to higher dimensions. Other generalizations of Voronoi and Delaunay diagrams have been studied extensively in computational geometry. Order-k Voronoi diagrams [9, 13, 12] are a generalization that uses the regions with the same set of k -closest neighbors. Gudmundsson et al.[21] have studied the related order-k Delaunay triangulations, which include the tetrahedra whose circumspheres can become empty if k - 4 points are deleted. Although this definition is easier to analyze and implement, it is less natural in our analysis of perturbations. Points that require a large perturbation to remove from a circumsphere can be deleted all too easily. Perturbation is most often studied to put points in general position, so that implementations of algorithms have fewer special cases to handle. For example, the regions of the Delaunay tessellation are simplices only if the points are in general position (e.g. no five points are cospherical). Thus, codes for computing the Delaunay usually enforce general position by a random perturbation [8] or symbolic perturbation [19], but otherwise take the input data as exact. Papers that do consider computation from inexact input usually search for only one output that is consistent with the input [22]. When one consistent output exists, then there may be exponentially many--even in a 2D Delaunay triangulation, two rows of grid points can be perturbed to form 2(n-1)/2 triangulations by selecting which diagonals of the grid squares to draw. This is why we define almost-Delaunay simplices instead of almost-Delaunay tessellations--we know that the number of simplices is less than nd+1 , which helps keep the computation feasible. 3 Almost-Delaunay as an annulus problem The computation of almost-Delaunay thresholds can be related to a minimum-width annulus problem from computational metrology. In the plane, a set of points is tested for roundness by the smallest width between two concentric circles that form an annulus enclosing all the points. This roundness problem has been studied extensively [18, 37, 38]. Solutions are known for two and higher dimensions, and for various special cases [3, 15, 31, 35]. We now review the definition of an annulus in d dimensions, and define some useful notation. D E FI N I T I O N 3 . 1 . (d- A N N U L U S ) A d-annulus is a set {p d : r dist (p, c) R}, defined by its center c, its inner and outer radii, r and R, and its width, w = R - r. We allow the center to be a point at infinity, in which case the inner and outer radii are infinite and the annulus is actually a slab of width w defined by two parallel hyperplanes [41]. Finding the almost-Delaunay threshold for a k -simplex can be formulated as a variant of the problem of finding an annulus of minimum width: T H E O R E M 3 . 1 . ( A L M O S T- D E L AU NAY ) Given finite sets of points Q S d, let A denote the d-annulus of minimum width that contains Q and whose inner hypersphere is empty of points from S . Then width (A) = 2 if and only if Q is almost-Delaunay with threshold . 411 Proof. Suppose that Q can be enclosed by an annulus A with center c, inner radius r, and width 2 so that the inner (hyper)sphere is empty of S . As in Figure 1, every point of A is within of the medial (hyper)sphere, M , having center c and radius r + . Thus, there is a perturbation of Q onto M and S outside of M showing that Figure 1: an AD( ) triangle Q AD( ). If simplex Q AD( ), then we can bound the annulus width as follows. By definition, we can perturb the points S to S so that the perturbed simplex Q lies on an empty (hyper)sphere M . Note that each point of S is within distance of its corresponding point S , so if we construct the annulus A with the same center as M and outer and inner spheres offset by ± , then the inner sphere of A is empty of S because M was empty of S , and outer sphere of A contains Q because M contained Q . Thus, Q is in AD( ) iff there is an annulus of width 2 , and the minimum annulus width equals the threshold of Q. 3 T.1 Configurations that determine As a corollary, we have a variant of the "cone condition" of [20] to further characterize the center c. T H E O R E M 3 . 2 . ( C O N E C O N D I T I O N ) From a finite center c of a minimum-width annulus associated with simplex Q S , no cone separates C N (c) from F NQ (c). From an infinite center c, no plane separates C N (c) from F NQ (c) and any cone (that is, cylinder) that separates them has C N (c) inside. Proof. For a finite point c, if there is a separating cone, then one of the directions along the axis of the cone has angles to sites of F NQ (c) smaller than those of C N (c). Lemma 3.1 says that the directional derivative is negative, so that c cannot be the center of a minimum-width annulus. For an infinite point c, if there is a separating cylinder with C N (c) outside, then moving c to a finite point produces an annulus with smaller width. If there is a separating plane, then moving c on the plane at infinity decreases the width W between the parallel planes. e can now count the points needed to define a minimumwidth annulus. T H E O R E M 3 . 3 . ( N U M B E R O F P O I N T S ) Suppose that the k simplex Q d has a minimum-width annulus A with center c. Then 2 |F NQ (c)| min(d, |Q|), and |C N (c)| + |F NQ (c)| d + 2 for finite c, and d + 1 for c at , with equality holding in general position. Proof. A single closest or furthest point can always be trivially separated from the other points by a cone with vertex at c enclosing just that point, so nF = |F NQ (c)| > 1. Also, nC = |C N (c)| 1, with equality only when center c is at infinity. To bound the sum nF + nC , consider the relatively open Voronoi regions for F NQ (c) and C N (c), which are regions of dimensions d + 1 - nF and d + 1 - nC , equality holding in general position. Lemma 3.1 shows that the minimum width occurs for a point on the boundary of the intersection--if the intersection contains more than a single finite point c, then the minimum must occur at infinity, because moving to a finite boundary will take the point out of C N (c) or F NQ (c)--one of these sets will gain an additional site. The dimension of the intersection is at least d - (d + 1 - nF ) - (d + 1 - nC ) = nF + nC - d - 2. Thus, when c is finite, we have 0 nF + nC - d - 2, or nF + nC d + 2, and when c is infinite, we have 1 nF + nC - d - 2, or nF + nC d + 1. R he threshold for an almost-Delaunay simplex Q S is determined by properties of its associated minimumwidth annulus. These properties, and their derivations, are similar to those for minimum-width annuli enclosing all points [20, 41]. Recall that the Voronoi diagram is the partition of space into maximally connected regions with the same set of closest neighbor sites. The furthest-point Voronoi diagram is the partition of space into maximally connected regions with the same set of furthest neighbor sites [5, 9, 13, 34]. Modifying notation of [20], any candidate center point c has a set C N (c) of closest neighbors in the Voronoi of S , and a set F NQ (c) Q of furthest neighbors in the furthest-point Voronoi of Q. Note that c is in the intersection of closest- and furthest-point Voronoi regions corresponding to sets C N and F NQ . An easy technical lemma establishes a geometric condition on the directions of motion that decrease the difference in distances to two points. L E M M A 3 . 1 . Consider two points p and q , and the difference in their distances to the origin, d(p, 0) - d(q , 0). Let emark: The case with c = and |C N (c)| = 1 occurs in v be a unit vector, and and be its angles to p and q respectively. Suppose that we move the center c of a minimum- our variant on the annulus problem; Smid and Janardan [41] width annulus by v for some infinitesimal . The direc- had observed that it never defines the width of a minimumwidth annulus that encloses all points. tional derivative Finally, we prove a fact asserted in the previous section. = dd (p, v ) - d(q , v ) cos( ) - cos(), L E M M A 3 . 2 . ( S I M P L E X T H R E S H O L D ) The almost-Delaud nay threshold for a k -simplex in d is at least as high as which is negative if angle 0 < . that for each of the m-simplices that constitute it, m < k . 412 Proof. A perturbation that makes the k -simplex Delaunay creates an empty hypersphere containing all of the points of T the simplex after perturbation. he above properties immediately suggest a simplistic algons rithm: For each of the d+1 implices, maintain a minimum threshold seen, which is initially infinite. For all sets of d + 2 points and all 2 j d + 1, consider the annuli defined by j points on the outer hypersphere and d - j + 2 on the inner hypersphere. If the inner hypersphere of an annulus is empty of all other points, update the minimum threshold seen for each k -simplex nrom among the d + 2 points. This algorithm f · +d would take O( d+2 n) = O(nd+3 ) time. Since the points on the inner hypersphere form a Delaunay simplex, we can reduce this to O(nd+2 ) time for the d-simplices, and less for the lower-dimensional simplices. ns If we do want thresholds for all d+1 implices, then there is not much room to improve this algorithm; we'd expect to spend between n4 and n5 time in 3D. In our applications, however, we are interested in a subset of simplices, either those with small thresholds or those with short edges, or both. In the next section, therefore, we will sketch worstcase analyses primarily to help make the algorithm clear, but will concentrate on algorithms that have good practical performance, as demonstrated by experiments and by analysis under assumptions such as even or random distribution of points. Figure 3: The lifting technique, illustrated on a 1D axis. Edge length prune: For analyzing neighbor interac° tions in proteins, only pairs of points within about 10 A of each other are relevant. We set an edge length prune parameter to restrict the maximum length of any edge of the simplices whose AD thresholds we evaluate. Edge pruning also eliminates long Delaunay edges near the convex hull ° that are not really between neighbors. We use a 10 A prune by default. As illustrated in Figure 2, we find the threshold for each k -simplex that satisfies the prune requirement, starting with the 1-simplices (edges), and discard the simplices with threshold higher than the cutoff before proceeding to the next stage. For each remaining simplex, we use a lifting map to find the candidate centers. This makes the time and space dependent on the number of relevant simplices, rather than the total. 4.2 Computation of candidate centers by a lifting map 4 Algorithms for relevant thresholds In this section, we identify "relevant" parameters that are monotone increasing in simplex dimension, so that if the parameter for a simplex is too large, then the parameter for any simplex that has as a face will also be too large. Thus, we begin by computing almost-Delaunay edges, as in Figure 2. Through these and other geometric observations, we make practical improvements, and attempt to explain these improvements by analysis and experiments. We modify Brown's lifting technique [10, 13] to compute the Voronoi diagram of the points projected onto the furthestpoint Voronoi region of k furthest neighbors, but with distances based on the original points. We sketch the technique in d dimensions, and illustrate it in Figure 3 for points projected on a 1D axis. Output almostInput point sites The proof of Theorem 3.3 observes that k furthest neighDelaunay Simplices shor t Delaunay: bors define a region of dimension d - k + 1 in the furthestCompute shor t: Comp. Make s, tets Compute point Voronoi diagram in d, which is a subspace if we Par tition Delaunay for short nonfrom DT & threshold edges tessellation Del. edges AD() edges for s, tets choose as origin the centroid of the furthest neighbors. The long: >prune high: >cutoff high: >cutoff distance from the origin to a point p can be factored into Discarded Discarded Discarded components x = {x1 , . . . , x(d-k+1) } within this subspace Figure 2: Processing AD edges then AD simplices and h which is the Euclidean norm of all orthogonal com4.1 Parameters ponents, in an orthogonal dimension y . We construct tanThreshold cutoff: For analyzing robustness, we are primar- gents to the unit paraboloid y = x1 2 + . . . + xd-k+1 2 at ily interested in small perturbations, relative to interpoint dis- values x corresponding to each point p, then shift each tantance. We set a threshold cutoff (or cutoff ) and consider only gent "down" in the negative y direction by h2 . Now at any the simplices in AD( cutoff ). Equivalently, we discard any point x within the subspace, the distance in the y direction edge with AD threshold > cutoff , and, by Lemma 3.2, between the shifted tangent at point x and the paraboloid at also the simplices that use these edges. Unless otherwise x equals the square of the distance between x and p. ° specified, cutoff = 2.0 A in our protein experiments, where Using the above property, the intersection points of two ° the typical distance between C carbons is 5­6A. shifted tangents along the upper envelope of these tangents 413 give the vertices of the Voronoi diagram, which are the candidate centers. The intersection points can be computed in dual space by finding the lower convex hull of the dual points to these tangent planes. We use the Quickhull[8] implementation which runs in expected O(n log n) time in 2 and 3 dimensions. Lifting avoids higher-dimensional Voronoi diagrams that are hard to compute robustly, and the lifting implementation is easy and efficient for the dimension (3) and problem sizes (n=100­10,000) we are interested in. 4.3 Computing almost-Delaunay edges The pseudo-code for the algorithm is outlined below. · Enumerate all non-Delaunay edges, (i, j ). · For each non-Delaunay edge (i, j ) that is shorter than the edge length prune, do: 1. Compute the equation of P , the furthest-point Voronoi region corresponding to i and j . For example, in 3D this is the bisector plane of ij . 2. Find the intersections of the Voronoi diagram with P using a lifting map. These are candidate centers with i and j as furthest neighbors and the points of a Delaunay simplex as closest neighbors, and include Voronoi vertices at infinity. 3. Evaluate annulus width function at all candidate centers, and record the minimum annulus width . 4. Record the value of /2 as the AD threshold for edge (i, j ), if it is less than cutoff . · Report the edges with threshold values between 0 and cutoff as almost-Delaunay edges. Figure 4: Finding the AD threshold for 1D half-line constraint generated by r. pqr in 2D by adding a closer to the center than p and q . If rx is positive, and hence the slope of the tangent line is positive, this constraint is satisfied by all candidate centers to the right of the equidistance point of p,q and r, which we call a left cutoff. This is illustrated in Figure 4. Similarly if rx is negative, the slope of the tangent is negative and the constraint is satisfied by all candidate centers on the left of the equidistance point, which we call a right cutoff. 4.4.2 Almost-Delaunay triangles/tetrahedra in 3D 4.4 Computing almost-Delaunay k -simplices We can easily modify the algorithm for an AD edge in any dimension to find the minimum threshold ( ) when that edge is part of a k -simplex. The only constraint is that each candidate annulus should contain (not necessarily as closest or furthest neighbors) all the other points of the simplex. The AD threshold for the simplex is then the minimum constrained edge threshold over all its edges. Now we describe efficient ways of computing the AD threshold for a few special cases. 4.4.1 Almost-Delaunay triangles in 2D For AD edge (p, q ) of pqr, we will consider all candidate centers c for which the distance to p or q is greater than the distance to point r. The candidate centers are computed using the 1D lifting map of the points as discussed in Section 4.2. The distance to point r is given by the tangent line to the parabola y = x2 at rx , shifted down by ry 2 , where rx and ry are the on-axis and off-axis components of r. At the point where the shifted tangent line at r intersects the shifted horizontal tangent line at points p and q , all three points are equidistant. Thus there is a 1D half-line constraint on the candidate centers generated by the condition that point r be Given a tetrahedron pq rs, assume that its associated minimumwidth annulus has points p and q as furthest neighbors. From all the candidate annuli for edge pq , we need to consider only those which contain both the points r and s. The annulus centers are computed using a 2D lifting map constructed on the bisector plane of pq as described in Section 4.2. The condition that a point r be closer than p or q generates a 3D half-plane constraint on the annulus center, 2D when projected onto the bisector plane. Thus finding the minimum annulus containing p, q and r is equivalent to finding the minimum annulus for edge pq from among the candidate centers (including centers at infinity) that satisfy the half-plane constraint generated by r, and symmetrically for edges pr and q r. For tetrahedron pq rs, the two half-planes for r and s form a wedge on the bisector plane of pq , which gives us a region in which to search for centers of annuli containing the points r and s. The center of the minimum-width annulus for an edge in 3D always has 2 furthest neighbors, and either 3 closest neighbors for a finite center, or 2 closest neighghbors when the center is at (Theorem 3.3). There is a third possibility for finding the minimum threshold of an AD triangle, that the center lies on a Voronoi edge between two candidate centers, at its intersection with a half-plane constraint. This corresponds to three furthest neighbors (two for the bisector plane and one for the constraint) and two closest neighbors. Also the points on each constraint at ± correspond to slabs with three furthest neighbors and one closest neighbor, the one whose Voronoi region the point is in. For an AD tetrahedron, candidate centers of edge pq and Voronoi edges between them may contribute to the minimum-width annulus if they satisfy both half-plane constraints for points r and s. For efficiency, we separate the edges into two groups. Each edge that intersects only one constraint is completely on one side of the other constraint, so it suffices to test any one of its points against 414 the other constraint, a test whose result is already known from the AD triangles. Edges that intersect both constraints are valid if the intersection point of each constraint satisfies the other constraint, a more expensive but relatively infrequent test. Recall that we computed the AD threshold only for nonDelaunay edges, since for Delaunay edges it is zero by definition. In 3D there are AD triangles and tetrahedra with all edges Delaunay, and we enumerate and handle them separately for efficiency. O(n) for constant p and r. T his lemma is clearly not true for arbitrary data, as we can get O(n2 ) short edges if we take uniformly distributed data in a fixed volume, and arbitrarily increase n and the packing density. In the limit, each newly added point adds a number of short edges proportional to the number of points already existing, and this adds up to O(n2 ) edges. However, the lemma holds for close-packed protein data where two ° C s are a minimum of 3 A apart, and two sidechain centroids 4.5 Acceleration for applications ° are even further, say 5 A. The algorithm as given above for 3D takes O(n3 log n) time Increasing the edge length prune parameter allows to run in the worst case for just the AD edge computation, longer edges to exist on the outer sphere of the annulus, with (n2 ) for computing the Delaunay, O(n2 ) for enumerand in turn allows tetrahedra with larger thresholds, from ating short edges, and then O(n log n) for finding the candiLemma 3.2. As shown in Figure 5 for a typical protein with date centers for each short edge. Note that centers of some 180 residues, the maximum AD edge threshold at any minimum width annuli lie arbitrarily far from the midpoint edge length increases linearly with the edge length, which of the short edge, and thus we have to compute O(n log n) justifies picking a value for cutoff based on the value of edge candidate centers and may not search within a region near length prune. the edge. Addition of a third and fourth point to the annuli and computation of the constrained minimum with one and two half-plane constraints takes O(n) and O(n2 ) pime for nt each candidate center. Thus with n sites and 4 otential AD tetrahedra, testing all of them takes O(n5 log n) time. We can improve the algorithm in 3D to take O(n2 log n) time by exploiting properties of minimum-width annuli and assumptions made about the input data. We concentrate on practical improvements for the number of points found in proteins; further asymptotic improvement may be possible. 4.5.1 Effect of edge length pruning Figure 5: Scatter plot showing correlation of AD edge threshold with edge length for a protein with 185 points. Pruning long edges speeds up the algorithm by considering only AD edges both of whose points lie in a bounded spherical region, a bucket. Pruning needs to be done only once, and then the complexity of the rest of the algorithm is determined by the size of the set of pruned (short) edges. This step rndutces the search space of AD edges considerably e -- from 2 otal edges, we get O(n2 ) short edges in the worst case, but O(n) edges under the assumption that points have nearly uniform packing density. L E M M A 4 . 1 . ( N U M B E R O F S H O RT E D G E S ) The expected number of edges shorter than an edge length prune p in n points closely packed such that the closest distance between 2 points is 2r, is O(n · p3 ) for constant r, or O(n) for constant p and r. Proof. We use a volume packing argument: Represent each residue by a ball of radius r; balls pack within the volume of the protein without overlap. Denote the sphere of radius p around residue A as its prune volume; all other residues Bi whose centers lie within this sphere would form short edges ABi . There can be a maximum of 0.78(p + r)3 /r3 spheres within the prune volume, since spheres in close packing can fill up at most 78% of the volume they occupy. Thus the total number of short edges is O(n · p3 ) for constant r, or 4.5.2 Angle pruning Using the properties from Section 3.1, we get conditions for valid Voronoi edges and valid configurations of points in 2D and 3D, which can be used to prune away candidate centers that cannot give minimum-width annuli. In 2D, if the furthest neighbors are p and q for a center c, exactly one of the two closest neighbors is inside the triangle cpq, or within the cylinder from c at infinity bounded by p and q . In 3D, the cone condition can be checked just like 2D once we fix the axis of the cone as the angle bisector of two furthest or closest neighbors, by rotating all other points into the same plane. In experiments on 3D protein data, we found that between 80 and 85% of candidate centers get eliminated by angle pruning. But we have to test every candidate center, and the pruning computation for a center needs more work than just evaluating the annulus width. Thus, angle pruning is useful only if we can use its results to prune the candidate centers and edges between centers for computation of AD triangles and tetrahedra. This is easy to see for the centers; 415 for the edges we keep them if either endpoint satisfies the angle prune conditions, or is closer to the origin than an empirically determined distance of prune + cutoff 2 . For both centers and edges, we can apply angle pruning in conjunction with threshold cutoff, and results are shown in Figure 6 for a cutoff of 1.0 for a test dataset of 440 proteins, and tabulated later in Section 4.7. shows the expected number of k -faces in the Voronoi diagram of n independent, uniformly distributed sites in d to be O(n). We can apply this result to our computation of the lifting map if we assume the projections of our data points to be uniformly distributed, and thus can bound the number of candidate centers per short edge and the number of edges connecting these centers by O(n). After angle pruning and applying the threshold cutoff, the number of candidate centers and edges is still O(n), though it may be sublinear as we shall see in Section 4.7. Now finding all the AD triangle and tetrahedron thresholds involves checking the threshold of O(n) valid candidate centers and edges against O(1) half-plane constraints, and doing this for O(n) short edges. Thus we get the time complexity as O(n2 ). 4 .7 Practical performance Figure 6: Angle pruning of candidate centers and our edge pruning heuristic greatly reduce the number of centers and edges considered for each AD triangle and tetrahedron. 4.6 Time complexity with assumptions about data Edge-length pruning speeds up the AD edge algorithm by a factor of n to run in O(n2 log n) expected time on typical protein data, as proved in Lemma 4.1. We shall now argue that the AD triangle and tetrahedron algorithms run in expected O(n2 ) time after the edges have been processed, for an algorithm that is O(n2 log n) overall, and scales very well from edges to higher dimensional simplices. L E M M A 4 . 2 . The expected time taken to check all the candidates for AD triangles and tetrahedra, once the AD edges are known, in n points closely packed such that the closest distance between 2 points is 2r, and the edge length prune is p, is O(n2 ) for constant p and r. Proof. Since each point has O(1) short edges to its neighbors (Lemma 4.1), each short edge forms a constant number of triangles and tetrahedra with all short edges. For each triangle there is one half-plane constraint and for each tetrahedron there are two half-plane constraints induced on the candidate centers and edges. Previous work by Dwyer [16, 17] To show that the practical behavior is not worst-case, but fits the theoretical bounds based on assumptions about our data, we estimated the performance of the algorithm in terms of the number of simplices checked at each stage and the output size, fitted using regression as shown in Table 1. Our dataset comprised the 440 protein chains, represented by C carbons, with chain length (n) between 80 and 256, from our training set of 1100 proteins. This training set was selected as mentioned in [23], to span a variety of different folds and families. From these results we may determine the practical time complexity ­ the number of short edges, O(n), times the O(n log n) lifting map computation gives the complexity for AD edges. For AD triangles and tetrahedra, performance measures are more naturally expressed in terms of s (the number of short non-Delaunay edges that pass the cutoff) which is O(n). Angle pruning is instrumental in making the practical case more efficient than the theoretical upper bound; the number of candidates per short edge is almost constant ( 7 1 s 6 ) after angle pruning, which leads to O(s 6 ) time taken. The number of AD triangles and tetrahedra tested and output is O(s), which corresponds to O(1) simplices per short edge. Thus the algorithm has output size and memory requirements linear in the number of points for typical edge length prune and cutoff values. Our MATLAB implementation computes the AD tetrahedra for protein data in 3D with 100­ 1000 points in a few seconds to two minutes on a 2.0GHz computer. 5 Delaunay probability In the previous sections we discussed one new method for estimating nearest neighbors in sets of imprecise points. Here we introduce another notion that is useful in describing random perturbations of points. The almost-Delaunay simplices 416 Quantity Delaunay edges Short D edges Short non-D edges % that pass cutoff (# that pass = s) Candidate centers, pre-angle pruning Edges between ctrs, pre-angle pruning Candidate centers, post-angle pruning Edges between ctrs, post-angle pruning AD triangles tested AD tetrahedra tested AD triangles output AD tetrahedra output Regression Fit ± Residual ° ° cutoff =1.0 A cutoff =0.1 A 7.1n - 92 ± 30 5.7n - 60 ± 60 2.9n - 73 ± 100 98 ± 2% 1.87n 0.78 C (d, r) (5.1) = 12 ± 3% + 14 ± 5 +e 2(d - r) 2(d + r) - 1 erf rf 2 a a . e -4(d+r )2 -4(d-r )2 a a2 - e a2 4d 2.12n0.83 + 25 ± 5 7.1s0.16+ 0.5 ± 1 6.7s0.16+ 2.7 ± 2 8s - 260 ± 300 21s - 1300 ± 1500 5s - 39 ± 100 8.5s - 180 ± 300 8.9s0.16 ± 1.5 8.9s0.16+ 0.5 ± 2 4.4s - 5.9 ± 20 6.2s - 18 ± 60 3.7s + 1.3 ± 20 4.6s + 0.65 ± 40 Given positions of all sites, s1 , s2 , . . . , sn , we could obtain the probability that {s1 , s2 , s3 , s4 } form a Delaunay tetrahedron if we could integrate, over each sphere that these sites can define, the probability of the sphere times the probability that it is empty. This integral cannot be calculated in closed form, but we can perform a Monte Carlo integration by generating sites s1 through s4 from the Gaussian distribution and using (5.1) to compute the probability that their circumsphere is empty. Delaunay log-prob. vs. AD tet. e for 1gab, 53 residues, a=0.1, prune=10 0 10 log10(probability of Delaunay tetra) 10 log10(probability of Delaunay tetra) 0 Delaunay log-prob. vs. AD tet. e for 1gab, 53 residues, a=0.5, prune=10 10 -1 10 -1 Table 1: Performance measures of the AD algorithm over 440 proteins as functions of number of points (n) or number of short nonDelaunay edges that pass cutoff (s) obtained from regression, along with 99th -percentile residuals. Candidate centers and edges are per short edge, while other quantities are per point set. Measurements ° were made for two typical values of cutoff , 1.0 and 0.1 A, and ° edges longer than 10 A were pruned. 10 -2 10 -2 10 -3 10 -3 10 -4 0 0.05 0.1 0.15 0.2 10 -4 AD Threshold 0 0.2 AD Threshold 0.4 0.6 0.8 1 Figure 7: Tetrahedron probability vs. AD average radius a = 0.1 and 0.5. with perturbations of AD( ) capture in a sense the worst case change in the Delaunay triangulation due to some bounded and carefully orchestrated perturbation. The actual perturbations that occur are random, and will sometimes cause a simplex to become Delaunay, and at other times will not. Here we characterize random perturbations; the analysis and algorithm are given for tetrahedra, though the concept holds in arbitrary dimensions. If we assume a particular distribution of error in the coordinates, it is possible to calculate the probability that a particular tetrahedron is Delaunay. Assume that the probability density function (pdf) has radial symmetry so that it can be expressed in spherical coordinates as function of the raa dius, f (r). Such a pdf must satisfy 0 4 x2 f (x)dx = 1; examples include the Gaussian distributi-ns with expected o . radius a given by fa (x) = 8/(a )3 exp (2x/a)2 / A sphere C of radius r at distance d from the origin will contain total density d +r C (d, r) = d-r Computation of the Delaunay probabilities suffers from the same problem as the na¨ve almost-Delaunay computai tion: there are roughly n4 tetrahedra to check, and each one requires significant computation. Fortunately, Figure 5 illustrates that the probability of a tetrahedron is inversely correlated with AD threshold. With the Gaussian distribution fa , the probability falls below 10-6 for AD(2a) tetrahedra. Thus, our efficient enumeration of almost-Delaunay tetrahedra makes it feasible to estimate probabilities for tetrahedra in proteins. Two confirmations of the accuracy of the Delaunay probability calculation: First, in a configuration whose convex hull contains the origin, the sum of the probabilities of all simplices containing the origin numerically converges to 1 as we increased the number of Monte Carlo trials. Second, in a large training set of 1100 proteins selected as detailed in [23], the sum of the Delaunay probabilities of all AD(0.3) tetrahedra converged to within 99­101% of the number of Delaunay tetrahedra at any value of edge length prune. x (d + r - x)(x - d + r) f (x)dx. d 6 Results from analysis of protein structure We studied the robustness of the Delaunay tessellation for analysis of protein packing interactions and finding motifs of secondary structure. We briefly report a few of our results, which are detailed in a companion paper [7]. 1. An -helix has a striking distribution of positive almost-Delaunay thresholds: three sharp peaks at = The probability that a point is outside sphere C is 1 - C (d, r), so if we have a fixed circle, we can compute the probability that it is empty. For the distribution fa , we can even give an expression in terms of the normal error function x erf (x) = 0 exp(-u2 )du: 417 0.3,0.7 and 1.2, which arise from the regular geometric pat- References tern. Individual C histograms also reveal the same peaks, which characterize residues in -helices. [1] Brookhaven protein data bank. http://www.rcsb.org. 2. Proteins represented by side-chain centroids produce [2] M. Abellanas, F. Hurtado, and P. A. Ramos. Structural tolerance and Delaunay triangulation. Information Processing fewer AD tetrahedra for a given cutoff and prune than those Letters, 71(5­6):221­227, 1999. represented by C s. 3. As point sets become more structured (i.e. protein- [3] P. K. Agarwal, B. Aronov, S. Har-Peled, and M. Sharir. Approximation and exact algorithms for minimum-width anlike), the number of AD tetrahedra decreases, particularly at nuli and shells. In Proc. 15th Annu. ACM Sympos. Comput. low threshold values. This indicates the relative stability of Geom., pages 380­389, 1999. the Delaunay for protein data. [4] B. Angelov, J. Sadoc, R. Jullien, A. Soyer, J. Mornon, and 4. Simplicial Neighborhood Analysis of Protein PackJ. Chomilier. Nonatomic solvent-driven Voronoi tessellation ing (SNAPP) scores protein structures using the likelihood of of proteins: an open tool to analyze protein folds. Proteins, neighboring four-tuples of residues from the Delaunay tes49(4):446­456, 2002. sellation of their side-chain centroids. We evaluated the sen- [5] F. Aurenhammer. Voronoi diagrams: A survey of a fundamensitivity of the SNAPP scores to a change in the Delaunay testal geometric data structure. ACM Comput. Surv., 23(3):345­ 405, 1991. sellation using AD tetrahedra weighted by Delaunay probability and unweighted. We found the modified scores equally [6] D. Bakowies and W. F. van Gunsteren. Water in protein cavities: A procedure to identify internal water and exchange patheffective at distinguishing proteins from decoys, which indiways and application to fatty acid-binding protein. Proteins, cates the robustness of Delaunay tessellation. 47(4):534­545, 2002. 5. We characterized the better inter-chain packing of the [7] D. Bandyopadhyay, A. Tropsha, and J. Snoeyink. Almostnative state than decoys (artificial structures) in terms of AD delaunay tetrahedra for analyzing protein structure, 2003. tetrahedra. submitted. http://www.cs.unc.edu/debug/papers/AlmDel. 6. Using the characteristic AD thresholds of the -helix, [8] C. B. Barber, D. P. Dobkin, and H. Huhdanpaa. The Quickwe filtered out the tetrahedra containing residues in helical hull algorithm for convex hulls. ACM Trans. Math. Softw., conformation, and were able to visualize and quantify the 22(4):469­483, 1996. -helical content of a protein to high accuracy. We also [9] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, UK, 1998. identified -sheets and -turns in a similar way. 7. We identified hinge regions in a protein undergoing [10] K. Q. Brown. Geometric transforms for fast geometric algorithms. PhD thesis, Carnegie­Mellon University, Pittsburgh, conformational change by comparing AD tetrahedra across Penn., 1980. many copies of the protein chain, grouping them based [11] C. W. Carter, B. C. LeFebvre, S. Cammer, A. Tropsha, and on number of Delaunay tetrahedra and maximum threshold M. H. Edgell. Four-body potentials reveal protein-specific across all the chains. 7 Future work We have introduced the tools of almost-Delaunay simplices and Delaunay probability to give worst-case and averagecase analysis of the Delaunay tessellation under perturbation of the input sites. We have proved properties of the AD simplices that lead to an efficient 3D implementation, and applied them to analyze protein data. We hope to use these and other tools in the future for a clearer understanding of nearest neighbors and packing interactions in imprecise data, and investigate Delaunay probability to improve results in GIS and meshing applications of Delaunay tessellation. [12] [13] [14] [15] Acknowledgements We thank the biogeometry research group at UNC, Alexander Tropsha and members of his lab for discussions, David O'Brien and Bala Krishnamoorthy for code to run experiments on protein data, and Robert-Paul Berretty for work on the annulus problem. [16] [17] [18] correlations to stability changes caused by hydrophobic core mutations. Journal of Molecular Biology, 311(4):625­638, 2001. B. Chazelle and H. Edelsbrunner. An improved algorithm for constructing kth-order Voronoi diagrams. IEEE Trans. Comput., C-36:1349­1354, 1987. M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geometry: Algorithms and Applications. Springer-Verlag, Berlin, Germany, 2nd edition, 2000. ` B. Delaunay. Sur la sphere vide. A la memoire de Georges Voronoi. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskih i Estestvennyh Nauk, 7:793­800, 1934. C. A. Duncan, M. T. Goodrich, and E. A. Ramos. Efficient approximation and optimization algorithms for computational metrology. In Proc. 8th ACM-SIAM Sympos. Discrete Algorithms, pages 121­130, 1997. R. A. Dwyer. Higher-dimensional voronoi diagrams in linear expected time. Discrete Comput. Geom., 6(4):343­367, 1991. R. A. Dwyer. The expected number of k-faces of a voronoi diagram. Computers Math Applications, 26(5):13­19, 1993. H. Ebara, N. Fukuyama, H. Nakano, and Y. Nakanishi. Roundness algorithms using the Voronoi diagrams. In Canadian Conference on Computational Geometry, page 41, 1989. 418 ¨ [19] H. Edelsbrunner and E. P. Mucke. Simulation of simplicity: A technique to cope with degenerate cases in geometric algorithms. ACM Trans. Graph., 9(1):66­104, 1990. [20] J. Garcia-Lopez, P. Ramos, and J. Snoeyink. Fitting a set of points by a circle. Discrete and Computational Geometry, 20:389­402, 1998. [21] J. Gudmundsson, M. Hammar, and M. J. van Kreveld. Higher order Delaunay triangulations. Comput. Geom.: Theory and Appl., 23(1):85­98, 2002. [22] L. J. Guibas, D. Salesin, and J. Stolfi. Epsilon geometry: building robust algorithms from imprecise computations. In Proc. 5th ACM Symp. Comp. Geom., pages 208­217, 1989. [23] B. Krishnamoorthy and A. Tropsha. Development of a fourbody statistical pseudo-potential to discriminate native from non-native protein conformations. Bioinformatics, 19(12), 2003. [24] A. R. Leach. Molecular Modelling: Principles and Applications, second edition. Pearson Education EMA, UK, 2001. [25] A. M. Lesk. Introduction to Protein Architecture - The Structural Biology of Proteins. Oxford University Press, UK, 2000. [26] J. Liang and K. A. Dill. Are proteins well-packed? Biophys J., 81(2):751­766, 2001. [27] J. Liang, H. Edelsbrunner, P. Fu, P. Sudhakar, and S. Subramaniam. Analytical shape computing of macromolecules II: identification and computation of inaccessible cavities inside proteins. Proteins, 33:18­29, 1998. [28] J. Liang, H. Edelsbrunner, and C. Woodward. Anatomy of protein pockets and cavities: Measurement of binding site geometry and implications for ligand design. Protein Science, 7:1884­1897, 1998. [29] Y. C. Liao, D. J. Lee, and B.-H. Chen. Description of multi-particle systems using Voronoi polyhedra. Powder Technology, 119(2­3):81­88, 2001. [30] B. McConkey, V. Sobolev, and M. Edelman. Quantification of protein surfaces, volumes and atom-atom contacts using a constrained Voronoi procedure. Bioinformatics, 18(10):1365­1373, 2002. [31] K. Mehlhorn, T. C. Shermer, and C.-K. Yap. A complete roundness classification procedure. In Proc. 13th Annu. ACM Sympos. Comput. Geom., pages 129­138, 1997. [32] S. Miyazawa and R. L. Jernigan. Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. Journal of Molecular Biology, 256:623­644, 1996. [33] P. J. Munson and R. K. Singh. Statistical significance of hierarchical multi-body potentials based on Delaunay tessellation and their application in sequence-structure alignment. Protein Sci, 6(7):1467­1481, 1997. [34] A. Okabe, B. Boots, and K. Sugihara. Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. John Wiley & Sons, Chichester, UK, 1992. [35] P. A. Ramos. Computing roundness is easy if the set is almost round. In Proc. 15th Annu. ACM Sympos. Comput. Geom., pages 307­315, 1999. [36] F. M. Richards. The interpretation of protein structures: total volume, group volume distributions, and packing density. J. Molecular Biology, 82:1­14, 1974. [37] T. Rivlin. Approximation by circles. Computing, 21:93­104, 1979. [38] U. Roy and X. Zhang. Establishment of a pair of concentric circles with the minimum radial separation for assessing roundness errors. Computer Aided Design, 24(3):161­168, 1992. [39] K. T. Simons, C. Kooperberg, E. Huang, and D. Baker. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. Journal of Molecular Biology, 268:209­ 225, 1997. [40] R. Singh, A. Tropsha, and I. Vaisman. Delaunay tessellation of proteins. J. Comput. Biol., 3:213­222, 1996. [41] M. Smid and R. Janardan. On the width and roundness of a set of points in the plane. In Canadian Conference on Computational Geometry, pages 193­198, Aug. 1995. [42] J. Tsai and M. Gerstein. Calculations of protein volumes: sensitivity analysis and parameter database. Bioinformatics, 18:985­995, 2002. [43] J. Tsai, R. Taylor, C. Chothia, and M. Gerstein. The packing density in proteins: Standard radii and volumes. Journal of Molecular Biology, 290(1):253­266, 1999. ` [44] G. M. Voronoi. Nouvelles applications des parametres conti` ´ ` ´ nus a la theorie des formes quadratiques. deuxieme Memoire: ´` Recherches sur les parallelloedres primitifs. J. Reine Angew. Math., 134:198­287, 1908. [45] H. Wako and T. Yamato. Novel method to detect a motif of local structures in different protein conformations. Protein Engineering, 11:981­990, 1998. [46] G. Weberndorfer, I. L. Hofacker, and P. F. Stadler. An efficient potential for protein sequence design. In Computer Science in Biology. Univ. Bielefeld, D. GCB'99 Proceedings, pages 107­ 112, Hannover, Germany, 1999. http://www.tbi.univie.ac.at/ papers/Abstracts/GCB026.ps.gz. [47] L. Wernisch, M. Hunting, and S. Wodak. Identification of structural domains in proteins by a graph heuristic. Proteins, 35(3):338­352, 1999. ¨ [48] R. Zimmer, M. Wohler, and R. Thiele. New scoring schemes for protein fold recognition based on Voronoi contacts. Bioinformatics, 14(3):295­308, 1998. 419