Navigating nets: Simple algorithms for proximity search [Extended Abstract] Robert Krauthgamer Abstract We present a simple deterministic data structure for maintaining a set S of points in a general metric space, while supporting proximity search (nearest neighbor and range queries) and updates to S (insertions and deletions). Our data structure consists of a sequence of progressively finer -nets of S , with pointers that allow us to navigate easily from one scale to the next. We analyze the worst-case complexity of this data structure in terms of the "abstract dimensionality" of the metric S . Our data structure is extremely efficient for metrics of bounded dimension and is essentially optimal in a certain model of distance computation. Finally, as a special case, our approach improves over one recently devised by Karger and Ruhl [KR02]. 1 Intro duction James R. Lee Nearest-neighbor search (NNS) is the problem of preprocessing a set S of n points lying in a huge (possibly infinite) metric space (M , d) so that given a query q M , one can efficiently locate the nearest point to q among the points in S .1 Computing such nearest neighbors efficiently is a classical and fundamental problem with numerous practical applications. These include data compression, database queries, machine learning, computational biology, data mining, pattern recognition, and ad-hoc networks. A common feature of many of these examples is that comparing two elements is costly, hence the number of distance computations should be made as small as possible. The general NNS problem has several flavors. For IBM Almaden Research Center, 650 Harry Road, San Jose CA 95120, USA. Part of this work was done while this author was with the International Computer Science Institute and with the Computer Science Division of U.C. Berkeley. Email: robi@almaden.ibm.com Computer Science Division, U.C. Berkeley, Berkeley, CA 94720, USA. Part of this work was done while the author was at Microsoft Research (Redmond). Work partially supported by NSF grant CCR-0121555 and an NSF Graduate Research Fellowship. Email: jrl@cs.berkeley.edu 1 The distance function d is assumed to b e nonnegative and symmetric, and to satisfy the triangle inequality. instance, the host metric space (M , d) may be application specific (for example, it could be the weighted edit distance between strings in a spell-checker or in genomic data), it may be infinite (e.g., Euclidean space), and it may even be so unstructured that it is practically unknown (as in a peer-to-peer network, representing the internode latencies). The space and time of the preprocessing stage may be constrained (e.g., to be polynomial or linear in n). The application at hand may require the data structure to be dynamic in the sense that it should efficiently support the insertion and deletion of points from S . Additionally, it may be desirable for the data structure to be distributed among the data points, in which case some other cost measures like communication and load could be highly important. Finally, it may be the case that only approximate solutions are required. Let q be the query and a S the closest point to q in S . Then a (1 + )-NNS algorithm is one which, given q , returns some s S such that d(q , s) (1 + ) d(q , a). Most previous research has focused on the important special case when M = Rd and distances are computed according to some p norm. While many types of data can be naturally represented in such a form, this is certainly not true for a significant number of applications, and it is therefore desirable to address NNS in general metric spaces. On the one hand, data structures for general metrics might perform a nearest neighbor query in time as poorly as (n) which is unacceptable in practice. Such a dependence is inherent even when only approximate solutions are required. A well-known example is where S forms a uniform metric, so that the interpoint distances in S are all equal, providing essentially no information. (See Section 3 for a specification of the model and additional lower bounds.) On the other hand, metrics arising in practice often have a nicer structure which can be exploited algorithmically. Given this state of affairs, an increasing amount of recent attention has focused on understanding the complexity of NNS in terms of a metric's implicit structure. In Euclidean spaces, a natural and common measure for the metric's complexity is the dimension of the Euclidean host space. Indeed, the efficiency of most algorithms depend on this dimension. In fact, the Copyright İ 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 798 running time of many algorithms grows exponentially with the dimension, a phenomenon called the curse of dimensionality (notable exceptions are those of [IM98, KOR98]). It is thus only natural to try and define an analogous notion of dimensionality for abstract metric spaces. Our first contribution is conceptual--we put forward (in the context of NNS) a natural notion of abstract dimension, which is based on one of Assouad [Ass83] (from the theory of analysis on metric spaces). This is corroborated by a technical contribution--we provide an efficient and dynamic data structure for NNS in general low-dimensional metrics. Other notions of dimensionality were suggested and studied in the context of NNS in general metric spaces [Cla99, KR02], and our scheme compares favorably with them. Furthermore, our scheme is comparable to the best one known for low-dimensional Euclidean space (with linear preprocessing space) [AMN+ 98], despite the fact that our approach is general and, in particular, independent of the Euclidean geometry! See Section 1.3 for a more detailed account. 2. If X is a subspace of Y , then dim(X ) dim(Y ). 3. dim(X1 · · · Xm ) maxi dim(Xi ) + log m. It has been noted in [KR02] that doubling metrics (those of uniformly bounded dimension) occur naturally in practical applications like peer-to-peer networks. In feature recognition problems, the data set S is often contained in the union of m low-dimensional manifolds lying in some very high-dimensional space RL and the distance function used is a norm of RL (usually the 1 or 2 norm). For instance, a manifold may represent the feature vectors corresponding to different viewpoints of a single ob ject, thus m ob jects give rise to a union of m manifolds. In this case, performing nearest neighbor search using only the structure of the high-dimensional host space would prove quite costly. The situation is made even more complicated by the fact that noise and errors arising in measurement actually yield a union of m sets each of which is only close to a manifold. Fortunately, the doubling dimension is fairly insensitive to such small perturbations. 1.2 Our results. We provide a simple scheme for (1 + )-NNS in general metrics. The data structure, described in Section 2, is deterministic and thus guaranteed to answer (1 + )-NNS queries correctly. Our data structure is dynamic; it supports insertions and deletions to S . It also supports range queries: One is given a query q M and a range t, and the goal is to return all points s S such that d(q , s) t. The complexity of these operations depends on the dimension of S and the aspect ratio of (S, d), denoted = S , which is defined as the ratio between the largest and smallest interpoint distances in S . In most applications, = poly(|S |), hence log = O(log n). If S is doubling, the data structure uses O(n) space; it answers (1 + )-NNS queries in time O(log ) + (1/ )O(1) ; and it implements insertions and deletions in time O(log log log ). The running time depends exponentially on dim(S ) (see Section 2), but in a distance-oracle model, where access to M is restricted to queries to the distance function d, this dependence is necessary, as we show in Section 3. In [KR02], a different notion of dimension is defined (implicitly) which we denote by dimKR (X ). This is the least value k such that |BX (x, 2r)| 2k |BX (x, r)| for every point x X and every r > 0. The following easy lemma is proved in [GKL03]; for completeness, we repeat the proof in the Appendix. 1.1 Abstract dimension. The doubling dimension of a metric space (X, d), denoted in this paper by dim(X ), is the minimum value such that every set in X can be covered by 2 sets of half the diameter. (The diameter of a set S X is sup{d(x, y ) : x, y S }.) As usual, define the (closed) bal l of radius r about x in S X to be BS (x, r) = {y S : d(x, y ) r}; we may omit the subscript S when it is clear from the context. It is not difficult to see that the minimum value such that every bal l in X can be covered by 2 bal ls of half the radius is within a factor of two of the above definition-- this is the value we work with throughout. A metric space is called doubling if its dimension is O(1). For example, a uniform metric on k points has dimension log k . An approximate converse also holds (see Lemma 1.2 below), showing that small doubling dimension quantifies the lack of a large nearly-uniform metric. In other words, this notion of dimension measures the "volume growth" of X . We note that Clarkson [Cla99] used a similar notion (under a different name) in the context of NNS, but his results do not seem to unleash its full power. (See Section 1.3 for further details.) The doubling dimension has several natural and appealing properties. Let (X, d) be an arbitrary metric space; here are some of the more simple properties (see, e.g. [Hei01]). Lemma 1.1. ([GKL03]) Every finite metric (X, d) satisfies dim(X ) 4 dimKR (X ). 1. For X = Rk equipped with any norm, dim(X ) = (k ). The converse direction (a bound on dimKR (X ) in terms of dim(X )) does not hold. In other words, 799 doubling metrics (those of bounded doubling dimension) form a strictly larger class than those metrics of bounded KR-dimension. The two notions are further compared in the next section. When the metric formed by the data set S together with the query point q (namely, S {q }) happens to belong to the more restricted class of metrics with bounded KR-dimension, by a slight modification of our querying procedure (but not the data structure), we are able to return exact nearest-neighbors and to match or improve the bounds of [KR02]; see Section 1.3. 1.3 Related work. Several works (e.g., [Bri95]) demonstrate that the (concentration of the) histogram of distances between points in a metric space indicates (heuristically) its dimensionality. Ch´vez a et al. [CNBYM01] suggest to define the dimension of a metric as = µ2 /2 2 where µ and 2 are the histogram's mean and variance, and show that it affects, in a random instance, the efficiency of a certain nearest neighbor search algorithm. Faloutsos and Kamel [FK97] suggest to measure the fractal dimension of the data set, but this notion only applies to Euclidean spaces. Clarkson [Cla99] devised two NNS data structures for metric spaces satisfying a certain sphere packing property which is equivalent to having O(1) doubling dimension. (See Table 1.) However, his data structures are randomized, not dynamic, the query time is superlogarithmic, and it is assumed that the data set S and the query q are drawn from the same (unknown) probability distribution. Our algorithms improve over [Cla99] in these respects, although they only guarantee (1 + )-NNS. More recently, Karger and Ruhl [KR02] suggested the notion of dimension discussed in the previous section. One justification for this notion is that the k dimensional grid with the l1 metric has KR-dimension (k ). Karger and Ruhl [KR02] show an efficient NNS data structure for metric spaces with bounded KRdimension. (See Table 1.) Lemma 1.1 shows that the doubling dimension of every metric space is (essentially) not larger than its KR-dimension. Therefore, all our results for doubling metrics immediately hold for metrics with bounded KRdimension. In addition, we can show that when run on the latter family of metrics, our algorithms can be (slightly) modified so as to find exact NNS with running time that is similar to [KR02], but with linear (instead of near-linear) space. Our results improve over [KR02] in several respects. Besides extending to a larger family of metrics, they are deterministic and do not require any parameter, while those of [KR02] are randomized (LasVegas), and their performance depends on a correct setting of the dimension parameter. We take a moment to discuss the frailty of the notion of KR-dimension. In particular, as was pointed out in [KR02], subsets of metrics of bounded KR-dimension do not always have bounded dimension themselves. In particular, there are subsets of the real line which have unbounded KR-dimension (while certainly every subset of R has bounded doubling dimension). The reason for this is quite simple; the algorithms of [KR02] use random sampling in order to find nearest neighbors, and thus impose a certain uniformity on the layout of points in the metric space. Thus it is unclear whether their approach could be extended to the larger class of doubling metrics. To see a simple example of this instability, consider the discrete annulus S = {x Z : 2r > |x| > r}. It is not difficult to see that dimKR (S ) = O(1) (uniformly, for any value of r > 0). On the other hand, dimKR (S {0}) = (log r), thus the addition of a single point to S can cause the KR-dimension to grow arbitrarily. Perhaps most interestingly, our results are comparable to those of [AMN+ 98] for bounded dimensional Euclidean metrics (which, of course, also have bounded doubling dimension); see Table 1. It is quite remarkable that similar running times are achievable using only a bound on the volume growth of point sets and not the geometry of Euclidean space. 1.4 Techniques. Let (X, d) be a metric space. We say that a subset Y X is an -net of X if it satisfies (1y For every x, y Y , d(x, y ) ) and (2) X B (y , ). Such nets always exist for any > 0. Y For finite metrics, they can be constructed greedily. For arbitrary metrics, proof of their existence is an easy application of Zorn's lemma. This classical notion of an -net, which is a standard tool in the study of metric spaces (see e.g. [Hei01]), should not be confused with a more recent notion (under the same name) from Computational Geometry. The following lemma will be key to our analysis. Lemma 1.2. Let (S, d) be a metric space, and let Y S . If the aspect ratio of the metric induced on Y is at most and 2, then |Y | O(dim(S )) . Proof. Let dmin = inf {d(x, y ) : x, y Y } and dmax = sup{d(x, y ) : x, y Y } be the minimum and maximum interpoint distance in Y , respectively, and assume that x = dman < . Notice that Y is contained in a ball dmi of radius 2dmax 2dmin in S (centered at any point of Y ). Applying the definition of doubling dimension iteratively several times we get that this ball, and in particular Y , can be covered by 2dim(S )·O(log ) balls of radius dmin /3. Each of these balls can cover at 800 [Cla99] [Cla99] This paper [KR02] This paper [AMN+ 98] Dimension dim(S ) = O(1) dim(S ) = O(1) dim(S ) = O(1) dimKR (S ) = O(1) dimKR (S ) = O(1) O(1) Euclidean Space O(n log ) n(log n)O(log log ) O(n) O(n log n) O(n) O(n) NNS or (1 + )-NNS O(log4 n · log ) (log n)O(log log ) O(log ) + (1/ )O(1) O(log n) O(log n) (1/ )O(1) log n Insert/Delete ­ ­ O(log log log ) O(log n log log n) O(log n log log n) O(log n) Randomized (Las Vegas) data structure Assuming there is a training set for the query. Assuming the query comes from the same (unknown) distribution as the data set. Table 1: Comparison of NNS schemes for general metrics. most one point of Y (by definition of dmin ) and thus d(a, Ly,j ) 2j -1 . In other words, the closest 2j -1 |Y | 2dim(S )·O(log ) O(dim(S )) . A net point to a is contained in Ly,j . Let y Yj -1 be such that d(a, y ) 2j -1 . We need to argue that simplified 3-NNS algorithm. Here is a simplified d(y , y ) 2j ; in this case, we will have y Ly,j . version of our data structure. Let (S, d) be the metric Since d(q , y ) 3 · 2j , we have space which is queried against, and for the sake of d(y , y ) d(y , a) + d(a, y ) 2j -1 + d(a, q ) + d(q , y ) this informal discussion, assume that the minimum interpoint distance in S is min{d(x, y ) : x, y S } = 1. 2j -1 + 2 · d(q , y ) 7 · 2j , In this case, the aspect ratio is just the diameter of S . In what follows, for a subset R S and a point x M , thus choosing = 7 suffices. This shows that the descension process "tracks" the we let d(x, R) = inf yR d(x, y ). Let k = log , and for i = 0, 1, . . . , k , let Yi be a closest net point to a. Now it is a simple matter to see 2i -net of S . Now, for every point y Yi , suppose we that have the set of points Ly,i = {z Yi-1 : d(y , z ) 2i }, 3 · 2j -1 < d(q , Ly,j ) d(q , a) + d(a, Ly,j ) d(q , a) + 2j -1 for some constant to be specified later. Notice that the set Ly,i has minimum distance 2i-1 (since this is the minimum distance in Yi-1 ) and maximum distance 2i+1 , hence its aspect ratio is bounded. When S is a doubling metric, Lemma 1.2 implies that |Ly,i | = O(1) where the constant depends on the choice of . Also, Yk contains only one (arbitrary) point which we denote by ytop . Now, given a query point q M , we first set y = ytop , and then iteratively for i = k , k - 1, . . . find the point in Ly,i Yi-1 that is closest to q , and set y to be this point (for the next iteration). If, at some stage, we reach a point where d(q , Ly,i ) > 3 · 2i-1 , then we stop and output y . Otherwise, we output the last value y Y0 . First, notice that the running time of this algorithm is at most O(log ), since finding the closest point to q among a list Ly,i takes constant time (the lists are of size O(1)). Thus we need only argue that the output point y is a 3-approximation to the nearest neighbor a S , i.e. that d(q , y ) 3 d(q , a). To this end, let j be such that d(q , y ) 3 · 2j , but d(q , Ly,j ) > 3 · 2j -1 , i.e. the step in which the distance does not decrease by a factor of 2. First, we argue that so that d(q , a) > 2j . Since d(q , y ) 3 · 2j , it follows that y is a 3-approximate nearest neighbor. Similar arguments show that if we end with y Y0 then y is actually the closest point to q . Later, we will see that, after obtaining an O(1)-approximate nearest neighbor, the (1 + )-NNS problem can be solved in time O(1/ )O(1) . The preceding simple algorithm illustrates the power of using nets at different scales to navigate through a metric space, roughly halving the distance from q to the closest point at each step. All the operations in our data structure are implemented inside this simple framework. To overcome some technical limitations of the above scheme, the actual data structure presented in Section 2 is more involved. First, the above algorithm has to be extended to (1 + )-NNS, for any > 0. This can be achieved by roughly log(1/ ) more iterations of the same decision procedure, but at each iteration we now have to process more than just one point y . Secondly, in the presence of insertions and deletions, we cannot make the simplifying assumption that the minimum 801 interpoint distance in S is 1, and thus we must find a way to keep track of the "relevant" scales. Finally, for a technical reason, the above data structure does not support efficient deletions and might require space (n log ). The main remedy for these latter problems is to choose Yi to be a 2i -net in Yi-1 rather than in S . 2 Navigating nets In this section we present a data structure that maintains a set S of points in a metric space, so as to support proximity queries and updates to S . The performance guarantees of this data structure (in terms of time and space) depend on the dimensionality of the data set S , as defined in Section 1. However, the data structure is deterministic, and is guaranteed to be correct for any metric space unconditionally. Neither the dimension of S nor (for (1 + )-NNS) needs to be known in advance. We start by describing the data structure in Section 2.1, and analyzing its space requirement in Section 2.2. We then present the procedures for computing proximity queries in Sections 2.3 and 2.4, and those for updating S in Sections 2.5 and 2.6. Finally, we give improved bounds for KR metrics in Section 2.7. We shall use the notation of Section 1, where (M , d) is the host metric space, S is the set of points to be maintained by the data structure, and n = |S |. Let dmax := sup{d(x, y ) : x, y S } and dmin := inf {d(x, y ) : x, y S } denote the maximum and minimum interpoint distance in S , respectively so that := dmax /dmin is the aspect ratio of S . property is trivial. For the inductive step, assume the property holds for scale r, i.e., there exists y Yr such that d(z , y ) < 2r. Since Y2r is a 2r-net of Yr , we get that d(z , Y2r ) d(z , y ) + d(y , Y2r ) < 4r. The packing property follows directly from the fact thatT r is an r-net of Yr/2 . Y he next lemma upper bounds the size of any navigation list. Its proof follows by observing that all the points in a list Ly,r are points of Yr/2 S , so the distance between every two of them is at least r/2, while they all lie in a ball of radius r. Applying Lemma 1.2 yields the following. Lemma 2.2. The size of every navigation list is at most 2O(dim(S )) . Implementation. We now discuss the implementation of this data structure. First, the nets Yr are not maintained explicitly, but rather follow implicitly from the lists Ly,r , i.e., Yr is the set of all points y S for which Ly,r exists. Second, if S is nonempty, we have by Lemma 2.1 that for every scale r > dmax , the net Yr consists of the same single point, which we denote ytop . The data structure maintains this point ytop and the cutoff scale rmax := min{r : r r, |Yr | = 1} for the sake of bootstrapping most of the operations. Third, for all scales r dmin the net Yr is equal to S , so for scales r dmin /2, every point x S has a trivial list Lx,r = {x}. These trivial lists can be represented succinctly by storing, for every x S , a scale rx i 2.1 The data structure. Let = {2 : i Z}, and below which all lists of x are trivial. For the sake of let us call every value r a scale. To simplify the analysis, define rmin := min{rx : x S }. We can now upper bound the number of navigation exposition, we consider infinitely many scales, but it will lists that need to be stored for any point x S . be apparent that only O(log ) of them are "relevant" and the remaining scales will be trivial. Lemma 2.3. rmax = (dmax ) and rmin = (dmin ), so For every r , let Yr be an r-net of Yr/2 . As the every point has O(log ) non-trivial navigation lists. base case, we define Yr := S for all scales r dmin . For every r and every y Yr , the data structure stores Proof. Using Lemma 2.1 it is easy to see that r max = a list of the nearby points to y among the r/2-net Yr/2 . (d max ). It is straightforward from the definition (2.1) This scale r navigation list of y is defined by that r (d ) for every x S . In addition, x min Yrmin /2 = S so by Lemma 2.1 dmin rmin /2. We conclude that a list for scale r needs to be stored only where > 0 is a universal constant. We will see later if (dmin ) rx r rmax O(dmax ). The lemma that 4 suffices for all the operations. While Yr need folloC s. w not be an r-net of S , the following lemma shows that Yr is a relaxed variant of an r-net of S . ombining Lemmas 2.2 and 2.3 yields an upper bound on the total space required for the data strucLemma 2.1. For every scale r, we have: ture of n · 2O(dim(X )) log . We improve over this in 1. Covering: d(z , Yr ) < 2r for every z S . Section 2.2, abolishing the log factor via a more care2. Packing: d(x, y ) r for every x, y Yr . ful analysis. When analyzing the performance of the data strucProof. The covering property follows by induction. The base case is r < dmin , for which Yr = S and the desired ture, we assume that the navigation lists are stored as (2.1) Ly,r := {z Yr/2 : d(z , y ) · r}, 802 .3 Approximate nearest neighb or query. We now present a procedure that uses the above data Remark. It is possible to speed up the (1 + )-NNS structure to find a (1 + )-approximate nearest neighbor procedure by letting each navigation list Lx,r contain for a query point. That is, given a point q M and not only pointers to points y Yr/2 , but also pointers a value > 0, the procedure finds a point p S with to their corresponding navigation lists Ly,r/2 . Since the d(q , p) < (1 + ) d(q , S ). We stress that is arbitrary, latter navigation list might be trivial (and not stored both in the sense that the data structure is oblivious explicitly), we shall actually store a pointer to the to , and in the sense that need not be small; for navigation list Ly,r , where r is the largest scale for instance, setting = 1 we can find a 2-approximate which r r/2 and Ly,r is non-trivial. In order to t nearest neighbor extremely fast. Choosing smaller update these pointers in the presence of insertions and rades speed for accuracy; the precise result is as follows. deletions, they must be implemented as bidirectional pointers. Theorem 2.2. A (1 + )-approximate nearest neighbor 2.2 Space requirement. We now show that for doubling metrics S , the total space used by our data structure is O(n). Theorem 2.1. The size of the data structure is 2O(dim(S )) · n words. Proof. Recall that trivial navigation lists Lx,r = {x} are represented implicitly. Hence, the total space used by our implementation of the data structure is linear in the space required to represent all the non-trivial navigation lists. The size of each such list is at most 2O(dim(S )) by Lemma 2.2, and thus it suffices to show that the number of non-trivial navigation lists is O(n). For simplicity in what follows, we shall assume that is a power of 2. We bound the number of non-trivial navigation lists by a charging argument against the points of S . Each such list is charged against a point in S as follows. A non-trivial list Lx,r must contain at least one point y = x. By definition, x, y Yr/2 and thus d(x, y ) r/2. Furthermore, d(x, y ) r, and thus Y2 r cannot contain both x and y . We charge the list Lx,r to the point z {x, y } which is not contained in Y2 r . It remains to bound the number of navigation lists that are charged to any single point z S . Consider a non-trivial scale r list that is charged against z . This list must contain z and also another point z , such that the list is either Lz,r or Lz ,r . Let r = r (z ) be the largest scale for which z Yr . The charging scheme is such that r {r/2, r, . . . , r}. It follows that once z is fixed, r can have only O(1) distinct values. Furthermore, z , z belong to the same scale r list, and thus r/2 d(z , z ) r. It follows that once z and r are fixed, there are only 2O(dim(S )) possible points z (Lemma 1.2). As mentioned before, once the among S can be computed using the data structure in time 2O(dim(S )) log + (1/ )O(dim(S )) . This bounds, in particular, the number of distance computations. To prove this theorem we next present Procedure Approx-NNS. We prove its correctness in Lemma 2.5 and analyze the running time in Lemma 2.6. We shall make use of the following definition. Definition. Zr Yr is called non-proper if it contains only one point, x, and rx > r. Otherwise, Zr is proper. The algorithm. We now outline Procedure ApproxNNS; the full description is given in Figure 2.3. The procedure iterates over the nets Yr , starting with the largest (non-trivial) scale r = rmax (line 1), and iteratively decreasing r by a factor of 2 (lines 2-4). At each iteration, the goal is to construct a subset Zr/2 Yr/2 containing the points of Yr/2 that are nearby q , as formalized in Lemma 2.4. This is done efficiently by exploiting Zr from the previous iteration and the navigation lists of the corresponding points (line 3). The crux is that Zr then contains a point of S whose distance from q is at most d(q , S ) + r. The iterations continue roughly until r · d(q , S ) (line 2), and then we simply report the closest point to q among Zr (line 5). The analysis. By induction, Zr Yr for every set Zr computed by this procedure, and thus Lz,r in line 3 is well-defined. Lemma 2.4. Let a be a closest point to q among S . Then every set Zr computed by Procedure ApproxNNS contains a point zr with d(a, zr ) 2r. Proof. Proceed by induction on r. For the base case r = rmax , line 1 sets Zrmax = {ytop } = Yrmax , and thus follows. For each point x S , the non-trivial navigation lists of x are stored using, say, a balanced search tree, which requires linear space and implements find, insert and delete in logarithmic time. It then follows from Lemma 2.3 that, say, inserting a new navigation list for a point can be done in time O(log log ). triple z , z , r is fixed, there are only two navigation lists (Lz,r and Lz ,r ). We conclude that the number of lists charged to any single point z is 2O(dim(S )) , and the total number of non-trivial lists is indeed 2O(dim(S )) n. 2 803 Procedure Approx-NNS (Input: q X and > 0). 1. set r = rmax and Zr = {ytop }. 2. while 2r(1 + 1/ ) > dz q , Zr ) and Zr is proper do ( 3. set Zr/2 = {y Zr Lz ,r : d(q , y ) d(q , Zr ) + r }. 4. set r = r/2. 5. return z Zr for which d(q , z ) is minimal. Figure 1: Approximate nearest neighbor procedure. for a point a that is closest to q among S we indeed have by Lemma 2.1 that d(a, Zrmax ) 2rmax . For the inductive step, consider the construction of Zr/2 in line 3 assuming that Zr satisfies the induction hypothesis. It follows that Zr contains a point z with d(z , a) 2r. By Lemma 2.1, Yr/2 contains a point y with d(a, y ) r, so d(z , y ) d(z , a) + d(a, y ) 2r + r = 3r, and thus y Lz,r (y appears in the navigation list of z since 6). To complete the proof, it suffices to show that y is included in Zr/2 . Indeed, observe that when Zr/2 is constructed in line 3, d(q , y ) d(q , a) + d(a, y ) d(q , Zr ) + r, so Zr/2 will L clude y . in emma 2.5. Procedure Approx-NNS outputs a point whose distance from q is at most (1 + ) d(q , S ). Proof. Let r be the value of r at the end of the procedure. It suffices to prove that d(q , Zr ) (1 + ) d(q , S ), because then the proof of the lemma follows from the choice made in line 5. There are two conditions in line 2 that may cause the while-loop to stop; consider first the case when 2r (1 + 1/ ) d(q , Zr ). By Lemma 2.4 we know that d(q , Zr ) d(q , S ) + 2r . Combining the last two inequalities we get 2r / d(q , S ). Plugging this back into the second inequality, we get d(q , Zr ) d(q , S ) + 2r (1 + ) d(q , S ), as desired. Consider next the case when Zr is non-proper. We may also assume that d(q , Zr ) > 0, as otherwise we are done. For the sake of analysis, suppose that we were to continue iterating (i.e., ignore the requirement of line 2 that Zr is proper). Notice that this process cannot go forever, because d(q , Zr ) d(q , S ) > 0 while 2r(1 + 1/ ) 0. Furthermore, the point that would have been returned by the modified procedure is exactly the one returned by the actual procedure, because the sets Zr that would have been constructed further (i.e., for r < r ) would all contain the same single point as Zr . Finally, it is easy to see that our analysis above for the case d(q , Zr ) (1 + ) d(q , S ) (including Lemma 2.4) is applicable also to the modified procedure, and we obtain that in both the modified and actual procedure, d(q , Zr ) (1 + ) d(q , S ) as desired. In fact, if the iteration stops with Zr being non-proper, then the procedure outputs the optimal (unique) closest point to q among S , because we can apply the same "modified procedure" argument with an arbitrarily small > 0. L emma 2.6. Procedure Approx-NNS runs in time 2O(dim(S )) log + (1/ )O(dim(S )) . Proof. We first bound the size of a set Zr computed by Procedure Approx-NNS. The set Zr of line 1 is trivial, so consider a set Zr/2 constructed in line 3. For all y Zr/2 , the conditions in lines 2 and 3 imply d(q , y ) d(q , Zr )+r < 2r(1+1/ )+r = r(3+2/ ). Since the distance between any two points of Zr/2 Yr/2 is at least r/2, we have by Lemma 1.2 that |Zr/2 | (2 + 1/ )O(dim(S )) . A cruder time bound of (2 + 1/ )O(dim(S )) log follows by bounding separately the number of iterations that the procedure performs and the running time of a single iteration. For the first bound (number of iterations), observe that once r becomes smaller than dmin /(6 + 6/ ), the diameter of Zr/2 is (by lines 2 and 3) at most 2[d(q , Zr ) + r] < 2[2r(1 + 1/ ) + r] < dmin , and thus Zr S contains at most one point. If, in addition, r < dmin / , then Zr must be non-proper, and the second condition in line 2 does not hold. Hence, the scales r iterated over are between rmax = O(dmax ) and (dmin /[ + 1 + 1/ ]), and thus the number of iterations is at most log + log(1 + 1/ ) + O(1). We now show the second bound (running time of a single iteration). An iteration scans |Zr | lists Lz,r of length |Lz,r | 2O(dim(S )) each, and computes their union (which requires the removal of duplicates and can be done by sorting). It follows that the running time of a single iteration is at most O(|Zr | · |Lz,r | log(|Zr | · |Lz,r |)) (2+1/ )O(dim(S )) . A similar scan is performed in line 5, with running time O(|Zr |). Altogether, the total running time of the procedure is indeed at most (2 + 1/ )O(dim(S )) log . Employing a more careful analysis, we shall now improve the upper bound to that stated in the lemma. We may assume < 1, as otherwise it's the same as 804 point s B (q , t). By Lemma 2.1, there is a point y Yt such that d(s, y ) 2t. For the sake of analysis, let a be a closest point to q among those in S . Then d(q , a) d(q , s) t. By Lemma 2.4, there is a point z Z2t such that d(a, z ) 4t. Altogether, we have d(z , y ) d(z , a) + d(a, q ) + d(q , s) + d(s, y ) 4t + t + t + 2t · 2t, and hence y Yt must appear in the list Lz,2t . Now it is easy to see that the second phase of the algorithm, which recursively scans all navigation lists reachable from Z2t , must find s along the way, due to the same argument as in Lemma 2.1. To analyze the running time, observe that Z2t is contained, because of its construction together with Lemma 2.4, in a ball of radius d(q , Z4t ) + 2t = O(t) around q . It follows that any point found in the second phase is within distance (2t + t + t/2 + · · · ) 4 t from Z2t , and hence within distance O(t) from q . We can then use an adaptation of Theorem 2.1 to bound the running time of the second stage by 2O(dim(S )) |B (q , O(t))|. In order to bound the running time of the first phase we need to modify it so that the iterations are stopped and the empty set is reported, whenever d(q , Zr ) > 3r. This implies that the diameter of Zr is at most 2 d(q , Zr )+r = .4 Range queries and exact nearest neighb or. O(r), and thus |Zr | 2O(dim(S )) (by Lemma 1.2). We can use the data structure to implement a range Hence, the first phase runs in time 2O(dim(S )) log . query operation: Given a point q X and a distance Observe that if there is a point s B (q , t), the first t > 0, this operation returns the set of points B (q , t). phase would not stop before reaching scale 2t, because Our upper bound on the running time of this operation by Lemma 2.4, d(q , Zr ) d(q , s) + 2r 3r for every T depends linearly on |B (q , O(t))|. In many cases it is r 2t. plausible that this quantity is not much larger than the heorem 2.4. An exact nearest neighbor search for a output length |B (q , t)|, although the ratio between the two can be arbitrarily large in the worst-case, even in query q can be computed using the data structure in time O (dim(S )) (log + |B (q , O(d(q , S ))|). doubling metrics. It is then straightforward to combine 2 the range query operation with the 2-NNS algorithm of Section 2.3 to arrive at an exact nearest neighbor search Proof. Given a query q , we first apply Theorem 2.2 to compute a 2-NNS for q , i.e., find a point sq S such procedure. that d(q , sq ) 2 d(q , S ). Now we apply Theorem 2.3 to Theorem 2.3. A range query within distance t around find all the points within distance t = d(q , sq ) from q , a point q can be computed using the data structure in and report the one or more points among them that are time 2O(dim(S )) (log + |B (q , O(t))|). closest to q . It is immediate that this algorithm indeed finds all the closest points to q and that the running Proof (Sketch). By rounding, we may assume that t time bound is as claimed. 2 . The range query procedure iterates over the scales r = rmax , . . . , rmin , and constructs a set Zr Yr for .5 Inserting a p oint. We now show a procedure each such value of r. The first phase corresponds to that updates the data structure when a new point scales r = rmax , . . . , 2t and is similar to Section 2.3-- q M is inserted to S . the set Zr Yr contains the points of Yr nearby q . In the second phase, which corresponds to scales Theorem 2.5. The data structure can be upr = t, . . . , rmin , the set Zr contains all the points that dated with an insertion of a point to S in time can be found by scanning the scale 2r navigation lists 2O(dim(S )) log log log . This includes 2O(dim(S )) log of all the points in Z2r . The procedure reports all the distance computations. points that are found in the second phase whose distance Proof (Sketch). The main idea is that regardless of how from q is at most t. Let us show that the procedure indeed reports any the nets were constructed, it is possible to update each the cruder bound. For iterations in which r d(q , S ), we can bound the diameter of Zr (using line 3 and Lemma 2.4) by 2[d(q , Z2r ) + 2r] 2[d(q , S ) + 4r + 2r] 14r, and using Lemma 1.2 we get that |Zr | 2O(dim(S )) , regardless of . The number of such iterations is bounded, as above, by log + log(1/ ) + O(1). For iterations in which r < d(q , S ), we use the weaker bound from above |Zr | (2 + 1/ )O(dim(S )) , However, the number of these iterations is at most log(1/ ) + O(1), because in every iteration, the condition in line 2 guarantees that d(q , Zr ) < 4r/ , i.e., r > d(q, Zr )/4 d(q, S )/4. We conclude that the total running time is at most 2O(dim(S )) log + (1/ )O(dim(S )) . Notice that we analyzed above only the number of operations performed by Procedure Approx-NNS explicitly. In general, locating a particular navigation list for a given point requires time O(log log ). However, we can reduce this cost as per the remark in Section 2.1. Since all the accesses to navigation lists are via other navigation lists, we can access each list in O(1) by maintaining, for each scale r navigation list, direct pointers to the scale r/2 lists. 2 805 r-net Yr by either adding q to it or leaving it unchanged. Indeed, this can be seen by induction on r. The base case is a sufficiently small scale r, for which Yr contains, by definition, all the points, and thus q must be added to this net. For the inductive step, assume first that the update to Yr/2 leaves it unchanged; then it is clear that Yr need not be changed. Assume next that a net Yr/2 is updated by adding q to it; then we update Yr by adding q to Yr if and only if d(q , Yr ) r. This guarantees that Yr remains an r-net of Yr/2 (although there may be other ways to update Yr ). Deciding whether d(q , Yr ) r will be done by using a set Zr that contains (similar to Section 2.3) all points of Yr that are nearby q . Recall that the net Yr is only maintained through its navigation lists, so adding q to Yr actually requires us to construct a scale r navigation list for q and to update the scale 2r lists of nearby points. Again, this task is carried out 2sing the aforementioned sets Zr . u The proof of Theorem 2.6 follows from Lemma 2.8 below; details are omitted from this version. Lemma 2.7. Let x, y be points in a metric (X, d) and set r = 1 d(x, y ); then |B (x, r)| 2-2 dimKR (X ) |B (y , r)|. 3 Proof. Observe that B (y , r) B (x, 3r + r) and thus |B (y , r)| |B (x, 4r)| 22 dimKR (X ) |B (x, r)|. L emma 2.8. Every point x S has at most 2O(dimKR (S )) log n non-trivial navigation lists Lx,r . Proof. Suppose that x has non-trivial navigation lists for scales r0 < r1 < . . . < rt . By definition, each such navigation list Lx,ri contains a point yi Yri /2 that is different from x, and with d(x, yi ) ri . By definition, x Yri Yri /2 and thus d(x, yi ) ri /2. We now claim that if c = c( ) is a suitable constant, then |B (x, ri+c )| (1 + 2-2 dimKR (X ) )|B (x, ri )| for every i 0. Given this claim, the lemma follows easily; by .6 Deleting a p oint. Our data structure can also induction, |B (x, r4i )| (1 + 2-2 dimKR (X ) )i (the base support deletion of points. The procedure for deleting case i = 0 is obvious), and since |S | = n we get that the a point is generally similar to that of inserting a point number of distinct scales ri is t + 1 2O(dimKR (X ) log n. (Section 2.5). The main technical difference occurs It remains to prove the above claim. By applywhen a point q is deleted from a net Yr . In this case, ing Lemma 2.7 for the points x and yi+3 with raYr \ {q } need not not be an r-net of Yr/2 \ {q }, in which dius r = d(x, yi+3 )/3, we get that |B (yi+3 , r)| case it is necessary to promote (i.e., add) to Yr one -2 dimKR (X ) 2 |B (x, r)|. By definition, these two balls are or more points of Yr/2 . However, this decision (and disjoint, and they are both contained in a radius 4r the corresponding updates to various lists) can be done ball centered at x. Thus, |B (x, 4r)| |B (yi+3 , r)| + using the relatively small sets Zr that contain all the |B (x, r)| (1 + 2-2 dimKR (X ) )|B (x, r)|. The claimed inpoints of Yr that are nearby q . Details are omitted from equality now follows by plugging in the upper bound this version of the paper. 4r = 4 d(x, yi+3 ) 4 ri+3 ri+c (assuming 2c-3 3 3 4 3 ) and the lower b ound r = d(x, yi+3 )/3 ri+3 /6 > 2.7 Improved b ounds for KR metrics. Clearly, r. T by Lemma 1.1, one may replace dim(S ) by dimKR (S ) i in all the aforementioned complexity bounds. We now heorem 2.7. The k nearest neighbors of a query show that, using essentially the same data structure, we q can be computed using the data structure in time can achieve improved performance under the additional 2O(dimKR (S {q})) (k + log n). In particular, exact NNS assumption that dimKR (S {q }) is small, where q is the can be computed in time O(log n) whenver dim (S KR query point. First, in the time complexity of the various {q }) = O(1). operations, we are able to replace with n = |S |. Secondly, we show that an exact nearest neighbor can The proof of this theorem is based on applying be found (rather than only a (1 + )-approximation). Theorem 2.3, and bounding the running time using In fact, the k closest points can be found in time arguments like in Lemma 2.8. Details aomitted from O(log n + k ). These bounds match the guarantees of this version of the paper. Karger and Ruhl [KR02], but our data structure is deterministic, dimension oblivious, and requires smaller 3 Lower b ounds space. (And, of course, extends to metrics with bounded This section shows that the complexity of our data doubling dimension.) structure is nearly optimal, by lower bounding the Theorem 2.6. The (1 + )-approximate nearest neigh- number of distance computations that are necessary bor algorithm from Theorem 2.2 runs in time (information theoretically) to answer NNS and (1 + )2O(dimKR (S {q})) log n + (1/ )O(dimKR (S {q})) . Similarly, NNS queries. The two lower bounds presented below the insertion and deletion procedures of Theorem 2.5 assume that computing the distance between two points run in time 2O(dimKR (S {q})) log n log log n. is the only costly operation, and that no information 806 about the distances can be deduced by other means, such as hashing points' identifiers.2 Formally, suppose that the distance between every two points in S is known, and that the distance between the query point q and any point in S requires an access to an oracle. We consider an adversarial oracle, i.e., we examine the worst-case complexity of answering a (1 + )-NNS query (over all possible oracles). For simplicity, we state the lower bound for 2, although the proof immediately extends to larger . These results hold even for randomized algorithms (both Las-Vegas and Monte-Carlo). Lemma 3.1. There is an input data set S for the distance oracle model, such that even though S is doubling and dimKR (S ) = O(1), any exact NNS algorithm must access the oracle (n) times (for a worst-case query). Proof (Sketch). Let (S, d) be the metric on the integer points 1, . . . , n on the real line. Clearly, dim(S ) = O(1). Let the query point q be at distance n - 1 from a single point i S and at distance n from all the other points of S . Obviously, any deterministic NNS algorithm must report the point i. It follows that if the value of i is chosen adversarially, then the NNS algorithm must compute n distances, in the worst-case, in order to find i. The proof for randomized algorithms is similar, using Lao's minimax principle. Y emma 3.2. There is an input data set S such that any (1 + )-NNS algorithm (for a fixed 0 1) that works in the distance oracle model, must access the oracle at least 2(dim(S )) log n times (for a worst-case query). [Bri95] S. Brin. Near neighbor search in large metric spaces. In 21st International Conference on Very Large Data Bases, pages 574­584, 1995. [Cla99] K. L. Clarkson. Nearest neighbor queries in metric spaces. Discrete Comput. Geom., 22(1):63­93, 1999. [CNBYM01] E. Ch´vez, G. Navarro, R. Baeza-Yates, and a J. L. Marroqu´n. Proximity searching in metric spaces. i ACM Computing Surveys, 33(3):273­321, September 2001. [FK97] C. Faloutsos and I. Kamel. Relaxing the uniformity and independence assumptions using the concept of fractal dimension. J. Comput. System Sci., 55(2):229­ 240, 1997. [GKL03] A. Gupta, R. Krauthgamer, and J. R. Lee. Bounded geometries, fractals, and low­distortion embeddings. Accepted to 43rd Symposium on Foundations of Computer Science, 2003. [Hei01] J. Heinonen. Lectures on analysis on metric spaces. Universitext. Springer-Verlag, New York, 2001. [IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In 30th Annual ACM Symposium on Theory of Computing, pages 604­613, May 1998. [KOR98] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in highdimensional spaces. In 30th Annual ACM Symposium on Theory of Computing, pages 614­623. ACM, 1998. [KR02] D. Karger and M. Ruhl. Finding nearest neighbors in growth-restricted metrics. In 34th Annual ACM Symposium on the Theory of Computing, pages 63­66, 2002. A App endix Proof. [Lemma 1.1] Let K be the KR-constant of X and fix some ball B (x, 2r). We will show that B (x, 2r) can be covered by K 4 balls of radius r. It will follow that We omit the proof from this version. It is based on dim(X ) 4 log K = 4 · dim (X ). KR 2 a data set S which is the shortest path metric between Let Y be an r-net of B (x, 2r), then the leaves of a complete -ary tree metric, in which y the length of edges at depth i is 1/2i . It shows that B (x, 2r) B (y , r) B (x, 4r). any algorithm essentially has to perform a linear search Y among the children of the root (the vertex at depth 0), then a linear search among the children of a vertex at Also, for every y Y , |B (x, 4r)| |B (y , 8r)| r r r depth 1, and so forth. K 4 |B (y , 2 )|. Since B (y , 2 ) and B (y , 2 ) are disjoint for 4 y=y Y , it follows that |Y | K . We conclude that the K 4 balls {B (y , r) : y Y } cover B (x, 2r). References [AMN+ 98] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. J. ACM, 45(6):891­923, 1998. [Ass83] P. Assouad. Plongements lipschitziens dans Rn . Bul l. Soc. Math. France, 111(4):429­448, 1983. 2 This is similar in spirit to lower b ounds on the numb er of comparisons in sorting. 807