Approximate Nearest Neighbor under Edit Distance via Product Metrics Piotr Indyk Abstract We present a data structure for the approximate nearest neighbor problem under edit metric (which is defined as the minimum number of insertions, deletions and character substitutions needed to transform one string into another). For any and a set of strings of length , the data structure reports a -approximate Nearest Neighbor for any given query string in time. The space re, quirement of this data structure is roughly i.e., strongly subexponential. To our knowledge, this is the first data structure for this problem with both query time and storage subexponential in . 1 Introduction The Nearest Neighbor Search (NN) problem is: Given a set of points in a metric space , preprocess so as to efficiently answer queries for finding the point in closest to a query point . NN and its approximate versions are among the most extensively studied problems in the fields of Computational Geometry and Algorithms, resulting in discovery of many efficient algorithms. In particular, for the case when the metric space is a low-dimensional Euclidean space , it is known how to construct data structure for exact [Cla88, Mei93] or approximate [AMN 94, Kle97, IM98, HP01] NN with query time . Unfortunately, those data structures require space exponential in . More recently, several data structures using space (quasi)-polynomial in and , and query time sublinear in , have been discovered for approximate NN under and [KOR98, IM98, Ind00] and [Ind98] norms. While many metrics of interest for nearest neighbor search are norms, quite a few of them are not. A prominent example of the latter is the Levenshtein (or edit) metric, which, for two strings and , is defined as the minimum number of insertions, deletions and substitutions that transform into . Edit distance is a natural measure of similarity between sequences of characters, and is widely used e.g., in computational biology and text processing. Unfortunately, the only nearest neighbor alLaboratory for Computer Science, MIT, Cambridge, MA 01239. This work was supported in part by NSF CAREER grant CCR-0133849. gorithms for edit metric known so far were the "trivial" ones, i.e., using linear scan (which suffers from query time) or using exhaustive storage (which suffers from memory requirements exponential in the length of strings). A potential approach for devising an approximate NN data structure for edit metric is to embed it into a normed space, and use NN algorithms for normed spaces. In other words, we would like to map every sequence into a vector such that the distance between the vectors approximates the distance between the sequences. This approach has been successfully applied for some notnormed metric spaces. In particular, it was shown that a variant of the edit distance (called block edit distance) can be embedded into Hamming space or norm with distortion roughly [CPSV00, MS00, CM02], where is the maximum string length. The block edit distance measures the minimum number of blocks (not just character) operations needed to transform one string into another. Although it is plausible that the standard edit metric can be embedded into, say, norm with low distortion, so far researchers were unable to provide such an embedding. A (somewhat weak) lower bound of for distortion of any such embedding has been recently given in [ADG 03]. In this paper we present the first non-trivial approximate algorithm for NN under edit metric. For any , our algorithm solves the -approximate NN using query time and space roughly space. This, e.g., allows us to obtain a constant factor approximation using space that is exponential only in , for arbitrarily small . Alternatively, we can use , and obtain a -approximate algorithm using polynomial space. The result is obtained by representing edit metric as a product of several "smaller" metrics, using a method first proposed in [Ind02]. A product of metrics with distance functions and a function is a metric over with distance function such that . The most typical functions are "max" and "sum"; in general, our result holds for any -monotone function (see Preliminaries). We prove the following general result about NN Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 646 k y i jfh y d4 S YS S g f yy yy ef !fS y fd !S P u x S x y YS y x YS x ¦ 5ba 02 § ¤¢ gc¡ ed f¨ S §wvFGEB p D u4 5§¡ i ¡ S ¡ 1 &$ " 20)('%#!h¦ § F DB 0cGE © ¨ 9 § i 0§ sq trp ¦ 543 1 &$ " ¦ %20)('%#!£ 6 7 8¡ § ¦ § § S ¡ 7 8 ¡ § X YW ¦ U SQ I ¦ F DB @ § PTRP5HGECA 9 W © ¨ X YW ¤¢ ¥£¡ V ¡ W § ¦ 6 ¦ ` 6 2 Preliminaries Note that if for any , implies , then is a metric. A function is called -monotone, if for any and , if then . Note that the functions , for any , are -monotone. We define to be the edit distance between strings , i.e., the minimum number of insertions and deletions and substitutions of symbols needed to transform into . In addition, we set . The -approximate Nearest Neighbor problem ( NN), under a metric , is defined as follows: given , , build a data structure which finds such that given any . The decision version of the above problem is defined as: given , , and , build a data structure which, given any , if there is any that , finds such that . Note that is fixed before the data structure is constructed. If the data structures are are randomized, we require that the above specifications holds for a fixed with constant probability. As shown in [IM98, HP01], the decision and optimization versions of approximate NN are equivalent, up to polylogarithmic factors in space and query time bounds. For any metric and , we define to be a metric such that for all . 647 x S x !s f " gS " g u g " " u ¢ 6gXX 7 V ¦ 6 u' ' EX y 3 f y '! sq y dA9 @5 PG y !y S YS S 2g y y ¢ ¢ k @ u 7 t S 7 u y 7 x be metrics, . The -product is defined by setting , setting 3 x ¢ U % 7 U X d y 2 B T6 " %$" 6 ©X ¦ y 7 y 7 u ¢ y k y ¤ k x u X hW W 75 X 864hW y W W TXhW y 2 3# 12 )0s " g&k X ( ' g' ¢ ¢ " #X "" y " k @ !#grw4¥g u TX U ¢ gW y 2 6 7 ux 7 y 7 Y y x GE R P IH#F¢ 7 D SQ( ' 6 ' 7 BC6 ¢ ¢ ¥g y 7 u y f y ' ¢ f y ' x 4 x Let , and let of , and, for any ¦ ¦ ¤ u x ¡x u 7 y s tU q u for product metrics. Assume that each is equipped with a -approximate NN data structure with query time and using space . Then we can construct a data structure for -approximate NN, with query time and space roughly . Given the above theorem, we show that NN for edit metric can be reduced to NN over a product of several edit metrics over shorter sequences. Then, the NN data structure for each individual metric can be constructed either recursively, or using the exhaustive storage approach. Although the main focus of this paper is on algorithms for edit metric, we mention that NN data structures for product metrics are also of significant interest by itself. Product metrics are of interest in scenarios where one searches for an object satisfying a set of metric constraints, e.g., "Find a person with similar signature and similar face photo". We can answer such a query by using NN data structure for a product of individual metrics. Because of this reason, we further study approximate NN algorithms for product metrics. In particular, we obtain the following result for the case when is a sum. As before, assume that each metric is equipped with an approximate NN data structure with query time and space roughly . Then we construct an -approximate data structure that solves the decision version (see Preliminaries) of NN for the sumproduct of 's, with query time and space . Note that, unlike in the previous case, the space bound does not depend exponentially on . However, the approximation factor makes it impossible to use this data structure in the context of edit metric. Related results. The approach of solving NN in not-normed metrics via product metrics was introduced in [Ind02]. In that paper, the author gave a approximate algorithm for the case when the function computes the maximum of its arguments. The space bound of that data structure lacked the term present here; other parameters were similar. The data structure was then used to provide a subexponential space approximate NN data structure for the Frechet metric. Unfortunately, the approach of [Ind02] relied significantly on properties of the maximum function, since it essentially generalized the earlier algorithm of [Ind98] for norms. On the other hand, in order to obtain an algorithm for the edit metric, one needs to design an algorithm for the case when computes the sum of its arguments. Thus, one cannot use the data structure of [Ind02] to obtain the result presented in this paper. We mention that by using our product metric data structure, we can construct an approximation algorithm for NN under Frechet metric, with space and query time as in [Ind02], that has only constant approximation factor. ¨ ¦F ©HD B ¦ 5 ¦ F DB F DB ¢ 5HGEHGEd U Q I F @ ¦ ¦ %S RP5¦ D B w!5 £ @¦ ¦ c5 ¦ ¡x Our techniques. Our first data structure (for any "monotone" function ) is quite simple. Its basic idea is to partition the product metric into roughly regions, within which an approximate nearest neighbor can be fixed and pre-computed during the preprocessing. In addition, we can determine the region that a given query belongs to, by only invoking the nearest neighbor oracles for the individual metrics . Our second algorithm can be viewed as an abstract version of the Locality-Sensitive Hashing algorithm of [IM98] . It allows us to reduce the problem of approximate NN in sum-product metrics to approximate NN in max-product metrics, by embedding the former into the latter. Then we can use the algorithm of [Ind02] to solve the problem in the latter space. ¡x ¤ §¦ £ ¦ 5 ¢ ¥ F DB 5¦ GEba ¤ ¦ P5 ¦ 5 ¤ ¤@ §¥¦PUTSQ x F FD¨¦ FD 5¦ D B GEBGEB ¦ 5 I w¦ ¢ £¦ ¦ 5 V ¡ £ ¦ 4 Edit distance T H E O R E M 4 . 1 . For any integer , there is a data T H E O R E M 3 . 1 . Assume that each metric is equipped structure solving -NN under the edit metric , using with a -NN data structure with query time and space and query time. using space . Then, for any , there is a data structure for -NN, with query time In the following we will show a data structure for using space (we denote NN under the edit distance, which uses ). space and has query time. The generalization to the bounds stated are obtained by recursive . Proof: For simplicity, we focus on the case application of the same idea. Generalization to is straightforward. Consider two strings . Partition into The data structure is constructed as follows. First, we substrings of length each (assuming build a -NN data structure for each . Then, we prepare divides ). a lookup table . We create an auxiliary set of containing all -tuples such that . By Claim 1, the (approximate) nearest neighbor of point in the set with respect to can be computed by finding the (approximate) nearest neighbor a query point in with respect to the distance (informally, we can combine the "min" from the nearest neighbor definition with the "min" of Claim 1 into one "min"). The latter distance function is a sum-product of copies of . The key property of this transformation, however, is that each of 's in and . Observe has length . Therefore, the -NN problem that can be computed using exclusively the in- under the component metrics can be trivially solved with formation presented to the lookup table, without the full query time and using space, via exhaustive knowledge of . storage of answers to all queries. At the same time, the Observe that and sum function is -monotone. Thus, using the results from ; this implies the previous section, we can build a data structure solving . In addition, we have the -NN problem under the metric. Since and , the space bound follows. 5 Sum-product In this section we show how to perform an embedding of a sum-product of metrics into a maxproduct of those metrics; the max-product includes several (scaled) copies of each . The embedding is somewhat weak, but is nevertheless sufficient to reduce the Thus, which implies , 648 3 X yy f !fS y x y S x ¤ ¨ y¨ X6 § 3 #u 3 x 4 5 ' )¨ y ¨ yy f !fS 2 ¡ x G 5' 6 ¦ ¦ 4§ d ¦§ 8 ¦ GE£4§ F DB d ¦ u 3 # x X f ¤ ¥ 6 G u S2 ¦ d § y y S 2 ' X 6 ' ¨ E hS X6 Set points j y 3 # S B D ( G( F IH#E ( C L A I M 1 . The edit distance equality B@ CA 3 888 99% u y 2 E F 3 # u ¦ . We describe how to construct later. Given a query point , the data structure proceeds as follows. First, for each , it queries with an argument . the -NN data structure for Let be the point returned by the -th data structure. Then the data structure returns a point in indexed by . The query time and space bound follows directly from the above description. It remains how to construct so that the algorithm returns an approximate nearest neighbor. Let and be as above. Consider any point . We define to be an approximation of the distance . Specifically, we set satisfies the following ¨ ¦ # ¨ $ 1)& 20( ' % 3 x ' 5' § ¤ ¦ 4§ d ¢¡ Consider metric spaces , . Assume that all distances in metrics are integers from . In addition, let be a -monotone function . be the -product of . Let Let , . Define , and let . X 6 y ¤ ijd y h S y yy S TX y 2 3 Nearest neighbor under -product metrics Therefore, if we set to a point which minimizes over , the algorithm will report a -approximate nearest neighbor. 5 § § y S © G¨ U20)('$%2!TQR%5¦ 5 2 1 &" I ' '§ @ ¢ #k e y y YS S YS f ¢ e @ fk ¤ 1)&4 7( 65 3 §¦ v w . § ¨¦ § ¦ y P X y y SX i j2 y 2e# X y e# y ¢ e2 d TX y 2 X y X y 2P X ¢ e @ X y P# X y X ¤ X y ¢ e @ ¤ y Grh X y e @ # y 2P#fe @4i 2 y # " X y e0 f u j y e0r5 X y 'P# @ X y # su yy G S u £¡ £ ¡ g6 ¢ S G 6 ¤ ¡x £¡ ¤¢ ¦ 5 £ 6 x S x ¢ X y 2P# X y X 2 y 2P # ! X y # ¢ X y 2P# dej y 2P # X y e # ¢ X y y S S g TX y u X y 2eX £ ¡x TX y 2 ¤ ¤ ¡ PS 9 FD @ GEB ¥ gH5¦ §¦ ¤ e C@ !¢ eC@ k ¤ ¦ 5¤ ¤ s¢ X Td y ¦ y e y y y S y S S y S ¤u u x x 6 y 2P# wE f ¨ ¦ 6 X SX ' d £S sq x k 6 ©§E ¥ ¨¦ ¦ ' ' 6 7 Bu 6 y u ' ' 7 x u u ¡ ¢ u 5 ¦ ¢ ¢ ¡ u S GEB FD 6 £¦ X decision version of the problem of computing approxi- that contains elements. The probability mate nearest neighbor under sum-product of metrics to the that all of them are is at most same problem over max-product of those metrics. The latter problem can be solved using the algorithm of [Ind02]. The embedding is based on the following approach. Thus satisfies the conditions of the lemma. Assume is a power of . We start by choosing , . Each is chosen by picking elements independently and uniformly at random T H E O R E M 5 . 1 . The embedding satisfies from , without replacement. In addition, we the following conditions: for any we have: define . Our mapping will map a point into , then 1. If , which is a max-product of metrics . Each itself is a max-product of metrics , multiplied by . The coordinate 2. If , then of corresponding to as above is defined to be . The mapping is obtained by concatenating over all . The key properties of the mapping follow from the C O RO L L A RY 5 . 1 . Assume that each metric is following lemma. equipped with a -NN data structure with query time L E M M A 5 . 1 . For any , we have: and using space . Then there is a -NN data structure for , with query time 1. If , then and space . Proof: Follows by combining Theorem 5.1 and the algorithm of [Ind02]. Specifically, we set and . By scaling, we can assume that the threshold in the decision version of approximate NN is equal to . Assume that there is a data point which is within distance from . By performing the embedding as in Theorem 5.1 we are guaranteed that: Proof: We start from part (1). Consider first . By Markov inequality, the fraction of 's such that The distance between and does not increase with is at most . Thus, probability the probability that does not contain any of such 's is at least . The case of holds The distance between and any such that trivially. Therefore, the probability that this holds for all remains greater than with is at least . overwhelming probability. Now we consider part (2). Let denote the number Thus, with probability slightly smaller than ,a of 's such that . We will show first is a correct solution approximate nearest neighbor of that there exists such that . Indeed, assume -approximate otherwise, i.e., for all . This would imply that to the decision version of a NN for . The probability can be amplified to a constant . In addition, we could upper bound by by building data structures and using all of them. It suffices to solve a decision version of an -approximate NN under . Since is a max-product metric, where each of components is equipped with a -approximate NN data structure, we can solve the latter problem using the which yields a contradiction. result of [Ind02] (as mentioned in the Introduction). By the above argument, there exists such that . Consider the following two cases: Case 1: . In this case, 6 Conclusions , and thus satisfies In this paper we showed how to solve NN in edit metrics the conditions of the lemma. by designing a new data structure for product metrics. The Case 2: . In this case, set . approach undertaken in this paper is fairly general, and Note that . From the above description we know can be probably used for other metrics which do not seem ¨ # 649 8 ¨ ¦d ¤ X d ¦ GEB HD 6 F D ¨ ¦ F B ¨ f£ ¨ k ¦FB GEDG¨ ¢ ¨ Xx FD FD k 5¦ GEB GEB 2 8 ¦ Sd ¤ ¨ 8 ¤ ¥¦ y q X U Xx ¤ ¨ @ @ ¢ ¢ ¢ FD GEB W ¤ ¤ ¤ h y 1 ¦FD B2¢ ) W ¨ ¦ %$ & ¥ su ¦Sd e s u i ¤q ¦ X % u 8 e ¢ ¢ ¨ W e W ¦ d Cu ¤ e S ) 0( D & B ¤ ' (d Sd e ¦ ¤ ¢ 8 ¦ d ¨ Gd ¦ u e e ¨ 2 ¢ )&' ( 6) ¤ e ) W ¤ y ¤ ¤ q h ' ¢ e¨ S h9 P2¢ S ¨ d ¤ q e ¢ ¤ ¢ ¢ e ¤ ) ¤ y ¤ 4 ¨ 2. If , then ¨ ¦ F DB 5HGE # ¦ F D k HGEB 2 ¦ ¦ P5 ¤§¥ @dPTQ w¦ US I x x % &$ 8 §u %$ & 7 X Cx 8 ¦ d i ¤¢¤ ¢ ef¢ d ¦ d Gd ¦ ¥t e ¢ ¤ u8 8 Pe¨ e e4 fd ¦ ¢ u 4 fd ¦ i j¨ q ef£ y '6 X f £ ¤q x AY y x R £ ef£ y '6 X f £ ¦ 5 ¦ F B¨ ¨ D ¢ ¨ ¤ UPTQRP5¦ GEB @ !5 S I FD ¦ ¦ ¦ FD FD FD 5HGEB GEB ¨ ¦ GE£B k ¤ © y q f © f y X 8 9u ¤ 4 5 ¦ 5 £ B x 8 ¦d T ¤ ¥¢ i !e ¤q £ e eP¨ q h¤ y ¤'¤# ¨©P ¤ E " f ¦§ © y q f ¦ ¢y¨ y ¤ ¤ ¤ h y x ¦ GED!¨ ¢ ¨ FB d£ e y x ¦ F GEDB ¤ ¨ ¦ § £ £ P E F f y ¤ £ ¦ ¤ d£ ¤ £ y ¥x § ¤ ¦ x ) 0( & x ¡ X y X7 Xx u ¦ ¤ ¤ u £ W #W © f y ¤ ¦ u ¦ F GEDB ¤ y q ¤ ¤ ¤ X ¤ Gd ¦ e ¦ ¤ ¦ 3 ¡ ¢ £ S P¤ E ¡ s u 0W ) 0( D & B su ¨¦ ©§ ¦ ¤ e ¤ amenable to constant distortion embeddings into normed spaces. However, the latter option would still be more attractive, due to much lower space requirements. This work raises several interesting open problems on the complexity of the nearest neighbor problem under product metrics. One natural question is if it is possible to improve the approximation factor of , and maybe even solve the exact NN problem. However, there is evidence that the factor of could be tight, even for the case when . Indeed, assume we could get a -NN with space simply exponential in , for . Then we could obtain a -approximation algorithm for NN under norm that uses space only exponential in , by setting . By the result of [Ind98], this would imply a data structure for the partial match query over , with space only exponential in . Such a result, although possible, appears to be somewhat unlikely (see [CIP02] for more background on the partial match problem). References [Ind00] P. Indyk. Dimensionality reduction techniques for proximity problems. Proceedings of the Ninth ACM-SIAM Symposium on Discrete Algorithms, 2000. [Ind02] P. Indyk. Approximate nearest neighbor algorithms for frechet metric via product metrics. Symposium on Computational Geometry, 2002. [Kle97] J. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. Proceedings of the TwentyNinth Annual ACM Symposium on Theory of Computing, 1997. [KOR98] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proceedings of the Thirtieth ACM Symposium on Theory of Computing, pages 614­623, 1998. [Mei93] S. Meiser. Point location in arrangements of hyperplanes. Information and Computation, 106:286­303, 1993. [MS00] S. Muthukrishnan and C. Sahinalp. Approximate sequence nearest neighbors. Proceedings of the Symposium on Theory of Computing, 2000. [ADG 03] A. Andoni, M. Deza, A. Gupta, P. Indyk, and S. Raskhodnikova. Lower bounds for embedding of edit distance into normed spaces. Proceedings of the ACMSIAM Symposium on Discrete Algorithms, 2003. [AMN 94] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 573­582, 1994. [CIP02] Moses Charikar, Piotr Indyk, and Rina Panigrahy. New algorithms for subset query, partial match, orthogonal range searching and related problems. International Colloquium on Automata,Languages, and Programming, 2002. [Cla88] K. Clarkson. A randomized algorithm for closest-point queries. SIAM Journal on Computing, 17:830­847, 1988. [CM02] G. Cormode and S. Muthukrishnan. The string edit distance matching problem with moves. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2002. [CPSV00] G. Cormode, M. Paterson, C. Sahinalp, and U. Vishkin. Communication complexity of document exchange. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2000. [HP01] S. Har-Peled. A replacement for voronoi diagrams of near linear size. Proceedings of the Symposium on Foundations of Computer Science, 2001. [IM98] P. Indyk and R. Motwani. Approximate nearest neighbor: towards removing the curse of dimensionality. Proceedings of the Symposium on Theory of Computing, 1998. [Ind98] P. Indyk. On approximate nearest neighbors in norm. Journal of Computer and System Sciences, to appear. Preliminary version appeared in Proceedings of the Symposium on Foundations of Computer Science, 1998. ¢ £¡ ¨ )¢ § 3 ¨ ¦ ¨ ¢ § 3 u ¨¦ ©§ ¦ E u V ¡ £ y ¤s § 3 ¢ 650