Fast Approximate Pattern Matching with Few Indels via Embeddings Mihai Badoiu Piotr Indyk For this section we assume we are given an oracle A that returns the length of the longest substring of , , which can be transformed into using at most substitutions. Note that, for the sake of simplicity, in this section we assume that the oracle gives the exact answer. We show later how to adopt our algorithm to deal with the approximate oracle provided in the next section. The basic idea behind our approach is as follows. Firstly, we will assume that there exists an -match of in , but there is no -match (we show how to remove this assumption in the full version of this paper). In this case we show that there must exist a much more restricted structure that we call blockmatch. Then we show how to search for a blockmatch using a dynamic program that employs a distance oracle, i.e., a procedure that returns the Hamming distance between any pair of substrings of and . Finally, we show how to implement an approximate version of the oracle and show that the approximate version suffices. 2.1 Blockmatch The reason for adding extra characters to and in the following definition is technical and will become clear in the proof of the next lemma. D E FI N I T I O N 1 . Let , and . Let be a substring with the following properties: (1) contains disjoint substrings , which will call blocks and (2) can be transformed into such that (a) we can perform exactly substitutions to each block (we are not allowed to substitute a character for itself or perform 2 substitutions on the same position), (b) we delete the characters of , which are not part of a block, (c) we insert characters between the blocks, (d) we perform at most indels in (b) and (c) and, (e) we can delete any character and insert any characted for free. We call a -blockmatch of in . L E M M A 2 . 1 . Let there exists a Proof: Omitted. . Let be the -match. Then -blockmatch. T H E O R E M 1 . 1 . There is an algorithm that, given a pattern and a text of length such that there is a match of in , it finds a match of in . The algorithm runs in time . Thus, we obtain a near-linear approximation algorithm that can handle a small number of indels but an arbitrary number of substitutions. This result can be viewed as a "noisy" version of the -time exact algorithms that allow at most indels and substitutions. Our techniques. Our main tool is an approximate substring difference data structure due to [IKM00]. After initial preprocessing, the data structure reports an approximate distance between any two substrings of in logarithmic time. MIT LCS, 200 Technology Square, Cambridge, MA 02139, USA, mihai, indyk @theory.lcs.mit.edu Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 651 ¢¤ i (q pDh0g¦ 2 The algorithm d e¡ E¤ q ¤ sss & 9 ¤ B¤ ¤ ¤ ' 8 " ¢ ¢ ' " ' 8" u¡ ¡ V 8£ £ ¡ F &GQC¦ FGED¦ C ¢ e¡ ¤ % v F REQC¦ q o n l hg l ¤ ©ph h S 8#g mkj0i &TEQC¦ f 7 y% v H s @PT ¤ i 2spi r ¡ y% v H s r xw ¤ utg 2sRg U¢ ¤ C 8GEQC¦ q ¤ 9F ¢ ¢ ¢ ¡ ¡ 1 Introduction The problem of finding a non-exact occurence of a pattern in a text over some alphabet is a central problem of combinatorial pattern matching and has a variety applications in many areas of computer science. The classic instance of the problem is when the distance between the pattern and a substring of is measured according to the edit distance. That is, the distance between and is equal to the minimum number of insertions, deletions and substitutions of characters that transform into . Despite decades of research, the best algorithm for this problem has running time if , even if the algorithm is allowed to report approximate answers. This contrasts with the setting when only substitutions are allowed, where nearlinear time approximation algorithms are known. In particular, Karloff [Kar93] showed how to reduce the problem over any to the same problem over , by mapping each symbol from to a sequence of random bits. The binary case can be solved in time using FFT. Several exact subquadratic-time algorithms exist for general as well (references omitted due to lack of space). In this paper we investigate searching under the edit distance in the following setting. Let be a substring of . We call a -match if can be transformed into using at most insertions or deletions and substitutions. The key feature in this definition is that the number of insertions and deletions (indels) is separated from the number of substitutions. Since substitutions are "easier" to handle, this allows us to obtain algorithms with better performance than when arbitrary edit operations are allowed. In particular, we show That data structure uses the dimensionality reduction theorem in due to Johnson and Lindenstrauss. Combining the data structure with dynamic programming yields our result. ¡ ¡ W 78DFa3421Y¨ 9 4¢ 215@cbaR86 £ Y421YXT42150¨§¥ 6 ) 3 ) ¨9 C H `7 3 ) ¨ W 3 ) ¦ ¡ Q F VUTS&QC¦ ¢ 7 H %¦ QREQC¦ F ¨ ¡ ¨3) ¦ BA42150¨§¥ 9 7 3 # 0)¦" &@86 £ 5421$§¥ '% (& F ¤ ¤ ¢ ¤ ¤ ¡ ¡ ! ¢ ¨ ¡ £ FH PIC ¨ ¦ BcQC§¥ F C DGED¦ ¢ f C¤ £ ¤ ¢ ¦ ©¨§¥ C ¢ £ d £ e ¡ ¡ ¢ 2.3 Approximate oracle suffices It remains to show that if we use the approximate oracle instead of the exact one , is equal to the size of the longest then we still solve the LEMMA 2.2. -match problem. For prefix of , such that is a blockmatch this purpose, we show the following. Let be the dynamic of that uses insertions or deletions and has blocks. programming table filled using instead of . The table is filled in the increasing order of and . Proof: We proceed by proving the statement by induction. The base case is given by the initialization of to . For the inductive step, assume that and are filled by the definition of the claim. Suppose there is a blockmatch that matches more characters for . must end with either a deletion or an insertion or a block. Let's suppose it ends with a block. If uses matches up to and then tries to match a block, then our blockmatch will be just as good as . Therefore, the must match less than and then end with a block. In this case, it is clear that our solution is at least as good as . By a similar analysis of the insertion and deletion cases, we conclude that cannot match more characters than . By filling out this table we will compute a solution that is at least a -blockmatch. Our algorithm makes calls to the oracle . The number of calls can be reduced by a factor of ; we show it in the full version of this paper. L E M M A 2 . 3 . For any , we have . The lemma implies that our algorithm will find a blockmatch, for , , such that each block uses at most substitutions. This will show the main theorem of this paper. Proof: The proof proceeds by induction. Assume that the statement is true for all triples and the triple ; we will show it holds for . Consider the inductive formula for . By the inductive assumption we have . By definition, differs in at most positions with the corresponding prefix of of the same length. Therefore, also differs in at most positions with the corresponding prefix of . Since 2.2 Implementing the distance oracle In this subsection for all arguments, it follows that . we show how to implement an approximate version of the A subroutine, which we call A'. The approximate References version proceeds as follows. Let be equal to A . Then A' returns such that [IKM00] P. Indyk, N. Koudas, and S. Muthukrishnan. Identifying where is the Hamming distance between representative trends in massive time series datasets using sketches. Proceedings of the 26th International Conference on Very Large Databases (VLDB), 2000. [Kar93] H. J. Karloff. Fast algorithms for approximately counting mismatches. Information Processing Letters, 48(2):53­60, 1993. and . In order to implement A', we show how to implement a subroutine D I S T , that returns a value between and H S&s g m¢ ¦ h% ¦ gss r 7 H 0y H D&s i r ¡ y H S&s g m¢ ¦ iss gss r pDh©g¦ i # # 4$¦ ## y% @ bv F H &s i r ¡ @r ev 5G&s g m¢ ¦ h8% P"0@ev H D&s i r ¡ @% iss y% F Hgs s r 7 H ¦ ! y% F iss y v H g s g U¢ ¦ F ss F F i (q pDh©g¦ q i (G4h0g¦ F i (q pDh0g¦ £ 5 £ U ¦ ¦ y 2s4H B £ r ¡ s% q y% v bi H £ B £ Hr ¢ s pi s r H B £ H ¢ U¢ ys I 2s4% H ¡ C q r h §¥ ¦ y % vB @b5£ i H £ H ¤ r ¢ 2spi EH £ C£ H C£ ¢ U¤ ¢ r r r £ sB y @% v ¤ r 4i r Dg r £ yy Qph h o n 2j©i h ED¦ C @% v (i r Qyg r B Dy 4i Dg £ Dy ¤ r (i r Qyg E£ £ y y yy yr qy 8GD% v ¤ r r 4i r Dg E£ pPH % v ¤ r (i r Qyg E£ H ¢ UAH @% y yr i y yr ¦ y y ¤ r 4i r Dg r £ yy v ¤ pDhg E£ y ¤ r 4i r Dg E£ i r yy % cy ¤ r @% ti r @% Bg r £ Dy ¤ r @% i r bv g E£ ¦ ¨¦ H y H y v y v y% r ©§ ©§ % Hy ¤ r @% ti r % v g r £ Dy ¤ r @% i r bwg r £ ¦ ¨¦ y ¤ r 4i r Dg r £ yHy y v y% v yy y @% v ¤ r (i r Qyg E£ yr y ¤ r (i r yQg r £ y ¤ pDh0g¦ i y @% v ¤ r 4i r Dg r £ yy i $% v ¤ 4h0g¦ ¤ Dp% 0g¦ i v y ¤ r 4i r Dg E£ y yr H %¦ h7 Scq h y ¤ r % Yi r % v g E£ y ¤ r @% yHy r y pQh o n 2j0i h q v# i r % # r v g# rr £ # DE£ % v ¤ r 4i r Dg E£ y y y r y 8q ED¦ C y y yr ¤ pDhg y ¤ r 4i r yDg E£ (i r (i r yQg r £ y r yy i ¤ ¡ g r £ yi H 4h e¡ H ¢ 2ss ¢ U¢ ¤ r r E£ ¡ ¡ GV©7¦§¥ H % ¦ GQC¦ F y (i Qyg r y ¤g H y H y v r y% v y v r % cy ¤ r @% ti r @% Xg E£ y ¤ r bei r @% wg E£ 7 9h86 ¨ ¦ 9 34210)§¥ $R786 £ Y3421Y¨ W 3p1 )50¨§¥ ` ) ¦ (q @%v ¤ r 4i r Dg rE£ piV%v ¤ r y4i r Dg E£ H ¢ U@@v ¤ r 4i r Dg E£ ¦ ¨¦ y y y Hy yr ¦ Hy% y yr ©§ y ¤ r y4i r Dg E£ yr sss 9 & ¡ @¡ C e¡ ¢ RpR©g¦ C i ¥ ¨ a3421) # r y # r y # r ¤ y # E£ q ¡ g ¡ r 20 % % p 87h 6 254 31) ('n &'4 #$" £ y H 4i p e¡ H ¢ s s ¢ U¢ ¢ l 7 H % g s s ¡ # ¤¡ ¦ ¤ r y r 86 £ C V7 AS¦ p&s £ r E C ss Cs # h(6 A3p1 ©)§¥ 97 ¨ ¦ 7 3) $9h86 ¨ 9 42150¨¦§¥ Rs&tCv i Gs&s g y (i Qyg ¢ ¨ V&& ¡ p% ¢ E¨ 2ss ¢ m¢ sss yr C ¡ ¦ £ SV7 ¡ H S$EQC¦ F %¦ kj0i h q Thus, it suffices to consider the case . Denote 652 ¡ pDh©g¦ i As stated before, instead of searching for a match, we search for a -blockmatch. Given the blockmatch, and setting slightly smaller than needed, such that is an integer, we can then construct a -match for the original problem. The algorithm proceeds by finding the blockmatch of with a prefix of for . For each , we use dynamic programming. Specifically, use a 3-dimensional table . The entry for , , , denotes the size of the longest prefix of , such that is a blockmatch of that uses insertions or deletions and has blocks (each block with substitutions). We initialize to and to all the other entries. We fill the table using the recurrence, that states that is equal to . The A' then can be implemented using calls to D I S T via binary search. It is not difficult to observe that the calls can be arranged in such a way that is always a power of . An implementation of D I S T, for the case when is a set of real numbers, is the (square of the) Euclidean distance and is fixed has been given in [IKM00]. It achieves the following parameters: preprocessing time and query time . For our purpose (i.e., with being the Hamming metric), we first as in [Kar93], obtaining and map to . Then we use the fact that, for binary strings, Hamming distance and the square of the Euclidean distance are the same. We build [IKM00] data structures that support operations D I S T in and for . The final data structure for A' has the following parameters: preprocessing time and query time . C ¨3 A421) 0y H D&s i r ¡ y iss F &REQC¦ 7 og l ©ph h S n l 2j0i h QC¦ g