Interpolation Search for Non-Independent Data Erik D. Demaine Abstract We define a deterministic metric of "well-behaved data" that enables searching along the lines of interpolation search. Specifically, define to be the ratio of distances between the farthest and nearest pair of adjacent elements. We develop a data structure that stores a dynamic set of n integers sub ject to insertions, deletions, and predecessor/successor queries in O(lg ) time per operation. This result generalizes interpolation search and interpolation search trees smoothly to nonrandom (in particular, non-independent) input data. In this sense, we capture the amount of "pseudorandomness" required for effective interpolation search. 1 Intro duction Interpolation search is a classic method for searching through ordered random data and attains a running time is O(lg lg n), which is exponentially better than binary search. The original method [5] was analyzed only for uniformly distributed data [7, 4, 2]. Willard [6] later generalized the analysis to arbitrary "regular distributions", without the algorithm having to know the distribution. Mehlhorn and Tsakalidis [3] further generalized interpolation search to handle the dynamic case of insertions and deletions in addition to searches and to a wider class of "smooth" distributions. Andersson and Mattsson [1] further extend the technique of Mehlhorn and Tsakalidis to a larger class of distributions and better bounds on searches and updates. All of these results rely on the data being drawn independently from some probability distribution. In this paper, we remove this independence assumption and instead capture the necessary properties purely deterministically via a sort of "pseudorandomness" measure. We start with the static case in Section 2 and generalize to the dynamic case in Section 3. 2 Searching Static Data Thouis Jones Mihai Patra¸cu s the data being independently drawn. Evenly subdivide the interval [x1 , xn ] into n bins B1 , B2 , . . . , Bn each representing a range of size xn -x1 . Each bin stores a balanced binary search tree n for the elements lying within its range, plus the nearest neighbor above and below. Searching for an element y then proceeds by interpolating on y to find the bin Bi that it lies in, i = xy-x11 , and performing a search in n -x the binary search tree of that bin. The array of bins can be constructed in O(n) time and space, assuming the data are already sorted. Theorem 2.1. The worst-case search time is O(lg ). Proof. The maximum distance between two adjacent values is at least xn -x1 . Given , the minimum n distance between two values must be at least xn -x1 . n Because the bins are of size xn -x1 , no more than n values can land in a single bin, and the search time of the binary search tree in a bin occupied by values is O(lg ). We note the following interesting traits of this algorithm: 1. The algorithm is oblivious to the value of . 2. The worst-case search time is also O(lg n) and thus O(lg min{, n}). 3. The algorithm reproduces the O(lg lg n) performance of interpolation search on data drawn independently from the uniform distribution, by the following lemma: Lemma 2.1. For data drawn independently from the uniform distribution, = O(polylog(n)) with high probability. Proof. We use standard Chernoff bounds on n balls thrown into m bins. Consider the uniform distribution over [0, u]. If m = O(n/ lg n), then every bin contains at least one ball with high probability. Thus, maxi (xi - g xi-1 ) = O( u ln n ) with high probability. If m = (n lg n), then every adjacent pair of bins contains at most one ball with high probability. Thus, mini (xi - xi-1 ) = ( n lu n ) with high probability. Taking the g ratio, = O(lg 2 n) with high probability. We first present a method for searching static data. Suppose we are given n distinct values x1 < x2 < · · · < xn in sorted order. Define the maximum gap ratio of x (x -x -1 ) the data as man (xii-xii-1 ) . We make no requirements on mi MIT Computer Science and Artificial Intelligence Lab oratory, {edemaine,thouis,mip}@mit.edu Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 529 3 Searching Dynamic Data The dynamic version of our structure supports insertions, deletions, and searches in O(lg max ) time per operation, where max is the largest value of over the lifetime of the structure, and the insertion and deletion bounds are amortized. At any point in time, the structure will be valid for sets of data with less than ^ some for which the structure was built. The only modification to the static structure is that ^ ^ it spans the larger range [x1 - L, xn + L], where L = xn - x1 . This superinterval is uniformly subdivided ^ into n bins each of size (2 + 1)L/n. Each bin stores a dynamic balanced binary search tree. The structure is rebuilt if either n updates (inserts 2 or deletes) occur without a rebuild, or an update causes ^ the of the data to be larger than (i.e., when the structure is no longer valid). In the latter case, ^ ^ we rebuild the structure with = max(new , 2old ), ^ old was where new is computed from the data, and the value of immediately before the rebuild. We need two main properties: (a) the structure is always valid, i.e., no xi outside the superinterval can be inserted into the structure, and (b) the number of ^ values landing in a single bin is polynomial in . The following two theorems establish the validity of the structure during a set of n updates that do not ^ cause to grow. Theorem 3.1. None of the n operations immediately 2 after a rebuild can add a value outside the range [x1 - ^ ^ L, xn + L]. Proof. Consider only the initial elements confined to [x1 , xn ]. The initial minimum separation between these elements is no more than nL 1 . After k deletions, the - minimum separation is no more than n-L-1 . k Given k deletions, there are n - k insertions before 2 a rebuild of the structure1 . These insertions can be only ^ L n-k-1 from the previous largest value in the structure. Therefore, the maximum value that can be inserted in ^ n n L 2 editing op erations is xn + ( 2 - k ) n-k-1 , which is ^ bounded above by xn + L for k n . By symmetry, 2 we can obtain a similar bound for the minimum value inserted. Theorem 3.2. The maximum number of elements in a ^ bin after n insertions or deletions is O(2 ). Proof. Again, consider the initial elements in [x1 , xn ]. After k < n - 2 deletions, the maximum separation 1 Because the minimum separation cannot grow from insertions, it is sufficient to consider deletions followed by insertions. between two of these elements is at least n-L-1 . Therek fore, the minimum separation between any two elements ^ is at least (n-kL 1) . The bins are of size L(2+1) , so n -^ the maximum number of elements that can end up in a ^^ ^ single bin is (2+1)(n-k-1) = O(2 ). n Finally, we establish the desired time bounds: Theorem 3.3. The worst-case cost of searches and the amortized costs of insertions and deletions is O(lg max ) time per operation. Proof. The cost for searches follows from theorem 3.2, as does the immediate cost of insertions and deletions. We use a charging scheme against updates to amortize the cost of rebuilds. The structure can be rebuilt in linear time. If the structure is rebuilt because of a sequence of n updates 2 without a rebuild, then the rebuild can be charged to these updates to give an amortized cost of O(1). ^ If a rebuild occurs because of a change in , then we charge the cost of the rebuild to the original insertions of the elements currently in the structure. The cost of ^ ^ inserting one such element was O(lg old ). Increasing raises max and hence the amortized cost of the original ^ ^ ^ insertion by (lg - lg old ) = (lg ) = (1) ^ old ^ ^ because 2old , to which we can charge the rebuild cost for that element. If the value of max is known a priori, then we can ^ always build the data structure with = max and ^ . A standard avoid rebuilding because of changing de-amortization of the remaining global rebuilds, from performing n updates, yields worst-case time bounds. 2 References [1] A. Andersson and C. Mattsson. Dynamic Interpolation Search in o(log log n) Time. Proceedings of ICALP, pages 15­27, 1993. [2] G. Gonnet, L. Rogers, and G. George. An algorithmic and complexity analysis of interpolation search. Acta Inf., 13(1):39­52, 1980. [3] K. Mehlhorn and A. Tsakalidis. Dynamic Interpolation Search. Journal of the ACM, 40(3):621­634, July 1993. [4] Y. Perl, A. Itai, and H. Avni. Interpolation search ­ A log log N search. CACM, 21(7):550­554, 1978. [5] W. W. Peterson. Addressing for Random-Access Storage. IBM J. Res. Development, 1(4):130­146, 1957. [6] D. Willard. Searching unindexed and nonuniformly generated files in log log N time. SIAM J. Comput., 14:1013­1029, 1985. [7] A. C. Yao and F. F. Yao. The complexity of searching an ordered random table. Proceedings of FOCS, pages 173­177, 1976. 530