Adaptive Sampling for Quickselect Conrado Mart´nez i Daniel Panario September 26, 2003 Alfredo Viola § Abstract Quickselect with median-of-3 is largely used in practice and its behavior is fairly well understood. However, the following natural adaptive variant, which we call proportion-from-3, had not been previously analyzed: "choose as pivot the smallest of the sample if the rank of the sought element is small, the largest if the rank is large, and the median if the rank is medium". We first analyze proportion-from-2 and then proportionfrom-3. We also analyze -find, a generalization of proportion-from-3 with interval breakpoints at and 1 - . We show that there exists an optimal value of and we also provide the range of values of where -find outperforms median-of-3. Our results strongly suggest that a suitable implementation of this variant could be the method of choice in a practical setting. Finally, we also show that proportion-from-s and similar strategies are optimal when s . 1 Intro duction Hoare's quickselect [5] algorithm selects the mth smallest element (equivalently, the element of rank m in ascending order, the mth order statistic) out of an array of n elements by picking an element from the array -- the pivot-- and rearranging the array so that elements smaller than the pivot are to its left and elements larger than the pivot are to its right. If the pivot has been brought to position j = m then it is the sought element; The research of the first author was supp orted by the Future and Emergent Technologies programme of the EU under contract IST-1999-14186 (ALCOM-FT) and DGES PB98-0926 (AEDRI). The second author was funded by NSERC grant number 238757. The third author was supported in part by Proyecto de Investigacion CSIC Fondos 2000-2002 and 2002-2004. ´ Departament de Llenguatges i Sistemes Informatics. Uni` versitat Polit`cnica de Catalunya. E-08034 Barcelona, Spain. e conrado@lsi.upc.es Scho ol of Mathematics and Statistics. Carleton University. Ottawa, K1S 5B6, Canada. daniel@math.carleton.ca § LIPN - CNRS UMR 7030. Universit´ de Paris-Nord. 93430 e Villetaneuse, France. viola@lipn.univ-paris13.fr. Permanent address: Instituto de Computacion. Universidad de la ´ Republica. Casilla de Correo 16120, Distrito 6, Montevideo, ´ Uruguay. viola@fing.edu.uy otherwise, if m < j then the procedure is recursively applied to the subarray to the left of the pivot, and if m > j the process continues in the right subarray. A similar principle is used in the celebrated quicksort algorithm [6], also by Hoare; once the pivot is brought into place by the partitioning of the array, the subarrays to its left and right are recursively sorted. For the remaining of this paper we will use = m/n to denote the relative rank of the sought element. Quickselect performs well in practice, its average number of comparisons being linear. In particular, () Cn0m = m0 () · n + o(n), , (0) m0 () = Cn,m = 2 · (1 + H()), n m/n lim n 0 1, m0 = limn Cn /n = 3. Using the median of a small sample of s = 2t + 1 elements as the pivot of each recursive stage yields a substantial improvement over the standard variant, reducing the average number of comparisons and making worst-case behavior less likely. Little is known for t > 1, other than the average and variance of the number of comparisons to select an element of random rank out of n [11]. For samples of three elements [1, 3, 7] it is known that () Cn1m = m1 () · n + o(n), , (1) where Cn,m denotes the average number of comparisons made by quickselect to select the mth element out of n, and H(x) = -x ln x - (1 - x) ln(1 - x) is the entropy function (see [4, 8, 9, 10]). Another quantity (0) of interest is the average number of comparisons Cn to locate an element of random rank, i.e., when m is given by a uniformly distributed random variable in 1 (0) (0) {1, . . . , n}; since Cn = 1/n · mn Cn,m we have (0) (0) m1 () = Cn,m = 2 + 3(1 - ), n m/n lim n (t) 0 1, where, for any t, Cn,m denotes the average number of comparisons made by quickeselect with median-of(2t + 1) to select the mth out of n. Also, with Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 447 Cn denoting the average number of comparisons to locate an element of random rank using quickselect with (1) median-of(2t + 1), we have m1 = limn Cn /n = 5/2. However, median-of-(2t + 1) sampling is not a natural pivot selection strategy for quickselect. For instance, if we were looking for the 100th element in a collection of 1000 elements it would seem more natural to pick the smallest element of a sample of three rather than the median. In general, it seems better and more logical to pick an element whose rank is close to · s in the sample, if we have to select the element of rank m = · n. We call this sampling strategy proportion-from-s. A similar idea lies behind Floyd and Rivest's Select algorithm [2]; this algorithm sorts a variable-size sample at each recursive stage to obtain two pivots. Because of the costly selection of pivots, the algorithm is not very efficient in practice. Finely tuned implementations of Hoare's quickselect (or its variants) are preferred and used for system and general-purpose libraries such as the Standard Template Library of C++ (see for instance [12]). There are no previous results about the performance of quickselect with proportion-from-s sampling, nor even has this variant been formally proposed, to the best of the authors' knowledge. We tackle in this paper its analysis, beginning with the recurrence for the average cost of proportion-from-s sampling and a characterization of f () = limn,m/n Cn,m /n as the fixed-point of an integral equation in Section 2. In Sections 3 and 4 we explicitly solve the equations for the proportion-from-2 and proportion-from-3 strategies. Our analysis shows that proportion-from-3 beats median-of-3, on the average, for all relative ranks , with the exception of 0.201 . . . < < 1/3 and 2/3 < < 0.798 . . .. The fact that there are ranges of where proportion-from-3 does not perform better on average than median-of-3 is a surprising outcome of our analysis. In Section 5 we consider a variant of proportion-from-3 where we choose the smallest in the sample if , the median if < < 1 - and the largest if 1 - , and explore various parameters as varies. Many interesting phenomena appear here, including the existence of an optimal value of , in the sense that this particular choice performs better on average than any other value of . We also examine the range of values of such that this variant is better, on average, than median-of3. Since the only changes to the basic algorithm involve the selection of the pivot for each recursive stage, all these variants can be easily implemented. In Section 6 we consider proportion-from-s when s > 3 and show that when s the main term in the average number of comparisons made is theoretically optimum. Finally, in Section 7 we briefly comment on (t) further research. 2 General results We begin this section with the derivation of the integral equation that satisfies f () = limn,m/n Cn,m /n when we use the proportion-from-s sampling. Actually, we generalize these results to a broader class of algorithms that we call adaptive sampling strategies. We consider that, at each stage, the rank r of the selected pivot is a function of = m/n, the ratio of (s,r ) the current rank m to the current size n. Let n,j be the probability that the rth element of a sample of s elements is the j th element of a random permutation of size n. Clearly, j -1 n n-j (2.1) n,j = (s,r ) r -1 s-r s 1 r s n, , 1 j n. The denominator is the number of ways to pick a sample of size s out of n elements; the numerator is the number of ways to choose r - 1 elements smaller than the pivot times the number of ways to choose s - r elements larger than the pivot. Now we are ready to set up a recurrence for the average number of comparisons made to select the mth element out of n: since n - 1 comparisons are needed to partition the array around the pivot 1 and we stop if the pivot is the sought element, we have n (2.2) Cn,m = n + (1) + j =m+1 n,j · Cj -1,m n,j · Cn-j,m-j . (s,r ) (s,r ) + m-1 j =1 We assume that r() is an integer staircase function defined by a finite collection of steps. We say that a sampling strategy defined by such a function is adaptive. The functions that do not satisfy the assumption would behave rather strangely and are most likely useless. Such a function r can be described by the image of each interval in a finite partition of [0, 1], say I1 , I2 , . . . , I with endpoints a0 = 0 < a1 < a2 < · · · < a = 1; we denote rk the value of r for Ik . For convenience, we assume I1 = [0, a1 ], I2 = (a1 , a2 ], . . . , I -1 = [a -2 , a -1 ) and I = [a -1 , 1]. If is odd then we take the middle interval I +1 = (a -1 , a +1 ); if is even 2 2 2 we take I 2 = (a 2 -1 , a 2 ] and I 2 +1 = (a 2 , a 2 +1 ). Notice that the intervals to the left of the middle interval, 1 The comparisons required to select the pivot are accounted for in the (1) term of (2.2), as s = (1). 448 except for the first, are semiopen to the left; symmetrically, intervals to the right are semiopen to the right, except for the last. A special case is = 2 where we take I1 = [0, a1 ] and I2 = (a1 , 1]. The important point here is to cover the full range [0, 1] in a "symmetric" way (there is a slight asymmetry for the middle intervals when is even). For instance, median-of-(2t + 1) is defined by a single interval ( = 1) and r1 = t + 1: no matter what the value of is, we always choose the median of the sample. On the other hand, we can now formally define proportion-from-s sampling : it is the sampling strategy defined by s intervals, with ak = k /s and rk = k . For instance, if s = 3 then r() = 1 for [0, 1/3], r() = 2 for (1/3, 2/3) and r() = 3 if [2/3, 1]. The main result of this section intuitively establishes that Cn,m = f () · n + o(n), with = m/n, and it gives a precise characterization of f (). Theorem 2.1. Let Cn,m be the average number of comparisons to select the mth out of n elements using any adaptive sampling strategy with m/n for 0 1 as n . Then we have f () = lim n the asymptotic estimate n · n,j k (s,r ) s! xrk -1 (1 - x)s-rk . (rk - 1)!(s - rk )! Passing to the limit when n , the summations become integrals thus yielding the equations stated in the theorem. Since r is an adaptive function, the range of must be broken into intervals, giving the piecewise definition of f (). We say that an adaptive sampling strategy is symmetric if limz- r(1 - z ) = s + 1 - limz- r(z ) for all . This definition properly captures the symmetric nature of the algorithm. Indeed, if we use as a pivot the rth smallest element in the sample of size s when searching for the element with relative rank , it is reasonable to choose the rth largest (equivalently, the (s + 1 - r)th smallest) element while searching for the element of relative rank 1 - . The actual definition using limits from the left is necessary to make it valid also for samples of even size. Notice that for any symmetric sampling strategy we have ak = 1 - a -k if k < s/2; furthermore if the number of intervals is even then a /2 = 1/2. It is not too difficult to obtain the following lemmas (due to space limitations, we do not include proofs in this extended abstract). Lemma 2.1. For any symmetric adaptive sampling strategy, we have f () = f (1 - ). More precisely, fk () = f +1-k (1 - ), if Ik . Lemma 2.2. Let r0 = lim0+ r(). For any adaptive sampling strategy we have lim f () = s+1 . s + 1 - r0 Cn,m n and with f () = fk () if Ik , 1 k s! fk () = 1 + × (rk - 1)!(s - rk )! 1 x rk fk (1 - x)s-rk dx x /ak -a x -a k-1 1 k-1 -x rk -1 fk (1 - x)s+1-rk dx + 1-x 0 I d + fd ( ) xrk (1 - x)s-rk dx d x =k+1 , x k-1 I d -x rk -1 s+1-rk + fd (1 - x) dx d 1-x =1 0+ with Id = (/ad , /ad-1 ) and Id = -ad -ad-1 1-ad , 1-ad-1 . Proof. [Sketch of proof ] The proof of this theorem closely follows the proofs in [3, 4] for the standard and median-of-(2t + 1) variants. The technical details of the proof are involved, but the intuition behind is rather simple: divide both sides of (2.2) by n, take j = x · n in the summations, and substitute Cj -1,m by f (/x) · x · n and Cn-j,m-j by f (( - x)/(1 - x)) · (1 - x) · n. Also, (s,r ) when n and Ik we can replace n · n,j k by Now we can easily rederive the known fact [3] that for median-of-(2t + 1), lim0+ mt () = ((2t + 1) + 1)/((2t + 1) + 1 - (t + 1)) = 2, for any t. On the other hand, if we use proportion-from-s sampling or a similarly inspired strategy such that r0 = 1, then lim0+ f () = 1 + 1/s, which proves that significant gains can be expected from alternative sampling strategies, even for s = 2. In order to provide an explicit solution of the integral equations in Theorem 2.1, we transform, after careful and lengthy manipulations, the original problem to one of ordinary differential equations. Lemma 2.3. For Ik , 1 k , and any adaptive 449 sampling strategy ds+2 (-1)s+1-rk drk +1 s! f () = · fk () · s+2 k s+1-rk d (rk - 1)! drk +1 s! ds+2-rk 1 · · fk (). + rk (1 - ) (s - rk )! ds+2-rk Since proportion-from-s is a symmetric strategy, we only have to consider the equations for 1 k s/2 and the order of the ordinary differential equation satisfied by each fk can be reduced. Let k (x) = dk+1 fk /dxk+1 . Then, for all 1 k s/2 (2.3) s! ds+1-2k ds+1-k 1 k - k dxs+1-k (s - k )! (1 - x)k dxs+1-2k s! (-1)s+1-k - k (x) = 0. (k - 1)! xs+1-k the form 1 (x) = y1 (x) + y2 (x) where n An xn+2 , y1 (x) = 0 y2 (x) = n 0 Bn xn-1 + A · y1 (x) · ln x, for some coefficients {An }n0 and {Bn }n0 . Substituting the proposed form for 1 (x) into the differential equation and integrating 1 twice, we finally get 1 1 f1 (x) = a((x - 1) ln(1 - x) + x3 + x2 - x) 6 2 - b(1 + H(x)) + cx + d, for some constants a, b, c and d yet to be determined. The difficult part here is to obtain the values of these constants. The known value f1 (0) = 3/2 gives d - b = 3/2, but the successive derivatives of f1 at = 0 are infinite and this information cannot be used to fix the value of the remaining constants. The painful process is to substitute the general expression for f1 and f2 into the integral equation (Theorem 2.1) and equalize. Finally, one gets a = -2 + 3 4 ln 2-3 , b = -2, 2 3 ln 2-2 1 4 ln 2-3 c = - 8 3 ln 2-2 , d = -1/2. The maximum of f () is at = 1/2. Indeed, we have f (0) = and f (1/2) > 0. It is easy to check, differentiating one more time, that f () is strictly decreasing, and hence always positive. In = 1/2, the cost is f1 (1/2) = f2 (1/2) = 1/96 · (576 ln2 2 - . . 253)/(3 ln 2 - 2) = 3.112 . . .. Compare with m0 (1/2) = 3.386 . . . for standard quickselect and m1 (1/2) = 11/4 = 2.75 for median-of-3 quickselect2 . It is not difficult to show that Cn = 1/n · 1 1 mn Cn,m = f · n + o(n), where f = 0 f (x) dx. For proportion-from-2 we have thus f= 0 An important special case of the ODE above is for the central interval (k = (s + 1)/2) when s is odd. The ODE is then identical to the corresponding ODE for median-of-s. The problem with the differential equations above, besides the intrinsic difficulty of solving high order linear differential equations, is that initial conditions are hard to establish, other than the limiting value f (0). Recall that f () is in general discontinuous and hence in order to obtain it, we should know fk (ak-1 ), fk (ak-1 ), . . . , (s+1) fk (ak-1 ) for every k , 1 k . In order to overcome the problem, we use a different technique: namely, substitute the fk 's in the integral equation(s) by the general form of the solution to the corresponding differential equation and fix the values of the involved constants by equalizing both sides. When the sampling strategy is symmetric the problem is somewhat simpler, but the essential obstacles remain. 3 Prop ortion-from-2 Let us begin with the simplest "proportion-from" strategy: s = 2. Solving (2.3) is not very difficult in this case and even can be said to be routine. Here, we have just to consider one piece, namely 1 , since by symmetry we know that f2 (x) = f1 (1 - x). Equation (2.3) is then (3.4) 2 d1 2 d 2 1 - - 2 1 (x) = 0, dx2 1 - x dx x 1 f (x) dx = 2 0 = 3 320 ln 2 - 213 . = 2.598 . . . , 128 3 ln 2 - 2 1/2 f1 (x) dx which tells us that proportion-from-2 makes 2.6n comparisons on average. Compare with the 3n comparisons of the standard variant and the 2.5n comparisons of the median-of-3 variant (recall that m0 = 3 and m1 = 2.5). It is interesting to notice that f () m0 () for all 0 1. Compared to median-of-3, proportion-from2 is better when 0.140 . . . (symmetrically, when 2 We only give four figures in the numerical values in this section and the following. But it is relatively easy to obtain a high degree of accuracy--indeed, we have computed all the numerical values given in the paper with up to twenty digits of accuracy. with x = 0 and x = 1 its regular singular points. We remark that 1 (x) = d2 f1 /dx2 . The corresponding indicial equation is ( - 1) - 2 = 0 whose solutions are = 2 and = -1. This entails a solution [13] of 450 3.386 . . . 3.112 . . . 2.75 m1 () m0 () where K1 (x) = cos ·n 2 ln x 2 ln x An xn+4 Bn xn+4 , An xn+4 ·n Bn xn+4 , 0 0 + sin K2 (x) = sin 2 ln x 2 1.5 0 0.140 . . . f () 1 ·n ·n 0 0 - cos 2 ln x Figure 1: Plot of m0 (), m1 () and f () for proportionfrom-2 (n + 2)(n + 5) , An = 2 (n + 6n + 11)(n2 + 8n + 18) 2(2n + 7) Bn = 2 . (n + 6n + 11)(n2 + 8n + 18) We also obtain f2 (x) = -C5 (1 + H(x)) + C6 x(1 - x) + C7 . The value of the constants can be obtained by the same routine but cumbersome procedure of substitution into the integral equations. The resulting implicit equations provide the values of the Ci s, namely, C0 = -24/11, C1 = -28/33, C2 = 0.193 . . . , . . C3 = - 100.190 . . . , C4 = - 27.556 . . . , 7 . . . C5 = - 1.463 . . . , C6 = C5 + 3 = 0.439 . . . , C7 = 0.135 . . . . 4 We observe that, contrary to what happens with median-of-3's m1 (x), f2 (x) is the sum of a second degree polynomial and an entropic term (recall that the linear differential equation is the same in both cases, but the entropic term vanishes for median-of-3). Another important aspect is that, even though the difference is not large, f () m1 () for 1/3 < < 2/3, and the same is true for the intervals 0.201 . . . and . 0.798 . . . In particular, f (1/2) = f2 (1/2) = 2.722 . . . < m1 (1/2) = 2.75. . Finally, integrating f () in [0, 1] yields f = 2.421 . . . which favorably compares to the value m1 = 2.5 that corresponds to median-of-3. We conclude this section briefly discussing the intuition behind the fact that batfind is doing worse than median-of-3 for values of near 1/3 (and 2/3). It also makes more comparisons in these regions than for = 1/2. In particular, if [0.276 . . . , 1/3] or [2/3, 0.723 . . .] then batfind makes more comparisons, on the average, to select the element of rank · n than to select the median. An informal explanation is the following. Assume for the sake of concreteness that s = 3, n = 1000 and m = 332. While there is some chance that the rank of the pivot selected by batfind is close but larger than 1000/3 = 333.3 --then we are in good shape since we . 0.859 . . . and is worse otherwise (see Figure 1). In that percentile, both algorithms make in average 2.362 . . . n comparisons. However, it is a bit unfair to compare median-of-3 and proportion-from-2 since these strategies use a different number of elements in the samples (and standard quickselect uses samples of size s = 1!). 4 Batfind: Prop ortion-from-3 The steps needed to analyze batfind (a nickname for proportion-from-3) are similar to those of the previous section. In this case we have the following three functions: f1 (x) when x [0, 1/3], f2 (x) when x (1/3, 2/3) and f3 (x) when x [2/3, 1]. By symmetry we have f3 (x) = f1 (1 - x) and f2 (x) = f2 (1 - x). This implies that we need to solve only two differential equations, namely, (4.5) 3 d 2 1 6 d 3 1 - - 3 1 (x) = 0, dx3 1 - x dx2 x 1 1 d 2 2 -6 + 2 (x) = 0, dx2 x2 (1 - x)2 with 1 (x) = d2 f1 /dx2 , 2 (x) = d3 f2 /dx3 , and x = 0 and x = 1 their regular singular points. Using the same procedure as in Section 3 we get f1 (x) = -C0 (1+H(x))+C1 +C2 x+C3 ·K1 (x)+C4 ·K2 (x) 451 2.75 2.722 . . . f () f1/5 () f1/4 () f1/7 () 2 m1 () 2 m 1 ( ) 4/3 0 0.201 . . . 0.276 . . . 1 4/3 Figure 2: Plot of batfind's f () and m1 (). have discarded almost two thirds of the input--, it is also likely that the rank of the pivot is slightly less than m and then we have discarded a bit less than a third of the input. In the next round the rank of the sought element will be relatively small; however there are still enough chances that we have "bad luck" again, that is, an uneven partition with the pivot in the wrong side, as in the first round. On the other hand, if m = 334 then the strategy would pick the median of the sample and thus exhibit more "stable" performance, since it would most likely partition the array into subarrays of similar size and hence avoid the boundary effect just described. Such boundary effects occurring at early stages of the recursion have a big impact on the performance and amount for the difficulty of finding elements whose rank is slightly less than n/3. 5 -find: A variant of batfind In the following we consider symmetric sampling strategies that are inspired by proportion-from-s but try to overcome its deficiencies. For these strategies we have in general ak = k /s but rk = k for all 1 k s. One important point is that these variants of proportionfrom-s satisfy the same differential equations. Hence, the general form of the fk 's is exactly the same and the only difference is in the value of the involved constants, because of the different initial conditions. Here, we only explore the case where s = 3 and investigate the properties of f () when a1 = and a2 = 1 - for 0 < < 1/2; we make the dependence in explicit and denote f the function corresponding to this strategy, which we call -find. For -find the differential equations are the same as in (4.5). When 0, find behaves as quickselect with median-of-three. When = 1/3, we recover batfind. Finally, when 1/2, the median of the sample is always discarded, so -find behaves slightly different than proportion-from-2. In general, f () consists of three pieces: f1, for 0 1 Figure 3: Plot of f () for {1/4, 1/5, 1/7}. Medianof-three (m1 ()) is also depicted for comparison. [0, ], f2, for (, 1- ) and f3, for [1-, 1]. Of course, since -find is symmetric we have f3, () = f1, (1 - ). As we have already pointed out, the only difference between our analysis of batfind and that of -find is that we have to obtain the dependence on of the values of the constants Ci in the general form of f1 and f2 (see Section 4). Notice that the argument of symmetry of f2, applies also here, no matter what the value of is. It turns out that C0 = -24/11 and C1 = -28/33 are independent of . Moreover, C6 ( ) = 7/4 · C5 ( ) + 3. The formulć for the other Ci 's are complicated (four to five lines each) and we do not give them here. Also, due to the space limitations, the rest of this section will describe the most salient features of -find, but we do not give the corresponding proofs or calculations here. For a large range of values of , f has three local maxima located at = , = 1 - and = 1/2. The local maxima at and 1 - constitute the endpoints of the characteristic two little "ears" of -find (and batfind in particular). As decreases, the value of f1, ( ) also decreases and = 1/2 becomes the absolute maximum of f . We denote the largest value of such that ~ = 1/2 is the absolute maximum of f ; for > ~ the absolute maxima of f are located at = and = 1 - . So is the solution of f ( ) = f (1/2). ~ . Numerical computations yield = 0.268 . . .. ~ If we continue decreasing the value of then f loses its characteristic "ears" because then we have f1, ( ) < f2, ( ) (see Figure 3). The transition point where f1, ( ) = f2, ( ), i.e., where f is continuous, enjoys another fundamental property, as stated by the following theorem. Theorem 5.1. Let be a value of , 0 < < 1/2, for 452 f1, ( ) which f is continuous,i.e., f1, ( ) = f2, ( ). There exists only one such value of , namely = 0.182 . . ., and furthermore, for al l 0 1 and for al l 0 < < 1/2, f () f (). . f (1/2) 2.75 m 1 ( ) f2, ( ) Since optimizes f , it minimizes the maxima; furthermore < , so it follows that minimizes ~ f (1/2). We can characterize it by f (1/2)/ | = = . 0. In particular, f (1/2) = 2.659 . . . (see Figure 4). It is also obvious that must minimize the average value f . Therefore, we have that f / = = 0. The . minimum value is f = 2.342 . . .. Another set of interesting values of appear when we compare -find with median-of-3. In particular, for . =0.404 . . ., -find performs better, on the average, than median-of-3 when looking for a random rank; that . is, f m1 = 5/2 = 2.5. When m = 0.364 . . ., find does less comparisons on the average than medianof-3 to locate the median, or in other words, f (1/2) m1 (1/2) = 11/4 = 2.75 (see Figure 4). Also, for . = 0.219 . . ., -find performs better on the average than median-of-3 when locating any element, in the sense that f () m1 () for all . Because of the properties of f and m1 , is characterized as the solution of m1 ( ) - f1, ( ) = 0. Notice that ~ so that when -find beats median-of-3, = 1/2 is already the most difficult rank to be selected for find; but on the other hand, f1, ( ) > f2, ( ) (see Figure 4). Also, for > , by definition, -find does worse than median-of-three for some intervals of . In particular, if m then -find beats median-of-3 in [0, ], (, 1 - ) and [1 - , 1]; if > m then -find beats median-of-3 only in [0, ] and [1 - , 1]. For instance, since < 1/3 < m , batfind beats median-of3 in the ranges [0, 1/3 ], (1/3, 2/3) and [1 - 1/3 , 1], with . 1/3 = 0.201 . . ., where we denote by the solution of m1 ( ) - f1, ( ) = 0. If we consider the average total cost (including comparisons to partition the array and to select pivots, exchanges, function calls, etc.) rather than the average number of comparisons as a measure of performance, then the optimal value changes, with respect to the one given in Theorem 5.1. Two aspects should be then taken into account in order to compute the optimal : 1) the toll function for the average total cost depends on --actually, on r()-- which was not the case for the number of comparisons needed to partition the array; 0.15 ~ m 0.4 Figure 4: Plot of f1, ( ), f2, ( ), f (1/2) and m1 ( ), and several interesting values of . and 2) the constants multiplying the linear term in the toll function, which are implementation-dependent. In any case, once the optimal value for had been established, it can be easily plugged into a generic implementation of -find. Our analysis, as well as the preliminary experiments that we have conducted, strongly suggest that -find could be the method of choice for practical situations, since its performance for medium ranks is similar to that of median-of-three, while it performs significantly better than median-ofthree for low and high ranks. In particular, our experiments suggest that the constant factor of the main term, as well as the lower order terms, are never much greater than those for median-of-three and can be noticeablely smaller, when is not close to 1/2. On the other hand, the performance of -find is not extremely sensitive to the choice of , so that a rough approximation to can be used with good results. 6 Beyond s = 3 Although it seems that it would not be too difficult to extend our analysis to s = 4 and s = 5, the process is undoubtely tedious and the results most likely cumbersome and qualitatively similar. A more interesting open problem is to provide solutions for general s or to study the behavior of f () when s (see later in this section). We can use the integral equations to compute numerical approximations of f () for s > 3. Notice that the fk 's are the fixed points of integral operators Tk (the right-hand side of the equations in Theorem 2.1), so we 453 2.8 result, which establishes the theoretical optimality [2] of proportion-from-s-like strategies when s . Theorem 6.1. For any adaptive sampling strategy such that lims r()/s = , f () = lim s n,m/n 2.6 2.4 2.2 lim Cn,m = 1 + min(, 1 - ). n 2 1.8 1.6 The proof of theorem above (which we will give in a full version of this extended abstract, due to lack of space) amounts to showing that f () is the unique fix point of T , where s! T (f ) = 1 + lim s r () - 1!s - r ()! 1 x r () × f (1 - x)s-r() dx x x . -x r ()-1 (1 - x)s+1-r() dx + f 1-x 0 The theoretically optimal performance in Theorem 6.1 can be achieved using variable-size samples, with s growing as n grows, as long as s = o(n). However, as in Floyd and Rivest's algorithm, using large samples has a big impact in the lower order terms of the performance. 7 Future work Some interesting open problems include a more precise analysis of the lower order terms in the performance of batfind and -find and the determination of the optimal size s of the samples for proportion-from-s when s = s(n) (we conjecture that the optimal size is s = ( n)). On the other hand, we are considering randomized sampling strategies, where given the relative rank of the sought element, for each r, 1 r s, there is a probability pr () that the rth smallest element of the sample is chosen as the pivot. These strategies generalize the deterministic strategies studied in this paper. References [1] D. Anderson and R. Brown. Combinatorial aspects of C.A.R. Hoare's Find algorithm. Australasian Journal of Combinatorics, 5:109­119, 1992. [2] R. Floyd and R. Rivest. Expected time bounds for selection. Communications of the ACM, 18(3):165­ 173, 1975. [3] R. Grubel. On the median-of-k version of Hoare's se¨ lection algorithm. Theoretical Informatics and Applications, 33(2):177­192, 1999. 1.4 0 0.2 0.4 0.6 0.8 1 Figure 5: The value of f () for proportion-from-s, 2 s 7. start with k,0 := 2 and apply the integral operator to it to obtain k,1 , for each 1 k s. The procedure is repeated several times, so we get k,i+1 := Tk (k,i ) in each iteration. Of course, fk = limi k,i but four iterations provide a reasonably good approximation3 . Despite the inherent error of these numerical approximations, there are several features that show up and probably hold for all values of s. To begin with, proportion-from-(s + 1) does not uniformly beat proportion-from-s. In fact, proportionfrom-s beats in some regions proportion-from-s with s > s + 1. But for the interval endpoints, the numerical approximations strongly suggest that fs (k /s) > fs+1 (k /(s + 1)) and fs (1/2) > fs+1 (1/2) (here the subscripts s refer to the size of the sample). Another property of proportion-from-s is that the maxima are located at 1/s, 2/s, . . . , 1 - 1/s and 1/2. Furthermore, if s is odd then f (1/s) > f (2/s) > · · · > f ((s - 1)/2s) > f (1/2). If s is even then we have f (1/s) > f (2/s) > · · · > f ((s - 2)/2s) > f (1/2). Thus, in general, when s > 2, the element of rank n/s is the one for which proportion-from-s will incur the largest cost. A look at the plot of f () for various s (Figure 5) exhibits the features mentioned above. We conjecture that, similarly to -find, a suitable choice of interval endpoints that smoothes f () yields the best results for each s. We end this section with the following important 3 The Maple co de that we have used in the computations of this and the previous sections is available from the first author. 454 [4] R. Grubel and U. Rosler. Asymptotic distribution ¨ ¨ theory for Hoare's selection algorithm. Adv. Appl. Prob., 28:252­269, 1996. [5] C. Hoare. Find (Algorithm 65). Communications of the ACM, 4:321­322, 1961. [6] C. Hoare. Quicksort. Computer Journal, 5:10­15, 1962. [7] P. Kirschenhofer, H. Prodinger, and C. Mart´nez. i Analysis of Hoare's Find algorithm with median-ofthree partition. Random Structures & Algorithms, 10(1):143­156, 1997. [8] D. Knuth. Mathematical analysis of algorithms. In Information Processing '71, Proc. of the 1971 IFIP Congress, pages 19­27, Amsterdam, 1972. NorthHolland. [9] D. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. Addison-Wesley, Reading, Mass., 2nd edition, 1998. [10] H. Mahmoud. Sorting: A Distribution Theory. John Wiley & Sons, New York, 2000. [11] C. Mart´nez and S. Roura. Optimal sampling stratei gies in quicksort and quickselect. SIAM Journal on Computing, 31(3):683­705, 2001. [12] D. Musser. Introspective sorting and selection algorithms. Software--Practice and Experience, 27(8):983­ 993, 1997. [13] E. Rainville, P. Bedient, and R. Bedient. Elementary Differential Equations. Prentice Hall, 8th edition, 1997. 455