Proximity Mergesort: Optimal In-Place Sorting in the Cache-Oblivious Model
Gianni Franceschini

Abstract
An algorithm performs its operations in-place if it uses O(1) extra locations of main memory besides those containing the input entries. An algorithm is cache-oblivious if it is not conscious of any parameter of the memory hierarchy (M , the size of cache memory, and B , the size of the minimal contiguous block of information that can be transferred between the cache and the main memory). Hence, it cannot directly exploit these parameters to reach the optimality. In the cache-oblivious model the complexity is measured with two criteria: the work complexity, which is the standard complexity in the RAM model, and the cache complexity, which is the total number of block transfers (cache misses) incurred during the computation. The contribution of this paper is twofold. We present the first sorting algorithm that is optimal in both work and cache complexity in the cache-oblivious model and that operates in-place. Furthermore, we introduce a new approach to the sorting problem in the cache-oblivious model.

algorithms. Moreover the vast ma jority of these levels have a strategy for the replacement of blocks (used when the level is full) that is unknown to the algorithms or, at least, that cannot be directly controlled. Complex memory hierarchies contributed to increase the distance between the practical and theoretical computing. An algorithm that has an optimal complexity in the classic RAM model can be potentially outperformed in a real computation by sub-optimal algorithms which exploit a better data locality. The RAM model is unable to capture the aspects of memory access patterns and so a number of new theoretical models have been proposed in order to cope with this new issues of modern computing.

1.1 Multi-level memory hierarchy mo dels. Aggarwal et al. [2] introduced the Hierarchical Memory 1 Intro duction Model. Like in RAM, this model has an unlimited numOne of the main characteristics of "real world" comput- ber of locations. It aims at modeling a multi-level hiing systems is the use of multi-level memory hierarchies. erarchy with the use of a function f (i) that gives the Typically, the highest level of the hierarchy contains a access cost for the ith location. The authors mainly great amount of memory with slow access time (and study the case in which f (i) = log i . For the sortwith low costs). On the contrary, for the lowest level a ing problem in particular, they prove an upper bound small quantity of more expensive type of memory with O(N log N log log N ) and the matching lower bound. fast access time is used. Usually, the processing unit can Other cost functions are examined and in [3] the model address the whole highest level, but it can only access is extended with block transfers. the locations in the lowest one. During a generic comThe widely studied External Memory Model of Agputation, the data flows up and down in this hierarchy, garwal and Vitter [4] has a simple two-level memory hifollowing the requests of the running algorithm(s). In erarchy: a limited and fast internal memory of size M order to amortize the cost of the inter-level flow of infor- and an unlimited slow external memory. The processing mation, the data transfers between two adjacent levels unit can address the external memory but can process always involve blocks of contiguous locations. Clearly, only the data that reside in internal memory. An algothat kind of strategy is successful only if the computa- rithm operating in this model must directly manage the tion maintains some form of data locality. transfers of data between the two levels. Those transA typical example of memory hierarchy of a modern fers involve blocks of B elements. The complexity of computer is composed by registers, look-ahead buffers, an algorithm is the number of block transfers it incurs. level one and level two cache (even level three in high- This is justified if we associate the internal and external end computers), main memory and disk. Each level has memory respectively with the main memory and with its own block size making hard the optimal tuning of the disk of a computer. The role of disk accesses is dominant in the total running time of any "reasonable"  Work partially supp orted by the Italian MIUR pro ject PRIN algorithm. The lower bound for the sorting problem "ALINWEB: Algorithmics for Internet and the Web". in this model is ((N/B ) logM /B (N/B )) block trans Dipartimento di Informatica ­ Universita di Pisa, Italy. ` fers and an in-place optimal sorting algorithm can be Email: francesc@di.unipi.it
Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688

291


achieved with a (M /B )-way merging strategy that exploits the whole internal memory. 1.2 Cache-oblivious mo del. A recent, though widely studied, model of computation is the Cacheoblivious Model, introduced by Frigo et al. [21, 29]. That model is composed of two main parts: the idealcache model and the cache-oblivious algorithms (and data structures). The ideal-cache model consists of two levels of memory, the cache level that contains M locations and is partitioned in blocks (or cache lines) of B contiguous locations, and the main memory level that can be arbitrarily large. The processing unit can address the locations of the main memory but it can process only the data residing in cache. Unlike the external memory model, when the block b that contains the data necessary to the computation is not in cache (a cache miss ), an optimal off-line replacing strategy is used to, eventually, substitute a residing block with b. Therefore, an algorithm operating in the ideal-cache model cannot directly manage the transfers of blocks between levels. Furthermore, the cache is ful ly associative, that is each block coming from main memory can be stored anywhere in the cache. An algorithm is cache-oblivious if it has no memoryhierarchy-specific parameterization. In particular, if that algorithm operates in the ideal-cache model, it cannot be defined in terms of parameters B and M . Hence, an algorithm for the cache-oblivious model looks like an algorithm for the RAM model. Moreover, the definition of in-placeness for the RAM model naturally extends to the cache-oblivious one: an algorithm operates in-place if it uses O(1) extra locations of main memory besides the contiguous ones containing the input elements. Unlike the external memory model, in the cache-oblivious model there are two measures for the complexity of an algorithm: the work complexity W , that is the standard complexity in the RAM model (evaluating the number of comparisons, arithmetic operations, data moves. . . ), and the cache complexity Q, that is the total number of cache misses incurred during the computation. The ideal-cache model can seem unrealistic because of the optimal off-line replacement strategy. However, Frigo et al. [21] have shown that algorithms satisfying a reasonable regularity condition on the number of cache misses in the ideal-cache model, have a small constant-factor overhead for that measure when used in a weaker cache model with the realistic LRU replacement strategy. For the full associativity assumption Frigo et al. provide an expected-time, favorable argumentation. Furthermore, since the cache complexity analysis holds for any value of B and M , it holds for any

level of a more general multi-level memory hierarchy. Therefore, the cache-oblivious model can be seen as a "successor" of the RAM model, a successor that incorporates a lot of the new architectural aspects which characterize the real world computing systems. Despite of its recent introduction, the cache-oblivious model is widely studied and there are many types of results: sorting [21, 12]; computational geometry [12, 9, 1]; priority queues [6, 13]; graph algorithms [6]; optimal dictionaries, with amortized and worst case bounds, partially-persistent, space consuming and implicit [7, 11, 15, 9, 18, 19]; static tree layout [10, 5]; dynamic sets scannings [8]; meaningful lower bounds [14]. 1.3 Sorting in the cache-oblivious mo del. According to Knuth [23, Sect. 5.5], sorting is one of the most fundamental problems in Computer Science, of great theoretical and practical importance. Virtually in every field of Computer Science there are problems that include the sorting as a primary step toward solution. Prokop [29] proved that the cache complexity of any sorting algorithm in the cache-oblivious model is ((N/B ) logM N ). The number of operations is (N log N ) and that lower bound comes directly from the comparison model. These lower bounds are matched by two sorting algorithms presented in the seminal paper of Frigo et al. [21, 29]: Funnel-sort and Distribution sort. The Funnel-sort was successively simplified by Brodal and Fagerberg [12] with a variant called Lazy Funnel-sort. The Funnel-sort divides the input array into about N 1/3 contiguous sub-arrays of about N 2/3 elements each. Then it recursively sorts these sub-arrays and finally produces the sorted sequence using an N 1/3 merger. Essentially, a k -merger is a complete binary tree with k leaves (one for each sorted input sequence), with a buffer for each edge and laid out in memory following the van Emde Boas layout and by which also the size of buffers is determined. (This particular layout is also a basic tool adopted in all the optimal dynamic dictionaries so far developed for the cache-oblivious model, as in [29, 7, 11, 15, 9, 18, 19].) When invoked, a k -merger outputs the first k 3 elements of the sorted sequence obtained by merging the k sorted input sequences. The Distribution sort performs the following steps. First, it divides the input array into about N 1/2 contiguous sub-arrays of the same size and recursively sorts these  b-array. Second, it distributes the element in su s q  N buckets B1 , . . . , Bq such that |Bi |  2 N and max{x : x  Bi }  min{x : x  Bi+1 }, for i = 1, 2, . . . , q - 1. Third, it recursively sorts each bucket. Fourth, it copies the elements back to the input

292


array. The way the second step is performed is crucial to obtain the optimal cache complexity. Unlike other kinds of sorting based on distribution, the phase in which the pivots (or sample set) are chosen and the distribution phase are not separated. Frigo et al. use a bucket splitting techniqu the buckets have a variable size upper e: bounded by 2 N ; when a bucket reaches that threshold then its medium element is selected (that will be a new pivot) and  e bucket is partitioned in two new buckth ets of size N . The bucket splitting technique alone it is not sufficient to achieve a good cache complexity. The order in which the sub-arrays are processed is critical. For example, if we process the sub-arrays from the leftmost to the rightmost one, distributing the elements of each sub-array along all the existing buckets in one single run, we incur in O(N ) cache misses instead of O(N/B ), as required by the cache optimality. As we will see, in our result we do not make use of the bucket splitting technique but we use that particular processing order introduced by Frigo et al. for disposing some particular groups of elements. In the paper of Frigo et al. [21], both the sorting algorithms relied on the so-called tal l cache assumption i.e. M = (B 2 ). Subsequently, Brodal and Fagerberg [14] proved that the tall cache assumption (or its less stringent version M = (B 1+ ) seen in [12]) is necessary for the cache optimality. So, from now on we also assume that M = (B 2 ). The only two optimal and cache-oblivious sorting algorithms known are the original Funnel-sort (and its variant, the Lazy Funnel-sort) and Distribution sort, if we exclude the algorithm directly derivable from from the priority queue in [13] (the other optimal priority queue of Arge et al. [6] relies on an optimal sorting algorithm). Moreover, a classical property like in-placeness seems to be challenging in the cacheoblivious model because of the apparently antithetical requirements needed to obtain the maximum space saving and the data locality. For these reasons, the techniques originally developed for this problem in the RAM and EM models appear to be of difficult application in our scenario. 1.4 Our result. The contribution of this paper is twofold. We present the first in-place optimal sorting algorithm for the cache-oblivious model. Furthermore, with our new algorithm we introduce a new approach to the sorting problem for this model of computation. The best in-place sorting algorithm previously known can be directly derived from the optimal implicit dictionaries in [18, 19]. That straightforward algorithm can incur O(N logB N ) cache misses. However, a simple variation of the well-known in-place binary mergesort based on in-

place unstable merging can lower the cache complexity bound to O((N/B ) log N ) block transfers. 1.5 Organization of the pap er. In section 2 we present a first version of Proximity Mergesort that uses O(N ) auxiliary space. At the end of this section we give a sketch of the complexity analysis. In section 3 we show how to reduce the auxiliary space to O(1) cells, pointing out the variations in the complexity analysis. Due to the lack of space and to make this brief exposition more readable, we do not stress on formal details (for example, we avoid the use of ceilings and floors). A complete exposition will be given in the full version of this paper. 2 Proximity Mergesort 2.1 Basic ideas. Let us give the main ideas behind our algorithm. There is a common scheme that seems to be very effective for the optimality in the cacheoblivious model: a divide and conquer approach that refines the size of the subproblems in an "exponential" way. For example,he van Emde Boas layouis applied t t recursively to ( N ) subtrees of size ( N ) each, where N is the size of the whole tree. In the Distribution sort th recursive calls are performed on sub-arrays of e size ( N ) each nd the distribution step is performed ,a using a set of ( N ) pivots. Again, in the Funnel-sort the recursive calls generate (N 1/3 ) sorted sequences that need to be merged. In the Funnel-sort these (N 1/3 ) sorted sequences are independent: given any element of a certain sequence, we cannot infer its rank inside another sequence. For this reason, the final merge step in Funnel-sort needs the aid of a very powerful and quite complicated structure like the k -merger. Our approach is based on a different idea, namely, that the sorted sequences involved in the final merge step are not independent. The permutation of elements on which is performed the final merge step has a proximity property that guarantees that any element in this permutation is not too far from its final destination. With this property, we need to perform a final merge step that is greatly simplified with respect to that in Funnel-sort. We now describe the observation at the heart of the proximity property. Let us consider a sequence of N elements divided into N/X sorted subsequences of X elements each. Each subsequence is further divided into X/ Y segments of Y elements each. Let us focus on the new sequence obtained by permuting the segments of Y elements each, so that the segments are sorted according to their smallest element. It is easy to see that no element is more than ((N/X ) - 1)(Y - 1) positions beyond its final position in the ful l-sorted sequence. A

293


particular case of this observation was originally used by Kronrod [24, 30] in the first known in-place linear time merging algorithm. Starting from this particular permutation, we can obtain the sorted sequence simply merging each segment, from the leftmost to the rightmost one, with the elements contained in the previous (N/X ) - 1 segments. However we cannot follow precisely that approach since the final merging process is still too costly (as we will see, we can pay only O(N/B ) block transfers). Therefore, we refine this property as described next. 2.2 The algorithm. The input array A is conceptually divided into contiguous portions, the complete segments, each of size S = log N with the exception of the rightmost one, the short segment. The minimum element of a generic segment s is the header of it; let fu us denote the set of headers with  . The array is  rther divided into complete sub-arrays containing N complete segments, excluding the rightmost one called the short sub-array. Let R be the number of complete  sub-arrays; clearly R  N /S .

9. Distribute the segments corresponding to  -  in this sub-zones according to the order of set  . At the end of the distribution, move back to A1 , . . . , AR the headers in  and the elements in Z1 , . . . , Z R . 10. Recursively A1 , . . . , AR . sort the complete sub-arrays

11. For each i = 1, . . . , R - 1, use a binary merging algorithm to merge the elements in zones Ai and Ai+1 . 12. Sort the elements in the short sub-array and in the short segment with the binary mergesort. Merge  that sorted sequence of < S ( N + 1) elements and the one obtained in step 11.

We recursively sort each complete sub-array. However, instead of sorting the complete segments on their headers, we find a perfectly balanced sample set of R equidistant headers and redistribute the complete segments using that set. Thus we obtain R new complete sub-arrays A1 , . . . , AR such that, for each i = 1, . . . , R - 1, the header of the segment with maximum P roximity M erg esort(A, N ). 1. Calculate S , R and the starting position of the header in Ai is less than or equal to the header of the segment with minimum header in Ai+1 . short sub-array. With the second recursive sorting of the sub-arrays 2. Recursively sort the complete sub-arrays we achieve a permutation that has the same proximity property of that in the observation done above. Namely, A1 , . . . , AR . no element is more than D = (R - 1)(S - 1) positions 3. Move the headers of the segments in a contiguous beyond its final position in the full-sorted sequence. zone of memory and sort them with the binary Since R  N /S , we have that D  N . Moreover, mergesort. that ermutation is composed of R sorted subsequences p of S N elements, instead of N/S subsequences with 4. Find a set  of R equally spaced elements in the sorted set of headers and move the elements of  in only S elements each, as in the observation. Therefore, the "left to right" binary merging process in step 11 ends a contiguous zone of memory. with the sorted sequence of elements (short segment and 5. Re ver the original order of the remaining short sub-array excluded). Note that our values for the co m R( N - 1) headers in  -  and move them back size of a seg ent (S ) and for the size of a complete sub-array (S N ) are quite "large" for  e problem and th in sub-arrays. other choices are possible (for example S N for the size 6. Recover the original order of the headers in  . of a complete sub-array). Before we proceed any further, we should give 7. Move the set of segments corresponding to headers a comment about step 9. If we perform this step in  in a contiguous area of memory and sort with the binary mergesort the headers in  . During the processing the sub-arrays from the leftmost to the sorting process, each time a header is moved move rightmost one, distributing the segments of each subarray along all the existing sub-zones in one single also the whole corresponding segment. run, we incur in O(R2 ) = O(N/ log2 N ) cache misses 8. Allocate a new contiguous array Z of memory and that term can be  (N/B ) when B =  (log2 N ).  divided into R sub-zones Z1 , . . . , ZR of size S N Each non recursive step in the Proximity Mergesort is (the same of a complete sub-array). Copy the required to have a cache complexity O(N/B ) in order segments corresponding to  in the last S - 1 to obtain the cache optimality. As already anticipated in section 1, we use the positions of each sub-zone, following the order of particular processing order introduced by Frigo et al. .

294


[21] with the following procedure. Distribute(i, j, m) 1: if m = 1 then 2: C opy E lements(i, j ) 3: else Distribute(i, j, m/2) 4: 5: Distribute(i + m/2, j, m/2) 6: Distribute(i, j + m/2, m/2) 7: Distribute(i + m/2, j + m/2, m/2) Unlike Frigo et al. [21], we do not need the splitting technique to find our sample set:  is composed of R perfectly equidistant elements in the sorted set of headers  . Furthermore, in the base case (m = 1), we do not copy single elements but whole segments. Therefore, in our particular case, the invocation C opy E lements(i, j ) scans the remaining segments of the complete sub-array i (starting from the leftmost one) and moves them in the sub-zone j until a segment with the header greater than the sample of j is found. In summary, step 9 of our algorithm is realized invoking Distribute(1, 1, R). 2.3 Complexity analysis. Let us estimate the cost of each non-recursive step of the Proximity Mergesort. Step 1 is trivial. For step 3, the scanning and the copy of headers can be done in O(N/S ) time and O(N/(S B )) = O(N/B ) block transfers. The sorting part is done with the normal stable binary mergesort. This algorithm is cache-oblivious and takes O((N/(B S )) log N ) = O(N/B ) cache misses, if N > cM (as we shall see, N  cM will be the base case condition for the cache complexity recurrence). Since the set  is sorted, step 4 takes O(R) time and incurs O(N/B ) cache misses. In Step 5 and 6 we use again the binary mergesort; we can perform these two steps in time O(N ) and with O((N/(B S )) log N ) = O(N/B ) block transfers, whenever N > cM . Step 7 can be performed with a simple variation of the binary mergesort at a cost of O(N/B ) cache misses as in steps 3, 5 and 6. Steps 8 and 11 have the same asymptotic complexity of a scanning. Since the short egment and s the short sub-array contain less than S ( N + 1) elementsthe binary mergesort invocation in step 12 costs , S ( N +1) O(( ) log N ) = O(N/B ), if N > cM . The other B part of this step has the same asymptotic complexity as that of scanning. Step 9 is realized invoking Distribute(1, 1, R). The following Lemma can be inferred from the cache complexity analysis that Frigo et al. and Prokop [21, 29] give for the Distribution sort. Lemma 2.1. The invocation Distribute(1, 1, m) incurs O(B + m2 /B + d/B ) cache misses, where d is the total

number of elements distributed. We distribute whole segments instead of single elements but this can be seen as a particular case. Therefore, step 9 requires O(B + R2 /B + N/B ) = O(N/B ) cache misses. Theorem 2.1. Proximity Mergesort uses O(N log N ) operations and incurs O((N/B ) log M N ) cache misses to sort N elements. Proof. The recurrence for the work complexity is  W (N ) = 2R · W (S N ) + O(N ), and the recurrence for the cache complexity is O (N/B )  N  cM Q(N )  2R · Q(S N ) + O(N/B ) otherwise 3 Optimal in-place sorting In this section we describe how to reduce the quantity of work space used by Proximity Mergesort to O(1) cells. With this reduction we lose the stability of our algorithm. A sorting algorithm is stable if it preserves the relative ordering of different occurrences of the same element. The property of stability is important when satellite data are carried around with the element being sorted. For example, a stable sorting algorithm is a necessary sub-procedure in the Radix sort, a linear time sorting algorithm that operates over sequences of integers and enforces their positional encoding. As we will see (section 4), the stability property seems to collide irreparably with the techniques we use to obtain a space-efficient algorithm. We have to deal with a lot of circumstances in which the Proximity Mergesort exploits the auxiliary space. · Auxiliary space is used to store indices, pointers, counters or, in general, integer of O(log N ) bits. · During the whole computation we use auxiliary space to permute efficiently various groups of elements. · We have to deal with all the auxiliary space implicitly employed by recursion (recursion stack), both in the main algorithm and in the invocation Distribute(1, 1, R) used to perform step 9. 3.1 Enco ding the integers. The classic technique of the odd-even encoding or bit-stealing is a widely used tool for achieve minimum space usage both in algorithms [25, 27, 22, 16] and data structures [28, 26, 20, 17, 18, 19]. We can encode a single bit using the

295


relative order of two distinct elements x, y , x < y : if x precedes y they encode a 0, otherwise they encode a 1. An integer of log N bits can be encoded using log N pairs of distinct elements, and, if the positions of these pairs are implicitly known, we can use this encoded integer as a normal one. The drawback is that a single manipulation of such encoded integers uses O(log N ) operations and O(log N/B ) cache misses (this is true only for particular positionings of the pairs, for example if the pairs are stored in a constant number of sequences of contiguous location). Let us estimate how many bits Proximity Mergesort needs. For steps 5 and 6, we need log N bits for each header, so that its original position after step 2 can be recorded. For step 9 we need a constant number of integers of log N bits for each sub-zone and each complete sub-array (Distribute needs to know things like the number of remaining segments in a complete sub-array). So, we need a total of a N log N bits, for S a properly chosen constant a. By the way, when we perform the recursive calls in step 10, we do not have to maintain the integers associated with sub-zones. All the P = a N log N pairs of distinct elements are S collected in the following way. (i) With an in-place cache-oblivious selection algorithm we select the P th and the (N - P + 1)st elements of the input sequence (in sorted order). The optimality of the selection algorithm is not strictly required, because for this task we can employ O(N log N ) operations and incur in O((N/B ) logM N ) cache misses. However, the optimal in-place selection algorithm of Lai and Wood [25] can be slightly modified to achieve the cache complexity optimality. (ii) Then, we partition in-place the input sequence so that we ends with the P smallest elements in the P leftmost positions and the P largest elements in the P rightmost positions. This task can be easily done in-place and cache-obliviously with O(N ) operations and O(N/B ) cache misses (we can lose the stability). (iii) Finally, we sort the two subsequences of P elements with the in-place unstable binary mergesort. This last step raises a problem. Since S = log N , we have that P = O(N ) and so the cache complexity of this step is O((N/B ) log N ). To solve this problem is sufficient to set S = log2 N . This new value of parameter S does not change the asymptotic order of the solution of the recurrence for the cache complexity in Theorem 2.1 and reduces the cache complexity of step (iii) to O(N/B ). We use an in-place unstable

mergesort based on a classical linear time in-place merging algorithm like the one of Kronrod [24, 30]. This algorithm can be easily modified to be cache optimal. Let E and E be the zones containing the P smallest elements and the P largest elements, respectively. If the P th and the (N - P + 1)st elements selected in step (i) are equal then at the end of step (iii) the input sequence is sorted and we are done. Otherwise only the subsequence, say A , of elements between E and E must be sorted. Moreover, we are sure that the ith element of E is distinct from the ith element of E , for each i = 1, . . . , P . So we have the needed P pairs of distinct elements. We can encode any quantity t of bits with 2t elements stored in two contiguous portions of t locations, one portion in E and the other in E . For simplicity, from now on we will write "x contiguous encoding pairs" even if these x i pairs are stored in two contiguous areas. When also A s sorted, it will be easy to recover the sorted order for E and E with a straightforward variant of scanning. Encoding the integers increases the work complexity and the cache complexity of steps 3 and 6 in the main ,5  algorithm. We have to employ R N log N , R( N - 1) log N and R log N contiguous encoding pairs with steps 3, 5 and 6, respectively. This is done to encode the position of each header in the sequence of elements obtained after step 2. Since a manipulation or a simple access of an encoded integer require the scanning of O(log N ) contiguous locations, the complexity of step 3 increases from O(N ) to O(N log N ) operations and log N from O(( B S ) log N ) to O(( N B S N ) log N ) cache misses. Fortunately, with the new choice of S the complexity for this step is still O(N ) operations and O(N/B ) cache misses. Analogous observations can be done for steps 5 and 6. The use of encoded integers affects other parts of Proximity Mergesort. For example Distribute (used to perform step 9) is modified to use the encoded integers that store the state of complete sub-arrays and sub-zones (i.e., the counter of the number of segment contained). Also the local variables (i.e., i, j, m) need a particular treatment in order to remove all the auxiliary space used by recursion. All these modifications are quite simple and will be discussed in the full paper. With these variations we obtain an algorithm that does not explicitly store any auxiliary integer used in the computation and that has the same complexity of the original Proximity Mergesort. 3.2 Creating "virtual" working areas. In order to eliminate the need of working areas to permute efficiently the elements, we use an ingenious idea whose credit seems to be due to Kronrod [24, 30]. Instead

296


ub-array

PSfrag replacements

inactive el.
¡¡¡¡¡¡¡¡   ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡¡¡¡¡¡¡¡   ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡  ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡¡¡¡¡¡¡¡   ¡¡¡ © ©©© ¡ ¡ ¡ ¡¡¡ © ©©© ¡ ¡ ¡  ¡ ¡ ¡ ¡¡¡ © ©©©

active el.
 ¡ ¡ ¡ ¡ ¡  ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡¡¡¡¡¡ ¡ ¡¡¡¡¡¡¡¡¡¡¡¡¡     ¡ ¡ ¡ ¡ ¡  ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡¡¡¡¡¡ ¡ ¡¡¡¡¡¡¡¡¡¡¡¡¡      ¡ ¡ ¡ ¡ ¡  ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡¡¡¡¡¡ ¡ ¡¡¡¡¡¡¡¡¡¡¡¡¡   

sub-zone

Figure 1: Sub-arrays and sub-zones. of moving an element to an empty location, thus performing a destructive occupying, we interchange this "active" element, (an element is "active" when we are performing part of our computation on it) with an "inactive" one. Clearly, in order to complete our computation, an inactive element should become active sooner or later. We now show how to use this simple idea for our purpose. After the creation of the encoding areas E and E , we have to sort only the area A between E and E . Let N be the size of A and let dN be the size of the largest working area used by Proximity Mergesort when invoked to sort a sequence of N elements. We divide the sequence A into two parts: the active area with d N d +1 active elements, and the inactive area with N d+1 inactive elements. We use the inactive area to allocate the various working areas that the Proximity Mergesort requires for sorting the active area. We need a method to distinguish the active elements from the inactive ones during the computation. By inspecting the algorithm, it is clear that the working areas are filled and emptied with a scanning order and so the task of distinguishing the two types of elements is quite easy. For example, in step 9 each complete subarray is emptied from left to right and each sub-zone is filled from left to right (see figure 1). So the simple counter of the segments that are stored in this sub-array or this sub-zone is sufficient to know where the inactive (and the active) elements are. When the active area is sorted, we divide the d N d+1 elements of the former inactive area into a d new inactive area of N ( d+1 )2 elements and a new d active area of N (d+1)2 elements. Then we apply the Proximity Mergesort to the new active area (see figure 2). This process is iterated until the areas have a constant number of elements. Let us denote the pair of active/inactive areas of the j th iteration with (Aj , Ij ). Let l be the number of iterations of this process. Let s(i) be the size of the problem at iteration i, i = 0, . . . , l - 1. The cache complexity for the iterated sorting process is: Qiter = i QP M (s(i)) + i QP M (s(i)).
:s(i)B

:s(i)>B

where QP M is the cache complexity of the Proximity i Mergesort. Since s(i) = N (d+d )i+1 and QP M (s(i)) = 1 O((s(i)/B ) logM s(i)), we have that the contribution of the first sum in Qiter is O((N /B ) logM N ). The second sum in Qiter contributes to the cache complexity with O(1) cache misses, because all the elements sorted in the iterations involved in this sum are stored in a constant number of contiguous blocks. Therefore the iterated sorting process of A can take in O((N /B ) logM N ) cache misses. Analogous calculations can be done for the work complexity. When the iterated sorting process ends, we use an optimal in-place unstable merging algorithm to obtain the final sorted sequence. First, we merge Al with Al-1 obtaining the array B1 . Then, we merge B1 with Al-2 obtaining B2 and so on. At the end of this iterated merging process the elements in A are sorted. The linear time in-place merging algorithm of Kronrod [24, 30] can be easily modified to be cache optimal. With calculations similar to the ones for the iterated sorting process, we can conclude that the iterated merging process uses O(N ) operations and incurs in O(N/B ) cache misses. 4 Conclusions and op en problems

We have described the first in-place optimal sorting algorithm for the cache-oblivious model. Furthermore, the first version of Proximity Mergesort that uses O(N ) auxiliary space, introduces a new approach to the sorting problem in the cache-oblivious model. The in-place version of our algorithm is not stable. The main difficulty towards the stability relies on the technique used to simulate the working areas. When the inactive elements are moved back in their inactive

297

¦¡¡¡¦ ¥ ¥¥¥ ¡¦ ¡¦ ¡ ¦¡¡¡¦ ¥ ¥¥¥ ¡¦ ¡¦ ¡ ¦ ¡¦ ¡¦ ¡ ¡¡¡¦ ¥ ¥¥¥

¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡ §¡¡¡¡¡¡¡¡¡¡¡¡¡¨ § §§§§§§§§§§§§ ¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡ §¡¡¡¡¡¡¡¡¡¡¡¡¡¨ § §§§§§§§§§§§§ § ¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡¨ ¡ ¡¡¡¡¡¡¡¡¡¡¡¡¡¨ § §§§§§§§§§§§§

¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡ £¡¡¡¡¡¡¡¡¡¡¡¤ £ ££££££££££ ¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡ £¡¡¡¡¡¡¡¡¡¡¡¤ £ ££££££££££ £ ¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡ ¡¡¡¡¡¡¡¡¡¡¡¤ £ ££££££££££

s

¢¡¡¡¡¡¢         ¡¢ ¡¢ ¡¢ ¡¢ ¡ ¢¡¡¡¡¡¢         ¡¢ ¡¢ ¡¢ ¡¢ ¡ ¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡ ¡¡¡¡¡¢        


E

2

I2 I0 I1

A0

A1

Figure 2: Encoding areas E , E

a

nd active/inactive areas pairs (Aj , Ij ) for j = 0, 1, 2
priority queue and graph algorithm applications. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC-02), pages 268­276, New York, May 19­21 2002. ACM Press. M. A. Bender, E. D. Demaine, and M. Farach-Colton. Cache-oblivious b-trees. In IEEE, editor, 41st Annual Symposium on Foundations of Computer Science: proceedings: 12­14 November, 2000, Redondo Beach, California, pages 399­409, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 2000. IEEE Computer Society Press. Michael A. Bender, Richard Cole, Erik D. Demaine, and Martin Farach-Colton. Scanning and traversing: Maintaining data for traversals in a memory hierarchy. Lecture Notes in Computer Science, 2461:139­??, 2002. Michael A. Bender, Richard Cole, and Ra jeev RaExponential structures for efficient cacheman. oblivious algorithms. Lecture Notes in Computer Science, 2380:195­??, 2002. Michael A. Bender, Erik D. Demaine, and Martin Farach-Colton. Efficient tree layout in a multilevel memory hierarchy. Lecture Notes in Computer Science, 2461:165­??, 2002. Michael A. Bender, Ziyang Duan, John Iacono, and Jing Wu. A locality-preserving cache-oblivious dynamic dictionary. In Proceedings of the 13th Annual ACM-SIAM Symposium On Discrete Mathematics (SODA-02), pages 29­38, New York, January 6­8 2002. ACM Press. Gerth Stølting Brodal and Rolf Fagerberg. Cache oblivious distribution sweeping. Lecture Notes in Computer Science, 2380:426­??, 2002. Gerth Stølting Brodal and Rolf Fagerberg. Funnel heap -- A cache oblivious priority queue. Lecture Notes in Computer Science, 2518:219­??, 2002. Gerth Stølting Brodal and Rolf Fagerberg. On the limits of cache-obliviousness. In Proc. 35th Annual ACM Symposium on Theory of Computing, 2003. Gerth Stølting Brodal, Rolf Fagerberg, and Riko Jacob. Cache-oblivious search trees via trees of small height. In Proc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 39­48, 2002. Gianni Franceschini and Viliam Geffert. A Simple In-Place Sorting with O(n log n) Comparisons and O(n) Moves. In Proceedings of the 44th Annual

areas, their original order is lost and cannot be recovered if there are equal inactive elements. The simulation of working areas with this technique underlies all the existent in-place merging algorithms. Since all these solutions are quite complicated and since the in-place sorting problem in the cacheoblivious model is clearly harder than the in-place merging problem, it is an interesting open problem how to achieve the stability. Acknowledgments The author would like to thank his Ph.D. Advisor, Professor Roberto Grossi, for his support, help and suggestions. The author would also like to thank an anonymous referee for his helpful comments. References

[7]

[8]

[9]

[10] [1] Panka j K. Agarwal, Lars Arge, Andrew Danner, and Bryan Holland-Minkley. Cache-oblivious data structures for orthogonal range searching. In In Proc. 19th ACM Symposium on Computational Geometry, 2003. [2] A. Aggarwal, B. Alpern, A. Chandra, and M. Snir. A model for hierarchical memory. In Alfred Aho, editor, Proceedings of the 19th Annual ACM Symposium on Theory of Computing, pages 305­314, New York City, NY, May 1987. ACM Press. [3] A. Aggarwal, A. K. Chandra, and M. Snir. Hierarchical memory with block transfer. In Ashok K. Chandra, editor, Proceedings of the 28th Annual Symposium on Foundations of Computer Science, pages 204­216, Los Angeles, CA, October 1987. IEEE Computer Society Press. [4] Alok Aggarwal and Jeffrey Scott Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116­1127, September 1988. [5] Stephen Alstrup, Michael A. Bender, Erik D. Demaine, Martin Farach-Colton, J. Ian Munro, Theis Rauhe, and Mikkel Thorup. Efficient tree layout in a multilevel memory hierarchy. Manuscript, November 11 2002. [6] Lars Arge, Michael A. Bender, Erik D. Demaine, Bryan Holland-Minkley, and J. Ian Munro. Cache-oblivious

[11]

[12]

[13]

[14]

[15]

[16]

298

¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡£ £¡¡¡¡¡¡¡¡¡¡¡¤ ¡£ ££££££££££ ¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡£ £¡¡¡¡¡¡¡¡¡¡¡¤ ¡£ ££££££££££ £ ¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡¤ ¡£ ¡¡¡¡¡¡¡¡¡¡¡¤ ¡£ ££££££££££

A A

E

¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡  ¡¡¡¡¡¡¡¡¡¡¡¢              ¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡  ¡¡¡¡¡¡¡¡¡¡¡¢                ¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡¢ ¡ ¡¡¡¡¡¡¡¡¡¡¡¢             


[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29] [30]

IEEE Symposium on Foundations of Computer Science (FOCS), 2003. Gianni Franceschini and Roberto Grossi. Implicit dictionaries supporting searches and amortized updates in O(log n log log n). In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03), pages 670­678. SIAM, 2003. Gianni Franceschini and Roberto Grossi. Optimal cache-oblivious implicit dictionaries. In Proceedings of the 30th International Col loquium on Automata, Languages and Programming (ICALP), 2003. Gianni Franceschini and Roberto Grossi. Optimal worst-case operations for implicit cache-oblivious search trees. In Proceedings of the 8th International Workshop on Algorithms and Data Structures (WADS), 2003. Gianni Franceschini, Roberto Grossi, J. Ian Munro, and Linda Pagli. Implicit B-trees: New results for the dictionary problem. In Proceedings of the 43th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2002. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In IEEE, editor, 40th Annual Symposium on Foundations of Computer Science: October 17­19, 1999, New York City, New York,, pages 285­297, 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA, 1999. IEEE Computer Society Press. Jyrki Kata jainen and Tomi A. Pasanen. In-place sorting with fewer moves. Information Processing Letters, 70(1):31­37, April 1999. D. E. Knuth. The Art of Computer Programming, Vol. 3: Sorting and Searching. 1973. (Second edition: 1998). M. A. Kronrod. Optimal ordering algorithm without operational field. Soviet Math. Dokl., 10:744­746, 1969. Tony W. Lai and Derick Wood. Implicit selection. In R. Karlsson and A. Lingas, editors, SWAT 88: 1st Scandinavian Workshop on Algorithm Theory, volume 318 of Lecture Notes in Computer Science, pages 14­ 23, Halmstad, Sweden, 5­8 July 1988. Springer-Verlag. J. I. Munro. Searching a two key table under a single key. In Alfred Aho, editor, Proceedings of the 19th Annual ACM Symposium on Theory of Computing, pages 383­387, New York City, NY, May 1987. ACM Press. J. Ian Munro and Venkatesh Raman. Sorting with minimum data movement. Journal of Algorithms, 13(3):374­393, September 1992. J. Ian Munro and Hendra Suwanda. Implicit data structures for fast search and update. Journal of Computer and System Sciences, 21(2):236­250, 1980. H. Prokop. Cache-oblivious algorithms. Master's thesis. MIT, Cambridge, MA, 1999. Jeffrey Salowe and William Steiger. Simplified stable merging tasks. Journal of Algorithms, 8(4):557­571, December 1987.

299