Compression b o osting in optimal linear time using the Burrows-Wheeler Transform Paolo Ferragina Abstract In this paper we provide the first compression booster that turns a zeroth order compressor into a more effective k -th order compressor without any loss in time efficiency. More precisely, let A be an algorithm that compresses a string s within |s|H0 (s)+µ bits of storage in O(T (|s|)) time, where H0 (s) is the zeroth order entropy of the string s. Our booster improves A by compressing s within |s|Hk (s) + log2 |s| + gk bits still using O(T (|s|)) time, where Hk (s) is the k -th order entropy of s. The idea of a "compression booster" has been very recently introduced by Giancarlo and Sciortino in [7]. They combined the Burrows-Wheeler Transform [3] with dynamic programming and achieved our same compression bound but with running time O(T (|s|)) + (|s|2 ). We start from the same premises of [7], but instead of using dynamic programming we design a linear time optimization algorithm based on novel structural properties of the Burrows-Wheeler Transform. 1 Intro duction After its proposal in 1994, the Burrows-Wheeler compression algorithm [3] has immediately become a standard for efficient lossless compression. Its key tool is the so called Burrows-Wheeler Transform (BWT, hereafter) that rearranges the symbols of the input string s producing an output bwt(s) which is usually easier to compress. The compression of the string bwt(s) is usually implemented via a cascade of transformation and coding algorithms: Move-To-Front (MTF), followed by Run-Length Encoding (RLE), followed a statistical compressor (Arithmetic or Huffman coding). Even in Partially supp orted by the Italian MIUR pro jects "Algorithmics for Internet and the Web (ALINWEB) and "Enhanced Content Delivery (ECD)." Dipartimento di Informatica, Universit` di Pisa, Italy. Ea mail: ferragina@di.unipi.it. Partially supported by the MIUR pro ject "Piattaforme abilitanti per griglie computazionali ad alte prestazioni orientate a organizzazioni virtuali scalabili (Grid.it)." Dipartimento di Informatica, Universit` del Piemonte Oria entale, Alessandria, Italy and IIT-CNR, Pisa, Italy. E-mail: manzini@mfn.unipmn.it. Giovanni Manzini its simplest forms this scheme achieves a compression performance that is competitive, if not superior, with the best known compression methods (see e.g. [5, 11]). Extensive experimental work has investigated the role and usefulness of each of the above steps, and numerous variants have been proposed in the literature. Recently, theoretical studies have provided new insights into the nature of the BWT, offering some analytical justifications to the original algorithmic choices made by Burrows and Wheeler in their seminal paper. Specifically, Manzini [9] proved bounds in terms of the k -th order entropy of the string s, denoted by Hk (s), without making any assumptions on the input s. He showed that a variant of the Burrows-Wheeler compres sion algorithm takes (5 + )|s|Hk (s) + log2 |s| + gk bits to compress any string s, where 10-2 and gk is a parameter independent of s. This result is particularly relevant in that the compression bound holds for any k 0, and a similar bound cannot hold for many of the best known compression algorithms, like LZ77, LZ78, and PPMC. Given these intriguing properties, more and more researchers have tried to vivisect the structure of the BWT-based compressor and gain insights into the nature and role of its four basic steps: BWT, MTF (Move-to-Front encoding), RLE (Run-length encoding), and zeroth order coding (Huffman or Arithmetic coding). Although MTF and RLE serve their purposes, the general feeling among theoreticians and practitioners is that the MTF coder introduces some degree of inefficiency and several alternatives have been proposed [1, 2, 4, 6, 8, 12]. Very recently, Giancarlo and Sciortino [7] showed analytically that this feeling was correct. Specifically, they proved that compressing the string s up to its k -th order entropy is equivalent to finding an optimal partition of the string bwt(s) with respect to a suitably chosen cost function. This idea led to compression algorithms which, in terms of guaranteed performance, outperform the old MTF-based algorithms. For example, in [7] Giancarlo and Sciortino describe a compression algorithm whose output size is bounded by 2.5|s|Hk (s) + log2 |s| + gk bits for any string s and for any k 0. This improves by a factor 2 a sim- Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 655 ilar bound for a MTF-based algorithm given in [9]. Giancarlo and Sciortino's approach can be seen as a compression booster in that, given a compressor A which squeezes any string s within |s|H0 (s) + µ bits of storage, it improves A by compressing s in |s|Hk (s)+log2 |s|+gk bits.1 A drawback of this booster is that the optimal partition of bwt(s) is computed using |dynamic programming. This induces an overhead t of s|2 ime which compares unfavorably with the (|s|) running time of the MTF-based algorithms. In this paper we start from the same premises of [7] but, instead of using dynamic programming, we design a linear time optimization algorithm based on novel structural properties of the Burrows-Wheeler Transform. Our contribution is twofold. From the theoretical side, we show that the MTF-less compression is overall superior to the classical MTF-based variants. From the practical side, we open the way to novel BWTbased compressors that deserve a careful experimental study. The bottom line is that we provide compressor designers with the first compression booster that allows to turn a zeroth order compressor into a more effective k -th order compressor, without any loss in the time efficiency. Their fatiguing task should be simpler, from now on! 2 Notation and known results In this section we review some known results that will be used in the rest of the paper. 2.1 The empirical entropy of a string Let s be a string over the alphabet = {a1 , . . . , ah }, and let ni denote the number of occurrences of the symbol ai inside s. The zeroth order empirical entropy of the string s is defined as ( 2.1) i h ni log H0 (s) = - |s| =1 n i by using a uniquely decodable code in which a fixed codeword is assigned to each alphabet symbol. We can achieve a greater compression if for each symbol we use a codeword which depends on the k symbols preceding it. For any length-k string w we denote by w s the string of symbols following w in s. Example 1 Let s = mississippi, w = si. The two occurrences of is inside s are followed by the symbols s and p, respectively. Hence w s = sp. The k -th order empirical entropy of s is defined as: 1w | w s | H0 ( w s ) Hk (s) = |s| k The value |s|Hk (s) represents a lower bound to the output size of any uniquely decodable encoder which uses for each symbol a code which depends only on the symbol itself and on the k most recently seen symbols. The empirical entropy resembles the entropy defined in the probabilistic setting (for example, when the input comes from a Markov source). However, the empirical entropy is defined pointwise for any string and can be used to measure the performance of compression algorithms as a function of the string structure, thus without any assumption on the input source. In [9] it is shown that for highly compressible strings |s|Hk (s) fails to provide a reasonable bound to the performance of compression algorithms. For this reason it is introduced the concept of zeroth order modified empirical entropy as: if |s| = 0 0 H0 (s) = (1 + log |s| )/|s| if |s| = 0 and H0 (s) = 0 H0 (s) otherwise. |s| (in the following all logarithms are taken to the base 2 and we assume 0 log 0 = 0). The value |s|H0 (s) represents the output size of an ideal compressor which n uses - log |si| bits for coding the symbol ai . It is well known that this is the maximum compression achievable attentive reader may have noticed the term log |s| in the compression bounds of [7, 9] which was not present in their published results. The point is that the output of the BurrowsWheeler Transform consists of a permutation of s and of an integer in the range [1, |s|] (see [3] and Sect. 2.2). In the papers [7, 9] the compression bounds refer only to the compression of the permutation of s. In the present paper we account also for the space needed to encode the above integer. 1 An Note that for a non-empty string s, |s|H0 (s) is at least equal to the number of bits needed to write down the length of s in binary. The k -th order modified empirical entropy Hk is defined in terms of H0 as the maximum compression we can achieve by looking at no more than k symbols preceding the one to be compressed.2 Formally, let Sk be a set of substrings of s having length at most k . We say that the set Sk is a suffix cover of k , and write Sk k , if any string in k has a unique suffix in Sk . Example 2 Let = {a, b} and k = 3, two examples of suffix covers for 3 are {a, b} and {a, ab, abb, bbb}. 2 Note that H was defined in terms of H as the maximum 0 k compression we can achieve by looking at exactly k symbols preceding the one to be compressed. For Hk we instead consider H0 and a context of at most k symbols. This is to ensure that Hk+1 (s) Hk (s) for any string s. See [9] for details. 656 Indeed, any string over of length 3 has a unique suffix in both sets. For any suffix cover Sk let (2.2) HSk (s) = Hk which is analogous to the k -th order empirical entropy (2.4), but now referring to preceding symbols (2.6) | w s | H0 (w s ); Sk Hk (s) = 1w |s| 1w |s| | w s | H 0 ( w s ). Pk the value HSk (s) represents the compression we can achieve using the strings in Sk as contexts for the prediction of the next symbol. The modified k -th order empirical entropy of s is defined as the compression that we can achieve using the best possible suffix cover. Formally (2.3) Hk (s) = min HSk (s), Sk k The relationship between Hk and Hk is shown by the following lemma. Lemma 2.1. For any string s, Hk (s) = Hk (sR ), where R s denotes the reversal of the string s. see [9] for further details and properties of the en tropy Hk . In the following, we use Sk to denote the suffix cover for which the minimum of (2.3) is achieved. Therefore we write 1w | w s | H0 (w s ). (2.4) Hk (s) = |s| Sk Proof. The set of symbols following w in s coincides with the set of symbols preceding w R in sR . More precisely, w s is the reverse of the string containing the symbols preceding w R in sR . Furthermore, if Sk is a suffix cover of k , then the reversal of the strings in Sk is a prefix cover of the reversal of k (which is k itself ). The lemma follows by observing that H0 (x) = H0 (xR ) for any string x. The above lemma tells us that if we bound the omc pression performance of an algorithm in terms of Hk (s), then we can achieve the same bound in terms of Hk (s) R by applying the algorithm to s . 2.2 The Burrows-Wheeler Transform Let s denote a text over the constant size alphabet . In [3] Burrows and Wheeler introduced a new compression algorithm based on a reversible transformation, now called the Burrows-Wheeler Transform (BWT from now on). The BWT consists of three basic steps (see Fig. 1): (1) append to the end of s a special symbol $ smaller than any other symbol in ; (2) form a conceptual matrix M whose rows are the cyclic shifts of the string s$ sorted in lexicographic order; (3) construct the trans^ formed text s = bwt(s) by taking the last column of M. Notice that every column of M, hence also the ^ transformed text s, is a permutation of s$. Although it is not obvious, from s we can always recover s, see [3] ^ for details. The importance of the BWT for data compression comes from the following observation. Let w denote a substring of s. By construction, all rows of the BWT matrix prefixed by w are consecutive. Hence, the symbols preceding every occurrence of w in s are grouped together in a set of consecutive positions of the string s (last column of M). Then these symbols ^ form a substring of s, which we denote hereafter by ^ s[w]. Using the notation introduced in the previous ^ section, we have that s[w] is equal to a permutation ^ of w s , namely s[w] = w (w s ) with w being a string ^ permutation which depends on w. A comment is in order at this point. In the rest of the paper we use the Burrows-Wheeler Transform (BWT) as a key component of our compression booster. As we will see, the BWT relates substrings of s with their preceding symbols. Conversely, Hk (s) relates substrings of s with their fol lowing symbols. In order to simplify the analysis of our algorithms we introduce an additional notation which offers this other point of view. Let w s be the string of symbols that precede all occurrences of the substring w in the string s (cfr. w s which contains the following symbols). Let Pk be a set of strings, having length at most k , that are unique prefixes of all strings in k . We call Pk a prefix cover of k (cfr. the definition of suffix cover Sk ). Example 3 Let = {a, b} and k = 3, two examples of prefix covers for 3 are {a, b} and {a, ba, bb}. Indeed, any string over of length 3 has a unique prefix in both sets. Substituting in (2.2) w s with w s , and Sk with Pk , we define for any prefix cover Pk (2.5) HPk (s) = 1w |s| | w s | H0 (w s ). Pk In analogy with (2.3), we take the minimum of (2.5) over all prefix covers. Let Pk denote the prefix cover which minimizes (2.5). Using Pk , we define the function 657 mississippi$ ississippi$m ssissippi$mi sissippi$mis issippi$miss ssippi$missi sippi$missis ippi$mississ ppi$mississi pi$mississip i$mississipp $mississippi = $ i i i i m p p s s s s mississipp $mississip ppi$missis ssippi$mis ssissippi$ ississippi i$mississi pi$mississ ippi$missi issippi$mi sippi$miss sissippi$m i p s s m $ p i s s i i Figure 1: Example of Burrows-Wheeler transform for the string s = mississippi. The matrix on the right has the rows sorted in lexicographic order. The output of the BWT is the last column of the matrix; in this example the output is ipssm$pissii. Example 4 Let s = mississippi and w = is. The two occurrences of is in s are in the fourth and fifth rows of the BWT matrix. Thus, s[is] consists of the ^ fourth and fifth symbols of s and we have s[is] = ms. ^ ^ Indeed, m and s are the symbols preceding is in s. Note that in the first k columns of the BWT matrix we find, lexicographically ordered, all length-k substrings of s. Hence the string s can be partitioned ^ into the substrings s[w] by varying w over k . We then ^ write w w w (w s ), (2.7) s[w] = ^ s= ^ k k are constants independent of the input string z . Using A we can compress any string s up to Hk (s) with the following three-step procedure 1. compute s = bwt(s), ^ ^ 2. find the optimal prefix cover Pk , and partition s into the substrings s[w], w Pk ; ^ 3. compress each substring s[w] using algorithm A. ^ By (2.6) and (2.8), we see that the above algorithm produces an output of size at most |s| Hk (s) + k µ|| bits. By Lemma 2.1 and by applying the above algorithm to sR , we get an output size of at most |s|Hk (s) + µ||k bits. It goes without saying that in the above compression procedure we ignored a few important details such as, the presence of the $ symbol and the fact that we need to be able to detect the end of each substring s[w]. We will deal with these details ^ when describing our compression booster in Section 4. At this point, however, is crucial to observe that the above algorithmic approach, although appealing, shows two drawbacks: (1) it needs to compute the optimal prefix cover Pk , and (2) its compression can be bounded in terms of a single entropy Hk , since the parameter k must be chosen in advance. In their seminal paper, Burrows and Wheeler [3] implicitly overcame these drawbacks by transforming s ^ via the so-called Move-to-Front encoding (MTF from now on) and then compressing the output with an order zero encoder. The analysis in [9] showed that this approach compresses any string up to its k -th order entropy, for any k 0. Recently, Giancarlo and Sciortino [7] have proposed a new approach for achieving the entropy Hk (s) using the BWT. They prove that compressing s up to Hk (s), for any positive k , is equivalent to finding an optimal where enotes the concatenation operator among strings.3 The same argument holds for any prefix cover, and in particular for the prefix cover Pk which defines Hk (s): each row of the BWT matrix is prefixed by a unique string in Pk hence d (2.8) s= ^ w s[w] = ^ Pk w w (w s ). Pk Recall that permuting a string does not change its zeroth order entropy, that is, H0 (ws ) = H0 (w (ws )). Hence, comparing (2.6) with (2.8) shows that the BWT can be seen as a tool for reducing the problem of compressing s up to the k -th order entropy to the problem of compressing distinct portions of s up to their ^ zeroth order entropy. Loosely speaking, suppose that we have a compression algorithm A that squeezes any string z in |A(z )| |z |H0 (z ) + µ bits, where and µ 3 In addition to ^ ^ wk s[w], the string s also contains the last k symbols of s (which do not belong to any w s ) and the special symbol $. We momentarily ignore the presence of these k + 1 symbols in s and deal with them in Sect. 4. ^ 658 partition of the transformed string s with respect to ^ a suitably chosen cost function. These ideas lead to compression algorithms which, in terms of guaranteed performance, outperform the old MTF-based approach. A drawback of the Giancarlo and Sciortino's approach is the high computational cost. They use dynamic programming to determine the optimal partition of s, ^ | t and this yields an overhead of s|2 ime. This compares unfavorably with the (|s|) running time of the MTF-based algorithms. In the following sections we show how to take advantage of some structural properties of the BWT matrix in order to find the optimal partition of s in (|s|) time avoiding the use ^ of dynamic programming. REMARK. In the analysis of BWT-based compressors, it is customary to provide bounds in terms of both the empirical entropies Hk and Hk defined in Sect. 2.1 (see [7, 9]). In the following we only consider the entropy Hk (which is the hardest to deal with). The analysis for Hk is similar and will be described in the full paper. At the end of Sect. 4 we state, without proof, the main result concerning the entropy Hk . 3 From prefix covers to leaf covers Notice that many strings may have the same locus because a suffix tree edge may be labelled by a long substring of s. For example, in Figure 2 the locus of both ss and ssi is the node reachable by the path labelled by ssi. 3.1 The notion of leaf cover We now introduce the notion of leaf cover for T which is related to the concept of prefix cover defined in Sect. 2.1. Definition 2 Given a suffix tree T , we say that a subset L of its nodes is a leaf cover if every leaf of the suffix tree has a unique ancestor in L. Example 5 The suffix tree for s = mississippi$ is shown in Figure 2. A leaf cover consists of all nodes of depth one. Using the notion of locus, we can describe this leaf cover as L1 = { [$], [i], [m], [p], [s]}. Another leaf cover is L2 = { [$], [i$], [ippi$], [issi], [m], [p], [si], [ssi]} which is formed by nodes at various depths. For any suffix tree node u, let s u denote the ^ substring of s containing the symbols associated to the ^ leaves descending from u. For example, in Figure 2 we have s [i] = pssm. Note that these symbols are ^ exactly the symbols preceding i in mississippi$. More in general, we have the following result whose immediate proof follows by the relationship between the suffix tree and the BWT matrix (recall that s[w] is the substring of ^ s corresponding to the rows prefixed by w, see Sect. 2.2). ^ ^ ^ Lemma 3.1. For any string w, it is s [w] = s[w]. Any leaf cover induces a partition of the suffix tree leaves and therefore a partition of the string s. ^ Formally, the partition induced by L = {u1 , . . . , uh } is s u1 , . . . , s uh . That this is actually a partition of s ^ ^ ^ follows from the definition of leaf cover: since each leaf of T has an ancestor in L, the concatenation of these substrings covers the whole s; moreover, these strings ^ are non overlapping, by the "uniqueness" property in Definition 2. Example 6 Carrying on with Example 5, the partition of s induced by L1 is {i, pssm, $, pi, ssii}. The parti^ tion of s induced by L2 is {i, p, s, sm, $, pi, ss, ii}. ^ We are now ready to describe the relationship between leaf covers and prefix covers. Let Pk = {w1 , . . . , wh } denote the optimal prefix cover which defines Hk (s) (see (2.6)). Let Pk denote the set of nodes { [w1 ], . . . , [wh ]} which are the loci of the strings in Pk . k Since Pk is a cover of , any leaf of T corresponding to a suffix of length greater than k has a unique ancestor in A crucial ingredient for the efficient computation of the optimal partition of s is the relationship between the ^ BWT matrix and the suffix tree data structure [10]. Let T denote the suffix tree of the string s$. T has |s| + 1 leaves, one per suffix of s$, and edges labelled with substrings of s$ (see Figure 2). Any node u of T has implicitly associated a substring of s$ given by the concatenation of the edge labels on the downward path from the root of T to u. In this implicit association the leaves of T correspond to the suffixes of s$. We assume that the suffix tree edges are ordered lexicographically. As a consequence, if we scan T 's leaves left to right the associated suffixes are lexicographically ordered. Since each row of the BWT matrix is prefixed by one suffix of s$ (see Section 2.2), there is a natural oneto-one correspondence between leaves of T and rows of the BWT matrix. Moreover, since the suffixes are lexicographically ordered both in T and in the BWT matrix, the i-th leaf (counting from the left) of the suffix tree corresponds to the i-th row of the BWT matrix. We associate the i-th leaf of T with the i-th symbol of the string s. We write i to denote the i-th leaf of T ^ and ^i to denote its associated symbol. From the above discussion it follows that s = ^1 ^2 · · · ^|s|+1 . See Figure 2 ^ for an example. Definition 1 Let w be a substring of s. The locus of w is the node [w] of T that has associated the shortest string prefixed by w. 659 Figure 2: Suffix tree for the string s = mississippi$. The symbol associated to each leaf is displayed inside a circle. for considering a cost function C of this form is the following. We will use C to measure the compression ^ of substrings of s achieved by a zeroth order encoder (hence the term H0 ). Since s contains an occurrence of ^ Definition 3 Let Qk denote the set of leaves of T the special symbol $, a substring of s may contain the ^ corresponding to suffixes of s$ of length at most k which symbol $. However, in our algorithm we store separately are not prefixed by a string in Pk , and let L = Pk Qk . the position of $ within s (see Section 4) and for this k ^ The set Lk is a leaf cover and is called the leaf cover reason we ignore the symbol $ in the definition of the associated to the optimal prefix cover Pk . cost function C . Having defined C , for any leaf cover L we define A comment is in order at this point. We are turning u prefix covers into leaf covers for the suffix tree T . This (3.10) C (L) = C (s u ). ^ effort is motivated by Lemma 3.1 which shows that there L is no difference among different prefixes that have the same locus since they induce the same substring of s. Notice that the definition of C (L) is additive and, ^ Intuitively, this means that we can restrict the space loosely speaking, accounts for the cost of individually ^ in which the optimal prefix cover has to be searched compressing the substrings of the partition of s induced into the one induced by the substrings spelled out by by L. suffix tree nodes. A second crucial observation comes Example 7 The "smallest" leaf cover of T is {root(T )} from Definition 2 which constraints the way a subset of and its induced partition consists of the whole string s. ^ these nodes can be selected to form a leaf cover. No Hence C ({root(T )}) = C (s). The "largest" leaf cover of ^ every subset of the suffix tree nodes is admissible to T consists of all suffix tree leaves { 1, . . . , |s|+1 } and its form a leaf cover. Finally, Definition 3 characterizes the ^ ^ relation that does exist between leaf covers and prefix induced partition consists of the singletons 1, . . . , |s|+1 . |s|+1 ^ covers, thus providing us with a formal bridge between Hence C ({ 1, . . . , |s|+1 }) = i=1 C ( i). Note that ^i) = + µ if ^i , and C ( ^i) = µ if ^i = $ (recall these two concepts, which we will exploit next. C( that according to (3.9) the symbol $ is removed when 3.2 The cost of a leaf cover Let C denote the we compute the function C ). function which associates to every string x over {$} The next lemma shows that the cost of the leaf cover the positive real value L associated to the optimal prefix cover Pk is within k (3.9) C (x) = |x | H0 (x ) + µ an additive constant from the k -th order entropy of s. where and µ are positive constants, and x is the string x with the symbol $ removed. The rationale Lemma 3.2. Let C be defined by (3.9) and let L be the k leaf cover associated with Pk , as defined in Definition 3. Pk . This means that in order to transform Pk into a leaf cover for T we need to add at most k suffix tree nodes. We formalize this notion with the following definition. 660 For any k 0 there exists a constant gk such that, for any string s C (L ) |s| Hk (s) + gk . k Proof. By Definition 3 we have L = Pk Qk , hence k C (L ) = k u C (s u ) + ^ u C (s u ). ^ Summing up, we have shown that instead of finding the optimal prefix cover Pk we can find the optimal leaf cover Lmin . The latter has two remarkable properties: the entropy bound of Lemma 3.3 holds for any k , and Lmin presents strong structural properties that allow its computation in optimal linear time. This is the goal of the next section. 3.3 Computing Lmin in linear time The key observation for computing Lmin in linear time is a decomposability property with respect to the subtrees of the suffix tree T . With a little abuse of notation, in the following we denote by Lmin (u) the optimal leaf cover of the subtree of T rooted at node u. Lemma 3.4. An optimal leaf cover for the subtree rooted at u consists of either the single node u, or of the union of optimal leaf covers of the subtrees rooted at the children of u in T . Pk Qk To evaluate the second summation recall that Qk contains only leaves and |Qk | k . Moreover, if u is a leaf then |s u | = 1 and C (s u ) + µ. Hence, the second ^ ^ summation is bounded by k ( + µ). To evaluate the first summation recall that every node u Pk is the locus of a string w Pk . By Lemma 3.1 and (3.9) we have C (L ) k = u C (s u ) + k ( + µ) ^ C (s[w]) + k ( + µ) ^ Pk Proof. Let u1 , u2 , . . . , uc be the children of u in T . Note that both node sets {u} and c=1 Lmin (ui ) are leaf covers i of the subtree rooted at u (see Definition 2). We now w ^ |s[w]| H0 (s[w]) + µ||k + k ( + µ). show that one of them is an optimal leaf cover for that ^ subtree. Let us assume that Lmin (u) = {u}. Then Pk Lmin (u) consists of nodes which descend from u. We c can partition Lmin (u) as i=1 L(ui ), where each L(ui ) is Now recall that s[w] is a permutation of w s and there^ a leaf cover for the subtree rooted at ui , child of u in T . ^ fore H0 (s[w]) = H0 (w s ). Hence, using (2.6) By the optimality of the Lmin (ui )'s and the additivity of the function C , we have w |w s | H0 (w s ) + gk = |s| Hk (s) + gk C (Lk ) ic ic Pk C (Lmin (ui )). C (L(ui )) C (Lmin (u)) = Pk w =1 =1 We now consider the leaf cover Lmin which minimizes the value C (L) among all the possible leaf covers L. That is, we are interested in the leaf cover Lmin such that C (Lmin ) C (L) for any leaf cover L. We say that Lmin induces the optimal partition of s with respect to the cost function C . ^ The relevance of Lmin resides in the following lemma whose proof immediately derives from Lemma 3.2 and by the trivial observation that C (Lmin ) C (Lk ). Lemma 3.3. Let Lmin be the optimal leaf cover for the cost function C defined by (3.9). For any k 0 there exists a constant gk such that, for any string s C (Lmin ) |s| Hk (s) + gk . Hence, c=1 Lmin (ui ) is an optimal leaf cover for the i subtree rooted at u as claimed. The above lemma ensures that the computation of Lmin admits a greedy approach that processes bottomup the nodes of the suffix tree T . The corresponding algorithm is detailed in Figure 3. Note that during the visit of T we store in L(u) the optimal leaf cover of the subtree rooted at u and in Z (u) the cost of such an optimal leaf cover. The correctness of the algorithm follows immediately from Lemma 3.4. For what concerns the running time of the algorithm of Fig. 3 we observe that the only non-trivial operation during the tree visit is the computation of C (s u ) ^ in Step (2.1). This requires the knowledge of the number of occurrences of the symbol ai in s u for i = ^ 1, . . . , ||. These values can be obtained in O(1) time from the number of occurrences of ai in s u1 , . . . , s uc w ^ ^ here u1 , . . . , uc are the children of u in T (recall that || = O(1) and that s u is the concatenation of ^ 661 (1) (2) (3) Construct the suffix tree T for the string s$. Visit T in postorder. Let u be the currently visited node, and let u1 , u2 , . . . , uc be its children: (2.1) Compute C (s u ). ^ i Z (ui )}. (2.2) Compute Z (u) = min {C (s u ), ^ c (2.3) Set the leaf cover L(u) = {u} if Z (u) = C (s u ); otherwise set L(u) = i=1 L(ui ). ^ Set Lmin = L(root(T )). Figure 3: The pseudocode for the linear-time computation of the optimal leaf cover L min . s u1 , . . . , s uc ). Hence, the visit of T takes constant ^ ^ time per node and O(|s|) time overall. Since the construction of the suffix tree at Step (1) takes O(|s|) time and space [10], we have established the following result. Lemma 3.5. The algorithm in Figure 3 computes the leaf cover Lmin achieving the minimum value for the cost function C defined by (3.9) in O(|s|) time and using O(|s|) space. 4 A BWT-based compression b o oster O(T (|s|)) time and uses O(|s|) space in addition to the space used by algorithm A. Proof. First of all, let us show that the output produced by our booster can be decompressed. Notice that the symbol $ is initially removed from s (i.e. from each ^ si ). Hence, each string si is over the alphabet . The ^ ^ decoder starts by decompressing the strings s1 , . . . , sm ^ ^ one at a time. The end-of-string symbol # is used to distinguish si from si+1 . Since at Step (4) we only ^ ^ compress non empty strings si 's, when the decoder finds ^ an empty string (that is, the singleton #) it knows that all strings s1 , . . . ,ism have been decoded and it may ^ ^ |si |. The decoder then fetches the ^ compute |s| = next ( log2 |s| + 1) bits which contain the position of $ ^ within s (written at Step (6)). At this point, the decoder ^ has reconstructed the original s and it may recover the input string s using the inverse BWT. As far as the compression performance is concerned, by construction we have |A(x)| C (x). Since Lmin is the optimal leaf cover with respect to the cost function C , using Lemma 3.3 we get im |A(si )| ^ im C (si ) = C (Lmin ) |s| Hk (s) + gk . ^ We are now ready to describe our compression booster which turns a zeroth order compressor into a more effective k -th order compressor without any (asymptotic) loss in time efficiency. Our booster uses linear space in addition to the space used by the zeroth order compressor. Our starting point is any compressor A which satisfies the following property. Property 4.1. Let A be a compression algorithm such that, given an input string x , A first appends an end-of-string symbol # to x and then compresses x# with the fol lowing space and time bounds: 1. A compresses x# in at most H0 (x) + µ bits, where and µ are constants, =1 =1 2. the running time of A on input x is O(T (|x|)) where T ( · ) is a convex function. Note that the algorithm RHC described in [7, Sect. 6] satisfies the above property with = 2.5 and T (|x|) = |x|. Given any compressor A satisfying Property 4.1, we can feed it to the compression booster described in Figure 4 obtaining a k -th order compressor without any (asymptotic) loss in time efficiency. The properties of our compression booster are formally stated in the following theorem. Theorem 4.1. Given a compression algorithm A that satisfies Property 4.1, the booster detailed in Figure 4 compresses any string s within |s| Hk (s) + log2 |s| + gk bits of storage for any k 0. The compression takes Since the compression of the empty string # needs further µ bits and we append ( log 2 |s| + 1) bits to encode the position of the symbol $, the overall compression bound follows. ii |s | = |s| ^ By the convexity of T and the fact that i T (|si |) T (|s|) + O(1). By Lemma 3.5 ^ we have computing Lmin takes O(|s|) time. Hence, the overall running time of our booster is O(T (|s|)) as claimed. Finally, the space bound follows directly from Lemma 3.5. Combining Lemma 2.1 with Theorem 4.1, we get that applying our compression booster to the string sR we obtain the following result. Corollary 4.1. Given a compression algorithm A that satisfies Property 4.1 we can compress any string s 662 (1) (2) (3) (4) (5) (6) Define the cost function C (see Sect. 3.2) according to the parameters and µ which appear in the compression bound of algorithm A. Compute the optimal leaf cover Lmin with respect to the cost function C . Compute the partition s = s1 · · · sm induced by Lmin . ^^ ^ For i = 1, . . . , m, remove from si the occurrence of $ (if any) and compress the resulting string si , ^ ^ if not empty, using the algorithm A. Compress the empty string using algorithm A (A actually compresses the singleton #). Write the binary encoding of the position of $ in s using log 2 |s| + 1 bits. ^ Figure 4: The pseudocode of our compression booster which turns a zeroth order compressor A into a k -th order compressor. within |s|Hk (s) + log2 |s| + gk bits for any k 0. The compression takes O(T (|s|)) time and uses O(|s|) space in addition to the space used by algorithm A. The analysis that we have carried out for Hk can be repeated almost verbatim for the entropy Hk . We will provide the details in the full paper. Here we only state the analogous of Corollary 4.1 for the entropy Hk . Corollary 4.2. Let A be a compression algorithm which satisfies Property 4.1 with the only modification that A compresses x# in at most H0 (x) + |x| + µ bits, where , and µ are constants. Then, we can compress any string s within |s|Hk (s) + |s| + log2 |s| + gk bits for any k 0. The above compression takes O(T (|s|)) time and uses O(|s|) space in addition to the space used by algorithm A. Acknowledgments We would like to thank our children Damiano, Davide and Francesca for boosting our efforts on compressing the time for doing research. References [1] Z. Arnavut. Generalization of the BWT transformation and inversion ranks. In Proc. IEEE Data Compression Conference, page 447, 2002. [2] B. Balkenhol, S. Kurtz, and Y. M. Shtarkov. Modification of the Burrows and Wheeler data compression algorithm. In Proceedings of IEEE Data Compression Conference, 1999. [3] M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. [4] S. Deorowicz. Second step algorithms in the BurrowsWheeler compression algorithm. Software Practice and Experience, 32(2):99­111, 2002. [5] P. Fenwick. The Burrows-Wheeler transform for block sorting text compression: principles and improvements. The Computer Journal, 39(9):731­740, 1996. [6] P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proc. of the 41st IEEE Symposium on Foundations of Computer Science, pages 390­398, 2000. [7] R. Giancarlo and M. Sciortino. Optimal partitions of strings: A new class of Burrows-Wheeler compression algorithms. In Combinatorial Pattern Matching Conference (CPM '03), pages 129­143, 2003. [8] R. Grossi, A. Gupta, and J. Vitter. Indexing equals compression: Experiments on suffix arrays and trees. In Proc. 15th Annual ACM-SIAM Symp. on Discrete Algorithms (SODA'04), 2004. This volume. [9] G. Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 48(3):407­430, 2001. [10] E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262­ 272, 1976. [11] J. Seward. The bzip2 home page, 1997. http://sources.redhat.com/bzip2. [12] A. Wirth and A. Moffat. Can we do without ranks in Burrows-Wheeler transform compression? In Proc. IEEE Data Compression Conference, pages 419­428, 2001. 663