Hybrid Index Maintenance for Growing Text Collections Stefan Buttcher Charles L. A. Clarke Brad Lushman ¨ School of Computer Science University of Waterloo, Canada {sbuettch,claclark,bmlushma}@plg.uwaterloo.ca ABSTRACT We present a new family of hybrid index maintenance strategies to b e used in on-line index construction for monotonically growing text collections. These new strategies improve up on recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new method distinguishes b etween short and long p osting lists: While short lists are maintained using a merge strategy, long lists are kept separate and are up dated in-place. This way, costly relocations of long p osting lists are avoided. We discuss the shortcomings of previous hybrid methods and give an exp erimental evaluation of the new technique, showing that its index maintenance p erformance is sup erior to that of the earlier methods, esp ecially when the amount of main memory available to the indexing system is small. We also present a complexity analysis which proves that, under a Zipfian term distribution, the asymptotical numb er of disk accesses p erformed by the b est hybrid maintenance strategy is linear in the size of the text collection, implying the asymptotical optimality of the prop osed strategy. index maintenance p erformance and query processing p erformance. Index up date op erations can b e carried out very efficiently if we do not care ab out query processing p erformance; sp ending more time on index reorganization tasks, on the other hand, usually results in lower resp onse times when processing search queries. Although fully dynamic text collections allow document insertions, deletions, and modifications, in this pap er we only focus on document insertions and discuss the problem of maintaining an inverted index for a monotonically growing text collection -- a restriction that is quite common in this area [6] [7] [11] [13]. For a discussion of index maintenance strategies in the presence of document deletions, see Chiueh and Huang [3] or Buttcher and Clarke [1]. ¨ Virtually all existing index maintenance strategies for monotonically growing text collections work by accumulating p ostings for incoming documents in main memory, building an in-memory inverted file, and only combining this in-memory index with the existing on-disk data structures when a certain, pre-defined memory utilization threshold is exceeded. This way, disk accesses can b e amortized over a larger numb er of index up date op erations, resulting in increased indexing p erformance. The individual up date strategies only differ in how they combine the accumulated in-memory data with the data stored on disk. Traditionally, index maintenance strategies for text retrieval systems based on inverted files have either employed an in-place [13] or a merge-based [6] up date scheme. In an in-place up date scheme, whenever the data in memory have to b e combined with the existing on-disk index, the in-memory p osting lists are app ended to the existing ones on disk. In general, this requires relocating existing ondisk lists. These relocations are time-consuming and can b e avoided by using an overallocation strategy that leaves some amount of free space after every on-disk p osting list. Only when the amount of free space is too small for the p ostings that are currently held in memory, the whole p osting list has to b e moved to a new location, where there is enough space for all p ostings for the given term. In contrast, merge-based strategies create a new on-disk inverted file by combining an existing one with the in-memory data. Although this requires a great numb er of disk op erations, since the entire on-disk index has to b e read, it usually can b e done fairly efficiently b ecause the numb er of disk seeks involved is very small, thanks to the largely sequential disk access pattern. Roughly sp eaking, merge-based strategies are b etter at dealing with short p osting lists, while in-place methods are b etter at handling long p osting lists. Recently, we have pre- Categories and Subject Descriptors H.2.4 [Systems]: Textual databases; H.3.4 [Systems and Software]: Performance evaluation General Terms Exp erimentation, Performance Keywords Information Retrieval, Index Construction, Index Maintenance, Merge, In-Place, Hybrid 1. INTRODUCTION Index maintenance strategies for text retrieval systems in dynamic search environments, where index up date op erations are interleaved with search queries, have b een studied intensively over the past few years. In general, every maintenance strategy can b e describ ed as a trade-off b etween Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. 356 in-memory search data structure "0" "and" "of" "the" term postings term postings term postings term postings term postings term postings term postings term postings on-disk inverted file Figure 1: General structure of an inverted file (schematical). Usually, a B-tree or a sorted list is chosen as the in-memory data structure. If the terms and their posting lists are stored in lexicographical order in the on-disk index, only a very small number of term descriptors needs to be kept in memory. sented a family of hybrid strategies, based on a distinction b etween short and long p osting lists [2]. Hybrid strategies maintain short p osting lists following a merge-based approach, while long lists are up dated in-place. By combining the two paradigms, they achieve b etter indexing p erformance than either approach alone. In this pap er, we present a new family of hybrid index maintenance strategies. Our new approach is based up on the same idea to distinguish b etween short and long p osting lists and to treat them in different ways, but it corrects some shortcomings that we found in our previous technique. The new approach is similar to the pulsing technique prop osed by Cutting and Pedersen [4] and allows the indexing system to view a p osting list as the concatenation of a long list and a short list. It offers b etter indexing p erformance than the old one, esp ecially if main memory is stinted. In the next section, we give a brief overview of related work in the area of b oth off-line and on-line index construction, summarizing existing techniques and discussing the shortcomings of our previous hybrid index maintenance method. Section 3 contains a description of the new method. This is followed by an exp erimental evaluation of old and new index maintenance strategies in section 4. Finally, we give a theoretical analysis of the new hybrid method, including a proof that, under a Zipfian term distribution, the hybridization of Logarithmic Merge provides an amortized p er-p osting index maintenance disk complexity of O(1). of all occurrences of a given term (as opp osed to mere document numb ers). Inverted lists are split up into chunks containing 30,000 p ostings, and each chunk is compressed using a byte-aligned gap compression method [10], resulting in an average storage requirement of 12 bits p er p osting. The off-line process describ ed ab ove can b e transformed into an on-line method. Search queries are then processed by fetching p osting lists from all sub-indices created so far and concatenating them. Query processing p erformance of this strategy is very p oor, of course, b ecause all lists are heavily fragmented, implying a great numb er of disk seeks during query processing. We use this strategy as the baseline for index maintenance p erformance and refer to it as No Merge (b ecause the inverted files are only merged after the whole collection has b een indexed). An imp ortant prop erty of inverted files is that, as long as p osting lists are stored in lexicographical order within the inverted file, the search data structure does not need to explicitly contain the p osition of each term's p osting list. For example, if we know the p ositions of the list for the terms "impaired" and "imp olite", then we can easily find all p ostings for "implicit" by reading all on-disk data b etween "impaired" and "imp olite". If the unexplored area b etween two such terms is fairly small (upp er limit in our implementation: 128 KB), this can b e done very efficiently and does not p ose a p erformance problem. If we allow list relocations, we lose this implicit information. In-Place Index Maintenance In-place index up date schemes combine the existing ondisk inverted file with the in-memory data by app ending all p ostings from the in-memory index to the corresp onding list in the on-disk inverted file (creating a new list if necessary). To make this p ossible, they use overallocation strategies: Whenever a p osting list is written to disk, more space is allocated than is actually needed. This space can then b e filled with p ostings from the in-memory index when the system runs out of memory the next time. From time to time, of course, lists have to b e relocated in order to create more room for new p ostings at the end of the list. In general, there are two typ es of in-place up date strategies: Those that require on-disk p osting lists to b e contiguous [8] and those that do not [13]. Forcing p osting lists to b e contiguous minimizes the numb er of disk seeks necessary to fetch a p osting list (and thus maximizes query processing p erformance). Allowing p osting lists to b e non-contiguous, on the other hand, helps avoid list relocations. It therefore increases index maintenance p erformance, but deteriorates query p erformance, due to a greater numb er of disk seeks. The main problem of in-place up date with contiguous p osting lists is that, due to the frequent relocation of p osting lists, no pre-defined ordering on the terms in the inverted file can b e guaranteed. Therefore, it is necessary to maintain an explicit vocabulary data structure, containing for every 2. RELATED WORK In this section, we give a summary of existing results in the area of off-line and on-line index construction. Inverted Files and Off-Line Index Construction Inverted files have b een proved to b e the most efficient data structure in high-p erformance retrieval systems for large text collections [17]. An inverted file realizes a mapping from terms to their p osting lists. A term's posting list (also called inverted list) is a list of all occurrences of that term. An inverted file is a collection of p osting lists, stored on a storage medium supp orting random access. It is equipp ed with some search data structure (usually a search tree) that can b e used to find the p osting list associated with a given term. An example is shown in Figure 1. Off-line index construction methods for static text collections usually follow a collection partitioning approach that splits the whole collection into a set of smaller sub collections, where the size of the individual sub collections is determined by the amount of main memory in the system. After an inverted file has b een created for each such subcollection, all these sub-indices are combined in a multiway merge process, yielding the final index representing the entire collection. [9] [16] [5] In the context of this paper, all inverted lists contain full positional information, i.e. the exact locations 357 Number of postings in long lists 100% 80% 60% 40% 20% 0% 0 5 10 15 20 25 30 35 40 Total number of postings (in billions) Threshold T = 1,000,000 Threshold T = 2,000,000 Threshold T = 4,000,000 inverted files is not constant, but logarithmic in the current size of the on-disk index. The strategy prop osed by Buttcher ¨ and Clarke makes use of the concept of index generation. An on-disk inverted file is of · generation 0 if it has b een directly created from the in-memory index; · generation n + 1 if it is the result of a merge op eration involving indices of generation n. Whenever there is more than one on-disk index of the same generation n, all indices of that generation are merged, resulting in a new inverted file of generation n + 1. This process is rep eated until there are no more such collisions. This strategy leads to a set of on-disk inverted files of exp onentially increasing size. Its total indexing disk complexity is ,, « N N · log . M We refer to this strategy as Logarithmic Merge. Hybrid Index Maintenance Hybrid index maintenance is motivated by the fact that, for large text collections, the vast ma jority of all p ostings is found in very long p osting lists containing several million entries (as shown in Figure 2). Copying these very long lists during re-merge op erations (as required by Immediate Merge, for example), is very costly and should b e avoided. Even though more modern strategies, like Sqrt Merge and Logarithmic Merge, substantially reduce the numb er of disk op erations, the problem in principle remains the same. Therefore, it seems promising to distinguish b etween long and short p osting lists and to up date only short lists using a merge strategy, while long lists are up dated in-place. Earlier this year, we have presented a family of hybrid index maintenance strategies based on this distinction b etween short and long lists [2]. The basic idea is rather simple: As soon as the p osting list for a given term exceeds a certain length (we refer to this as the long list threshold, denoted as T ), is is declared long and moved from the merge-up dated part of the on-disk index to the in-placemaintained part. Every p osting list in the in-place part is stored in an individual file. Whenever a frequent term with long p osting list is encountered during a merge op eration, its p ostings are app ended to the corresp onding file instead of b eing transferred to the target index of the current merge op eration. From the indexing system's p oint of view, each long list is stored contiguously inside its file. The actual details (contiguousness, relocations) are left to the file system implementation (in our exp eriments, the ext3 file system that is part of the Linux kernel). This technique can b e combined with all three merge strategies describ ed ab ove, resulting in three different hybrid index maintenance strategies. Since these hybrid strategies require all long on-disk p osting lists to b e stored contiguously, we refer to them as HIMC (Hybrid Immediate Merge), HSMC (Hybrid Sqrt Merge), and HLMC (Hybrid Logarithmic Merge) ­ where the subscript "C" indicates contiguous p osting lists. Shortcomings Hybrid Index Maintenance The hybrid index maintenance strategies describ ed ab ove have several serious flaws. For the sake of brevity, we only discuss two ma jor problems: Figure 2: Total number of postings in the index vs. number of postings in long lists. After 50% of the GOV2 text collection have been indexed, more than 82% of all postings are found in lists longer than 1,000,000 postings. term the p osition of its on-disk p osting list. This data structure can then b e used to up date p osting lists in the order in which they are stored on disk. Since the vocabulary can b ecome very large (more than 50 million different terms for the text collection used in our exp eriments), it does not fit into main memory any more, and vocabulary maintenance ­ as opp osed to posting list maintenance ­ b ecomes a challenging task (see [8] for details). Merge-Based Index Maintenance In merge-based index maintenance strategies, p ostings are never directly added to an existing on-disk inverted list. Instead, whenever in-memory p ostings have to b e combined with on-disk data, the in-memory index is merged with an existing on-disk inverted file, resulting in a new inverted file that replaces the old one. The simplest merge strategy is called Immediate Merge. At any given p oint in time, this strategy maintains at most one active on-disk inverted file. When the indexing system runs out of memory, the inmemory data are merged with this inverted file, resulting in a new inverted file that sup ersedes the old one. Since every such merge op eration requires the system to read the entire current on-disk index, the total amount of disk op erations necessary to index a text collection containing N tokens is X N M i=1 ,, « N (i · M ) = N · , M where M is the numb er of p ostings that can b e held in memory. Despite its simplicity and its obviously problematic quadratic disk complexity, it seems to b e very difficult to find an in-place up date scheme that outp erforms the Immediate Merge strategy (cf. Lester et al. [7] [8]). Recently, it has b een studied how allowing the indexing system to maintain more than one inverted file at a time affects the p erformance of merge-based up date schemes. Lester et al. [6] analyzed the case where the search system is allowed to use k indep endent on-disk inverted files at a time (for constant k). In this situation, an optimal index maintenance strategy has an overall disk complexity of ,, «1/k ! N . N· M For k = 2, we refer to this strategy as Sqrt Merge. Lester et al. [6] and Buttcher and Clarke [1] also looked ¨ at the case where the maximum allowable numb er of on-disk 358 merge-maintained on-disk index inverted file list segment inverted file list segment inverted file list segment in-memory index term term term term list list list list in-place-maintained on-disk index inverted file list segment list segment list segment @addfile /u3/gov2/uncompressed/GX169/40.txt @rank[bm25][docid][count=20][id=479] "".."" by \ "porche", "suv" @addfile /u3/gov2/uncompressed/GX135/81.txt @rank[bm25][docid][count=20][id=3219] "".."" by \ "volkswagen", "beetle", "convertible" @addfile /u3/gov2/uncompressed/GX082/90.txt @rank[bm25][docid][count=20][id=10406] "".."" by \ "problems", "hmong", "immigrants" @addfile /u3/gov2/uncompressed/GX073/87.txt @rank[bm25][docid][count=20][id=48490] "".."" by \ "lee", "county", "clerk", "courts" Figure 3: Index layout for a hybrid maintenance strategy with non-contiguous posting lists. Each term's posting list is the concatenation of a number of list segments found in the in-memory index, the merge-maintained index, and the in-place index. The sub-list found in the in-place part of the index may consist of several non-contiguous segments. 1. Delegating the in-place up date to the file system is very convenient, but hides too much information and therefore makes it difficult to analyze the hybrid strategies. In particular, the rules that define under what circumstances a p osting list is relocated (to avoid fragmentation) are unclear and may differ from system to system ­ even from hard disk to hard disk. Are the long p osting lists really contiguous? Most likely not, but we don't know. 2. Assume the long list threshold of HIMC is chosen as T = 106 . After, say, 100 physical index up dates, the p osting list of a term X exceeds this threshold and is moved from the merge-maintained part of the on-disk index to the in-place part. From that p oint on, it is considered long and is always up dated in-place. Since it took this list 100 re-merge op erations to reach the threshold, on average we can exp ect to see 104 occurrences of the term b etween two physical index up dates. For only 104 new p ostings, however, it would certainly b e more efficient to store them in the merge-up dated part of the index, avoiding the additional disk seek(s) caused by app ending them to the end of the file associated with the term X . With every additional merge op eration, this effect gets stronger, implying that the p erformance of the strategies describ ed ab ove is highly sensitive to the amount of main memory available. In this pap er, we remedy these problems. We present a new family of hybrid strategies that is less sensitive to memory limitations and more amenable to a formal complexity analysis, allowing us to get a deep er understanding of the subtleties of index maintenance for dynamic text collections. Figure 4: Commands #53,437 - #53,445, taken from the whole sequence consisting of 27,204 update commands and 27,204 queries. The resulting index layout is depicted in Figure 3. The new hybrid strategies start like any of the non-hybrid merge strategies describ ed in section 2 (Immediate Merge, Sqrt Merge, or Log. Merge). During a merge op eration, however, whenever a term is encountered for which there are more than T p ostings participating in the merge op eration, all these p ostings are moved to the in-place part of the index instead of the target inverted file of the merge op eration. This is done by simply app ending the p ostings to the single inverted file that represents the in-place part. In contrast to the contiguous hybrid strategies discussed in the previous section, our new strategies introduce additional non-contiguities in the on-disk p osting lists (b ecause new p ostings are simply app ended to the existing inverted file), causing additional disk seeks at query time. By choosing a large enough value for T , however, these query-time disk seeks can b e amortized over a longer, sequential read op eration If, for example, we choose T = 106 , and the hard drive can read 50,000 p ostings (sequentially) in the time it requires to p erform a single random disk seek, then the relative slowdown caused by the additional disk seeks is at 0 most 1,50,0,00 0 = 5%. By choosing different threshold val00 00 ues T , it is p ossible to control index maintenance and query processing p erformance. Smaller T values mean b etter update p erformance (b ecause more p ostings are moved to the in-place part of the index). Greater T values mean b etter query p erformance (by reducing the numb er of disk seeks during query processing). In particular, T = represents a pure merge-based strategy, while T = 0 is equivalent to the No Merge strategy ­ b oth create the same amount of fragmentation in the on-disk p osting lists. The exact effect of a given T value dep ends on hard drive characteristics. By measuring the disk's bandwidth and its seek latency, it is p ossible to find the right T value for the relative slowdown tolerated -- which, presumably, will dep end on the relative up date and query load of the system. 1 Since, in the in-place part of the on-disk index, p osting list segments are not stored in any particular order, we need an additional data structure that tells us for every term in the inverted file the location of all its list segments. This is similar to the problem we describ ed when we discussed the shortcomings of in-place up date in section 2. This time, It needs to b e mentioned here that the criterion that defines when p ostings are transferred to the in-place index is slightly grubby. In our system, all p ostings are compressed and thus have different sizes, b etween 1 and 6 bytes, dep ending on the term's frequency and locality. This should b e taken into account. However, we decided to ignore this detail b ecause it would make the analysis of our method more complicated. 1 3. A NEW SET OF HYBRID STRATEGIES Instead of enforcing a strict distinction b etween short and long p osting lists, as done in our previous approach to hybrid index maintenance, the new strategies allow each p osting list to consist of two parts ­ a long one, up dated in-place (realized by a single, app end-only inverted file), and a short one, maintained using one of the merge strategies. Only when the length of the short, merge-maintained part of the p osting list exceeds a pre-defined threshold value T , it is removed from the merge part of the index and added to the in-place part of the index. This method is similar to the pulsing technique describ ed by Cutting and Pedersen [4]. 359 (a) Index maintenance performance 24 22 20 18 16 14 12 10 8 6 0.125 0.25 0.5 Average time per query (in ms) Total indexing time (in hours) HIM-NC / Immediate Merge HSM-NC / Sqrt Merge HSM-C / Sqrt Merge HLM-NC / Logarithmic Merge 1650 1600 1550 1500 1450 1400 1350 1300 1250 1200 0.125 0.25 (b) Query processing performance HIM-NC / Immediate Merge HSM-NC / Sqrt Merge HSM-C / Sqrt Merge HLM-NC / Logarithmic Merge 1.0 2.0 4.0 8.0 0.5 1.0 2.0 4.0 8.0 Long list threshold T (x 106) Long list threshold T (x 106) Figure 5: Index maintenance and query processing performance for different strategies with various parameter settings. T = represents the underlying non-hybrid strategy. T = 0 is equivalent to the No Merge strategy. Total indexing time (in hours) 11 10 9 8 7 6 5 4 64 96 128 192 256 384 512 768 Main memory available for the in-memory index (in MB) 1024 Logarithmic Merge HLM-C (T=1,000,000) HLM-NC (T=1,000,000) No Merge (off-line) ray built on top of two 7,200-rpm hard drives. The size of the final index, containing full p ositional information for all terms app earing in GOV2, was 61 GB. Indexing and Query Processing Performance In our first series of exp eriments, we had our retrieval system process the command sequence, building an index for GOV2 and concurrently processing 27,204 search queries. We allowed the system to use 512 MB of main memory for the in-memory index. For the 3 different merge strategies and various threshold values T , we analyzed index maintenance and query processing p erformance. From the results shown in Figure 5, it is obvious that our new strategies exhibit significantly b etter indexing p erformance than the non-hybrid methods. For T = 106 , HLMNC builds the index 17% faster than its non-hybrid counterpart (HSMNC : 49%; HIMNC : 155%). On the other hand, query p erformance only drops by 4% (HSMNC : 7%; HIMNC : 10%). Compared to No Merge, HLMNC (T = 106 ) with 512 MB RAM only needs 47% longer to build the final index (6.86 hours instead of 4.66), but exhibits a vastly sup erior query processing p erformance ­ 1.4 seconds p er query, instead of 4.8 seconds. For the non-contiguous hybrid strategies, decreasing T consistently improves indexing p erformance and decreases query p erformance in all cases. Moreover, there is a strict ordering b etween the individual strategies. For instance, in order for HSMNC to outp erform Logarithmic Merge's indexing p erformance on GOV2, it needs T 2 · 105 . At that p oint, however, its query processing p erformance is much worse than that of Logarithmic Merge (cf. Figure 5). Space vs. Time Trade-offs Our second series of exp eriments serves the purp ose of finding out how the amount of RAM available for the inmemory index affects the index maintenance p erformance of our strategies. We therefore varied the amount of memory available for the in-memory index b etween 64 MB and 1,024 MB. The results depicted in Figure 6 show that, while the indexing p erformance of HLMNC is changed only insignificantly, HLMC suffers severely from the reduced size of the in-memory index, and index construction time steps up from 7.39 hours (1024 MB) to 9.21 hours (64 MB) ­ a 25% increase. For comparison, indexing time with HLMNC only increases by 14% (Logarithmic Merge: 18%). Figure 6: Impact of memory size on index maintenance performance. HLMC is more sensitive to the amount of available main memory than HLMNC and non-hybrid Logarithmic Merge. however, the situation is different. Only frequent terms can make it into in-place index, and their numb er is very small (only 13,309 terms app ear more than 105 times in the GOV2 text collection, for instance). We can therefore afford to keep the meta-information that is necessary to efficiently find all list segments for a frequent term in memory. Dep ending on which merge strategy serves as starting p oint for the resp ective hybrid strategy, we refer to the new strategy as HIMNC (Immediate Merge), HSMNC (Sqrt Merge), or HLMNC (Logarithmic Merge). In each case, the subscript "NC" indicates non-contiguous p osting lists 4. EXPERIMENTAL EVALUATION Our exp eriments were conducted using the GOV2 text collection used in the TREC Terabyte track2 . GOV2 consists of 25.2 million documents with a total size of 426 GB. In order to b e able to measure index maintenance and query processing p erformance at the same time, we created a mixed update/search sequence consisting of 27,204 up date op erations and 27,204 search queries (randomly taken from the 50,000 queries used in the effiency task of the 2005 TREC Terabyte track), simulating an on-line search environment. A short subsequence is shown in Figure 4. Search queries are of the form "find all documents containing at least one of the query terms, rank them by their BM25 score, and return the document IDs of the top 20 documents". All exp eriments were run on a single Linux PC based on an AMD Athlon64 3500+ CPU with 2 GB of main memory and 7,200-rpm SATA hard drives. The input documents were read from a RAID-0 ar2 5. COMPLEXITY ANALYSIS In this section, we generalize the exp erimental results presented in this pap er in order to make them applicable to http://www-nlpir.nist.gov/pro jects/terabyte/ 360 other text collections. We present a complexity analysis, based on the numb er of disk op erations (measured by the numb er of p ostings transferred to/from disk) p erformed by the resp ective strategy. More sp ecifically, we determine the numb er of disk write op erations necessary to index a text collection of a given size. Since the numb er of read op erations carried out during merge op erations is upp er-b ounded by the numb er of write op erations (nothing is read twice), this allows us to calculate the total numb er of disk op erations p erformed (up to a constant factor). We show that the HLMNC method only needs (N ) disk op erations to construct an index for a text collection of size N . Asymptotically, this is the optimal indexing p erformance and is only rivalled by the No Merge strategy used in offline index construction (which provides incomp etitive query p erformance if used in an on-line environment). Our analysis is based on the assumption that the term distribution can b e expressed as a generalized Zipf distribution [15], i.e., that the numb er of times the i-th most frequent word app ears in the text collection is inversely prop ortional to i : fT (i) = c i Postings in long lists (>1,000,000) 100% 80% 60% 40% 20% 0% 0 5 10 15 20 25 30 Total number of postings (in billions) 35 40 Zipf, alpha=1.30 Zipf, alpha=1.20 Zipf, alpha=1.10 GOV2 collection Figure 7: Comparing the term distribution of the GOV2 collection with Zipf distributions for various values of . Zipf with = 1.2 seems to provide a lower bound for the total number of postings found in lists that are longer than 1,000,000 elements. We first determine for a collection of size N the numb er of terms whose p osting lists are longer than T p ostings, denoted as L(N , T ). From the definition of fT (i), we know: ,, «1/ N N i . fT (i) T i · T · T Hence, we have L(N , T ) = $,, «1/ % ,, N 1/ T 1/ « . (3) (rounded to the nearest integer), for some constants c and (Zipf originally prop osed = 1). The same assumption has b een made b efore for similar purp oses (see, for example, Cutting and Pedersen [4]). Figure 7 suggests that, for the GOV2 collection, we have a Zipf exp onent somewhere in the neighb orhood of 1.2. While our analysis does not dep end on the fact that the term distribution is exactly Zipfian (using a Zipf-Mandelbrot distribution [12] would lead to the same result), it does rely on the convergence of the sum: X1 . i i=1 N · T We therefore assume > 1. The sum's value is given by Riemann's Zeta function () and can b e approximated by Z 1 1 + , (1) dx = + x -1 1 where is the Euler-Mascheroni constant [14] ( = 0.5772 . . .). For realistic Zipf exp onents ( < 1.4), the relative approximation error is less than 1%. We will therefore always use the integral representation when we refer to the value of the Zeta function. Also, we will denote the 1 term + -1 from (1) as . In order to achieve N= « ,, « ,, Z Xc 1 1 c· + dx = c · + i x -1 1 i=1 The total numb er of p ostings found in these lists is ! Z L(N,T ) L(N,T ) X N - fT (i) · + x dx 1 i=1 «« ,, ,, 1 N (L(N , T ))1- - · + = 1- 1- ,, « 1- N N = N- · ·T · ( - 1) = N - N 1/ · T (1-1/) . ( - 1)( )1/ (4) It immediately follows that the total numb er of p ostings found in short lists, i.e. in lists shorter than T p ostings, is " " N 1/ · T (1-1/) ^ S (N , T ) := T (1-1/) · N 1/ . (5) )1/ ( - 1)( In other words, the relative numb er of p ostings found in short lists converges to zero, as N ­ rhythmically and inexorably ­ is marching towards infinity. The sp eed of convergence dep ends on the Zipf parameter . In the remainder of this section, we refer to the total ^ numb er of p ostings found in short lists as S (N , T ), to the ^ (N , T ), and to numb er of p ostings found in long lists as L the numb er of long lists as L(N , T ). An Analysis of Hybrid Immediate Merge Consider the Hybrid Immediate Merge strategy with non-contiguous p osting lists (HIMNC ), as defined in section 3. In order to index a text collection containing N p ostings, N HIMNC needs to p erform M merge op erations, where M is the numb er of p ostings that can b e stored in main memory. In every such merge op eration, HIMNC moves all long p osting lists (containing at least T p ostings) it encounters to the in-place-maintained part of the on-disk index. For and make the term distribution function consistent with the size of the text collection N , we let c := N and thus fT (i) := N . · i (2) We now use this term distribution to determine for a text collection consisting of N tokens the numb er of p ostings that are found in long p osting lists (i.e., in lists that contain more than T p ostings, as defined in section 3). This result is then applied twice, once to analyze the disk complexity of Hybrid Immediate Merge and then a second time to analyze the disk complexity of Hybrid Logarithmic Merge. 361 the merge-up dated part of the on-disk index, which is the result of this merge op eration, this means that it does not contain any lists that are longer than T p ostings. Now, consider the merge-up dated part of the on-disk index that results from the k-th merge op eration, i.e., that is created after k · M tokens have b een added to the collection. This inverted file contains two typ es of lists: · Genuine short lists, each having T p ostings. · Short parts of long lists, each having T p ostings. The latter happ ens if, for instance, the p osting list for a term was moved from the merge-maintained to the in-placemaintained part of the index in the (k - 1)-th re-merge operation and the numb er of p ostings accumulated for that term b etween the (k - 1)-th and the k-th merge op eration is smaller than T so that the hybrid strategy leaves these new p ostings in the merge-up dated part of the index. We can easily give a lower b ound for the numb er of p ost^ ings P that are stored in the merge part of the index after the k-th merge op eration: «1/ ! ,, k·M ^ ^ P (k · M , T ) S (k · M , T ) T · T (using equation 5). We can also give an upp er b ound: ^ ^ P (k · M , T ) S (k · M , T ) + T · L(k · M , T ) « ,, " " (k · M )1/ O T (1-1/) · (k · M )1/ + O T · T 1/ · a merge-maintained inverted file (with contiguous ^ p osting lists) containing S (M , T ) p ostings; ^ · an in-place-maintained inverted file containing L(M , T ) p ostings. We refer to the merge-maintained inverted file, created after M tokens have b een added, as an index of generation 0. After the first on-disk inverted file has b een created, whenever 2k · M tokens have b een added to the collection (for some integer k), we have to merge two merge-maintained sub-indices of generation k - 1 into a new merge-maintained index of generation k. This is the standard procedure of Logarithmic Merge. Here, however, b ecause we follow a hybrid strategy, all long lists encountered during this merge op erations are moved to the in-place part of the index instead of the new merge-maintained inverted file of generation k. Using the same argument as b efore (for HIMNC ), ^ we can give a -b ound for the numb er of p ostings P in this new merge-maintained inverted file of generation k: " " ^ (7) P (2k · M , T ) T (1-1/) · (2k · M )1/ . Therefore, the total numb er of disk write op erations D(k) p erformed during merge op erations, b efore and including the creation of this new inverted file of generation k, is: D (k ) ^ = P (2k · M , T ) + 2 · D(k - 1) = k X i=0 (8) (9) ^ 2k-i · P (2i · M , T ) k X i=0 (using equations 3 and 5). It follows that " " ^ P (k · M , T ) T (1-1/) · (k · M )1/ . Of course, this is also the numb er of disk write op erations necessary to create the particular index. Since, for a text collection of size N , the numb er of disk op erations sp ent on adding p ostings to the in-place part of the index is ^ L(N , T ) (N ), the total numb er of disk op erations needed to build an inverted index for such a text collection is: (N ) X N M 2k · " " 2-i · T (1-1/) · (2i · M )1/ (10) (using equation 7). Hence, we have: D (k ) T (1-1/) ·2 ·M k 1/ · k X i=0 ! (2 1 -1 ) i (11) (12) (13) " " = T (1-1/) · 2k · M 1/ " " O T (1-1/) · 2k · M . (in-place part) (merge part) (6) + k =1 " " T (1-1/) · (k · M )1/ ,, « N (1+1/) = T (1-1/) · . M For a text collection of size N = 2k · M , the numb er of disk op erations sp ent on adding p ostings to the in-place part of ^ the index is L(N , T ) (N ). Thus, the total numb er of disk op erations needed to build an inverted index for a text collection of size N is: (N ) (for the in-place part) " " (1-1/) ·N (for the merge part). +OT Since T is a constant, this means we need (N ) disk op erations. HLMNC 's disk complexity is asymptotically optimal. Validation In order to convince the suspicious reader of the correctness of the results obtained in this section, we use them to predict how changing the amount of available main memory M and the threshold value T affects the overall index maintenance p erformance of our system. We assume that the GOV2 collection, consisting of 42 billion tokens, ob eys Zipf 's law with = 1.25. HIMNC , with T = 2 · 106 and 512 MB of RAM, needs 32.46 hours to index the GOV2 collection. Since the No For = 1.2 and constant T , this gives us an index mainte1.833 nance disk complexity of ( N M ), clearly b etter than the 2 ( N ) complexity of non-hybrid Immediate Merge. M An Analysis of Hybrid Logarithmic Merge Finally, let us consider the Hybrid Logarithmic Merge strategy with non-contiguous p osting lists (HLMNC ), as defined in section 3. Like every index maintenance strategy discussed in this pap er, HLMNC accumulates p ostings in main memory until there is no more memory available, at which p oint an on-disk sub-index is created from the inmemory data. After M tokens have b een added to the collection, two on-disk inverted files are created: 362 Merge strategy, using the same amount of main memory, indexes the collection in 4.66 hours, the index maintenance overhead of HIMNC (T = 2 · 106 ) is 27.80 hours. According to equation 6, altering T by a factor x changes the total index maintenance overhead by a factor x1-1/ . Thus, the exp ected total indexing time is: · 27.80 · 20.2 + 4.66 = 36.59 hours for T = 4 · 106 , · 27.80 · ( 1 )0.2 + 4.66 = 28.86 hours for T = 1 · 106 , and 2 1 · 27.80 · ( 4 )0.2 + 4.66 = 25.73 hours for T = 0.5 · 106 . The exp erimentally obtained index construction times are 37.42, 27.61, and 23.43 hours, resp ectively ­ close to the predicted numb ers. HLMNC , with T = 106 and 512 MB of RAM, needs 6.87 hours to build the index, p erforming 150 merge op erations (k 7). The overhead, compared to No Merge, is 2.21 hours. Decreasing the available amount of memory to 256 MB decreases M by a factor of 2 and increases k by 1. According to equation 12, this increases the index maintenance overhead by a factor 1 21 · ( )1/ 1.1487, 2 from 2.21 hours to 2.54 hours. We therefore predict a total indexing time of 4.66 + 2.54 = 7.20 hours. The time measured in our exp eriments is 7.19 hours, very close to our prediction. Similarly, for 1024 MB of main memory, we predict a total index construction time of 6.58 hours. Again, this is close to the exp erimentally obtained time, 6.66 hours. 6. CONCLUSION & FUTURE WORK We have presented a new family of hybrid index maintenance strategies to b e used in on-line index construction for growing text collections. Like previous hybrid strategies, they are based on a distinction b etween long and short p osting lists. Our exp erimental evaluation has shown that the new strategies outp erform previous index maintenance strategies, including existing hybrid strategies ­ which do not work very well if only a small amount of memory is available for the in-memory index. We have given a theoretical analysis of the disk complexity of our hybrid strategies, proving that the combination of Logarithmic Merge (for short lists) and in-place up date with non-contiguous list segments (for long lists) results in an overall index maintenance disk complexity of O(N ) for a text collection of size N . This is asymptotically optimal. The main shortcoming of the new strategies is their slightly reduced query processing p erformance, caused by internal fragmentation in the on-disk p osting lists. We think that this problem can b e solved by integrating overallocation strategies (and p ossibly relocation strategies) into the in-place part. In particular, a comparatively simple prop ortional preallocation strategy for inverted lists in the in-place index will improve query p erformance, while not affecting the asymptotical index maintenance complexity. 7. REFERENCES [1] S. Buttcher and C. L. A. Clarke. Indexing Time vs. ¨ Query Time Trade-offs in Dynamic Information Retrieval Systems. In Proceedings of the 14th ACM Conf. on Information and Know ledge Management (CIKM 2005), Bremen, Germany, Novemb er 2005. [2] S. Buttcher and C. L. A. Clarke. A Hybrid Approach ¨ to Index Maintenance in Dynamic Text Retrieval Systems. In Proceedings of the 28th European Conference on Information Retrieval (ECIR 2006), London, UK, April 2006. [3] T. Chiueh and L. Huang. Efficient Real-Time Index Up dates in Text Retrieval Systems. Technical rep ort, SUNY at Stony Brook, NY, USA, August 1998. [4] D. R. Cutting and J. O. Pedersen. Optimization for Dynamic Inverted Index Maintenance. In Proceedings of the 13th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 405­411, New York, USA, 1990. [5] S. Heinz and J. Zob el. Efficient Single-Pass Index Construction for Text Databases. Journal of the American Society for Information Science and Technology, 54(8):713­729, 2003. [6] N. Lester, A. Moffat, and J. Zob el. Fast On-Line Index Construction by Geometric Partitioning. In Proceedings of the 14th ACM Conference on Information and Know ledge Management (CIKM 2005), Bremen, Germany, Novemb er 2005. [7] N. Lester, J. Zob el, and H. E. Williams. In-Place versus Re-Build versus Re-Merge: Index Maintenance Strategies for Text Retrieval Systems. In Proceedings of the 27th Conference on Australasian Computer Science, Darlinghurst, Australia, 2004. [8] N. Lester, J. Zob el, and H. E. Williams. Efficient Online Index Maintenance for Text Retrieval Systems. Information Processing & Management, 42, July 2006. [9] A. Moffat and T. C. Bell. In-Situ generation of Compressed Inverted Files. Journal of the American Society of Information Science, 46(7):537­550, 1995. [10] F. Scholer, H. E. Williams, J. Yiannis, and J. Zob el. Compression of Inverted Indexes for Fast Query Evaluation. In Proceedings of the 25th ACM SIGIR Conference on Research and Development in Information Retrieval, 2002. [11] K. A. Shoens, A. Tomasic, and H. Garc´a-Molina. i Synthetic Workload Performance Analysis of Incremental Up dates. In Proceedings of the 17th ACM SIGIR Conference on Research and Development in Information Retrieval, 1994. [12] Z. K. Silagadze. Citations and the Zipf-Mandelbrot's Law. Complex Systems, 11:487, 1997. [13] A. Tomasic, H. Garc´a-Molina, and K. Shoens. i Incremental Up dates of Inverted Lists for Text Document Retrieval. In Proceedings of the 1994 ACM SIGMOD Conference on Management of Data, pages 289­300, New York, USA, 1994. [14] E. W. Weisstein. Euler-Mascheroni Constant. From MathWorld ­ http://mathworld.wolfram.com/EulerMascheroniConstant.html. [15] G. K. Zipf. Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, USA, 1949. [16] J. Zob el, S. Heinz, and H. E. Williams. In-Memory Hash Tables for Accumulating Text Vocabularies. Information Processing Letters, 80(6):271­277, 2001. [17] J. Zob el, A. Moffat, and K. Ramamohanarao. Inverted Files versus Signature Files for Text Indexing. ACM Trans. on Database Systems, 23(4):453­490, 1998. 363