Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Holger Bast
Max-Planck-Institut fur Informatik ¨ Saarbrucken, Germany ¨

Ingmar Weber
Max-Planck-Institut fur Informatik ¨ Saarbrucken, Germany ¨

bast@mpi-inf.mpg.de ABSTRACT
We consider the following full-text search autocompletion feature. Imagine a user of a search engine typing a query. Then with every letter b eing typ ed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the b est hits for any of these completions should b e displayed. Known indexing data structures that apply to this problem either incur large processing times for a substantial class of queries, or they use a lot of space. We present a new indexing data structure that uses no more space than a state-of-the-art compressed inverted index, but with 10 times faster query processing times. Even on the large TREC Terabyte collection, which comprises over 25 million documents, we achieve, on a single machine and with the index on disk, average resp onse times of one tenth of a second. We have built a full-fledged, interactive search engine that realizes the prop osed autocompletion feature combined with supp ort for proximity search, semi-structured (XML) text, subword and phrase completion, and semantic tags.

iweber@mpi-inf.mpg.de
the command line after the last space. Nowadays, we find a similar feature in most text editors, and in a large variety of browsing GUIs, for example, in file browsers, in the Microsoft Help suite, or when entering data into a web form. Recently, autocompletion has b een integrated into a numb er of (web and desktop) search engines like Google Suggest or Apple's Sp otlight. We discuss more applications in Section 1.2. In the simpler forms of autocompletion, the list of completions is simply a range from a (typically precomputed) list of words. For the Unix Shell, this is the list of all file names in all directories listed in the PATH variable. For the text editors, this is the list of all words entered into the file so far (and mayb e also words from related files). In Google Suggest, completions app ear to come from a precompiled list of p opular queries. For these kinds of applications we can easily achieve fast resp onse times by two binary or B-tree searches in the (pre)sorted list of candidate strings. More advanced forms of autocompletion take into account the context in which the to-b e-completed word has b een typ ed. The problem we prop ose and discuss in this pap er is of this kind. The formal problem definition will b e given in Section 2. More informally, imagine a user of a search engine typing a query. Then with every letter b eing typ ed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the b est hits for any of these completions should b e displayed. All this should preferably happ en in less time than it takes to typ e a single letter. For example, assume a user has typ ed conference sig. Promising completions might then b e sigir, sigmod, etc., but not, for example, signature, assuming that, although signature by itself is a pretty frequent word, the query conference signature leads to only few good hits. See Figure 1 for a screenshot of our search engine resp onding to that query. For a live demo, see http://search.mpi- inf.mpg.de/wikipedia.

Categories and Subject Descriptors
H.3.1 [Content Analysis and Indexing]: Indexing Methods; H.3.3 [Content Analysis and Indexing]: Retrieval Models; H.5.2 [User Interfaces]: Theory and Methods

General Terms
Algorithms, Design, Exp erimentation, Human Factors, Performance, Theory

Keywords
Autocompletion, Empirical Entropy, Index Data Structure

1.

INTRODUCTION

Autocompletion is a widely used mechanism to get to a desired piece of information quickly and with as little knowledge and effort as p ossible. One of its early uses was in the Unix Shell, where pressing the tabulator key gives a list of all file names that start with whatever has b een typ ed on

1.1 Our results
We have develop ed a new indexing data structure, named HYB, which uses no more space than a state-of-the-art compressed inverted index, and which can resp ond to autocompletion queries as describ ed ab ove within a small fraction of a second, even for collection sizes in the Terabyte range. Our main comp etitor in this pap er is the inverted index, referred to as INV in the following. Other data structures that could b e directly applied to our problem either use a lot of space or have other limitations; we discuss these in Section 1.2. We give a rigorous mathematical analysis of HYB and INV with resp ect to b oth space usage and query processing times. Our analysis accurately predicts the real b ehavior on our test collections. Concerning space usage, we define a notion of empirical

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

364


Figure 1: A screenshot of our search engine for the query conference sig searching the English Wikipedia. The list of completions and hits is updated automatically and instantly after each keystroke, hence the absence of any kind of search button. The number in parentheses after each completion is the number of hits that would be obtained if that completion where typed. Query words need not be completed, however, because the search engine does an implicit prefix search: if, for example, the user continued typing conference sig proc, completions and hits for proc, e.g., proceedings, would be from the 185 hits for conference sig. entropy [11] [22], which captures the inherent space complexity of an index indep endent of a particular compression scheme. We prove that the empirical entropy of HYB is essentially equal to that of INV, and we find that the actual space usage of our implementation of the two index structures is indeed almost equal, for each of our three test collections. Concerning processing times, we give a precise quantification of the numb er of op erations needed, from which we derive b ounds for the worst, b est, and average-case b ehavior of INV and HYB. We also take into account the different latencies of sequential and random access to data [1]. We compare INV and HYB on three test collections with different characteristics. One of our collections has b een (semi-)publicly searchable over the last year, so that we have autocompletion queries from real users for it. Our largest collection is the TREC Terabyte b enchmark with over 25 million documents [7]. On all three collections and on all the queries we considered, HYB outp erforms INV by a factor of 15 - 20 in worst-case query processing time, and by a factor of 3 - 10 in average case query processing time. In absolute terms, HYB achieves average query processing of one tenth of a second or less on all collections, on a single machine and with the index on disk (and not in main memory). We have built a full-fledged search engine that supp orts autocompletion queries of the describ ed kind combined with supp ort for proximity/phrase search, XML tags, subword and phrase completion, and category information. All of these extensions are describ ed in Section 4. back on which completions of the prefix typ ed so far would lead to highly ranked documents. The user can then assess the relevance of these completions to his or her search desire, and decide to (i) typ e more letters for the last query word, e.g., in the query from Figure 1, typ e i and r so that the query is then conference sigir, or to (ii) start with the next query word, e.g., typ e a space and then proc, or to (iii) stop searching as , e.g., the user was actually looking for one of the hits shown in Figure 1. There is no way to achieve this by a stemming preprocessing step, b ecause there is no way to foresee the user's intent. This kind of user interaction is well known to improve retrieval effectiveness in a variety of situations [21]. While our autocompletion feature is for the purp ose of finding information, autocompletion has also b een employed for the purp ose of predicting user input, for example, for typing messages with a mobile phone, for users with disabilities concerning typing, or for the comp osition of standard letters [6] [14] [20] [8] [15]. In [12], contextual information has b een used to select promising extensions for a query. Paynter et al. have devised an interface with a zooming-in prop erty on the word level, and based on the identification of frequent phrases [18]. We get a related feature by the subword/phrase-completion mechanism describ ed in Section 4.4. Our autocompletion problem is related to but distinctly different from multi-dimensional range searching problems, where the collection consists of tuples (of some fixed dimension, for example, pairs of word prefixes), and queries are asking for all tuples that match a given tuple of ranges [10] [2] [4] [13]. These data structures could b e used for our autocompletion problem, provided that we were willing to limit the numb er of query words. For fast processing times, however, the space consumption of any of these structures is on the order of N 1+d , where N is the size of an inverted index, and d > 0 grows (fast) with the dimension. For our au-

1.2 Related work
The autocompletion feature as describ ed so far is reminiscent of stemming, in the sense that by stemming, too, prefixes instead of full words are considered [23]. But unlike stemming, our autocompletion feature gives the user feed-

365


tocompletion queries, we can achieve fast query processing times and space efficiency at the same time b ecause we have the set of documents matching the part of the query b efore the last word already computed (namely when this part was b eing typ ed). In a sense, our autocompletion problem is therefore a 1 1/2 - dimensional range searching problem. Finally, there is a large variety of alternatives to the inverted index in the literature. We have considered those we are aware of with regard to their applicability to our autocompletion problem, but found them either unsuitable or inferior to the inverted index in that resp ect. For example, approaches that consider document by document are b ound to b e slow due to a p oor locality of access; in contrast, b oth INV and HYB are mostly scanning long lists; see Section 3. Signature files were found to b e in no way sup erior (but significantly more complicated) to the inverted index in all ma jor resp ects in [24]. Suffix arrays and related data structures address the issue of ful l substring search, which is not what we want here (but see Section 4.4); a direct application of a data structure like [11] would have the same efficiency problems as INV, whereas multi-dimensional variants like [10] require sup er-linear space, as explained ab ove.

(b) For a multisubset of size n with elements from a universe of size n, the empirical entropy is (n + n ) ˇ H(n /(n + n ), n/(n + n )) (consider a bitvector of size n + n , and let a bit be 0 with probability n /(n + n ) and 1 otherwise; the prefix sums at the 0-bits give the multisubset), which is n n n+n n+n . n ˇ log2 n ˇ log2 + (c) For a sequence of n elements from a universe of size l, where the ith element occurs ni times (n1 + ˇ ˇ ˇ + nl = n), the empirical entropy is n ˇ H(n1 /n, . . . , nl /n) (for each position, pick element i with probability ni /n), which is n n + ˇ ˇ ˇ + nl ˇ log 2 . n1 ˇ log2 n1 nl (d) For a col lection of l entities with empirical entropies H1 , . . . , Hl , the empirical entropy is simply H1 + ˇ ˇ ˇ + Hl .

3. INV, HYB, AND THEIR ANALYSIS
In this section we will describ e INV and HYB, and analyze them with resp ect to their empirical entropy and their processing time for autocompletion queries according to Definition 1. Query processing times will b e quantified in terms of all relevant parameters; from this we can easily derive worstcase, b est-case, and average-case b ounds. Our average-case b ounds make simplifying assumptions on the distribution of words in the documents, but nevertheless turn out to predict the actual b ehavior quite well. Implementation issues and the actual p erformance of our implementations of INV and HYB will b e discussed in Section 5. We briefly comment on index construction times in Section 3.3

2.

FORMAL PROBLEM DEFINITION AND DEFINITION OF EMPIRICAL ENTROPY

The following definition of our autocompletion problem takes neither p ositional information, nor ranking of the completions or of the documents into account. We will first, in Section 3, analyze our data structures for this basic setting. In Section 4, we then show how to generalize the data structures and their analysis to cop e with p ositional information, ranking, and a numb er of other useful enhancements. This generalization will b e straightforward. Definition 1. An autocompletion query is a pair (D, W ), where W is a range of words (al l possible completions of the last word which the user has started typing) and D is a set of documents (the hits for the preceding part of the query). To process the query means to compute the subset W  W of words that occur in at least one document from D, as wel l as the subset D  D of documents that contain at least one of these words. For our example conference sig, D is the set of all documents containing a word starting with conference (computed when the last letter of this word was typ ed), and W is the range of al l words from the collection starting with sig. For queries with only a single word, e.g., confer, D is simply the set of al l documents. To analyze the inherent space complexity of INV and HYB indep endently of the sp ecialties of a particular compression scheme, we introduce a notion of empirical entropy. Both INV and HYB are essentially a collection of (multi)sets and sequences. The following definition gives a natural notion of entropy for each such building block, and for arbitrary combinations of them (similar definitions have b een made in [11] [22]). The reader might first want to skip the following definition and come back to it when it is first used in the analysis that follows. Definition 2. We define empirical entropy for the folP lowing entities, where H(p1 , . . . , pl ) = - l=1 (pi ˇ log 2 pi ) is i the l-ary entropy function. (a) For a subset of size n with elements from a universe of size n, the empirical entropy is n ˇ H(n /n, 1 - n /n) (include each element of the universe into the subset with probability n /n), which is n n n ˇ log 2 (n - n ) ˇ log 2 n+ n-n .

3.1 The inverted index (INV)
The inverted index is the data structure of choice for most search applications: it is relatively easy to implement and extend by other features, it can b e compressed well, it is very efficient for short queries, and it has an excellent locality of access [23]. In this pap er, by INV we mean the following data structure: for each word store the list of all (ids of ) documents containing that word, sorted in ascending order. We do not consider enhancements such as skip p ointers [17], which we would exp ect to give similar b enefits for b oth INV and HYB, however at the price of an increased space usage. In the following, we first estimate the inherent space efficiency (empirical entropy) of INV. We then analyze the time complexity of processing autocompletion queries with INV, and p oint out two inherent problems. Lemma 1. Consider an instance of INV with n documents and m words, and where the ith words occurs in ni distinct documents (so that n1 + ˇ ˇ ˇ + nm is the total number of word-in-document pairs). Let Hinv be the empirical entropy according to Definition 2. Then Ť m X,, 1 n + ni ˇ log 2 ni ˇ , Hinv  ln 2 ni i=1 and for al l col lections considered in this paper (where most ni are much smal ler than n) this bound is tight up to 2%. Proof. According to Definition 2 (a) and (d), we have Ť m X,, n n ni ˇ log2 . Hinv = + (n - ni ) ˇ log2 ni n - ni i=1 To prove the lemma, it suffices to observe that b ecause 1 + x  ex for any real x, ,, Ť ni ni n - ni n ˇ ln 1 + .  = (n - ni ) ˇ log2 n - ni ln 2 n - ni ln 2

366


Lemma 1 tells us that if the documents in each list were picked uniformly at random, then a Golomb-encoding of the gaps [23] from one document id to the next (for list i, the exp ected size of a gap would b e n/ni ) would achieve a space usage very close to Hinv bits. In our implementation, we opted to encode gaps with the Simple-9 encoding from [3], which is easy to implement, yet achieves very fast decompression sp eeds at the price of only a moderate loss in compression efficacy; details are rep orted in Section 5. Lemma 2. With INV, an autocompletion query (D, W ) can be processed in the fol lowing time, where Dw denotes the inverted list for word w: X X |Dw | + |D  Dw | ˇ log |W |. |D| ˇ |W | +
wW wW

Assuming that the elements of W , D, and the Dw are picked uniformly at random from the set of m words and the set of n documents, respectively, this bound has an expected value of |D| ˇ |W | + |D| |W | |W | ˇN + ˇ ˇ N ˇ log |W |. m n m

b e precomputed, esp ecially when we do not want to use more space than an (optimally compressed) inverted index. The analysis given in this section suggests the following approach: group the words in blocks so that the lengths of the inverted lists in each block sum to (approximately) c ˇ n, for some constant c < 1 (we will later choose c  0.2). For each block, store the union of the covered inverted lists as a compressed multiset, using an effective gap encoding scheme just as done for INV (rep etitions of the same element in the multiset corresp ond to a gap of zero). In parallel to each multiset, for each element x store the id of the word that led to the inclusion of (this occurrence of ) x in the multiset. This gives a sequence of word ids, the length of which is exactly the size of the multiset. Encode these word ids with code length (approximately) log2 ((n1 + ˇ ˇ ˇ + nl )/ni ) for the ith word, where ni is the numb er of documents containing the ith word, and l is the numb er of words in the resp ective block. Here is an example. Let one of the blocks comprise four words A, B , C , and D, with inverted lists A B C D : : : : 3, 5, 3, 3, 5, 6, 8, 9, 11, 12, 15 11 7, 11, 13 8

Remark. By picking the elements of a set S at random from a set U , we mean that each subset of U of size |S | is equally likely for S . We are not making any randomness assumption on the sizes of W , D, and Dw ab ove. Proof sketch. The obvious way to use an inverted index to process an autocompletion query (D, W ) is to compute, for each w  W , the intersections D  Dw . Then, W is simply the set of all w for which the intersection was non-empty, and D is the union of all (non-empty) intersections. The intersections can b e computed in time linear in P the total input volume wW (|D| + |Dw |).1 The union can b e computed by a |W |-way merge, which requires on the order of log |W | time p er element scanned. With the randomness assumptions, the exp ected size of Dw is N/m, and the exp ected size of |D  Dw | is |D|/n ˇ N/m. Lemma 2 highlights two problems of INV. The first is that the term |D| ˇ |W | can b ecome prohibitively large: in the worst case, when D is on the order of n (i.e., the first part of the query is not very discriminative) and W is on the order of m (i.e., only few letters of the last query word have b een typ ed), the b ound is on the order of n ˇ m, that is, quadratic in the collection size. The second problem is due P to the required merging. While the volume wW |D  Dw | will typically b e small once the first query word has b een completed, it will b e large for the first query word, esp ecially when only few letters have b een typ ed. As we will see in Section 5, INV frequently takes seconds for some queries, which is quite undesirable in an interactive setting, and is exactly what motivated us to develop a more efficient index data structure.

We would then like to store, in compressed form, the multiset (of document ids) and the sequence (of word ids) 3 3 3 5 5 6 7 8 8 9 11 11 11 12 13 15 ACDABACADAA B C A C A The optimal encoding of the words A, B , C , D would use code lengths log 2 (16/8) = 1, log2 (16/2) = 3, log 2 (16/4) = 2, log2 (16/2) = 3, resp ectively, for example A = 0, B = 110, C = 10, D = 111. An optimal encoding of the four gaps 0, 1, 2, 3 that occur in the ab ove multiset of document ids would b e 0, 10, 110, 111, resp ectively. What we actually store are then the two bit vectors (where the | are solely for b etter readability; the codes in this example are prefix-free) 111|0|0|110|0|10|10|10|0|10|110|0|0|10|10|110 0|10|111|0|110|0|10|0|111|0|0|110|10|0|10|0 Note that due to the two different encodings the two lists end up having different lengths in compressed form, and this is also what will happ en in reality. The following analysis will make very clear that (i) one should choose blocks of equal list volume (and not, for example, of equal numb er of words), (ii) this volume should b e a small but substantial fraction of the numb er of documents (and neither smaller nor larger), and (iii) the lists of document ids should b e "gap-encoded" while the lists of word ids should b e "entropy-encoded". As for the space usage, we will first derive a very tight estimate of the entropy of HYB, and then show that, somewhat surprisingly, if we only choose the block volume to b e a small enough fraction of the numb er of documents, the entropy of HYB is almost exactly that of INV. We will then show how HYB, when the blocks are chosen of sufficiently large volume, can b e used to process autocompletion queries in time linear in the numb er of documents, for any reasonable word range. Since HYB essentially scans long lists, without the need for any merging, except when the word range is huge, it also has an excellent locality of access. Lemma 3. Consider an instance of HYB with n words and m documents, where the ith word occurs in ni documents, and where for each block the sum of the ni with i from that block is c ˇ n, for some c > 0. Then the empirical

3.2 Our new data structure (HYB)
The basic idea b ehind HYB is simple: precompute inverted lists for unions of words. Assume an autocompletion query (D, W ), where the union of all lists for word range W have b een precomputed. We would then get D with a single intersection (of D with the precomputed list). However, from this precomputed list alone we can no longer infer the set W of completions leading to a hit. Since W can b e an arbitrary word range, it is also not clear which unions should
1 There are asymptotically faster algorithms for the intersection of two lists [5], but in our exp eriments, we got the b est results with the simple linear-time intersect, which we attribute to its compact code and p erfect locality of access.

367


entropy Hhy b , defined according to Definition 2, satisfies Ť m X,, 1 + c/2 n + ni ˇ log2 Hhy b  ni ˇ , ln 2 ni i=1 and the bound is tight as c  0. Proof. Consider a fixed block of HYB, and let ni denote the numb er of documents containing the itPword b elonging h to that block. Throughout this proof, let i ni P note the de sum over all these ni (so that the sum over all i ni from Pm all blocks gives the i=1 ni from the lemma). According to Definition 2 (b), (c), and (d), the empirical entropy of this block is then P P P P n+ ni n + i ni X ni + ni ˇlog 2 P i +nˇlog 2 ni log 2 i . i i ni n ni i Now adding the first and the last term, the arguments of the logarithms partially cancel out (!), and we get P P X n + i ni n + i ni . ni ˇ log 2 + n ˇ log2 i ni n P Now using that, by assumption, i ni = c ˇ n, we obtain ,, Ť X n . ni (1 + 1/c) log2 (1 + c) + log2 i ni Since (1 + 1/c) ln(1 + c)  1 + c/2 for all c > 0 (not obvious, but true), we can upp er b ound this (tightly, as c  0) by ,, Ť X n 1 + c/2 + log2 ni . i ln 2 ni This b ounds the empirical entropy of a single block of HYB (the sum goes over all words from that block). Adding this over all blocks gives us the b ound claimed in the lemma. Comparing Lemma 3 with Lemma 1, we see that if we let the blocks of HYB b e of volume at most c ˇ n, for some small fraction c, then the empirical entropy of HYB is essentially that of an inverted index. In Section 4.2, we will see that when we take p ositional information into account, the empirical entropy of HYB actually b ecomes less than that of INV, for any choice of block volumes. In our implementation of HYB, we compress the lists of document ids by a Simple-9 encoding of the gaps, just as describ ed for INV ab ove. For the lists of word ids, entropyoptimal compression could b e achieved by arithmetic encoding [23], but for efficiency reasons, we compress word ids as follows: assuming that the word frequencies in a block have a Zipf-like distribution, it is not hard to see that a universal encoding with  log x bits for numb er x [17] of the ranks of the words, if sorted in order of descending frequency, is entropy-optimal, too. We again opted for Simple-9 encoding of these ranks, which gives us a reasonable compression and very fast decompression sp eed, without the need for any large codeb ook. We take block sizes as n/5, but also take word/prefix b oundaries into account such that frequent prefixes like pro, com, the get a block on their own. This is to avoid that a query unnecessarily spans more than one block. Lemma 4. Using HYB with blocks of volume N , autocompletion queries (D, W ) can be processed in the fol lowing time, where Dw is the inverted list for word w X
wW

from the set of al l n documents or al l m words, respectively, the exp ected processing time is bounded by O(n). Proof sketch. According to Definition 1, we have to compute, given (D, W ), the set W of words from W contained in documents from D, as well as the set D of documents containing at least one such word. For each block B , a straightforward intersection of the given D with the list of document-word pairs from B , gives us the set WB of all words from W from block B , as well as the set DB of all document from D which contain a word from B . From these, D can b e computed by a k-way merge, where k is c the numb er of blocks that contain a word from W , and W an b e computed by a simple linear-time sort into W buckets (b ecause W is a range). The numb er k of blocks is P , which is O(1) in exp ectation, given the ranw |Dw |/N domness assumptions stated in the lemma.

3.3 Index construction time
While getting from a collection of documents (files) to INV is essentially a matter of one big external sort [23], HYB does not require a full inversion of the data. For our exp eriments, however, we built the compressed indices for b oth INV and HYB from an intermediate fully inverted text version of the collection, which takes essentially the same time for b oth.

4. EXTENSIONS
In this section, we describ e a numb er of extensions of the basic autocompletion facility we have describ ed and analyzed so far. The first (ranking) is essential for practical usability, the second (proximity search) greatly widens the sp ectrum of search tasks for which autocompletion can b e useful, and the others (supp ort for XML tags, subword and phrase completion, and semantic tags) give advanced search facilities to the exp ert searcher.

4.1 Ranking
So far, we have considered the following problem (from Definition 1): while the user is typing a query, compute after each keystroke the list of al l completions of the last query word that lead to at least one hit, as well as the list of al l hits that would b e obtained by any of these completions. In practice, only a selection of items from these lists can and will b e presented to the user, and it is of course crucial that the most relevant completions and hits are selected. A standard approach for this task in ad-hoc retrieval is to have a precomputed score for each word-in-document pair, and when a query is b eing processed, to aggregate these scores for each candidate document, and return documents with the highest such aggregated scores [23]. Both INV and HYB can b e easily adapted to implement any such scoring and aggregation scheme: store by each word-in-document pair its precomputed score, and when intersecting, aggregate the scores. A decision has to b e made on how to reconcile scores from different completions within the same document. We suggest the following: when merging the intersections (which gives the set D according to Definition 1), compute for each document in D the maximal score achieved for some completion in W contained in that document, and compute for each completion in W the maximal score achieved for a hit from D achieved for this completion. Asymptotically, the inclusion of ranking does not affect the time b ounds derived in Lemmas 2 and 4, and our exp eriments show that ranking never takes more than half of the total query processing time; see Section 5.4. The increase in space usage dep ends on the selected scoring scheme, and is the same for INV and HYB. It is for these reasons, that we factored out the ranking asp ect from our basic Definition 1

|Dw |ˇ(1+ |D|/N )+

X
wW

|D Dw |ˇlog

X
wW

|Dw |/N

!

.

For N = (n) and |W |  m ˇ n/N , and assuming that the elements of D, Dw , and W are picked uniformly at random

368


and from our space and time complexity analysis in Section 3.

4.2 Proximity/Phrase searches
With a prop erly chosen scoring function, such as BM25, mere ranking by score aggregation often gives very satisfactory precision/recall b ehavior [19]. There are many queries, however, where the decisive cue on whether a particular document is relevant or not lies in the fact whether certain of the query words occur close to each other in that document. See [16] for a recent p ositive result on the use of proximity information in ad-hoc retrieval. Our autocompletion feature increases the b enefits of a proximity op erator, b ecause the use of this op erator will strongly narrow down the list of completions displayed to the user, which in turn makes it easier for the user to filter out irrelevant completions. For example, when searching the Wikip edia collection the most relevant completion for the non-proximity query max pl would b e place (b ecause max and place are b oth frequent words), but for the proximity query max..pl it is planck. Here the two dots .. indicate that words should occur within x words of each other, for some user-definable parameter x. It is not hard to extend b oth INV and HYB to supp ort proximity search: in the document lists (INV and HYB) as well as in the word lists (HYB only), we duplicate each entry as many times as it occurs in the corresp onding document, and store the p ositions in a parallel array of the same size. Word and document lists are compressed just as b efore, and the lists of p ositions are gap-encoded by Simple-9, just like the lists of document ids. The intersection routine is adapted to consider a proximity window as an additional parameter. As we will see in Section 5.3, the p osition lists increase the index size by a factor of 4-5, for b oth INV and HYB (without any kind of stopword removal). We can extend our analysis from Section 3 to predict this factor as follows. If we replace ni , the numb er of documents containing the ith word, by Ni , the total numb er of occurrences of the ith words, and n, the numb er of documents, by N , the total numb er of word occurrences, we can show that (details omitted) Hhy b  Hinv 
m X i=1

meaningful subwords and phrases. An example involving a subword: for the query normal..vec we might want to see eigenvector as one of the relevant completions. An example involving a phrase: for the query max planck we might want to see the phrase max planck institute as one of the relevant completions. It is not hard to see, that the autocompletion according to Definition 1 will automatically provide this feature if only we add the corresp onding subwords/phrases to the index.

4.5 Category information
Our autocompletion feature can b e combined with a numb er of other technologies that enhance the semantics of a corpus. To give just one more example here, assume we have tagged all conference names in a collection. Then assume we duplicate all conference names in the index, with an added prefix of, say, conf:, e.g., conf:sigir. By the way our autocompletion works, for the query seattle conf: we would then get a list of all names of conferences that occur in documents that also mention Seattle.

5. EXPERIMENTS
We implemented b oth INV and HYB in compressed format, as describ ed in Sections 3.1 and 3.2. Each index is stored in a single file with the individual lists concatenated and an array of list offsets at the end. The vocabulary (which is the same for INV as for HYB) is stored in a separate file. All our code is in C++. All our exp eriments were run on a Dual Opteron machine, with 2 Intel Xeon 3 GHz processors, 8 GB of main memory, and running Linux. We ensured that the index was not cached in main memory.

5.1 Test collections
We compared the p erformance of INV and HYB on three collections of different characteristics. The first collection is a mailing-list archive plus several encyclop edias on homeopathic medicine (www.homeonet.org). This collection has b een searchable via our engine over the past year by an audience of several hundred p eople. The second collection consists of the complete dumps of the English and German Wikip edia from Decemb er 2005 (search.mpi- inf.mpg.de/ wikipedia). The third collection is the large TREC Terabyte collection [7], which served as a stress test for our index structures (and for the authors as well). Details ab out all three collection are given in Table 1, where the "Raw size" of a collection is the total size of the original, uncompressed files in their original formats (e.g., HTML or PDF).

(Ni / ln 2 + Ni ˇ log2 (N/Ni )) ,

where Hinv and Hhy b denote the empirical entropy of INV and HYB, resp ectively, with p ositional information. That is, with p ositional information, HYB is always more space-efficient than INV, irrespectively of how we divide into blocks. It can b e shown that Ni ˇlog 2 (N/Ni )  2Ni ˇlog 2 (n/ni ), and since on average a word occurs ab out 2-3 times in a document, this is just 4-5 times ni log2 (n/ni ), which was the corresp onding term in the entropy b ound for INV or HYB without p ositional information (Lemmas 1 and 3).

5.2 Queries
For the Homeopathy collection, we picked 5,732 maximal queries (that is, queries, which are not a true prefix of another query) from a fixed time slice of our query log for that collection. From each of these maximal queries, a sequence of autocompletion queries was generated by "typing" the query from left to right, with a minimal prefix length of 3. Like that, for example, the maximal query acidum phos gives rise to the 6 autocompletion queries aci, acid, acidu, acidum, acidum pho, and acidum phos. For the Wikip edia collection, autocompletion queries were generated in the same manner from a set of 100 randomly generated queries, with a distribution of the numb er of query words and of the term frequency similar to that of the real queries for the Homeopathy collection. For the Terabyte collection, autocompletion queries were generated, in again the same way but with a minimal prefix length of 4, from the (stemmed) 50 ad-hoc queries of the Robust Track Benchmark [7], e.g., squirrel control protect. For all three collections, we removed queries containing words that had no completion at all in the resp ective collection.

4.3 Semistructured (XML) text
Many documents contain semantic information in the form of tag pairs. We briefly sketch how we can make good use of such tags in our autocompletion scenario. Assume that in the archive of a mailing list, the sub ject of a mail is enclosed in a <subject> . . . <\subject> tag pair. We can then easily implement an op erator = (in a way very similar to our implementation of the proximity op erator), such that for a query subject=sig only those completions of sig are displayed which actually occur in the sub ject line of a mail, and only such documents are displayed as hits.

4.4 Completion to subwords and phrases
Another simple yet often useful extension to the basic autocompletion feature is to consider as p otential matches not only the words as they occur in the collection, but also

369


For Homeopathy and Wikip edia, all queries were run as proximity queries (using a full p ositional index according to Section 4.2), while for Terabyte, they were executed as ordinary document-level queries. For all collections, b oth completions and hits were ranked as we describ ed it in Section 4.1 (details of the aggregation function are omitted here). Each autocompletion query was processed according to Definition 1, e.g., for acidum pho, we compute all completions of pho that occur in a document which also contains a word starting with acidum, as well as the set of all such documents. The result for each autocompletion query is rememb ered in a history, so that we do not need to recompute the set of documents matching the first part of the query. E.g., when processing acidum pho, we can take the set of documents matching acidum from the history; see the explanation following Definition 1. The autocompletion queries with minimal prefix lengths, like aci and acidum pho for Homeopathy, are the most difficult ones. All other queries can b e easily processed by what we call filtering. For example, b oth the completions and hits for the query acidum phos can b e obtained by retrieving the list of matching word-in-document pairs for the previously processed query acidum pho from the history, and by filtering out, in a linear scan over that list, all those pairs, where the word starts with phos. In practice, this is always faster than processing such queries as full autocompletion queries according to Definition 1. Note that this filtering is identical for INV and HYB. We nevertheless include the filtered queries in our exp eriments, b ecause in reality we will always get a mix of b oth kinds of queries. Table 3 will provide figures for just the difficult (unfiltered) queries. We remark that the history is useful also for caching purp oses, but in our exp eriments we used it solely for the purp ose of filtering.

Collection Raw size #documents #words #items Vocabulary Entropy INV index size -p er item HYB index size -p er item -p er doc -p er word

Homeopathy 452 MB 44,015 263,817

Wikip edia 7.4 GB

Terabyte 426 GB

2,866,503 25,204,103 6,700,119 25,263,176

12 [27] million 0.3 [0.8] billion 3.5 billion 2.9 MB 6.6 [13.1] bits 73 MB 9.1 [14.0] bits 239 MB 8.4 bits

13 [70] MB

0.5 [2.2] GB

4.6 GB 11.0 bits

9.3 [21.5] bits 12.8 [23.2] bits

14 [62] MB

0.5 [2.0] GB

4.9 GB 11.6 bits 5.9 bits 5.7 bits

9.4 [19.2] bits 13.0 [20.7] bits 3.9 [15.4] bits 5.5 [3.8] bits 4.3 [14.8] bits 8.7 [5.9] bits

Table 1: Properties of our three test collections, and the space consumption of INV versus HYB. The entries in square brackets are for a full positional index, without any word whatsoever removed. Collection Homeopathy Method INV HYB INV HYB INV HYB mean 0.033 0.003 0.171 0.055 0.581 0.106 90% 0.038 0.008 0.143 0.158 0.545 0.217 99% 0.384 0.026 2.272 0.492 16.83 0.865 max 12.27 0.065 22.66 1.085 28.84 1.821

5.3 Index space
Table 1 shows that INV and HYB use essentially the same space on all three test collections, and that HYB is slighter more compact than INV for a full p ositional index. This is exactly what Lemmas 1 and 3, and the derivation in Section 4.2 predicted! The sizes for b oth INV and HYB exceed that predicted by the empirical entropy by ab out 50%. This is due to our use of the Simple-9 compression scheme, which trades very fast decompression time for ab out this increase in space usage [3]. A combination of Golomb and arithmetic encoding would give us a space usage closer to the empirical entropy. However, decompression would then b ecome the computational b ottleneck for almost all queries, see Table 3. We remark that, by the way we did our analysis, any new compression scheme with improved compression ratio/decompression sp eed profile, would immediately yield a corresp onding improvement for b oth INV and HYB.

Wikip edia

Terabyte

Table 2: Average, 90%-ile, 99%-ile and maximum processing times in seconds for INV versus HYB on our three test collections. merging of the intersections then dominates for INV, and this indeed shows in the first column of Table 3. For multiP word queries, the result volume w |D Dw | (Lemmas 2 and 4) goes down, and, according to Lemma 2, the intersection costs dominate for INV, which shows in the third column of Table 3. In contrast, columns two and four demonstrate that HYB achieves a b etter balance of the costs for reading, uncompressing, and intersecting, and none of these essential op erations b ecomes the b ottleneck. HYB avoids merging altogether since, by construction, the p otential completions from the given word range W always lie within a single block. The read time of HYB is ab out 50% larger than that of INV, b ecause HYB always reads a whole block of size (n), even for small word ranges. This also partially explains why HYB sp ends more time decompressing than INV; the other factor is that decompression of the word ids is more exp ensive than decompression of the document ids. As we remarked in Section 4.1, the absolute time for ranking is

5.4 Query processing time
Table 2 shows that in terms of query processing time, HYB outp erforms INV by a large margin on all collections. With resp ect to maximum processing time, which is esp ecially critical for an interactive application, the improvement is by a factor of 15-20. With resp ect to average processing time, which is critical for throughput in a high-load scenario, the improvement is by a factor of 3-10. Table 3 gives interesting insights into where exactly INV loses against HYB. The table shows a breakdown of the running times of those queries for the Terabyte collection, which were not answered by filtering as discussed ab ove. (Note that the breakdown of the filtered queries would b e identical for b oth methods.) The table differentiates b etween 1-word queries like squi, squir, etc. and multi-word queries like squirrel contr or squirrel control prot. For the 1-word queries, no intersections have to b e computed for either INV or HYB. According to Lemma 2, the

370


the same for b oth methods. Ranking takes more time on average for the 1-word queries, b ecause these tend to have larger result sets. The comparison with the time needed for the maintenance of the history, which is nothing but memory allocation and copying, shows that all of HYB's op eration are essentially fast list scans. Query size Index typ e 1-word INV HYB multi-word INV HYB

average time 0.340 secs 0.244 secs 2.404 secs 0.207 secs read decompress intersect merging rank ing history
.024 .011 7% .048 20% 3% .032 13% .032 .023 1% .045 22% 1% .045 22%

----
.165 48%

---- ----

2.27 94% .030 14% .010 .007 .062 .4%

----
4%

.081 24% .083 34% .058 17% .082 32%

.3% .007

3% .079 38%

Table 3: Breakdown of average processing times for INV and HYB, for the difficult (unfiltered) queries on Terabyte.

6.

CONCLUSIONS

We have introduced an autocompletion feature for fulltext search, and presented a new compact indexing data structure for supp orting this feature with very fast resp onse times. We have built a full-fledged search engine around this feature, and we have given arguments, why we b elieve it to b e practically useful. Given the interactivity of this engine, the next logical step following this work would b e to conduct a user study for verifying that b elief. We also see p otential for a further sp eed-up of query processing time by applying techniques from top-k query processing [9], in order to display the most relevant hits and completions without first computing and ranking al l of them.

7.

ACKNOWLEDGEMENTS

Many thanks to our mentor David Grossman for his encouragement and many valuable comments.

8.

REFERENCES

[1] A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116­1127, 1988. [2] S. Alstrup, G. S. Brodal, and T. Rauhe. New data structures for orthogonal range searching. In 41st Symposium on Foundations of Computer Science (FOCS'00), pages 198­207, 2000. [3] V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8:151­166, 2005. [4] L. Arge, V. Samoladas, and J. S. Vitter. On two-dimensional indexability and optimal range search indexing. In 18th Symposium on Principles of Database Systems (PODS'99), pages 346­357, 1999. [5] R. Baeza-Yates. A fast set intersection algorithm for sorted sequences. Lecture Notes in Computer Science, 3109:400­408, 2004. [6] S. Bickel, P. Haider, and T. Scheffer. Learning to complete sentences. In 16th European Conference on Machine Learning (ECML'05), pages 497­504, 2005.

[7] C. L. A. Clarke, N. Craswell, and I. Sob oroff. The TREC terabyte retrieval track. SIGIR Forum, 39(1):25, 2005. [8] J. J. Darragh, I. H. Witten, and M. L. James. The reactive keyb oard: A predictive typing aid. IEEE Computer, pages 41­49, 1990. [9] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614­656, 2003. [10] P. Ferragina, N. Koudas, S. Muthukrishnan, and D. Srivastava. Two-dimensional substring indexing. Journal of Computer and System Science, 66(4):763­774, 2003. [11] P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 52(4):552­581, 2005. [12] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. In 10th International World Wide Web Conference (WWW10), pages 406­414, 2001. [13] V. Gaede and O. Gunther. Multidimensional access ¨ methods. ACM Computing Surveys, 30(2):170­231, 1998. [14] K. Grabski and T. Scheffer. Sentence completion. In 27th Conference on Research and Development in Information Retrieval (SIGIR'04), pages 433­439, 2004. [15] M. Jakobsson. Autocompletion in full text transaction entry: a method for humanized input. In Conference on Human Factors in Computing Systems (CHI'86), pages 327­323, 1986. [16] D. Metzler, T. Strohman, H. Turtle, and W. B. Croft. Indri at TREC 2004: Terabyte track. In 13th Text Retrieval Conference (TREC'04), 2004. [17] A. Moffat and J. Zob el. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349­379, 1996. [18] G. W. Paynter, I. H. Witten, S. J. Cunningham, and G. Buchanan. Scalable browsing for large collections: A case study. In 5th Conference on Digital Libraries (DL'00), pages 215­223, 2000. [19] S. E. Rob ertson, S. Walker, M. M. Beaulieu, M. Gatford, and A. Payne. Okapi at TREC-4. In 4th Text Retrieval Conference (TREC'95), pages 73­96, 1995. [20] T. Stocky, A. Faab org, and H. Lieb erman. A commonsense approach to predictive text entry. In Conference on Human Factors in Computing Systems (CHI'04), pages 1163­1166, 2004. [21] E. M. Voorhees. Query expansion using lexical-semantic relations. In 17th Conference on Research and Development in Information Retrieval (SIGIR'94), pages 171­180, 1994. [22] H. Williams and J. Zob el. Compressing integers for fast file access. Computer Journal, 42(3):193­201, 1999. [23] I. H. Witten, T. C. Bell, and A. Moffat. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition. Morgan Kaufmann, 1999. [24] J. Zob el, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4):453­490, 1998.

371