The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables Bernard Chazelle Joe Kilian Ronitt Rubinfeld Ayellet Tal§ "Oh boy, here is another David Nelson" by airlines." Ticket Agent, Los Angeles Airport This story illustrates a common problem that arises (Source: BBC News) when one tries to balance false negatives and false positives: if one is unwilling to accept any false negatives whatsoever, one often pays with a high false positive Abstract We introduce the Bloomier filter, a data structure for rate. Ideally, one would like to adjust one's system compactly encoding a function with static support in to fix particularly troublesome false positives while still order to support approximate evaluation queries. Our avoiding the possibility of a false negative (eg, one would construction generalizes the classical Bloom filter, an like to make life easier for the David Nelsons of the world ingenious hashing scheme heavily used in networks and without making life easier for Osama Bin Laden). We databases, whose main attribute--space efficiency--is consider these issues for the more prosaic example of achieved at the expense of a tiny false-positive rate. Bloom filters, described below. Historical background Bloom filters yield an exWhereas Bloom filters can handle only set membership queries, our Bloomier filters can deal with arbitrary tremely compact data structure that supports memfunctions. We give several designs varying in simplicity bership queries to a set [1]. Their space requirements and optimality, and we provide lower bounds to prove fall significantly below the information theoretic lower bounds for error-free data structures. They achieve the (near) optimality of our constructions. their efficiency at the cost of a small false positive rate (items not in the set have a small constant probability 1 Intro duction of being listed as in the set), but have no false negaA widely reported news story1 describes the current tives (items in the set are always recognized as being in predicament facing air passengers with the name of the set). Bloom filters are widely used in practice when David Nelson, most of whom are being flagged for extra storage is at a premium and an occasional false positive security checks at airports across the United States: "If is tolerable. They have many uses in networks [2]: for you think security at airports is tight enough already, collaborating in overlay and peer-to-peer networks [5, imagine your name popping up in airline computers 8, 17], resource routing [15, 26], packet routing [12, 30], with a red flag as being a possible terrorist. That's and measurement infrastructures [9, 29]. Bloom filwhat's happening to David Nelsons across the country." ters are used in distributed databases to support iceThe problem is so bad that many David Nelsons have berg queries, differential files access, and to compute stopped flying altogether. Although the name David joins and semijoins [7, 11, 14, 18, 20, 24]. Bloom filters Nelson raises a red flag, security officials won't say if are also used for approximating membership checking there is a terror suspect by that name. "Transportation of password data structures [21], web caching [10, 27], Security Administration spokesman Nico Melendez said and spell checking [22]. the problem was due to name-matching technology used Several variants of Bloom filters have been proposed. Attenuated Bloom filters [26] use arrays of Bloom This work was supp orted in part by NSF grant CCR-998817, filters to store shortest path distance information. SpecARO Grant DAAH04-96-1-0181, and NEC Laboratories America. tral Bloom filters [7] extend the data structure to sup Princeton University and NEC Lab oratories America, port estimates of frequencies. In Counting Bloom Filchazelle@cs.princeton.edu ters [10] each entry in the filter need not be a single bit NEC Lab oratories America, {joe|ronitt}@nec-labs.com § Technion and Princeton University, but rather a small counter. Insertions and deletions to ayellet@ee.technion.ac.il the filter increment or decrement the counters respec1 http://news.bb c.co.uk/2/hi/americas/2995288.stm, tively. When the filter is intended to be passed as a http://www.kgun9.com/story.asp?TitleID=3201&. . . message, compressed Bloom filters [23] may be used in. . . ProgramOption=News Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 30 stead, where parameters can be adjusted to the desired tradeoff between size and false-positive rate. We note that a standard technique for eliminating a very small number of troublesome false positives is to just keep an exception list. However, this solution does not scale well, both in lookup time and storage, when the list grows large (say comparable to the number of actual positives). This Work A Bloom filter is a lossy encoding scheme for a set, or equivalently, for the boolean characteristic function of the set. While Bloom filters allow membership queries on a set, we generalize the scheme to a data structure, the Bloomier filter, that can encode arbitrary functions. That is, Bloomier filters allow one to associate values with a subset of the domain elements. The method performs well in any situation where the function is defined only over a small portion of the domain, which is a common occurrence. In our (fanciful) terrorist detection example, suspicious names would map to suspect and popular, non-suspicious names (eg, David Nelson) would map to sounds-suspicious-but-real ly-ok; meanwhile, all but a tiny fraction of the other names would map to ok. This third category is the only source of error. Bloomier filters generalize Bloom filters to functions while maintaining their economical use of storage. In addition, they allow for dynamic updates to the function, provided the support of the function remains unchanged. Another application of Bloomier filters is to building a meta-database, ie, a directory for the union of a small collection of databases. The Bloomier filter keeps track of which database contains information about each entry, thereby allowing the user to jump directly to the relevant databases and bypass those with no relation to the specified entry. Many such meta-databases already exist on the Web: for example, BibFinder, a Computer Science Bibliography Mediator which integrates both general and specific search engines; Debriefing, a meta search engine that uses results from other search engines, a meta-site for zip codes & postal codes of the world, etc. Bloomier filters can be used to maintain local copies of a directory in any situation in which data or code is maintained in multiple locations. Our Results Let f be a function from D = {0, . . . , N - 1} to R = {, 1, . . . , 2r - 1}, such that f (x) = for all x outside some fixed (arbitrary) subset S D of size n. (We use the symbol either to denote 0, in which case the function has support S , or to indicate that f is not defined outside of S .) Bloomier filters allow one to query f at any point of S always correctly and at any point of D \ S almost always correctly; specifically, for a random x D \ S , the output returns f (x) = with probability arbitrarily close to 1. Bloomier filters shine especially when the size of D dwarfs that of S , ie, when N/n is very large. The query time is constant and the space requirement is O(nr); this compares favorably with the naive bound of O(N r), the bound of O(nr log N ) (which is achieved by merely listing the values of all of the elements in the set) and, in the 0/1 case, the O(n log N ) bound n achieved by the perfect hashing method of Brodnik and Munro [3]. (Of course, unlike ours, neither of these methods ever errs.) Bloomier filters are further generalized to handle dynamic updates. One can query and update function values in constant time while keeping the space requirement within O(nr), matching the trivial lower bound to within a constant factor. Specifically, for x S , we can change the value of f (x), though we cannot change S . We also prove various lower bounds to show that our results are essentially optimal. First we show that randomization is essential: over large enough domains, linear space is not enough for deterministic Bloomier filters. We also prove that, even in the randomized case, the ability to perform dynamic updates on a changing support (ie, adding/removing x to/from S ) requires a data structure with superlinear space. Our Techniques Our first approach to implementing Bloomier filters is to compose an assortment of Bloom filters into a cascading pipeline. This yields a practical solution, which is also theoretically nearoptimal. To optimize the data structure, we change tack and pursue, in the spirit of [4, 6, 19, 28], an algebraic approach based on the expander-like properties of random hash functions. As with bloom filters, we assume that we can use "ideal" hash functions. We analyze our algorithms in this model; heuristically one can use "practical" hash functions. 2 A Warmup: the Blo om Filter Cascade We describe a simple, near-optimal design for Bloomier filtering based on a cascading pipeline of Bloom filters. For illustrative purposes, we restrict ourselves to the case R = {, 1, 2}. Let A (resp. B ) be the subset of S mapping to 1 (resp. 2). Note that the "obvious" solution which consists of running the search key x through two Bloom filters, one for A and one for B , does not work: What do we do if both outputs contradict each other? One possible fix is to run the key through a sequence of Bloom filter pairs: (F (Ai ), F (Bi )), for i = 0, 1, . . . , and some suitable parameter . The first pair corresponds to the assignment A0 = A and B0 = B . Ideally, no key will pass the test for membership in both A and B , as provided by F (A0 ) and F (B0 ), but we cannot count on it. So, we need a second pair of 31 Bloomier filter tables store the assignment A, and are created by calling the procedure i+1 E ni n2-(k -k)/(k-1) and create (A) [m, k , q ], where A denotes the assignment E 2 log log n/ log k . and (m, k , q ) are the parameters chosen to optimize the implementation. For notational convenience, we The probability that a search key runs through the i-th will omit mention of these parameters when there is no i filter is less than 2-k , so the expected search time is ambiguity. Our ultimate goal is to create a one-sided constant. The expected storage used is equal to error, linear space (measured in bits) data structure supporting constant-bit table lookups. Specifically, we i i i i need to implement the following operations: E 2k k ni = k n 2-(k -k)/(k-1) = O(k m). =0 =0 Bloom filters, and then a third, a fourth, etc. (The idea of multiple Bloom filters appears in a different context in [7].) Generally, we define Ai to be the set of keys in Ai-1 that pass the test in F (Bi-1 ); by symmetry, Bi is the set of keys in Bi-1 that pass the test in F (Ai-1 ). In other words, Ai = Ai-1 Bi-1 and Bi = Bi-1 A-1 , i where Ai and Bi are the set of false positives for F (Ai ) and F (Bi ), respectively. Given an arbitrary key x D, we run the test with respect to F (A0 ) and then F (B0 ). If one test fails and the other succeeds, we output 1 or 2 accordingly. If both tests fail, we output . If both tests succeed, however, we cannot conclude anything. Indeed, we may be faced with two false positives or with a single false positive from either A or B . To resolve these cases, we call the procedure recursively with respect to F (A1 ) and F (B1 ). Note that A1 (resp. B1 ) now plays the role of A (resp. B ), while the new universe is A B . Thus, recursively computed outputs of the form `in A1 ', in B1 ', `not in A1 B1 ' are to be translated by simply removing the subscript 1. For notational convenience, assume that |A| = |B | = n. Let ni be the random variable max{|Ai |, |Bi |}. All filters use the same number of hash functions, which is a large enough constant k . The storage allocated for the filters, however, depends on their ranks in the sequence. We provide each of the Bloom filters F (Ai ) i and F (Bi ) with an array of size 2k k ni . The number of Bloom filter pairs is the smallest i such that ni = 0. A key in Ai ends up in Ai+1 if it produces a false positive for F (Bi ). This happens with probability at most 2i i+1 (k |Bi | k k ni )k = 2-k . This implies that a key in A i+1 belongs to Ai with probability at most 2-(k -k)/(k-1) ; therefore, is O(log log n) in the worst case and constant when averaged over all of D. 3 An Optimal Blo omier Filter Given a domain D = {0, . . . , N - 1}, a range R = {, 1, . . . , |R| - 1}, a subset S = {t1 , . . . , tn } of D, we consider the problem of encoding a function f : D R, such that f (ti ) = vi for 1 i n and f (x) = for x D \ S . Note that the function is entirely specified by the assignment A = {(t1 , v1 ), . . . , (tn , vn )}. For the purpose of constructing our data structure, we assume that the function values in R are encoded as elements of the additive group Q = {0, 1}q , with addition defined bitwise mod 2. As we shall see, the false-positive rate is proportional to 2-q , so q must be chosen sufficiently large. Any x R is encoded by its q -bit binary expansion encode (x). Conversely, given y Q, we define decode (y ) to be the corresponding number if it is less than |R| and otherwise. We use the notation r = log |R| . Given an assignment A, we denote by A(t) the value A assigns to t, ie, A(ti ) = vi . Let be a total ordering on S . We write a > b to mean that a comes after b in . We define (i) to be the ith element of S in ; if i > j , then obviously (i) > (j ). For any triple (D, m, k ), we assume the ability to select a random hash function hash : D {1, . . . , m}k . This allows us to access random locations in a Bloomier filter table of size m. Definition 3.1. Given hash as above, let hash (t) = (h1 , . . . , hk , ). We say that {h1 , . . . , hk } is the neighborhood of t, denoted N(t). · create (A): Given an assignment A = {(t1 , v1 ), . . . , (tn , vn )}, create (A) sets up a data structure Table. The subdomain {t1 , . . . , tn } specified by A is denoted by S . · set value (t, v , Table): For t D and v R, set value (t, v , Table) associates the value v with Note that, if N is polynomial in n, we can stop the recursion when ni is about n/ log n and then use perfect hashing [3, 13]. This requires constant time and O(n) bits of extra storage. To summarize, with high probability a random set of hash functions provides a Bloomier filter with the following characteristics: (i) the storage is O(k n) bits; (ii) at most a fraction O(2-k ) of D produces false positives; and (iii) the search time 32 ordering on the elements of S . We say that a matching respects (S, , N) if (i) for al l t S , (t) N(t), and (ii) if ti > tj , then (ti ) N(tj ). When the function · lookup (t, Table): For t S , lookup (t, Table) hash (and hence N ) is understood from the context, we returns the last value v associated with t. For all say that respects on S . but a fraction of D \ S , lookup (t, Table) returns (ie, certifies that t is not in S ). For the remaining Given and , we can, for t = (1), . . . , (n), elements of D \ S , lookup (t, Table) returns an set the value v associated with t by setting Table [ (t)]. arbitrary element of R. By the order-respecting nature of , this assignment A data structure that supports only create and lookup is referred to as an immutable data structure. Note that although re-assignments to elements in S are made by set value , no changes to S are allowed. Our lower bounds show that, if we allow S to be modified, then linear size (measured in bits) is impossible to achieve regard less of the query time. In other words, Bloomier filters provably rule out fully dynamic operations. There are three parameters of interest in our constructions: the runtime of each operation, the space requirements, and the false-positive rate . The create operation runs in expected O(n log n) time (indeed, O(n) time, depending on the model) and uses O(n(r + log 1/)) space. The set value and lookup operations run in O(1) time. 3.1 An Overview We first describe the immutable data structure and later show how to use the same principles to construct a mutable version. The table consists of m q -bit elements, where m and q are implementation parameters. We denote by Table [i] {0, 1}q the ith q -bit value in Table. To look up the value v associated with t, we use a hash function hash to compute k locations (h1 , . . . , hk ), where 1 hi m, and a q -bit "masking value" M (used for reducing false positives). k We then compute x = M i=1 Table [hi ], where denotes the bit-wise exclusive-or operation. There are two main issues to address. First, we must set the values of Table [i], for i = 1, . . . , m, so that the decode operations yield the correct values for all t S . We need to show that with high probability a "random" solution works (for appropriate parameter settings), and furthermore we wish to compute the assignment efficiently, which we do by a simple greedy algorithm. Second, we must ensure that, for all but an expected fraction of t D \ S , the computed "image" in Q decodes to . We set the table values using the following key technique. Given a suitable choice of m and k , we show that, with high probability, there is an ordering on S and an order respecting matching, defined as follows: Definition 3.2. Let S be a set with a neighborhood N(t) defined for each t S . Let be a complete cannot affect any previously set values. We show the existence of good (, ) using the notion of lossless expanders [6, 28]. Our analysis implies that, with high probability (over hash ), we can find (, ) in nearly linear time using a greedy algorithm. To limit the number of false positives, we use the random mask M produced by hash (t). Because M is distributed uniformly and independently of any of the values stored in Table, when we look up t S , the resulting value is uniformly and independently distributed over {0, 1}q . If the size of R is small compared with the size of {0, 1}q , then with high probability this value will not encode a legal value of R, and we will detect that t S . We make a mutable structure by using a two-table construction. We use the first table, Table1 , to encode (t) for each t S . We note that since N(t) has only k values, which may be computed from hash (t), (t) N(t) can be compactly represented by a number in {1, . . . , k }. Now, it follows from the definitions that if t = t for t, t S , (t) = (t ). Thus, we can simply store the value associated with t in Table2 [ (t)]; the locations will never collide. 3.2 Finding a Go o d Ordering and Matching We give a greedy algorithm that, given S and hash , computes a pair (, ) such that respects on S . First, we consider how to compactly represent . Recall that hash (t) defines the k neighbors, h1 , . . . , hk of t. Therefore, given hash , we can represent (t) N(t) by an element of {1, . . . , k }. Thus, we define (t) such that (t) = h(t) . With S = {t1 , . . . , tn }, we also use the shorthand i = (ti ), from which = {1 , . . . , n }. Our algorithm is based on the abundance of "easy matches." Definition 3.3. Let m, k , hash be fixed, defining N(t) for t D, and let S D. We say that a location h {1, . . . , m} is a singleton for S if h N(t) for exactly one t S . We define tweak (t, S, hash ) to be the smal lest value j such that hj is a singleton for S , where N(t) = (h1 , . . . , hk ); tweak (t, S, hash ) = if no such j exists. If tweak (t, S, hash ) is defined, then it sets the value of (t) and t is easy to match. Note that this domain element t in Table. It is required that t be in S . 33 Proof. When the assignment for t is first stored in Table, (t) generates a location L N(t), with a concise representation {1, . . . , k }. By the construction, Table1 [L] is set so that 3.3 Creating a Mutable Blo omier Filter Given Z M Table1 [Z ] an ordering on S , and a matching that respects on S (given the neighborhoods defined by hash ), N(t) we store values associated with any t S as follows. is a valid representation for . We claim that the Given t S , gives a location L N(t) such that L is same value of (and hence L) is recovered by the not in the neighborhood of any t that appears before lookup and set value commands on input t. These t in . Furthermore, given hash (t), L has a compact routines recover by the same formula; it remains to description as an element {1, . . . , k }. Finally, no verify that none of the operations causes this value to other t S (before or after t) has the same value of L. change. We observe that the lookup and set value We can construct an immutable table as follows: commands do not alter Table . The only indices of 1 For t = [1], . . . , [n], we compute the neighborhood Table subsequently altered by create are of the form 1 N(t) = {h1 , . . . , hk } and mask M from hash (t). From ) > her t i e he p s cc i (t) we obtain L N(t) with the above properties. o(t),. wHoweet er, by (tshncprtopetrstiarse ofroc,est efd laowordhng t v e e i o l s t at k Finally, we set Table [L] so that M i=1 Table [hi ] (t ) N(t), so these changes to Table1 cannot affect encodes the value v associated with t. By the properties the recovered value of , and hence L. of L given above, altering Table [L] cannot affect any of Finally, we observe that all of the L are distinct: the t whose associated values have already been put Suppose that t1 , t2 S and t1 = t2 . Assume without into the table. To retrieve the value associated with t, loss of generality that t1 > t2 . Then (t1 ) N(t2 ), we simply compute but (t2 ) N(t2 ), so (t1 ) = (t2 ). It follows that x=M ik =1 choice will not interfere with the neighborhood for any different t S . Let E denote the subset of S with "easy matches" of that sort, and let H = S \ E . We recursively find ( , ) on H and extend ( , ) to (, ) as follows. First, we put the elements of E at the end of the ordering for the elements of H , so that if t E and t H , then t > t (the ordering of the elements within E can be arbitrary). Then we define (t) to be the union of the matchings for H and E . It is immediate that respects on S . We give the algorithm in Figure 1. Note that it is not at all clear that our algorithm for find match will succeed. We show that for m and k suitably large, and hash chosen at random, find match will succeed with high probability. 4 Analysis of the Algorithm The most technically demanding aspect of our analysis is in showing that for a random hash , and sufficiently large k and m, the find match routine will with high probability find (, ) such that respects on S . Once we have such an (, ), the analysis of our algorithms is straightforward. Lemma 4.1. Assuming that find match succeeded in create , then for t S , the value v returned by lookup (t, Table) wil l be the most recent v assigned to t by create or set value . Table2 (L) is only altered when create and set value associate a value to t, as desired. Table [hi ], Lemma 4.2. Suppose that Table is created using an assignment with support S . Then if t S , k , 2q where the probability is taken over the coins of create , assuming that hash is a truly random hash function. Pr[lookup (t, Table) = ] 1 - Proof. Since t S , the data structures were generated completely independent of the values of (h1 , . . . , hk , M ) = hash (t). In particular, M is uniformly distributed over {0, 1}q , independent of anything else. Hence, the value of Z M Table1 [Z ] N(t) and see if x is a correct encoding of some value v R. If it is not, we declare that t S . Because M is random, so is x if t S ; therefore, it is a valid encoding only with probability |R|/2q . In order to make a mutable table, we use the fact that each t S has a distinct matching value L, with a succinct representation {1, . . . , k } (given hash (t)). We use the above technique to make an immutable table that stores for each t the value that can be used to recover its distinct matching value L. We then store any value associated with t in the Lth location of a second table. We give our final algorithms in Figures 2 and 3. 34 find match (hash , S )[m, k ] 1. E = ; = For ti S If tweak (ti , S, hash ) is defined i = tweak (ti , S, hash ) E = E {ti } If E = Return (failure) 2. H = S \ E Recursively compute ( , ) = find match (hash , H )[m, k ]. If find match (hash ,H)[m,k]=failure Return (failure) Find (, ) for S, hash 3. F = or ti E Add ti to the end of (ie, make ti be the largest element in thus far) Return (, = {1 , . . . , n }) (where i is determined for ti E , in Step 1, and for ti H (via ) in Step 2.) Figure 1: Given hash and S , find match finds an ordering on S and a matching on S . on S that respects create (A = {(t1 , v1 ) . . . , (tn , vn )})[m, k , q ] 1. Uniformly choose hash : D {1, . . . , m}k × {0, 1}q S = {t1 , . . . , tn } Create Table1 to be an array of m elements of {0, 1}q Create Table2 to be an array of m elements of R. (the initial values for both tables are arbitrary) Put (hash , m, k , q ) into the "header" of Table1 (we assume that these values may be recovered from Table1 ) 2. (, ) = find match (hash , S )[m, k ] If find match (hash , S )[m, k ] = failure Goto Step 1 3. For t = [1], . . . , [n] v = A(t) (ie, the value assigned by A to t) (h1 , . . . , hk , M ) = hash (t) L = (t); = (t) (ie, L = h ) ik Table1 [L] = encode ( ) M able1 [hi ] =1 i= T (create a mutable table) Table2 [L] = v 4. Return (Table = (Table1 , Table2 )) Figure 2: Given an assignment A and parameters m, k , q , create creates a mutable data structure corresponding to A. 35 lookup (t, Table = (Table1 , Table2 )) 1. Get (hash , m, k , q ) from Table1 s 2 (h1 , . . . , hk , M ) = hak h (t) M i = decode Table1 [hi ] =1 set value (t, v , Table = (Table1 , Table2 )) 1. Get (hash , m, k , q ) from Table1 s 2 (h1 , . . . , hk , M ) = hak h (t) M i = decode Table1 [hi ] =1 . If is defined L=h R eturn (Table2 [L]) Else Return () . If is defined L=h T able2 [L] = v Return (success) Else Return (failure) Figure 3: lookup looks up t in Table = (Table1 , Table2 ). set value sets the value associated with t S to v in Table = (Table1 , Table2 ). Note that for set value , t must be an element of S , as determined (by A) in the original create invocation. will be uniformly distributed over {0, 1}q . However, h1 , . . . , hk {1, . . . , m}; G contains an edge between Li there are only k out of 2q values encoding legitimate and Rhj , for 1 j k . The following property is crucial to our analysis: values of {1, . . . , k }. Thus, in the "ideal" setting, where hash is truly random, it suffices to set l k q= g . to achieve a false positive rate of . Of course, hash is generated heuristically, and is at best pseudorandom, so this analysis must be taken in context. We note that if hash is a cryptographically strong pseudorandom function, then the behavior of the data structure using hash will be computationally indistinguishable from that of using a truly random hash ; however, functions with provable pseudorandomness properties are likely to be too complicated to be used in practice. Definition 4.1. Let G be as above. We say that G has the singleton property if for al l nonempty A L, there exists a vertex Ri R such that Ri is adjacent to exactly one vertex in A. We claim that if G has the singleton property, then find match will never get stuck. This is because whenever find match is being called on a subset A of S , the resulting easy set will contain Ri , and hence will be nonempty. We next reduce the singleton property to a wellstudied (lossless) expansion property. Let N(v ) be the set of neighbors of a vertex v L, and for A L, let N(A) be the set of neigbors (in R) of elements of A. Definition 4.2. Let G be as above. We say that G has the lossless expansion property if for al l nonempty 4.1 Analyzing find match It remains to show for A L, |N(A)| > k |A|/2. any S , with reasonable probability (constant probability suffices for our purposes), find match will indeed find Lemma 4.3. If G has the lossless expansion property, a pair (, ) such that respects on S . Recall then it has the singleton property. that find match works by finding an "easy" set E , Proof. Assume to the contrary that each node in N(A) solving the problem recursively for H = S \ E , and then has degree at least 2. Then G graph has at least 2|N(A)| combining the solutions for E and H to get a solution for edges. However, by the lossless expansion property, S . To show that find match terminates, it suffices to N(A) > |A|k /2, so G has greater than |A|k edges, which show that for any subset A of S , find match will find is a contradiction. a nonempty "easy" set, EA , implying that find match always makes progress. Now, choosing hash at random corresponds to Let G be an n × m bipartite graph defined as choosing G according to the following distribution: follows: On the left side are n vertices, L1 , . . . , Ln L, Each v L selects (with replacement) k random vertices corresponding to elements of S . On the right side are m in r1 , . . . , rk R to be adjacent to. Such random graphs vertices, R1 , . . . , Rm R, corresponding to {1, . . . , m}. are well studied; Lemma 4.4 follows from a standard Recall that hash defines for each ti S a set of k values counting argument. 36 Lemma 4.4. Let G be chosen as above, with fixed k Lemma 5.1. The chromatic number of G is between and m = ck n. For any constant c > 1 + 1/ n (2n log N ) and O(4n ln N ). and n sufficiently large, G has the lossless expansion Proof. Consider a minimum coloring of G. For any probability with probability at least 2/3. vector w {-1, 1}n and any sequence of n indices Proof. (Sketch) The probability of a counterexample is 1 i1 < · · · < in N , there exists a color c such c c that w = (zi1 , . . . , zin ). To see why, consider a choice of at most A, B that matches the coordinates of w at the positions ks m k sn n s/2 m i1 , . . . , in . If we turn all the minus ones to zeroes, the resulting set of vectors z c is (N , n)-universal (meaning s ks/2 =1 that the restrictions of the vectors z c to any given s n en s 2ecn ks/2 s ks choice of n coordinate positions produce all possible 2n s s 2cn patterns). By Kleitman and Spencer [16], the number of =1 ek/2+1 s s k such vectors is known to be (2n log N ). For the upper sn s/2 bound, we use the existence of a (N , 2n)-universal set of cn 2k / 2 s vectors of size O(n2n log N )--also established in [16]-- =1 s and turn all zeroes into minus ones. (Alternatively, -s c o(1) + we can use Razborov's bound on the size of separating n sets [25].) Each node is colored by picking a vector from = o(1). the universal set that matches the the ones and minus ones of the vector associated with that node. 5 Lower Bounds We consider the case R = {, 1, 2}. The set S splits into the subsets A and B that map to 1 and 2, respectively. It is natural to wonder whether a single set of hash functions might be sufficient for all pairs A, B . In other words, is deterministic Bloomier filtering possible with only O(n) bits of storage. We provide a negative answer to this question. Our lower bound also holds for nonuniform algorithms; in other words, we may use an arbitrary large amount of storage besides the Bloomier filter tables as long as the encoding depends only on n and N . Theorem 5.1. Deterministic Bloomier filtering requires (n + log log N ) bits of storage. N 2n n Proof. Let G be a graph with 2n n odes, each one corresponding to a distinct vector {-1, 0, 1}N with exactly n coordinates equal to 1 and n others equal to -1. Two nodes of G are adjacent if there exists at least one coordinate position 1 i N such that their corresponding vectors (x1 , . . . , xN ) and (y1 , . . . , yN ) satisfy xi yi = -1. Intuitively, each node corresponds to a choice of A (the 1 coordinates) and B (the -1 coordinates). Two nodes are joined by an edge if the set A of one node intersects the set B of the other one. Since the table T is the only source of information about A, B , no two adjacent nodes should correspond to the same table assignment; therefore, the size m of the array is at least log (G), where (G) is the chromatic number of G. Theorem 5.1 follows directly from the lemma below. Going back to the randomized model of Bloomier filtering, we consider what happens if we attempt to modify the set S itself. Again we give a negative answer, but this time the universe size need be only polynomially larger than n for the scheme to break down. Intuitively, this shows that too much information is lost in a linear bit size encoding of the function f to allow for changes in the set S . Theorem 5.2. If N = 2n and the number m of storage bits satisfies n m n log log(N/cn3 ) for some c large enough constant c, then Bloomier filtering cannot support dynamic updates on the set S . Proof. Again, we consider S to be the disjoint union of of an n-set A (resp. B ) mapping to 1 (resp. 2). Fix the original B , and consider the assignments of the table T corresponding to the various choices of A. With each assignment of T is associated a certain family of n-element sets A ND2 Let F be the largest such family: . obviously, |F | n -m . Given an integer k > 0, let Lk be the set of elements x U that belong to at least k sets of |F |. It is easy to see that Lk cannot be too small. Indeed, let Fk denote the subfamily consisting of the sets of F that are subsets of Lk . Obviously, |F \ Fk | (k - 1)N . The assumptions of the theorem e imply that N > cn3 ; thus, the choice of k = |F|/2N nsures that k > c and N2 1 -m-1 |Fk | |F | - (k - 1)N > |F | . 2 n L Because nk |Fk |, it follows that (5.1) |Lk | 2- m+1 n O (1) N. 37 The expected number of sets in F that a random n element from D intersects is N |F |. Given an n-element B B D, let F denote the subfamily of F whose sets intersect B . For a random B , n3 |F |. N { B B Let Fc = F \ F B and LB = S Lk | S Fc }. k Since each x Lk intersects at least k sets of F , E | S| : S F B E |LB | |Lk | - k n3 |F | |Lk | - 3n3 . kN References Once the new choice of B is revealed, the table gets updated. The only information about A is encoded in the previous table assignment. Thus the algorithm B cannot distinguish between any two sets A in Fc . To summarize, given a random B , the algorithm must answer `in A' for any search key in LB ; furthermore, k (5.2) Prob | 1 . LB | |Lk | - 6n3 k 2 Next, we partition the family of n-element sets B according to the assignment of T each one corresponds to.i This givesNus at most 2m subfamilies {Gi }, {with . |{Gi }| = n If Mi denotes the union S Lk | S Gi }, then given any new choice of B in Gi the algorithm must answer `in B ' for any search key in Mi . By our previous remark, therefore, it is imperative that Mi should be disjoint from LB . We show that if k m is too small, then for a random choice of B both sets intersect with high probability. m/n Fix a parameter = |Lk |2-c . Let i(B ) denote the index j such that B Gj and let = |Lk |n/2N . Given a random B , by Chernoff 's bound, the probability that |B Lk | < is o(1). On the other hand, the conditional probability that |Mi(B ) | , given that |B Lk | , is at most | = | Lk | Lk | m m max 2 2 s s s 2m |Lk | = o(1). Therefore, the probability that |Mi(B ) | > is 1 - o(1). Since > 6n3 , it follows that, with probability at least 1/2 - o(1), |Mi(B ) | intersects LB and the algorithm fails. k [1] Bloom, B. Space/time tradeoffs in in hash coding with al lowable errors, CACM 13 (1970), 422­426. [2] Broder, A., Mitzenmacher, M. Network applications of Bloom filters: a survey, Allerton 2002. [3] Brodnik, A., Munro, J.I., Membership in constant time and almost minimum space, SIAM J. Comput. 28 (1999), 1628­1640. [4] Buhrman, H., Miltersen, P.B., Radhakrishnan, J., Venkatesh, S. Are bitvectors optimal? Proc. 32th STOC (2000), 449­458. [5] Byers, J., Considine, J., Mitzenmacher, M. Informed content delivery over adaptive overlay networks, Proc. ACM SIGCOMM 2002, Vol. 32:4, Computer Communication Review (2002), 47­60. [6] Capalbo, M., Reingold, O., Vadhan, S., Wigderson, A. Randomness conductors and constant-degree expansion beyond the degree /2 barrier, Proc. 34th STOC (2002), 659­668. [7] Cohen, S., Matias, Y. Spectral Bloom filters, SIGMOD 2003. [8] Cuena-Acuna, F.M., Peery, C., Martin, R.P., Nguyen, T.D. PlanetP: Using gossiping to build content addressable peer-to-peer information sharing communities, Rutgers Technical Report DCS-TR-487, 2002. [9] Estan, C., Varghese, G. New directions in traffic measurement and accounting, Proc. ACM SIGCOMM 2002, Vol 32:4, Computer Communication Review (2002), 323­336. [10] Fan, L., Cao. P., Almeida, J., Broder, A. Summary cache: a scalable wide-area web cache sharing protocol, IEEE / ACM Transactions on Networking, 8 (2000), 281-293. [11] Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R., Ullman. J. Computing iceberg queries efficiently, Proc. 24th Int. Conf. on VLDB (1998), 299­ 310. [12] Feng, W.-C., Shin, K.G., Kandlur, D., Saha, D. Stochastic fair blue: A queue management algorithm for enforcing fairness, INFOCOM '01 (2001), 1520­ 1529. [13] Fredman, M.L., Komlos, J., Szemeredi, E. Storing a sparse table with O(1) worst case access time, J. ACM 31 (1984), 538­544. [14] Gremillion, L.L. Designing a Bloom Filter for differential file access, Comm. ACM 25 (1982), 600­604. [15] Hsiao, P. Geographical region summary service for geographical routing, Mobile Computing and Communications Review 5 (2001), 25­39. [16] Kleitman, D.J., Spencer, J. Families of k-independent sets, Discrete Math 6 (1973), 255­262. [17] Ledlie, J., Taylor, J., Serban, L., Seltzer, M. Selforganization in peer-to-peer systems, Proc. 10th European SIGOPS Workshop, September 2002. [18] Li, Z., Ross, K.A. PERF join: an alternative to twoway semijoin and bloomjoin CIKM '95, Proc. 1995 In- 38 [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] ternational Conference on Information and Knowledge Management, 137­144, November 1995. Luby, M. LT codes, Proc. 43rd Annu. IEEE Symp. Foundat. Comput. Sci., 2002. Mackert, L., Lohman, G. R* optimizer validation and performance for distributed queries, Proc. Int'l. Conf. on VLDB (1986), 149­159. Manber, U., Wu, S. An algorithm for approximate membership checking with application to password security, Information Processing Letters 50 (1994), 191­ 197. McIlroy, M.D. Development of a spel ling list, IEEE Trans. on Communications, COM-30 (1982), 91­99. Mitzenmacher, M. Compressed Bloom filters, IEEE Transactions on Networking 10 (2002). Mullin, J.K. Optimal semijoins for distributed database systems, IEEE Transactions on Software Engineering 16 (1990). Razborov, A.A. Applications of matrix methods to the theory of lower bounds in computational complexity, Combinatorica 10 (1990), 81­93. Rhea, S.C., Kubiatowicz, J. Probabilistic location and routing, Proceedings of INFOCOM 2002. Rousskov, A., Wessels, D. Cache digests, Computer Networks and ISDN Systems, 30(22-23), 2155-2168, 1998. Sipser, M., Spielman, D.A. Expander codes, IEEE Trans. Inform. Theory 42 (1996), 1710­1722. Snoeren, A.C., Partridge, C., Sanchez, L.A., Jones, C.E., Tchakountio, F., Kent, S.T., Strayer W.T. Hashbased IP traceback, Proc. ACM SIGCOMM 2001, Vol. 31:4, Computer Communication Review, 3­14, August 2001. Whitaker, A., Wetherall, D. Forwarding without loops in Icarus, Proc. 5th OPENARCH (2002), 63­75. 39