Tabulation Based 4-Universal Hashing with Applications to Second Moment Estimation
Mikkel Thorup Yin Zhang AT&T Labs--Research, Shannon Laboratory 180 Park Avenue, Florham Park, NJ 07932, USA. (mthorup,yzhang)@research.att.com

Abstract We show that 4-universal hashing can be implemented efficiently using tabulated 4-universal hashing for characters, gaining a factor of 5 in speed over the fastest existing methods. We also consider generalization to k -universal hashing, and as a prime application, we consider the approximation of the second moment of a data stream. 1 Introduction. This paper is about fast 4-universal hashing, with fast data streaming algorithms being the prime application. We also consider generalization to k -universal hashing for arbitrary k . For any i  1, let [i] = {0, 1, · · · , i - 1}. As defined in [19], a class H of hash functions from [n] into [m] is a k -universal class of hash functions if for any distinct x0 , · · · , xk-1  [n] and any possibly identical v0 , · · · , vk-1  [m], (1.1)
hH

Pr {h(xi ) = vi , i  [k ]} = 1/mk

By a k -universal hash function, we mean a hash function that has been drawn at random from a k -universal class of hash functions. Our main contribution is a fivefold speed-up for 4-universal hashing of keys consisting of one or a few words. Hashing is typically applied to a set of s n keys from [n] and often we consider collisions harmful. For any k  2, with k -universal hashing, the expected number of collisions is bounded by s2 /n collisions. However, with 2universal hashing, the variance in the number of collisions may be as large as (s4 /n). On the other hand, as shown in [7], with 4-universal hashing, the variance is no bigger than the expected number of collisions. As described in [7] this means that 4-universal hashing is more reliable in many algorithmic contexts. Our special interest in fast 4-universal hashing is due to its applications in the processing of data streams, an area that has recently gathered intense interest both in the theory community and in many applied communities such as those of databases and the Internet (see [2, 13] and the references therein). A common data stream scenario is as follows. A large stream of items arrive one by one at a rapid pace. When an item arrives, we have very little time to update some small amount of local information that we maintain on the data
Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688

stream. This information is selected with a certain question or type of questions in mind. The item itself is lost before the next item arrives. When all the items have passed by, we have to answer questions on the data stream. A classic example is the second moment computation from [1]. Each item has a weight and a key and we want to compute the second moment which is the sum over each key of the square of the total weight of items with that key. The generic method used in [1] is that when an item arrives, a hash function is computed of the key, and then the hash value is used to update the local information. At any point in time, the local information provides an unbiased estimator of the second moment of the data stream seen so far. In order to control the variance of the estimator, the hash function used in [1] has to be 4-universal. The generic method from [1] is used in many other streaming algorithms. Sometimes we just use 2-universal hashing. Other times we have a choice between simpler algorithm using 4-universal hashing and a more complicated one using 2-universal hashing. And finally, as in the second moment example, we only have algorithms using 4-universal hashing. In many of the streaming algorithms, computing the hash function is a bottleneck. The basic reason to prefer 2-universal hashing over 4-universal hashing is that it is an order of magnitude faster with existing methods. However, here we improve the speed of 4-universal hashing by a factor of 5, making it a more viable option in time critical scenarios. 1.1 Concrete application areas. The application [11] that originally motivated us for this research was a sniffer that monitors packets passing through a high speed Internet backbone link at OC48 speed (2.48 gigabits per second). At such high link speed, it is common for packets to arrive at a rate of more than a million per second, thus leaving us with less than a microsecond per packet. In the worst case, when all packets are of the minimum Internet packet size of 40 bytes (320 bits), the packet arrival rate can be as high as 2.48 × 109 /320 = 7.75 × 106 per second, leaving us with less than 130 nanoseconds per packet. A critical part of the application was to compute the second moment with the packet key being its single word 32-bit IP address, and the packet weight being its size in bytes. The speed-up achieved

615


integers returned for free as part of the look-up done over the input characters. We also present a scheme for general k that gives k universal hashing using (k - 1)(q - 1) + 1 k -universal hash functions. For k = 4, this is not as good as the previous result, but it does have the advantage that the derived characters have exactly the same length as the input characters. As a theoretical comment, we note that our scheme completely avoids multiplication. It only uses O(k q ) AC0 operations like addition, shifts, and bit-wise Boolean operations plus memory look-ups. For contrast, we know that any small 1.2 Tabulation based hashing. On most of today's com- space implementation of k -universal hashing needs non-AC 0 puters, the fastest way of generating a 2-universal bit string operations like multiplication [12]. The space of our tables from a key is to divide the key into characters, use an inde- allows us to circumvent this problem. An alternative solution pendent tabulated 2-universal function to produce a bit string would have been to use our space to tabulate multiplication for each character, and then just return the bit-wise exclusive- of c/2-bit characters, and use this table to implement mulor of each of these strings. This method goes back to [4], and tiplication over -bit keys with AC0 operations. Even with an experimental comparison with other methods is found in the asymptotically fastest multiplication [15], we would use [16]. More precisely, if H is a 2-universal class of hash func-  (q log q ) look-ups per multiplication and  (k q log q ) looktions from characters to bit-strings, and we pick q indepen- ups for k -universal hashing. This is worse than the O(k q ) dent random functions h0 , · · · , hq-1  H, then the function look-ups with our solution, and it is not even remotely prach mapping a0 a1 · · · aq-1 to h0 [a0 ]  h1 [a1 ] · · ·  hq-1 [aq-1 ] tical. is 2-universal. Here  denotes bit-wise exclusive-or. If H is 3-universal, then so is h. However, the scheme breaks 1.2.2 Random derived characters. Recently and indedown above 3. Regardless of the properties of H, h is not pendently, a somewhat similar scheme has been suggested [9, §5], but where the derived characters are based them4-universal. selves on random hashing. As we shall see below, our deAbove we used `[]' around the arguments of the hi to indicate that the hi are tabulated so that function values are terministically derived characters work much better. The scheme in [9] takes an integer parameter p, and found by a single look-up in an array. generates p independent universal hash functions f0 , ...fp-1 1.2.1 Our results. Despite the above obstacles, we show from input keys to c-bit characters. Here, as in [4], universal m in this paper that it is possible to use tabulation for fast 4- just c eans that for any inputs x and y , Pr{fi (x) = fi (y )} = 1/2 . For 32- and 64-bit input words, such a universal hash universal hashing. As a simple illustration, consider the case where keys are divided into two characters. Then we will function can be implemented with a multiplication and a shift [8]. The output is defined as show that in this paper made the application possible. We note that independent of computer speeds, there is an absolute virtue in processing information with just a few instructions. The basic point is that many streams are already passing through computer software, hence limited by the actual processor's speed. If we can process data in a few instructions, then chances are that we are as fast as everything else, hence that we can keep up with the stream. An example of such software limited streams are IP firewalls that are often implemented in software. Another example is the flow level statistics exported by most IP routers. h[ab] = h0 [a]  h1 [b]  h2 [a + b] is a 4-universal hash function if h0 , h1 , and h2 are independent 4-universal hash functions into strings of the same length. As a slight caveat of the above scheme is that the derived character a+b requires one more bit than a and b, hence that h2 need to be over a domain of twice the size. It would have been nicer if we could just apply h2 to a  b instead of a + b, but then we will show that the combined function is not 4-universal. We can reduce the domain of a + b by performing the addition over an appropriate odd prime field. With character length c = 8, 16, we exploit that 2c + 1 prime. Then the domain of the derived character a + b is only one bigger than that of the input characters. The above scheme could be applied recursively, but then, for q characters, we would end up using q log2 3 hash functions. We show here that we can get down to 2q - 1 hash functions whose output strings need to be ed. Apart from hashing q input characters in [2c ], we hash q - 1 derived characters over [2c + q ]. The derived characters are all generated using a total of q + 4 simple operations over h(x) = h0 [f0 (x)]  h1 [f1 (x)]  · · ·  hp-1 [fp-1 (x)]. In [9, Proof of Theorem 5], they show for any distinct input x0 , ..., xp-1 and output y0 , ..., yp-1 Pr {h(xi ) = vi , i  [k ]} = 1/mk + (k /(p + 1))(k /2c )p Above, the additive term (k /(p + 1))(k /2c )p is the error relative to d-universality as defined in (1.1). In order to get remotely k -universal, we set 2cp = mk and accept an error factor of k p+1 /(p + 1). For m = 2 and c = /q, this means that p = k q where p is the number of hashed characters and table look-ups. This is actually more than the (k - 1)(q - 1) + 1 deterministically derived characters that we use for our general perfectly k -universal scheme. Our special scheme for 4-universal hashing provides even bigger improvements, using only 2q - 1 as opposed to 4q derived characters and look-ups, and avoiding the error factor k 4q+1 /(4q + 1). Conversely, if we limit the scheme from [9] to use the same computational resources as we do, that is, only p = 2q - 1 randomly derived characters and

616


because the 'mod p' is slow. However, as pointed out in [4], we get a fast implementation if p is a so-called Mersenne (k /(p + 1))(k /2c )p 24 > (82q /2q )22 > m2 prime of the form 2i - 1. Then (1.2) gives the fastest known 4-universal hashing on a processor with standard arithmetic For example, hashing 64-bit integers to 64-integers using 16bit characters, [9] gets an error factor > 2149 . In all fairness, operations. We shall refer to this as CW-trick. In the hashing it should be mentioned that the scheme in [9] is not claimed of 32-bit integers, we can use p = 261 - 1, and for 64-bit to be practical. Also, the analysis in [9] is geared towards integers, we can use p = 289 - 1. As mentioned previously, we do not know if CWasymptotic bounds, and it may be possible to tighten it. We note that [9, §5.2] stipulates that they can implement trick is defined for arbitrary key lengths, for it is a major their k -universal hashing with a single multiplication, though open problem in number theory if arbitrarily large Mersenne over numbers that are q times as long as the input keys. primes exist. The largest known so far is 213466917 - 1. For the second moment estimation from [1], it is actually As mentioned previously, our scheme avoids multiplication preferable that the 4-universal output is a bit string, for altogether. then each bit position is a 4-universal bit independent of the 1.3 Hashing single and double words in C. The focus remaining string. In contrast, if the output is in a prime field, of this paper is to develop efficient C code for 4-universal the bit positions are not independent, and then we typically hashing of single and double words of 32 and 64 bits, have to give up some of the most significant bits in order to respectively, producing a corresponding number of output get an approximately 4-universal bit string. The output of bits. Indeed, we end up gaining a factor 5 in speed over our 4-universal hashing is a bit string as desired. Also, we previous methods. We shall later return to the case where can easily generate long 4-universal bit-strings. All we have to do is to put long bit-strings in the 4-universal character input and output are of different sizes. Concerning other key lengths, if keys have less than 16 tables, and then  these bit-strings as we look them up in a bits, we can hash them trivially using a complete table over sequential read. We note that a weakness of our tabulation based scheme all such keys. If keys have between 16 and 32 bits, we hash relative to CW-trick is that we require fairly large prethem as 32 bit keys. If keys have between 32 and 64 bits, we hash them as 64 bit keys. Finally, if keys have more than computed tables whereas CW-trick just requires access to 64 bits, we first apply fast standard universal hashing into 64 a0 , . . . , ak-1 . One can easily imagine applications where bits, and then we apply 4-universal hashing on the reduced it is desirable to compute hash values directly from a small space representation of a hash function. However, for our keys. In the above reduction to 64 bit keys, the universal streaming applications, it is not a problem to initialize some hashing means that two keys get the same reduced key tables in an up-start face. We are going to compare CW-trick with our tabulation with probability 2-64 . Hence, if there are n 232 based method both based on a high level instruction count, different keys in the stream, they will all have distinct reduced keys. However, as detailed in [18, A3], if the target and based on experiments on two different computers. is to estimate the second moment, then the error is small with high probability if most of the mass of the stream is 1.4.1 With p a power of 2. In [6] it was shown that distributed on n 264 keys, which is most certainly the (1.2) can be used with p an appropriately large power of two, outputting a suffix of the result as a k -universal bit case in any foreseeable future. We note that our techniques generalize perfectly to 96 string. For 2-universal hashing, this power-of-two scheme and 128 bits if that is of any interest. The point we try has proved very fast [16]. However, for 4-universal hashing, to make here is that a more standard asymptotic analysis the scheme needs p = 2218 just to hash from 32 bits to for keys of non-constant length, e.g., favoring Schonhage- 32 bits [6, Theorem 10]. The multiplication of 218-bit ¨ Strassen multiplication of large numbers [15], is much less integers makes this method much slower than CW-trick for relevant for the kind of streaming applications we have in 4-universal hashing. We note that since the method from [6] gives us 4mind. Another point in not considering the asymptotic case is universal bit strings, we could use it to initialize the character that our worst competitor may simply be undefined, depend- tables needed in our tabulation based method. Assuming that the data stream is much bigger than the tables, the ing on a major unresolved problem in number theory. initialization time is not an issue. 1.4 The opposition. When implementing k -universal hashing from words to words in C on a standard computer, 1.5 The C-level instruction count. Besides an experimental comparison, we are going to compare our tabulation our worst competitor is the original function from [19]: based scheme with CW-trick using a coarse-grained analysis k-1 i of C code. We assume we have a 64-bit processor, and we i (1.2) h(x) = ai x mod p charge a unit cost for each instruction on one or two 64-bit =0 double word. Of computational instructions we have stanfor some prime p > x with each ai picked randomly from dard arithmetic operation such as addition and multiplica[p]. If p is an arbitrary prime, this method is fairly slow look-ups, the error is

617


tion. We note for both addition and multiplication that overflow beyond the 64 bits are discarded. In particular, this means that multiplication is only used to multiply integers below 232 . With CW-trick we do not need modulus or division so we do not need to worry about the slowness of these operations. Of other computational instructions, we have regular and bit-wise Boolean operations and shifts. Finally, we may define a vector of characters over a double word and extract a character at unit cost. We note that among the above operations, multiplication is typically the most expensive. Since multiplication is used by CW-trick and not by us, we are generous to the opponent when only charging one unit for multiplication. We assume that our processor has a small number, say 30, of registers. Here registers are just thought of as memory that is so fast that copying between cells is almost free. Since the registers are controlled by the compiler, it is convenient to ignore it. When running CW-trick, we assume that all variables reside in registers. However, we do charge a unit cost for memory access beyond the registers. This means that our tabulation based methods are charged a unit for each look-up. Obviously, the unit cost is only fair if the tables are small enough to fit in reasonably fast memory. For example, if tables over 16-bit characters were too slow, we could switch to tables over 8-bit characters. As a final cost, we charge a unit for a jump. For a conditional jump, we charge for the evaluation of the condition and for the jump if it is made. All our procedure calls are inline, so we do not need to charge for them. We refer to the above cost as the C-level instruction count. Here "C-level" refers to the fact that a finer analysis would have to take the concrete machine and compiler into account. Our C-level analysis will be complemented with an experimental evaluation on two different computers, and as it turns out, the C-level instruction count does give a fairly accurate prediction of the actual running times. 1.5.1 Modern processors. For the k -universal hashing of longer keys, it is worth noting that our algorithms are ideally suited for the kind of vector operations supported supported on 128-bit words by modern processors like the Pentium 4 [10, 17]. In fact, we are really coding such vector operations, making sure that we have enough space between coordinates that we do not get overflow from one coordinate to another. For contrast, these modern processors do not support fullword multiplication, and hence they do not help as much for the traditional method in (1.2). 1.6 Second moment estimation. The estimation of the second moment is a canonical application of our tabulation based 4-universal hashing. We present a new estimator that given a 4-universal hash value, yields the same variance and space gains roughly a factor of 2 over the best combination of previous methods [1, 5]. In [11], this speed is used, as part of a larger application, in real time estimation of the second moment over IP-addresses of packets coming through a high speed Internet router. Our second moment

estimator is described in §4, deferring some details to [18, §A]. 2 Tabulation based hashing. In this section, we show how tabulation can be used for fast 4-universal hashing. First we present a general framework for k -universal hashing along with some simple lemmas. Next, we present a scheme for 4-universal hashing on two input characters that requires 3 table lookups. We then generalize the scheme to achieve 4-universal hashing on q input characters with 2q - 1 table lookups. We also present a scheme for general k that gives k -universal hashing on q characters using (k - 1)(q - 1) + 1 table lookups. We then compare our 4-universal hashing scheme with CW-trick, the fastest known algorithm, both in terms of C-level instruction count and actual running times. The results suggest that our tabulation based scheme consistently wins by at least a factor of 5. 2.1 General framework. Our general framework for tabulation based k -universal hashing with q characters is as follows. 1. Given a vector of q input characters x = (x0 x1 · · · xq-1 ), xi  [2c ], we construct a vector of q + r derived characters z = (z0 z1 · · · zq+r-1 ), zj  [p], p  max{2c , q + r}. Some of the derived characters may be input characters. 2. We will have q + r independent tabulated hash functions hj into [2 ], and the hash value is then (2.3) h(x) = h0 [z0 ]  · · ·  hq+r-1 [zq+r-1 ]

The domain of the different derived characters depends on the application. Here we just assume that hj has an entry for each possible value of zj . We will now define the notion of a "derived key matrix" along with some simple lemmas. Consider k  k distinct keys xi = (xi,0 xi,1 · · · xi,q-1 ), i  [k ], and let the derived characters zi be (zi,0 zi,1 · · · zi,q+r-1 ). We then define the derived key matrix as   z0,0 z0,1 · · · z0,q+r-1 z1,1 · · · z1,q+r-1   z1,0  D= ..   . zk -1,0 zk -1,1 · · · zk -1,q+r-1 L E M M A 2 . 1 . Suppose for any k  k distinct keys xi , i  [k ], the derived key matrix D contains some element that is unique in its column, then the combined hash function h defined in (2.3) is k -universal if all the hj , j  [q + r], are independent k -universal hash functions. Proof. Consider a set of k distinct keys along with their derived key matrix D. For any set of k hash values vi , i  [k ], we have to show that Pr {h(xi ) = vi , i  [k ])} = 1/2k

6

18


By assumption, there is an element zi0 ,j0 that is unique in column j0 . Since the hi are independent k -universal hash functions, each character in each column is hashed independently. Without loss of generality, we can assume that the hash value of zi0 ,j0 is picked last. When all the other characters are hashed, we obtain hash values for each of the other k - 1 keys. By induction, these are hashed (k - 1)universally, so
H

four keys xi yi are distinct and D has only four rows, it is easy to see that each xi and yi must appear exactly twice in its column. Without loss of generality, we can assume the four distinct keys xi yi are a0 b0 , a0 b1 , a1 b0 , and a1 b1 , where a0 < a1 and b0 < b1 . This implies a0 + b0 < a0 + b1 < a1 + b1 and a0 + b0 < a1 + b0 < a1 + b1 . So both a0 + b0 and a1 + b1 are unique in column {zj }. A slight caveat of the above scheme is that x + y requires one more bit than x and y , hence that h2 needs to be over a domain that is twice as large. It would have been nicer if we could just apply h2 to x  y instead of x + y , but then the combined function guarantees h(00)  h(01)  h(10)  h(11)  0, and is therefore not 4-universal. The following theorem shows that we can still achieve 4-universal hashing if we replace x + y with addition over an odd prime field p containing the domain for input characters. With concrete character length c = 8, 16, we can exploit that 2c + 1 prime. Then the domain of the derived character x + y is only one bigger than that of the input characters. T H E O R E M 2 . 2 . Theorem 2.1 still holds if we perform addition over an odd prime field p containing the domain for input characters. That is, if keys are divided into 2 characters in [2c ], then h(xy ) = h0 [x]  h1 [y ]  h2 [x + y mod p] is a 4-universal hash function, where p > 2c is an odd prime and h0 , h1 , and h2 are independent 4-universal hash functions into [2 ]. Proof. For this proof, we use '+' and '-' to denote addition and subtraction over p. As in the proof of Theorem 2.1, we may assume that the 4 distinct keys xi yi are a0 b0 , a0 b1 , a1 b0 , and a1 b1 , respectively, where a0 = a1 and b0 = b1 . With zi = xi +yi , we need to show that some zi is unique. Clearly, z0 = a0 + b0 = a0 + b1 = z1 . Similarly, z0 = z2 . Hence, if z0 is not unique, z0 = z4 , and if z2 is not unique, z2 = z3 . But these equalities imply a0 +b0 +a0 +b1 = a1 +b0 +a1 +b1 , or a0 + a0 = a1 + a1 . Therefore, there exists an element e = a0 - a1 = 0 such that e + e = 0. This is impossible in an odd prime field, so we conclude that some zi is unique. Note that if we perform addition on any even field over [2c ], the combined hash function h(xy ) = h0 [x]  h1 [y ]  h2 [x + y ] is not 4-universal. This is because for any even field over [2c ], there always exists an element e = 0 such that e + e = 0, where 0 is the zero element of the field. This means h(00)  h(0 e)  h(e 0)  h(e e)  0. Therefore, h cannot be 4-universal. Interestingly, however, h(xy ) = h0 [x]+h1 [y ]+h2 [xy ] is indeed 4-universal if h0 , h1 , and h2 are 4-universal hash functions into [p ], where p is an odd prime, and addition is performed over p . But since in practice it is often much more convenient to deal with bit strings, we will not go into details on this.
       

Pr {h(xi ) = vi , i  [k ] \ {i0 }} = 1/2(k-1) owever, zi0 ,j0 is hashed independently by hj0 , and Pr {h(xi0 ) = vi0 }   i = Pr hi0 [zi0 ,j0 ] = vi0     

H

hi [zi,j0 ]

= 1/2

=i0

ence, Pr {h(xi ) = vi , i  [k ])} = 1/2k , as desired. In our constructions for 4-universal hashing, the input characters will all be used as derived characters, and then we can simplify the assumption of Lemma 2.1 to dealing with exactly 4 keys. L E M M A 2 . 2 . Suppose all input characters are used as derived characters and that for any 4 distinct keys xi , i  [4], the derived key matrix D contains some element that is unique in its column, then the combined hash function h defined in (2.3) is 4-universal if all the hj , j  [q + r], are independent 4-universal hash functions. Proof. Consider k < 4 distinct keys. The distinctness implies that some column of input characters in D has at least two different elements, and for k < 4, one of these elements must be unique in its column. Hence, for k = 4, the condition of Lemma 2.1 is satisfied if it is satisfied for k = k = 4. 2.2 4-universal hashing with two characters. T H E O R E M 2 . 1 . If keys are divided into 2 characters, then h(xy ) = h0 [x]  h1 [y ]  h2 [x + y ] is a 4-universal hash function if h0 , h1 , and h2 are independent 4-universal hash functions into [2 ]. Proof. For any 4 distinct keys xi yi , i  [4], let zi = xi + yi . By Lemma 2.2, it suffices to show that the derived key matrix   x0 y 0 z 0  x y1 z1  D= 1 x2 y 2 z 2  x3 y 3 z 3 has an element that is unique in its column. Without loss of generality, we may assume that each element appears at least twice in the input columns {xj } and {yj }. Since the

619


By Lemma 2.2, we only need to prove that some element of D is unique in its column. Assume by way of contradiction that each element of D appears at least twice in its column. Then each column D[·, j ] must be either of type 0: (a a a a)T , in which all 4 elements of the column are equal, or one of the three proper types in which each element appears exactly twice: type : (a a b b)T , type  : (a b a b)T , and type  : (a b b a)T . For t = 0, ,  ,  , let Xt be the possibly empty submatrix of X that consists of all columns of type t. Also, let Gt y = xG be the submatrix of G consisting of the rows j such that column j of X is of type t. Finally, we define Yt = Xt Gt . Then t where G is a q × r generator matrix with the property that Y = =0,, , Yt . any square submatrix of G has full rank, and vector element Consider a specific derived column Y [·, j ], j  [r]. For additions and multiplications are performed over an odd t = 0, ,  ,  , Y [·, j ] is of type 0 or type t as it is a linear t prime field p, p  max{2c, q + r}. We then use the above combination of input columns of type t. We say type t = 0 general hashing framework to combine q + r independent is present in derived column Y [·, j ] if Y [·, j ] is of type t. t tabulated 4-universal hash functions. We will now prove that at most one type can survive For small q , the generator matrix G can be constructed in each derived column Y [·, j ], j  [r]. Suppose for a manually. In particular, if q = 2 (thus r = 1), we can just contradiction that we have at least two present types. By use G = (1 1)T , which gives the scheme in Theorem 2.2. symmetry, we may assume that  and  are present. Then For large q , we can use a q × r Cauchy matrix over prime Y [·, j ] = (a a b b )T and Y [·, j ] = (a b a b )T , so from  0000  1111 field p (which mandates p  q + r): the proof of Theorem 2.2, we know that Y [·, j ] + Y [·, j ] 1 i has a unique character. If  is not present, Y0 [·, j ] + Y [·, j ] is of type 0, and then we know that Y [·, j ] has a unique C q×r = i + j + 1 [q], j [r] character. Otherwise, all three types = 0 are present and   symmetric in that regard. If we don't have a unique character 1 1 1 ··· 0+0+1 0+1+1 0+(r -1)+1   in Y [·, j ], it is of some type, say 0 or  . This implies that 1 1 1 ···  1+0+1  Y [·, j ] + Y [·, j ] = Y [·, j ] - Y0 [·, j ] + Y [·, j ] is of type  1+1+1 1+(r -1)+1  = ..   contradicting that Y [·, j ] + Y [·, j ] has a unique character. .   ··· ··· Let nt be the number of columns in Xt . Next, we prove 1 1 1 · · · (q-1)+(r-1)+1 (q -1)+0+1 (q -1)+1+1 that if nt > 0 and t = 0, then G has at most nt - 1 derived columns Y [·, j ] where type t does not survive. Assume T H E O R E M 2 . 3 . Let G be a q × r generator matrix with the by contradiction that there are n or more derived columns t property that any square submatrix of G has full rank over where n does not survive. We can then find an n × n t t t c prime field p, where p  max{2 , q + r} is an odd prime. submatrix G0 of G consisting of columns j such that type t t Given any q characters x = (xi ), i  [q ], let y = (yj ), t is not present in derived column Y [·, j ]. Then, for any two j  [r], be the r = q - 1 additional characters derived using rows a and b of Xt , aG0 = bG0 . However, Xt has different t t y = x G, then the combined hash function rows so this contradicts the fact that all square submatrices of G have full rank. h(x) = h0 [x0 ]  · · · hq-1 [xq-1 ]  ~ 0 [y0 ]  · · · ~ r-1 [yr-1 ] h h From the abovte results, we know that G cannot have more than n0 + =, , max{0, nt - 1}. Since the input is a 4-universal hash function if hash functions hi (i  [q ]) ~ j (j  [r]) are independent 4-universal hash functions keys are distinct, there has tto be at least two proper types t and h with nt > 0. Hence n0 + =, , max{0, nt - 1}  q - 2. into [2 ]. However, G has r = q - 1 columns, so this gives us the Proof. Consider 4 distinct q -character keys xi = desired contradiction. (xi,0 · · · xi,q-1 ), i  [4]. Let yi = (yi,0 · · · yi,r-1 ) = xi G 2.3.1 Relaxed and efficient computation of x G on p. and consider the the derived key matrix With the above scheme, we only need 2q - 1 table lookups   to compute the hash value for q input characters. However, x0,0 · · · x0,q-1 y0,0 · · · y0,r-1 to make the scheme useful in practice, we still need to · · · x1,q-1 y1,0 · · · y1,r-1  x D =  1,0 compute y = x G very efficiently, which requires O(q r) = x2,0 · · · x2,q-1 y2,0 · · · y2,r-1  O(q 2 ) multiplications and additions on p using schoolbook x3,0 · · · x3,q-1 y3,0 · · · y3,r-1 implementation. Below we describe several techniques to = [X Y ] get down to O(q ) time. Multiplication through tabulation. Let Gi . i  [q ], where X and Y are the submatrices formed by vectors xi be the q rows of the generator matrix G from Theorem 2.3. and yi , respectively. Clearly, Y = X G.
         

2.3 4-universal hashing with q characters. For 4universal hashing with more than two input characters, we can recursively apply the two-character scheme, but then, for q characters, we would end up using q log2 3 hash functions. Here we show that we can get down to 2q - 1. Let r = q - 1. Given q input characters x = (x0 x1 · · · xq-1 ), xi  [2c ], we obtain q + r characters by including the q input characters themselves together with r additional characters y = (y0 y1 · · · yr-1 ) derived using

620


Then  y = x G = (x0 · · · xq-1 )   G0 i . . = xi G i . [q ] Gq-1 

destroy the uniqueness of an element in a column, so our hash function remains 4-universal with these compressed derived characters. The transformation from a to a can be performed in parallel for a vector of derived characters. With appropriate precomputed constants, the compression is performed in 5 C-level instructions. 2.4 k -universal hashing with q characters. Here we present a scheme for general k -universal hashing using (k - 1)(q - 1) + 1 k -universal hash functions. For k = 4, this is not as good as the previous results, but it does have the advantage that it allows the derived characters to have the same length as the input characters. Let r = (k - 2)(q - 1). Given q input characters x = (xi ) (i  [q ]), we derive q + r = (k - 1)(q - 1) + 1 characters z = (zj ) (j  [q + r]) using z = xH where H is a q × (q + r) generator matrix with the property that any q × q submatrix of H has full rank on a (possibly even) field with size  max{2c , q + r}. Matrices with this property are commonly used in coding theory to generate erasure resilient codes [3, 14]. For example, we can choose H = [Iq G] where where Iq is the q × q identity matrix and G is a q × r matrix with full rank of all square submatrices. Then we get a scheme just like in §2.3 where G can be the Cauchy matrix C q×r . However, this time we may use an even field such as (2c ), and that gives us some substantial savings to be explored later. Another possible choice of H is a q × (q + r) Vandermonde matrix: ji V q×(q+r) = i [q], j [q+r]   1 1 1 ··· 1 11 21 ··· (q + r - 1)1   01  = ..   ··· . ··· 0q-1 T H E O R E M 2 . 4 . Let H be a q × (q + r) matrix with the property that any q × q submatrix of H has full rank on a (possibly even) field with size  max{2c, q + r}, where c is the length of the input characters. For any given q characters x = (xi ), i  [q ], derive q + r = (k - 1)(q - 1) + 1 characters z = (zj ), j  [q + r]. Then the combined hash function h(x) = h0 [z0 ]  · · ·  hq+r-1 [zq+r-1 ] is a k -universal hash function if hi (i  [q + r]) are independent tabulated k -universal hash functions into [2 ]. Proof. By Lemma 2.1, we only need to prove for any k  k distinct keys xi , i  [k ], some element of the derived key matrix D is unique in its column. Suppose for a contradiction that every element appears at least twice in its column. For each column D[·, j ], define Ej = {(a, b) | a, b  [k ] s.t. a < b  D[a, j ] = j [b, j ]}. Then D we have |Ej |  k /2. As the result, [q +r ] |Ej | 

621

¡ ¢ 

Therefore, we can avoid all the multiplications by storing with each xi , not only hi [xi ], but also the above veictor xi Gi , denoted gi (xi ). Then we compute y as the sum [q ] gi (xi ) of these tabulated vectors. Using regular addition. We wilil now argue that for 4universality, it suffices to compute [q ] gi (xi ) using regular integer addition rather than addition over p. What was shown in the proof of Theorem 2.3 was that some element of the derived key matrix D was unique in its column. However, all elements were from [p] so the uniqueness cannot be destroyed by adding a variable multiples of p to the elements, but this is exactly the effect of using regular integer addition rather than addition over p. Parallel additions. To make the additions efficient, we can exploit bit-level parallelism by packing the gi (xi ) into bit-strings with log2 q bits between adjacent elements. Then we can add the vectors by adding the bit strings as regular integers. By Bertand's Postulate, we can assume p < 2c+1 , hence that each element of gi (xi ) uses c + 1 bits. Consequently, we use c = c + 1 + log2 q bits per element. For any application we are interested in, 1 + log2 q  c, and then c  2c. This means that our vectors are coded in bit-strings that are at most twice as long as the input keys. We have assumed our input keys are contained in a word. Hence, we can perform each vector addition with two word additions. Consequently, we can perform all q - 1 vector additions in O(q ) time. In our main tests, things are even better, for we use 16bit characters of single and double words. For single words of 32 bits, this is the special case of two characters. For double words of 64 bits, we have q = 4 and r = q - 1 = 3. This means that the vectors gi (xi ) are contained in integers of rc = 3(16 + 1 + 2) = 57 bitsi, that is, in double words. Consequently, we can compute [q ] gi (xi ) using 3 regular double word additions. i Compression. With regular addition in [q ] gi (xi ), the derived characters may end up as large as q (p - 1), which means that tables for derived characters need this many entries. If memory becomes a problem, we could perform the mod p operation on the derived characters after we have done all the additions, thus placing the derived characters in [p]. This can be done in O(log q ) total time using bit-level parallelism like in the vector additions. However, for character lengths c = 8, 16, we can do even better exploiting that p = 2c + 1 is a prime. We are then going to place the derived characters in [2c + q ]. Consider a c ector element a < q p. Let a = a  (2c - 1) + q - (x v )  (2c - 1). Here denotes a right shift and  denotes bitwise A N D. Then it is easy to show that 0  y < 2c + q and a  a + q (mod p). Adding q and a variable multiple of p to each element of the derived key matrix does not
   

1q-1

2 q -1

· · · (q + r - 1)q-1


architecture. We see that Table gains more than a factor 5 over CW-trick, both for 32 and 64 bit keys. When it comes to actual running time, we see that the C-level instruction count gives a good rough estimate of the relative performances, yet there is a glaring contrast on Computer A between Table and CompressTable on 64 bit keys. In this case, Table uses roughly 4 times as much space as CompressTable, so it is natural to attribute its slowness to use of slower memory. Similarly, the relative slowness of CW-trick can be attributed to its use of multiplication. (2c ). As men- All in all, for running times, we see that CompressTable 2.4.1 Efficient computation of z over tioned above, we can pick H = [Iq G] and thus get a consistently wins by a factor of 5 over CW-trick. Table is scheme like in §2.3 with z being the concatenation of x and even faster in most cases when memory is not a problem. y = x G. With the notation of §2.3.1, we compute y as the Even when memory starts to become a problem (Table with i 64 bit keys on Computer A), it is still more than 4 times faster sum [q ] gi (xi ) of the tabulated vectors gi (xi ). However, than CW-trick. c (2 ). Consethis time, we may work in the even field Summing up, we have shown that tabulation can be used quently, the elements of gi (xi ) are in [2c ] and addition over for k -universal hashing, and for the important case of 4(2c ) is just  as supported directly in C without any need universal hashing, we have gained a factor of 5 in speed over for carry bits. Thus, each vector gi (xi ) is represented as a the previous fastest methods, making it a much more viable bit-string of length rc = (k - 2)(q - 1)c, and we just need to method for time critical streaming applications.  these vectors to produce y. The resulting derived characters are all in [2c ]. This scheme is thus in many ways simpler 4 Second moment estimation. than our specialized scheme for 4-universal hashing over p. a However, for k = 4, our general scheme performs worse be- Let S = (a1 , w1 ), (a2 , w2 ), ..., (as , ws ) be a dati stream, where each key ai is a member of [u]. Let va = : ai =a wi cause it uses 2(q - 1) derived characters, whereas our spebe the total weights associated with key a  [u]. Define, for cialized scheme only uses q -1 derived characters. So far, we do not understand if this is an inherent advantage of p over each j  0, a j (2c ), or if there is a smarter way of exploiting (2c ). Fj = va 3 Performance evaluation. We have implemented our schemes and CW-trick in C. Table 1 compares the different algorithms both in terms of C-level instruction count and actual running times on two machines with different architecture and operating systems. OldTable is simply the standard 2-universal hashing mentioned in §1.2 obtained by hashing each character independently using 16-bit characters. This is currently the fastest known method for 2-universal hashing, hence an interesting benchmark. CWtrick61 and CWtrick89 are CW-trick schemes as described in §1.4 with Mersenne primes 261 - 1 and 289 - 1, respectively. The actual code for CWtrick61 is found in §5.3 while the code for CWtrick89 is deferred to [18, §B.5]. The code for CWtrick89 gains speed from only producing the 64 least significant bits of the result assuming that we only need that many hashed bits. Table and CompressTable are instances of our new tabulation based 4-universal hashing schemes from §2.3 with 16-bit input characters. With Table the derived characters are not compressed, and may be as large as q × 216 with q = 2, 4 depending on whether the input is 32 or 64 bits. With CompressTable, the derived characters are less than 216 + q , so they need much smaller tables. The actual codes for 32-bit keys is found in §5.2. The code for 64-bit keys is deferred to [18, §B.3]. The C-level instruction count can be read directly from the code and is thus independent of compiler and machine
   

k. 2 (q + r) · k /2 = ((k - 1)(q - 1) + 1) · k /2 > (q - 1) But kd 2 ] there are only ifferent (a, b) pairs (a, b  [k , a < b). Therefore, there must exist a pair (a0 , b0 ) that appears in at least q different Ej , j  [q + r]. Let D and H be the q × q submatrices of D and H consisting of columns j with (a0 , b0 )  Ej . In D , rows a0 and b0 are identical, so xa0 H = xb0 H . However, xa0 = xb0 , so this contradicts the fact that H is a q × q matrix with full rank.

¡ 

¡ 

¡ 

¡ ¢  ¡ ¢ 

[u]

The second moment, F2 , is of particular interest, since it arises in various applications. 4.1 Second moment estimators. Classic estimator. The classic method for estimating F2 by Alon et. al. [1] uses n counters ci (i  [n]) and n independent 4-universal hash functions si that map [u] into {-1, 1}. When a new data item (a, w) arrives, all n counters are updated using ci + = isi (a) · w (i  [n]). F2 is then 2 estimated as Xclassic = [n] ci /n. Following the analysis in a 1], we have E [Xclassic ] = F2 and Var [Xclassic ] = [ 22 2 =b 2 va vb = 2(F2 - F4 )/n. Count sketch based estimator. Recently, Charikar et. al. [5] described a data structure called count sketch for keeping track of frequent items in a data stream. We can adapt count sketch to make second moment estimation. Using this method, we need n counters ci (i  [n]) and two independent 4-universal hash functions h : [u]  [n] and s : [u]  {-1, 1}. When a new data item (a, w) arrives, a single counter ch(a) is updated using ch(a) + = s(a) · w. i 2 F2 is then estimated as Xcount sketch = [n] ci . We can prove that E [Xcount sketch ] = F2 and Var [Xcount sketch ] = 2 2(F2 - F4 )/n. Therefore, Xcount sketch achieves the same variance as Xclassic with substantially lower update cost. Fast count sketch based estimator. An alternative way of implementing the count sketch scheme is to use 2n counters ci (i  [2n]) and a 4-universal hash function h : [u] 

622


key bits  k-universal hash bits algorithm 2 32  64 OldTable32 4 32  61 CWtrick61 4 32  64 Table32 4 32  64 CompressTable32 2 64  64 OldTable64 4 64  64 CWtrick89 4 64  64 Table64 4 64  64 CompressTable64 Update 2nd moment in stream

C-level instructions 5 41 8 12 11 184 32 37 11

running time (sec) computer A computer B 0.17 0.31 1.82 2.95 0.30 0.55 0.34 0.55 0.35 0.48 6.83 12.31 1.56 2.22 1.04 2.53 0.50 0.68

Table 1: C-level instruction count plus running times for performing 10 million hash computations on computer A (400 MHz SGI R12k processor running IRIX64 6.5) and B (900 MHz Ultrasparc-III processor running Solaris 5.8). [2n]. When a new data item (a, w) arrives, w is directly added to the counter ch(a) : ch(a) + = w. In the end F2 is estimated using the alternating sum Xfast count sketch = i 2 [n] (c2i - c2i+1 ) . Xfast count sketch achieves the same variance as Xcount sketch , but is faster because the direct update of a counter based on a single hash value is much simpler. However, such simplicity comes at the cost of doubling the space. Our new estimator. Here we present a new estimator that achieves the same speed and variance as the fast count sketch based estimator without having to double the space. Instead of using 2n counters, our new method uses m = n + 1 counters ci (i  [m]), and a 4-universal hash function h : [u]  [m]. The update algorithm is exactly the same as that of the fast count sketch based estimator: when a new data item (a, w) arrives, w is added to the counter ch(a) : ch(a) + = w. But the estimation formula is quite different. We use i mi 1 Xnew = ( c i )2 c2 - i m-1 m-1
[m] [m]

(2.48 Gbps). With enough buffering, if the average IP packet size is above 85 bytes (680 bits), which is generally the case in today's Internet, we can even keep up with OC192 speed (10 Gbps). 5 Code. In this section, we present some of the code discussed in this paper, justifying the concrete C-level instruction counts. The remaining code is found in [18, §B]. 5.1 Common data types and macros.
typedef unsigned int typedef unsigned long long INT32; INT64;

const INT64 LowOnes = (((INT64)1)<<32)-1; const INT32 HalfLowOnes = (((INT32)1)<<16)-1; #define #define #define #define LOW32of64(x) HIGH32of64(x) LOW16of32(x) HIGH16of32(x) ((x)&LowOnes) ((x)>>32) ((x)&HalfLowOnes) ((x)>>16)

Note that we do not worry about the cost of adding 5.2 Tabulation based hashing for 32-bit keys using 16up counters done in the end. Hence, it is not considered a bit characters. problem to have a more complex sum for this. In [18, §A], /* tabulation based hashing for 32-bit key x we prove T H E O R E M 4 . 1 . If h is 2-universal, E [Xnew ] = F2 . If 2 h is 4-universal, Var [Xnew ] = 2(F2 - F4 )/(m - 1) = 2 2(F2 - F4 )/n. 4.2 Performance evaluation. The code for second moment estimation can be found in §5.4. We used m = 215 , 2 giving us a relative standard error below /(m - 1)  2-7 < 1%. The last line in Table 1 shows the instruction count and the running times for performing 10 million hash computation and counter updates. This should be compared with the hash computation alone in Table32. We see that the additional overhead is limited. Even in the worst case when all packets are of the minimum IP packet size of 40 bytes (320 bits), the counter update part can easily keep up with 320 × 107 /0.68 = 4.7 × 109 bits per second on the slower computer, which is nearly twice as fast as OC48 speed

* using 16-bit characters. T0, T1, T2 are * precomputated tables */ inline INT64 Table32(INT32 x, INT64 T0[], INT64 T1[],INT64 T2[]){ INT32 x0, x1, x2; x0 = LOW16of32(x); x1 = HIGH16of32(x); x2 = x1 + x2; x2 = compress32(x2); //optional compression return T0[x0] ^ T1[x1] ^ T2[x2]; } // 8 + 4 = 12 instructions

/* optional compression */ inline INT32 compress32(INT32 i) { return 2 - HIGH16of32(i) + LOW16of32(i); } // 4 instructions

The code uses 12 instructions (8 without compression), including 3 lookups.

623


5.3 CW Trick for 32-bit keys with prime 261 - 1.
const INT64 Prime = (((INT64)1)<<61) - 1;

References

[1] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System /* Computes ax+b mod Prime, possibly +2*Prime, Sci., 58(1):137­147, 1999. * exploiting the structure of Prime.*/ [2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. inline INT64 MultAddPrime(INT32 x, INT64 a, Models and issues in data stream systems. In Proc. 21st INT64 b) { ACM SIGACT-SIGMOD-SIGART Symposium on Principles INT64 a0,a1,c0,c1,c; of Database Systems, pages 1­16, 2002. a0 = LOW32of64(a)*x; [3] J. Bloemer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, a1 = HIGH32of64(a)*x; and D. Zuckerman. An XOR-based erasure-resilient codc0 = a0+(a1<<32); ing scheme. Technical Report TR-95-048, ICSI, Berkeley, c1 = (a0>>32)+a1; California, Aug. 1995. http://www.icsi.berkeley.edu/~luby/ c = (c0&Prime)+(c1>>29)+b; PAPERS/cauchypap.ps. return c; [4] J. Carter and M. Wegman. Universal classes of hash func} // 12 instructions tions. J. Comp. Syst. Sci., 18:143­154, 1979. [5] M. Charikar, K. Chen, and M. Farach-Colton. Finding /* CWtrick for 32-bit key x with * prime 2^61-1 */ frequent items in data streams. In Proc. 29th ICALP, LNCS inline INT64 CWtrick(INT32 x, INT64 A, 2380, pages 693­703, 2002. INT64 B, INT64 C, [6] M. Dietzfelbinger. Universal hashing and k-wise independent INT64 D) { random variables via integer arithmetic without primes. In INT64 h; Proc. 13th STACS, LNCS 1046, pages 569­580, 1996. h = MultAddPrime(MultAddPrime( [7] M. Dietzfelbinger, J. Gil, Y. Matias, and N. Pippenger. PolyMultAddPrime(x,A,B),x,C),x,D); nomial hash functions are reliable. In Proc. 19th ICALP, h = (h&Prime)+(h>>61); LNCS 623, pages 235­246, 1992. if (h>=Prime) h-=Prime; [8] M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Pentreturn h; tonen. A reliable randomized algorithm for the closest-pair } // 12*3 + 5 = 41 instructions problem. J. Algorithms, 25:19­51, 1997. [9] M. Dietzfelbinger and P. Woelfel. Almost random graphs The code uses 41 instructions including 6 multiplications. with simple hash functions. In Proc. 35th STOC, pages 629­ 638, 2003. 5.4 Second moment estimation. [10] Intel. The IA-32 Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference (Order #define NumCounters 32768 // (1<<15) Number 245471), 2001. http://developer.intel.com/design/ INT64 Counters[NumCounters]; pentium4/manuals/245471.htm. [11] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch/* precomputed tables whose hash strings based change detection: Methods, evaluation, and applica* only use 15 least significant bits */ tions. In Proc. ACM Internet Measurement Conference (IMCINT64 *T0, *T1, *T2; 03), 2003. [12] Y. Mansour, N. Nisan, and P. Tiwari. The computational inline void StreamUpdate2nd(INT32 ipaddr, complexity of universal hashing. Theor. Comp. Sc., 107:121­ INT32 size) { 133, 1993. Counters[Table32(ipaddr,T0,T1,T2)]+=size; [13] S. Muthukrishnan. Data streams: Algorithms and applica} // 3 instructions plus those in Table32. tions, 2003. Manuscript based on invited talk from 14th double StreamEstimate2nd() { SODA. Available from http://www.cs.rutgers.edu/~muthu/ int i; stream-1-1.ps. INT64 c; [14] L. Rizzo. Effective erasure codes for reliable computer double sum = 0, sqsum = 0; communication protocols. ACM Computer Communication for (i = 0; i < NumCounters; i++) { Review, 27(2):24­36, Apr. 1997. c = Counters[i]; ¨ [15] A. Schonhage and V. Strassen. Schnelle muliplication großer sum += c; zahlen. Computing, 7:281­292, 1971. sqsum += c*c; [16] M. Thorup. Even strongly universal hashing is pretty fast. In } Proc. 11th SODA, pages 496­497, 2000. return sqsum + (sqsum-sum*sum)/ [17] M. Thorup. Combinatorial power in multimedia proces(NumCounters-1); sors, 2003. http://www.research.att.com/~mthorup/PAPERS/ } multimedia.ps. The code for updating the counters adds 3 instructions to [18] M. Thorup and Y. Zhang. Appendix for "Tabulation Based 4-Universal Hashing with Applications to Second Moment those in Table32, including one additional lookup. Estimation", 2003. http://www.research.att.com/~mthorup/ PAPERS/k-univ-app.ps. Acknowledgment. We would like to thank Martin Dietzfel- [19] M. Wegman and J. Carter. New hash functions and their use binger and Muthu(krishnan) for very useful comments and in authentication and set equality. J. Comp. Syst. Sci., 22:265­ suggestions. 279, 1981.

624