An improved data stream algorithm for frequency moments Don Coppersmith Abstract ~ We present a simple, one-pass, O( n)-space data stream algorithm for approximating the third frequency ~ moment. This is the first improvement to the O(n2/3 )space data stream algorithm of Alon, Matias, and Szegedy [AMS99]. The current known lower bound for this problem is (n1/3 ) [BJKS02a]. Our algorithm can also be generalized to an ~ O(n1-1/(k-1) )-space data stream algorithm for approximating the k -th frequency moment. Besides improv~ ing the O(n1-1/k )-space upper bound [AMS99], our algorithm beats the (n1-1/k )-sampling lower bound [BKS01] for this problem. Our method suggests a unified perspective of spaceefficient data stream algorithms for all frequency moments. 1 Intro duction The data stream model of computation [AMS99, HRR99, FKSV02] abstracts the requirement that data be processed on-the-fly by a machine that has very little workspace memory. With an increasing need to efficiently analyze massive data sets, this model has gained prominence. In the last few years, algorithms for many important problems have been developed in this model--for a detailed account, see the recent surveys [BBD+ 02, Mut03]. Many of these data stream algorithms have the following flavor: they use space sublinear in the size of the input, they are randomized, and they produce only an approximation to the correct answer. One of the earliest genre of problems for which space-efficient data stream algorithms were developed [AMS99] is computing various frequency moments of a sequence of elements. Definition 1.1. (Frequency moments) Given a sequence of elements x = x1 , . . . , xm where each Ravi Kumar xi [n], the k -th frequency moment is defined to be a k fa (x), Fk (x) = [n] where fa (x) is the number of times a occurs in the sequence x. The zeroth-frequency moment F0 (x) is defined to be the number of distinct elements in x. Frequency moments are fundamental statistical quantities and have varied applications in databases, data mining, and networking. Note that it is trivial to compute the frequency moments (exactly) using O(n log m) space. Moreover, it is not hard to show a space lower bound of (n) if we insist on either a deterministic algorithm or an exact algorithm [AMS99]. In a seminal paper, Alon, Matias, and Szegedy [AMS99] showed that if we allow both randomization and approximation, then space-efficient algorithms for frequency moments are possible. For F0 , they showed an O( -2 log n)-space algorithm (see also [FM85]) and an (log n) lower bound. They also showed an O(n1-1/k )space algorithm for approximating Fk , k 2 and an (n1-5/k ) lower bound for k > 5. For F2 , they obtained a super-efficient O( -2 log n)-space algorithm. The algorithms for F0 and F2 were based on hashing the elements and keeping track of the elements that hash to zero, where it suffices for the hash function to have limited-independence. The algorithm for Fk , k > 2 was based on sampling and counting in a single pass. Subsequent to this work, several improvements were made to both upper and lower bounds for Fk . For F0 , ~ the current upper bound is O( -2 + log n) [BJK+ 02] (motivated by [BKS02, GT01]) and the one-pass lower bound is ( -2 + log n) [IW03]. For Fk , k > 2, the lower bound was improved to (n1-2/k ) [BJKS02a] (and [CKS03] for multi-pass algorithms). However, for k > 2, there was a tantalizing gap between the upper and lower bounds--for instance, for F3 , the upper ~ bound was O(n2/3 ) while the lower bound was only (n1/3 ). ~ Results. We present a simple, one-pass, O( n)space data stream algorithm for approximating F3 . Our ~ algorithm can also be generalized to an O(n1-1/(k-1) )space data stream algorithm for approximating Fk , k 2. This constitutes the first improvement to the upper T. J. Watson Research Center, Yorktown Heights, NY 10598. Email: copper@watson.ibm.com IBM Almaden Research Center, San Jose, CA 95120. Part of this work was done while the author was visiting IBM T. J. Watson Research Center. Email: ravi@almaden.ibm.com IBM Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 151 bounds for Fk , k > 2 of [AMS99]. Our algorithm can also be used to obtain an improved data stream algorithm for estimating Lp differences (see [FKSV02, Ind00, SS02]). Our algorithm operates by defining a statistical estimator based on hashing each element, where it suffices for the hash function to have limited-independence (e.g., [CW79, WC81]). For instance, for F3 , the estimator is the cube of sum of values, ne for each element, where o each value is either 1 - 1/ n or -1/ n depending on whether the element hashed to 0 or not; the hash function need to be only six-wise independent. The main idea is then to show that this estimator can be used to approximate F3 and this is done by relating its first and second moments to F3 . Notice that this approach similar in spirit to the data stream algorithms for F2 and F0 . To recap, the algorithm for F0 maintains the smallest element that hashed to 0 and the algorithm for F2 outputs the square of sum of values, where each value is ±1 depending on whether the element hashed to 0 or not. Thus, besides reducing the gap between the upper and lower bound for Fk , k > 2, our method also suggests a unified perspective of space-efficient data stream algorithms for al l frequency moments. Note that despite our improvement, the gap between the upper and lower bounds for Fk , k > 2 still persists and closing this gap remains an important open problem. On a philosophical note, the algorithm of [AMS99] for Fk , k > 2, even though is in the data stream model, is a one-pass realization of a sampling algorithm. Unfortunately, there is an (n1-1/k )-sampling lower bound for Fk , k 2 [BKS01] and consequently, sampling-based methods suffer from an inherent limitation. On the other hand, our algorithm can be thought of utilizing the full potential of the data stream model (where we are allowed to process each element instead of just the sampled elements) to break the barrier imposed by the sampling lower bound. (The F2 d a stream algorithm at of [AMS99] already beats the ( n) lower bound for sampling algorithms.) Notation. An algorithm is said to ( , )approximate Fk if for any sequence of ele~ ments x, it outputs a number Fk such that ~k (x) (1 + )Fk (x), with prob(1 - )Fk (x) F ~ ability at least 1 - . We use O(·) to suppress factors that are polynomial in log n, log m and arbitrary in k ; for us, k is a constant. The main idea behind the algorithm is to use hash functions with limited independence and define statistical estimators based on the hash values of the elements in the data stream. While this idea in itself is not new in data stream algorithms, the twist in our method is to define an estimator that supports an efficient estimation of F3 . The basic estimator we define will have the following properties: its expectation will be a simple function of F3 , its variance is not too far away from its expectation, and it can be computed in a single pass using O(log n) space. (These estimators will turn out to be generalizations of the estimators of [AMS99] for F2 .) More precisely, let = n and let h : [n] {0, . . . , - 1} be a random hash function that is sixwise independent. As we mentioned earlier, storing such a hash function requires only O(log n) space. Define A = (1/ )(1 - 1/ )(1 - 2/ ). Let 11 - if h(a) = 0, (2.1) h (a) = -1 otherwise. The basic estimator Qh (x) is defined as m Qh (x) = i =1 3 h (xi ) . Several such (in fact, 60 -2 n, to be precise) basic estimators are computed using different hash functions and are averaged to yield an averaged estimator. Finally, this whole process is repeated O(log 1/ ) times and the median of all the averaged estimators is output as the final answer, after dividing it by A . Clearly the above steps can be accomplished in a ngle pass si over the data and can be realized in O( -2 n(log n + log m) log(1/ )) space. Theorem 2.1. For every , > 0 and given a sequence x = x1 , . . . , xm of elements where xi [n], the above algorithm ( , )-approximates F3 and uses O( -2 n(log n + log m) log(1/ )) space. Proof. First, notice that a Qh (x) = [n] 3 fa h (a) . (2.2) 2 A new data stream algorithm for F3 For the rest of the proof, all the expectations will In this section we present an improved data stream be over the choice of h, which is six-wise independent. algorithm for approximating F3 , the third frequency Let Q denote Qh (x), let Fk denote Fk (x), and for any ~ moment; this algorithm will use O( n) space. a [n], let fa denote fa (x). 152 The definition of (a) in (2.1) immediately leads to the following: 1 1 1 1 1 j (2.3) E [ (a)] = + - . - 1 j -j Notice that when j = 1, then E [ (a)] = 0. First, we show that the expectation (2.4) E [Q] = 1 1 - 1 1 - 2 F 3 We will upper bound each term in the above equation. From (2.3), it can be seen that E [ j (a)] 1/ for j = 2, 3, 4, 6 whenever n is chosen sufficiently large. Using this, the above equation becomes (2.6) E [Q ] 15 2 F 6 F2 F3 F4 F2 + 3+ 2 +2 2 3 . =A · F3 . Our goal is to express E [Q2 ] in terms of E 2 [Q], 2 which is roughly F3 / 2. For this, we need the following analytic inequalities. Proposition 2.1. For any x with xi [n], (i) F6 (x) 2 2 3 F3 (x), (ii) F2 F4 (x) n1/3 F3 (x), and (iii) F2 (x) 2 nF3 (x). Moreover, al l three inequalities can be simultaneously tight, up to constants. Proof. For i [n], let fi denote the frequency of i in x. Now, (i) follows trivially by expanding the right-hand side of the inequality. To show (ii) and (iii), we use H¨lder's inequality: o The proof is simple. Using (2.2), we have 3 a E [Q] = E fa (x)(a) [n] = a [n] 3 fa E [ 3 (a)] + a ,b,c[n],¬(a=b=c) fa fb fc E [ (a) (b) (c)]. The first term is our desired quantity and it follows from in |ai bi | a p b q, (2.3) with j = 3 and the definition of F3 . The second (2.7) =1 term becomes zero because of the following reason. Since at least one of a, b, c occurs only once in the u summation and as h is six-wise independent, applying where p and q are s2 ch that 1/p + 1/q = 1 and p, q > 1. Letting ai = fi , bi = 1, p = 3/2, q = 3 in (2.7), we (2.3) with j = 1 will annihilate this term. get Next, we upper bound the second moment E [Q2 ]. n 2/3 in i 6 2 3 (2.8) F2 (x) = |fi | |fi | · n 1 /3 . a 2 =1 =1 (2.5) fa (x)(a) . E [Q ] = E [n] The expansion inside the above expectation will involve summation over six-tuples of the form = a, b, c, d, e, f . An easy but useful observation is that if a tuple has an element occurring uniquely, then the expectation of the term involving is zero: this follows from the six-wise independence of h and (2.3) with j = 1. Hence, the only non-zero terms in (2.5) are of the form: a 6 E [Q2 ] = fa E [ 6 (a)] 6a + + + 4 1 2! 3 6 1 3! 2 =b 42 fa fb E [ 4 (a)]E [ 2 (b)] Cubing both sides of (2.8) yields (iii). Also, recall that for any vector a, (2.9) a p a q, if p > q . In particular, by letting p = 4, q = 3, ai = fi in (2.9) and raising both sides to the fourth power yields in =1 n |fi | 4 4 /3 |fi | 3 (2.10) F4 (x) = i =1 . 6a =b 33 fa fb E [ 3 (a)]E [ 3 (b)] 4 2 a =b b=c c=a 222 fa fb fc E [ 2 (a)]E [ 2 (b)]E [ 2 (c)]. Multiplying (2.8) and (2.10) yields (ii). To show that the inequalities can be simultaneously tight, consider the case when the frequency vector of x is such that f1 = n1/3 , f2 = · · · = fn = 1, i.e., a unique element occurs n1/3 times and the other elements occur just once. In this case, it is easy to calculate F2 (x) = F3 (x) = (n), F4 (x) = (n4/3 ), and F6 (x) = (n2 ) and verify that the above inequalities are tight, up to constants. 153 Applying Proposition 2.1 to (2.6) and using (2.4), we get Var[Q] E [Q2 ] 1 15 1 n n1/3 + + +2 2 3 F 2 3 Theorem 3.1. For every , > 0 and given a sequence x = x1 , . . . , xm of elements where xi [n], From the example in Proposition 2.1, it is easy the above algorithm ( , )-approximates F4 and uses to see that the above analysis cannot be improved O( -2 n2/3 (log n + log m) log(1/ )) space. asymptotically as the variance is tight for the example ~ Proof. [Sketch] Let Q = Qh (x), Fk = Fk (x), F2 = in Proposition 2.1. ~ The original algorithm of [AMS99] for F2 is a special F2 (x). For the rest of the proof, all the expectations case of the algorithm for F3 where is set to be a will be over the choice of h. It is easy to calculate constant. In particular, the algorithm of [AMS99] can be interpreted as choosing to be 2. In that case, it 2 E [Q] = A F4 + B F2 . suffices for h to be four-wise independent, it is easy to (3.11) calculate E [Q] = F2 and E [Q2 ] 2E [Q]2 . 2 Define Q = Q - B F2 . ~ Also, as in [AMS99], the log m term in the above We need the following inequalities which can be space bound can be improved to log log m by using proved in the spirit of Proposition 2.1: approximate counting [FM85]. 2 2 2 F8 F4 , F6 F2 nF4 , F5 F3 n1/4 F4 , 3 Data stream algorithm for F4 and higher 2 2 2 2 4 2 (3.12) F4 F2 nF4 , F3 F2 nF4 , F2 n2 F4 . moments In this section we outline the algorithm and the analysis Using (3.12), for approximating Fk , k 4 in the data stream model. 2 E [Q ] = E [Q] - B F2 For simplicity, we will illustrate the algorithm for F4 ~ ~ (n2/3 ) space. The algorithm for F4 follows a -1/3 2 that uses O A F4 + B n F2 path similar to that of F3 , but the analysis is more cum2/3 A F4 + B n F4 bersome and involves an additional step of estimating c0 F4 / , F2 really well. To begin, we need the following fact. Fact 3.1. (Alon, Matias, and Szegedy [AMS99]) for a sufficiently chosen constant c0 ; note that the There is a randomized algorithm that given , > 0 quality of our F2 approximation enables this step. Now, Var[Q ] = Var[Q] E [Q2 ]. Expanding E [Q2 ], and a sequence x = x1 , . . . , xm of elements where xi [n], ( , )-approximates F2 and uses it can be derived that 1 O( -2 (log n + log m) log(1/ )) space. 1 1 12 F6 F2 + F5 F3 + F4 E [Q2 ] = c1 8+ F 2 2 2 First, we approximate F2 (x) really well. More 1 ~ precisely, let F2 (x) be an approximation to F2 (x) to 1 12 14 2 + F4 F2 + F3 F2 + F2 within a factor of (1 ± n-1/3 ). This estimation of F2 3 3 4 60 · E 2 [Q] = 60 n · E 2 [Q]. (Our choice of to be n was to optimize the above expression.) Since the av aged estimator is computed as an er average of 60 -2 n-many basic estimators, it is easy to see that its mean is E [Q] and its variance is at most E[Q]. By Chebyshev's inequality, the averaged estimator is therefore an (1 + )-approximation to E [Q], with constant probability. The median of log 1/ such averaged estimators will be an approximation to E [Q] with probability at least 1 - . Thus, dividing the median by A yields a (1 + )-approximation to F3 . is carried out in parallel with the rest of the algorithm. As we mentioned before (see Lemma 3.1), the amount of space required to estimate F2 (x) to within this tiny factor is still only O(n2/3 log n). Let = n2/3 and let h : [n] {0, . . . , - 1} be a random hash function that is eight-wise independent. Define , 1 1 1 1 6 6 1 B= - - - . A= 1 1 +2 2 2 Using (2.1), we define m 4 i Qh (x) = h (xi ) , =1 Qh (x) = Qh (x) - B F2 (x). ~ The basic estimator is Qh (x) and we average -2 n2/3 such basic estimators to obtain the averaged estimator. The rest of the algorithm is the same as before. 54 1 n1 /4 + + 2 2 F 2 n n n 2 + + + 4 3 3 4 + E c3 [Q]2 n2 2 1 n 2+ 2c3 n2/3 E [Q ]2 , (3.13) = for a sufficiently chosen constants c1 , c2 , c3 . In the above derivation, we use (3.12), the eight-wise independence of h, (2.3) with j = 1, and (3.11). As before, averaging -2 n2/3 such basic estimators and using (3.11), (3.13), and Chebyshev's inequality, it is easy to obtain a (1 + )-approximation to F4 . The total space is O( -2 n2/3 (log n + log m) log(1/ )). The case of general Fk can be worked out to obtain the following (the exact dependence on k was not optimized). Theorem 3.2. There is an algorithm that, for every , > 0 and a given sequence x = x1 , . . . , xm of elements where xi [n], ( , )-approximates Fk and uses O( -2 2k n1-1/(k-1) (log n + log m) log(1/ )) space. 4 Is this algorithm optimal? As we saw earlier, the space lower for approximating F3 is (n1/3 ) whereas our algorithm uses space ~ O(n1/2 ). This leads to the obvious question of what is the exact space bound for approximating F3 in the data stream model. In general, can Fk be approximated ~ in O(n1-2/k ) space or is there an (n1-1/(k-1) ) space lower bound? Note that both these quantities "interpolate correctly" for k = 2, i.e., they are consistent with ~ the O(1) upper bound for F2 . As we remarked earlier, the worst-case for our Fn algorithm i. when the frequencies are of the form s 3 1/3 , 1, . . . , 1 In many aspects, this is reminiscent of the "hard distribution" for the communication complexity version of F3 that was proposed by [AMS99]. Recall that in the communication version of F3 , there are t = n1/3 players labeled 1, . . . , t and player i has a set Si . The promise is that either |S1 · · · St | = 1 (Yes instance) or |Si Sj | = 0 for all i = j (No instance). The goal of the players is to classify a given instance. A communication bound of (n1/3 ) was shown for this problem in [BJKS02a]. However, as observed in [BJKS02b], a simple protocol that uses n1/3 communication exists for this promise: player i uniformly samples n1/3 of the elements in Si and sends it player i + 1 who checks if the sample it received has any intersection with Si+1 it has; the proof is via a birthday argument. Back in the data stream world, this corre~ sponds to an O(n1/3 ) space algorithm for the promise problem. On the other hand, the variance of our basic estimator is tight for the Yes instance of the p mise ro ~ problem. Assuming that one of O(n1/3 ) or ( n) is the correct answer for F3 , this tempts one of the following scenarios: there is a much harder distribution that can be used to show an ( n) lower bound; there is an estimator with a better variance--in particular, for the Yes instance of the above promise; or there is an inherent loss in going from communication complexity lower bounds to data stream lower bounds. ~ 4.1 An O(n1/3 ) space algorithm for F3 ? We show ~ (1)-pass O(n1/3 )-space algorithm for F3 would ~ that an O follow, if "black-boxes" with certain properties exist. First, we interpret the algorithm of [AMS99] for F3 in a sampling setting. Their F3 algorithm can be thought of as a one-pass version of the following twopass algorithm. In the first pass, the algorithm selects an element a from the stream x uniformly at random. (Note that a is selected proportional to its frequency fa (x) in x.) In the second pass, the algorithm counts the number of times a occurs in x. The quantity Q = m · fa (x)2 is the basic estimator of the algorithm. As before, it is easy to see that E [Q] = F3 . Moreover, Var[Q] E [Q2 ] = mF5 (x) mn2/3 F5 (x). Since each basic estimator can be computed in O(log n) space and we need -2 n2/3 of them to drive down the variance, ~ this yields an O(n2/3 )-space algorithm for F3 . The F3 algorithm in [AMS99] is essentially this algorithm except that the sampling and counting are implemented in a single pass. (It is easy to see that the variance of the above algoritn m is also tig.ht for the case when the h frequencies are 1/3 , 1, . . . , 1 ) Notice that the sampling step can be abstracted as a (trivial) black-box such that each invocation of the black-box outputs an element a with probability proportional to fa (x). In other words, the black-box implements a random variable Y such that Pr[Y = a] fa (x). On the other hand, suppose we had a slightly more "powerful" black-box. Then, Proposition 4.1. If there is a black-box such that: (i) ~ ~ it uses O(1) space, (ii) it makes O(1) passes over the input, (iii) each invocation outputs a random variable 2 Y such that Pr[Y = a] fa (x), then, there is a data stream algorithm for approximating F3 that uses ~ ~ O(n1/3 ) space and makes O(1) passes over the input. Proof. Using the black-box, in one pass, we first sample 2 a random element a proportional to fa (x). In the next pass, we count the number of times a occurs in x. The basic estimator is Q = fa (x). It is easy to see that a f2 F3 a · fa = . E [Q] = F2 F2 155 Also, E [Q2 ] = a f2 F4 F2 a2 fa = n1/3 · 3 = n1/3 · E [X ]2 , 2 F2 F2 F2 where we used Proposition (2.1). Thus, we can approx~ imate F3 using O(n1/3 ) space using the black-box. 5 Data stream algorithm for Lp differences Recall that the Lp difference problem is to approxim mate i=1 |ai - bi |p where ai 's and bi 's arrive in the stream in an arbitrary order. Our algorithm can be modified to approximate Lp differences, for p > 2 us~ ing space O(n1-1/(p-1) ). The ideas are similar to those stated in [FKSV02] and the details are omitted. Note that the lower bound for this problem is essentially (n1-2/p-o(1) ) [SS02] (and [BJKS02a] for multi-pass algorithms). Acknowledgments The second author is thankful to Ziv Bar-Yossef, T. S. Jayram, and D. Sivakumar for many discussions preceding and during this work. References [AMS99] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137­ 147, 1999. [BBD+ 02] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of the 21st ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems (PODS), pages 1­16, 2002. [BJK+ 02] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques, 6th International Workshop (RANDOM), pages 1­10, 2002. [BJKS02a] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. In Proc. of the 43rd Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 209­218, 2002. [BJKS02b] Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. Information theory methods in communication complexity. In Proc. of the 17th Annual IEEE Conference on Computational Complexity (CCC), pages 93­102, 2002. [BKS01] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling algorithms: lower bounds and applications. In Proc. of the 33rd Annual ACM Symposium on Theory of Computing (STOC), pages 266­275, 2001. [BKS02] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 623­632, 2002. [CKS03] A. Chakrabarti, S. Khot, and X. Sun. Nearoptimal lower bounds on the multiparty communication complexity of set-disjointness. In Proc. of the 18th Annual IEEE Conference on Computational Complexity (CCC), 2003. [CW79] L. Carter and M. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143­154, 1979. [FKSV02] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate L1 -difference algorithm for massive data streams. SIAM Journal on Computing, 32:131­151, 2002. [FM85] P. Fla jolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31:182­209, 1985. [GT01] P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. of the 13th ACM Symposium on Paral lel Algorithms and Architectures (SPAA), pages 281­291, 2001. [HRR99] M. Henzinger, P. Raghavan, and S. Ra jagopalan. Computing on data streams. In DIMACS series in Discrete Mathematics and Theoretical Computer Science, volume 50, pages 107­118, 1999. [Ind00] P. Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computations. In Proc. of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 189­197, 2000. [IW03] P. Indyk and D. Woodruff. Tight lower bounds for the distinct elements problem. In Proc. of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2003. [Mut03] S. Muthukrishnan. Data streams: algorithms and applications. In Proc. of the 14th Annual ACM­SIAM Symposium on Discrete Algorithms (SODA), page 413, 2003. [SS02] M. Saks and X. Sun. Space lower bounds for distance approximation in the data stream model. In Proc. of the 34th Annual ACM Symposium on Theory of Computing (STOC), pages 360­369, 2002. [WC81] M. Wegman and L. Carter. New hash functions and their use in authentication and set equality. Journal of Computer and System Sciences, 22(3):265­279, 1981. 156