Optimal Space Lower Bounds for all Frequency Moments
David Woodruff  MIT dpwood@mit.edu
Abstract
We prove that any one-pass streaming algorithm which ( ,  )-approximates the kth frequency moment Fk , for any 1, 1b must use  2 its real k = 1 and any =  m of space, where m is the size of the universe. This is optimal in terms of , resolves the open questions of Bar1l Yossef et al in [3, 4], and extends the  2 ower bound for F0 in [11] to much smaller by applying novel techniques. Along the way we lower bound the one-way communication complexity of approximating the Hamming distance and the number of bipartite graphs with minimum/maximum degree constraints.

1 Intro duction Computing statistics on massive data sets is increasingly important these days. Advances in communication and storage technology enable large bodies of raw data to be generated daily, and consequently, there is a rising demand to process this data efficiently. Since it is impractical for an algorithm to store even a small fraction of the data stream, its performance is typically measured by the amount of space it uses. In many scenarios, such as internet routing, once a stream element is examined it is lost forever unless explicitly saved by the processing algorithm. This, along with the sheer size of the data, makes multiple passes over the data infeasible. In this paper we restrict our attention to one-pass streaming algorithms and we investigate their space complexity. Let a = a1 , . . . , aq be a stream of q elements drawn from a universe of size m, which we denote by [m] = {1, . . . , m}, and let fi denote the number of occurrences of the ith universe element in a. For any real k , the k th frequency moment Fk is defined by: Fk = im
=1

F2 is the repeat rate, also known as Gini's index of homogeneity [10]. Efficient algorithms for computing F0 are important to the database community since query optimizers can use them for finding the number of unique values of an attribute without having to perform an expensive sort on the values. Efficient algorithms for F2 are useful for determining the output size of selfjoins in databases and for computing the surprise index of a data sequence [10]. Higher frequency moments are used to determine data skewness which is important in parallel database applications [8]. An algorithm A ( ,  )-approximates Fk if A out~ ~ puts a number Fk such that Pr[|Fk - Fk | > Fk ] <  . Since there is an (m) space lower bound [1] for any deterministic algorithm computing Fk exactly or even approximating Fk within a multiplicative factor of (1 ą ), considerable effort has been invested into randomized approximation algorithms for the problem. In [1, 3, 7, 9] various algorithms are given to ( ,  )approximate F0 with the best known algorithm (in terms of space complexity) given in [3] achieving space 1 O 2 log log m + log m log 1 1 . Alon et al [1] present the best algorithm for ( ,  )-approximating F2 which 1 , achieves space O 2 (log m + log q ) and the best algorithm for ( ,  )-approxif ating Fk which achieves space m ( log m+log q ) 1- 1 k O or any integer constant k  1. m 2 This paper is concerned with space lower bounds for 1, the problem - we show that for any =  m any one-pass streaming algorithm which ( ,1 )-approximates b Fk , for any real k = 1 2 , must use  2 its of space. Prior to our work the only known space lower bounds in terms of the approximation error were for F0 . For F0 an  (log m) space lower bound was estab1ishl d in le 1 [1], an  b ower lowerm ound f in [4], and an  2 or any c > 0 in [11]. Note 1l that one cannot hope for the  2 ower bound to bound for =
-1 9+c

fik .

Interpreting 00 = 0, we see that F0 is the number of distinct elements in a, F1 is the stream size q , and
 Supp orted

by a DoD NDSEG fellowship.

1 In this pap er we take the error probability  to b e a constant, i.e., a value indep endent of m. 2 Note that F can b e computed trivially and exactly in space 1 O(log q ).

Copyright Š 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688

167


s 1 hold for = o m ince there is an O(m) algorithm computing F0 exactly and an O(m log q ) computing Fk exactly for any k  {0, 1}. / As in previous papers [1, 4, 5, 6, 11], to show space lower bounds we lower bound the one-way communication complexity of a boolean function f and reduce the computation of f to that of Fk . More precisely, there are two parties Alice and Bob holding inputs x and y respectively who wish to compute f (x, y ) with error probability at most  . Suppose that Alice and Bob can associate x, y with data streams ax , ay . Let A be an algorithm which ( ,  )-approximates Fk . Then Alice can compute A(ax ) and transmit the state S of A to Bob. Bob can feed S into his copy of A and continue ~ ~ the computation to obtain Fk (ax  ay ). If Fk (ax  ay ) can determine f (x, y ) with probability at least 1 -  , then the space used by A must be at least the one-way communication complexity of f . The cleverness is in choosing f and bounding its one way complexity. Lt 1 e. (ˇ, ˇ) denote Hamming distance and set t =  2 We consider the following function f suggested in [11]. Alice and Bob are given x, y {0, 1}t with t the promise that either (x, y )  2 - t, in which t case f (x, y ) = 0, or (x, y ) > 2 , in which case f (x, y ) = 1. The authors of [11] were not able to lower bound the one-way complexity of f directly, and instead considered a related function g with rational inputs x, y  [0, 1]t . They used a low distortion embedding to reduce a bound on g 's complexity to a bound on F0 's space complexity. This indirect approach led to an additional assumption om , namely, that their n f -1 9+c bound held only for = or any c > 0. We instead lower bound the one-way complexity of1 f b directly using novel techniques, aa d hence our  2 n 1

ma jority one in each column and ma jority one in each row is at least 2mn-zm-n for a constant z < 1. Using the natural association between bipartite graphs on n by m vertices with binary m by n matrices, we obtain a nontrivial lower bound on the number of bipartite graphs on n by m vertices where each left vertex has degree at most (resp. at least) m and each right vertex 2 has degree at most (resp. at least) n . Our presentation 2 is much simpler than that in [13], although our result is only a lower bound. As far as we are aware, this is the first nontrivial lower bound for the class of bipartite graphs 3 . 2 Preliminaries We adopt some of the definitions/notation given in [4, 11]. For x, y  {0, 1}n , let x  y denote vector addition over GF (2), x complementation, (x, y ) Hamming distance, and Z the integers. The characteristic vector of a stream a is the length-m bit vector with ith bit set to 1 iff fi > 0. 2.1 One-Way Communication Complexity Let f : X × Y  {0, 1} be a boolean function. In this paper we consider two parties, Alice and Bob, receiving x and y respectively, who wish to compute f (x, y ). In our protocols Alice computes some function A(x) of x and sends the result to Bob. Bob then attempts to compute f (x, y ) from A(x) and y . Note that only one message is sent, and it must be from Alice to Bob.

Definition 2.1. For each randomized protocol  as described above for computing f , the communication cost of  is the expected length of the longest message sent from Alice to Bob over al l inputs. The  error randomized communication complexity of ound holds for all =  m nd all k = 1, which f , R (f ), is the communication cost of the optimal is optimal. To lower bound f 's one-way complexity, protocol computing f with error probability  (that is, we use shatter coefficients [6] which generalize the VC- Pr[(x, y ) = f (x, y )]   ). dimension [12, 14]. The tricky part is proving our main For deterministic protocols with input distribution ľ, theorem, which essentially computes the largest shatter define Dľ, (f ), the  -error ľ-distributional commucoefficient of f . We use the probabilistic method in nication complexity of f , to be the communication an elaborate way and a correlation inequality due to cost of an optimal such protocol. Using the Yao MinKleitman [2]. imax Principle, R (f ) is bounded from below by Dľ, Our main theorem establishes some additional refor any ľ [15]. sults. Consider the problem: Alice and Bob have inputs x, y respectively and wish to ( ,  )-approximate (x, y ). 2.2 VC dimension and Shatter Co efficients Let Such a protocol necessarily computes f (x, y ) with erF = {f : X  {0, 1}} be a family of Boolean functions ror probability at most  . Hence, we obtain the first on a domain X . Each f  F can be viewed as a |X |-bit (in terms of ) lower bound on the one-way communistring f1 . . . f|X | . cation complexity of ( ,  )-approximating the Hamming distance. Finally, in the proof of our main theorem it is shown 3 The presentation in [13] was a characterization for general that the number of m by n binary matrices M with graphs.

168


Theorem 3.1. (Main) There exist constants c, c > 0 such that for sufficiently large n there is a set S  The following theorem [6] lower bounds the one-way {0, 1}n of size n such that for 2(n) subsets T of S , complexity of f in terms of information theory. there exists a y = yT  {0, 1}n such that for al l t  T ,  c n  (y , t)  n - c n, and for al l t  S - T , 2 Theorem 2.1. For every function f : X × Y  {0, 1}, (y , t) > n . 2 every l  V C D(fX ), and every  > 0, there exists a distribution ľ on X × Y such that: We say that a set T  S is good if there is a yT  {0, 1}n which separates T from its complement. More precisely,  Dľ, (f )  log(S C (fX , l)) - l ˇ H2 ( ). T is good if for all t  T , c n  (y , t)  n - c n, and 2 for all t  S - T , (y , t) > n . 2 2.3 Prop erties of the Binomial Distribution We need some properties of the binomial distribution in the 3.1 One-way Communication Complexity of proof of our main theorem. The following lemmas follow App1 oxia ating the Hamming Distance Let = rm 1, easily from Stirling's formula. Let n be odd and let X  m nd t =  2 where we assume t is a power be the sum of n independent unbiased Bernoulli random of 2 without loss of generality (WLOG). Let S be as variables X1 , . . . , Xn . in the main theorem, applied with n = t, and define Lemma 2.1. For any constant c > 0, and for suffi- Y = {y | T  S and T is good}, using the notation T ciently large n, above. We assume is small enough so that t is suf2 ficiently large to apply the main theorem with n = t.  1 n Pr[X > + c n] > - c Setting to be less than a small constant suffices. Define 2 2  the promise problem: Lemma 2.2. t t 2 L = {(y , s)  Y ×S s.t. (y , s)  - t or (y , s) > } 1 n 2 2 (1 + o(1)) i Pr[Xi = 1 | X > ] = + 2 2 n t Define f : Y × S  {0, 1} as f (y , s) = 1 if (y , s) > 2  t 2.4 A Theorem of Kleitman We also need the and f (y , s) = 0 if (y , s)  2 - t, and define the following theorem due to Kleitman [2]. We say a set function family F = {fy | y  Y } where fy : S  {0, 1} family A of a finite set N is monotone increasing if is defined by fy (s) = f (y , s). whenever S  A and S  T  N , then T  A. If A and B are monotone increasing, then their intersection Consider the ( ,  )-Hamming Distance Approximation Problem (( ,  )-HDAP): Alice, Bob have {S | S  A and S  B } is monotone increasing. ~ x, y  {0, 1}m respectively, and wish to output (x, y ) Theorem 2.2. (Kleitman) Let N be a set of size with Pr[|(x, y ) - (x, y )| > (x, y )] <  . The ~ n. Consider the symmetric probability space whose claim is that provided t  m, the randomized one-way elements are the members of the power set of N , that communication complexity R (f ) of deciding L is a  is, for any A  N , Pr[A] = 2-n . Let A and B be two lower bound on the one-way communication complexity monotone increasing families of subsets of N . Then, of the ( ,  )-HDAP. Indeed, a special case of the ( ,  )-HDAP is when Alice is given a random element Pr[A  B ]  Pr[A] ˇ Pr[B ] x of Y , padded with m - t zeros, and Bob a random element y of S , padded with m - t zeros. Then  t with probability at least 1 -  , if (x, y )  2 - t, 3 Applications of the Main Theorem  t =t t ~ 1 The main theorem intuitively says that there is a set (x, y )  (1 + ) 2 - t 2 - 2 - , and if t t t n ~ S  {0, 1} of n elements such that for many subsets T (x, y ) > 2 , then (x, y )  (1 - ) 2 = 2 - 2t . Hence,

Definition 2.2. For a subset S  X , the shatter coefficient S C (fS ) of S is given by |{f |S }f F |, the number of distinct bit strings obtained by restricting F to S . The l-th shatter coefficient S C (F , l) of F is the largest number of different bit patterns one can obtain by considering al l possible f |S , where S ranges over al l subsets of size l. If the shatter coefficient of S is 2|S | , then S is shattered by F . The VC dimension of F , VCD(F ), is the size of the largest subset S  X shattered by F .

of S , one can find a word yT  {0, 1}n that separates T from its complement S - T . By yT separating T from S - T , we mean that yT is closer to every element of T than to any element of S - T . We measure closeness in terms of Hamming distance. For one of our applications we also need to ensure that yT is not too close to any element of T . We give the formal theorem statement now and defer its proof to section 4:

169


~ the output (x, y ) can decide L with probability 1 -  . We now show R (f ) = (t), and hence 1 hat the t. one-way complexity of the ( ,  )-HDAP is  2
t Theorem 3.2. The 4 th shatter coefficient of F is 2(t) .

= 2k-1 (wt(y ) + wt(s) - (y , s)) + (y , s) = 2k-1 (wt(y ) + wt(s)) + (1 - 2k-1 )(y , s) and hence for k = 1, (3.1) (y , s) = 2k-1 2k-1 (wt(y ) + wt(s)) -1

Proof. The claim is that there are 2(t) distinct bitstrings in the truth table of F . Indeed, for every y  Y , Fk (ay  as ) - k-1 there exists a good subset T  S such that y = yT . 2 -1 For s  T , f (y , s) = 0 and for s  S - T , f (y , s) = 1. Viewing fy as a bitstring (see section 2), it follows that We want a (1 ą ) approximation to Fk to result in a = ( ). fy = fy for y = y since if T  S is such that y = yT , (1 ą ) approximation to (y , s) for some T and T differ in at least one element. Hence there are Specifically, if k < 1 we want: |Y | = 2(t) distinct bitstrings, so the shatter coefficient (1 - )(y, s)  is 2(t) . Corollary 3.1. The randomized on1 -way communie. cation complexity R (f ) is (t) =  2 (1 - ) 2k-1 Fk (ay  as ) + k-1 (wt(y ) + wt(s)) k-1 - 1 2 2 -1

-2 = hich decides L with probability at least 1 -  with communication cost equal to the space of any ( ,  ) Fk -approximation algorithm for any k = 1. It folm1, -2 lows that for any k = 1 and any =  any ( ,  ) 1s Fk -approximation algorithm must use  2 pace. In particular, for all smaller , any such algorithm must use (m) space. For k = 0 this is optimal since one can keep a length-m bit vector to compute F0 exactly. For k  {0, 1} this is optimal up to a factor of log q since / one can keep a lengta -m vector with ith entry set to fi . 1h Let t =  2 s before. Alice and Bob are given random y  Y and s  S , respectively, and wish to determine f (y , s). The protocol is as follows: Alice chooses a stream ay with characteristic vector y  0m-t . Let M be an ( ,  ) Fk -approximation algorithm for some constant k = 1. Alice runs M on ay . When M terminates, she transmits the state S of M to Bob along with wt(y ). Bob chooses a stream as with characteristic vector s  0m-t and feeds both S and as ~ into his copy of M . Let Fk be the output of M . The ~k along with wt(y ) and wt(s) can be claim is that F used to determine f (y , s) (and hence decide L) with probability at least 1 -  . We first decompose Fk : i Fk (ay  as ) = fik = 2k wt(y  s) + 1k (y , s) [m]

Proof. Follows immediately from theorem 2.1 and the and Yao minimax principle. Fk (ay  as ) 2k-1 (1 + ) k-1 + k-1 (wt(y ) + wt(s))  2 -1 2 -1 3.2 Space Complexity of Approximating the Frequency Moments m rom ,the previous section, we F (1 + )(y, s), -1 2 know that for =  the one-way communi- whereas for k > 1 we want: cation complexity o. deciding L with error probability 1f (1 - )(y, s)  We now give a protocol for any at most  is  2 m w
1

2k-1 Fk (ay  as ) (wt(y ) + wt(s)) - (1 + ) k-1 2k-1 - 1 2 -1 and 2k-1 Fk (ay  as ) 2k-1 (wt(y ) + wt(s)) - (1 - ) k-1  -1 2 -1 (1 + )(y, s). After some algebraic manipulation, we see that these properties hold for k < 1 iff: 2 k-1 (wt(y ) + wt(s)) - Fk (ay  as )  2 k-1 (w t(y ) + w t(s)) (2k-1 - 1)(y , s) = k-1 (w t(y ) + w t(s)) 2 and for k > 1 iff:
k-1

 =

2

F (wt(y ) + wt(s)) - Fk (ay  as ) k (ay  as )

(2k-1 - 1)(y , s) . Fk (ay  as )

Now, (y , s)  wt(y ) + wt(s) and (y , s)  Fk (ay  as ). Also, wt(y ) + wt(s) = O(t) and Fk (ay  as ) = O(t). Hence, for any k = 1 we will have = ( ) if there

170


exists a positive constant p so that for all pairs of that for sufficiently large n there is a set S  {0, 1}n inputs y , s, (y , s) > pt. Setting n = t in the main of size n such that for 2(n) subsets T of S , there theorem, we see that this condition is satisfied for p = c . exists a y = yT  {0,}n such that for al l t  T , 1 c n  (y , t)  n - c n, and for al l t  S - T , 2 We conclude that Alice and Bob can choose = ( ) (y , t) > n . 2 such that Bob can use his knowledge of wt(y ), wt(s), > 0 be constants to be determined. We ~ and an ( ,  ) approximation to Fk (i.e., Fk ), to Proof. Let c, c k -1 ~ (a y as ) assume n  1 mod 4 in what follows, so that n and Fk 2 compute 2k-1 -1 (wt(y ) + wt(s)) - 2k-1 -1 , which n are odd. Choose n elements r1 , . . . , rn uniformly 2 is a (1 ą )-approximation to (y , s). Hence, as in at random from {0, 1}n with replacement, and put the analysis of the ( ,  )-HDAP, Bob can decide L S = {r1 , . . . , rn }. Note that S may be a multiset; with probability at least 1 -  . One may worry that we correct this later. Set m = n and let T be an 2 the log t = O(log m) bits used to transmit wt(y ) will arbitrary subset of S of size m. We omit ceilings if not dominate the space of the Fk -approximation algorithm essential. for large . Fortunately, there is also an (log m) space lower bound [1] for approxim,ating Fk 1forl any k = 1 1 For notational convenience put T = {r1 , . . . , rm }. 4 , so if indeed log m =  2 the  2 ower bound Let y = yT be the ma jority codeword of T , that is, is absorbed in the (log m) lower bound. From the yj = ma jority(r1j , . . . , rmj ) for all 1  j  m. The reduction we 1 eesthat the Fk -approximation algorithm s map fy (x) = x  y preserves Hamming distances, so must use  2 pace. WLOG, assume y = 1n . 3.3 Lower Bound for Bipartite Graphs with Given Maximum/Minimum Degree There is a bijective correspondence between m by n binary matrices M and bipartite graphs G on m + n vertices, where Mij = 1 iff there is an edge from the ith left vertex to the j th right vertex in G. From corollary 4.1 (see the end of section 4) we see that the number of bipartite graphs on m + n vertices where each left vertex has degree at least n and each right vertex has degree at 2 least m , is at least 2mn-zm-n for a constant z < 1. 2 Interchanging the role of 1s and 0s, it follows that the number of bipartite graphs with each left vertex having degree at most n and each right vertex having degree 2 at most m , is at least 2mn-zm-n . 2 Note that a trivial lower bound on the number of such graphs can be obtained from theorem 2.2. Indeed, if C is the event that each column of M is ma jority 1 and R the event that each row is ma jority 1, C and R represent monotone families of subsets of [mn], so by theorem 2.2, Pr[R  C ]  2-m ˇ 2-n = 2-m-n , and hence the number of such M is at least 2mn ˇ 2-m-n = 2(mn-m-n) . Since z < 1 in our bound, our bound is strictly stronger. We say that T is good if for all t  T ,  c n  (y , t)  n - c n, and for all t  S - T , 2 (y , t) > n . We show the probability that T is good 2 is greater than 2-zn for a constant z < 1. It follows that the expected number of good subsets of S of size n2 1 m is m -zn = 2H2 ( 2 )n+o(1)n-zn = 2(n) . Hence, there exists an S with 2(n) good subsets. It remains to lower bound the probability that T is good. The probability that T is good is just the product: n Pr[ t  S - T , (y , t) > ]ˇ 2  n n Pr[ t  T , c  (y , t)  - c n ], 2 since these events are independent. Since y is independent of S - T , (4.2) Pr[ t  S - T , (y , t) > n ] 2 = 2m-n .

 We find Pr[ t  T , (y , t)  n - c n ]. We force 2 (y , t)  c n later. Let M be the binary m × n matrix whose ith row is ri . Let m = m1 + m2 for m1 , m2 positive integers to be determined. Let R1 be the event  4 Pro of of the Main Theorem that M has at least n + c n ones in each of its rst m1 fi 2 We use the probabilistic method to prove our main rows, R2 the event that M has at least n + c n ones 2 theorem, repeated here for convenience: in each of its remaining m2 rows, and C the event that M has at least m ones in each column. Then, 2 Theorem 4.1. There exist constants c, c > 0 such  n Pr[ t  T , (y , t)  - c n ] = Pr[ R1  R2 | C ] 4 In [1] the authors only explicitly state the (log m) lower 2 = Pr[ R1  R2  C ] Pr[ C ]

bound for k  {0, 2}, but their argument in prop ositions 3.7 and 4.1 is easily seen to hold for any fixed k = 1 (even nonintegral) for sufficiently small, but constant .

171


2 2 M can be viewed as the characteristic vector of a subset Now put c = r and d = 2(-r) for a constant   of [mn] = {0, . . . , mn - 1}. Under this correspondence, 0 < r < 1 to be determined, and let Ei be the event: each of R1 , R2 , and C represent monotone families of   n n subsets of [mn]. Applying Theorem 2.2, + c n < Xi < + d n. 2 2 Pr[ R1  R2  C ]  Pr[ R1  C ] Pr[ R2 ] for 1  i  m1 . Clearly, = Pr[ R1 | C ] Pr[ C ] Pr[ R2 ] im1 -1 Pr[Ei | i=1 El ]. (4.6) Pr[R1 | C , Y = s] > and hence, l

(4.3)

 n Pr[ t  T , (y , t)  - c n ] 2  Pr[ R1 | C ] Pr[ R2 ]

=1 -1 The idea is to bound E[Xi | i=1 El ] and to show l i-1 Var[Xi | l=1 El ] is small so that we can use Chebyshev's inequality on each multiplicand in the RHS of 4.6.

Computing Pr[ R2 ] is easy since M 's entries are independent in this case. There are m2 independent rows, and each row is a sum of n independent unbiased Bernoulli variables. By lemma 2.1, 1 2 m2

-1 -1 We first bound E[Xi | i=1 El ]. Given i=1 El , we l n i-1 l a know that l=1 Xi is at least (i - 1) 2 + c n nd at n . i-1 most (i - 1) 2 + d n To ensure that E[Xi | l=1 El ] doesn't vary much with i, we restrict m1 from being too large by setting m1 = v m for a constant 0 < v < 1 (4.4) Pr[ R2 ] > -c 2  to be determined. Since there are s ones in M , and -1 -1 E[Xj1 | i=1 El ] = E[Xj2 | i=1 El ] for all j1 , j2  i, l l To compute Pr[R1 |C ], let Y be the number of ones in m n  M . We compute s - (i - 1) 2 + d n s - (i - 1) Pr[R1 |C ] = Pr[R1 |Y = s, C ] ˇ Pr[Y = s | C ]. n n m n3 -  mn + 2m  + o 2 (i - 1) 2 + d n 2 = The following insight simplifies this calculation: - (i - 1) 2 md   m m 2 Lemma 4.1. Pr[Y = n2 + n  (1 + o(1)) | C ] =  (i - 1) n1 -  2 n +o 2 = + n  - 1 - o(1). 2 -i+1  -1 E[Xi | i=1 El ]. m l i  n. From lemma 2.2, E[Yi |C ] = m + 2  (1 + o(1)). 2 From a similar calculation, m m Hence, E[Y |C ] = n2 + n  (1 + o(1)). Since the m -1 columns are i.i.d., Var[Y |C ] = nVar[Yi |C ]  n4 . E[Xi | i=1 El ]  l Chebyshev's inequality establishes the lemma: m 2   n1 S (i - 1)  - c n  2 Var[Y |C ] +o 2 +n+ Pr[|Y |C - E[Y |C ]| >  (n)]  = o(1). 2 -i+1   (n2 )

Proof. Let Yi be the number of ones in column i, for 1  2

Put s =

nm 2

+n

2

m 

(1 + o(1)). It follows that:

(4.5) Pr[R1 |C ]  (1 - o(1)) Pr[R1 |Y = s, C ]. Technically speaking, s represents a2 set of values, all m m of which are of the form n2 + n  (1 + o(1)). We abuse notation and say Y = s, when in fact Y assumes a value in this set. 1 Define Xij to be the (i, j )th entry of M , condij tioned on events Y = s and C , and define Xi = Xij .

etting i = m1 + 1 in the above, we obtain bounds independent of i which hold for all 1  i  m1 , d 1   2  n v -   2 n +o 1 2 + n  - 2 -v 
-1 E[Xi | i=1 El ]  l 12   n v  - c n  2 +o 1 2 +n+ 2 -v 

72


It follows that for all i, Define ki to be min aE n  n -1 i-1 [Xi | i=1 El ] - - c n, + d n - E[Xi | l=1 El ] l 2 2 jn
=1 -1 Var[Xi | i=1 El ] = l -1 Var[Xij | i=1 El ] + l

j
=k

-1 Cov[Xij , Xik | i=1 El ] l

jn n i-1 -1 < Var[Xij | l=1 El ]  . nd note that ki measures how far Xi | i=1 El has to l 4 -1 =1 deviate from its expectation for Ei | i=1 El to occur. We l will use ki in Chebyshev's inequality below. Simplifying We now apply Chebyshev's inequality to each row: ki using our bounds, after some algebra we obtain: -1 Pr[Ei | i=1 El ] = +n, 1 2 l  1 v - 2r  o2 n - ki =   n n -1 1-v  Pr[ + c n < Xi < + d n | i=1 El ]  l 2 2 i-1 using the definitions of c and d, which were defined to 1 - Pr[ |Xi - E[Xi | l=1 El ] | > ki ]  2 be symmetric around  . Note that for sufficiently -1 Var[Xi | i=1 El ] l large n, ki is positive provided v < 1 , which we hereby  1- 2 2 ki enforce. n  1- . = 1- 1 2 We show that Var[Xi | i=1 El ] is small by show4ki 2 -2v l (2 - 2r)2 - o(1) 4 1-v ing the entries in the ith row are negatively correlated:  < 2-1 Lemma 4.2. For any 2  i  m1 and any 1  j < k  To simplify this expression, we choose v = 22-1 1 n, 2 . The ab ove inequality b ecomes -1 Cov[Xij , Xik | i=1 El ] l =  -1 -1 Pr[Xik = 1 | i=1 El ] (4.7) Pr[Ei | i=1 El ]  1 - l l 8(1 - r)2 - o(1) i-1 i-1 Pr[Xij = 1 | Xik = 1, l=1 El ]-Pr[Xij = 1 | l=1 El ] < 0 From equations 4.3, 4.4, 4.5, 4.6, and 4.7, we conclude: n=  Proof. Interpreting x 0 for x < 0, we have: n (4.8) Pr[ t  T , (y , t)  - c n ] > 2 -1 Pr[Xij = 1 | i=1 El ] = 1 l 2 m2 1 m1  - (1 - o(1)) -c tn 8(1 - r)2 - o(1) 2  -1 -1 Pr[Xij = 1 | Xi = t, i=1 El ] ˇ Pr[Xi = t, i=1 El ] = l l =0 We say that T is almost good if for all t  T , (y , t)   n n n-1 n n 2 - c n, and for all t  S - T , (y , t) > 2 . Note that t t-1 -1 these two events are independent and that T is good if Pr[Xi = t, i=1 El ] > l and only if T is almost good and for all t  T , (y , t)  t =1 n-2 c n. Combining equations 4.2 and 4.8, we have: t n t-2 -1 Pr[Xi = t, i=1 El ] = l -1 Pr[T is almost good ] =
=1 t-1

tn
=0

(Pr[Xij = 1 | Xik = 1, Xi = t,
-1 Pr[Xi = t, i=1 El ]) l

i-1 l=1 El ]ˇ

n ]ˇ 2  n Pr[ t  T , (y , t)  - c n ] 2 1 2 m2 Pr[ t  S - T , (y , t) > > 2m-n 1 -  8 (1 - r) - o(1)
2

-1 = Pr[Xij = 1 | Xik = 1, i=1 El ], l

2

-c

where we used the fact that conditioned on Xi = t, every t-combination in the ith row is equally likely by symmetry.

 m1

ˇ

(1 - o(1))

173


 (1-v)m 2r 2 =2 - ˇ 2  vm 1  (1 - o(1)) - 2 8 (1 - r) - o(1) 1
m-n

for any constant  < 1 and sufficiently large n. Hence, Pr[t  T such that (y , t)  c n]  n2H2 (1-c for any 
<
)

n+O (log n)-n

 2H2 (1-c

)

n-

n

,

 and large enough n. By the union bound, Pr[ T is good ] =

Taking logarithms base 2 and dividing by n we obtain: (4.9) log(Pr[T is almost good ]) 1 =- n 2 + 1  (1 - v ) 2r 2 + log2 - 2 2  v log2 2 + 1 -  8 (1 - r) - o(1)
2

Pr[ t  T , c n  (y , t) 

 n -c n ] 2 Pr[ T is almost good ]-
) n

Pr[t  T such that (yT , t)  c n]  2-zn - 2H2 (1-c
n-

log2 (1 - o(1))

We choose c , ,  so that  - H2 (1 - c ) > z by choosing c close to 0 and  close to 1. Hence, n Pr[ T is good ] > 2-z for any z > z and large enough n. Since z < 1, we can choose z < 1, as needed.

Observe that the RHS of equation 4.9 is continuous in r for 0  r < 1 and for r = 0 is just: The only loose end to tie up is that S may be a 1 1 + multiset. But for any i = j , Pr[ri = rj ] = 2-n , so: v  (4.10) -1 + + log2 - n2 2 8 - o(1) -n = 2-n+O(log n) , Pr[i = j such that ri = rj ]  2 log2 (1 - o(1)). Let N1 1 Z be 1such that fot all n > N1 , the and hence for any specific T ,  r v  -1 + 2 + log2 - erm in (4.10) is less (4.11) Pr[T is not good or S is a multiest ] < 1 8-o(1) 1  . v Si = than l = -1 + 2 + log2 - 7 n - 1nce log2 () is 1 - 2-z + 2-n+O(log n) , monotonic increasing, and since log2 -1 and 2 1 -  > 1 , we have l > -1. Let N2  Z be such so that for sufficently large n and for any 1 > z > z , 7 2 that for all n > N2 , the log2 (1 - o(1)) term in (4.10) is a constant larger than -(1 + l). Finally, let n be Pr[T is good |S is not a multiset ]  larger than max(N1 , N2 ) and large enough to satisfy all n previous steps where n needed to be sufficiently large. Pr[T is good and S is not a multiset ] > 2-z Then (4.10) is larger than a constant strictly larger than Thus, the expected number of good subsets of S , given -1. Since equation 4.9 is continuous in r, there exists a that S is not a multiset, is 2(n) , as before. This constant r > 0 so that for sufficiently large n, the RHS completes the proof. of equation 4.9 is larger than a constant strictly larger than -1. Hence for sufficiently large n, there exists a Corollary 4.1. The number of m by n binary matriconstant z < 1 so that Pr[T is almost good ] > 2-zn . ces M with more ones than zeros in each column and more ones than zeros in each row is at least 2mn-zm-n We compute Pr[ t  T , (y , t)  c n]. Fix for a constant z < 1. t  T . From lemma 2.2, there is a constant u > 0 with: Pr[(y , t)  c n] n 1 in
=(1-c )n

 

i 1

u + 2 n n

i

u - 2 n

1

n-i

n (1 - c )n
)

u + 2 n

cn

 2H2 (1-c

n+O (log n)-n

Proof. Using the notation of the proof, the probability that a (uniformly) random m by n binary matrix M has ma jority 1 in each row, given that it has ma jority 1 in each column, is Pr[ R1 | C ] ˇ Pr[R2 ] with r = 0 (and hence c = 0). Note that the proof holds for any balue of m, even though we only needed m = n v 2 efore. As n  , Pr[ R1 | C ] ˇ Pr[R2 ] approaches 1 (1-v)m 1  vm (see equations 4.4, 4.5, 4.6, 4.7), -8 2

174


which is 2-z for a constant z < 1. The only dependence of Pr[ R1 | C ] ˇ Pr[R2 ] on n is through o(1) terms (see the RHS of equation 4.8), which can each be upper bounded by o(1) terms continuous in n. Hence for sufficiently large n, Pr[R1 |C ] ˇ Pr[R2 ]  2-zm for a constant z with z < z < 1. Thus, the probability that M has ma jority 1 in each row and ma jority 1 in each column is at least 2-zm ˇ 2-n = 2-zm-n . Since there are 2mn total binary matrices, the number of such M is at least 2mn-zm-n . 5 Acknowledgment The author thanks Piotr Indyk for helpful discussion and checking the proof of the main theorem. References
[1] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposum on the Theory of Computing, p. 20-29, 1996. [2] N. Alon and J. Spencer. The Probabilistic Method, Wiley Interscience, New York, 1992, 86--87. [3] Z. Bar Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. RANDOM 2002, 6th. International Workshop on Randomization and Approximation Techniques in Computer Science, p. 1-10, 2002. [4] Z. Bar Yossef. The complexity of massive data set computations. Ph.D. Thesis, U.C. Berkeley, 2002. [5] Z. Bar Yossef, T.S. Jayram, R. Kumar, and D. Sivakumar. Information statistics approach to data stream and communication complexity. Foundations of Computer Science, p.209-218, 2002. [6] Z. Bar Yossef, T.S. Jayram, R. Kumar, and D. Sivakumar. Information Theory Metho ds in Communication Complexity. 17th IEEE Annual Conference on Computational Complexity, p.93-102, 2002. [7] G. Cormo de, M. Datar, P. Indyk, and M. Muthukrishnan. Comparing Data Streams Using Hamming Norms. 28th International Conference on Very Large Databases (VLDB), 2002. [8] D.J. DeWitt, J.F. Naughton, D.A. Schneider, and S. Seshadri. Practical Skew Handling in Parallel Joins. Proc. of the 18th Int'l Conf. Very Large Data Bases, p. 27, 1992. [9] P. Fla jolet and G.N. Martin. Probabilistic Counting Algorithms for Data Base Applications. Journal of Computer and System Sciences, 18(2) 143-154, 1979. [10] I.J. Go o d. Surprise Indexes and P-values. Journal of Statistical Computation and Simulation 32, p 90-92, 1989. [11] P. Indyk and D. Wo o druff. Tight Lower Bounds for the Distinct Elements Problem. To appear: Foundations of Computer Science, 2003. Available: http://web.mit.edu/dpwo o d/www [12] I. Kremer, N. Nisan, and D. Ron. On randomized oneround communication complexity. Computational Complexity, 8(1):21-49, 1999. [13] B.D. McKay, I.M. Wanless, and N.C. Wormald. Asymptotic enumeration of graphs with a given upper b ound on the maximum degree, Combinatorics, Probability and Computing 11 p. 373-392, 2002. [14] V.N. Vapnik and A.Y. Chervonenkis. On the uniform converges of relative frequences of events to their probabilities. Theory of Probability and its Applications, XVI(2):264-280, 1971. [15] A. C-C. Yao. Lower b ounds by probabilistic arguments. In Proceedings of the 24th Annual IEEE Symposium on Foundations of Computer Science, p. 420-428, 1983.

m

175