Efficient Estimation Algorithms for Neighborhood Variance and Other Moments Edith Cohen Haim Kaplan Abstract The neighborhood variance problem is as follows. Given a (directed or undirected) graph with values associated with each node, compute a data structure that for any given node v and r 0, would quickly produce an estimate of the variance of all values of nodes that lie within distance r from v . The problem can be generalized to other moment functions and to arbitrary distance-dependent decay. These problems are motivated by applications where the relevance of a measurement observed (or data present) at a certain location decreases with its distance, and thus the aggregate value varies by location. The centralized version of the problem is motivated by applications to query processing on graphical databases. The distributed version of the problem falls in a model we recently introduced for spatial ly decaying aggregation and is motivated by sensor or p2p networks. We present novel algorithms for the centralized and distributed versions of the problem. Our algorithms are nearly ~ optimal, the centralized version requires O(m) time and the distributed version requires polylogarithmic communication per node or edge (depending on assumptions). The respective aggregates are computed over all items present in the r-neighborhood. Generally, a decay function can be any non-increasing function. The spatial ly-decaying moments problem is to efficiently compute a summary that would allow us to retrieve, for each node v , power in some fixed range, decay function, and a point a, the (approximate) weighted average of |x - a| over items. For an r-threshold decay function, this aggregate value is simply the average of |x - a| over all values x of items that reside at nodes in the r-neighborhood of v . When instead of an arbitrary value a we use the weighted mean (which varies from node to node), we refer to the problem as spatial lydecaying central moments; when = 2 this is the spatial ly-decaying variance; and for r-threshold decay this is the variance over all values in the r-neighborhood. (see Figure 1 for an example of a network, values, and respective moments.) d,2 e,0 f,3 c,1 1 Intro duction Variance and moments are commonly used and very basic properties of data sets and distributions. We consider these problems in a spatial ly-decaying setting, where values are present at nodes of a graph (or a network). Data items present at one node are relevant to other nodes, yet, the relevance decreases with distance [5]. Thus, each location views the items through a different distribution and is interested in aggregate values accordingly. The weight of an item as viewed from a certain location is determined by some decay function applied to its distance [5]. One example of a decay function is the r-threshold function, which assigns uniform weights to items within distance r and 0 weight otherwise. AT&T Research Labs, 180 Park Ave. Florham Park, NJ, USA. Email: edith@research.att.com. Scho ol of Computer Science, Tel-Aviv University, Tel Aviv, Israel. Email: haimk@cs.tau.ac.il. h,2 g,1 b,0 a,2 Figure 1: The 2-neighborhood of node a is {a, b, h, d, e, g , c}. The mean according to 2-threshold decay is thus 8/7, the second moment about 0 is 14/7 = 2, and the variance is 238/343 = 34/49. For node c the 2-neighborhood is {c, b, f , g , e, a}, the respective mean is 7/6, the second moment about 0 is 15/6 = 2.5 and the variance is 246/216 = 41/36. The problem has a centralized variant, where the graph is given as input, and a distributed variant, where the nodes form a network, and the goal is to use a small amount of communication and obtain, at each node, a compact summary that would allow it to answer queries on its own neighborhoods (or arbitrary decay functions). The centralized variant of the problem is motivated Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 157 by applications in traditional graphical databases, for example, XML documents or analyzing Web structure. The distributed version of the problem is motivated by emerging applications, such as P2P networks and sensor networks, where different data items are present at nodes connected by some low-degree communication network [5]. We present algorithms that yield (1 ± )approximate answers (for a fixed > 0) for spatiallydecaying moments and variance. The size of the summaries are polylogarithmic per node and the running ~ time for the centralized version is O(m). Our algorithms are novel for both the centralized and distributed versions of the problem. Related work: From an algorithmic standpoint (but less so from an application standpoint), spatiallydecaying aggregation generalizes time-decaying aggregation on massive data streams [6] and in particular, sliding-window aggregation for massive data streams [7, 8]: time-decaying aggregation on data streams correspond to spatially-decaying aggregation on directed path graphs, and sliding windows correspond to neighborhoods. Thus, spatially-decaying variance generalizes the sliding-windows variance problem which was studied by Babcock et al [1]. The Babcock et al techniques do not seem to carry over to the spatial setting: Exponential Histograms [7] do not seem to work in the spatial setting (see [5] for discussion). Moreover, their algorithm relies on exact computation of the variance and average in each bin of the histogram, an operation that seems fundamentally hard in the spatial setting. It is also not clear if the Babcock et al slidingwindow variance algorithm can be extended beyond sliding windows to time-decaying variance under general decay functions [6]. The challenge in spatially-decaying aggregation is that the aggregate value (or summary) is locationdependent. Yet, we do not want to recompute it from the raw distribution for each node, as this would result in quadratic time in the centralized setting and flooding with quadratic communication in the distributed setting. The moments problem imposes additional challenges, since even beyond the issues of computational efficiency, it is not even clear how to summarize the data into a compact representation that captures sufficient information to answer the queries. Distributed computation makes the problem even more challenging, since it basically requires a very efficient way of both summarizing and communicating the essence of the data such that each node can distill the information relevant to it. Overview and insights: A basic ingredient in our algorithms is approximate spatial ly-decaying counts, where given binary item val- ues the goal is to produce a Neighborhood Summary (NH-summary) which allows to obtain, for each node v and decay function, an approximate decaying count of values. (For the special case of threshold decay function this amounts to estimating, for any given r 0, the count in the r-neighborhood of v ). Algorithms for centralized computation of NH-summaries were introduced by the first author in [4]. Further new ideas which allow efficient distributed computation of NH-summaries and handling of general decay functions are presented by the authors in [5]. The crux of our approach is a novel technique to summarize a set of values to a poly-logarithmic size summary that allows us to retrieve an approximation of the moment about any constant. These summaries are obtained by applying a logarithmic number of global predicates to each value. Over each predicate we then compute an NH-summary for the count of values that are true for the predicate. Each NH-summary has polylogarithmic-size and can provide, for each decay function, the approximate weight of items that satisfy the predicate. The key insights we need are in the choice of these predicates. In order to estimate neighborhood moments, we need to somehow be able to preserve and retrieve information about the distances between values and any given point. If we are only interested in moments that are about some globally-fixed point a, the problem is easy: Each value x is bucketized according to its distance from a, |x - a|, using buckets with exponentially growing width. We then only need to know an approximate weight of items within each bucket, something we can do using an NH-summary obtained by performing a spatially-decaying count of items in each bucket. The catch, however, is that we want the same summary to work for arbitrary choices of a. 1 The next approach we consider is partitioning the range uniformly to a polylogarithmic number of bins and producing an NH-summary for each bin. This partially works, and only for nodes, decay functions, and points a, where "most" of the weight lies in a bin that is far from the bin that a lies in (Otherwise, too much information is lost and we can not guarantee the desired (1 ± ) approximation). Our approach is based 1 It may seem that this approach can work for the variance, where we are interested in a moment about a specific point (the mean). Note, however, that the mean is not global and rather depends on the aggregating node and choice of the decay function. Thus, there are many (possibly linearly many) relevant "means" to consider that each value can be aggregated about. We shall see that our solution to the variance relies on the summary which can retrieve moments about arbitrary points. 158 on extending this attempt, by first folding the range of values into a smaller range (the fold width), and then uniformly partitioning it into a histogram with a fixed number of bins. The folding essentially amounts to discarding some values and then performing a modulus operation by the fold width. The key property is that all distances between values may decrease and become at most the fold width, but distances that are smaller than the fold width remain the same. After partitioning the fold width uniformly to some constant number of bins we obtain that all distances that are not too big and not too small (that is, are of the order of the fold width) are approximately preserved. We use a logarithmic number of different fold widths that are exponentially decreasing. When computing our estimate on the moment about some value a, we sum over different foldings. Each item is accounted for in many foldings, but there is only one folding that preserves its approximate distance from a. The larger-width foldings would bucketize it together with a, yielding a 0 contribution to the moment and the smallerwidth bins will account for a contribution that is much smaller than the one corresponding to its true distance. The computation of central moments, including the variance, uses the same summaries but requires some additional insights. The exact value of the mean is not known to the aggregating node, and simply computing the aggregate about a (1 ± ) approximation of the mean (which can be obtained using decaying sum computations) is not sufficient for obtaining (1 ± ) approximation of the variance. We organize our presentation as follows. In section 2 we state the spatial decay model and the spatiallydecaying sum problem [5]. Section 3 describes the summaries by defining the folding functions and predicates that are aggregated as spatially-decaying sums at each node. Section 4 states the algorithm that for a given decay function, point a, and power computes an estimate of a moment from the summary. Section 6 is concerned with the variance computation (and other central moments). The correctness proof of the algorithm in Section 4 is given in Section 5. We conclude in Section 7 with a discussion on extending our approach to higher dimensions and k -medians. 2 Preliminaries only if the two nodes can communicate. We denote the number of edges by m. Edges can have nonnegative lengths associated with them, which correspond to distances. We denote by dist(vi , vj ) the distance between two nodes vi and vj with respect to the shortest-path metric on the edge lengths. Nodes in the network have data items associated with them. Each item i I is specified by a pair (fi , i), where fi is its value and i V is its location. A decay function is a non-increasing function g (x) 0 defined for x 0. The decay function determines the "weight" of a remote item as a function of its distance. The decaying weight of the item i as viewed by a node u is wu,g (i) = g (dist(u, i)).2 An important family of decay functions are threshold functions Ballr (for r 0), defined by Ballr (x) = 1 for x r and Ballr (x) = 0 otherwise. The corresponding aggregation is over the r-neighborhood, where all data items that lie within distance r have equal weight and all further items have 0 weight. Other natural classes of decay functions are Exponential decay and Polynomial decay (see [5] for details). An aggregate function is a function defined on a multiset of value-weight pairs. The goal of spatial lydecaying aggregation is to produce summaries with respect to a particular aggregate function (or a class of functions). Each node u obtains a localized summary3 which allows it, for any given decay function g () (and any aggregate function in the set we consider), to obtain (1 ± )-approximate estimates of the value of the aggregate on the multiset {fi , wu,g (i)}. We measure performance by the running time needed to produce these summaries and by the size of the resulting summaries. In the distributed setting we consider the amount of communication per node and storage at each node. In the sequel, (1 ± )-approximate estimates (or just estimates) means that by appropriately adjusting constants in our algorithms we can handle any fixed 0. To simplify the discussion, we ignore in several places scaling of by a constant factor. A basic aggregate is the sum (weighted sum of values), where the value at node u for decay function g () is i Sg (u) = wu,g (i)fi . (For the sum problem we assume fi 0 for all i.) In We start by defining spatially-decaying aggregation [5], and in particular, the spatially-decaying sum problem [5]. We then proceed and define our problem of spatially-decaying moments. We model the network as a (directed or undirected) graph G = (V , E ), where V = {v1 , . . . , vn } is the set of nodes, and there is an edge between two nodes if and 2 Our algorithms can b e easily extended to a setting where 0 each item has a "local" weight wi , and its decaying weight 0 is wi g (dist(u, i)). For simplicity of presentation we assume uniform local weights. 3 In the centralized version of the problem one can also consider a single summary for all nodes. The algorithms we consider here produce separate summaries. 159 3 Foldings and predicates We develop a technique to compute summaries for the spatially-decaying power sums problem. We assume (this assumption is addressed in Subsection 5.1) that items have integral values in the range 0, . . . , R - 1. Our algorithm defines a logarithmic number of global predicates. All nodes apply each predicate to their local items. For each predicate, the system then produces NH-summaries at all nodes. As a result, each node stores a logarithmic number of NH-summaries (one for each predicate). We now provide a high-level description of these predicates. We use mappings which we refer to as foldings. Each folding excludes part of [0, R) and maps remaining (included) values into a range of the form [0, R/2j ) for some j 0 and 2. The range of the folding is then partitioned uniformly into B bins, where bin b (b = 0, . . . , B - 1) contains values that b the folding maps to [ B R/2j , (b+1) R/2j ). Each bin in B each folding corresponds to a predicate. This predicate is "1" for the ith item if and only if fi is included in the folding and the image of fi under this mapping falls in the corresponding bin. These NH-summaries allow For "pure" moments we use the notation each node u to obtain, for each folding, each bin, and i each decay function g , an approximate decayed count of M,g (u) = (2.2) (wu,g (i)(fi - ) ) . the items with values that are mapped by the folding to Note that the -moment is the ratio M,g (u)/Wu,g and that bin. (For the special case of Ballr decay function, we can obtain for each r, an approximate number of the respective absolute moment is A,g (u)/Wu,g . items within the r-neighborhood of u that are mapped Our algorithms obtain (1 + )-estimates for absolute power sums and thus for pure power sums with integral by the folding to that bin.) The value of B is set according to the desired accuracy and communication tradeoffs. Recall that is 4 The work of [4] considers only Ball decay functions, but r the special case where the values fi are binary, we referi to this aggregate as the count. We define Wu,g = wu,g (i) to be the decaying count of all items as viewed by u. When u or g are clear from context we will omit them from the subscript of Wu,g and wu,g (i). The spatially-decaying sum problem is to obtain summaries such that each location u V , for any decay function g () can retrieve an (1 ± )-approximate value of Sg (u). The summaries produced by spatially-decaying sum algorithms are termed Neighborhood-summaries (NHsummaries) [4, 5]. NH-summaries are particularly relevant to us here since we reduce the spatially-decaying moments problem to performing logarithmically-many computations of NH-summaries. As discussed in the ~ Introduction, [4] shows that in O(m) time, we can obtain for each node a polylogarithmic-size NH-summary that gives (1 ± )-approximate answers with very high probability4 . We studied distributed algorithms for NHsummaries in [5]. The communication needed per node depends on the setup. Under some assumptions, e.g., if shortest path trees are pre-computed, the summaries can be obtained using polylogarithmic communication per node. For a set i of items (with values fi , weights w(i) w(i)), a point , and a power , the and W = i -moment about is defined as w(i)(fi i- ) /W . We refer to the non-normalized quantity w(i)(fi - ) as the -power sum about i . We also consider absolute moments defined as w(i)|fi - | /W and i the respective absolute power-sum w(i)|fi - | . Moments about the mean are termed central moments whereas moments about arbitrary choices of are termed raw moments. The spatial ly-decaying (absolute) power-sums problem is to produce summaries, according to > 0 and a range [a , b ] (where b > a > 0). The summaries should allow each node u to obtain, for each , g (), and [a , b ], a (1 ± )-approximation of the power sum i A,g (u) = (2.1) (wu,g (i)|fi - | ) . even values of (since for even powers M,g (u) A,g (u)).5 Central moments have particular significance ­ the most important such moment is the variance. The (weighted) variance of a set of values is idefined as i V= w(i)(fi - µ)2 /W , where µ = w(i)fi /W is the (weighted) mean. The spatial ly-decaying central moment is Mµg (u),g (u)/Wu,g and the spatial ly-decaying variance is thus M2 g (u),g (u)/Wu,g A2 g (u),g (u)/Wu,g , µ iµ wu,g (i)fi /Wu,g . where µg (u) = Moments are the ratio of the respective power sum and Wu,g . Since we can efficiently approximate Wu,g using an NH-summary, an approximation of the numerator (the power sum) would yield approximation of the respective moment. In particular, approximate central, raw, absolute or pure moments can be obtained from the respective approximate power sums (and vice versa). In the sequel we will focus on power sums. we show in [5] that summaries that can support Ballr decay functions for arbitrary r 0 can support arbitrary decay functions. 5 (1 + ) approximate pure p ower sums with o dd are as hard as obtaining exact neighborhood counts [5, 7]. 160 a parameter of our construction which is at least 2. We And the discretization to bins by also define S = 2 + 1. We have a folding for each j from BINc,j,s (x) = Fc,j,s (x)B 2j /R . 0 to -1 log2 (R/B ). For convenience of presentation we (+2) -1 assume that B /2 and log2 (R/B ) are integral. An illustration of a range, and different foldings with 0 R the respective included items is provided in Figure 2. 1111111111 1111111111 0000000000 FOLD 0,1,0 0000000000 1111111111 0000000000 1111111111 0000000000 An illustration of the folding mapping is provided in Figure 3. 1111111111 0000000000 1111111111 0000000000 FOLD 0,1,1 1111111111 0000000000 1111111111 0000000000 FOLD 0,1,0 FOLD 1/2,1,0 1111111111 0000000000 1111111111 0000000000 1111 0000Included points 1111 0000 Excluded points 1111111111 0000000000 1111111111 0000000000 0 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 R FOLD 1/2,1,1 1111111111 0000000000 1111111111 0000000000 S=2 =2 11111111111 00000000000 11111111111 00000000000 Figure 3: Included parts of the range [0, . . . , R) for the folding FOLD0,1,0 and the respective mapping of these parts to [0, . . . , R/4). ( = 2, shown with S = 2 although we assume in the analysis that S = 2 + 1). The reason for using two different partitions for each j, s, with c = 0 and c = 1/2 is to obtain the property that every subinterval of [0, R) that is of length at most R/2j +1 lies within some subinterval of one of the partitions. Formally, we say that an interval [a, b] [0, R) is intact by a folding FOLDc,j,s if all points in the interval are included in FOLDc,j,s and the folding preserves distances within points in the interval. Equivalently, [a, b] is intact if [a, b] FOLDc,j,s and b - a = Fc,j,s (b) - Fc,j,s (a) . Observe that every interval of the form [a, a + R/2j ), where a mod R/2j +1 = 0, is a maximal intact interval in some folding of the form FOLD,j, . We thus have the following property: Lemma 3.1. Any interval [a, b] [0, R) such that b - a R/2j +1 is contained in some maximal intact subinterval in a folding of the form FOLD,j, . Lemma 3.2. Consider a maximal intact interval [a, a + R/2j ) (a mod R/2j +1 = 0) of some FOLD,j, . Consider a folding FOLDc,j -1,s such that [a, a + R/2j ) is intact in that folding, and let [d, d + R/2(j -1) ) [a, a + R/2j ) be a maximal intact subinterval of FOLDc,j -1,s . Then, {BINc,j -1,s (x)|x [a, a + R/2j )} {BINc,j -1,s (x)|x [d, d + R/2(j -1) ) \ [a, a + R/2j )} = j Figure 2: Included parts of the range [0, . . . , R) for the foldings FOLD0,1,0 , FOLD0,1,1 , FOLD1/2,1,0 and FOLD1/2,1,1 ( = 2, for simplicity shown with S = 2 although we assume in the analysis that S = 2 + 1). We now define precisely the set of foldings that we use. In addition to j {0, . . . , -1 log2 (R/B )}, each folding FOLDc,j,s is specified by two more parameters: s {0, . . . , S - 1}, and c {0, 1/2}. We explain the role of these additional parameters next. The folding mapping can be viewed as follows. The interval [0, . . . R) is partitioned into consecutive subintervals of size R/2j . The c {0, 1/2} determines at what point the partition is started: if c = 0 the subinterval boundaries start at 0 and if c = 1/2 the boundaries start at R/2j +1 ("half " subinterval shift) and end at R - R/2j +1 .6 The domain of the folding includes a subset of these subintervals that are spaced exactly S subintervals apart (s determines which of the S possible subsets of subintervals-spaced-S -apart is included.) All included subintervals are then identified (that is, a value fi in a subinterval [a1 , a2 ] is mapped to fi - a1 ). Hence, we obtain a mapping of the range [0, R) to a range [0, R/2j ). Formally, we have that the domain of the mapping is x ( m . x - cR/2j +1 ) FOLDc,j,s = | od S = s R/2j For x FOLDc,j,s we define the image as7 Fc,j,s (x) = (x - cR/2 j +1 ) mod R/2 . (the set of bins that cover Fc,j -1,s ([a, a + R/2j )), and the set of bins that cover Fc,j -1,s ([d, d + R/2(j -1) ) \ [a, a + R/2j )) are disjoint.) Proof. Since [d, d + R/2(j -1) ) is a maximal intact interval of some FOLD,j -1, we have that d mod R/2(j -1)+1 = 0. It follows that (a - 6 So for c = 1/2 we don't get exactly a partition of [0, R) but of [R/2j +1 , R - R/2j +1 ). 7 We use the natural extension of the mo dulo op eration for nonnegative reals. 161 d) mod R/2j +1 = 0 and that (a + R/2j - Lemma 4.1. 1. I j is contained in Ij -1 and therefore d) mod R/2j +1 = 0. The bin partition partitions the is intact under FOLDcj ,j,sj . including interval into intervals of size (R/2(j -1) )/B . 2. {BINcj ,j,sj (x)|x I j } {BINcj ,j,sj (x)|x I j -1 \ It thus suffices to show that R/2j +1 is divisible by I j } = . (I j is "exactly covered" by the bin (R/2(j -1) )/B . This in fact holds since partition of FOLDcj ,j,sj .) B 2(j -1) R/2j +1 = = B /2+1 . (j -1) )/B Proof. Ij is of size R/2(j +1)+1 , and I j is of size 2j +1 (R/2 R/2(j +1) . Thus, |I j | = 2|Ij |. Since is the midpoint (The latter is clearly integral since we assumed that of Ij we have B /2(+2) is integral.) I j [max{ -3R/2(j +1)+2 , 0}, min{ +3R/2(j +1)+2 , R}] Ij -1 . (the latter holds since 3 2 .) The second property The following is immediate from our definitions: follows from Lemma 3.2. Lemma 3.3. Consider a maximal intact subinterval I of some FOLD,j, . Then al l points x [0, . . . , R) \ I such that dist(x, I ) (S - 1)R/2j are not included in the folding. Each node stores an NH-summary for each of the B bins in each folding FOLDc,j,s for all c, j , s. Thus, the communication and storage amount to computing 2B S -1 log2 (R/B ) NH-summaries. Consider the viewpoint of some node u. We use the notation { Bc,j,s (b, g ) = wu,g (i) i|fi FOLDc,j,s BINc,j,s (fi )=b} for the decaying count of items in the bth bin of FOLDc,j,s . From the NH-summaries available at ^ our node u we can obtain estimates Bc,j,s (b, g ) for Bc,j,s (b, g ). (for all g (), foldings, and bins). 4 Computing p ower sums from summaries 5 Algorithm PowerSum(, w) · M 0 · For j = 0, . . . , -1 (log2 (R/B )) do as follows: · For all b {0, . . . , B - 1} such that bR/(B 2j ) [0, . . . , R/2j - 1) \ Fcj ,j,sj (I j ) (In words, for the bins of the range of Fcj ,j,sj that cover Fcj ,j,sj (I j -1 \ I j ).) do as follows: b j / ^ · M M + Bcj ,j,sj (b, g ) RB2 - Fcj ,j,sj ( ) . Correctness of algorithm PowerSum Consider an iteration of PowerSum, and the respective folding FOLD(cj , j, sj ). First note that items i are classified as either included or excluded according to whether they belong to FOLDcj ,j,sj . We further classify included items into either internal or external as follows. Items with value fi such that Fcj ,j,sj (fi ) Fcj ,j,sj (I j ) are internal for FOLDcj ,j,sj . Items such that Fcj ,j,sj (fi ) Fcj ,j,sj (I j -1 \I j ) (i.e., all other items) are external for FOLDcj ,j,sj . So any item is either internal, external, or excluded. For an iteration j and an external item i, we refer to wg (i)|Fcj ,j,sj (fi ) - Fcj ,j,sj ( )| as the contribution of item i. These classifications are useful as only external items "contribute" to M during the j th iteration. We will use the following property to bound the approximation error. Lemma 5.1. Values in I j -1 that are external for FOLDcj ,j,sj are excluded in FOLDcj +1 ,j +1,sj +1 . Proof. Values in I j -1 that are external for FOLDcj ,j,sj are exactly those in I j -1 \ I j . So we have to show that I j -1 \ I j is excluded by FOLDcj +1 ,j +1,sj +1 . From Given , g (), and , we show how a node u can use ^ itsi locally-available estimates on Bc,j,s (b, g ) to estimate wu,g (i)|fi - | . For each j {0, . . . , -1 log2 (R/B )} we define the intervals Ij = [max{ -R/2(j +1)+2 , 0}, min{ +R/2(j +1)+2 , R}] . Then for each j {0, . . . , -1 log2 (R/B )} the node selects one folding of the form FOLD,j, (denoted FOLDcj ,j,sj ) as follows. For j = 0 it uses the folding FOLD0,0,0 (for all ). For j > 0 the node selects a folding FOLDcj ,j,sj such that Ij -1 is intact. (Existence of such a folding is guaranteed by Lemma 3.1.) We define I j -1 = [aj , aj + R/2j ) be the maximum intact interval of FOLDcj ,j,sj which includes Ij -1 . For convenience, we define I -1 [0, . . . , R) and I -1 log2 (R/B ) . The following lemma summarizes two properties of these intervals that we need to establish correctness of the algorithm. 162 Lemma 3.3, all values that are not in I j and are of distance at most (S - 1)R/2(j +1) from I j are excluded under FOLDcj +1 ,j +1,sj +1 . Since Ij is of size R/2(j +1)+1 and is the midpoint of Ij we have that all values that are of distance at most of each such bin only up to (1 + ) accuracy and second for item in a bin we sum up not the exact difference |Fcj ,j,sj (fi ) - Fcj ,j,sj ( )| but this difference where Fcj ,j,sj (fi ) is rounded to a bin boundary. The first component of the error clearly contribute a factor of 1 + to our overall error. (S - 1)R/2(j +1) + R/2(j +1)+2 = (4S - 3)R/2(j +1)+2 We next bound the error introduced by rounding to bins. For j = -1 log2 R/B the range of FOLDcj ,j,sj from and are not in I j are excluded. Since S 2 + 1 we obtain that (4S - 3) 4 · 2 has B values and the histogram captures exact values, thus the contribution is precise. Otherwise, since item and thus i is external, i Ij and therefore FOLDcj ,j,sj (fi ) - (4S - 3)R/2(j +1)+2 R/2j . FOLDcj ,j,sj ( ) R/2(j +1)+2 ; Rounding to bin boundj Recall now that the interval I j -1 is of size R/2j and aries gives an additive error term of (R/2 )/B in our knowledge of FOLDcj ,j,sj (fi ). Thus, the relative ercontains , thus it must be the case that all points in ror we get in |Fcj ,j,sj (fi ) - Fcj ,j,sj ( )| is bounded by I j -1 \ I j are excluded by FOLDcj +1 ,j +1,sj +1 . (1 ± 2+2 /B ) . 1 j Lemma 5.3. Tj i is an + 2-+2 /(1 - 2- ) external internal w(i)|fi - | . approximation of FOLD *, j,* 1111111111111111111111111111111111111111 0000000000000000000000000000000000000000 Ij Ij I j-1 11111111111 range of FOLD *, j+1,*00000000000 11111111111 00000000000 Figure 4: A folding selected according to some (B = 16, = 2, and thus B 2-(+2) = 1). The figure shows the value , I j -1 (the maximal intact interval in FOLDcj ,j,sj ) along with the histogram partition to B = 16 bins. The figure also shows the interval Ij (which is 1-bin wide around ), and I j (the maximal intact interval in FOLDcj +1 ,j +1,sj +1 that contains Ij ). As should be Ij I j I j -1 and I j aligns with the bin partition induced on I j -1 . The figure also shows the ranges of values that are classified as internal and external. We conclude the correctness proof with the following two lemma. Let i Tj = w(i)|Fcj ,j,sj (fi ) - Fcj ,j,sj ( )| external in FOLDcj ,j,sj Proof. jWe now consider the contribution of each item i to Tj (that is, the sum, over j where i is external in FOLDcj ,j,sj , of the contribution of i to Tj .) For each item i, we define J (i) {0, . . . , -1 log2 (R/B )} be such that fi I J (i)-1 \ I J (i) . Recall that an item i is external in the j th iteration if and only if i is included in FOLDcj ,j,sj and Fcj ,j,sj (fi ) Fcj ,j,sj (I j -1 \ I j ) . be the non-discretized contribution during iteration j . Lemma 5.2. The total contribution made to M during +2 iteration j is a (1 + )(1 ± 2 B ) -approximation of the quantity Tj , where (1 + ) is our error in the decaying sum estimates. Proof. Observe that only bins of external items and all bins with external items contribute to our estimate among the bins of FOLDcj ,j,sj . Thus, the error stems from two reasons, first we have the weight In particular an item i is external in the fold FOLDcJ (i) ,J (i),sJ (i) . Moreover, since the intervals I j are nested, iteration J (i) is the first iteration in which the item i is external. Note also that the interval between fi and is intact in iteration J (i) and is not intact in subsequent iterations. Each item i contributes in several iterations. The first iteration it contributes in is iteration J (i). Since the interval between fi and is intact in iteration J (i), the contribution of i in iteration J (i) is exactly w(i)|(fi - )| . We next argue that its contributions in subsequent iterations are at most some constant fraction of w(i)|(fi - )| . It follows from the definition of the intervals Ij that the contribution of i at iteration J (i) is at least w(i)(R/2(J (i)+1)+2 ) . Lemma 5.1 states that external items in an iteration are excluded in the next iteration. Thus, an item i does not contribute in iteration J (i) + 1. In every iteration j J (i) + 2 the contribution of i is at most w(i)(R/2j ) (since the size of the range of FOLD(, j, ) is R/2j ). Since the upper bound on the contribution of i in each iteration j J (i) + 2 decreases by a factor of 2- we obtain that sum of the contribution of i in all these iterations together is 163 One-sided power sum about is a weighted sum of all values that arie larger (or smaller) than . w(i)(R/2(J (i)+2) ) (1 + 2- + (2- )2 + · · ·) or A,- = That is, A,+ = |fi > w (i)|fi - | i |fi w (i)|fi - | . Absolute p ower sums and pure (J (i)+2) - w(i)(R/2 ) /(1 - 2 ). power sums with even can be expressed as the sum ,+ ,- Therefore the relative error contributed by the of the two one-sided sums A = A + A whereas the pure power sums with od are the difference contribution of i in iterations j J (i) + 2 is at most M = A,+ - A,- . (R/2(J (i)+2) ) /(1 - 2- ) - +2 - =2 /(1 - 2 ). Lemma 6.1. A slight modification of algorithm Power(R/2(J (i)+1)+2 ) Sum al lows us to obtain approximate values for each one-sided sum within an additive error term of A . Proof. The modification amounts to simply considering a subset of the bins that cover only the part of Fcj ,j,sj (I j -1 \ I j ) that is larger (for fi > ) or smaller (1 + )(1 + 2+2 /B ) (1 + 2(2-) /(1 - 2- )) . (for fi ) than Fcj ,j,sj (I j ). Observe that this is the e Assume we are interested in summaries that are good same additive error we obtaind when approximating i the sum A = w(i)|fi - | , only that it does not for a certain and [a , b ]. The second part of the approximation factor (contributed by Lemma 5.3) necessarily translate into a small relative error in the - is decreasing with , thus we shall choose > 2a 1 one-sided case. 1+(2-)a . The first part of large enough such that 2 the approximation factor (contributed by Lemma 5.2) Corollary 6.1. For each and [a , b ], M can is increasing with . By choosing a sufficeintly large B be estimated from the summaries to within an additive term of A . we can have (1 + 2+2 /B )b (1 + ). 5.1 When values are unrestricted Our presentation assumed that an upper bound R on the maximum value M is known to all nodes. This assumption can be dropped by using the sum S of all values in the system. Then we can use R = S (The sum S is at most nM , and thus log S log n + log M .) Alternatively, we can perform log(M n/S ) count computations, where the ith computation counts the number of values that are larger than (S/n)2i . As a result, we obtain an estimate on M within a factor of 2. Another natural question is whether we can remove the dependence on R (and allow for exponentially large range of values.) A simple construction (that mimics one given for spatially-decaying sums [5]) shows that the dependence is inherent. Proof. M is a sum or difference of two one-sided sums. Each one-sided sum can be estimated to within an additive term of ( /2)A . Thus, their sum or difference can be estimated to within an additive term of A . Combining the two lemmas, the total approximation factor is at most An important ingredient we need is obtaining a value , such that the absolute -power sum about is within some constant factor of the respective absolute central power sum. That is, A = O(A ). It is easy to µ see that an approximate (with relative error) mean does not necessarily possess this property, but fortunately, (as proved in the following lemma) an approximate median will do. Folklore knowledge is that a random value (drawn according to the weights w(i)) has a constant probability of being an approximate median, and this probability can be arbitrarily increased by selecting the median of a constant number of random 6 Central moments samples. An efficient algorithm for obtaining such We now show how approximate values of Mµ,g for even spatially-decaying random samples is given in [5]. integral 2 can be retrieved from the summaries. i For brevity, since clear from context, we omit the decay Lemma 6.2. Let m be such that |fi m w (i) cW i function g from the subscripts. and |fi m w (i) cW that is, the weight of items The challenge in approximating A is that µ is not µ with value that is at most m is at least cW , and the known to us. It can be approximated within a relative weight of items with value that is at least m is at least error using the sum and count aggregates, but a relativecW .) Then error estimate on µ is not sufficient for obtaining a A m relative-error estimate on A . 2 (1 - c)/c . µ A µ We start with some definitions and lemmas. 164 Proof. Assume wlog that µ > m. Consider now items and their contributions to the power sums A and A . m µ All the values that are larger than m + 2(µ - m) have the property that their contribution to A is at most m 2 times their contribution to A . We next consider µ items with value at most m. Since they are closer to m than to µ, their contribution to A is smaller m than their contribution to A . The total contribution µ of these items to A is at least cW |µ - m| . Thus µ A cW |µ - m| . We next consider values in the µ interval (m, m + 2(µ - m)]. Since the total weight of items with values at most m is at least cW , the weight of items with values in the interval (m, m + 2(µ - m)] is at most (1 - c)W . So the contribution of the items in (m, m + 2(µ - m)] to A is at most 2 (1 - c)W |µ - m| . m It follows that A /A max{2 , 2 (1 - c)/c} 2 (1 - m µ c)/c (note that we always have c 1/2). In particular, M2 µ M4 µ = = -(M1 )2 /W + M2 M4 - 3(M1 )4 /W 3 - 4M3 M1 /W + 6M2 (M1 )2 /W 2 . We let be an approximate median m as in Lemma 6.2. We estimate the central moment through the polynomial sum of raw moments about = m (Equ. 6.4), by plugging in, for each Mi , our estimated quantity Mi m m ^ and for W , the (1 ± )-approximate W . Lemma 6.4. The additive error we obtain in our estimate is O( A ). µ M note k k - Mk . The difference can be rewritten m m m k 1 as (Mm + k )(M1 + m )-k - Mk (M1 )-k . If we exm m m m a a pand this expression the term Mk (M1 )-k cancels out m m The claim will follow by applying this inequality and we obtain a polynomial P (Mk , M1 , k , 1 ) that m m m m w with the following parameters: Let W = (i) as consists of a positive linear combination of products of defined earlier and assume wlog that |fi - | are ordered Mk , M1 , k , and 1 . Recall now that |Mi | Ai m m m m m m by magnitude. Let the step function hj (x) be defined and that |i | = |Mi - Mi | Ai (for all i) (from m m m m on the ik terval [0, W ] as follows, hj (x) = |fi - |ij for Corollary 6.1). Therefore, if we replace each appearance n . k i x Note that it follows from of Mi by Ai and each i by Am we can only increase 0 and p > 0 (from a fixed range) and are ||fi - || = jd |fij - j |pj p-1 . [2] =1 (in particular, this generalizes Lp norms). We are interested ini summaries that would yield approximate values of w(i)||fi - ||).8 We sketch the extension of our d = 1 construction to d > 1. The size of the summaries is polylogarithmic in the number of items but the dependence on the dimension d is exponential. Each of our d-dimensional foldings maps the domain into a smaller d-dimensional range cube. The widths (edge-lengths) of these range cubes are exponentially decreasing. For each width we use (2S )d different foldings. For each possible (out of 2d ) selection of different zero or half tile-width shifts we consider a partition into sub-cubes according to the width. For each partition, we derive S d foldings, where each folding includes a subset of the sub-cubes that are spaced S subcubes apart. Each folding then maps all included subcubes into a range cube of the respective width. The range cube of each folding is then partitioned uniformly into B d sub-cubes (bins), and each bin corresponds to a predicate. We thus use O((2B S )d log R) predicates. 7.3 k -medians We next consider summaries that for any set of points 1 , . . . , k , iobtain an approximate value of F (1 , . . . , k ) = w(i) mink=1 ||fi - j ||. j This is relevant for computing the k -median defined as each such "norm" it is interesting to consider the median i w(i)||fi - ||. Since our summaries can obtain an arg min estimate for any , they can be used to estimate this median. 8 For [3] [4] [5] [6] [7] [8] [9] [10] In Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2002). ACM, 2002. B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan. Maintaining variance and k-medians in data stream windows. In Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2003). ACM, 2003. M. Charikar, L. O'Callaghan, and R. Panigraphy. Better streaming algorithms for clustering problems. In Proc. 35th Annual ACM Symposium on Theory of Computing, pages 30­39. ACM, 2003. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441­453, 1997. E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. In preparation, 2003. E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of the 2003 ACM Symp. on Principles of Database Systems (PODS 2003). ACM, 2003. M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6):1794­1813, 2002. P. B. Gibbons and S. Tirthapura. Distributed streams algorithms for sliding windows. In Proc. of the 14th Annual ACM Symposium on Paral lel Algorithms and Architectures, pages 63­72. ACM, 2002. I. A. Gradshteyn and I. M. Ryzhik. Tables of Integrals, Series, and products. Academic Press, San Diego, CA, 6 edition, 2000. A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill Book Company, New York, second edition, 1984. 166