Density estimation in linear time

 Satyaki Mahalanabis and Daniel Stefankovi c Department of Computer Science University of Rochester Rochester, NY 14627 {smahalan,stefanko}@cs.rochester.edu

Abstract
We consider the problem of choosing a density estimate from a set of densities F , minimizing the L1 -distance to an unknown distribution. Devroye and Lugosi [DL01] analyze two algorithms for the problem: Scheff´ e tournament winner and minimum distance estimate. The Scheff´ tournament estimate ree quires fewer computations than the minimum distance estimate, but has strictly weaker guarantees than the latter. We focus on the computational aspect of density estimation. We present two algorithms, both with the same guarantee as the minimum distance estimate. The first one, a modification of the minimum distance estimate, uses the same number (quadratic in |F |) of computations as the Scheff´ tournament. The second e one, called "efficient minimum loss-weight estimate," uses only a linear number of computations, assuming that F is preprocessed. We then apply our algorithms to bandwidth selection for kernel estimates and bin-width selection for histogram estimates, yielding efficient procedures for these problems. We also give examples showing that the guarantees of the algorithms cannot be improved and explore randomized algorithms for density estimation.

1

Introduction

We study the following density estimation problem considered in [DL96, DL01, DGL02]. There is an unknown distribution g and we are given n (not necessarily independent) samples which define empirical distribution h. Given a finite class F of densities, our ob jective is to output f  F such that the error f - g 1 is minimized. The use of the L1 -norm is well justified because it has many useful properties, for example, scale invariance and the fact that approximate identification of a


Supp orted by NSF grant I IS-0546554

distribution in the L1 -norm gives an estimate for the probability of every event. The following two parameters influence the error of a possible estimate: the distance of g from F and the empirical error. The first parameter is required since we have no control over F , and hence we cannot select a density which is better than the "optimal" density in F , that is, the one closest to g in L1 -norm. It is not obvious how to define the second parameter--the error of h with respect to g . We follow the definition of [DL01], which is inspired by [Yat85] (see Section 1.1 for a precise definition). Devroye and Lugosi [DL01] analyze two algorithms in this setting: Scheff´ tournament winner and mine imum distance estimate. The minimum distance estimate, defined by Yatracos [Yat85], is a special case of the minimum distance principle, formalized by Wolfowitz in [Wol57]. It is a general density estimation tool which has been applied, for example, by [DL96, DL97] to the bandwidth selection problem for kernels and by [DL04, DL01] to bin-width selection for histograms. The minimum distance estimate also finds application in hypothesis testing [DGL02]. The Scheff´ tournament winner algorithm requires e fewer computations than the minimum distance estimate, but it has strictly weaker guarantees (in terms of the two parameters mentioned above) than the latter. Our main contribution are two procedures for selecting an estimate from F , both of which have the same guarantees as the minimum distance estimate, but are computationally more efficient. The first has a quadratic (in |F |) cost, matching the cost of the Scheff´ tournament e winner algorithm. The second one is even faster, using linearly many (in |F |) computations (after preprocessing F ). We also apply our estimation procedures to the problem of bandwidth selection for kernels and to that of binwidth selection for histograms, following [DL01, DL96, DL97, DL04]. We show that in each of these applications "efficient minimum loss-weight estimate" is faster than our "modified minimum distance estimate," which in turn is faster than the minimum distance estimate. Now we outline the rest of the paper. In Section 1.1 we give the required definitions and introduce the notion of a test-function (a variant of Scheff´ set). Then, in e Section 1.2, we restate the previous density estimation


algorithms (Scheff´ tournament winner and the minie mum distance estimate) using test-functions. Next, in Section 2, we present our algorithms. In Section 3 we discuss two widely studied nonparametric estimation problems where the computational cost of efficient minimum loss-weight estimate (including preprocessing) is much smaller than that of both the modified minimum distance and the minimum distance estimates. In Section 4 we explore randomized density estimation algorithms. In the final Section 5, we give examples showing tightness of the theorems stated in the previous sections. Throughout this paper we focus on the case when F is finite, in order to compare the computational costs of our estimates to previous ones. However our results generalize in a straightforward way to infinite classes as well if we ignore computational complexity. 1.1 Definitions and Notations Throughout the paper g will be the unknown distribution. We will use h to denote the empirical distribution, which given samples X1 , X2 , . . . , Xn , is defined for each s et A   a s h(A) =
n 1i 1[Xi A] n =1

the following measure of distance between the empirical and the unknown distribution:  := max (g - h) · T .
T TF

(2)

Devroye and Lugosi say that fi wins against fj if A <A .
ij

Now we discuss how test-functions can be viewed as a reformulation of Scheff´ sets, defined by Devroye e and Lugosi [DL01] (inspired by [Sch47] and implicit in [Yat85]), as follows. The Scheff´ set of densities fi , fj e is Aij = {x ; fi (x) > fj (x)}.

fi - h(Aij )

ij

fj - h(Aij )

(3)

Let F be a set of densities. We will assume that F is finite. Let d1 (g , F ) be the L1 -distance of g from F , that is, minf F f - g 1. Given two functions fi , fj on  (in this context, densities) we define a test-function Tij :   {-1, 0, 1} to be the function Tij (x) = sgn(fi (x) - fj (x)). Note that Tij = -Tj i . We also define TF to be the set of all testfunctions for F , that is, Let · be the inner product for the functfons on , defined i for any 2 functions f , f  as f · f  = f  . Note that We use the inner product of the empirical distribution h with the test-functions to choose an estimate, which is a density from F . In this paper we only consider algorithms which make their decisions purely on inner products of the testfunctions with h and members of F . It is reasonable to assume that the computation of the inner product will take significant time. Hence we measure the computational cost of an algorithm is by the number of inner products used. We say that fi wins against fj if Note that either fi wins against fj , or fj wins against fi , or there is a draw (that is, there is equality in (1)). We will say that fi loses to fj if The algorithms choose an estimate f  F using the empirical distribution h. The L1 -distance of the estimates from the unknown distribution g will depend on (fi - h) · Tij  (fj - h) · Tj i . (fi - h) · Tij < (fj - h) · Tj i . (1) (fi - fj ) · Tij = fi - fj 1. TF = { Tij ; fi , fj  F }.

The advantage of using Scheff´ sets is that for a concrete e set F of densities one can immediately use the theory of  Vapnik-Chervonenkis dimension [VC71] for the family of Scheff´ sets of F (this family is called the Yatracos e class of F ), to obtain a bound on the empirical error. If h, fi , fj are non-negative and integrate to 1 then the condition (1) is equivalent to (3) (to see this recall that Tij = -Tj i , and add (fi - h) · 1 = (h - fj ) · 1 to (1), where 1 is the constant one function on ). Thus, in our algorithms the test-functions can be replaced by Scheff´ e sets and VC dimension arguments can be applied. We chose to use test-functions for two reasons: first, they allow us to give succinct proofs of our theorems (especially Theorem 8), and second, they immediately extend to the case when the members of F do not correspond to distributions (cf, e. g., Exercise 6.2, in [DL01]). Remark 1 Note that our value of , defined in terms of TF , is at most twice the  used in [DL01], which is defined in terms of Scheff´ sets. e 1. 2 Previous Estimates

In this section we restate the two algorithms for density estimation from Chapter 6 of [DL01]) using testfunctions. The first algorithm requires less computation but has worse guarantees than the second algorithm. Algorithm 1 - Scheff´ tournament winner. e Output f  F with the most wins (tie broken arbitrarily). Theorem 2 ([DL01], Theorem 6.2) Let f1  F be the density output by Algorithm 1. Then f1 - g 1  9 d1 (g , F ) + 8. The number of inner products used by Algorithm 1 is (|F |2 ). Algorithm 2 - Minimum distance estimate. Output f  F that minimizes | . max (f - h) · Tij | ; fi , fj  F (4)


Theorem 3 ([DL01], Theorem 6.3) Let f1 be the density output by Algorithm 2. Then f1 - g 1  3 d1 (g , F ) + 2. The number of inner products used by Algorithm 2 is (|F |3 ). Let us point out that Theorems 6.2 and 6.3 in [DL01] require that each f  F corresponds to a distribution, f that is, = 1. Since we use test-functions in the algorithms insfead of Scheff´ set based comparisons, the t e assumption = 1 is not actually needed in the proofs of Theorems 6.2 and 6.3 (we skip the proof ), and is not used in the proofs of Theorems 4, 8.

Combining the last inequality with (6) we obtain f1 - g 1  3 f2 - g 1 + 2. R emark 5 Note that one can modify the Lemma to only require that g and h be "close" with respect to the test functions for the "best" function in the class, that is, only |(g - h) · T2,j | need to be small (where f2 is argminf F f - g 1). One can ask whether the observation in Remark 5 can lead to improved density estimation algorithms for concrete sets of densities. The bounds on  (which is given by (2)) are often based on the VC-dimension of the Yatracos class of F . Recall that the Yatracos class Y is the set of Aij = {x ; fi (x) > fj (x)} for all fi , fj  F . Remark 5 implies that instead of the Yatracos class it is enough to consider the set Yi = {Aij ; fj  F } for fi  F . Is it possible that the VC-dimension of each set Yi is smaller the VC-dimension of the Yatracos class Y ? The following (artificial) example shows that this can, indeed, be the case. Let  = {0, . . . , n}. For each (n + 1)-bit binary string a0 , a1 , . . . , an , let us consider the distribution Pn j 1 (1 + (1/2 - a0 )(1/2 - ak ))2- j=1 aj 2 , P (k ) = 4n for k  {1, . . . , n} (with P (0) chosen to make P into a distribution). For this family of 2n+1 distributions the VC-dimension of the Yatracos class is n, whereas each Yi has VC-dimension 1 (since a pair of distributions fi , fj has a non-trivial set Aij if and only if their binary strings differ only in the first bit). 2. 2 An even more efficient estimator minimum loss-weight In this section we present an estimator which, after preprocessing F , uses only O(|F |) inner products to obtain a density estimate. The guarantees of the estimate are the same as for Algorithms 2 and 3. The algorithm uses the following quantity to choose the estimate: f loss-weight(f ) = max - f  1 ; f loses . to f   F

2
2. 1

Our estimators

A variant of the minimum distance estimate The following modified minimum distance estimate uses only O(|F |2 ) computations as compared to O(|F |3 ) computations used by Algorithm 2 (equation (5) takes minimum of O(|F |) terms, whereas equation (4) takes minimum of O(|F |2 ) terms), but as we show in Theorem 4, it gives us the same guarantee as the minimum distance estimate. Algorithm 3 - Modified minimum distance estimate. Output fi  F that minimizes | . max (fi - h) · Tij | ; fj  F (5)

Pro of : Let f1  F be the function output by Algorithm 3. Let f2 = argminf F f - g 1. By the triangle inequality we have f1 - g 1  f1 - f2 1 + f2 - g 1. (6) We bound f1 - f2 1 as follows: f1 - f2 1 = (f1 - f2 ) · T12  |(f1 - h) · T12 | + |(f2 - h) · T12 |  |(f1 - h) · T12 | + max |(f2 - h) · T2,j | where in the last inequality we used the fact that T12 = -T21 . By the criteria of selecting f1 we have |(f1 -h)·T12 |  maxfj F |(f2 - h) · T2,j | (since otherwise f2 would be selected). Hence f1 - f2 1  2 max |(f2 - h) · T2,j |
fj F fj F

Theorem 4 Let f1  F be the density output by Algorithm 3. Then f1 - g 1  3 d1 (g , F ) + 2. The number of inner products used by Algorithm 3 is (|F |2 ).

  =

2 max |(f2 - g ) · T2,j |
fj F

Intuitively a good estimate should have small lossweight (ideally the loss-weight of the estimate would be - = max{}, that is, the estimate would not lose at all). Thus the following algorithm would be a natural candidate for a good density estimator (and, indeed, it has a guarantee matching Algorithms 2 and 3), but, unfortunately, we do not know how to implement it using O(|F |) inner products. Algorithm 4a - Minimum loss-weight estimate. Output f  F that minimizes loss-weight(f ).

+2 max |(g - h) · T2,j |
fj F

2 (f2 - g ) 1 + 2 max |(g - h) · T | 2 f2 - g 1 + 2.
T T F

The next algorithm, seems less natural than algorithm 4a, but its condition can be implemented using only O(|F |) inner products.


Algorithm 4b - Efficient minimum loss-weight estimate. Output f  F such that for every f  to which f loses we have f - f  1  loss-weight(f  ). (7) Before we delve into the proof of (8) let us see how Algorithm 4b can be made to use |F | - 1 inner products. We preprocess F by computing L1 -distances between all pairs of densities in F and store the distances in an list sorted in decreasing order. When the algorithm is presented with the empirical distribution h, all it needs to do is perform comparison between select pairs of densities. The advantage is that we preprocess F only once and, for each new empirical distribution we only compute inner products necessary for the comparisons. We will compute the estimate as follows.
: family of densities F , list L of all pairs {fi , fj } sorted in decreasing order by fi - fj 1, oracle for computing inner products h · Tij . output : f  F such that: (f  ) f loses to f  = f - f  1  loss-weight(f  ). input
1 2 3 4 5 6 7 8

adversary's moves). Is it possible to achieve this goal with polynomial-time preprocessing? We now show that the detailed version of algorithm 4b outputs f satisfying the required condition. Lemma 7 The estimate f output by the detailed version of algorithm 4b satisfies (7) for every f  to which f loses. Pro of : We show, using induction, that the following invariant is always satisfied on line 2. For any f  S and any f   F \ S we have that if f loses to f  then f - f  1  loss-weight(f  ). Initially, F \ S is empty and the invariant is trivially true. For the inductive step, let f  be the density most recently removed from S . To prove the induction step we only need to show that for every f  S we have that if f loses to f  then f - f  1  loss-weight(f  ). Let W be the L1 -distance between two densities in S  {f  }. Then loss-weight(f )  W (since f  lost), and f - f  1  W (by the definition of W ). T heorem 8 Let f1  F be the density output by Algorithm 4a (or Algorithm 4b). Then f1 - g 1  3 d1 (g , F ) + 2. (8) Assume that we are given L1 -distances between every pair in F . The number of inner products used by Algorithm 4b is (|F |). Pro of of Theorem 8: Let f4 = g . Let f2 be the function f  F minimizing g - f 1. We can reformulate our goal (8) as follows: (f1 - f4 ) · T14  2 + 3(f2 - f4 ) · T24 . (9)  Let f3  F be the function f  F such that f2 loses against f  and f2 - f  1 is maximal (there must be at least one function to which f2 loses, otherwise the algorithm would pick f2 and we would be done). Note that f1 , f2 , f3  F , but f4 does need to be in F . We know that f2 loses against f3 , that is, we have (see (1)) 2h · T23  f2 · T23 + f3 · T23 , (10) and, since f1 satisfied (7), we also have (f1 - f2 ) · T12  (f2 - f3 ) · T23 . (11) By (2) we have 2(f4 - h) · T23  2. (12) Adding (10), (11), and (12) we obtain 2(f2 - f4 ) · T23 + (f2 - f1 ) · T12 + 2  0. (13) Note that for any i, j, k ,  we have: (fi - fj ) · (Tij - Tk )  0, (14) since if fi (x) > fj (x) then Tij - Tk  0, if fi (x) < fj (x) then Tij - Tk  0, and if fi (x) = fj (x) then the contribution of that x is zero. By applying (14) four times we obtain (f2 - f4 ) · (3T24 - 2T23 - T14 ) + (f1 - f2 ) · (T12 - T14 )  0. (15) Finally, adding (13) and (15) yields (9).

SF repeat pick the first edge {fi , fj } in L if fi loses to fj then f   fi else f   fj fi remove f  from S remove pairs containing f  from L until |S | = 1 output the density in S

Detailed version of algorithm 4b - using O(|F |) inner pro ducts. Note that while Algorithm 4b uses only O(|F |) inner products its running time is actually (|F |2 ), since it traverses a list of length (|F |2 ). Are we cheating? There are two answers to this question: practical and theoretical. As we will see in applications the inner products dominate the computation, justifying our focus on just the inner products (of which there are linearly many). Theoretically, if we are willing to spend exponential time for the preprocessing, we can build the complete decision tree corresponding to Algorithm 4b and obtain a linear-time density selection procedure. We find the following question interesting: Is it possible to achieve linear running time using only polynomialtime preprocessing? Question 6 (Tournament Revelation Problem) We are given a weighted undirected complete graph on n vertices. Assume that the edge-weights are distinct. We preprocess the weighted graph and then play the following game with an adversary until only one vertex remains: we report the edge with the largest weight and the adversary chooses one of the endpoints of the edge and removes it from the graph (together with al l the adjacent edges). Our goal is to make the computational cost during the game linear-time (in n) in the worst-case (over the


Remark 9 Note that Remark 5 also applies to Algorithms 4a and 4b, since (12) is the only inequality in which  is used. Lemma 10 If the condition (7) of Algorithm 4b is relaxed to for some C  1, an analogue of Theorem 8 with (8) replaced by holds. f1 - g 1  (1 + 2C ) d1 (g , F ) + 2C  (17) f - f  1  C · loss-weight(f  ), (16)

The proof of Lemma 7 yields that the estimate f output by the detailed version of algorithm 4b satisfies the following inequality D(f , f  )  loss-weight (f  ). for every f  to which f loses. Now using the fact that D(f , f  ) is an (1 ± ) approximation of f - f  1 we obtain that the estimate f output by algorithm 4b satisfies the following f - f 1  1+ · loss-weight(f  ). 1-


Pro of : The proof is almost identical to the proof of Theorem 8. Let f4 = g . Let f2 be the function f  F minimizing g - f 1. We can reformulate our goal (17) as follows:

for every f  to which f loses. 3

Applications
We now describe two nonparametric density estimation problems where our estimates can be used to obtain efficient algorithms. The first of these problems is that of selecting the optimal smoothing factor for kernel estimates (Section 3.1) while the second one is that of finding an optimal bin-width for 1-dimensional histograms (Section 3.3). 3. 1 Bandwidth selection for kernel estimates

Let f3  F be the function f   F such that f2 loses against f  and f2 - f  1 is maximal (there must be at least one function to which f2 loses, otherwise the algorithm would pick f2 and we would be done). Note that f1 , f2 , f3  F , but f4 does need to be in F . Equations (10) and (12) from proof of Theorem 8 are satisfied here as well. Since f1 satisfies (16), we also have (f1 - f2 ) · T12  C (f2 - f3 ) · T23 . (19) Adding (10) multiplied by C , (19), and (12) multiplied by C we obtain 2C (f2 - f4 ) · T23 + (f2 - f1 ) · T12 + 2C   0. By applying (14) four times we obtain (f2 - f4 ) · ((1 + 2C )T24 - 2C T23 - T14 )+ (f1 - f2 ) · (T12 - T14 )  0. (20)

(f1 - f4 ) · T14  2C  + (1 + 2C )(f2 - f4 ) · T24 . (18)

(21)

We are going to show that our estimates give fast algorithms for the bandwidth selection problem for uniform kernels on R. Given n i.i.d samples x1 , . . . , xn  R drawn from an unknown distribution g the kernel estimate for g is the density w x n 1i - xi fn,s (x) = K n s =1 s here K , the kernel, is |a function (usually nonnegative) K with = 1 and K | < , and s > 0 is called the smoothing factor. For us K will be the uniform distribution on [-1, 1]. Given x1 , . . . , xn the bandwidth selection problem is to select an s > 0 such that fn,s - g 1 is close to inf s>0 fn,s - g 1 [DL01, DL96, DL97]. The data splitting approach to bandwidth selection uses n - m (n  m > 0) samples x1 , . . . , xn-m to define the kernel estimate fn-m,s and remaining m samples xn-m+1 , . . . , xn as a test set which defines an empirical measure h. Devroye and Lugosi ([DL96]) use the minimum distance estimate to give an algorithm for selecting s . Given n > 0 samples, they select s from an interval [an , bn ] (where, e. g., an = e-n , bn = en ). They discretize [an , bn ] by defining s1 = an , s2 = an (1 + n ), . . . , si = an (1 + n )i-1 , . . . , sN = an (1 + n )N -1 where N = ln(bn /an )/ ln(1 + n ) and n > 0 is a parameter. They now select s to be si such that fn-m,si is the minimum distance estimate for {fn-m,si ; 1  i  N } and measure h. Their main theorem is the following. Theorem 12 ([DL96]) Let K be nonnegative, Lipschitz and nonzero only in [-1, 1]. Let an , bn be such that c nan  0, bn   as n  . Assume that n = n

Finally, adding (20) and (21) yields (18). L

emma 10 allows us to run Algorithm 4b with distances between the densities computed approximately with relative error (1 ± ) and obtain analogue of Theo r em 8 . Corollary 11 Assume that we are given approximate L1 -distances between every pair in F with relative error (1±). Let f1  F be the density output by Algorithm 4a (or Algorithm 4b), where the algorithm uses the approximate distances (instead of the true distances). Then 3+ 2 + 2 d1 (g , F ) + . (22) 1- 1- The number of inner products used by Algorithm 4b is (|F |). f1 - g 1  Pro of : Let D(f , f  ) be the approximate L1 -distance between f and f  given to the algorithm (for every pair f , f   F ). Let D  loss-weight (f ) = max (f , f  ) ; f loses . to f   F


b and that ln an  c na where c, c , a > 0 are constants. n If m m   as n  ,  0 and 4/5 n n ln n then the estimate fn-m,s satisfies

Pro of : FollLws immediately from Lemma 17. o et us analyze the time required for computing s ^ for the uniform kernel. Let Tij denote the test function for fn-m,si , fn-m,sj . If we sort x1 , . . . , xn-m (using O(n log n) time) in the preprocessing step then computing the inner product fn-m,si · Tij for any i, j requires only O(n) time. Computing Tij at any point in R takes O(log n) time (using a single binary search). Hence computing the inner product h · Tij can be done in O(m log n) time. So the preprocessing time O((1/)2 (nN + N 2 ) log(nN/ ) + n log n) dominates the running time of the rest of the procedure, which is O((n + m log n)N ).  Choosing  = 1/ log n and  = 1/ n yields a running time of O((nN + N 2 )polylog(n)). In contrast, modified minimum distance requires N 2 (m log n + n) time while the minimum distance estimate requires N 3 (m log n + n) time, both of which are much slower since in Theorem 12, m = (n4/5 ). 3. 2 Efficient approximation of L1 -distances using pro jections.

sup lim sup
g n

E[ fn-m,s - g 1]  3. inf s>0 E[ fn,s - g 1]

(23)

Observation 13 For an , bn , n , a as in Theorem 12, N = (n1/2+a ). We can replace minimum distance with the minimum loss-weight estimate (Algorithm 4b) in this setting. Simply define s to be si (1  i  N ) such that ^ fn-m,si is the efficient minimum loss-weight estimate for {fn-m,si ; 1  i  N } and measure h. This requires the computation of L1 distances between all O(N 2 ) pairs of densities. Assume however that the kernel K is such that we are able to compute approximate estimates Di,j , 1  i, j  N such that with probability at least 1 -  ,

 i, j, (1 - )Dij  fn-m,si - fn-m,sj 1  (1 + )Dij (24) We can now define the approximate minimum loss-weight estimate s in the same way we defined s. In other ^ ^ words, s is si such that Algorithm 4b outputs fn-m,si ^ for the class {fn-m,si ; 1  i  N } and the measure h, except that it uses Dij instead of fn-m,si - fn-m,sj 1 for each i, j . The following theorem is the analogue of Theorem 12 for both s and s . ^ ^ Theorem 14 Let K, an , bn , n , a > 0, m be as in Theorem 12. Then s satisfies ^ E[ fn-m,s - g 1] ^ sup lim sup  3. inf s>0 E[ fn,s - g 1] g n   0 and   0 as n   n-2/5 (26) (25)

Our main tool will be the following result of [LHC07] (for related work see also [Ind06]). Lemma 16 (Lemma 8 of [LHC07]) Let v1 , . . . , vN  RM . Let ,   (0, 1). Let d  11(2 log N - log  )/2 be an integer. Let R be an d × M matrix whose entries are i.i.d. from the Cauchy distribution C (0, 1). Let wi = Rvi for i  [N ]. Let Dij be the geometric mean of the coordinates of |wi - wj |. With probability  1 -  (over the choice of the entries in R) we have for al l pairs i, j  [N ] (1 - )Dij  vi - vj 1  (1 + )Dij . (27)

Morever, if

then s satisfies ^ sup lim sup
g n

E[ fn-m,s - g 1] ^  3. inf s>0 E[ fn,s - g 1]

The proof of Theorem 14 is identical to that of Theorem 12, except that the use of Theorem 3 needs to replaced by Theorem 8 for (25), and by Corollary 11 for (26). Finally we state a lemma which shows, using ideas from [Ind06] and [LHC07], that it is indeed possible to efficiently compute approximate estimates Dij satisfying (24) (with confidence  ) when the kernel K is the uniform distribution on [-1, 1]. Lemma 15 Let the kernel K be the uniform distribution on [-1, 1]. Let ,   (0, 1). Then there is a randomized algorithm which in time O((1/)2 (nN + N 2 ) log(nN/ )) computes Dij for i, j  [N ] such that with probability  1 -  we have that for al l i, j  [N ] (1 - )Dij  fn-m,si - fn-m,sj 1  (1 + )Dij .

As an immediate consequence of Lemma 16 we obtain an efficient algorithm for approximating all pairwise L1 -distances between N densities each of which is a mixture of n uniform distributions on intervals. Lemma 17 Let n and N be positive integers. Let ,   (0, 1). For each i  [N ] let fi be a mixture of n uniform densities on intervals (fi is given by a set of n mixture coefficients i,1 , . . . , i,n and n disjoint intervals [ai,1 , bi,1 ), . . . , [ai,n , bi,n )). There is a randomized algorithm which in time O((1/)2 (nN + N 2 ) log(nN/ )) computes Dij (for i, j  [N ]) such that with probability  1 -  we have that for al l i, j  [N ] (1 - )Dij  fi - fj 1  (1 + )Dij . (28)


Proof : Let S = s0 < s1 < · · · < sM be the sequence obtained by sorting the set {ai,j ; i  [N ], j  [n]}  {bi,j ; i  [N ], j  [n]}. Note that M < 2N n. Let vi  RM be the vector whose j -th coordinate is the measure of [sj -1 , sj ) under fi . We have fi - fj 1 = vi - vj 1 for all i, j  [N ]. Now we will apply Lemma 16 to v1 , . . . , vN . Let d = 11(2 log 2nN - log  )/2 . Let R be an d × M matrix whose entries are i.i.d. from the Cauchy distribution C (0, 1). We can compute R in time O(dM ). Suppose that we computed wi = Rvi for i  [N ]. Then we can compute Dij , the coordinate mean of |wi - wj | for all i, j  [N ] in time O(N 2 d). The equation (27) and the fact that fi - fj 1 = vi - vj 1 implies (28). It remains to show how to compute wi = Rvi efficiently. The j -th coordinate of vi is the measure of [sj -1 , sj ) under fi which is (sj - sj -1 ) times the density of fi on the interval [sj -1 , sj ) (the density of fi is constant on this interval). Let R be obtained from matrix R by multiplying j -th column by (sj - sj -1 ) for j  [M ]. We can obtain R from R in time O(dM ). Let R be the     matrix with Rij = Ri1 + Ri2 + · · · + Rij (again we can   compute R from R in time O(dM )). We have (Rvi )k = jn . R ij   (29) k,r (bij ) - Rk,r (aij )-1 bij - aij

histogram estimate fn,s is defined as the density such that for each t and each x  At fn,s (x) =

|{xi ; xi  At }| . (30) ns Devroye and Lugosi [DL01, DG85] consider the problem of finding L1 -optimal histogram estimates. As in the case of kernel estimates, they use the first n - m sample points x1 , . . . , xn-m to define the histogram estimate fn-m,s , and the remaining points xn-m+1 , . . . , xn to define the empirical distribution h. Now, given a set  to choose from, s is defined to be the bin-width such that fn-m,s is the minimum distance estimate for {fn-m,s ; s  } and h. If each width in  is 2k for some integer k , Devroye and Lugosi [DL01] prove the following about s . Theorem 19 ([DL01], Theorem 10.3 and Lemma 10.5) If   {2i ; i  Z} then for al l n and m, with 0 < m  n/ 2 ,  f E n-m,s - g 1 m + 1 f 2m + 3 inf E +8 n,s - g 1 s n-m n l 3 og(2(m + 1)n2 ) +. 8 m n

=1

Using equation (29) we can compute all vi in time O(nN d). R  emark 18 In a forthcoming paper [MS08] we generalize Lemma 17 to piecewise polynomial densities. For each i  [N ], let density fi be specified by n disjoint intervals [ai,1 , bi,1 ), . . . [ai,n , bi,n ), and in interval [ai,j , bi,j ) for each j  [n] by coefficients (d ) (1) (0) i,j , i,j , . . . , i,j such that (x  [ai,j , bi,j )) f (x) = i,j + i,j x + . . . + i,j xd .  Theorem 5.1 of [MS08] states that there is a randomized algorithm which takes O(N (N + n)( d )3 log N ) time and   outputs Dij , 1  i < j  N such that with probability at least 1 -  , for each 1  i < j  N (1 - )Dij  fi - fj 1  (1 + )Dij . Bin-width selection for histogram estimates Here we show how the efficient minimum loss-weight estimate yields a fast algorithm for finding the optimal bin-width of 1-dimensional histograms. The set of densities arising in this problem will be such that for any subset of them it will be trivial to determine the pair whose L1 -distance is maximal. Given a bin-width s > 0, define At for each integer t to be the interval [ts, (t + 1)s). Given n sample points x1 , . . . , xn  R drawn from a distribution g , a regular 3. 3
(0) (1) (d )

Once again, like kernel estimates, we can simply use efficient minimum loss-weight instead of minimum distance. Now, define s to be such that fn-m,s is the effi^ ^ cient minimum loss-weight estimate (Algorithm 4b) for {fn-m,s ; s  } and h. We state below the analogue of Theorem 19 for the efficient minimum loss-weight estimate. The proof is the same, except, one uses Theorem 8 instead of Theorem 3. Theorem 20 If  is as in Theorem 19 then for al l n and m with 0 < m  n/2,  f E n-m,s - g 1 ^ m + 1 f 2m + 3 inf E -g 1 +8 n,s s n-m n l 2) 3 og(2(m + 1)n +. 8 m n Let us now consider the computational cost. For each n, lets say we choose  to be {2i ; - N  i  N } (where, e. g., N = n is a possible choice) so that we have 2N + 1 densities to select from. Define si = 2-N +i for each 0  i  2N . The following lemma shows that we need not actually pre-compute pairwise L1 -distances in the preprocessing step of Algorithm 4b. Lemma 21 For any i  k    j ,

fn,s - fn,sk 1  fn,sj - fn,si 1.

Pro of : We first prove that for any n and i < j , fn,sj - fn,si+1 1  fn,sj - fn,si 1, (31)


and fn,sj-1 - fn,si 1  fn,sj - fn,si 1. In order to prove (31), consider any bin At = [tsi+1 , (t + 1)si+1 ) = [2tsi , 2(t + 1)si ). Denote the density of fn,sj in this bin by µ, and that of fn,si in [2tsi , (2t+1)si ), [(2t+1)si , (2t+2)si ) respectively by µ1 , µ2 . Clearly the density of fn,si+1 in At is µ1 +µ2 . 2 However, A |fn,sj - fn,si | = si (|µ - µ1 | + |µ - µ2 |)
t

(32)

related problem. Indeed, let  be the distribution on F sproduced by a randomized algorithm, and let r = F s s b e the corresp onding mixture. Then, by triangle inequality, we have f . r-g 1 E -g 1

= 

2 si A t

µ

-

µ1 + µ2 2

t

|fn,sj - fn,si+1 |.

Thus fn,sj - fn,si 1 = t A A
t

Hence the model in which the output is allowed to be a mixture of densities in F is "easier" than the model in which the density selection algorithm is randomized. We consider here only the special case in which F has only two densities f1 , f2 , and give an randomized algorithm with a better guarantee than is possible for deterministic algorithms. Later, in Section 5, we give a matching lower bound in the mixture model. To simplify the exposition we will, without loss of generality, assume that f1 - f2 1 > 0. Thus for any h we have (f1 - h) · T12 + (h - f2 ) · T12 = f1 - f2 1 > 0. Algorithm 5 - Randomized estimate. Let |(f1 - h) · T12 | r= . |(f2 - h) · T12 |

|fn,sj - fn,si | 

t

|fn,sj - fn,si+1 | = fn,sj - fn,si+1 1.

The proof of (32) is similar. The lemma now follows by iS duction. n o in each iteration of Algorithm 4b, the pair of densities that are picked for comparison simply correspond to the smallest and the largest bin-widths remaining to be considered. In other words, if si and sj are respectively the minimum and the maximum width remaining, fn-m,si is compared against fn-m,sj . Now let Tij denote, as usual, the test function for fn-m,si , fn-m,sj . Now we analyze the time needed to compute fn-m,si · Tij and h · Tij . We first preprocess x1 , . . . , xn-m by sorting them (O(n log n) time). For any x the value of Tij (x) can be computed in time O(log n) (using binary search on x1 , . . . , xn-m ) and hence h · Tij can be computed in O(m log n) time. We can compute fn-m,si · Tij in O(n) time (using one pass over the array x1 , . . . , xn-m ). Hence the efficient minimum loss-weight estimate requires only O(N (n + m log n) + n log n) computations in total. In contrast, modified minimum distance requires O(N 2 (n + m log n) + n log n) and minimum distance requires O(N 3 (n + m log n) + n log n), making efficient minimum loss-weight the fastest of the three.

With probability 1/(r + 1) output f1 , otherwise output f2 .

(By convention, if |(f2 - h) · T12 | = 0 then we take r =  and output f2 with probability 1). Theorem 22 Let F = {f1 , f2 }. Let f  F be the density output by Algorithm 5. Then  f 2 d1 (g , F ) + , E -g 1 where the expectation is taken only with respect to the coin tosses made by the algorithm. Pro of : Without loss of generality assume that f2 = argminf F f - g 1. First we bound the error of f1 and later use it to bound the error of f . We have, by triangle inequality, f1 - g 1  f1 - f2 1 + f2 - g 1. We can bound f1 - f2 1 as follows = (f1 - f2 ) · T12  |(f1 - h) · T12 | + |(f2 - h) · T12 | = (r + 1)|(f2 - h) · T12 |  (r + 1)|(f2 - g ) · T12 | + (r + 1)|(g - h) · T12 |. Thus, f1 - g 1  (r + 2) f2 - g 1 + (r + 1). Hence f E -g 1 = (33) f1 - f2 1

4

Randomized algorithm and mixtures

In this section we explore the following question: can constant 3 be improved if we allow randomized algorithms? Let f be the output of a randomized algorithm (f is a random variable with values in F ). We would , f like to bound the expected error E - g 1 where the expectation is taken only with respect to coin tosses made by the algorithm (and not with respect to the distribution of the samples). If instead of randomization we consider algorithms which output mixtures of densities in F we obtain a


1 r f1 - g 1 + f2 - g 1 r+1 r+1 2 f2 - g 1 + 

where in the last inequality we used (33).


5

Lower bound examples

In this section we construct an example showing that deterministic density selection algorithms based on testfunctions cannot improve on the constant 3, that is, Theorems 2, 3, 4, 8 are tight. For algorithms that output mixtures (and hence randomized algorithms) the example yields a lower bound of 2, matching the constant in Theorem 22. Lemma 23 For every  > 0 there exist distributions f1 , f2 , and g = h such that f1 - g 1  (3 -  ) f2 - g 1, and f1 · T12 = -f2 · T12 and h · T12 = 0. Before we prove Lemma 23 let us see how it is applied. Consider the behavior of the algorithm on empir  ical distribution h for F = {f1 , f2 } and F  = {f1 , f2 },    where f1 = f2 and f2 = f1 . Note that T12 = T21 = -T12 and hence
    f1 · T12 = -f2 · T12 = f1 · T12 = -f2 · T12 .  Moreover, we have h · T12 = h · T12 = 0. Note that all the test-functions have the same value for F and F  . Hence a test-function based algorithm either outputs f1   and f1 , or it outputs f2 and f2 = f1 . In both cases it outputs f1 for one of the inputs and hence we obtain the following consequence.

Thus for two distributions the correct constant is 2 for randomized algorithms using test-functions. For larger families of distributions we do not know what the value of the constant is (we only know that it is from the interval [2, 3]). Question 26 What is the correct constant for deterministic test-function based algorithm which output a mixture? What is the correct constant for randomized test-function based algorithms? Next we construct an example showing that 9 is the right constant for Algorithm 1. Lemma 27 For every  > 0 there exist probability dis tributions f1 , f2 , f3 = f3 and g such that f1 - g 1  (9 -  ) f2 - g 1,
 yet the Algorithm 1, for F = {f1 , f2 , f3 , f3 }, even when given the true distribution (that is, h = g ) outputs f1 .

Pro of : Consider the following probability space with 6 events A1 , . . . , A6 and f1 , f2 and g with the probabilities given by the following two tables: g=h f1 f2 f3 T12 T13 T23 A1 2/3 - 21 0 2/3 - 30 2/3 - 21 -1 -1 -1 A4 0 2/9 - 13 0 2/9 - 4 1 -1 -1 A2 1/9 - 2 18 0 9 1 1 -1 A5 2/9 + 14 9 2/9 + 14 0 -1 1 1 A3 9 2/3 - 12 0 9 1 1 -1 A6 0 1/9 - 2 1/9 + 16 1/9 + 7 -1 -1 1

Corollary 24 For any  > 0 and any deterministic test-function based algorithm there exist an input F and h = g such that the output f1 of the algorithm satisfies f1 - g 1  (3 - )d1 (g , F ). Pro of of Lemma 23: Consider the following probability space consisting of of 4 atomic events A1 , A2 , A3 , A4 : f1 f2 g=h T12 A1 0 1/2 +  1/2 -1 A2 1/4 +  1/4 -  1/2 1 A3 1/2 0 0 1 A4 1/4 -  1/4 0 -1

Note that we have f1 · T12 = -f2 · T12 = 1 + 2, and 2 3 f1 - g 1 = 2 - 2, f2 - g 1 = 1 + 2. The ratio 2 f1 - g 1/ f2 - g 1 gets arbitrarily close to 3 as  goes to zCro. e onsider f1 and f2 from the proof of Lemma 23. Let f = f1 + (1 - )f2 where   1/2. For 0 <  < 1/4 we have f - g 1 = 1/2 +  - 2  1 - 2. By symmetry,    for one of F = {f1 , f2 } and F  = {f1 , f2 } (with f1 = f2  and f2 = f1 ), the algorithm outputs f1 + (1 - )f2 with   1/2, and hence we obtain the following. Corollary 25 For any  > 0 and any deterministic test-function based algorithm which outputs a mixture there exist an input F and h = g such that the output f of the algorithm satisfies f - g 1  (2 - )d1 (g , F ).

g=h f1 f2 f3 T12 T13 T23

Note that we have f1 · T12 = 7/9 - 14, h · T12 = -7/9 + 14, f2 · T12 = -1, f1 · T13 = 1/3 + 30x, h · T13 = -1/3 + 42x, f3 · T13 = -1 + 36x, f2 · T23 = -1/3 + 60x, h · T23 = -5/9 + 28x, f3 · T23 = -7/9 + 14x. Hence f1 wins over f3 , f3 wins over f2 , and f2 wins over  f1 . Since f3 = f3 we have that f1 is the tournament winner. Finally, we have f1 - g 1 = 2 - 72 and f2 - g 1 = 2/9 + 32. As   0 the ratio f1 - g 1/ f2 - g 1 gets arbitrarily close to 9.


References
[DG85] Luc Devroye and Laszl´ Gy¨rfi. Nonparamet´o o ric density estimation: the L1 view. Wiley series in probability and mathematical statistics. John Wiley & Sons, New York, 1985. [DGL02] Luc Devroye, Laszl´ Gy¨rfi, and G´bor Lu´o o a gosi. A note on robust hypothesis testing. IEEE Transactions on Information Theory, 48(7):2111­2114, 2002. [DL96] Luc Devroye and G´bor Lugosi. A univera sally acceptable smoothing factor for kernel density estimates. Ann. Statist., 24(6):2499­ 2512, 1996. [DL97] Luc Devroye and G´bor Lugosi. Nonasympa totic universal smoothing factors, kernel complexity and Yatracos classes. Ann. Statist., 25(6):2626­2637, 1997. [DL01] Luc Devroye and G´bor Lugosi. Combinatoa rial methods in density estimation. Springer Series in Statistics. Springer-Verlag, New York, 2001. [DL04] Luc Devroye and G´bor Lugosi. Bin width sea lection in multivariate histograms by the combinatorial method. Test, 13:1­17, 2004. [Ind06] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM, 53(3):307­ 323, 2006. [LHC07] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Nonlinear estimators and tail bounds for dimension reduction. Journal of Machine Learning Research, 8:2497­2532, 2007.  [MS08] Satyaki Mahalanabis and Daniel  Stefankovi. c Approximating l1 -distances between mixture distributions using random pro jections. arXiv.org, http://arxiv.org/abs/0804.1170, April 2008. [Sch47] Henry Scheff´. A useful convergence theoe rem for probability distributions. Ann. Math. Statistics, 18:434­438, 1947.  [VC71] Vladimir N. Vapnik and Alexey J.  Cervonenkis. The uniform convergence of frequencies of the appearance of events to their probabilities. Teor. Verojatnost. i Primenen., 16:264­279, 1971. [Wol57] Jacob Wolfowitz. The minimum distance method. The Annals of Mathematical Statistics, 28:75­88, 1957. [Yat85] Yannis G. Yatracos. Rates of convergence of minimum distance estimators and Kolmogorov's entropy. Ann. Statist., 13(2):768­ 774, 1985.