Beyond Gaussians: Spectral Methods for Learning Mixtures of Heavy-Tailed Product Distributions

Kamalika Chaudhuri Information Theory and Applications, UC San Diego kamalika@soe.ucsd.edu

Satish Rao Computer Science Division, UC Berkeley satishr@cs.berkeley.edu

Abstract

1 Introduction
We study the problem of learning mixtures of distributions, a natural formalization of clustering. A mixture of distributions is a collection of T distributions D = {D1 , . . . , DT } over Rn and mixing weights w1 , . . . , wT T such that i=1 wi = 1. A sample from a mixture is drawn by first selecting i with probability wi , and then choosing a random sample from Di . The goal, in learning a mixture, is to learn the parameters of the distributions comprising the mixture, and to classify the samples according to source distribution, given only the ability to sample from the mixture. Learning mixtures of distributions frequently arise in many applications in machine learning, and a fair amount of empirical work has been devoted to the problem. On the theoretical side, all work (except for the work of [DHKS05]) has focussed on learning mixtures of distributions with one of the following characteristics: either the distributions in question have exponentiallydecaying tails, for example, mixtures of Gaussians [Das99, DS00, AM05, KSV05, AK01, VW02], or they have severely bounded range, for example, mixtures of binary product distributions [FOS05, CR08]. In the latter case, the bounds deteriorate with the maximum range of values taken by any coordinate of a sample drawn from the mixture. In this paper, we focus our attention to learning mixtures of more general distributions. In particular, we study learning mixtures of heavy-tailed product distributions, which was introduced by Dasgupta et. al [DHKS05]. If the distributions comprising a mixture are very close together, in the sense that they have a high overlap in probability mass, then, even if we knew the parameters of the distributions comprising the mixture, the samples would be hard to classify. To address this, Dasgupta [Das99] introduced the notion of a separation condition.A separation condition is a promise that the distributions comprising a mixture are sufficiently different according to some measure, and the goal of the algorithm is to learn correctly a mixture which obeys a certain separation condition. Naturally, the less stringent a separation condition is, the harder it is to learn a mixture, and therefore, a line of theoretical research

We study the problem of learning mixtures of distributions, a natural formalization of clustering. A mixture of distributions is a collection of distributions D = {D1 , . . . , DT } and weights w1 , . . . , wT . A sample from a mixture is drawn by selecting Di with probability wi and then selecting a sample from Di . The goal, in learning a mixture, is to learn the parameters of the distributions comprising the mixture, given only samples from the mixture. In this paper, we focus on learning mixtures of heavy-tailed product distributions, which was studied by [DHKS05]. The challenge in learning such mixtures is that the techniques developed for learning mixture-models, such as spectral methods and distance concentration, do not apply. The previous algorithm for this problem was due to [DHKS05], which achieved performance comparable to the algorithms of [AM05, KSV05, CR08] given a mixture of Gaussians, but took time exponential in the dimension. We provide an algorithm which has the same performance, but runs in polynomial time. Our main contribution is an embedding which transforms a mixture of heavy-tailed product distributions into a mixture of distributions over the hypercube in a higher dimension, while still maintaining separability. Combining this embedding with standard spectral techniques results in algorithms that can learn mixtures of heavy-tailed distributions with separation comparable to the guarantees of [DHKS05]. Our algorithm runs in time polynomial in the dimension, number of clusters, and imbalance in the weights.


has focussed on learning mixtures of distributions under less and less restrictive separation conditions. For mixtures of Gaussians, the common measure of separation used is the minimum distance between the means of any two distributions in the mixture, parameterized by the maximum directional standard deviation of any distribution in the mixture. However, this is not a good measure for the type of distributions considered here, as the directional standard deviation may be infinite; following [DHKS05], we therefore use as a measure of separation the minimum distance between the medians of any two distributions in the mixture, as parameterized by the 3 maximum 4 -radius. Recall that given 0 <   1, the  radius of a one-dimensional distribution D with median m(D) is the minimum number R such that the probability mass of D in the interval [m(D) - R , m(D) + R ] is at least  . The major challenge in learning mixtures of heavytailed distributions is that none of the tools developed in the literature for learning mixtures of Gaussians or binary product distributions work when the mixture consists of more general distributions. The key ingredients of such algorithms for learning mixtures are: (1) a singular value decomposition of part [CR08] or whole [VW02, KSV05, AM05] of the covariance matrix of the samples and (2) distance-thresholding based clustering algorithms. Singular value decompositions of the covariance matrix do not converge if the distributions have infinite variance. Even for mixtures of distributions with finite variance, distance concentration, which works on the principle that two samples from the same distribution are closer in space than two samples from different distributions, does not work unless the distributions have light tails or a very small range. The previous algorithm for the problem is due to [DHKS05], which learns mixtures of heavy-tailed distributions with performance comparable to the performance of algorithms in [AM05, KSV05, CR08] given a mixture of Gaussians; however, it involves an exhaustive search over all partitions of (n) samples, where n is the number of dimensions, and hence takes time exponential in the dimension. In this paper, we show a general procedure for transforming mixtures of heavy-tailed product distributions into mixtures which are more well-behaved, while preserving the separability of the distributions in the mixture. In particular, we provide an efficiently computable 3/2 embedding from Rn to {0, 1}O(n ) . Our embedding, when applied to a mixture of heavy-tailed product distributions which have certain conditions comparable to those in [DHKS05], produces a mixture of distributions 3/2 in {0, 1}O(n ) with centers that are far apart. In addition, we show that the resulting mixture has good properties such that standard algorithms for learning mixtures of binary product distributions ­ such as the SVDbased algorithms of [AM05, KSV05] and the correlationsbased algorithm of [CR08] can be applied to learn it, leading to efficient algorithms for learning mixtures of

heavy-tailed product distributions. More specifically, our results are as follows. Given a mixture of general product distributions, such that each distribution is symmetric about its median, and has 3 4 radius upper-bounded by R, our embedding transforms 3/2 it into a mixture of distributions over {0, 1}O(n ) , while preserving the distance between the centers in a certain sense which is explained in Theorem 1. We can now apply either SVD-based clustering algorithms [KSV05, AM05], and in this case, for sucess with probability 1 -  , we require that (a) the separation between the medians -1/2 -1/2 )+ + wj of distributions Di and Dj be (R(wi T nT log  ) and (b) this separation be spread across R
-1/2 -1/2

T )2 + T log n ) coordinates. Alter+ wj ((wi natively, we can apply the correlations-based algorithm of [CR08] on the transformed mixture, to get a logarithmic dependence on the mixing weights. In this case, to learn the mixture with probability 1 -  , we require that (a) the minimum distance between the medians of any  two Tistributions in the mixture to be (R T log  + d log(nT / )) and (b) that this separation to be spread R across (T log  + T log(nT / )) coordinates, where  1 is polynomial in n,T and wmin . We note that conditions comparable to all these four conditions are required by [DHKS05] for learning mixtures of heavy-tailed distributions; our work improves on their results by providing a polynomial-time algorithm for the problem, as opposed to an exponentialtime algorithm. In addition, we also do not need the restriction, needed by [DHKS05], that the probability density function should be decreasing with distance from the median. We also note that the guarantees of our algorithms are comparable to the guarantees of [AM05, KSV05, CR08] when the input is a mixture of axisaligned Gaussians.

Our Techniques An initial approach for converting a mixture of general product distributions to a mixture of distributions with better properties is to remove the outlier points, which lie very far from the other samples. However, for the types of distributions we consider, a sample may be an outlier along each coordinate with constant (1/4) probability, and since there are n coordinates, with high probability, every point is an outlier. Another approach could be to try to round the outlier points along each dimension; however, since the different mixture components may have different mixing weights, given samples from the mixture, it is hard to determine which of the samples are outliers along a specific coordinate. To address these issues, we use techniques from metric embeddings [Ind01]. The main idea behind our embedding is to use many random cutting points to divide the real line into intervals of length (R); points which fall into the even intervals are then mapped to 0 and those which fall into the odd intervals are mapped to 1. Although this process does not preserve distances be-


tween all pairs of points, we show that this succeeds in separating the centers of two distributions which have medians that are far apart compared to their 3/4-radius R. Our techniques are related to techniques in metricembedding [Ind01]; however, so far as we know, this is the first time they have been applied to learning mixtures of distributions. Combining our embedding with existing standard algorithms for learning mixtures of distributions, we get efficient algorithms for learning mixtures of heavy-tailed distributions.

our algorithms work with separation and spreading constraints comparable to algorithm (1) of [DHKS05]. [DHKS05] also works with a second class of distributions, which have mildly decaying tails. In this case, they provide an algorithm which clusters correctly 1 -  fraction of the samples in time exponential in n, so long as the separation between any two distributions is (R T 5/2 / 2 ). Other Mixture Models There has been a long line of theoretical work on learning mixtures of Gaussians. For this problem, the separation condition is usually expressed in terms of n, the number of dimensions,  , the maximum directional standard deviation of any distribution in the mixture, and T , the number of clusters. In [Das99], Dasgupta provided an algorithm which learns mixtures of spherical Gaussians when the c ters of each pair of distributions is en separated by ( n). In [DS00], Dasgupta and Schulman provided an algorithm which applied to more situations and required a separation of ( n1/4 ). [AK01] showed how to learn mixtures of arbitrary Gaussians with a separation of ( n1/4 ) using distance concentration. In addition to the usual separation between the centers, their results apply to other situations, for example, to concentric Gaussians with sufficiently different variance. The first algorithm that removed the dependence on n was due to Vempala and Wang [VW02], who gave a singular value decomposition based algorithm for learning mixtures of spherical Gaussians with a separation of (T 1/4  ). Their algorithm applies a singular value decomposition of the matrix of samples to compute a T -dimensional subspace which approximates the subspace containing the centers, and then uses distance concentration to cluster the samples projected on this lowdimensional space. In further work, [KSV05] and [AM05] showed how to use singular value decomposition based algorithms to learn mixtures of general Gaussians when the separation between the centers of distributions Di and Dj is T
-1/2 -1/2

2 Related Work
Heavy-Tailed Mixtures The work most related to ours is the work of Dasgupta, Hopcroft, Kleinberg and Sandler [DHKS05]. Dasgupta et. al [DHKS05] introduced the problem of learning mixtures of heavy-tailed distributions and the notion of using the distance between the medians, parameterized by the half-radius, as a measure of separation between such distributions. Their work deals with the class of all product distributions in which the distribution of each coordinate has the following properties: (a) symmetry around the median (b) decreasing probability density 1 with distance from the median and (c) 2 -radius upper  bounded by R . In contrast, we require the distribution of each coordinate to be symmetric about its median and 3 have 4 -radius upper bounded by R, and do not require the second assumption of [DHKS05]. [DHKS05] provide two algorithms for learning such mixtures. First, they provide Tan algorithm which requires a separation of (R  ) and a spreading condition that the distance between the medians of any two distributions in the mixture should be spread over (T / ) coordinates, to classify a 1 -  fraction of the samples correctly. This algorithm works by performing an exo haustive search over all partitions of ( n lwg(inT ) ) sammn ples, and therefore has a running time exponential in o ( n lwg(inT ) ). In contrast, our algorithms work with simmn ilar separation and spreading conditions, and only take time polynomial in n. Second, they provide an algorithm which orks with w a stronger separation requirement of (R n) and a spreading condition that the distance between the medians of any two distributions in the mixture be spread over (T / ) coordinates. Typically, for such problems, the dimension n is much larger than the number of clusters T , and hence the separation needed here is much larger than the separation needed by the previous algorithm and our algorithms. This algorithm works by performing an exhaustive search over all partitions of gT ( low(nn ) ) samples, and therefore has a running time mi
gT 1 exponential in ( low(nn ) ). Since wmin is at most T , mi this may be polynomial in n but remains exponential in T . In contrast, the running times of our algorithms are 1 polynomial in n, T , and wmin , and for distributions in 3 which the 4 -radius is comparable with the half-radius,

)+ + wj ( (wi log( T )). The algorithm  of [AM05] was shown to apply to f -convergent and g concentrated distributions, with bounds that vary with the nature of the distributions. Their algorithm also applies to product distributions on binary vectors. However, their algorithm does not apply to distributions with infinite variance. Even for distributions with finite variance, unless the distribution has rapidly decaying tails, their algorithm yields poor guarantees, proportional to the maximum range of the distribution of each coordinate. More recently, [CR08] show an algorithm which, under certain conditions, learns mixtures of binary product distributions and axis-aligned Gaussians when the centers are separated by T  ( ( T log  + log( T ))) where  is the max


imum directional variance in the space containing the 1 centers, and  is polynomial in n, T and wmin . Their algorithm also does not work for distributions with infinite variance and yields poor guarantees for mixtures of heavy-tailed product distributions.

We use ||x|| to denote the L2 norm of a vector x. We use n to denote the number of dimensions and s to denote the number of samples. For a point x, and subspace H, we use PH (x) to denote the projection of x on H. 3.1 Our Results The main contribution of this paper is an embedding  from Rn to {0, 1}n , where n > n. The embedding has the property that samples from two product distributions on Rn which have medians that are far apart  map to samples from distributions on {0, 1}n with centers which are also far apart. In particular, let D = {D1 , . . . , DT } be a mixture of product distributions such that each coordinate f of each distribution Di in the mixture satisfies the following properties: 1. Symmetry about the median. 2.
3 4 -radius

3 A Summary of our Results
We begin with some definitions about distributions over high-dimensional spaces. Mixture of Distributions. A mixture of distributions is a collection of distributions D = {D1 , . . . , DT } and T mixing weights w1 , . . . , wT such that i=1 wi = 1. A sample from a mixture is drawn by selecting Di with probability wi and then choosing a sample from Di . Median. We say that a distribution D on R has median m(D) if the probability that a sample drawn from D is less than or equal to m(D) is 1/2. We say that a distribution D on Rn has median m(D) = (m1 , . . . , mn ) if the projection of D on the f -th coordinate axis has median mf , for 1  f  n. For a distribution D, we write m(D) to denote the median of D. Center. We say that a distribution D on Rn has center (c1 , . . . , cn ) if the projection of D on the f -th coordinate axis has expectation cf , for 1  f  n.  -Radius. For 0 <   1, the  -Radius of a distribution D on R with median m(D) is the smallest R such that
x D

u p p er b o u n d ed b y R .

In particular, this allows the distribution of each coordinate to have infinite variance. Then the properties of our embedding can be summarized by the following theorems. Theorem 1 Suppose we are given access to samples from a mixture of product distributions D = {D1 , . . . , DT } f over Rn such that for every i and f , Di satisfies properties (1) and (2). Moreover, let for any i, µi denote the ~ ~ center of the distribution Di obtained by applying our embedding  on Di . If, for some constant c1 , dR (m(Di ), m(Dj ))  c1 R , then, there exists a constant c2 , such that ~ ~ ||µi - µj ||  c2 n1/4 T 1/2 (log n log T )1/2 dR (m(Di ), m(Dj )) × R
1 with probability 1 - n over the randomness in computing . Moreover, for any i, any k , k  and any f = f  , ~ coordinates (f , k ) and (f  , k  ) of Di are independently distributed.

Pr [m(D) - R  x  m(D) + R ]  

Effective Distance. To better describe our results, we need to define the concept of effective distance. The effective distance between two points x and y in Rn at scale R, denoted by dR (x, y ) is defined as: n f dR (x, y ) = min(R2 , (xf - y f )2 )
=1

The effective distance between two points x and y at scale R is thus high if many coordinates contribute to the distance between the points. Notation. We use subscripts i, j to index over distributions in the mixture and subscripts f , g to index over coordinates in Rn . Moreover, we use subscripts (f , k ), . . . to index over coordinates in the transformed space. We use R to denote the maximum 3 -radius of any coordi4 nate of any distribution in the mixture. For each distribution Di in the mixture, and each coordinate f , we use f Di to denote the projection of Di on the f -th coordinate ~ axis. For any i, we use Di to denote the distribution induced by applying our embedding on Di . Similarly, for ~f any i and any f , we use Di to denote the distribution f induced by applying our embedding on Di . Moreover, ~ we use µi to denote the center of Di and µf to denote ~ ~i ~f. the center of Di

Our embedding can be combined with the SVD-based clustering algorithms of [KSV05, AM05] to provide an efficient algorithm for learning mixtures of heavy-tailed distributions. The resulting clustering algorithm has the following guarantees. Theorem 2 Suppose we are given access to samples from a mixture of product distributions D = {D1 , . . . , DT } f over Rn such that for every i and f , Di satisfies properties (1) and (2). If, for some constant c3 , dR (m(Di ), m(Dj ))  c3 R(wi +
-1/2

+ wj

-1/2

T

log

nT ) 


Then, Algorithm H T- S V D clusters the samples correctly with probability 1 -  over the samples, and with prob1 ability 1 - n over the randomness in the algorithm. The algorithm runs in time polynomial in n and T , and the ~ 3/2 T number of samples required by the algorithm is O ( n min ). w Alternatively, we can also combine our algorithm with the more recent correlation-based clustering algorithm of [CR08]. The result is an efficient algorithm with the following guarantees. Theorem 3 Suppose we are given s samples from a mixture of product distributions D = {D1 , . . . , DT } over f Rn such that for every i and f , Di satisfies properties (1) and (2). If, for some constant c3 , T T nT log  + ) dR (m(Di ), m(Dj ))  c3 R( log  where  = ( Then, Algorithm H T- C O R R E L AT I O N S clusters the samples correctly with probability 1 -  over the samples, and with at least constant probability over the randomness in the algorithm. The algorithm runs in time polynomial in n and T , and the number of samples required by the algo1 rithm is polynomial in n, T , and wmin . The condition imposed on the centers of the distributions states that every pair of centers is sufficiently far apart in space, and the distaT ce between every pair of n c T centers is spread across  log  + T log n oordinates. 3.2 Discussions Symmetry. Our embedding still seems to work when the distributions do not have perfect symmetry, but satisfy an approximate symmetry condition. However, we illustrate by an example that we need at least a weak version of the symmetry condition for our embedding to work. Let D1 and D2 be the following distributions over R, where M is a very large number. For D1 the probability density function is: 3 , -R  x  R 8R 1 = , M R  x  2M R 8M R 1 = , -2M R  x  -M R 8M R The density function for D2 is: f1 (x) = 3 , -R  x  R 8R 1 , -2M R  x  -M R = 4M R We note that although the medians of D1 and D2 are R/3 distance apart, the overlap in their probability mass f2 (x) =
T  n log2 n ). wmin

in any interval of size 2R is very high. Therefore, since our embedding relies on the fact that two distributions which have medians that are far apart, and 3 -radius 4 bounded by R, have low overlap in probability mass in a region of size (R) around the median, it does not work for distributions like D1 and D2 . Spreading Condition. We note that our spreading condition, while similar to the slope requirement of [DHKS05], is weaker; while they require the total contribution to the distance between any two medians from all the coordinates to be large with respect to the contribution from the maximum coordinate, we only require that the contribution come from a few coordinates, regardless of what the maximum contribution from a coordinate is.

4 Embedding Distributions onto the Hamming Cube
In this section, we describe an embedding which maps points in Rn to points on a Hamming Cube of higher dimension. The embedding has the following property. If for any i and j , Di and Dj are product distributions on Rn with properties (1) and (2) such that their medians are far apart, then, the distributions induced on the Hamming cube by applying the embedding on points from Di and Dj respectively also have centers which are far apart. The building blocks of our embedding are embeddings {f }, one for each coordinate, f in {1, . . . , n}. The final embedding  is a concatenation of the maps f for 1  f  n. We describe more precisely how to put together the maps f in Section 4.3; for now, we focus on the individual embeddings f . Each embedding f , in its turn, is a concatenation of two embeddings. The first one ensures that, for any i f f and j , if Di and Dj are two distributions with properf f ties (1) and (2) such that |m(Di ) - m(Dj )| is smaller than (or in the same range as) R, then, the expected distance between the centers of the distributions induced f f by applying the embedding on points from Di and Dj f f |m(Di )-m(Dj )| . is  Unfortunately, this embedding R f f does not provide good guarantees when |m(Di )-m(Dj )| is large with respect to R. To address this, we use our f second embedding, which guarantees that when |m(Di )- f m(Dj )| is large with respect to R, the centers of the two distributions induced by applying the embedding on f f points from Di and Dj are at least constant distance apart. By concatenating these two embeddings, we ensure that in either case, the centers of the induced distrif f butions obtained by applying f on Di and Dj are far apart. 4.1 Embedding Distributions with Small Separation In this section, we describe an embedding with the folf f lowing property. If, for any i, j , and f , Di and Dj have


f f properties (1) and (2) and |m(Di ) - m(Dj )| < 8R, then the distance between the centers of the distributions f induced by applying  to points generated from Di and f j i Dj , is proportional to . 8R The embedding is as follows. Given a parameter R1 , and r  [0, R1 ), we define, for a point x  R, x-r i r (x) = 0, if s even R1 = 1, otherwise | m(D f )- m(D f )|

Figure 1: Proof of Lemma 6

In other words, we divide the real line into intervals of length R1 and assign label 0 to the even intervals and label 1 to the odd intervals. The value of r (x) is then the label of the interval containing x - r. The properties of this embedding can be summarized as follows.
f f Theorem 4 For any i, j , and f , if Di and Dj have properties (1) and (2), and if r is drawn uniformly at f f random from [0, R1 ) and R1 > 2R+3|m(Di )-m(Dj )|, then, E[| Pr [r (x) = 0] - Pr [r (x) = 0]|]
f xDi f xDj

Lemma 5 For any , if R1 > 3 + 2R, then, for any i, R1 /2  (f (r) - f ( + r))dr  i i 2 r =-R1 /2 Note that the difference between the statement of Theorem 4 and Lemma 5 is that the left-hand side of the equation in Theorem 4 has an absolute value, and hence Lemma 5 makes a stronger statement (under stronger assumptions). Before we prove Lemma 5, we need the following lemma. Lemma 6 Let [a, a ] be any interval of length more than 2. Then, for any i, · a f (r)dr i  a (Fif (r + ) - Fif (r))dr ·
r =a+

f f |m(Di ) - m(Dj )| 2R1 Here the expectation is taken over the distribution of r.


Notation For i = 1, . . . , T , we write f as the probai f bility density function of distribution Di centered at 0, and Fif as the cumulative density function of distribuf tion Di centered at 0. For a real number r  [0, R1 ), and for i = 1, . . . , T , we define  f (r) = (Fif (r + (2 + 1)R1 ) - Fif (r + 2R1 )) i
=- 

a

r =a

a -

f (r)dr i

Proof: For any r, Fif (r + ) - Fif (r) =


More specifically, f (r) is the sum of the probabili ity mass of the distribution Di in the even intervals when the shift is r, which is again the probability that a point drawn from Di is mapped to 0 by the embedding r . In f f the sequel, we use  to denote |m(Di ) - m(Dj )|. We f also assume without loss of generality that m(Dj )  f f m(Di ), and m(Di ) = 0. Then, the left-hand side of the equation in Theorem 4 can be written as follows. E[| Pr [r (x) = 0] - Pr [r (x) = 0]|]
f xDi f xDj

t=r

r +

f (t)dt i

We divide the interval [a, a ] into infinitesimal intervals ¯ of length  . The probability mass of distribution Di in ¯¯ an interval [t, t +  ] is  · f (t). i Note that in the expression a (Fif (r + ) - Fif (r))dr

r =a

=

1 R1

r =-R1 /2

R1 /2

|f (r + ) - f (r)|dr j i

(1 )

The proof of Theorem 4 follows in two steps. First, f f we show that if Di were a shifted version of Dj , a slightly stronger version of Theorem 4 would hold. This is shown in Lemma 5. Next,Lemma 8 shows that even f f if Di is not a shifted version of Dj , the statements in Theorem 4 still hold.

¯ the probability mass of each interval [t, t +  ] where t lies in [a + , a - ] is counted exactly  times, and ¯  ¯ the probability mass of Di in an interval [t, t +  ], where t lies in the interval [a, a + )  (a - , a ] is counted at most  times ­ see Figure 1. Since f (t)  0 for ¯ i  ¯ all t, the lemma follows in the limit when   0. P roof:(Of Lemma 5) The shaded area in Figure 2 shows the value of f (r) - f (r + ) for a distribution Di . i i


is shown by a combination of Lemmas 7 and 8, which are both consequences of the symmetry of the distribuf f tions Di and Dj .
f f Lemma 7 Suppose that for any i, j , and f , Di ,Dj have property (1) and median 0. Then, for any r,

Figure 2: Proof of Lemma 5

f (r) - f (r) = f (-r) - f (-r) i j j i

We can write: R1 /2 = R1 /2

Proof: We define (f (r) - f (r + ))dr i i 


r =-R1 /2

f (r) = ¯i

=

-Fif (r -Fif (r R1 /2

r =-R1 /2 =-

[(Fif (r + (2 + 1)R1 ) (Fif (r +  + (2 + 1)R1 )


=- 

Fif (r + 2R1 ) - Fif (r + (2 - 1)R1 )

+ 2R1 ) -

-Fif (r +  + 2R1 ))]dr R1 /2   = [(Fif (r + 2R1 + )
r =-R1 /2 =-

-Fif (r +  + (2 + 1)R1 ) - (Fif (r + 2R1 ) = = =

r =-R1 /2 =-

+  + 2R1 ))]dr  [(Fif (r + (2 + 1)R1 )

Thus, f (r) is the probability mass of Di in the odd ¯i intervals, which is again the probability that r maps a random point from Di to 1 when the shift chosen is r. Therefore, f (r) = 1 - f (r). Since Di is symmetric ¯i i with median 0, for any interval [a, a ], a > a > 0, Fif (a ) - Fif (a) = Fif (-a) - Fif (-a ). Therefore, f (-r) i  Fif (-r + (2 + 1)R1 ) - Fif (-r + 2R1 ) 
=-  

-Fif (r + (2 + 1)R1 ))]dr

-Fif (r + 2R1 )) - (Fif (r + (2 + 1)R1 + )

f (r) ¯i

=- 

Fif (r - 2R1 ) - Fif (r - (2 + 1)R1 )

From Lemma 6, the first term is at least R1 /2-  · f (r + 2R1 )dr i
=-  r =-R1 /2+

The lemma follows because
f ¯j f (r) - f (r) = j (r) - f (r) ¯i i

This is  times the total probability mass of Di in the intervals [2R1 - R1 /2 + , 2R1 + R1 /2 - ], for all . Since R1 > 2 + 2R, this includes the interval [-R, R], and as the median of Di is at 0 and Di has 3 4 radius less than or equal to R, the value of the first term 3 is at least 4 . From Lemma 6, the second term is at most R1 /2  f (r + (2 + 1)R1 )dr · i
=-  r =-R1 /2

L
f f emma 8 For any i and j , if Di and Dj have properties (1) and (2), then,


r =-R1 /2

R1 /2

r =-R1 /2

R1 /2

|f (r + ) - f (r)|dr j i

f (f (r) - j (r + ))dr j

This is the total probability mass of Di in the intervals [(2 + 1)R1 - R1 /2, (2 + 1)R1 + R1 /2], for all . Since R1 > 3 + 2R, none of these intervals have any intersection with [-R, R]. The total probability mass in these intervals is therefore at most 1 , and therefore 4 the value of the second term is at most  . The lemma 4 follows. N ext we show that Theorem 4 holds even if distribuf f tion Di is not a shifted version of distribution Dj . This

Proof: By Lemma 7, for every r  [-R1 /2, R1 /2], there is a unique r = -r such that f (r) - f (r) = j i f (r ) - f (r ). We claim that for every such pair r, r , j i
f  (f (r) - j (r + )) + (f (r ) - f (r + )) j j j f |f (r + ) - f (r)| + |f (r + ) - i (r )| j i j


We note that for a fixed pair (r, r ), |f (r + ) - f (r)| + |f (r + ) + f (r) j j i j
f -f (r) - i (r )| j f |f (r + ) - f (r) + f (r + ) + j (r) j i j

|f (r + ) - f (r)| + |f (r + ) - f (r )| j i j i

=

 
f -f (r) - i (r )| j

such that y8-r is an integer. If [a, a ] is cut at y , then, R with probability 1 over the choice of {k }, any point x 2 in the interval [a, y ] has a different value of  (x) than any point in (y , a ]. If an interval is not cut, then all points in the interval have the same value of  with probability 1 over the choice of {k }. f f f Since the intervals [m(Di )-R, m(Di )+R] and [m(Dj )- f R, m(Dj ) + R] have length at least 2R,
f f f f Pr[[m(Di ) - R, m(Di ) + R], [m(Dj ) - R, m(Dj ) + R]

+(f (r) + f (r ) - f (r) - f (r ))| i i j j

|(f (r + ) - f (r)) + (f (r + ) - f (r )) j j j j

are not cut]  1 -

2R + 2R 1  8R 2

The lemma follows by summing over all such pairs (r, r ). P roof: (Of Theorem 4) From Equation 1 and Lemma 5, E[| Pr [r (x) = 0] - Pr [r (x) = 0]|]
f xDi f xDj

f f If none of the intervals [m(Di ) - R, m(Di ) + R] and f f [m(Dj ) - R, m(Dj ) + R] are cut, f f Pr[ (m(Di ) - R) =  (m(Dj ) - R)] =

1 2

1 R1

-R1 /2

R1 /2

|f (r j

+ ) -

f (r)|dr i

  2R1

f f Let us assume that the intervals [m(Di ) - R, m(Di ) + f f R] and [m(Dj ) - R, m(Dj ) + R] are not cut and f f  (m(Di ) - R) =  (m(Dj ) - R)

4 The second step follows from Lemma 5. .2 Embedding Distributions with Large Separation In this section, we describe an embedding with the folf f lowing property. For any i, j , and f , if Di and Dj have f f properties (1) and (2), and |m(Di ) - m(Dj )|  8R, then, the expected gap between the centers of the distributions induced by applying the embeddings on points f f from Di and Dj is at least a constant. The embedding is as follows. Given a random  = {, {k }kZ } where  is a number in [0, R2 ) and {k } is an infinite sequence of bits, we define  : R  {0, 1} as follows. x- (  (x) = k(x) , where k (x) = 2) R2 In other words, if x -  lies in the interval [8k R, 8(k + 1)R), then  (x) = k . The properties of the embedding  can be summarized as follows. Theorem 9 For any i, j , and f , let and h a ve f f properties (1) and (2), and let |m(Di ) - m(Dj )|  8R. If R2  8R, and if  is generated uniformly at random from the interval [0, R2 ), and each k is generated by an independent toss of a fair coin, then, E[| Pr [ (x) = 0] - Pr [ (x) = 0]|] 
f xDi f xDj

. From the two equations above, the probability of this event is at least 1 . Also suppose without loss of general4 f ity that  (m(Di ) - R) = 0. Then, since R is an upper f f bound on the 3 -radius of the distributions Di and Dj , 4 f 3 the probability mass of Di that maps to 0 is at least 4 , f and the probability mass of Dj that maps to 0 is at most 1 1 4 . Therefore, with probability at least 4 , | Pr [ (x) = 0] - Pr [ (x) = 0]| 
f xDi f xDj

1 2

4 The theorem follows. .3 Combining the Embeddings In this section, we show how to combine the embeddings of Sections 4.1 and 4.2 to provide a map  which obeys the guarantees of Theorem 1. Given parameters R1 , R2 , and q , we define f for a coordinate f as follows. f (x) = (1 (xf ), . . . , q (xf ), r1 (xf ), . . . , rq (xf )) (3 ) Here, 1 , . . . , q are q independent random values of  = (, {k }kZ ), where  is drawn uniformly at random from the interval [0, R2 ), and k , for all k , are generated by independent tosses of an unbiased coin. r1 , . . . , rq are q independent random values of r, where r is drawn uniformly at random from the interval [0, R1 ). Finally, the embedding  is defined as: The properties of the embedding  are summarized in Theorem 1. Next, we prove Theorem 1. We begin with the following lemma, which demonstrates the properties o f e a c h f . (x) = 1 (x)  . . .  n (x) (4 )

f Di

f Dj

1 8

where the expectation is taken over the distribution of  . Proof: We say that an interval [a, a ] of length 8R or less is cut by the embedding if there exists some y  [a, a ]


Lemma 10 Let R1  26R, R2  8R, and  q = 4 nT log n log T , and suppose we are given samples from a mixture of product distributions which satisfy conditions (1) and (2). Then, for all i and j , the embedding  = f f defined in Equation 3 satisfies 1 the following conditions. With probability at least 1 - n over the randomness in the embedding, for each coordinate f ,
f f 1. If |m(Di ) - m(Dj )| > 8R, then, for some constant c5 ,

, this contribution is at most 1 the total distance. The 2 first part of the theorem therefore follows. For any sample x from any Di in the mixture, and any k , k  , coordinates (f , k ) and (f  , k  ) of (x) are  function of xf and xf respectively. As for f = f  , xf  and xf are independently distributed, the second part of 5 the theorem follows.

||ExDf [f (x)] - ExDf [f (x)]||  c5 n1/4 T 1/2
i j

Applications: Learning Mixtures
In this section, we show how our embedding in Theorem 1 can be combined with standard algorithm for learning mixture models to yield algorithms than can learn mixtures of heavy-tailed distributions. First, in Section 5.1, we show how to combine our embedding with SVD-based algorithms of [KSV05, AM05]; in Section 5.2, we show how to combine our embedding with the more recent algorithm of [CR08].

f f R 2. If n  |m(Di ) - m(Dj )|  8R, then, for some constant c6 ,

×(log n log T )

1/2

||ExDf [f (x)] - ExDf [f (x)]||  c6 n1/4 T 1/2
i j

×(log n log T )1/2

f f |m(Di ) - m(Dj )| R

f f R if n  |m(Di ) - m(Dj )| < 8R, and high otherwise. Let Vi,j , Li,j and Hi,j respectively denote the set of very low, low and high coordinates for distributions Di and D j . T h en , f f ||µf - µj ||2 ~i ~f ||µf - µf ||2 + ~i ~j ||µi - µj ||2 = ~ ~ Vi,j Li,j

Proof:(Of Lemma 10) The first part of the lemma follows by Theorem 9, along with an application of the Chernoff Bounds, followed by a Union Bound over all i, j , f . The second part follows similarly by an applicaPon of Theorem 4. ti roof:(Of Theorem 1) We call a coordinate f very low f f R for distributions i and j if |m(Di ) - m(Dj )|  n , low

5.1 Clustering using SVD In this section, we present Algorithm H T- S V D­ a combination of SVD-based algorithms of [AM05, KSV05] with our embedding in Theorem 1. The input to the algorithm is a set S of samples, and the output is a partitioning of the samples. The algorithm is described in Figure 3. The properties of Algorithm H T- S V D are summarized by Theorem 2, which we prove for the rest of this section. The two main steps in the proof are as follows: first, we show that after applying our embedding, the tranformed distributions have good properties, such as low directional variance and distance-concentration. Next, we show that these properties imply that SVDbased algorithms, such as those of [KSV05, AM05] can learn these mixtures effectively. The following lemma shows that the maximum directional variance of the transformed distributions in the mixture is high; this fact is later used crucially in demonstrating that SVD-based algorithms can effectively cluster the mixture.

+

f

Hi,j

||µf - µj ||2 ~i ~f

From Lemma 10,this sum is at least f f f |m(Di ) - m(Dj )|2 c6 n1/2 T log n log T R2 Li,j f c5 n1/2 T log n log T +
Hi,j

which, by the definition of effective distance is at least d2 (m(Di ), m(Dj )) c7 n1/2 T log n log T R R2 f f f2 w Vi,j (m(Di ) - m(Dj )) - R2 here c7 is some constant. Now the contribution from the very low coordinates to the distance between m(Di ) f R2 /n = R. Since and m(Dj ) is at most dR (m(Di ), m(Dj ))  2R

Lemma 11 For any i, the maximum directional vari~ ance of the transformed distribution Di is at most 1/2 O(n T log n log T ).

Proof: Let v be any unit vector in the transformed space. ~ The variance of the transformed distribution Di along v


HT-SVD(S) 1.  ~ Let R1 = 26R, R2 = 8R, and q = 4 nT log n log T . Compute S = {(x)|x  S }. Partition ~ into SA and SB uniformly at random. ~ ~ S
s ¯ ¯ Construct the 2 × nq matrix SA (respectively SB ) in which the entry at row l and column l is  ~~ the l -th coordinate of the l-th sample point in SA (SB respectively).

2.

3.

¯ ¯ Let {v1,A , . . . , vT ,A } (resp. {v1,B , . . . , vT ,B }) be the top T singular values of SA (resp. SB ). ~B (resp. SA ) on the subspace KA (resp. KB ) spanned by v1,A , . . . , vT ,A ~ Project each point in S (resp. v1,B , . . . , vT ,B ). ~ ~ Use a distance-based clustering algorithm as in [AK01] to partition the points in SA and SB after projection.

4.

Figure 3: Algorithm Using SVDs can be written as: ~ ~ ExDi [ v, x - E[x] 2] ~~ ( (v f ,k )2 · (xf ,k - E[xf ,k ])2 ~ ~ = ExDi [ ~~
f ,k )

Proof: Let q = 4n1/2 T log n log T . Let v1 , . . . , vd be an orthonormal basis of H. As ld ~ ||PH (x)||2 = ( vl , x )2 ~
=1 f ,k

+2

(

v
f , k ), (f  , k  )


f ,k

·v

f  ,k 

~ · (x

f ,k

~ - E[x

])

(  ExDi [ ~ ~

×(xf ~

,k 

- E[xf ~ (v
f ,k )


,k 

])] (


f ,k 2

) +2

f , k ), (f  , k  )


v f ,k · v f
 

,k 

we apply the Method of Bounded Differences to bound the value of each vl , x . ~ f k f ,k vl , x = ~ vl · xf ,k ~ As changing each coordinate of the original sample point ~ x will change at most q coordinates of x, f , the change ~ in vl , x when we change a coordinate f of the origk f ,k 2 inal sample point is at most ( vl ) . Therefore, k f ,k 2 f f2 vl ) . Since vl is a unit vector, ( f = =   q . Thus, for any l, q  Pr[| vl , x - vl , E[x] | > ~ ~ log(d/ )]  d l vl , x - E[x] 2, the lemma ~ ~ As ||PH (x - E[x])||2 = ~ ~ follows by applying a Union Bound over each vector vl . W e are now ready to prove Theorem 2. The main tool in our proof is the following lemma, due to [AM05], which shows that if the separation between the transformed centers is large, then, Step 3 of the algorithm will find a subspace in which the transformed centers are far apart. Lemma 13 Let, for each i, ci,A be the empirical cen~ ~ ters of Di computed from the points in SA , and let  be ~ the maximum directional standard deviation of any Di . Then, ||PKB (ci,A - cj,A )||  ||ci,A - cj,A || - (wi
-1/2

×(xf ,k - E[xf ,k ]) · (xf ,k - E[xf ,k ])] ~ ~ ~ ~ fk (  f ,k 2 (v ) + 2 v f ,k v f ,k  ExDi [ ~~
f ,k ) ,k 

- E[x ]) · (x ~ ~ fk  ExDi [ ( v f ,k )2 ] ~~

×(x ~

f ,k

f ,k

f ,k 

- E[xf ,k ])] ~


As xf ,k is distributed independently of xf ~ ~ f  , in this case, ~ ~ ~ ExDi [(xf ,k - E[xf ,k ]) · (xf ~~
f ,k


,k 

wh e n f = ])] = 0

,k 

The lemma follows as |(x - E[x ])|  1 for any f ~ ~ and k , and there are at most O(n1/2 T log n log T ) coordinates corresponding to a single f . N ext we show that the transformed distributions also possess some distance-concentration properties. Lemma 12 Let H be a d-dimensional subspace of 3/2 {0, 1}4n T log T log n . Then for any i,
xDi ~~

f ,k

- E[xf ~


,k 

Pr [||PH (x - E[x])|| < 4n1/4 T 1/2 (log n log T )1/2 ~ ~ d × log(d/ )]  1 - 

+ wj

-1/2

)


Proof: (Of Theorem 2) Let q = 4n1/2 T log n log T . When the distributions in the input mixture obey the separation conditions of Theorem 2, from Theorem 1, for each i and j , the distance between the transformed centers µi and µj is at least : ~ ~  -1/2 -1/2 ( q ) · (wi + + wj T log(T n/ ))
3/2

 ( q )(

n Since the number of samples is at least ( wmin ), the distance between the sample means and actual means of the transformed distributions are at most O(1). Therefore, from Theorem 13, q T log(T n/ ) ||PKB (ci,A - cj,A )||  c8

We note that the proof of Theorem 1 in [CR08] requires only that for each distribution, the coordinates in F are independently distributed from the coordinates in G . Since the distribution of any coordinate in F is independent of the distribution in G (although the coordinates within F or G are not necessarily independently distributed), we can apply Theorem 1 in [CR08] to conclude that for each i and j , there exists some constant a such that: ~~ dKB (µi , µj )  (d(µi , µj )) ~~ q a( T log  +

T

log  +

T

log(nT / ))

where ci,A and cj,A are the empirical centers of the transformed distributions, and c8 is some constant. As KB has dimension at most T , from Lemma 12 and a union bound over all pairs of samples, with probability 1 -  , all pairs of samples drawn from a distribution Di have distance at most 2 T log(nT / ) 2n1/4 T 1/2 (log n log T )1/2 in the subspace KB . On the other hand, for some constant a , a sample drawn from Di and a sample drawn from Dj are at least T log(nT / ) a n1/4 T 1/2 (log n log T )1/2


As KB has dimension at most 2T , from Lemma 12 and a union bound, with probability 1 -  , all pairs of samples dq awn from a distribution Di have distance at r most 2 T log(T n/ ) in the subspace KB . On the other hand, a sample drawn from2Di and a sample drawn from Dj are at least (a1 - 2) q T log(T n/ ) apart in KB . Algorithm H T- C O R R E L AT I O N S therefore works. R

q

T log(nT / ))

eferences
S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of 33rd ACM Symposium on Theory of Computing, pages 247­257, 2001. [ AM 0 5 ] D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In Proceedings of the 18th Annual Conference on Learning Theory, pages 458­469, 2005. [CR08] K. Chaudhuri and S. Rao. Learning mixtures of distributions using correlations and independence. In 21st Annual Conference on Learning Theory, 2008. [Das99] S. Dasgupta. Learning mixtures of gaussians. In Proceedings of the 40th IEEE Symposium on Foundations of Computer S cience, pages 634­644, 1999. [DHKS05] A. Dasgupta, J. Hopcroft, J. Kleinberg, and M. Sandler. On learning mixtures of heavytailed distributions. In Proceedings of the 46th IEEE Symposium on Foundations of Computer Science, pages 491­500, 2005. [DS00] S. Dasgupta and L. Schulman. A two-round variant of em for gaussian mixtures. In Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), 2000. [FOS05] J. Feldman, R. O'Donnell, and R. Servedio. Learning mixtures of product distributions over discrete domains. In Proceedings of FOCS, 2005. [In d 0 1 ] Piotr Indyk. Algorithmic applications of low-distortion geometric embeddings. In FOCS, pages 10­33, 2001. [ AK0 1 ]

apart in KB . Algorithm H T- S V D therefore works for  a 5  > 2 2. .2 Clustering Using Correlations

In this section, we present Algorithm H T- C O R R E L AT I O N S which is a combination of our embedding with the correlations-based clustering algorithm of [CR08]. Algorithm H T- C O R R E L AT I O N S is described in Figure 4. The input to the algorithm is a set S of s samples, and the output is a partitioning of the samples. The properties of Algorithm H T- C O R R E L AT I O N S are described in Theorem 3. This section is devoted to proving Theorem 3. The proof proceeds in three steps. First, we deduce from Theorem 1 that if the distributions satisfy the conditions in Theorem 3, then the transformed distributions satisfy the separation and spreading requirements of Theorem 1 in [CR08]. We can then apply Theorem 1 to show that the centers of the transformed distributions are far apart in KA and KB , the subspaces computed in Step 4 of Algorithm H T- C O R R E L AT I O N S. Finally, we use this fact along with Lemmas 11 and 12 to show that distance concentration algorithms work in these output subspaces. Proof:(Of Theorem 3) Let q = 4n1/2 T log n log T . From Theorem 1 and Conditions (1) and (2), for each i and j , the distance between the transformed centers µi and µj ~ ~ is at least


HT-CORRELATIONS(S) 1. 2. Partition the set of coordinates into F and G uniformly at random. Partition S uniformly at random into SA and SB . Let R1 = 26R, R2 = 8R, and q =  ~ ~ 4 nT log n log T . Compute SA = {(x)|x  SA } and SB = {(x)|x  SB }. Construct the nq × nq covariance matrix MA (respectively MB ), which has a row for each 2 2 tuple (f , k ), f  F , k  [q ], and a column for each tuple (g , k ), g  G , k  [q ]. The entry at row (f , k ) and column (g , k  ) is the covariance between coordinate (f , k ) and (g , k  ) of the transformed points over all samples in SA (SB respectively). Let {v1,A , . . . , vT ,A } and {y1,A , . . . , yT ,A } ({v1,B , . . . , vT ,B } and {y1,B , . . . , yT ,B } respec~ tively) be the top T left and right singular vectors of MA (resp. MB ). Project each point in SB ~ (resp. SA ) on the subspace KA (resp. KB ) spanned by {v1,A , . . . , vT ,A }  {y1,A , . . . , yT ,A } (resp. {v1,B , . . . , vT ,B }  {y1,B , . . . , yT ,B }). ~ ~ Use a distance-based clustering algorithm [AK01] to partition the points in SA and SB after projection.

3.

4.

5.

Figure 4: Algorithm Using Correlations [KSV05] R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In Proceedings of the 18th Annual Conference on Learning Theory, 2005. V. Vempala and G. Wang. A spectral algorithm of learning mixtures of distributions. In Proceedings of the 43rd IEEE Symposium on Foundations of Computer Science, p ag es 1 1 3 ­ 1 2 3 , 2 0 0 2 .

[VW02]