Learning Mixtures of Product Distributions using Correlations and Independence Kamalika Chaudhuri Information Theory and Applications, UC San Diego kamalika@soe.ucsd.edu Satish Rao Computer Science Division, UC Berkeley satishr@cs.berkeley.edu Abstract We study the problem of learning mixtures of distributions, a natural formalization of clustering. A mixture of distributions is a collection of distributions D = {D1 , . . . DT }, and mixing weights, {w1 , . . . , wT } such that i wi = 1. A sample from a mixture is generated by choosing i with probability wi and then choosing a sample from distribution Di . The problem of learning the mixture is that of finding the parameters of the distributions comprising D, given only the ability to sample from the mixture. In this paper, we restrict ourselves to learning mixtures of product distributions. The key to learning the mixtures is to find a few vectors, such that points from different distributions are sharply separated upon projection onto these vectors. Previous techniques use the vectors corresponding to the top few directions of highest variance of the mixture. Unfortunately, these directions may be directions of high noise and not directions along which the distributions are separated. Further, skewed mixing weights amplify the effects of noise, and as a result, previous techniques only work when the separation between the input distributions is large relative to the imbalance in the mixing weights. In this paper, we show an algorithm which successfully learns mixtures of distributions with a separation condition that depends only logarithmically on the skewed mixing weights. In particular, it succeeds for separation bea tween the centers that is ( T log ), where is the maximum directional standard deviation of any distribution in the mixture, T is the number of distributions, and is polynomial in T , , log n and the imbalance in the mixing weights. For our algorithm to succeed, we require a spreading condition, that the distance between the centers be spread across (T log ) coordinates. Additionally, with arbitrarily small separation, i.e., even when the separation is not enough for clustering, with enough samples, we can approximate the subspace containing the centers. Previous techniques failed to do so in polynomial time for non-spherical distributions regardless of the number of samples, unless the separation was large with respect to the maximum directional variance and polynomially large with respect to the imbalance of mixing weights.Our algorithm works for Binary Product Distributions and Axis-Aligned Gaussians. The spreading condition above is implied by the separation condition for binary product distributions, and is necessary for algorithms that rely on linear correlations. Finally, when a stronger version of our spreading condition holds, our algorithm performs successful clustering when theeparation bes tween the centers is only ( T log ), where is the maximum directional standard deviation in the subspace containing the centers of the distributions. 1 Introduction Clustering, the problem of grouping together data points in high dimensional space using a similarity measure, is a fundamental problem of statistics with numerous applications in a wide variety of fields. A natural model for clustering is that of learning mixtures of distributions. A mixture of distributions is a collection of distributions D = {D1 , . . . DT }, and mixing weights, {w1 , . . . , wT } i wi = 1. A sample from a mixture is gensuch that erated by choosing i with probability wi and choosing a sample from distribution Di . The problem of learning the mixture is that of finding the parameters of the distributions comprising D, given only the ability to sample from the mixture. If the distributions D1 , . . . , DT are very close to each other, then even if we knew the parameters of the distributions, it would be impossible to classify the points correctly with high confidence. Therefore, Dasgupta [Das99] introduced the notion of a separation condition, which is a promise that each pair of distributions is sufficiently different according to some measure. Given points from a mixture of distributions and a separation condition, the goal is to find the parameters of the mixture D, and cluster all but a small fraction of the points correctly. A commonly used separation measure is the distance between the centers of the distributions parameterized by the maximum directional variance, , of any distribution in the mixture. A common approach to learning the mixtures and therefore, clustering the high-dimensional cloud of points is to find a few interesting vectors, such that points from different distributions are sharply separated upon projection onto these vectors. Various distance-based methods [AK01, Llo82, DLR77] are then applied to cluster in the resulting low-dimensional subspace. The stateof-the-art, in practice, is to use the vectors corresponding to the top few directions of highest variance of the mixture and to hope that it contains most of the separation between the centers. This is computed by a Singular Value Decomposition(SVD) of the matrix of samples. This approach has been theoretically analyzed by [VW02] for spherical distributions, and for more general distributions in [KSV05, AM05]. The latter show that the maximum variance directions are indeed the in teresting directions when the separation is ( wmin ), where wmin is the smallest mixing weight of any distribution. This is the best possible result for SVD-based approaches; the directions of maximum variance may well not be the directions in which the centers are separated, but instead may be the directions of very high noise, as illustrated in Figure 1(b). This problem is exacerbated when the mixing weights wi are skewed ­ because a distribution with low mixing weight diminishes the contribution to the variance along a direction that separates the centers. This bound is suboptimal for two reasons. Although mixtures with skewed mixing weights arise naturally in practice(see [PSD00] for an example), given enough samples, mixing weights have no bearing on the separability of distributions. Consider two mixtures D and D of distributions D1 and D2 : in D , w1 = w2 = 1/2, and in D , w1 = 1/4 and w2 = 3/4. Given enough computational resources, if we can learn D from 50 samples, we should be able to learn D from 100 samples. This does not necessarily hold for SVD-based methods. Secondly, regardless of , an algorithm, which has prior knowledge of the subspace containing the centers of the distributions, should be able to learn the mixture when the separation is proportional to , the maximum directional standard deviation of any distribution in the subspace containing the centers. An example in which and are significantly different is shown in Figu re 1 (b ). In this paper, we study the problem of learning mixtures of product distributions. A product distribution over Rn is one in which each coordinate is distributed independently of any others. In practice, mixtures of product distributions have been used as mathematical models for data and learning mixtures of product distributions specifically has been studied [FM99, FOS05, FOS06, DHKS05] ­ see the Related Work section for examples and details. However, even under this seemingly restrictive assumption, providing an efficient algorithm that does better than the bounds of [AM05, KSV05] turns out to be quite challenging. The main challenge is to find a low-dimensional subspace that contains most of the separation between the centers; although the independence assumption can (sometimes) help us identify which coordinates contribute to the distance between some pair of centers, the problem of actually finding the low-dimensional space still requires more involved techniques. In this paper, we present an algorithm for learning mixtures of product distributions, which is stable in the presence of skewed mixing weights, and, under certain conditions, in the presence of high variance outside the subspace containing the centers. In particular, the dependence of the separation required by our algorithm on skewed mixing weights is only logarithmic. Additionally, with arbitrarily small separation, (i.e., even when the separation is not enough for classification), with enough samples, we can approximate the subspace containing the centers. Previous techniques failed to do so for non-spherical distributions regardless of the number of samples, unless the separation was sufficiently large. Our algorithm works for binary product distributions and axis-aligned Gaussians. We require that the distance between the centers be spread across (T log ) coordinates, where depends polynomially on the max- imum distance between centers and wmin . For our algorithm to classify the samples correctly, w further need e the separation between centers to be ( T log ). In addition, if a stronger version of the spreading condition is satisfied, then our algorithm requires a sepa ration of only ( T log ) to ensure correct classification of the samples. The stronger spreading condition, discussed in more detail later, ensures that when we split the coordinates randomly into two sets, the maximum directional variance of any distribution in the mixture along the projection of the subspace containing the centers into the subspaces spanned by the coordinate vec2 tors in each set, is comparable to . In summary, compared to [AM05, KSV05], our algorithm is much (exponentially) less susceptible to the imbalance in mixture weights and, when the stronger spreading condition holds, to high variance noise outside the subspace containing the centers. However, our algorithm requires a spreading condition and coordinateindependence, while [AM05, KSV05] are more general. We note that for perfectly spherical distributions, the results of [VW02] are better than our results ­ however, these results do not apply even for distributions with bounded eccentricity. Finally unlike the results of [Das99, AK01, DS00], which require the separation to grow polynomially with dimension, our separation only grows logarithmically with the dimension. Our algorithm is based upon two key insights. The first insight is that if the centers are separated along several coordinates, then many of these coordinates are correlated with each other. To exploit this observation, we choose half the coordinates randomly, and search the space of this half for directions of high variance. We use the remaining half of coordinates to filter the found directions. If a found direction separates the centers, it is likely to have some correlation with coordinates in the remaining half, and therefore is preserved by the filter. If, on the other hand, the direction found is due to noise, coordinate independence ensures that there will be no correlation with the second half of coordinates, and therefore such directions get filtered away. The second insight is that the tasks of searching for and filtering the directions can be simultaneously accomplished via a singular value decomposition of the matrix of covariances between the two halves of coordinates.In particular, we show that the top few directions of maximum variance of the covariance matrix approximately capture the subspace containing the centers. Moreover, we show that the covariance matrix has low singular value along any noise direction. By combining these ideas, we obtain an algorithm that is almost insensitive to mixing weights, a property essential for applications like population stratification [CHRZ07], and which can be implemented using the heavily optimized and thus, efficient, SVD procedure, and which works with a separation condition closer to the information theoretic bound. Related Work The first provable results for learning mixtures of Gaussians are due to Dasgupta [Das99] who shows how to learn m tures of spherical Gaussians with a separation ix of ( n) in an n-dimensional space. An EM based algorithm by Dasgupta and Schulman [DS00] was shown to apply to more situations, and with a separation of ( n1/4 ). Arora and Kannan [AK01] show how to learn mixtures of distributions of arbitrary Gaussians whose centers are separated by (n1/4 ). Their results apply to many other situations, for example, concentric Gaussians with sufficiently different variance. The first result that removed the dependence on n in the separation requirement was that of Vempala and Wang [VW02] who use SVD to learn mixtures of spherical Gaussians with O( T 1/4 ) separation. They project to a subspace of dimension T using an SVD and use a distance based method in the low dimensional space. If the separation is not enough for classification, [VW02] can also find, given enough samples, a subspace approximating the subgspace containing the centers. While the results of [VW02] are independent of the imbalance on mixing weights, they apply only to perfectly spherical Gaussians, and cannot be extended to Gaussians with bounded eccentricity. In further work Kannan, Salmasian, and Vempala[KSV05] and Achlioptas and McSherry [AM05] show how to cluster general Gaussians using SVD. While these results are weaker than ours, they apply to a mixture of general Gaussians, axis-aligned or not. We note that their analysis also applies to binary product distributions again with polynomial dependence on the imbalance in mixing we hts1 . In contrast, our ig separation requirement is ( T log ), i.e., is logarithmically dependent on the mixing weights and dimension and the maximum variance in noise directions. There is also ample literature on specifically learning mixtures of product distributions. Freund and Mansour [FM99] show an algorithm which generates distributions that are -close to a mixture of two product distributions over {0, 1}n in time polynomial in n and 1/. Feldman, O'Donnell, and Servedio show how to generate distributions that are -close to a mixture of T product distributions [FOS05] and axis-aligned Gaussians [FOS06]. Like [FM99], they have no separation 3 requirements, but their algorithm takes nO(T ) time. Dasgupta et. al [DHKS05] provide an algorithm for learning mixtures of heavy-tailed product distributions which works with a separation of (R T ), where R is the maximum half-radius of any distribution in the mixture. They do not directly address binary product distributions in their paper, but their techniques apply. 1 While their separation requirement does not depend poly1 nomially on wmin , their algorithm runs in time expon nential in ( wmin ). They also require a slope, which is comparable to our spreading condition. Chaudhuri et al. [CHRZ07] show an iterative algorithm for learning mixtures of two product distributions that implicitly uses the notion of co-ordinate independence to filter out noise directions. However, the algorithm heavily uses the two Figure 1: (a) Spherical Gaussians: Direction of maxidistribution restriction to find the appropriate directions, mum variance is the direction separating the centers (b) and does not work when T > 2. Arbitrary Gaussians: Direction of maximum variance is More broadly, the problem of analyzing mixture moda noise direction. els data has received a great deal of attention in statistics, see for example, [MB88, TSM85], and has numerg ous applications. We present three applications where data is modelled as a mixture of product distirbutions. D1 D2 First, the problem of population stratification in populaµg = µg 1 2 tion genetics has been posed as learning mixtures of binary product distributions in [SRH07]. In their work, the D4 D3 authors develop an MCMC method for addressing the µg = µg problem and their software embodiment is widely used. 3 4 A second application is in speech recognition [Rey95, PFK02], which models acoustic features at a specific µf = µf µf = µf f 4 1 3 2 time point as a mixture of axis-aligned Gaussians. A third application is the widely used Latent Dirichlet AlFigure 2: An Example where All Covariances are 0 location model [BNJ03]. Here, documents are modelled as distributions over topics which, in turn, are distributions over words. Subsequent choices of topics and words are assumed to be independent. (For words, this is the message. The distributions comprising D1 are dereferred to as the "bag of words" assumption.) [BNJ03] fined as follows. Each of the T = 2k centers is a codedevelops variational techniques that provide interesting word for a k -bit string appended by a string of length results for various corpora. Interestingly, the same model n - k in which each coordinate has value 1/2. Notice was used by Kleinberg and Sandler [KS04] to model that the last n - k bits are noise. Thus, the centers are user preferences for purchasing goods (users correspond separated by T /2 coordinates. D2 is the uniform disto documents, topics to categories, and words to goods). tribution over the n-dimensional hypercube. As there Their algorithm, which provides provably good perforare no linear correlations between any two bits in the mance in this model, also uses SVD-like clustering alHadamard code, the covariance of D1 along any two gorithms as a subroutine. directions is 0, and each direction has the same variOur clustering method also involves a Canonical Corance. As this is also the case for D2 , any SVD-bsed relations Analysis of the samples, which seems to have or c . connections with multiview learning[KF07] and co-training[AT98]orrelation-based algorithm will fail to distinguish between the two mixtures. We also note that learning binary product distributions with minimum separation Discussion 1 2 and average separation 1 + 2 log T would allow one The Spreading Condition. The spreading condition to learn parities of log T variables with noise. Finally, loosely states that the distance between each pair of cenwe note that when the spreading condition fails, one has ters is spread along about (T log ) coordinates. We only a few coordinates that contain most of the distance demonstrate by an example, that a spread of (T ), is a between centers. One could enumerate the set of possinatural limit for all methods that use linear correlations ble coordinates to deal with this case, and is exponenbetween coordinates, such as our methods and SVD based tional in T log n log . [FOS05] on the other hand takes methods [VW02, KSV05, AM05]. We present, as an time exponential in T 3 log n, and works with no separaexample, two distributions : a mixture D1 of T binary tion requirement. product distributions, and a single binary product distribution D2 , which have exactly the same covariance 2 A Summary of Our Results matrix. Our example is based on the Hadamard code, in which a codeword for a k -bit message is 2k bits long, We begin with some preliminary definitions about disand includes a parity bit for each subset of the bits of tributions drawn over n dimensional spaces. We use f, g , . . . to range over coordinates, and i, j, . . . to range over distributions. For any x Rn , we write xf for the f -th coordinate of x. For any subspace H (resp. vector ¯ v ), we use H (resp. v ) to denote the orthogonal com¯ plement of H (resp. v ). For a subspace H and a vector v , we write PH (v ) for the projection of v onto the subspace H. For any vector x, we use ||x|| for the Euclidean f orm of x. For any two vectors x and y , we use x, y n or the dot-product of x and y . Mixtures of Distributions. A mixture of distributions D, is a collection of distributions, {D1 , . . . , DT }, over points in Rni, and a set of mixing weights w1 , . . . , wT such that wi = 1. In the sequel, n is assumed to be much larger than T . In a product distribution over Rn , each coordinate is distributed independently of the others. When working with a mixture of binary product distributions, we assume that the f -th coordinate of a point drawn from distribution Di is 1 with probability µf , and 0 with probability 1 - µf . When working with i i a mixture of axis-aligned Gaussian distributions, we assume that the f -th coordinate of a point drawn from distribution Di is distributed as a Gaussian with mean µf i f and standard deviation i . Centers. We define the center of a distribution i as the vector µi , and the center of mass of the mixture as the vector µ where µf is the mean of the mixture for the ¯ ¯ coordinate f . We write C for the subspace containing µ1 , . . . , µT . Directional Variance. We define 2 as the maximum variance of any distribution in the mixture along any 2 direction. We define as the maximum variance of any distribution in the mixture along any direction in the subspace containing the centers of the distributions. 2 We write max as the maximum variance of the entire mixture in any direction. This may be more than 2 due to contribution from the separation between the centers. Spref d. We say that a unit vector v in Rn has spread S a (v f )2 S · maxf (v f )2 . if (about T log T ) coordinates. Due to technicalities in our proofs, the number of coordinates we can ignore needs to depend (logarithmically) on this distance. We therefore define the spreading condition as follows. We define parameters cij and a parameter as : 2 xT n > wmima(miloigj c2 ) and cij is the maximum value such n, n· ij that there are 49T log coordinates f with |µf - µf | > i j cij . We note that is bounded by a polynomial in T , , 1/wmin , 1/cij and logarithmic in n. We define cmin to be the minimum over all pairs i, j of cij . Given a pair of centers i and j , let ij be the set of coordinates f such that |µf - µf | > cij , and let ij i j f f be defined as: ij = µf - µf , if f ij , and ij = cij / i j ¯ otherwise. We define d(µi , µj ), the effective distance between µi and µj to be the square of the L2 norm of ij . In contrast, the square of the norm of the vector µi - µj is the actual distance between centers µi and µj , and is always greater than or equal to the effective distance between µi and µj . Moreover, given i and j ¯ and the subspace K, we define dK (µi , µj ) as the square of the norm of the vector ij projected onto the subspace K. Under these definitions, our spreading condition now ¯ requires that d(µi , µj ) 49c2j T log and our stronger i spreading condition requires that every vector in C has spread 32T log . A Formal Statement of our Results. Our main contribution is Algorithm C O R R - C L U S T E R, a correlation based algorithm for learning mixtures of binary product distributions and axis-aligned Gaussians. The input to the algorithm is a set of samples from a mixture of distributions, and the output is a clustering of the samples. The main component of Algorithm C O R R - C L U S T E R is Algorithm C O R R - S U B S PAC E, which, given samples from a mixture of distributions, computes an approximation to the subspace containing the centers of the distributions. The motivation for approximating the latter space is as follows. In the T -dimensional subspace containing the centers of the distributions, the distance between each pair of centers µi and µj is the same as their distance in Rn ; however, because of the low dimensionality, the magnitude of the noise is small. Therefore, provided the centers of the distributions are sufficiently separated, projection onto this subspace will sharply separate samples from different distributions. SVD-based algorithms [VW02, AM05, KSV05] attempt to approximate this subspace by the top T singular vectors of the matrix of samples. However, for product distributions, our Algorithm C O R R - S U B S PAC E can approximate this subspace correctly under more restrictive separation conditions. The properties of Algorithms C O R R - S U B S PAC E and The Spreading Condition and Effective Distance. The spreading condition tells us that the distance between each µi and µj should not be concentrated along a few coordinates. One way to ensure this is to demand that for all i, j , the vector µi - µj has high spread. This is comparable to the slope condition used in [DHKS05]. However, we do not need such a strong condition for dealing with mixtures with imbalanced mixing weights. Our spreading condition therefore demands that for each pair of centers µi , µj , the norm of the vector µi - µj high, even if we ignore the contribution of the top few Distance. Given a subspace K of Rn and two points x, y in Rn , we write dK (x, y ) for the square of the Euclidean distance between x and y projected along the subspace K. CORR-CLUSTER are formally summarized in Theorem 1 and Theorem 2 respectively. Theorem 1 (Spanning centers) Suppose we are given a mixture of distributions D = {D1 , . . . , DT }, with mixing weights w1 , . . . , wT . Then with at least constant 2. For axis-aligned Gaussians, if every vector in C has probability, the subspace K of dimension at most 2T spread at least 32T log , and for all i, j : output by Algorithm C O R R - S U B S PAC E has the follow2 ¯ ing properties. d(µi , µj ) 150 T (log + log n) 2 ¯ 1. If, for all i and j , d(µi , µj ) 49cij T log , then, then, with constant probability over the randomfor all pairs i, j , ness in the algorithm, and with probability 1 - 99 ¯ 1 2 dK (µi , µj ) (d(µi , µj ) - 49T cij log ) n over the samples, Algorithm C O R R - C L U S T E R 100 computes a correct clustering of the sample points. 2. If, in addition, every vector in C has spread 32T log , Algorithm C O R R - C L U S T E R runs in time polynomial in then, with at least constant probability, the maxin and the number of samples required by Algorithm C O R R mum directional variance in K of any distribution 1 2 C L U S T E R is polynomial in , T , n, and wmin . Di in the mixture is at most 11 . The number of samples required by Algorithm C O R R 1 S U B S PAC E is polynomial in , T , n, and wmin , and the algorithm runs in time polynomial in n, T , and the number of samples. We note that because we are required to do classification here, we do require an absolute lower bound on the distance between each pair of centers in Theorem 2. The second theorem follows from the first and the distance concentration Lemmas of [AM05] as described in detail in Chapter 3 of [Cha07]. The Lemmas show that once the points are projected onto the subspace computed in Theorem 1, a distance-based clustering method suffices to correctly cluster the points. A Note on the Stronger Spreading Condition. The motivation for requiring the stronger spreading condition is as follows. Our algorithm splits the coordinates randomly into two sets F and G . If CF and CG denote the restriction of C to the coordinates in F and G respectively, then our algorithm requires that the maximum directional variance of any distribution in the mixture is close to in CF and CG respectively. Notice that this does not follow from the fact that the maximum di2 rectional variance along C is : suppose C is spanned by (0.1, 0.1, 1, 1) and (0.1, 0.1, -1, 1), variances of D1 along the axes are (10, 10, 1, 1), and F is {1, 2}. Then, 2 is about 2.8, while the variance of D1 along CF is 10. However, as Lemma 9 shows, the required condition is ensured by the strong spreading condition. However, in general, the maximum directional variance of any Di in the mixture along CF and CG may 2 still be close to , even though strong spreading condition is far from being met. For example: if C is the space spanned by the first T coordinate vectors e1 , . . . , eT ,then with probability 1 - 21 , the maximum variance along T 2 CF and CG is also . 1 then with probability 1 - n over the samples and with constant probability over the random choices made by the algorithm, Algorithm C O R R - C L U S T E R computes a correct clustering of the sample points. The subspace K computed by Algorithm C O R R - S U B S PAC E approximates the subspace containing the centers of the distributions in the sense that the distance between each pair of centers µi and µj is high along K. Theorem 1 states that Algorithm C O R R - S U B S PAC E computes an approximation to the subspace containing the centers of the distributions, provided the spreading condition is satisfied. If the strong spreading condition is satisfied as well, then the maximum variance of each Di along K is 2 also close to . Note that in Theorem 1, there is no absolute lower bound required on the distance between any pair of centers. This means that, so long as the spreading condition is satisfied, and there are sufficiently many samples, even if the distance between the centers is not large enough for correct classification, we can compute an approximation to the subspace containing the centers of the distributions. We also note that although we show that Algorithm C O R R - S U B S PAC E succeeds with constant probability, we can make this probability higher at the expense of a more restrictive spreading condition, or by running the algorithm multiple times. Theorem 2 (Clustering) Suppose we are given a mixture of distributions D = {D1 , . . . , DT }, with mixing weights w1 , . . . , wT . Then, Algorithm C O R R - C L U S T E R has the following properties. ¯ 1. If for all i and j , d(µi , µj ) 49T c2j log , and for i all i, j we have: ¯ d(µi , µj ) > 59 2 T (log + log n) (for axis-aligned Gaussians) ¯ d(µi , µj ) > 59T (log + log n) (for binary product distributions) 3 Algorithm C O R R - C L U S T E R Our clustering algorithm follows the same basic framework as the SVD-based algorithms of [VW02, KSV05, AM05]. The input to the algorithm is a set S of samples, and the output is a pair of clusterings of the samples according to source distribution. C O R R - C L U S T E R (S ) 1. Partition S into SA and SB uniformly at random. 2. Compute: KA = C orr - S ubspace(SA ), KB = C orr - S ubspace(SB ) 3. Project each point in SB (resp. SA ) on the subspace KA (resp. KB ). 4. Use a distance-based clustering algorithm [AK01] to partition the points in SA and SB after projection. The first step in the algorithm is to use Algorithm C O R R - S U B S PAC E to find a O(T )-dimensional subspace K which is an approximation to the subspace containing the centers of the distributions. Next, the samples are projected onto K and a distance-based clustering algorithm is used to find the clusters. We note that in order to preserve independence the samples we project onto K should be distinct from the ones we use to compute K. A clustering of the complete set of points can then be computed by partitioning the samples into two sets A and B . We use A to compute KA , which is used to cluster B and vice-versa. We now present our algorithm which computes a basis for the subspace K. With slight abuse of notation we use K to denote the set of vectors that form the basis for the subspace K.The input to C O R R - S U B S PAC E is a set S of samples, and the output is a subspace K of dimension at most 2T . Algorithm C O R R - S U B S PAC E: Step 1: Initialize and Split Initialize the basis K with the empty set of vectors. Randomly partition the coordinates into two sets, F and G , each of size n/2. Order the coordinates as those in F first, followed by those in G . Step 2: Sample Translate each sample point so that the center of mass of the set of sample points is at the origin. Let F (respectively G) be the matrix which contains a row for each sample point, and a column for each coordinate in F (respectively G ). For each matrix, the entry at row x, column f is the value of the f -th coordinate of the sample point x divided | S |. by The main idea behind our algorithm is to use half the coordinates to compute a subspace which approximates the subspace containing the centers, and the remaining half to validate that the subspace computed is indeed a good approximation. We critically use the coordinate independence property of product distributions to make this validation possible. Step 5: Output Output the set of vectors K. n/2 dimensions (0 vector in n/2 dimensions concatenated with yi respectively). For each i, if the singular value i is more than a threshold = w , 2 m in c j O T log2 in · log we add vi and yi to K. 4 Analysis of Algorithm C O R R - C L U S T E R This section is devoted to proving Theorems 1, and 2. We use the following notation. Notation.We write F -space (resp. G -space) for the n/2 dimensional subspace of Rn spanned by the coordinate vectors {ef | f F } (resp. {eg | g G }). We write C for the subspace spanned by the set of vectors µi . We write CF for the space spanned by the set of vectors ¯ PF (µi ). We write PF (CF ) for the orthogonal complement of CF in the F -space. Moreover, we write CF G for the subspace of dimension 2T spanned by the union of a basis of CF and a basis of CG . Next, we define a key ingredient of the analysis. Covariance Matrix. Let N be a large number. We de^ ^ fine F (resp. G), the perfect sample matrix with respect to F (resp. G ) as the N × n/2 matrix whose rows from (w1 + . . . + wi-1 )N + 1 through (w1 + . . . + wi )N are equal to the vector PF (µi )/ N (resp. PG (µi )/ N ). For a coordinate f , let Xf be a random variable which is distributed as the f -th coordinate of the mixture D. As ^^ the entry in row f and column g in the matrix F T G is equal to Cov(Xf , Xg ), the covariance of Xf and Xg , ^^ we call the matrix F T G the covariance matrix of F and G. Proof Structure. The overall structure of our proof is as follows. First, we show that the centers of the distributions in the mixture have a high projection on the subspace of highest correlation between the coordinates. To do this, we first assume,in Section 4.1 that the input to the algorithm in Step 2 are the perfect sample ma^ ^ trices F and G. Of course, we cannot directly feed in ^^ the matrices F , G, as the values of the centers are not known in advance. Next, we show in Section 4.2 that this holds even when the matrices F and G in Step 2 of Algorithm C O R R - S U B S PAC E are obtained by sampling. In Section 4.3, we combine these two results and prove Theorem 1. Finally, using results on distance concentration from [AM05, AK01], we complete the analysis by proving Theorem 2. Step 3: Compute Singular Space For the matrix F T G, compute {v1 , . . . , vT }, the top T left singular vectors, {y1 , . . . , yT }, the top T right singular vectors, and {1 , . . . , T }, the top T singular values. Step 4: Expand Basis For each i, we abuse notation and use vi (yi respectively) to denote the vector obtained by concatenating vi with the 0 vector in 4.1 The Perfect Sample Matrix The goal of this section is to prove Lemmas 3 and 7, which establish a relationship between directions of high correlation of the covariance matrix constructed from the perfect sample matrix, and directions which contain a lot of separation between centers. Lemma 3 shows that a direction which contains a lot of effective distance between some pair of centers, is also a direction of high correlation. ¯ Lemma 7 shows that a direction v PF (CF ), which is perpendicular to the space containing the centers, is a direction with 0 correlation. In addition, we show in Lemma 8, another property of the perfect sample matrix ­ the covariance matrix constructed from the perfect sample matrix has rank at most T . We conclude this section by showing in Lemma 9 that when every vector in C has high spread, the directional variance of any distribution in the mixture along F -space or G -space is of the 2 order of . We begin by showing that if a direction v contains a lot of the distance between the centers, then, for most ways of splitting the coordinates, the magnitude of the covariance of the mixture along the projection of v on F -space and the projection of v G -space is high. In other words, the projections of v along F -space and G -space are directions of high correlation. Lemma 3 Let v be any vector in CF G such that for ¯ some i and j , dv (µi , µj ) 49T c2j log . If vF and vG i are the normalized projections of v to F -space and G 1 space respectively, then, with probability at least 1 - T T ^T ^ over the splittinw step, for all such v , vF F GvG g . 2 m in c j where = O T log2 in · log A detailed proof, presented in [Cha07], is omitted due to lack of space. However, the main ingredient of the proof is Lemma 4. Lemma 4 Let v be a fixed vector in C such that for some ¯ i and j , dv (µi , µj ) 49T c2j log . If vF and vG are the i projections of v to F -space and G -space respectively, then, with probability at least 1--2Tw ver the splitting . o 2 min cij ^^ step, v T F T GvG 2 where = O · log 2 F T log n ^T ^ Therefore, Fv Gv has rank at most 1. The proof strategy for Lemma 4 is to show that if ^T ^ dv (µi , µj ) is large then the matrix Fv Gv has high norm. We require the following notation. For each coordinate f we define a T -dimensional vector zf as zf = [ w1 Pv (µf - µf ), . . . , wT Pv (µf - µf )] ¯ ¯ 1 T Notice that for any two coordinates f ,g : zf , zg = Cov(Pv (Xf ), Pv (Xg )) , computed over the entire mixture. We also observe that f i wi · dv (µi , µ) ¯ ||zf ||2 = The RHS of this equality is the weighted sum of the squares of the Euclidean distances between the centers of the distributions and the center of mass. By the triangle inequality, this quantity is at least 49wminc2j T log . i We also a couple of technical lemmas ­ Lemmas 5 and 6, which are stated below. The proofs of these lemmas are omitted due to lack of space, but can be found in [Cha07]. Lemma 5 Let A be a set of coordinates with cardinality more than 144T 2 log such that for each f A, ||zf || f 2 is equal and A ||zf || = D . Then, (1) f ,gA,f =g Moreover, for any pair of vectors x in F -space and y in G -space such that x, vF = 0 and y, vG = 0, i ^T ^ ¯ ¯ wi x, PvF (µi -µ) y, PvG (µi -µ) = 0 xT Fv Gv y = zf , zg 2 D2 288T 2 log and (2) with probability 1 - -2T over the splitting of coordinates in Step 1, f D2 zf , zg 2 1152T 2 log F A, g G A ^^ Let Fv (Gv respectively) be the s × n/2 matrix ob^ ^ tained by projecting each row of F (respectively G) on vF (respectively vG ). Then, T ^T ^ vF Fv Gv vG Lemma 6 Let A be a set of coordfnates such that for i 2 each f A, ||zf || is equal and A ||zf || = D . If 48T log + T < |A| 144T 2 log , then (1) f D2 zf , zg 2 1152T 4 log ,gA,f =g = = T^ ^ vF F T GvG i ¯ ¯ wi vF , PvF (µi - µ) vG , PvG (µi - µ) and (2) with probability 1 - -2T over the splitting in Step 1, f D2 zf , zg 2 4608T 4 log F A, g G A Proof:(Of Lemma 4) From the definition of effective ¯ distance, if the condition: dv (µi , µj ) > 49c2j T log i holds then there are at least 49T log vectors zf with total squared norm at least 98wmincij 2 T log . In the sequel we will cale down each vector zf with norm s gre er than cij wmin so that its norm is exactly at cij wmin . We divide the vectors into log n groups as follows: group Bk conta s vectors which have norm in cij wmin cij wmin and 2k-1 . between 2k We will call a vector small if its norm is less than wmin cij , and otherwise, we call the vector big. We ob2 log n serve that there exists a set of vector B with the following properties: (1) the cardinality of B is more than 49T log , (2) the total sum of squares of the norm of the ij , and, (3) the vectors in B is greater than log n ratio of the norms of any two vectors in B is at most 2 log n. Case 1: Suppose there exists a group Bk of small vectors the squares of whose norms sum to a value greater ij . By definition, such a group has than log n more than 49T log vectors, and the ratio is at most 2. Case 2: Otherwise, there are at least 49T log big vectors. By definition, the sum of the squares of their norms ij exceeds . Due to the scaling, the ratio is log n at most 2 log n. We scale down the vectors in B so that each vector wmin c2 has squared norm 2k ij in case 1, and, squared norm ¯ Proof: We first show that for any x in PF (CF ), and any T ^T ^ y , x F Gy = 0. ^^ S x F T Gy = T iT =1 wi PF (µi ), x · PG (µi ), y ¯ ince x is in PF (CF ), PF (µi ), x = 0, for all i, and T ^T ^ ¯ hence x F Gy = 0 for all x in PF (CF ). We now prove the Lemma by induction on k . Base case (k = 1). Let v1 = u1 + x1 , where u1 CF ¯ and x1 PF (CF ). Let y1 be the top right singular ^ T G, and let |x1 | > 0. Then, v T F T Gy1 = ^ ^^ vector of F 1 T ^T ^ u1 F Gy1 , and u1 /|u1 | is a vector of norm 1 such that 1 T ^T ^ T ^T ^ |u1 | u1 F Gy1 > v1 F Gy1 , which contradicts the fact ^^ that v1 is the top left singular vector of F T G. Inductive case. Let vk = uk + xk , where uk CF and ¯ xk PF (CF ). Let yk be the top k -th right singular vec^ T G, and let |xk | > 0. We first show that uk is ^ tor of F orthogonal to each of the vectors v1 , . . . , vk-1 . Otherwise, suppose there is some j , 1 j k - 1, such that uk , vj = 0. Then, vk , vj = xk , vj + uk , vj = uk , vj = 0. This contradicts the fact that vk is a T^ ^ ^^ left singular vector of F T G. Therefore, vk F T Gyk = ^^ uT F T Gyk , and uk /|uk | is a vector of norm 1, orthogok 1 T^ ^ ^^ nal to v1 , . . . , vk-1 such that |uk | uT F T Gyk > vk F T Gyk . k This contradicts the fact that vk is the top k -th left sin^^ Lular vector of F T G. The Lemma follows. g ^^ emma 8 The covariance matrix F T G has rank at most T. The proof is omitted due to space constraints. Finally, we show that if the spread of every vector in C is high, then with high probability over the splitting of coordinates in Step 1 of Algorithm C O R R - S U B S PAC E, the maximum directional variances of any distribution Di in CF and CG are high. This means that there is enough information in both F -space and G -space for correctly clustering the distributions through distance concentration. Lemma 9 If every vector v C has spread at least 32T log , then, with constant probability over the splitting of coordinates in Step 1 of Algorithm C O R R - S U B S PAC E, the maximum variance along any di2 rection in CF or CG is at most 5 . Proof:(Of Lemma 9) Let v and v be two unit vectors in C , and let vF (resp. vF ) and vG (resp. vG denote the normalized projections of v (resp. v ) on F -space and G -space respectively. If ||vF - vF || < , then, the 49T log wmin c2 49T wmin c2 log 49T wmin c2 log wmin c2j i 4 log n in case 2. Due to (2) and (3), the total squared 49T wmin c2 log ij norm of the scaled vectors is at least . 4 log2 n Due to (1), we can now apply Lemmas 5 and 6 on the vectors to conclude that for some constant a1 , with probability 1 - -2T , T w 2 4 f min cij log z f , z g 2 a1 · T 2 log4 n F ,g G he above sum is the square of the Frobenius norm ^T ^ ^T ^ ^T ^ |Fv Gv |F of the matrix Fv Gv . Since Fv Gv has rank at most 1, and the maximum singular value of a rank 1 matrix is wts Frobenius norm [GL96], plugging in i c 2 m in c j =N T log2 in · log ompletes the proof. O ¯ ext we show that a vector x PF (CF ) is a direction of 0 correlation. A similar statement holds for a ¯ vector y PG (CG ). Lemma 7 If at Step 2 of Algorithm C O R R - S U B S PAC E, ^ ^ the values of F and G are respectively F and G, and for some k ,the top k -th left singular vector is vk and the corresponding singular value k is more than , then ¯ for any vector x in PF (CF ), vk , x = 0. directional variance of any Di in the mixture along vF can be written as: E[ vF , x - E[x] 2] = E[ vF , x - E[x] 2] + E[ vF - vF , x - E[x] 2] +2E[ vF , x - E[x] ]E[ vF - vF , x - E[x] ] E[ vF , x - E[x] 2] + ||vF - vF ||2 2 Thus, the directional variance of any distribution in the mixture along v is at most the directional variance along 2 v , plus an additional . Therefore, to show this lemma, we need to show that if v is any vector on a -cover of C , then with high probability over the splitting of coordinates in Step 1 of Algorithm C O R R - S U B S PAC E, the directional variances of any Di in the mixture along vF 2 and vG are at most 4 . We show this in two steps. First we show that for f 3 f2 any v in a -cover of C , 1 F (v ) 4 . Then, 4 we show that this condition means that for this vector v , the maximum directional variances along vF and vG are 2 at most 4 . Let v be any fixed unit vector in C . We first show 2T over the splitting of that with probability 1 - coordinafes in Step 1 of Algorithm C O R R - S U B S PAC E, t 1 3 f2 F (v ) 4 . To show this bound, we ap4 ply the Method of Bounded Difference[PD05]. Since we split thefcoordinates into F and G uniformly at ran1 f2 d o m , E[ F (v ) ] = 2 . Let f be the change in f f2 F (v ) when the inclusion or exclusion of coordinate f in the set F changes. Then, f = (v f )2 and f2 = f . Since the spread of vector v is at least f f4 (v ) 32T l1 g , and from the 32T log , = o Method of Bounded Differences, f f 1 Pr[| (v f )2 - E[ (v f )2 ]| > ] e-1/32 4 F F 2T By taking an union bound over all v on a -cover of C , f f2 3 we deduce that for any such v , 1 F (v ) 4 . 4 Since the maximum directional variance of any dis2 tribution Di in the mixture in C is at most , f f2 f2 2 (v ) (i ) . Therefore the maximum variance along vF as well as vG can be computed as: f 1f 1 f f 2 (v f )2 (i )2 (v f )2 (i )2 4 ||vF ||2 ||vF ||2 F generated by sampling in Step 2 of Algorithm C O R R C L U S T E R are very close to the properties of the matrix ^^ F T G. The lemmas are stated below. The proofs are omitted due to space constraints, but can be found in [Cha07]. The proofs use the Method of Bounded Differences (when the input is a mixture of binary product distributions) and the Gaussian Concentration of Measure Inequality (for axis-aligned Gaussians). The central lemma of this section is Lemma 10, which shows that, if there are sufficiently many samples, for any set of 2m vectors, {v1 , . . . , vm } and {y1 , . . . , ym }, k T ^T ^ kTT vk F Gyk are very close. This vk F Gyk and lemma is then used to prove Lemmas 11 and 12. Lemma 11 shows that the top few singular vectors of F T G output by Algorithm C O R R - S U B S PAC E have very low pro¯ ¯ jection on PF (CF ) or PG (CG ). Lemma 12 shows that the rank of the matrix F T G is almost T , in the sense that the T + 1-th singular value of this matrix is very low. Lemma 10 Let U = {u1 , . . . , um }, Y = {y1 , . . . , ym } be any two sets of orthonormal vectors, and let F and G be the matrices generated by sampling in Step 2 of the algorithm. If the number of samples |S | is greater than 32 o ( m n log nl2 g(max /) ) (for Binary Product Distributions), and (max(a1 , a2 )) (for Axis-Aligned Gaussians), 4 42 2 2 n where a1 = m n log 2log (max /) , and max max a2 = , then, with probability 2 at least 1 - 1/n, k | uT (F T G - E[F T G])yk | k 2 2 m3 n log n log( / ) Lemma 11 Let F and G be the matrices generated by sampling in Step 2 of the algorithm, and let v1 , . . . , vm be the vectors output by the algorithm in Step 4. If the number of samples |S | is greater than ) (for Binary Product Distribu( 2 4 tions), and max(a1 , a2 ) (for Axis-Aligned Gaussians) 4 42 2 g2 where a1 = m n lo 2 n log (/) , and 4 m3 n2 log n(log +log 1 ) max a2 = 2 4 ¯ x in PF (CF ), vk , x . 2 2 m3 n log n log(/) , then, for each k , and any 4 The lemma follows. .2 Working with Real Samples In this section, we show that given sufficient samples, the properties of the matrix F T G, where F and G are Lemma 12 Let F and G be the matrices generated by sampling in Step 2 of Algorithm C O R R - S U B S PAC E. If the number of sampl(es |S | is greater than T3 2 n log n log for binary product distributions) and 2 f m 4 4 2 2 2 23 T n log log max T n log n log or axis, ax 2 2 aligned Gaussians, then, T +1 , the T + 1-th singular value of the matrix F T G is at most /8. 4.3 The Combined Analysis In this section, we combine the lemmas proved in Sections 4.1 and 4.2 to prove Theorem 1. We begin with a lemma which shows that if every vector in C has spread 32T log , then the maximum directional variance in K, the space output by Algorithm 2 C O R R - S U B S PAC E, is at most 11 . Lemma 13 Let K be the subspace output by the algorithm, and let v be any vector in K. If every vector in C has spread 32T log , and the number of samples |S | is greater t an h t m 2 43 642 T n log2 log max T n log n log , h en ax 2 4 2 4 for any i the maximum variance of Di along v is at most 2 11 . The proof is omitted due to space constraints, and can be found in [Cha07]. The above Lemmas are now combined to prove Theor em 1 . Proof:(Of Theorem 1) Suppose K = KL KR , where KL = {v1 , . . . , vm }, the top m left singular vectors of F T G and KR = {y1 , . . . , ym } are the corresponding right singular vectors. We abuse notation and use vk to denote the vector vk concatenated with a vector consisting of n/2 zeros, and use yk to denote the vector consisting of n/2 zeros concatenated with yk . Moreover, we use K, KL , and KR interchangeably to denote sets of vectors and the subspace spanned by those sets of vectors. 1 We show that with probability at least 1 - T over the splitting step, there exists no vector v CF G such that (1) v is orthogonal to the space spanned by the vectors K and (2) there exists some pair of centers i and j such that ¯ dv (µi , µj ) > 49T c2j log . For contradiction, suppose i there exists such a vector v . Then, if vF and vG denote the normalized projections of v onto F -space and G -space respectively, from T^ Lemma 3, vF F T GvG with probability at least 1 - 1 m m T over the splitting step. From LemT a 10, if the nuf ber of samples |S | is greater than 3 v1 , . . . , vm are the vectors output by Algorithm C O R R S U B S PAC E. We inductively define vl as follows. Suppose for each k , vk = uk + xk , where uk CF and ¯ xk PF (CF ). Let ul be a unit vector in CF which is perpendicular to u1 , . . . , ul-1 . Then, vl = ul . By definition, this vector is orthogonal to u1 , . . . , ul-1 . In addition, for any k = l, vl , vk = ul , uk + ul , xk = 0, and vl is also orthogonal to v1 , . . . , vl-1 . Moreover, if < 101 T , u1 , . . . , um are linearly independent, and we 0 can always find dim(CF ) such vectors. Similarly, we construct a set of vectors y1 , y2 , . . .. Let us call the combined set of vectors C . We now show that if there are sufficient samples, dC (µi , µj ) c2j . Note that for any unit vector v ¯ i ¯ in C , and any unit x CF G , v, x m. Also, note that for any uk and ul , kk= l, | uk , ul | 2 , and k uk be any unit vector ||uk ||2 1 - 2 . Let v = k in CF G . Then, 1 = ||v ||2 = ,k k k uk , uk k2 2 22 k ||uk || - (T ). The projection of v on C can be written as: k k v, vk 2 = v, uk 2 = k l 2 uk , ul 2 + 2 l l , l l l uk , ul uk , ul n2 log n log 2 or T axis-aligned Gaussians, vF F T GvG with at least 2 constant probability. Since v is orthogonal to the space spanned by K, vF is orthogonal to KL and vG is orthogonal to KR . As m+1 is the maximum value of xT F T Gy over all vectors x orthogonal to KL and y orthogonal to KR , m+1 , which is a contradiction. 2 Moreover, from Lemma 12, T +1 < , and hence 8 m T. Let us construct an orthonormal series of vectors v1 , . . . , vm , . . . which are almost in CF as follows. binary pro uct distributions, and if |S | is greater than d f m 22 42 n log2 log max n log n log or , ax 2 2 The last step follows because for each k , ||uk ||2 1 - 2 . If the number of samples |S | is greater than 32 l ( m n log n(og +log 100T ) ) (for Binary Product Distri2T 4 butions), and ( 4 m4 n2 log2 n log2 (100T ) 2 2 m3 n log log(100T ) , max max 2T 4 2T 4 for axis-aligned Gaussians), then, < 1/100T . Theref o r e, 1 d(µi , µj ) dC (µi , µj ) ¯ 100 For any i and j , d(µi , µj ) = dK (µi , µj ) + dC \K (µi , µj ) + dC¯ (µi , µj ) Since vectors vm+1 , . . . and ym+1 , . . . , all belong to CF G (as well as C \ K, there exists no v C \ K with the Conditions (1) and (2) in the previous paragraph, ¯ and dCF G \K (µi , µj ) 49T c2j log . That is, the aci tual distance between µi and µj in CF G \ K ( as well as C \ K) is at most the contribution to d(µi , µj ) from the top 49T c2j log coordinates, and the contribution i ¯ to d(µi , µj ) from K and C is at least the contribution from the rest of the coordinates. Since dC (µi , µj ) ¯ 1 100 d(µi , µj ), the distance between µi and µj in K is at 99 ¯ least 100 d(µi , µj ) - 49T log c2j ). The first part of the i theorem follows. The second part of the theorem follows directly from L em m a 1 3 . k 2 ||uk ||4 - T 3 4 1 - (T 2 2 ) k References [ GL 9 6 ] S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of 33rd ACM Symposium on Theory of Computing, pages 247­257, 2001. [ AM 0 5 ] D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In Proceedings of the 18th Annual Conference on Learning Theory, pages 458­469, 2005. [AT98] A.Blum and T.Mitchell. Combining labeled and unlabeled data with co-training. In Proc. of Conference on Learning Theory, 1998. [BNJ03] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, (3):993­1022, January 2003. [Cha07] K. Chaudhuri. Learning Mixtures of Distributions. PhD thesis, University of California, Berkeley, 2007. UCB/EECS-2007-124. [CHRZ07] K. Chaudhuri, E. Halperin, S. Rao, and S. Zhou. A rigorous analysis of population stratification with limited data. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2007. [Das99] S. Dasgupta. Learning mixtures of gaussians. In Proceedings of the 40th IEEE Symposium on Foundations of Computer S cience, pages 634­644, 1999. [DHKS05] A. Dasgupta, J. Hopcroft, J. Kleinberg, and M. Sandler. On learning mixtures of heavytailed distributions. In Proceedings of the 46th IEEE Symposium on Foundations of Computer Science, pages 491­500, 2005. [DLR77] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical Society B, 3 9 , p ag es 1 ­ 3 8 , 1 9 7 7 . [DS00] S. Dasgupta and L. Schulman. A two-round variant of em for gaussian mixtures. In Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI), 2000. [FM99] Y. Freund and Y. Mansour. Estimating a mixture of two product distributions. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1999. [FOS05] J. Feldman, R. O'Donnell, and R. Servedio. Learning mixtures of product distributions over discrete domains. In Proceedings of FOCS, 2005. [FOS06] J. Feldman, R. O'Donnell, and R. Servedio. Learning mixtures of gaussians with no separation assumptions. In Proceedings [ AK0 1 ] [KF07] [KS04] [KSV05] [Llo82] [MB88] [PD05] [PFK02] [PSD00] [Rey95] [SRH07] [TSM85] [VW02] of COLT, 2006. G. Golub and C. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996. S. Kakade and D. Foster. Multi-view regression via canonical correlation analysis. In Proc. of Conference on Learning Theory, 2007. Jon M. Kleinberg and Mark Sandler. Using mixture models for collaborative filtering. In STOC, pages 569­578, 2004. R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In Proceedings of the 18th Annual Conference on Learning Theory, 2005. S.P. Lloyd. Least squares quantization in pcm. IEEE Trans. on Information Theory, 1982. G.J. McLachlan and K.E. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, 1988. A. Panconesi and D. Dubhashi. Concentration of measure for the analysis of randomised algorithms. Draft, 2005. C. Pal, B. Frey, and T. Kristjansson. Noise robust speech recognition using Gaussian basis functions for non-linear likelihood function approximation. In ICASSP '02: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I­405­I­408, 2002. J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155:954­959, June 2000. D. Reynolds. Speaker identification and verification using gaussian mixture speaker models. Speech Communications, 1995. Srinath Sridhar, Satish Rao, and Eran Halperin. An efficient and accurate graphbased approach to detect population substructure. In RECOMB, 2007. D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical Analysis of Finite Mixture Distributions. Wiley, 1985. V. Vempala and G. Wang. A spectral algorithm of learning mixtures of distributions. In Proceedings of the 43rd IEEE Symposium on Foundations of Computer Science, p ag es 1 1 3 ­ 1 2 3 , 2 0 0 2 .