Data Sp ectroscopy: Learning Mixture Mo dels using Eigenspaces of Convolution Op erators Tao Shi Department of Statistics, Ohio State University Mikhail Belkin Department of Computer Science and Engineering, Ohio State University Bin Yu Department of Statistics, University of California Berkeley taoshi@stat.osu.edu mbelkin@cse.osu.edu binyu@stat.berkeley.edu Abstract In this paper we develop a spectral framework for estimating mixture distributions, specifically Gaussian mixture models. In physics, spectroscopy is often used for the identification of substances through their spectrum. Treating a kernel function K (x, y ) as "light" and the sampled data as "substance", the spectrum of their interaction (eigenvalues and eigenvectors of the kernel matrix K ) unveils certain aspects of the underlying parametric distribution p, such as the parameters of a Gaussian mixture. Our approach extends the intuitions and analyses underlying the existing spectral techniques, such as spectral clustering and Kernel Principal Components Analysis (KPCA). We construct algorithms to estimate parameters of Gaussian mixture models, including the number of mixture components, their means and covariance matrices, which are important in many practical applications. We provide a theoretical framework and show encouraging experimental results. from sampled data x1 , . . . , xn Rd , where the mixture component pg = N (µg , g ) has the mean µg and the covariance matrix g , g = 1, . . . , G. Gaussian mixture models are used in a broad range of scientific and engineering applications, including computer vision, speech recognition, and many other areas. However, effectiveness of modeling hinges on choosing the right parameters for the mixture distribution. The problem of parameter selection for mixture models has a long history, going back to the work of (Pearson, 1894, [9]), who introduced the Method of Moments and applied it to the study of a population of Naples crabs, deducing the existence of two subspecies within the population. The most commonly used method for parameter estimation is Maximum Likelihood Estimation (MLE), which suggests choosing the parameters in a way that maximizes the likelihood of the observed data, given a model. In modern practice this is most commonly done through the iterative optimization technique known as Expectation Maximization (EM) algorithm ([3]), which is typically initialized using k -means clustering. Recently significant progress on understanding theoretical issues surrounding learning mixture distributions and EM has been made in theoretical computer science, e.g., [2, 4]. Another set of methods for inferring mixture distribution is based on the Bayesian inference, which is done using a prior distribution on the parameters of the model. In recent literature ([7]) the Dirichlet process mixture models were used to produce posterior distribution for parameters of a mixture model. The inference procedure involves applying Markov Chain Monte-Carlo to draw samples from the posterior distribution. 1. Intro duction Gaussian mixture models are a powerful tool for various tasks of data analysis, modeling and exploration. The basic problem is to estimate the parameters of a G Gaussian mixture distribution p(x) = g=1 g pg (x), Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Data Sp ectroscopy: Learning Mixture Mo dels using Eigenspaces of Convolution Op erators In this paper we propose a new method for estimating parameters of a mixture distribution, which is closely related to non-parametric spectral methods, such as spectral clustering (e.g., [8]) and Kernel Principal Components Analysis [11]. Those methods, as well as certain methods in manifold learning (e.g., [1]), construct a kernel matrix or a graph Laplacian matrix associated to a data set. The eigenvectors and eigenvalues of that matrix can then be used to study the structure of the data set. For example, in spectral clustering the presence of a small non-zero eigenvalue indicates the presence of clusters, while the corresponding eigenvector shows how the data set should be split. In particular, we note the work [12] where the authors analyze dependence of spectra on the input density distribution in the context of classification and argue that lower eigenfunctions can be truncated without sacrificing classification accuracy. We will develop the intuitions and analyses underlying these methods and take them a step further by offering a framework, which can be applied to analyzing parametric families, in particular a mixture of Gaussian distributions. We would like to study mixture distributions by building explicit connections between their parameters and spectral properties of the corresponding kernel matrices. More specifically, we construct a family of probability-dependent operators and build estimators by matching eigenvalues and eigenfunctions of the operator associated to a probability distribution to those of the matrix associated to a data sample. Thus given G a mixture distribution p(x) = g=1 g pg (x), we use a x-y 2 = gG =1 g Gpg f (y ) It can be seen (Theorem 1) that given enough separation between the mixture components, top eigenfunc tions of the individual components Gpg are approxi mated by top eigenfunctions of Gp . That will allow us to connect eigenfunctions/eigenvalues of the mixture to eigenfunctions/eigenvalues of the individual components. A specific example of this is given in Fig. 2, which will be discussed in detail in Section 4. Observation 3. (Estimation from data) The eigenfunctions and eigenvalues of Gp can be approximated given data sampled from p(x) by eigenvectors and eigenvalues of empirical kernel matrices. To highlight the effectiveness of our methodology consider the distribution in Fig. 1, where the density given by a mixture of two normal distributions p = 0.9 N (-3, 12 ) + 0.1 N (0, 0.32 ) and a histogram obtained by sampling 1000 points are shown. From the Table 1, we see that the spectroscopic estimator has no difficulty providing reasonably accurate estimates for the mixing coefficients 1 , 2 , means µ1 , µ2 and variances 1 , 2 for each component, despite the fact that the mixture is unbalanced. We also see that these estimates can be further improved by using the spectroscopic estimate to initialize EM. We note that, while EM is a computationally efficient and algorithmically attractive method, it is a local optimization procedure and the quality of the achieved maximum and accuracy of the resulting estimate are sensitive to initialization (see, e.g., [10]). If the initial value happens to be close to the global maximum, fast convergence can be guaranteed. However, finding such "lucky" regions of the parameter space may be nontrivial. To emphasize that point, consider the bottom two rows of Table 1, where the results of k -means clustering (k = 2) and EM initialized by k -means are shown. We see that k -means consistently provides a poor starting point as the energy minimizing configuration splits the large component, ignoring the small one. EM, initialized with k -means, stays at a local maximum and cannot provide an accurate estimate for the mixture. On the other hand, EM initialized with our method, converges to the correct solution. We should note that our method requires sufficient separation between the components to provide accurate results. However there does not exist a computationally feasible method for estimating parameters of a mixture distribution in several dimensions without a separation assumption. The rest of the paper is structured as follows: in Sec- which will be the principal ob ject of this paper. Our framework will rely on three key observations about the spectral properties of this operator and its connection to the sampled data. Observation 1. (Single comp onent) For the Gaussian distribution p = N (µ, ), we can analytically ex press eigenfunctions and eigenvalues of Gp in terms of the mean µ and the covariance . This will allows us to reverse this dependence and explicitly express µ and in terms of the spectral properties of Gp . Observation 2. (Mixture of comp onents) G Let p be a mixture distribution p(x) = g=1 g pg (x). Note that by linearity e gG x-y 2 - 2 2 f (x) pg (x)dx g Gp f (y ) = =1 Gaussian kernel K (x, y ) = e- 22 to construct the integral operator e x-y 2 - 2 2 f (x) p(x)dx Gp f (y ) = Data Sp ectroscopy: Learning Mixture Mo dels using Eigenspaces of Convolution Op erators 35 30 25 Counts 20 15 10 5 0 -7 -6 -5 -4 -3 -2 -1 0 1 X Figure 1. Histogram of 1000 data points sampled from 0.9N (-3, 12 ) + 0.1N (0, 0.32 ) and the distribution (red line). True Parameters Sp ectroscopic Estimator EM [SE initialization] k-means [random samples] EM [k-means initialization] 1 = 0 .9 0.86 (0.01) 0.90 (0.01) 0.68 (0.03) 0.78 (0.07) 2 = 0 .1 0.14 (0.01) 0.10 (0.01) 0.32 (0.03) 0.22 (0.07) µ1 = -3 -2.98 (0.23) -3.01 (0.04) -3.42 (0.06) -3.17 (0.09) µ2 = 0 -0.02 (0.08) 0.00 (0.03) -1.17 (0.16) -0.93 (0.56) 1 = 1 1.12 (0.54) 1.00 (0.03) 0.74 (0.03) 0.92 (0.05) 2 = 0 .3 0.34 (0.10) 0.30 (0.02) 0.90 (0.03) 0.95 (0.39) Table 1. Mixture Gaussian parameters and corresponding estimators from Spectroscopic Estimation, and EM (initialized by SE), k-means (random initialization) and EM (initialized by k-means). The mean and the (standard deviation) of each estimator over 50 runs are shown. tion 2, we describe our approach in the simplest setting of a one-dimensional component in R. In Section 3, we analyze a single component in Rd , in Section 4, we deal with a general case of a mixture distribution and state a basic theoretical result for the mixture. In section 5, we show some experimental results on a simulated mixture distribution with three components in R5 and show some experimental results on the USPS handwritten digit dataset. We conclude in Section 6. i (x) = - × (x - µ)2 1 + 2 - 1 (1 + 2 )1/8 exp 2 2 2 2i i! ( 1 1 + 2 4 x - µ 2) Hi 4 Since H0 (x) = 1, and putting C = (1 + 2 )1/8 0 (x) = C exp - (x - µ)2 2 2 1 + 4 2 / 2 - 1 2 ( 3) 2. Setting Up the Framework: Single Comp onent in R We start the discussion by demonstrating the basis of our approach on the problem of estimating parameters of a single univariate Gaussian distribution p(x) = N (µ, 2 ). We first establish a connection between eigenfunctions and eigenvalues of the convoluR - (x-y)2 e 22 f (x) p(x)dx and tion operator Gp f (y ) = 2 the parameters µ and . We show these parameters can be estimated from sampled data. We will need the following Prop osition 1 (Refinement of a result in [13]) Let = 2 2 / 2 and let Hi (x) be the i-th order Hermite polynomial. Then eigenvalues and eigenfunctions of Gp for i = 0, 1, · · · are given by 2 1/ 2 (1 + + 1 + 2 ) i 1 + + 1 + 2 We observe that that the maximum value of |0 (x)| is taken at the mean of the distribution µ, hence 1 µ = argmaxx |0 (x)|. We also observe that 0 = 1 -1 1 2 2 2 2 + 42 + 22 + . Taking r = 1 /0 , we 2 derive 2 = r 2 . (1 - r)2 (4) Thus we have established an explicit connection be tween spectral properties of Gp and parameters of p(x). We now present Algorithm 1 for estimating µ and 2 from a sample x1 , . . . , xn from p(x). · Step 1. Construct kernel matrix Kn , (Kn )ij = . Kn serves as the empirical version of the operator Gp . Compute the top eigenvector v0 of Kn and the top two eigenvalues 0 (Kn ), 1 (Kn ). 1- ne (xi -xj )2 2 2 i = (1) Data Sp ectroscopy: Learning Mixture Mo dels using Eigenspaces of Convolution Op erators Actual value SE (µ, ) ^^ Std Est (x, s) ¯ µ=0 0.000 (0.014) 0.002 (0.011) =1 1.005 (0.012) 1.001 (0.007) eigenvalues. The corresponding eigenfunction of the product is e[,µ] (x, y ) = e (x) eµ (y ). Applying this result, we see that eigenvalues and eigen functions of of Gp can be written as products [i1 ,...,id ] (Gp ) = Table 2. Average(standard deviation) of spectroscopic estimator SE(µ, ) and the standard estimator Std Est(x, s) of ^^ ¯ 100 simulation run. In each run, estimators are calculated from 1000 i.i.d samples of N (0, 1). jd =1 ij (Gpj ) · Step 2. Construct estimators µ and 2 for mean ^ ^ and variance as follows: µ = xk , ^ k = argmax |(v0 )i | i [i1 ,...,id ] (Gp )(x) = jd =1 ij (Gpj )( x, uj ) 2 = ^ where r= ^ 0 (Kn ) 1 (Kn ) . r ^ , (1 - r)2 ^ 2 Where [i1 , . . . , id ] is a multindex over all components. It can be seen that [0,...,0] is (up to a scaling factor) a Gaussian with the same mean µ as the original distribution p(x). Thus µ can be estimated as the point with maximum value [0,...,0] in the same way as for 1-dimensional distributions. Consider now I , where I = [0, . . . , 0, 1, 0, . . . , 0]. i -1 These estimators are constructed by substituting top eigenvector of Kn for the top eigenfunction of Gp and eigenvalues of Kn for the corresponding eigenvalues of Gp . It is well-known (e.g., [6]) that eigenvectors and eigenvalues of Kn approximate and converge to eigenfunc1 tions and eigenvalues of Gp at the rate n as n , which implies consistency of the estimators. The accuracy of µ and 2 depends on how well the empirical ^ ^ operator Kn approximates the underlying operator Gp . The Table 2 reports the average and the standard deviation of our spectroscopic estimators (µ, 2 ) compared ^^ the standard estimators (x, s2 ) for one hundred repeti¯ tions of the simulation. We see that our spectroscopic estimators are comparable to the standard estimators for mean and variance of a single Gaussian. Since H2 (x) = 2x, Eq. 1 implies that [0.I..(,x])(x) is a , 0 linear function in x with the gradient pointing in the direction of ui . That allows us to estimate the principal directions. The resulting Algorithm 2 for estimating µ and is presented below: Step 1. 1- ne xs -xt 2 2 2 Construct kernel matrix Kn , (Kn )st = . Kn serves as the empirical version of the operator Gp . Compute eigenvalues (Kn ) and eigenvectors v (Kn ) of Kn . Denote the top eigenvector by v0 and the corresponding eigenvalue by 0 . Step 2. Identify each eigenvector vi , vi = v0 , i = vi 1, . . . , d such that the values of v0 are approximately linear in x, that is vi (xs ) aT xs + b, v0 (xs ) a, b Rd 3. Setting Up the Framework: Single Comp onent in Rd In this section we extend our framework to estimating a single multivariate Gaussian p = N (µ, ) in Rd . d 2 Let = i=1 i ui ut be the spectral decomposition of i the covariance matrix . As before we put Gp f (x) = R - x-y 2 x-y 2 2 2 f (y ) p(y )dy . Since the kernel e- 22 is de invariant under rotations, it follows that the operator Gp can be decomposed as: Gp = d=1 Gpi , where pi is i 2 an 1-dimensional Gaussian with variance i and mean µ, ui along the direction of ui . It is easy to see that given two operators F , H, the spectrum of their direct sum F H consists of pairwise products µ, where and µ are their respective The corresponding principal direction ui is estimated a by ui = a . Let the corresponding eigenvalue be i . ^ ^ Step 3. Construct estimators µ and for mean and ^ variance as follows: µ = xk , ^ k = argmax |(v0 )i | i ^ = where i = ^2 ri ^ ( 1 -r i ) 2 ^ 2 id i ui ut , ^ 2 ^ ^i 0 i . =1 and ri = ^ Data Sp ectroscopy: Learning Mixture Mo dels using Eigenspaces of Convolution Op erators 4. Sp ectroscopic Estimation for Mixtures of Gaussians We now extend our framework to the case of a mixture of several multivariate Gaussian distributions with potentially different covariance matrices and mixture coefficients. To illustrate our approach we sample 1000 points from two different Gaussian distributions N (2, 12 ) and N (-2, 12 ) and from their mixture 0.5N (2, 12 ) + 0.5N (-2, 12 ). The histogram of the mixture density is shown in the top left panel of Fig 2, and histograms of each mixture component are shown in the right top panels. Taking the bandwidth = 0.3, we construct three kernel matrices K 1 , K 2 and K for a sample from each of the components and the mixture distribution respectively. The middle and lower left panels show the top two eigenvectors of K , while the middle and lower right panels show the top eigenvector of K 1 and K 2 respectively. The key observation is to notice the similarity between the left and right panels. That is, the top eigenvectors of the mixture are nearly identical to the top eigenvectors of each of the components. Thus knowing eigenvectors of the mixture allows us to approximate top eigenvectors (and the corresponding eigenvalues) for each of the components. Having access to these eigenvectors and using our Algorithms 1,2, allows us to estimate parameters of each of the mixture components. This phenomenon is easily understood from the point of view of operator theory. The leading eigenfunctions of operators defined by each mixture component are approximately the eigenfunctions of the operators defined on the mixture distribution. To be explicit, let us consider the Gaussian convolution operator Gp defined by the mixture distribution p(x) = 1 p1 + 2 p2 , with Gaussian components p1 = N (µ1 , 2 ) and p2 = N (µ2 , 2 ) and the Gaussian kernel K (x, y ) with band width . The corresponding operators are Gp1 and 1 2 Gp2 and Gp = Gp1 + Gp2 respectively. Con sider an eigenfunction 1 (x) of Gp1 with eigenvalue 1 , Gp1 1 = 1 1 . We have Gp 1 (y ) = 1 1 1 (y ) + 2 Theorem 1 Given a d-dimensional mixture of two 2 ii Gaussians p(x) = i=1 p (x) where i is mixing weight and pi is the density corresponding to N (µ 2 I ). Define = 2 2 /w2 and = i, 2 / 1 + 2 - 1, then the first eigenfunction (1 0 w with an eigenvalue 1 ) of Gp1 is approximately an 0 w eigenfunction of Gp in the fol lowing sense: For any > 0 we have that for al l y w Gp 1 (y ) = 1 1 (1 (y ) + T (y )) 0 00 and |T (y )| 1 assuming that the separation satisfies + + µ1 - µ2 2 2 2 log 2 log 2 + 2 1 d log(1 + 2 ) 4 We do not provide a proof of Theorem 1 for lack of space. A more general version of the theorem for several Gaussians with different covariance matrices can also be given along the same lines. Together with some perturbation analysis ([5]) it is possible to provide bounds on the resulting eigenvalues and eigenfunctions of the operator. We now observe that for the operator Gpg , the top eigenfunction is the only eigenfunction with no sign change. Therefore, such eigenfunction of Gp corresponds to exactly one component of the mixture distribution. This immediately suggest a strategy for identifying components of the mixture: we look for eigen functions of Gp that have no sign change. Once these eigenfunctions of Gp are identified, each eigenfunction of Gp can be assigned to a group determined an eigenfunction with no sign change. As a result, the eigenvalues and eigenfunctions in each group only depend on one of the component pg and mixing weight g . By reversing the relationship between parameters and eigenvalues/eigenfunctions, parameter estimations for each mixing component can be constructed based only on the eigenvalues/eigenvectors in the corresponding group. K (x, y )1 (x)p2 (x)dx. 4.1. Algorithm for Estimation of a Mixture of Gaussians Following the discussion above, we now describe the resulting algorithm for estimating a multidimensional G mixture of Gaussians p(x) = g=1 g N (µg , g ), from a sample x1 , . . . , xn Rd , first giving the following Definition 1 For vectors d, e Rn ), we define 1. -supp ort of d is the set of indices {i: |di | , i = 1, · · · , n}. 2. d has no sign changes up to precision , if d is It can be shown that eigenfunction 1 (x) of Gp1 1 is centered at µ and decays exponentially away from µ1 . Therefore, assuming the separation µ1 - µ2 is large enough, the second summand K (x, y )1 (x)p2 (x)dx 0 for all y uniformly, and 2 hence Gp 1 1 1 1 . When the approximation holds the top eigenfunctions of Gp are approximated by top eigenfunctions of either Gp1 or Gp2 . Data Sp ectroscopy: Learning Mixture Mo dels using Eigenspaces of Convolution Op erators 15 30 10 25 X 5 0 -6 15 Counts 20 15 10 10 5 0 -6 -4 -2 0 2 4 6 Y 5 -4 -2 0 2 4 6 0 -6 -4 -2 0 2 4 6 Z Eigenvectors 0.1 Eigenvectors 1st eig-vect of K 0.1 n 1st eig-vect of K1 n 0.05 0.05 0 -6 -4 -2 0 2 4 6 0 -6 -4 -2 0 2 4 6 Z X Eigenvectors 0.1 Eigenvectors 2nd eig-vect of Kn 0.1 1st eig-vect of K2 n 0.05 0.05 0 -6 -4 -2 0 2 4 6 0 -6 -4 -2 0 2 4 6 Z Y Figure 2. Eigenvectors of a Gaussian kernel matrix ( = 0.3) of 1000 data sampled from a Mixture Gaussian distribution P = 0.5N (2, 12) + 0.5N (-2, 12). Top left panel: Histogram of the data. Middle left panel: First eigenvector of Kn . Bottom left panel: Second eigenvector of Kn . Top right panel: Histograms of data from each component. Middle right 1 2 panel: First eigenvector of Kn . Bottom right panel: First eigenvector of Kn . either positive or negative on the -support of e. {i : |ei | } {i : |di | }. Algorithm 3. Spectroscopic estimation of a Gaussian mixture distribution. Input: Data x1 , . . . , xn Rd . Parameters: Kernel bandwidth > 0, threshold > 0.1 ^ Output: Number of components G. Estimated mean ^ µg Rd , mixing weight g , g = 1, . . . , G and covari^ ^ ance matrix g for each component. · Step 1. Constructing Kn , the empirical approx imation to Gp : Put (Kn )ij = 1 n · Step 3. Estimating the mean µg and the mixing weight g of each component: ^ For the g 'th component, g = 1, . . . , G, estimate the mean and the mixing weight as follows: ^ µg = xk , g where k = argmax |(v0 )i | i g = ^ where h nh = cardinality of -support of v0 . ng ^ G h=1 nh , exp (1, . . . , n). Compute the (leading) eigenvalues 1 , 2 , . . . and eigenvectors v1 , v2 , . . . of Kn . · Step 2. Estimating the number of components G: Identify all eigenvectors of Kn , which have no sign changes up to precision . Estimate G by the ^ number (G) of such eigenvectors and denote those eigenvectors and the corresponding eigenvalues by ^ ^ 12 G v0 , v0 , . . . , v0 and 1 , 2 , . . . , G respectively. 0 0 0 In our implementation of the algorithm we choose = maxj |(vi )j |/n for each eigenvector vi . In the description of the algorithm we will use the same for simplicity. 1 - xi -xj 2 2 2 , i, j = To estimate the covariance matrix g of each compoxs nent pg : we first all eigenvectors such that vvg((xs)) is 0 approximately a linear function of xs on the -support g of v0 . Then we can apply the estimation methods described in Algorithm 2, Step 3 on the -support of g v0 . 5. Simulations and Exp eriments Simulation: multivariate Gaussian mixture distribution. A simulation on five dimensional data is carried out to test the proposed algorithm. The first two variables X1 and X2 3are a mixture three Gaussian components p(X ) = g=1 g N (µg , g ) with mixing weights and group means shown in Table 3 and covariance matrices: 0 , 0 , .5 -0.25 .5 0.25 1 = 2 = -0.25 0.5 0.25 0.5 Data Sp ectroscopy: Learning Mixture Mo dels using Eigenspaces of Convolution Op erators 250 3 200 2 20 1 150 15 0 X2 1 300 250 200 150 2 100 10 -1 3 50 5 4 -3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 0 0 -4 Figure 3. Left: Histogram of the first coordinate X1 ; Middle: Two dimensional histogram of the first two coordinates (X1 , X2 ). Right: Histogram of X2 . T = 3 0 .5 -0.25 -0.25 0.5 he remaining three variables are Gaussian noise N (0, 0.1I ). In each simulation run, 3000 data points are sampled. The histogram of X1 , two-dimensional histogram of X1 and X2 , and histogram of X2 for one simulation run are shown in Figure 3. We see that it is impossible to identify the number of components by investigating the one-dimensional histograms. The Algorithm 3 with = 0.1 was used to estimate the number of components G, mixing weights g . The simulation is run 50 times and the algorithm accurately estimated the number of groups in 46 of the 50 runs. Two times the number of groups was estimated as 2 and two times as 4. The average and standard deviation of the estimators of mixing weights and means for the 46 runs are reported in Table 3. We see that the estimates for mixing weights are close to the true values and the estimated group means are close to the estimates from labeled data. Covariance estimates, which we do not show due to space limitations, also show reasonable accuracy. USPS ZIP co de data. To apply our method to some real-world data we choose a subset of the USPS handwritten digit dataset, consisting of 16x16 grayscale images. In this experiment, 658 "3"s, 652 "4"s, and 556 "5"s in the training data are pooled together as our sample (size 1866). The Spectroscopic estimation algorithm using a Gaussian kernel with bandwidth 2 is applied to the sample . Here we do not use the algorithm to estimate mean and variance of each component, since we do not expect the distribution of the 256 dimensional data to like a Gaussian distribution. Instead, we investigate the eigenvectors with no sign change over {x : |v (x)| > }. We expect (1) the data corresponding to large absolute values of each of such eigenvectors present one mode Figure 4. Images ordered by the three eigenvectors v1 , v16 and v49 identified by Algorithm 3. The images are the digits corresponding to the 1st , 41st , 81st , · · ·, 361st largest entries of |v1 | (first row), |v16 | (second row) and |v49 | (third row). "3" (P) "4" (P) "5" (P) "3" (T) 625 17 16 "4" (T) 0 640 12 100 X1 X1 "5" (T) 45 32 479 Table 4. Confusion matrix of clustering results for USPS handwritten digits. Each cell shows the number of data points belonging both in the True group (e.g. "3") and the Predicted group (e.g. "3") (cluster) and (2) those data points are in the same digit group. In the output of our algorithm, three eigenvectors v1 , v16 and v49 of Kn satisfy the condition of no sign change over {x : |v (x)| > } with = max(v )/n. We first rank the data by an decreasing order of |v | and show the 1st , 41st , 81st , · · ·, 361st digits in Figure 4. All digits with larger value of |v1 | belong to the group of "4"s, and other digits ("3" and "5") correspond to smaller values of |v1 |. Similarly, larger values of |v16 | are in the group of "3"s and |v49 | for "5"s. By assigning digits to their component defined by one of the eigenvectors (v1 , v16 , v49 ) we obtain the clustering results shown in the confusion Table 4. We see that 50 0 -4 -3 -3 -2 -2 -1 X2 0 Data Sp ectroscopy: Learning Mixture Mo dels using Eigenspaces of Convolution Op erators Parameter value Spectroscopy (STD) Parameter value Spectroscopy (STD) x(S T D) of each group ¯ 1 = 0.4 0.40 (0.03) µ1 = 1 µ1 = 1 1 2 1.00 (0.12) 1.00 (0.19) 1.00 (0.02) 1.00 (0.022) 2 = 0.3 0.30 (0.03) µ2 = 0 µ2 = -1 1 2 0.01 (0.20) -0.94 (0.21) -0.00 (0.03) -1.00 (0.02) 3 = 0.3 0.30 (0.03) µ3 = -1 µ3 = 1 1 2 -0.96 (0.22) 0.99 (0.22) -1.00 (0.02) 0.99 (0.03) Table 3. Estimation of mixing weight and mean of each component the overall accuracy of clustering is 93.46%. This clustering method can be thought of as an extension of the framework provided in this paper. While this method is closely related to spectral clustering, the procedures for choosing eigenvectors are different. [3] A. Dempster, N. Laird, D. Rubin, Maximumlikelihood from incomplete data via the em algorithm, Journal of Royal Statistics Society, Ser. B, 39 (1997), pp. 1­38. [4] R.Kannan, H.Salmasian, S.Vempala, The spectral method for general mixture models, COLT 2005. [5] T. Kato, Perturbation Theory for Linear Operators, Springer-Verlag, 1966. ´ [6] V. Koltchinskii and E. Gine, Random matrix approximation of spectra of integral operators, Bernoulli, 6 (2000), pp. 113 ­ 167. [7] S. MacEachern and P. Muller, Estimating mixture of Dirichlet process models, Journal of Computational and Graphical Statistics, 7 (1998), pp. 223­238. [8] A. Ng, M. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm, NIPS 2001. [9] K. Pearson, Contributions to the mathematical theory of evolution, Phil. Trans. Royal Soc., 185A, (1894), pp. 71­110. [10] R. Redner and H. Walker, Mixture densities, maximum likelihood and the em algorithm, SIAM Review, 26 (1984), pp. 195­239. ¨ [11] B. Scholkopf, A. J. Smola, and K.-R. ¨ Muller, Kernel principal component analysis, Advances in kernel methods: support vector learning, (1999), pp. 327­352. [12] C.Williams, M.Seeger, The effect of the input density distribution on kernel-based classifiers, ICML2000. [13] H. Zhu, C. Williams, R. Rohwer, and M. Morcinie, Gaussian regression and optimal finite dimensional linear models, in Neural networks and machine learning, C. Bishop, ed., 1998. 6. Conclusion In this paper we have presented Data Spectroscopy, a new framework for inferring parameters of certain families of probability distributions from data. In particular we have analyzed the case of a mixture of Gaussian distributions and shown how to detect and estimate its components under the assumption of reasonable component separation. The framework is based on the spectral properties of data-dependent convolution operators and extends intuitions from spectral clustering and Kernel PCA. We have developed algorithms and have shown promising experimental results on simulated and real-world datasets. We think that our approach provides new connections between spectral methods and inference of distributions from data, which may lead to development of algorithms for using labeled and unlabeled data in problems of machine learning. 7. Acknowledgments The authors thank Yoonkyung Lee for helpful discussions and suggestions. Mikhail Belkin was partially supported by NSF Early Career Award 0643916. Bin Yu was partially supported by NSF grant DMS0605165, ARO grant W911NF-05-1-0104, NSFC grant 60628102, a grant from MSRA, and a Guggenheim Fellowship in 2006. References [1] M. Belkin, P. Niyogi, Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, NIPS 2001. [2] S. Dasgupta, Learning mixtures of Gaussians, FOCS 1999.