Estimating Lo cal Optimums in EM Algorithm over Gaussian Mixture Mo del Zhenjie Zhang zhenjie@comp.nus.edu.sg Bing Tian Dai daibingt@comp.nus.edu.sg Anthony K.H. Tung atung@comp.nus.edu.sg School of Computing, National University of Singapore, Computing 1, Law Link, 117590, Singapore Abstract EM algorithm is a very popular iterationbased method to estimate the parameters of Gaussian Mixture Model from a large observation set. However, in most cases, EM algorithm is not guaranteed to converge to the global optimum. Instead, it stops at some local optimums, which can be much worse than the global optimum. Therefore, it is usually required to run multiple procedures of EM algorithm with different initial configurations and return the best solution. To improve the efficiency of this scheme, we propose a new method which can estimate an upper bound on the logarithm likelihood of the local optimum, based on the current configuration after the latest EM iteration. This is accomplished by first deriving some region bounding the possible locations of local optimum, followed by some upper bound estimation on the maximum likelihood. With this estimation, we can terminate an EM algorithm procedure if the estimated local optimum is definitely worse than the best solution seen so far. Extensive experiments show that our method can effectively and efficiently accelerate conventional multiple restart EM algorithm. of the data is not a trivial task, especially when the dimensionality or the number of components is large. Usually, this model estimation problem is transformed to a new problem, which try to find parameters maximizing the likelihood probability on the observations from the Gaussian distributions. In the past decades, EM algorithm (Dempster et al., 1977) has become the most widely method used in the problem of learning Gaussian Mixture Model (Ma et al., 2001; Jordan & Xu, 1995; McLachlan & Krishnan, 1996). Although EM algorithm can converge in finite iterations, there is no guarantee on the convergence to global optimum. Instead, it usually stops at some local optimum, which can be arbitrarily worse than the global optimum. Although there have been extensive studies on how to avoid bad local optimums, it is still required to run EM algorithm with different random initial configurations and the best local optimum is returned as final result. This leads to a great waste of computation resource since most of the calculations do not have any contribution to the final result. In this paper, we propose a fast stopping method to overcome the problem of trapping into bad local optimums. Given any current configuration after an EM iteration, our method can estimate an upper bound on the final likelihood of the local optimum current configuration is leading to. Therefore, if the estimated local optimum is definitely not better than the best local optimum achieved in previous runs, current procedure can be terminated immediately. To facilitate such local optimum estimation, we first prove that a region in the parameter space can definitely cover the unknown local optimum. If a region covers the current configuration and any configuration on the boundary of the region gives lower likelihood than the current one does, we can show that the local optimum is "trapped" in the region; and we call such region as a maximal region. In this paper, we adopt a 1. Intro duction Gaussian Mixture Model (GMM) (McLachlan & Peel, 2000) is a powerful tool in unsupervised learning to model unlabelled data in a multi-dimensional space. However, given an observation data set, estimating the parameters of the underlying Gaussian Mixture Model Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Estimating Lo cal Optimums in EM Algorithm over Gaussian Mixture Mo del special type of maximal region, which can be computed efficiently. Since the best likelihood of any configuration in a maximal region can be estimated in relatively short time, it can be decided immediately on whether current procedure still has potential to achieve a better local optimum. In our experiments, such method is shown to greatly improve the efficiency of original EM algorithm for GMM, on both synthetic and real data sets. The rest of the paper is organized as follows. We first introduce the definitions and related works on Gaussian Mixture Model and EM algorithm in Section 2 . Section 3 proves the local trapping property of EM algorithm on GMM; and Section 4 presents our study on maximal region of local optimum. We propose our algorithm on estimating the likelihood of a local optimum in Section 5 . Section 6 shows some experimental result. Finally, section 7 concludes this paper. mize the likelihood of a set of observations D = {x1 , x2 , . . . , xn }, which is represented by Pr(D|) = in Pr(xi |) (1) =1 Based on the parameters of the GMM model, the posterior probability of xi from component j (or the weight of xi in component j ), ij , can be calculated as follows. ij = Pr(xi |j )j k l=1 Pr(xi |l )l (2) 2. Mo del and Related Works In this section, we review the basic models of Gaussian Mixture Model, EM algorithm, and some acceleration method proposed for a special type of Gaussian Mixture Model (K-Means Algorithm). 2.1. Gaussian Mixture Mo del In GMM model (McLachlan & Peel, 2000), there exist k underlying components {1 , 2 , . . . , k } in a ddimensional data set. Each component follows some Gaussian distribution in the space. The parameters of the component j include j = {µj , j , j }, in which µj = (µj [1], . . . , µj [d]) is the center of the Gaussian distribution, j is the covariance matrix of the distribution and j is the probability of the component j . Based on the parameters, the probability of a point coming from component j appearing at x = (x[1], . . . , x[d]) can be represented by T - |-1 |1/2 1 j Pr(x|j ) = exp (x - µj )T -1 (x - µj ) j 2 (2 )d/2 To simplify the notations, we use to denote the set of all ij for any pair of i, j , and use () to denote the corresponding based on current configuration . For ease of analysis, the original optimization problem on equation (1), is usually transformed to an equal maximization problem on the following variable, called log likelihood. L(, ) = in jk =1 ij (ln =1 ln |-1 | j j - + ij 2 (3) (xi - µj )T -1 (xi - µj ) j ) 2 L is actually a function over and , the latter of which is usually optimized according to . Thus, the problem of learning GMM is finding an optimal parameter set which can maximize the function L( , ( )). 2.2. EM Algorithm EM algorithm (Dempster et al., 1977) is a widely used technique for probabilistic parameter estimation. To estimate = {1 , . . . , k }, it starts with a randomly chosen initial parameter configuration 0 . Then, it keeps invoking iterations to recompute t+1 based on t . Every iteration consists of two steps, E-step and M-step. In E-step, the algorithm computes the expected value of ij for each pair of i and j based on t = {t , . . . , t } and equation (2). 1 k In M-step, the algorithm finds a new group of parameters t+1 to maximize L based on t = {itj } and {x1 , x2 , . . . , xn }. The details of the update process for µj , j and j are listed below. µt+1 [l] = j n i=1 hus, given the component parameter set = {1 , 2 , . . . , k } but without any component information on an observation point x, the probability of observing x is estimated by Pr(x|) = jk Pr(x|j )j =1 The problem of learning GMM is estimating the parameter set of the k component to maxi- itj xi [l] t i=1 ij n (4) Estimating Lo cal Optimums in EM Algorithm over Gaussian Mixture Mo del itj (xi - µt+1 )(xi - µt+1 )T j nj t i=1 ij t i=1 ij n t+1 j = n i=1 (5) Table 1: Table of Notations Notation Description number of points in data dimensionality of data space number of components component j parameter set of j center of j covariance matrix of j probability of j configuration of all components ith point in the data posterior probability Pr(j |xi ) the set of all ij the optimal with solutions space for configurations ob jective log likelihood function a parameter for a maximal region t j+1 = n (6) The iteration process stops only when N = N -1 after N iterations. We note that both E-step and M-step always improve the ob jective function, i.e. L(t , t ) L(t , t-1 ) L(t-1 , t-1 ). Based on this property, EM-algorithm will definitely converge to some local optimum. The convergence properties of EM algorithm over GMM have been extensively studied in (Xu & Jordan, 1996; Ma et al., 2001). 2.3. K-Means Algorithm and Its Acceleration K-Means algorithm can be considered as a special problem of GMM learning with several constraints. First, the covariance matrix for each component must be identity matrix. Second, the posterior probability ij can only be 0 or 1. Therefore, in E-step of the algorithm, each point is assigned to the closest center under Euclidean distance; whereas in M-step, the set of geometric center of each cluster is used to replace the old set. With the problem simplification from GMM to KMeans, there have been many methods proposed to accelerate the multiple restart EM algorithm for KMeans. In (Kanungo et al., 2002), for example, Kanungo et al. applied indexing technique to achieve a much more efficient implementation of E-step. In (Elkan, 2003), Elkan accelerated both E-step and Mstep by employing triangle inequality of Euclidean distance to reduce the time for distance computations. In (Zhang et al., 2006), Zhang et al. introduced a lower bound estimation on the k-means local optimums to efficiently cut the procedures not leading to good solutions. However, all these methods proposed for kmeans algorithm cannot be directly extended to the general GMM. As far as we know, our paper is the first study on acceleration of the multiple restart EM algorithm with robustness guarantee. To improve the readability of the paper, we summarize all notations in Table 1. n d k j j µj j j xi ij () S L(, ) or invalid, can be represented by a point in S . Without loss of generality, we use to denote the configuration as well as the corresponding point in solution space S . The rest of the section will be spent to prove the following theorem. Theorem 1 Given a closed region R in the solution space S covering current configuration t , EM algorithm converges to a local optimum in R if every configuration on the boundary of R has L(, ()) < L(t , t ) Given two configurations t and t+1 across one EM iteration, we define a path between t and t+1 in S as follows. This path consists of two parts, called P 1 and P 2 respectively. P 1 starts at t and ends at # , t where # = {# }. Here # = {µt , # , j }, and j j it i jt j # t tT j = ij (xi - µj )(xi - µj ) / ij . An intermet diate configuration between and # is defined as t , in which µt and j remain the same, while j j t -1 in is ((1 - )( ) + (# )-1 )-1 . When increases from 0 to 1, we can move from t to # in the solutions space S . The second part of the path starts at # and ends at t+1 . Any intermediate configuration = { }, where µ = (1 - )µt + µt+1 , j j j itj it ij , and j = = ij (xi - µ )(xi - µ )T / j j j t t (j )1- (j+1 ) . Similarly, a continuous movement from # to t+1 can be made by increasing from 0 to 1. The following lemmas prove that any intermediate configuration on the path is a better solution than t . 3. Lo cal Trapping Prop erty In this section, we prove the local trapping property of EM algorithm on GMM. To derive the analysis, we first define a solution space S , containing (d2 + d + 1)k dimensions where d is the dimensionality of the original data space. Any configuration , either valid Estimating Lo cal Optimums in EM Algorithm over Gaussian Mixture Mo del Lemma 1 Given any intermediate configuration between t and # , we have L( , ( )) L(t , t ). Proof: By the optimality property of ( ), we have L( , ( )) L( , t ). it it ij is the ij (xi - µt )(xi - µt )T / Since # = j j j t optimal choice for j if itj , µt and j are fixed, we j also have L(# , t ) L(t , t ). By the definition of and the property of # above, the following equations can be easily derived. On the other hand, based on the definition of , we j can prove that j = in it,j (xi - µt )(xi - µt )T + j j in it,j )(µt+1 - µt )(µt+1 - µt )T j j j j =1 ( 2 - 2 )( =1 Since 2 - 2 0 for any between 0 and 1, ln | | j # ln |j |. And thus, we have - ln | |/2 - ln |# |/2. j j Combing the results above, we reach the conclusion that L( , t ) L(# , t ), leading to the correctness of the lemma. 2 Pro of for Theorem 1 Proof: We prove the theorem by contradiction. If R satisfies the boundary condition but EM algorithm converges to some configuration out of R in S , there is at least one pair of configurations {s , s+1 } that s is in R but s+1 is not. By setting up the path { } { } between s and s+1 as defined above, we know there is at least one ( ) that ( ) is exactly on the boundary of R. By Lemma 1 (Lemma 2), we know L( , ( )) L(s , s ) (L( , ( )) L(s , s )). On the other hand, any or is better than t by the definition of R. This leads to the contradiction, since L(s , s ) > L(t , t ). 2 L( , t ) = (1 - )L(t , t ) + L(# , t ) L(t , t ) Therefore, it is straightforward to reach the conclusion that L( , ( )) L(t , t ). 2 Lemma 2 Given any intermediate configuration between # and t+1 , we have L( , ( )) L(t , t ). Proof: Again, the basic inequality L( , ( )) L( , t ) holds. Based on this, we can prove the lemma by showing L( , t ) L(# , t ), since L(# , t ) L(t , t ). it it ij , ij (xi - µ )(xi - µ )T / If = j j jj i t a very interesting result is that ij (xi - µ )T ( )-1 (xi j j is shown below. j i itj (xi - µ )T ( )-1 (xi - µ ) = nd j j j - µ ) j remains constant for any , as 4. Maximal Region Based on Theorem 1, we define the concept of Maximal Region in GMM as follows. Given the current configuration t , a region R in S is the maximal region for t , if (1) R covers t , and (2) any boundary configuration of R has L(, ()) < L(t , t ), by Theorem 1, EM algorithm converges to some local optimum in R. Given the current configuration t , there are infinite number of valid maximal regions in the solution space, most of which are hard to verify and manipulate. To facilitate efficient computation, we further propose a special class of maximal regions. Given t and a positive real value < 1, we define a closed region R(t , ) S as the union of any configuration , each j = {µj , j , j } of which satisfies t all of the conditions below: (1) (1 - )j j -1 t t (1 + )j ; (2) - tr(j (j ) - I ) ; and (3) t (µj - µj )T (t )-1 (µj - µt ) 2 ; where tr(M ) dej j notes the trace of the matrix M and I is the identity matrix of dimension d. Therefore, for any , we only need to .consider the i jt l sum ij n(j /itj ) - ln(| |)/2 j t t By the definition of j , since j = (j )1- (j+1 ) , we t t have ln j = (1 - ) ln j + ln j+1 . Then, in itj =1 t t in in j j + 1 j ln t = (1 - ) itj ln t + itj ln t ij ij ij =1 =1 (7) i i jt j Therefore, ij ln t ij t n n j +1 t itj ln t i=1 itj ln tj . i=1 ij ij j itj ln tj , since ij t Estimating Lo cal Optimums in EM Algorithm over Gaussian Mixture Mo del Theorem 2 Any configuration on the boundary of R(t , ) has L(, t-1 ) L(t , t-1 ) - t n minj j 2 /6. Proof: Given any R(t , ), any configuration on the boundary must satisfy one of the following conditions t for at least one j (1 j k ): (1) (1 - )j = j ; (2) -1 t t t j = (1 + )j ; (3) |tr(j (j ) - I )| = ; and (4) (µj - µt )T (t )-1 (µj - µt ) = 2 . j j j If satisfies condition (1) for some component l, L(, t-1 ) is maximized if µt and t remain unj j t l j for all j = l. changed for all j , while j = t 1-l Therefore, we have the following upper bound. 1-(1-) t = = L(, t-1 ) - L(t , t-1 ) t -1 t n l l n |-1 t | - tr - (t )-1 l l l l l 2 tt l -1 t - -1 t n l r og tr l l l l - I 2 - -1 t 2 t tr(l l - I ) n l 2 2 - t n l 2 4 The fourth equality is derived by the definitions of t l t and l . And the second inequality from bottom is due to the taylor expansion on the logarithm matrix. Finally, if satisfies conditi n (4) for some component o T il (xi -µl )(xi -µl ) . In this l, L is maximized if l = il i t-1 i t-1 T -1 case, l (xi - µj ) j (xi - µj ) = l (xi - µt )T (t )-1 (xi - µt ) = nj d. Thus, the only differj j j ence on the log likelihood function L stems from the change on the determinant of the covariance matrix. L(, t-1 ) - L(t , t-1 ) = t 1 - (1 - )l t 1 - l 1 t l t t nl ln(1 - ) + n(1 - l ) ln + t 1 - l - + 2 t l t t nl - n(1 - l ) t 2 1 - l t t nl ln(1 - ) + n(1 - l ) ln = - t n l 2 2 The second inequality from the bottom is achieved by applying Taylor expansion on ln(1 - ). By iterating l with all k components, we have L(, t-1 ) t L(t , t-1 ) - minj nl 2 /2. If satisfies condition (2) for some component l, L(, t-1 ) can be maximized similarly. We have = = L(, t-1 ) - L(t , t-1 ) i n il - ln |l | + ln |t | l 2 =1 i n il - ln |t + (µl - µt )(µl - µt )T )| + ln |t | l l l l 2 =1 | + i n il - t ln t | + |(µl - µl )(µl - µt )T | ln |t | l l l 2 =1 i n il - | + ln t | + 2 |t | ln |t | l l l 2 =1 - i n il ln(1 + 2 ) 2 =1 t n l 2 2 L(, = t-1 ) - L( , t t-1 ) t 1 - (1 + )l t 1 - l 1 t l t t nl ln(1 + ) + n(1 - l ) ln - t 1 - l + 3 t 2 l t t n(1 - l ) + nl - t 2 3 1 - l t t nl ln(1 + ) + n(1 - l ) ln - The fourth inequality applies the property of positive definite matrices that |A + B | > |A| + |B | (Lutkepohl, 1996) . In all of the four cases, the reduction on the likelihood function L is at least - proof of the theorem. t n minj j 2 . 6 - t n l 2 6 Again, the third inequality from the bottom is due to Taylor expansion of ln(1 + ). The last inequality is because 3 2 for any 0 1. If satisfies condition (3) for some component l, L is maximized if all other parameters remain the same. Thus, This completes the 2 Last theorem implies that will reduce the log likelihood function by at least n mint j 2 /6 if remains j t-1 . The following question is how much we can increase the likelihood if we use the optimal () instead of t-1 . Estimating Lo cal Optimums in EM Algorithm over Gaussian Mixture Mo del R(t , ), P r(xi |j )j t |( )-1 |1/2 is no larger than (1 + )1.5 (2j)d/2 exp{-(1 - t )M (x, j )2 /2}j , where M (x, j ) is Lemma 3 Given L max ( (x - µt )T (t )-1 (x - µt )) - , 0 j j j Proof: By the definition of L, we have i j Uij Pr(xi |j )j ij Pr(xi |j )j L(, ()) - L( , t t-1 ) ln i j emma 4 Given R(t , ), P r(xi |j )j is t |( )-1 |1/2 no smal ler than (1 - )1.5 (2j)d/2 exp{-(1 + t )N (x, j )2 /2}j , where N (x, j ) is ( (x - µt )T (t )-1 (x - µt )) + j j j It is not hard to verify that the derivative of L(, ()) - L(t , t-1 ) to Pr(xi |j )j is always positive. Therefore, the equation above can be maximized if we employ the maximum value of Pr(xi |j )j . Based on the analysis above, we know that L(, ()) - L(t , t-1 ) ij Uij max Pr(xi |j )j ij ln ij max Pr(xi |j )j . The proofs of Lemma 3 and Lemma 4 are available in (Zhang et al., 2008). Lemma 5 Given a region R(t , ) as defined above, an upper bound, Uij , on ij () for any R(t , ) can be calculated in constant time. Proof: For any configuration on the boundary of R(t , ), the optimal value of ij can be calculated by equation (2). By Lemma 3 and Lemma 4, we can compute max Pr(xi |j )j and min Pr(xi |j )j . Therefore, ij Uij = max Pr(xi |j )j l min Pr(xi |l )j min Pr(xi |j )j l max Pr(xi |l )j By Theorem 2, L(, t-1 ) - L(t , t-1 ) t n minj j 2 /6. Therefore, by Theorem 1, L(, ()) < L(t , t ) if the condition of the theorem is satisfied. 2 For any local optimum in the maximal region R(t , ), the following theorem upper bound the likelihood function L( , ( )). Theorem 4 Given a valid maximal region R(t , ), if EM algorithm converges to local optimum , t L( , ( )) L(t , t ) + n min j 2 /6. ij Proof: Since (Uij - itj ) max Pr(j |xi ) < t2 n min j /6 by Theorem 3 and L(, ()) - ij L(, t ) (Uij - itj ) max Pr(j |xi ), we have L(, ()) - L(t , t ) L(, ()) - t L(, t ) n min j 2 /6. 2 ij Lij = The calculations can be finished in constant time with the two sums pre-computed. 2 By Lemma 5, the increase upper bound from L(, t-1 ) to L(, ()) can be calculated by the following equation. L(, ()) - L(, t-1 ) U ln ij max Pr(xi |j )j - ln ij max Pr(xi |j )j 5. Algorithm Theorem 3 provides an easy way to verify whether R(t , ) is a valid maximal region. On the other hand, Theorem 4 implies that a smaller can lead to tighter bound on the likelihood function L. However, it is not necessary to get the tightest bound on local optimum in our algorithm, since the goal of our algorithm is estimating whether the current configuration can lead to better solu,tion. Instead, we set as 6 1 t t (L -L( , )) where L is the best remin , t n min j (8) The following theorem gives a sufficient condition on a maximal region R(t , ) for some positive value . Theoreim j3 R(t , ) is a maximal region for t Uij max Pr(xi |j )j t if ln i j ij max Pr(xi |j )j - n min j 2 /6 < L(t , t ) - L(t , t-1 ) sult we have seen so far. This is the maximal one of all values, which are able to prune the current EM procedure by Theorem 4 The details of the algorithm are summarized in Algo 1. In this algorithm, conventional M-step and E-step Estimating Lo cal Optimums in EM Algorithm over Gaussian Mixture Mo del Average Time until Convergence 4.5 4 3.5 3 2.5 2 1.5 1 10 15 20 25 30 Dimensionality 35 40 1: Compute new t by M-Step. 2: Compute new t by E-Step. 3: if L(t , t ) < L then 6 Average Number of Iteration Algorithm 1 New Iteration(Data Set D, Current t-1 , current t-1 , comp onent numb er k , sample numb er m, current b est result L ) 5.5 5 AEM OEM 65 60 55 50 45 40 35 30 25 20 15 10 10 15 20 25 30 Dimensionality AEM OEM 5 4: 1 Let = min , (L -L( , )) t n min j 12 Average Time until Convergence 10 8 6 4 2 0 t t 35 40 (a) CPU Time (b) Number of Iterations 35 Average Number of Iteration 30 25 20 15 10 : 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: S=X=0 for each xi do for each dimension j do Get lij = max Pr(xi |j )j by Lemma 3. Get sij = min Pr(xi |j )j by Lemma 4. Get Uij by Lemma 5. S + = Uij lij X + = itj-1 lij end for end for if ln S - ln X - n min j 2 /6 < L(t , t ) - L(t , t-1 ) then Stop the current procedure of EM algorithm. end if else Return (t , t ) end if Figure 1: Performance vs. varying dimensionality AEM OEM AEM OEM 10 15 (a) CPU Time 20 25 30 Number of Clusters 35 40 10 15 (b) Number of Iterations 20 25 30 Number of Clusters 35 40 Figure 2: Performance vs. varying component number of each component are randomly generated independently. Two real data sets are also tested, including Cloud and Spam, both of which are available on UCI Machine Learning Repository. The Cloud data set consists of 1024 points in 10-dimensional space, while Spam data set has 4601 points in 58 dimensions. Both of the real data set are normalized before being used in our experiments. Two performance measurements are recorded in our experiments, including CPU time and number of iterations. An algorithm is supposed to be better if it spends less CPU time and invokes less time of iterations. All of the experiments are compiled and run on a Fedora Core 6 linux machine with 3.0 GHz Processor, 1GB of memory and GCC 4.1.2. In the experiments on the data sets, we test the performances of the algorithms with varying dimensionality D, number of components k , and the number of points in the data S . The default setting of our experiments is D = 20, k = 20, and S = 100K. The time of EM restart is fixed at 100 in all tests. More experimental results are available in the technical report (Zhang et al., 2008). 6.1. Results on Synthetic Data In Figure 1(a) and Figure 1(b), we present the experimental result by varying the dimensionality from 10 to 40. The results show that AEM is much more efficient than OEM. On data set with low dimensionality, AEM is almost two times faster than OEM, both on the CPU time and the number of iterations. The advantage is very obvious, even on high dimensional space. The results of our experiments on varying component are invoked first. If the current configuration is better than the best solution we have seen before, there is no need to test the upper bound of the local optimum. t Otherwise, the value of is set according to min j , T t L and L( , ). For each point and each component, lij , sij and Uij are collected according to Lemma 3, Lemma 4 and Lemma 5 respectively. With the information collected from each point, the condition of Theorem 1 can be tested. If this condition is satisfied, we can assert that current local optimum can never be better than L , leading to the termination of the current procedure. 6. Exp eriments In this section, we report the experimental results on the comparison of our accelerated EM algorithm (AEM) and the conventional EM algorithm (OEM). We note that in our implementation, either AEM or OEM will be stopped if it does not converge after 100 iterations. We employ both synthetic and real data sets in our empirical studies. The synthetic data sets are generated in a d-dimensional unit cube. There are k components in the space. Each component follows some Gaussian distribution. The center, size and covariance matrix Estimating Lo cal Optimums in EM Algorithm over Gaussian Mixture Mo del 10 Average Time until Convergence 9 8 7 6 5 4 3 2 1 0 40 60 80 100 120 140 Data Size (K) 160 180 200 Average Time until Convergence AEM OEM 60 Average Number of Iteration 55 50 45 40 35 30 25 20 15 10 40 60 80 100 120 140 Data Size (K) 160 180 200 AEM OEM 3.5 3 2.5 2 1.5 1 0.5 0 1 10 Number of Clusters 100 AEM OEM 100 80 60 40 20 0 1 AEM OEM Average Number of Iteration (a) CPU Time (b) Number of Iterations (a) CPU Time (b) Number of Iterations 10 Number of Clusters 100 Figure 3: Performance vs. varying data size 60 Average Time until Convergence 50 40 30 20 10 0 1 Average Number of Iteration AEM OEM 80 70 60 50 40 30 20 10 0 1 10 Number of Clusters 100 AEM OEM Figure 5: Performance vs. varying component number on Cloud data space. This upper bound computation turns out to be both efficient and effective in pruning un-promising procedures of EM algorithm. Acknowledgement: The authors would like to acknowledge the valuable comments from the reviewers of ICML 2008. (a) CPU Time 10 Number of Clusters 100 (b) Number of Iterations Figure 4: Performance vs. varying component number on Spam data number are summarized in Figure 2(a) and Figure 2(b). From the figures, we can see the performance advantage of AEM is stable, with the increase of component number. The CPU time and number of iterations on AEM is only about half of those of OEM. As is shown in Figure 3(a), Figure 3(b) AEM has much better performance than OEM when we increase the data size from 50K to 200K. AEM can detect those worse local optimums much earlier, if there are more data available. The number of iterations invoked by AEM is almost the same, even when the data has been doubled. The ratio of CPU time is more stable when the data size is larger. 6.2. Results on Real Data On Spam data set, AEM also show great advantage over OEM, on CPU time (Figure 4(a)) and on the number of iterations (Figure 4(b)). AEM is more efficient than OEM by one magnitude, independent to the number of components k . However, the experiments on Cloud data set show quite different results than the pervious results, where AEM has very limited advantage. We believe the difference on the results stems from normalization problem. References Dempster, A. P., Laird, N. M., & Robin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm (with discussion). Journal of Royal Statistical Society B, 39, 1­38. Elkan, C. (2003). Using the triangle inequality to accelerate k-means. ICML (pp. 147­153). Jordan, M. I., & Xu, L. (1995). Convergence results for the em approach to mixtures of experts architectures. Neural Networks, 8, 1409­1431. Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., & Wu, A. (2002). An efficient k-means clustering algorithm: analysis and implementation. Lutkepohl, H. (1996). Handbook of matrices. John Wiley & Sons Ltd. Ma, J., Xu, L., & Jordan, M. I. (2001). Asymptotic convergence rate of the em algorithm for gaussian mixtures. Neural Computation, 12, 2881­2907. McLachlan, G., & Krishnan, T. (1996). The em algorithm and extensions. Wiley-Interscience. McLachlan, G., & Peel, D. (2000). Finite mixture models. Wiley-Interscience. Xu, L., & Jordan, M. I. (1996). On convergence properties of the em algorithm for gaussian mixtures. Neural Computation, 8, 129­151. Zhang, Z., Dai, B. T., & Tung, A. K. H. (2006). On the lower bound of local optimums in k-means algorithm. ICDM (pp. 775­786). Zhang, Z., Dai, B. T., & Tung, A. K. H. (2008). Estimating local optimums in em algorithm over gaussian mixture model. http://www.comp.nus.edu.sg/zhangzh2/papers/em.pdf. 7. Conclusion In this paper, we propose a new acceleration method for multiple restart EM algorithm over Gaussian Mixture Model. We derive an upper bound on the local optimum of the likelihood function in the solution