NMF and PLSI: Equivalence and A Hybrid Algorithm Chris Ding Lawrence Berkeley National Laboratory Berkeley, CA 94720 Tao Li School of Computer Science Florida International University Miami, FL 33199 Wei Peng School of Computer Science Florida International University Miami, FL 33199 chqding@lbl.gov taoli@cs.fiu.edu wpeng002@cs.fiu.edu ABSTRACT In this paper, we show that PLSI and NMF optimize the same objective function, although PLSI and NMF are different algorithms as verified by experiments. In addition, we also propose a new hybrid method that runs PLSI and NMF alternatively to achieve better solutions. 2. EQUIVALENCE OF NMF AND PLSI Suppose we have n documents and m words (terms). Let F = (Fij ) be the word-to-document matrix: Fij = F (wi , dj ) is the frequency of word wi in document dj . In this paper, we re-scale the term frequency Fij by Fij Fij /Tw , where Tw = ij Fij is the total number of words. With this stochastic normalization, ij Fij = 1. The joint occurrence probability p(wi , dj ) = Fij . The general form of NMF is È È Categories and Subject Descriptors I.2 [Artificial Intelligence]: Learning; I.5.3 [Pattern Recognition]: Clustering F = CHT , (1) General Terms Algorithms, Experimentation, Theory where the matrices C = (Cik ), H = (Hj k ) are nonnegative matrices. They are determined by minimizing J NMF = Keywords NMF, PLSI, Equivalence, Chi-square ij m n Fij log =1 =1 Fij - Fij + (C H T )ij (C H T )ij (2) PLSI maximize the likelihood 1. INTRODUCTION Non-negative Matrix Factorization (NMF) [1, 5] and Probabilistic Latent Semantic Indexing (PLSI) [4] have been successfully applied to document clustering recently. Despite significant research on both NMF and PLSI, few attempts have been made to establish the connections between them while highlighting their differences in the clustering framework. Gaussier and Goutte [3] made the first connection between NMF and PLSI, by showing that the local fixed point solutions of the iterative procedures of PLSI and NMF are the same. Their proof is, however, incorrect. NMF and PLSI are different algorithms. They converge to different solutions even if they start from same initial conditions. In this paper, we show that PLSI and NMF optimize the same objective function, although PLSI and NMF are different algorithms as verified by experiments. This provides a theoretical basis for a new hybrid method that runs PLSI and NMF alternatively, combining different advantages of PLSI and NMF and thus achieving better optimal solutions. Extensive experiments on 5 real-life datasets show the relation between NMF and PLSI. Our results indicate the hybrid method lead to significant improvements over NMF-only or PLSI-only methods. max JPLSI , JPLSI = im j n =1 =1 Fij log P (wi , dj ) (3) where P (wi , dj ) is the factorized (i.e., parameterized or approximated ) joint occurrence probability P (wi , dj ) = k p(wi |zk )p(zk )p(dj |zk ), (4) where the probability factors follow the normalization of probabilities i m p(wi |zk ) = 1, j n p(dj |zk ) = 1, =1 k K p(zk ) = 1. =1 (5) =1 Proposition 1: Objective function of PLSI is identical to the objective function of NMF, i.e., JPLSI = -JNMF + constant by setting (C H T )ij = P (wi , dj ). Proposition 2. Column normalized NMF of Eq.(1) is equivalent to the probability factorization of Eq.(4), i.e., (C H T )ij = S T , and ik = P (wi |zk ) , j k = P (dj |zk ) and Skk = P (zk ) . Thus (C H T )ij = P (wi , dj ) in detailed factorizations as in Eq.(4). The proofs of Propositions 1 and 2 can be found in [2]. From Propositions 1 and 2, we have CH C H Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 641 Theorem 1: PLSI and NMF are equivalent. In [3], the authors attempted to prove that NMF and PLSI converge to the same solution (fixed point). Their proof is incorrect. NMF and PLSI are different computational algorithms optimizing the same objective function. In our experiments, starting with the same initial starting C0 , H0 , NMF and PLSI always converge to different solutions. NMF and 2 -statistic. JNMF of Eq.(2) has a complicated expression. We give a better understanding by relating it to the 2 statistics [2]. Here Fij is the data and (C H T )ij is the model fit. Assuming |(C H T )ij - Fij |/Fij is small, we have J NMF = 2 = A B C WebAce 0.083 0.029 0.022 CS T R 0.072 0.025 0.013 WebKB 0.239 0.056 0.052 Reuters 0.070 0.051 0.040 Log 0.010 0.010 0.012 Table 2: Dis-agreements between NMF and PLSI. All 3 type experiments begin with the same smoothed Kmeans. (A) Smoothed K-means to NMF. Smoothed Kmeans to PLSI. (B) Smoothed K-means to NMF to PLSI. (C) Smoothed K-means to PLSI to NMF. not jump out of a local minimum (we ignore the situation that the minimum could be saddle points). But PLSI algorithm is a finite-step algorithm, so it is possible to jump out of a local minimum. NMF is also a finite-step algorithm. Interesting enough, experiment results indicate we can jump out of local minima consistently. Starting with converged solution using NMF, we run PLSI until convergence. The obtained accuracy is listed in Line B of Table 2. The accuracy are improved significantly compared with the NMF-only solutions (in Line A). Similarly, starting with converged solution using PLSI, we run NMF until convergence. The accuracy are also improved significantly compared with the PLSI-only solutions (in Line C). Based on the ability of NMF for jumping out of local minima of PLSI and vise versa, we propose a hybrid algorithm that alternatively runs NMF and PLSI, with the goal of successive jumping out local minima and therefore converging to a better minimum. The hybrid algorithm consists of 2 steps (1) K-means and smooth. (2) Iterate till converge: (2A) Run NMF to converge. and (2B) Run PLSI to converge. We run the hybrid algorithm on all 5 datasets. The results are listed in Table 3. We use accuracy as performance measure [2]. We observe that: (1) NMF/PLSI always improves upon K-means. (2) Hybrid always improves upon NMF and PLSI, and the improvements are significant. A B C D Reuters 0.316 0.454 0.487 0.521 WebKB 0.410 0.619 0.510 0.644 CSTR 0.617 0.666 0.668 0.878 WebAce 0.416 0.520 0.519 0.523 Log 0.775 0.778 0.779 0.781 ij m n =1 =1 [(C H T )ij - Fij ]2 . Fij (6) 3. NMF AND PLSI COMPARISON Experiment Setup. We compare the clustering performance of each method on 5 real-life datasets. Table 1 summarizes the characteristics of the datasets [6]. Datasets CSTR WebKB Log Reuters WebAce # documents 476 4199 1367 2900 2340 # class 4 4 9 10 20 Table 1: Dataset Descriptions. Agreements Between NMF and PLSI. For a given clustering solution, as a measure of pairwise matches, we introduce a clustering matrix R = (rij ), where rij = 1 if xi , xj are clustered into the same cluster; rij = 0 otherwise. To compare NMF and PLSI solutions, we compute the relative difference = ||RNMF - RPLSI ||F ºÕ ||RNMF ||2 /2 + ||RPLSI ||2 /2. F F The computed results are listed in line A of Table 2. These results are the average of 10 different runs, each beginning with different random initial starts in K-means. The results show that the differences between NMF and PLSI are quite substantial for WebKB (24%), and ranges between 1% to 8% in general cases. Function JNMF defines a surface in the multidimensional space. Because this global objective function is not a convex function, there are in general a very large number of local minima in the high p-dimensional space. Our experimental results suggest that starting with the same initial smoothed K-means solution, NMF and PLSI converge to different local minima. In many cases, NMF and PLSI converge to nearby local minima; In other cases they converge to notso-nearby minima. Table 3: Clustering Accuracy. (A) K-means. (B) NMFonly. (C) PLSI-only. (D) Hybrid. Acknowledgment. The work is partially supported by a 2005 IBM Faculty Award, the NSF grant IIS-0546280, and by Dept of Energy DE AC03-76SF00098. 5. REFERENCES [1] C. Ding, X. He, and H.D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. Proc. SIAM Data Mining Conf, 2005. [2] C. Ding, T. Li, and W. Peng. Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence, chi-square statistic, and a hybrid method. In AAAI, 2006. [3] E. Gaussier and C. Goutte. Relation between PLSI and NMF and implications. In SIGIR, 2005. [4] T. Hofmann. Probabilistic latent semantic analysis. In UAI), 1999. [5] D.D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In T. G. Dietterich and V. Tresp, editors, NIPS 13, 2001. [6] T. Li, S. Ma, and M. Ogihara. Document clustering via adaptive subspace iteration. In SIGIR, 2004. 4. A HYBRID NMF-PLSI ALGORITHM We have seen that NMF and PLSI optimize the same objective function, but their different detailed algorithms converge to different local minima. An interesting question arises. Starting from a local minimum of NMF, could we jump out the local minimum by running the PLSI algorithm? Strictly speaking, if an algorithm makes an infinitesimal step, it will 642