Nonnegative Matrix Factorization via Rank-One Downdate

Michael Biggs Ali Gho dsi Stephen Vavasis University of Waterloo, Waterloo, ON N2L 3G1

mbiggs@alumni.uwaterloo.ca aghodsib@uwaterloo.ca vavasis@uwaterloo.ca

Abstract
Nonnegative matrix factorization (NMF) was popularized as a tool for data mining by Lee and Seung in 1999. NMF attempts to approximate a matrix with nonnegative entries by a product of two low-rank matrices, also with nonnegative entries. We propose an algorithm called rank-one downdate (R1D) for computing an NMF that is partly motivated by the singular value decomposition. This algorithm computes the dominant singular values and vectors of adaptively determined submatrices of a matrix. On each iteration, R1D extracts a rank-one submatrix from the original matrix according to an ob jective function. We establish a theoretical result that maximizing this ob jective function corresponds to correctly classifying articles in a nearly separable corpus. We also provide computational experiments showing the success of this method in identifying features in realistic datasets. The method is also much faster than other NMF routines.

factorization (NMF), that is, approximation of a matrix A  Rm×n as a product of two factors W H T , where W  Rm×k , H  Rn×k , both have nonnegative entries, and k  min(m, n). Lee and Seung showed intriguing results with a corpus of images. In a related work, Hofmann (1999) showed the application of NMF to text retrieval. Nonnegative matrix factorization has its roots in work of Gregory and Pullman (1983), Paatero and Tapper (1994) and Cohen and Rothblum (1993). Since the problem is NP-hard (Vavasis, 2007), it is not surprising that no algorithm is known to solve NMF to optimality. Heuristic algorithms proposed for NMF have generally been based on incrementally improving the ob jective A - W H T in some norm using local moves. A particularly sophisticated example of local search is due, e.g., to Kim and Park (2007). A drawback of local search is that it is sensitive to initialization and it is also sometimes difficult to establish convergence. We propose an NMF method based on greedy rank-one downdating that we call R1D. R1D is partly motived by Jordan's algorithm for computing the SVD, which is described in Section 2. Unlike local search methods, greedy methods do not require an initial guess. In Section 3, we compare our algorithm to Jordan's SVD algorithm, which is the archetypal greedy downdating procedure. Previous work on greedy downdating algorithms for NMF is the sub ject of Section 4. In Section 5, we present the main theoretical result of this paper, which states that in a certain model of text due to Papadimitriou et al. (2000), optimizing our ob jective function means correctly identifying a topic in a text corpus; and Section 6 discusses the complexity of this problem. We then turn to computational experiments: in Section 7, we present results for R1D on image datasets, and in Section 8, we present results on text.

1. Nonnegative Matrix Factorization
Several problems in information retrieval can be posed as low-rank matrix approximation. The seminal paper by Deerwester et al. (1990) on latent semantic indexing (LSI) showed that approximating a termdocument matrix describing a corpus of articles via the SVD led to powerful query and classification techniques. A drawback of LSI is that the low-rank factors in general will have both positive and negative entries, and there is no obvious statistical interpretation of the negative entries. This led Lee and Seung (1999) among others to propose nonnegative matrix
Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


Nonnegative Matrix Factorization via Rank-One Downdate

2. Algorithm and Ob jective Function
Rank-one downdate (R1D) is based on the simple observation that the leading singular vectors of a nonnegative matrix are nonnegative. This is a consequence of the Perron-Frobenius theorem (Golub & Van Loan, 1996). Based on this observation, it is trivial to compute a rank-one NMF. This idea can be extended to approximate a higher order NMF. Suppose we compute the rank-one NMF and then subtract it from the original matrix. The original matrix will not be nonnegative any more but all negative entries can be forced to be zero or positive and the procedure can be repeated. An improvement on this idea takes only a submatrix of the original matrix and applies the Perron-Frobenius theorem. The point is that taking the whole matrix will in some sense average the features, whereas a submatrix can pick out particular features. A second reason to take a submatrix is that a correctly chosen submatrix may be very close to having a rank of one, so the step of forcing the residuals to be zero will not introduce significant inaccuracy (since they will already be close to zero). The outer loop of the R1D algorithm may be described as follows. Algorithm 1 R1D input A  Rm×n , k > 0 output W  Rm×k , H  Rn×k 1: for µ = 1 to k do 2: [M , N , u, v,  ] = ApproxRankOneSubmatrix(A) 3: W (M , µ) = u(M ) 4: H (N , µ) =  v(N ) 5: A(M , N ) = 0 6: end for Here, M is a subset of {1, . . . , m}, N is a subset of {1, . . . , n}, u  Rm , v  Rn and   R, and u, v are both unit vectors. The function ApproxRankOneSubmatrix selects these five values so that the submatrix of A indexed by rows M and N is approximately rank one, and in particular, is approximately equal to u(M ) vT (N ). We follow Matlab subscripting conventions, so that A(M , N ) denotes this particular submatrix. This outer loop for R1D may be called "greedy rankone downdating" since it greedily tries to fill the columns of W and H from left to right by finding good rank-one submatrices of A and subtracting them from A. The classical greedy rank-one downdating algorithm is Jordan's algorithm for the SVD, described in Section 3. Related work on greedy rank-one downdat-

ing for NMF is the topic of Section 4. The subroutine ApproxRankOneSubmatrix, presented later in this section, is a heuristic routine to maximize the following ob jective function: f (M , N , u, , v) = A(M, N ) 2 - A(M, N )-u vT 2 F F (1) Here,  is a penalty parameter. The Frobenius norm of Bn m × n matrix B , denoted B F is defined to be a (1, 1)2 + B (1, 2)2 + · · · + B (m, n)2 . The rationale for (1) is as follows: the first term in (1) expresses the ob jective that A(M , N ) should be large, while the second term penalizes departure of A(M , N ) from being a rank-one matrix. Since the optimal u, , v come from the SVD (once M , N are fixed), the above ob jective function can be rewritten just in terms of M and N as f (M , N ) = ip
=1

i (A(M , N ))2 - 

ip
=2

i (A(M , N ))2

= 1 (A(M , N ))2 - ( - 1) · 
2 2 (A(M , N ))

+ · · · + p (A(M , N ))2

,

(2)

where p = min(|M |, |N |). The penalty parameter  should be greater than 1 so that the presence of lowrank contributions is penalized rather than rewarded. We conjecture that maximizing (1) is NP-hard (see Section 6), so we instead propose a heuristic routine for optimizing it. The procedure alternates improving M , N , u,  and v cyclically. First, observe that if M , N are already known, then the optimal choice of u,  , v can be found with the SVD. For fixed (v, N ), the ob jective function (1) is separable by rows of the matrix. In particular, the contribution of row i  M is A(i, N ) 2 -  A(i, N ) - i vT 2, where i = ui  . Note that i may be undefined if i  M . Nonetheless, given v, the optimal i (i.e., / the choice that minimizes A(i, N ) - ui vT ) is easy to compute: it is A(i, N )v, the solution to a simple leastsquares minimization. Thus, we conclude that putting column i into index set M is favorable for the overall ob jective function provided that fi > 0, where fi = A(i, N ) 2 -  A(i, N ) - A(i, N )vvT 2. The formula for fi can be simplified as follows: fi = A(i, N )A(i, N )T -  (A(i, N ) -A(i, N )vvT )(A(i, N ) - A(i, N )vvT )T = -( - 1)A(i, N )A(i, N )T +  (A(i, N )v)2 .


Nonnegative Matrix Factorization via Rank-One Downdate

If we rescale by  - 1 (which does not affect the acceptance criterion), and we define new penalty parameters  :=  /( - 1), then we see that row i is accepted pro¯ vided that  (A(i, N )v)2 - A(i, N )A(i, N )T > 0. ¯ A similar analysis applies to the columns, and leads to the conclusion that, given values for M and u, column j should be accepted provided that  (u A(M , j )) - A(M , j ) A(M , j ) > 0. ¯ The next issue is the choice of a starting guess for M , N , u, , v. The algorithm should be initialized with a starting guess that has a positive score, or else the rules for discarding rows and columns could conceivably discard all rows or columns. To put this more strongly, in order to improve the score of a converged solution, it seems sensible to select a starting guess with a high score. For this reason, R1D uses a single column of A as its starting guess, and in particular, the column of A with the greatest norm. (A single row may also be chosen.) It then chooses u to be the normalization of this column. This column is exactly rank one, so for the correct values of  and v the first penalty term of (1) is zero. We have derived the following algorithm for the subroutine ApproxRankOneSubmatrix occuring in statement 2 in R1D. Algorithm 2 ApproxRankOneSubmatrix input A  Rm×n , parameter  > 1 ¯ output M  {1, . . . , m}, N  {1, . . . , n}, u  Rm , v  Rn ,   R 1: Select j0  {1, . . . , n} to maximize A(:, j0 ) 2 : M = {1, . . . , m} 3: N = {j0 } 5:  = A(:, j0 ) 4 : u = A(:, j0 )/ 6: rep eat ¯ 7: Let v = A(M , :)T u(M ) 8: N = {j :  v (j )2 - A(M, j ) 2 > 0} ¯¯ ¯ 19: v(N ) = v(N )/ v(N ) ¯ ¯ 0: Let u = A(:, N )v(N ) 11: M = {i :  u(i)2 - A(i, N ) 2 > 0} ¯¯ 12:  = u(M ) ¯ 3: u(M ) = u(M )/ 14: until stagnation in M , N , u, , v The `Repeat' loop is guaranteed to make progress because each iteration increases the value of the ob jective function. On the other hand, there does not seem to be any easy way to derive a useful prior upper bound on its number of iterations. In practice, it proceeds
T 2 T

quite quickly, usually converging in 10­15 iterations. But to guarantee fast termination, monotonicity can be forced on M and N by requiring M to shrink and N to grow. In other words, statement 8 can be replaced by N = N  {j :  v (j )2 - A(M, j ) 2 > 0}, ¯¯ and statement 11 by M = M - {i :  u(i)2 - A(i, N ) 2  0}. ¯¯ Our experiments indicate that this change does not have a ma jor impact on the performance of R1D. Another possible modification to the algorithm is as follows: we modify the ob jective function by adding a second penalty term -|M | · |N | to (1) where  > 0 is a parameter. The purpose of this term is to penalize very low-norm rows or columns from being inserted into A(M , N ) since they are probably noisy. For data with larger norm, the first term of (1) should dominate this penalty. Notice that this penalty term is also separable so it is easy to implement: the formula in 8 is changed to  v (j )2 - A(M, j ) 2 - |M | > 0 while the ¯¯ ¯ formula in 11 becomes  u(i)2 - A(i, N ) 2 - |N | > ¯¯ ¯ 0, where  = /( - 1). A good value for  is to set it ¯ ¯ so that in the initial starting point, the third penalty term is a small fraction (say  = 1/20) of the other ¯ terms. This leads to the following definition for :  =  ( - 1) 2 /m, ¯¯ which may be computed immediately after 4 . Greedy rank-one downdating appears to be much faster than other NMF algorithms. Generating each column of W and H requires approximately 20 matrixvector multiplications; these multiplications are always at least as sparse as the original data. There is no iterative improvement phase. It can also be much faster than the SVD, especially for sparse data.

3. Relationship to the SVD
The classical rank-one greedy downdating algorithm is Jordan's algorithm for computing the singular value decomposition (SVD) (Stewart, 1993). Recall that the SVD takes as input an m × n matrix A and returns three factors U, , V such that U  Rm×k and U has orthonormal columns (i.e., U T U = I ),   Rk×k and is diagonal with nonnegative diagonal entries, and V  Rn×k also with orthonormal columns, such that U V T is the optimal rank-k approximation to A in either the 2-norm or Frobenius norm. (Recall that the 2-norm of an m × n matrix B , denoted B 2 , is


Nonnegative Matrix Factorization via Rank-One Downdate

Algorithm 3 JordanSVD input A  Rm×n and k  min(m, n) output U, , V as above. 1: for µ = 1 to k do ¯ 2: Select a random nonzero u  Rm 4: 3 = u ¯ ¯ : u = u/ 5: rep eat {power method} ¯ 6: v = AT u ¯¯ 8 7: v = v/ v ¯ : u = Av 19: = u ¯ ¯ 0: u = u/ 11: until stagnation in u, , v 12: A = A - u vT 13: U (:, µ) = u 14: V (:, µ) = v 15: (µ, µ) =  16: end for

A few previous works follow an approach similar to ours, namely, greedy subtraction of rank-one matrices. This includes the work of Bergmann et al. (2003), who identify the rank-one matrix to subtract as the fixed point of an iterative process. Asgarian and Greiner (2006) find the dominant singular pair and then truncate it. Gillis (2006) finds a rank-one understimator and subtracts that. Boutsidis and Gallopoulos (2007) consider the use of a greedy algorithm for initializing other algorithm and make the following interesting observation: The nonnegative part of a rank-one matrix has rank at most 2. The main innovation herein is the idea that the search for the rank-one submatrix should itself be an optimization subproblem. This observation allows us to compare one candidate submatrix to another. (Gillis also phrases his subproblem as optimization, although his optimization problem does not explicitly seek submatrices like ours.) A second innovation is our analysis in Section 5 showing that if the subproblem were solved optimally, then R1D would be able to accurately find the topics in the model of -separable corpora (Papadimitriou et al., 2000).

 T defined to be max (B B ), where max denotes the maximum eigenvalue.) Thus, we see that R1D is quite similar to the SVD. The principal difference is that R1D tries to find a submatrix indexed by M × N at the same time that it tries to identify the optimal u and v. Hence, the formulas for u and v occurring in 9 and 13 of subroutine ApproxRankOneSubmatrix, which were presented earlier as solutions to a least-squares problem, may also be regarded as steps in a power method. In particular, this means that if M and N are fixed, then the inner repeat loop of this subroutine will indeed converge to the dominant singular triple of A(M , N ). As mentioned earlier, a shortcoming of the SVD is that its factors contain both positive and negative numbers. It has another subtler shortcoming when used for clustering which is as follows: because the SVD always operates on the entire matrix, it can return a singular vector that averages the results from two nearly disjoint topics in a corpus (see Biggs et al. (2008) for an example). R1D avoids this pitfall by seeking a submatrix that is approximately rank-one as it applies the power method.

5. Behavior of this ob jective function on a nearly separable corpus
In this section, we establish the main theoretical result of the paper, namely, that the ob jective function given by (1) is able to correctly identify a topic in a nearly separable corpus. We define our text model as follows. There is a universe of terms numbered 1, . . . , m. There is also a set of topics numbered 1, . . . , t. Topic k , for k = 1, . . . , t, is a probability distribution over the terms. Let P (i, k ) denote the probability of term i occurring in topic k . Thus, P is a singly stochastic matrix, i.e., it has nonnegative entries with column sums exactly 1. We assume also that there is a probability distribution over topics; say the probability of topic k is k , for k = 1, . . . , t. The text model is thus specified by P and 1 , . . . , t . We use the Zipf distribution as the model of document length. In particular, there is a number L such that all documents have length less than L, and the probability that a document of length l occurs is 1/l . 1 + 1/2 + · · · + 1/(L - 1) We have checked that the Zipf model is a good fit for several common datasets. A document is generated from this text model as follows. First, topic k is chosen at random according to the probability distribution {1 , . . . , t }. Then, a

4. Related Work
As mentioned in the introduction, most algorithms proposed in the literature are based on forming an initial W and H and then improving them by local search on an ob jective function. The ob jective function usually includes a term of the form A - W H T i n some norm, and may include other terms.


Nonnegative Matrix Factorization via Rank-One Downdate

length l is chosen at random from {1, . . . , L - 1} according to the Zipf distribution. Finally, the document itself is chosen at random by selecting l terms independently according to the probability distribution P (:, k ). A corpus is a set of n documents chosen independently using this text model. Its term-document matrix is the m × n matrix A such that A(i, j ) is the frequency of term i in document j . We further assume that the text model is -separable, meaning that each topic k is associated with a set of terms Tk  {1, . . . , m}, that T1 , . . . , Tt are mutually disjoint, and that P (i, k )  for i  Tk , i.e., the prob/ ability that a document on topic k will use a term outside of Tk is small. Let Pmin = min{P (i, k ) : i  Tk , k = 1, . . . , t}. Without loss of generality, Pmin > 0 since any row i  Tk such that P (i, k ) = 0 may be removed from Tk without affecting the validity of the model. Parameter must satisfy an inequality mentioned below. This corpus model is quite similar to that of Papadimitriou et al. (2000). One difference is in the the document length model. Our model also relaxes several assumptions of Papadimitriou et al. Our main theorem is that the ob jective function given by (1) correctly finds documents associated with a particular topic in a corpus. Theorem 1. Let (P, (1 , . . . , t )) specify a text model, and let  > 0 be chosen arbitrarily. Assume > 0 is chosen smal ler than a function (Pmin , m, t, ) (see Biggs et al. (2008) for this function). Suppose that the text-model is -separable with respect to T1 , . . . , Tt , the subsets of terms defining the topics. Let A be the termdocument matrix of a corpus of n documents drawn from this model when the document-length parameter is L. Choose  = 4 in (1). Then with probability tending to 1 as n   and L  , the optimizing pair (M , N ) of (1) satisfies the fol lowing. Let D1 , . . . , Dt be the partitioning of the columns of A according to topics. There exists a topic k  {1, . . . , t} such that A(M , N ) and A(Tk , Dk ) are nearly coincident in the fol lowing sense. (
i,j )(M ×N ) (Tk ×Dk )

6. On the complexity of maximizing f (M , N )
In this section, we observe that the problem of globally maximizing (2) is NP-hard at least in the case that  is treated as an input parameter. This observation explains why R1D settles for a heuristic maximization of (2) rather than exact maximization. First, observe that the maximum biclique (MBC) problem is NPhard as proved by Peeters (2003). We show that the MBC problem can be transformed to an instance of (2). Let us recall the definition of the MBC problem. The input is a bipartite graph G. The problem is to find an (m, n)-complete bipartite subgraph K (sometimes called a biclique) of G such that mn is maximized, i.e., the number of edges of K is maximized. Suppose we are given G, an instance of the maximum biclique problem. Let A be the left-right adjacency matrix of G, that is, if G = (U, V , E ) where U  V is the bipartition of the node set, then A has |U | rows and |V | columns, and A(i, j ) = 1 if (i, j )  E for i  U and j  V , else A(i, j ) = 0. Consider maximizing (2) for this choice of A. We require the following preliminary lemmas whose proofs are omitted. Lemma 2. Let A be a matrix that has either of the fol lowing as a submatrix: o 1 . 1 1 0 r U2 = (3) U1 = 01 01 Then 2 (A) > 0.618. This lemma leads to the following lemma. Lemma 3. Suppose al l entries of A  Rm×n are either 0 or 1, and suppose and at least one entry is 1. Suppose M , N are the optimal solution for maximizing f (M , N ) given by (2). Suppose also that the parameter  is chosen to be 2.7mn + 1 or larger. Then the optimal choice of M , N must yield a matrix A(M , N ) of al l 1's, possibly augmented with some rows or columns that are entirely zeros. Now consider the main claim, namely, that optimize (M , N ) of the ob jective function for this A corresponds to the max biclique. If A(M , N ) includes a row or column entirely of zeros, then this row or column may be dropped without affecting the value of the ob jective function (2). Hence it follows from the lemma that without loss of generality that the optimizer (M , N ) of (2) indexes a m| trix of all 1's. In that case, a 1 (A(M , N )) = M | · |N | while 2 (A(M , N )) =

A(i, j )2  

(
i,j )M ×N

A(i, j )2 .

Here, X Y denotes the set-theoretic symmetric difference (X - Y )  (Y - X ). The proof of this theorem is lengthy and appears in Biggs et al. (2008). It relies on Chernoff-Hoeffding estimates and perturbation results for singular vectors such as Theorem 8.6.5 of Golub and Van Loan (1996).


Nonnegative Matrix Factorization via Rank-One Downdate

(a)

(a) (b)

(c)
Figure 1. A binary image dataset is depicted in (a); white indicates zeros. The result of R1D on this dataset is shown in (b), and LSI in (c).

(b)

· · · = p (A(M , N )) = 0 (where p = min(|M |, |N |)), and hence f (M , N ) = |M | · |N |. Thus, the value of the ob jective function corresponds exactly to the number of edges in the biclique. This completes the proof that biclique is reducible in polynomial time to maximizing (2). We note that Gillis (2006) also uses the result of Peeters for a similar purpose, namely, to show that the subproblem arising in his NMF algorithm is also NP-hard. The NP-hardness result in this section requires that  be an input parameter. We conjecture that (2) is NPhard even when  is fixed (say  = 4 as used herein).

(c)
Figure 2. Three algorithms applied to the Frey face dataset (black indicates zeros): (a) NMF with divergence criterion, (b) our R1D algorithm for NMF, and (c) LSI

sociated with the feature on the left. R1D discovered the four triangles as a basis, and to each it associated exactly the dataset images which contain the appropriate triangle. Alternatively, the LSI factorization is not as interpretable. We have also compared results against NMFDIV from nmfpack (Hoyer, 2000; Hoyer, 2004). NMFDIV requires k , the number of basis vectors to compute, as an input parameter which globally affects the factors W and H . If k is correctly set to 4, NMFDIV is able to compute the same correct result as R1D. Otherwise, some or all of the basis vectors will appear incorrect, including the first ones. R1D and LSI will each compute the same leading columns regardless of k , and on this dataset they will not compute more than 4 columns; all subsequent columns of W and H will be

7. Image dataset test cases
We first demonstrate the performance of R1D on a simple binary image dataset, depicted in Figure 1 (a). Each of the ten dataset images is composed of one or two "basis" triangles. The results of R1D (with parameter  = 4) and LSI on this dataset are shown in ¯ Figure 1 (b) and (c), respectively, and the interpretation is as follows. The leftmost column illustrates the four leading columns of W , which are the learned features. For each of these, the images on the right are the dataset images with the largest entries in the corresponding column of H ; they should be closely as-


Nonnegative Matrix Factorization via Rank-One Downdate Table 1. The amount of sparsity in the NMF computed by R1D ( = 2) on the Frey face dataset. It is presented as ¯ the percentage of zero values in the first few columns of W and H . Column 1 2 3 4 5 % zeros in W 0.00 0.82 0.69 0.82 0.94 % zeros in H 0.00 0.69 0.68 0.88 0.73 simpson president clinton police house israel bosnian haiti united government israel israeli bosnian peace serbs bosnia serb sarajevo palestinian nato israel israeli palestinian gaza arafat plo jerusalem peace palestinians simpson bosnian serbs serb sarajevo bosnia nato simpson bihac air troops Table 2. Topics found by LSI on the TDT Pilot Study corpus (tf-idf normalization).

Topic 1

Topic 2

Topic 3

Topic 4

zero. Figure 2 conducts a similar experiment on the Frey face dataset, which consists of 1965 registered face images of size 28×20. Again, the leading columns of W present the "eigenfaces" or "features" discovered in the dataset, and the corresponding column of H selects dataset images that are classified as carrying the feature most prominently. R1D seems to be the most successful at finding features and classifying images; in each case, the column of W shows a particular highlight that distinguishes some images in the dataset from others. NMFDIV appears to be slightly inferior to R1D, while LSI is noticeably worse. In this experiment, the algorithms computed 30 basis vectors of the NMF. NMFDIV was allowed 500 iterations which took 727 seconds; in contrast, LSI required 20 seconds and R1D took 47 seconds. Additionally, R1D is effective at finding a sparse factorization. Table 1 demonstrates the sparsity in the first few columns of W and H . The first column of W and H is fully dense, because the data matrix appears to be approximately rank-one; its first singular value is dominant. Apart from this, the other columns of the NMF are sparse, and the sparsity can be controlled by the  parameter (here we have used  = 2). Al¯ ¯ ternatively, both NMFDIV and LSI perform a dense factorization with very few values near zero in any column.

Table 3. Topics found by R1D on the TDT Pilot Study corpus (tf-idf normalization). Note that all words in a column do in fact refer to the same news event.

Topic 1 simpson judge ito jury defense trial angeles los prosecution case

Topic 2 masters pairings augusta amateur tournament round golf noted players georgia

Topic 3 korea korean north kim pyongyang seoul sung nuclear south communist

Topic 4 deng xiaoping rong paramount china health chinese kong hong daughters

columns are all correctly associated with the given topics. NMFDIV (and the other implementations of NMF in nmfpack) were not run on this dataset because they would exhaust all of the computer's memory. As noted earlier, R1D on text datasets is able to efficiently work with sparse matrices throughout its operation. R1D was able to compute 80 basis vectors of the TDT corpus in 171 seconds, whereas LSI required 269 seconds.

8. Text dataset test cases
In Tables 2 and 3 we illustrate LSI versus R1D (with parameter  = 4) on the TDT Pilot Study (TDT ¯ Study, 1997). The columns of each table are the leading columns of W , with the leading terms per column displayed. The LSI results show that the topics are not properly separated and terms from different topics recur or are mixed. The columns in the R1D table are clearly identifiable topics, and the terms in each

9. Conclusions
We have proposed an algorithm called R1D for nonnegative matrix factorization. It is based on greedy rank-one downdating according to an ob jective function, which is heuristically maximized. We have shown that the ob jective function is well suited for identifying topics in the -separable text model. Finally, we have shown that the algorithm performs well in practice.


Nonnegative Matrix Factorization via Rank-One Downdate

This work raises several interesting open questions. First, the -separable text model seems rather too simple to describe real text, so it would be interesting to see if the results generalize to more realistic models. A second arising question asks whether a result like Theorem 1 will hold for the R1D algorithm. In other words, if the heuristic subroutine ApproxRankOneSubmatrix is applied to an -separable corpus, does it successfully identify a topic? Here is an example of a difficulty. Suppose n   much faster than L. In this case, the document j with the highest norm will be the one in which lj is very close to L and in which one entry A(i, j ) is very close to L while the rest are mostly zeros. This is because the maximizer of x 2 sub ject to the constraint that x 1 = C occurs when one entry of x is equal to C and the rest are zero. It is likely that at least one instance of such a document will occur regardless of the matrix P (·, ·) if n is sufficiently large. This document will then act as the seed for expanding M and N , but it may not be similar to any topic. This scenario can perhaps be prevented by a more intelligent selection of a starting vector for ApproxRankOneSubmatrix.

Gillis, N. (2006). Approximation et sousapproximation de matrices par factorisation positive: algorithmes, complexit´ et applications. e Master's thesis, Universit´ Catholique de Louvain, e Louvain-la-Neuve, Belgium. In French. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations, 3rd edition. Baltimore: Johns Hopkins University Press. Gregory, D. A., & Pullman, N. J. (1983). Semiring rank: Boolean rank and nonnegative matrix rank. J. Combin. Inform. System Sci, 3, 223­233. Hofmann, T. (1999). Probabilistic latent semantic analysis. UAI '99: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intel ligence, Stockholm, Sweden, July 30-August 1, 1999 (pp. 289­296). Morgan Kaufmann. Hoyer, P. (2000). nmfpack - matlab code for nmf. http://http://www.hiit.fi/node/70. Hoyer, P. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5, 1457­1469. Kim, H., & Park, H. (2007). Sparse non-negative matrix factorizations via alternating non-negativityconstrained least squares for microarray data analysis. Bioinformatics (to appear). Lee, D., & Seung, H. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788­791. Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5, 111­126. Papadimitriou, C., Raghavan, P., Tamaki, H., & Vempala, S. (2000). Latent semantic indexing: A probabilistic analysis. J. Comput. Syst. Sci., 61, 217­235. Peeters, R. (2003). The maximum edge biclique problem is NP-complete. Discrete Applied Mathematics, 131, 651­654. Stewart, G. W. (1993). On the early history of the singular value decomposition. SIAM Review, 35, 551­ 566. TDT Study (1997). Topic detection and tracking pilot study. http://pro jects.ldc.upenn.edu/TDT/. Vavasis, S. (2007). On the complexity of nonnegative matrix factorization. arxiv.org, 0708.4149.

References
Asgarian, N., & Greiner, R. (2006). Using rank-1 biclusters to classify microarray data. Department of Computing Science, University of Alberta, Edmonton, AB, Canada. Bergmann, S., Ihmels, J., & Barkai, N. (2003). Iterative signature algorithm for the analysis of largescale gene expression data. Physical Review E, 67, 031902. Biggs, M., Ghodsi, A., & Vavasis, S. (2008). Nonnegative matrix factorization via rank-one downdate. Available online at http://www.arxiv.org/abs/0805.0120. Boutsidis, C., & Gallopoulos, E. (2007). SVD based initialization: A head start for nonnegative matrix factorization. In press. Cohen, J., & Rothblum, U. (1993). Nonnegative ranks, decompositions and factorizations of nonnegative matrices. Linear Algebra and its Applications, 190, 149­168. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391­407.