On Sampling-based Approximate Spectral Decomposition

Sanjiv Kumar Google Research, 76 Ninth Avenue, New York, NY 10011 Mehryar Mohri Courant Institute and Google Research, 251 Mercer Street, New York, NY 10012 Ameet Talwalkar Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012

sanjivk@google.com

mohri@cs.nyu.edu

ameet@cs.nyu.edu

Abstract
This paper addresses the problem of approximate singular value decomposition of large dense matrices that arises naturally in many machine learning applications. We discuss two recently introduced sampling-based spectral decomposition techniques: the Nystr¨m o and the Column-sampling methods. We present a theoretical comparison between the two methods and provide novel insights regarding their suitability for various applications. We then provide experimental results motivated by this theory. Finally, we propose an efficient adaptive sampling technique to select informative columns from the original matrix. This novel technique outperforms standard sampling methods on a variety of datasets.

in the order of millions and the O(n3 ) complexity is infeasible at this scale. Many of the aforementioned applications require only some of the top or bottom singular values and singular vectors. If the input matrix is sparse, one can use efficient iterative methods, e.g., Jacobi or Arnoldi. However, for large dense matrices that arise naturally in many applications, e.g., manifold learning and kernel methods, iterative techniques are also quite expensive. In fact, for many real-world problems, even storing the full dense matrix becomes infeasible. For example, an input set containing 1 million points requires storage of a 4 TB SPSD matrix. Sampling-based methods provide a powerful alternative for approximate spectral decomposition. They operate on a small part of the original matrix and often eliminate the need for storing the full matrix. In the last decade, two sampling-based approximation techniques have been introduced (Frieze et al., 1998; Williams & Seeger, 2000). However, their similarities and relative advantages have not been well studied. Also, there exist no clear guidelines on which method to use for specific applications. This work introduces a theoretical framework to compare these methods and provides the first exhaustive empirical comparison on a variety of datasets. The analysis and subsequent experiments reveal the counter-intuitive behavior of these methods for different tasks. Another important component of sampling-based approximations is the sampling strategy used to select informative columns of the original matrix. Among the techniques that sample columns according to some fixed distribution, uniform sampling has been shown to be quite effective in practice (Williams & Seeger, 2000; Fowlkes et al., 2004; Kumar et al., 2009), and a bound on approximation error has also been derived (Kumar et al., 2009). As an improvement over the fixed-distribution techniques, an adaptive, error-

1. Introduction
Several common methods in machine learning, such as spectral clustering (Ng et al., 2001) and manifold learning (Platt, 2004), require the computation of singular values and singular vectors of symmetric positive semi-definite (SPSD) matrices. Similarly, kernel-based methods can be sped up by using low-rank approximations of SPSD kernel matrices, which can be achieved via spectral decomposition (Williams & Seeger, 2000; Zhang et al., 2008). The computational complexity of Singular Value Decomposition (SVD) of n × n SPSD matrices is O(n3 ), which presents a major challenge in large-scale applications. Large-scale data sets have been used in several recent studies (Platt, 2004; Talwalkar et al., 2008). The size of such datasets can be
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

On Sampling-based Approximate Spectral Decomposition

driven sampling technique with better theoretical approximation accuracy was recently introduced (Deshpande et al., 2006). However, this technique requires the full matrix to be available at each step, and is thus infeasible for large matrices. In the second part of this work, we propose a simple and efficient algorithm for adaptive sampling that uses only a small submatrix at each step. In comparison to non-adaptive sampling techniques, we show that the proposed adaptive sampling can provide more accurate low-rank approximation, particularly for higher ranks.

If k  l singular values and singular vectors are needed, the run time of this algorithm is O(l3 +nlk): l3 for SVD on W and nlk for multiplication with C. 2.2. Column-sampling method The Column-sampling method was introduced to approximate the SVD of any rectangular matrix (Frieze et al., 1998). It approximates the spectral decomposition of G by using the SVD of C directly. If C = Uc c Vc and we assume uniform sampling, the approximate singular values and singular vectors of G are given as: col = n c l and Ucol = Uc = CVc -1 . c (4)

2. Approximate spectral decomposition
Two different sampling-based methods, i.e., Nystr¨m o and Column-sampling, have recently been introduced for spectral decomposition of large matrices using subsets of their columns. Let G be an SPSD matrix of  size n × n with spectral decomposition G = UG G UG , where G contains the singular values of G and UG the associated singular vectors. Suppose we randomly sample l  n columns of G uniformly without replacement.1 Let C be the n × l matrix of these sampled columns, and W be the l × l matrix consisting of the intersection of these l columns with the corresponding l rows of G. Since G is SPSD, W is also SPSD. Without loss of generality, we can rearrange the columns and rows of G such that: G= W G21 G 21 G22 and C = W G21 . (1)

The runtime of the Column-sampling method is dominated by the SVD of C. Even when only k singular values and singular vectors are required, the algorithm takes O(nl2 ) time and is thus more expensive than Nystr¨m. Often, in practice, the SVD of C  C (O(l3 )) o is performed instead of the SVD of C. However, it is still substantially more expensive than the Nystr¨m o method due to the additional cost of computing C  C.

3. Nystr¨m vs Column-sampling o
Given that two sampling-based techniques exist to approximate the SVD of SPSD matrices, we pose a natural question: which method should one use to approximate singular values, singular vectors and low-rank approximations? We first analyze the form of these approximations and then empirically evaluate their performance in Section 3.3 on a variety of datasets. 3.1. Singular values and singular vectors As shown in (3) and (4), the singular values of G are approximated as the scaled singular values of W and C, respectively. The scaling terms are quite rudimentary and are primarily meant to compensate for the `small sample size' effect for both approximations. However, the form of singular vectors is more interesting. The Column-sampling singular vectors (Ucol ) are orthonormal since they are the singular vectors of C. In contrast, the Nystr¨m singular vectors (Unys ) o are approximated by extrapolating the singular vectors of W as shown in (3), and are not orthonormal.  It is easy to verify that Unys Unys = Il , where Il is the identity matrix of size l. As we show in Section 3.3, this adversely affects the accuracy of singular vector approximation from the Nystr¨m method. o It is possible to orthonormalize the Nystr¨m singular o vectors by using QR decomposition. Since Unys 

The approximation techniques discussed next use the SVD of W and C to generate approximations of UG and G . 2.1. Nystr¨m method o The Nystr¨m method was presented in (Williams & o Seeger, 2000) to speed up kernel machines and has been used in applications ranging from manifold learning to image segmentation (Platt, 2004; Fowlkes et al., 2004; Talwalkar et al., 2008). The Nystr¨m method o uses W and C from (1) to approximate G as: ~ G  G = CW -1 C  . (2)

If W is not invertible, its pseudoinverse can be used  (Drineas & Mahoney, 2005). If W = Uw w Uw , the approximate singular values and singular vectors of G are (Williams & Seeger, 2000): nys =
1

n w l

and

Unys =

l CUw -1 . w n

(3)

Other sampling schemes are possible (see Section 4).

On Sampling-based Approximate Spectral Decomposition

CUw -1 , where Uw is orthogonal and w is diagonal, w this simply implies that QR decomposition creates an orthonormal span of C rotated by Uw . However, the complexity of QR decomposition of Unys is the same as that of the SVD of C. Thus, the computational cost of orthogonalizing Unys would nullify the computational benefit of Nystr¨m over Column-sampling. o 3.2. Low-rank approximation Several studies have shown that the accuracy of lowrank approximations of kernel matrices is tied to the performance of kernel-based learning algorithms (Williams & Seeger, 2000; Talwalkar et al., 2008; Zhang et al., 2008). Hence, accurate low-rank approximations are of great practical interest in machine learning. Suppose we are interested in approximating G with a matrix of rank k  n, denoted as Gk . It is well-known that the Gk that minimizes the Frobenius norm of the error, i.e., ||G - Gk ||F , is given by,
   ^ Gk = UG,k G,k UG,k = UG,k UG,k G = GUG,k UG,k (5)

present Theorem 1 and Observations 1 and 2, which provide further insights about these two methods. Theorem 1. The Column-sampling and Nystr¨m mao  trix projections are of the form Uc RUc G, where R  Rl×l is SPSD. Further, Column-sampling gives the lowest reconstruction error (measured in · F ) among all such approximations if k = l. Proof. From (6), it is easy to see that
  Gcol = Uc,k Uc,k G = Uc Rcol Uc G, k

(8)

where Rcol =

Ik 0 0 0

. Similarly, from (7) we can derive

 Gnys = Uc Rnys Uc G where Rnys = Y -2 Y  , (9) w,k k

where UG,k contains the singular vectors of G corresponding to top k singular values contained in G,k .  We refer to UG,k G,k UG,k as Spectral Reconstruction, since it uses both the singular values and vectors, and  UG,k UG,k G as Matrix Projection, since it uses only singular vectors to compute the projection of G onto the space spanned by vectors UG,k . These two low-rank approximations are equal only if G,k and UG,k contain the true singular values and singular vectors of G. Since this is not the case for approximate methods such as Nystr¨m and Column-sampling, these two measures o generally give different errors. Thus, we analyze each measure separately in the following sections. 3.2.1. Matrix projection For Column-sampling, using (4), the low-rank approximation via matrix projection is
  Gcol = Ucol,k Ucol,k G = Uc,k Uc,k G = C(C  C)-1 C  G, k k (6) where Ux,k are the first k vectors of Ux and (C  C)k =  VC,k -2 VC,k . Clearly, if k = l, (C  C)k = C  C. C,k Similarly, using (3), the Nystr¨m matrix projection is o

and Y = l/nc Vc Uw,k . Note that both Rcol and Rnys are SPSD matrices. Furthermore, if k = l, Rcol = I. Let E be the (squared) reconstruction error for  an approximation of the form Uc RUc G, where R is an arbitrary SPSD matrix. Hence, when k = l, the difference in reconstruction error between the generic and the Column-sampling approximations is
 E -Ecol = G-Uc RUc G 2 F   G (I - Uc RUc ) (I - Uc RUc )G   G (I - Uc Uc ) (I - Uc Uc )G    G (Uc R2 Uc - 2Uc RUc + Uc Uc )G   ((R - I)Uc G) ((R - I)Uc G) 2 F  - G-Uc Uc G

=Tr -Tr =Tr =Tr  0.

(10)

 We used the fact that Uc Uc = I, and that A A is SPSD for any matrix A.

Observation 1. For k = l, matrix projection for Column-sampling reconstructs C exactly. This can be ¯ seen by block-decomposing G as: G = [C C], where ¯ = [G21 G22 ] , and using (6): C ¯ Gcol = C(C  C)-1 C  G = [C C(C  C)-1 C  C]. (11) l Observation 2. For k = l, the span of the orthogonalized Nystr¨m singular vectors equals the span of Ucol , o as discussed in Section 3.1. Hence, matrix projection is identical for Column-sampling and Orthonormal Nystr¨m for k = l. o From an application point of view, matrix projection approximations tend to be more accurate than the spectral reconstruction approximations discussed in the next section (results omitted due to space constraints). However, these low-rank approximations are not necessarily symmetric and require storage of all entries of G. For large-scale problems, this storage requirement may be inefficient or even infeasible.

l -2  Gnys = Unys,k Unys,k G = C( Wk )C  G, k n
 where Wk = Uw,k w,k Uw,k , and if k = l, Wk = W .

(7)

As shown in (6) and (7), the two methods have similar expressions for matrix projection, except that C  C is replaced by a scaled W 2 . Furthermore, the scaling term appears only in the Nystr¨m expression. We now o

On Sampling-based Approximate Spectral Decomposition

3.2.2. Spectral reconstruction Using (3), the Nystr¨m spectral reconstruction is: o
-1  Gnys = Unys,k nys,k Unys,k = CWk C  . k

  Let X = UX X VX and Y = UX UX  ,k . Then,

E = Tr = Tr = Tr

  (I - UX  ,k ZUX  ,k )UX 2 UX X

2 2

(12)

   UX X UX (I - UX  ,k ZUX  ,k )UX X UX  UX X (I - Y ZY  )X UX 2

When k = l, this approximation perfectly reconstructs three blocks of G, and G22 is approximated as G21 W -1 G21 , which is the Schur Complement of W in G (Williams & Seeger, 2000): Gnys = CW -1 C  = l W G21 G 21 G21 W -1 G 21 . (13)

= Tr X (I - Y ZY  )2 (I - Y ZY  )X X = Tr 4 -22 Y ZY  2 +X Y ZY  2 Y ZY  X . X X X X (18) ^ To find Z, the Z that minimizes (18), we set: ^ E/Z = -2Y  4 Y + 2(Y  2 Y )Z(Y  2 Y ) = 0 X X X ^ and solve for Z: ^ Z = (Y  2 Y )-1 (Y  4 Y )(Y  2 Y )-1 . X X X ^ is different from Zcol and Znys , and since Clearly Z ^ 2 = G , Z depends on the spectrum of G.
X

The Column-sampling spectral reconstruction has a similar form as (12):
 Gcol = Ucol,k col,k Ucol,k = C k

l  (C C)k n

1 -2

C  . (14)

In contrast with matrix projection, the scaling term now appears in the Column-sampling reconstruction. To analyze the two approximations, we consider an alternative characterization based on the fact that since G is SPSD, there exists an X  Rm×n such that G = X  X. Similar to (Drineas & Mahoney, 2005), we define a zero-one sampling matrix, S  Rn×l , that selects l columns from G, i.e., C = GS. Each column of S has exactly one non-zero entry per column.  Further, W = S  GS = (XS) XS = X  X  , where  m×l X  R contains l sampled columns of X and  X  = UX  X  VX  is the SVD of X  . We use these definitions to present Theorems 2 and 3. Theorem 2. Column-sampling and Nystr¨m speco  tral reconstructions are of the form X  UX  ,k ZUX  ,k X, k×k where Z  R is SPSD. Further, among all approximations of this form, neither the Column-sampling nor the Nystr¨m approximation is optimal (in · F ). o Proof. If  = n/l, then starting from (14) and expressing C and W in terms of X and S, we have Gcol k
-1/2 =GS(S G S)k S  G  -1/2   X X =X  X  VC,k 2 VC,k C,k   =X UX  ,k Zcol UX  ,k X,  2

While Theorem 2 shows that the optimal approximation is data-dependent and may differ from the Nystr¨m and Column-sampling approximations, Theo orem 3 reveals that in certain instances the Nystr¨m o method is optimal. In contrast, the Column-sampling method enjoys no such guarantee. Theorem 3. Suppose r = rank(G)  k  l and rank(W ) = r. Then, the Nystr¨m approximation is o exact for spectral reconstruction. In contrast, Column1/2 . Furthersampling is exact iff W = (l/n)C  C more, when this very specific condition holds, ColumnSampling trivially reduces to the Nystr¨m method. o Proof. Since G = X  X, rank(G) = rank(X) = r. Similarly, W = X  X  implies rank(X  ) = r. Thus the columns of X  span the columns of X and UX  ,r  is an orthonormal basis for X, i.e., I - UX  ,r UX  ,r  Null(X). Since k  r, from (16) we have G - Gnys k
F  = X  (I - UX  ,r UX  ,r )X F

= 0. (19)

To prove the second part of the theorem, we note  that rank(C) = r. Thus, C = UC,r C,r VC,r and (15)
 (C  C)k = (C  C)1/2 = VC,r C,r VC,r since k  r.  1/2 If W = (1/)(C C) , then the Column-sampling and Nystr¨m approximations are identical and hence o exact. Conversely, to exactly reconstruct G, Columnsampling necessarily reconstructs C exactly. Using C  = [W G ] in (14): 21 1/2

  where Zcol = X  VX  VC,k -1 VC,k VX  X  . SimiC,k larly, from (12) we have:

Gnys =GS(S  GS)-1 S  G k k =X  X  X  X  =X
 -1   X X k  UX  ,k UX  ,k X.

(16)

Gcol = G k

= = = =

C(C  C)k

-1/2

W =C

Clearly, Znys = I. Next, we analyze the error, E, for an arbitrary Z, which yields the approximation GZ : k E = G - GZ k
2 F  = X  (I - UX  ,k ZUX  ,k )X 2 F.

 UC,r VC,r W  VC,r VC,r W

 = UC,r C,r VC,r  = VC,r C,r VC,r (20)

(17)

W =

1  1/2 (C C) . 

(21)

On Sampling-based Approximate Spectral Decomposition
Singular Values
Accuracy (Nys - Col)
PIE-7K MNIST ESS 0

Singular Vectors
0.5

Accuracy (Nys - Col)

Table 1. Description of the datasets used in our experiments (Sim et al., 2002; LeCun & Cortes, 2009; Talwalkar et al., 2008). `n' denotes the number of points and `d' denotes the number of features in input space. Dataset PIE-2.7K PIE-7K MNIST ESS Data faces faces digits proteins n 2731 7412 4000 4728 d 2304 2304 784 16 Kernel linear linear linear RBF

0.2 PIE-2.7K 0.1

PIE-2.7K PIE-7K MNIST ESS 0

-0.1

-0.2

20

40

60

80

100

-0.5

20

40

60

80

100

Top Singular Values

Top Singular Vectors

(a)
Matrix Projection
1 1 PIE-7K MNIST ESS 0

(b)
Spectral Reconstruction
Accuracy (Nys - Col)
PIE-2.7K

Accuracy (Nys - Col)

0.5

0.5

 In (20) we use UC,r UC,r = I, while (21) follows since  VC,r VC,r is an orthogonal projection onto the span of the rows of C and the columns of W lie within this  span implying VC,r VC,r W = W .

0

PIE-2.7K PIE-7K MNIST ESS DEXT

-0.5

-0.5

-1 200

400

600

800

1000

-1 100 200

400

600

800

1000

# Sampled Columns (l )

# Sampled Columns (l )

(c) 3.3. Empirical comparison To test the accuracy of singular values/vectors and low-rank approximations for different methods, we used several kernel matrices arising in different applications, as described in Table 1. We worked with datasets containing less than ten thousand points to be able to compare with exact SVD. We fixed k to be 100 in all the experiments, which captures more than 90% of the spectral energy for each dataset. For singular values, we measured percentage accuracy of the approximate singular values with respect to the exact ones. For a fixed l, we performed 10 trials by selecting columns uniformly at random from G. We show in Figure 1(a) the difference in mean percentage accuracy for the two methods for l = 600. The Column-sampling method generates more accurate singular values than the Nystr¨m method for the o top 50 singular values, which contain more than 80% of the spectral energy for each of the datasets. A similar trend was observed for other values of l. For singular vectors, the accuracy was measured by the dot product i.e., cosine of principal angles between the exact and the approximate singular vectors. Figure 1(b) shows the difference in mean accuracy between Nystr¨m and Column-sampling methods. The o top 100 singular vectors were all better approximated by Column-sampling for all datasets. This trend was observed for other values of l as well. This result is not surprising since the singular vectors approximated by the Nystr¨m method are not orthonormal. o Next we compared the low-rank approximations generated by the two methods using matrix projection and spectral reconstruction as described in Section 3.2.1 and Section 3.2.2, respectively. We measured the accuracy of reconstruction relative to the optimal rank-k

(d)

Figure 1. Differences in accuracy between Nystr¨m and o Column-Sampling. Values above zero indicate better performance of Nystr¨m and vice-versa. (a) Top 100 singular o values with l = 600. (b) Top 100 singular vectors with l = 600. (c) Matrix projection accuracy for k = 100. (d) Spectral reconstruction accuracy for k = 100.

^ approximation, Gk , as: relative accuracy = ^ ||G - Gk ||F ||G - Gk
nys/col

||F

.

(22)

The relative accuracy will approach one for good approximations. Results are shown in Figure 1(c) and (d). As motivated by Theorem 1 and consistent with the superior performance of Column-sampling in approximating singular values and vectors, Columnsampling generates better reconstructions via matrix projection. This was observed not only for l = k but also for other values of l. In contrast, the Nystr¨m o method produces superior results for spectral reconstruction. These results are somewhat surprising given the relatively poor quality of the singular values/vectors for the Nystr¨m method, but they are in o agreement with the consequences of Theorem 3. We also note that for both reconstruction measures, the methods that exactly reconstruct subsets of the original matrix when k = l (see (11) and (13)) generate better approximations. Interestingly, these are also the two methods that do not contain scaling terms (see (6) and (12)). Further, as stated in Theorem 2, the optimal spectral reconstruction approximation is tied to the spectrum of G. Our results suggest that the relative accuracies of Nystr¨m and Column-sampling spectral reconstruco tions are also tied to this spectrum. When we analyzed

On Sampling-based Approximate Spectral Decomposition
Orthonormal Nystrom (Mat Proj)
PIE-2.7K PIE-7K MNIST ESS

Orthonormal Nystrom (Spec Recon)
Accuracy (Nys - NysOrth)
1

Accuracy (Col - NysOrth)

1

0.5

0.5

of the unique properties described in Section 3.2.2. For instance, Theorem 3 no longer holds and the scaling -1 terms do not cancel out, i.e., Gnys = CWk C  . k Even though matrix projection tends to produce more accurate approximations, spectral reconstruction is of great practical interest for large-scale problems since, unlike matrix projection, it does not use all entries in G to produce a low-rank approximation. Thus, we further expand upon the results from Figure 1(d). We first tested the accuracy of spectral reconstruction for the two methods for varying values of k and a fixed l. We found that the Nystr¨m method outo performs Column-sampling across all tested values of k, as shown in Figure 2(c). Next, we addressed another basic issue: how many columns do we need to obtain reasonable reconstruction accuracy? For very large matrices (n  O(106 )), one would wish to select only a small fraction of the samples. Hence, we performed an experiment in which we fixed k and varied the size of our dataset (n). For each n, we performed grid search over l to find the minimal l for which the relative accuracy of Nystr¨m spectral reconstruction o was at least 75%. Figure 2(d) shows that the required l l does not grow linearly with n. The n ratio actually decreases quickly as n increases, lending support to the use of sampling-based algorithms for large-scale data.

0

0 PIE-2.7K PIE-7K -0.5 MNIST ESS -1 100 200 400 600 800 1000

-0.5

-1 100 200

400

600

800

1000

# Sampled Columns (l )

# Sampled Columns (l )

(a)
Effect of Rank on Spec Recon
1 1000

(b)
Matrix Size vs # Sampled Cols
# Sampled Columns (l )
800 600 400 200 0 0 PIE-2.7K PIE-7K MNIST ESS 2000 4000 6000 8000

Accuracy (Nys - Col)

0.5

0 PIE-2.7K -0.5 PIE-7K MNIST ESS -1 0 50 100 150 200

low rank (k )

Matrix Size (n)

(c)

(d)

Figure 2. (a) Difference in matrix projection between Column-sampling and Orthonormal Nystr¨m (k = 100). o Values above zero indicate better performance of Columnsampling. (b) Difference in spectral reconstruction between Nystr¨m and Orthonormal Nystr¨m (k = 100). Valo o ues above zero indicate better performance of Nystr¨m o method. (c) Difference in spectral reconstruction accuracy between Nystr¨m and Column-sampling for various o k and fixed l = 600. Values above zero indicate better performance of Nystr¨m method. (d) Number of columns o needed to achieve 75% relative accuracy for Nystr¨m speco tral reconstruction as a function of n.

4. Sampling Techniques
In Section 3, we focused on uniformly sampling columns to create low-rank approximations. Since approximation techniques operate on a small subset of G, i.e., C, the selection of columns can significantly influence the accuracy of approximation. In this section we discuss various sampling options that aim to select informative columns from G. The most common sampling techniques select columns using a fixed probability distribution, with uniform sampling being the most basic of these non-adaptive approaches. Alternatively, the ith column can be sampled non-uniformly with a weight that is proportional to either the corresponding diagonal element, Gii (diagonal ) or the squared L2 norm of the column (column-norm). The non-adaptive sampling methods have been combined with SVD approximation algorithms to bound the reconstruction error (Drineas & Mahoney, 2005; Kumar et al., 2009). Interestingly, the non-uniform approaches are often outperformed by uniform sampling for dense matrices (Kumar et al., 2009). Contrary to the non-adaptive sampling methods, an adaptive sampling technique with better theoretical approximation accuracy (adaptive-full ) was proposed in (Deshpande et al., 2006). It requires a full pass through G in each

spectral reconstruction performance on a sparse kernel matrix with a slowly decaying spectrum, we found that Nystr¨m and Column-sampling approximations were o roughly equivalent (`DEXT' in Figure 1(d)). This result contrasts the results for dense kernel matrices with exponentially decaying spectra arising from the other datasets used in the experiments. One factor that impacts the accuracy of the Nystr¨m o method for some tasks is the non-orthonormality of its singular vectors (Section 3.1). When orthonormalized, the error in resulting singular vectors is reduced (not shown) and the corresponding Nystr¨m matrix o projection error is reduced considerably as shown in Figure 2(a). Further, as discussed in Observation 2 and seen in Figure 2(a) when l = 100, Orthonormal Nystr¨m is identical to Column-sampling when k = l. o However, since orthonormalization is computationally costly, it is avoided in practice. Moreover, the accuracy of Orthogonal Nystr¨m spectral reconstruction is o actually worse relative to the standard Nystr¨m apo proximation, as shown in Figure 2(b). This surprising result can be attributed to the fact that orthonormalization of the singular vectors leads to the loss of some

On Sampling-based Approximate Spectral Decomposition

iteration. Another interesting technique that selects informative columns based on k-means clustering has been shown to give good empirical accuracy (Zhang et al., 2008). However, both methods are computationally inefficient for large G. 4.1. Proposed Adaptive Sampling Method Instead of sampling all l columns from a fixed distribution, adaptive sampling alternates between selecting a set of columns and updating the distribution over all the columns. Starting with an initial distribution over the columns, s < l columns are chosen to form a set C  . The probabilities are then updated as a function of previously chosen columns and s new columns are sampled and incorporated in C  . This process is repeated until l columns have been selected. Input: n×n SPSD matrix G, number of columns to be chosen (l), initial probability distribution over [1 . . . n] (P0 ), number of columns selected at each iteration (s) Output: l indices corresponding to columns of G Sample-Adaptive(G, n, l, P0 , s) 1 R  set of s indices sampled according to P0 l 2 t  s - 1 £ number of iterations 3 for i  [1 . . . t] do 4 Pi  Update-Probability-Partial(R) 5 Ri  set of s indices sampled according to Pi 6 R  R  Ri 7 return R Update-Probability-Partial(R) 1 C   columns of G corresponding to indices in R 2 k   Choose-rank() £ low rank (k) or |R| 2 ¨ 3 nys , Unys  Do-Nystrom (C  , k  ) £ see eq (3)  4 Cnys  Spectral reconstruction using nys , Unys  5 E  C  - Cnys 6 for j  [1 . . . n] do 7 if j  R then 8 Pj  0 £ sample without replacement 9 else Pj  ||Ej ||2 2 P 10 P  ||P ||2 11 return P
Figure 3. The proposed adaptive sampling technique that uses only a small subset of the original matrix G to compute probability distribution over columns. Note that it does not need to store or run a full pass over G.

2006). At each iterative step, we measure the reconstruction error for each row of C  and the distribution over corresponding columns of G is updated proportional to this error. Unlike (Deshpande et al., 2006), we compute the error for C  , which is much smaller than G, thus avoiding the O(n2 ) computation. As described in (13), if k  is fixed to be the number of  columns in C  , it will lead to Cnys = C  resulting  in perfect reconstruction of C . So, one must choose a smaller k  to generate non-zero reconstruction errors from which probabilities can be updated (we used k  = (# columns in C  )/2 in our experiments). One artifact of using a k  smaller than the rank of C  is that all the columns of G will have a non-zero probability of being selected, which could lead to the selection of previously selected columns in the next iteration. However, sampling without replacement strategy alleviates this problem. Working with C  instead of G to iteratively compute errors makes this algorithm significantly more efficient than that of (Deshpande et al., 2006), as each iteration is O(nlk  + l3 ) and requires at most the storage of l columns of G. The details of the proposed sampling technique are outlined in Figure 3. 4.2. Sampling Experiments We used the datasets already shown in Table 1, and compared the effect of different sampling techniques on the relative accuracy of Nystr¨m spectral recono struction for k = 100. The results for PIE-7K are presented in Figure 4(a) for varying values of l. The results across datasets (Figure 4(b)) show that our adaptive-partial sampling technique outperforms all non-adaptive methods. They show that adaptivepartial performs roughly the same as adaptive-full for smaller l and outperforms it for larger l, while being much cheaper computationally (Figure 5(a)). Next, we wished to identify the situations where adaptive sampling is most effective. It is well-known that most matrices arising in real-world applications exhibit a fast decaying singular value spectrum. For these matrices, sampling based spectral decomposition methods generally provide accurate estimates of the top few singular values/vectors. However, the accuracy generally deteriorates for subsequent singular values/vectors. To test this behavior, we conducted an experiment with the PIE-7K dataset, where the relative accuracy for Nystr¨m spectral reconstruction o was measured by varying k for a fixed l. As shown in Figure 5(b), the relative accuracy for all sampling methods decreases as k is increased, thus verifying the deterioration in quality of singular value/vector approximation as k increases. However, by performing error-driven sampling, the proposed adaptive sampling

We propose a simple sampling technique (adaptivepartial ) that incorporates the advantages of adaptive sampling while avoiding the computational and storage burdens of the technique in (Deshpande et al.,

On Sampling-based Approximate Spectral Decomposition
Nystrom Reconstruction: PIE-7K
1

0.8 0.6 0.4 0.2 0 100 200 Uniform Diagonal Col-Norm Adaptive-Partial Adaptive-Full 400 600 800 1000

# Sampled Columns (l )

Dataset PIE-2.7K PIE-7K 400 MNIST ESS PIE-2.7K PIE-7K 800 MNIST ESS l

Uniform 67.2 (±1.1) 57.5 (±1.1) 67.4 (±0.7) 61.0 (±1.7) 84.1 (±0.5) 73.8 (±1.2) 83.3 (±0.3) 78.1 (±1.0)

Diagonal 62.1 (±0.9) 50.8 (±1.9) 67.4 (±0.4) 61.5 (±1.5) 77.8 (±0.6) 64.9 (±1.8) 83.0 (±0.3) 79.2 (±0.9)

Col-Norm 59.7 (±1.0) 56.8 (±1.6) 65.3 (±0.5) 57.5 (±1.9) 73.9 (±1.0) 71.8 (±3.0) 80.4 (±0.4) 75.4 (±1.2)

Adapt-Part 70.4 (±0.9) 62.8 (±0.9) 69.3 (±0.7) 65.0 (±1.0) 86.5 (±0.4) 78.5 (±0.5) 84.2 (±0.4) 80.6 (±1.1)

Adapt-Full 72.6 (±1.0) 64.3 (±0.7) 69.2 (±0.7) 63.9 (±0.9) 87.7 (±0.4) 74.1 (±0.6) 80.7 (±0.5) 74.8 (±0.8)

Relative Accuracy

(a)

(b)

Figure 4. Nystr¨m spectral reconstruction accuracy for various sampling methods for k = 100. (a) Results for PIE-7K o using several values of l. (b) Results for all datasets for two values of l with k = 100. Numbers in parenthesis indicate the standard deviations for 10 different runs for each l.

provides better relative accuracy as k increases.

on Discrete Algorithms (pp. 1117­1126). Drineas, P., & Mahoney, M. W. (2005). On the Nystr¨m method for approximating a Gram matrix o for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153­2175. Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). Spectral grouping using the Nystr¨m o method. Transactions on Pattern Analysis and Machine Intelligence, 26, 214­225. Frieze, A., Kannan, R., & Vempala, S. (1998). Fast Monte-Carlo algorithms for finding low-rank approximations. Found. of Comp. Sci. (pp. 370­378). Kumar, S., Mohri, M., & Talwalkar, A. (2009). Sampling techniques for the Nystr¨m method. Conf on o Artificial Intelligence and Statistics (pp. 304­311). LeCun, Y., & Cortes, C. (2009). The MNIST database of handwritten digits. Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: analysis and an algorithm. Neural Info. Proc. Systems (pp. 849­856). Platt, J. C. (2004). Fast embedding of sparse similarity graphs. Neural Info. Proc. Systems (pp. 571­578).

5. Conclusions and Future Work
We presented an analysis of two sampling-based techniques for approximating SVD on large dense SPSD matrices, and provided a theoretical and empirical comparison. Although the Column-sampling method generates more accurate singular values/vectors and low-rank matrix projections, the Nystr¨m method cono structs better low-rank spectral approximations, which are of great practical interest as they do not use the full matrix. We also presented a novel adaptive sampling technique that results in improved performance over standard techniques and is significantly more efficient than the existing adaptive method. An important question left to study is how different properties of SPSD matrices, e.g., sparsity and singular value spectrum, affect the quality of SVD approximations and the effectiveness of various sampling techniques.
Timing for Sampling: PIE-7K
2500

Effect of Rank on Reconstruction
1

CPU time (sec)

2000 1500 1000 500 0 100 200

Relative Accuracy

Uniform Diagonal Col-Norm Adaptive-Partial Adaptive-Full

0.8 0.6 0.4 0.2 0 0 Uniform Diagonal Col-Norm Adaptive-Partial Adaptive-Full 50 100 150 200

400

600

800

1000

# Sampled Columns (l )

low rank (k )

(a)

(b)

Sim, T., Baker, S., & Bsat, M. (2002). The CMU PIE database. Conf on Auto. Face and Gesture Recog (pp. 53­58). Talwalkar, A., Kumar, S., & Rowley, H. (2008). Largescale manifold learning. Conference on Vision and Pattern Recognition (pp. 1­8). Williams, C. K. I., & Seeger, M. (2000). Using the Nystr¨m method to speed up kernel machines. Neuo ral Info. Proc. Systems (pp. 682­688). Zhang, K., Tsang, I., & Kwok, J. (2008). Improved Nystr¨m low-rank approximation and error analysis. o Intl. Conf. on Machine Learning (pp. 273­297).

Figure 5. Results on PIE-7K for different sampling techniques. Left: Empirical run times (Matlab) for Nystr¨m o method for k = 100. Right: Mean Nystr¨m spectral recono struction accuracy for varying k and fixed l = 600.

References
Deshpande, A., Rademacher, L., Vempala, S., & Wang, G. (2006). Matrix approximation and projective clustering via volume sampling. Symposium