Making Large-Scale Nystr¨m Approximation Possible o

Mu Li limu.cn@gmail.com Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China James T. Kwok jamesk@cse.ust.hk Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong Bao-Liang Lu bllu@sjtu.edu.cn Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China MOE-MS Key Lab. for Intel. Comp. and Intel. Sys., Shanghai Jiao Tong University, Shanghai 200240, China

Abstract
The Nystr¨m method is an efficient technique o for the eigenvalue decomposition of large kernel matrices. However, in order to ensure an accurate approximation, a sufficiently large number of columns have to be sampled. On very large data sets, the SVD step on the resultant data submatrix will soon dominate the computations and become prohibitive. In this paper, we propose an accurate and scalable Nystr¨m scheme that first samples o a large column subset from the input matrix, but then only performs an approximate SVD on the inner submatrix by using the recent randomized low-rank matrix approximation algorithms. Theoretical analysis shows that the proposed algorithm is as accurate as the standard Nystr¨m method that directly o performs a large SVD on the inner submatrix. On the other hand, its time complexity is only as low as performing a small SVD. Experiments are performed on a number of large-scale data sets for low-rank approximation and spectral embedding. In particular, spectral embedding of a MNIST data set with 3.3 million examples takes less than an hour on a standard PC with 4G memory.

tions in diverse areas such as physics, statistics, signal processing, machine learning and data mining. In machine learning for example, eigenvalue decomposition is used in kernel principal component analysis and kernel Fisher discriminant analysis for the extraction of nonlinear structures and decision boundaries from the kernel matrix. The eigenvectors of the kernel or affinity matrix are also used in many spectral clustering (von Luxburg, 2007) and manifold learning algorithms (Belkin & Niyogi, 2002; Tenenbaum et al., 2000) for the discovery of the intrinsic clustering structure or low-dimensional manifolds. However, standard algorithms for computing the eigenvalue decomposition of a dense n × n matrix take O(n3 ) time, which can be prohibitive for large data sets. Alternatively, when only a few leading (or trailing) eigenvalues/eigenvectors are needed, one may perform a partial singular value decomposition (SVD) using the Arnoldi method (Lehoucq et al., 1998). However, empirically, the time reduction is significant only when the matrix is sparse or very few eigenvectors are extracted (Williams & Seeger, 2001). A more general approach to alleviate this problem is by using low-rank matrix approximations, of which the Nystr¨m method (Drineas & Mahoney, 2005; Fowlkes o et al., 2004; Williams & Seeger, 2001) is the most popular. It selects a subset of m n columns from the kernel matrix, and then uses the correlations between the sampled columns and the remaining columns to form a low-rank approximation of the full matrix. Computationally, it only has to decompose the much smaller m × m matrix (denoted W ). Obviously, the more columns are sampled, the more accurate is the resultant approximation. However, there is a tradeoff between accuracy and efficiency. In particular, on very large data sets, even decomposing the small W

1. Introduction
Eigenvalue decomposition is of central importance in science and engineering, and has numerous applicaAppearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Making Large-Scale Nystr¨m Approximation Possible o

matrix can be expensive. For example, when the data set has several millions examples, sampling only 1% of the columns will lead to a W that is larger than 10, 000 × 10, 000. To avoid this explosion of m, Kumar et al. (2010) recently proposed the use of an ensemble of ne Nystr¨m o approximators. Each approximator, or expert, performs a standard Nystr¨m approximation with a mano ageable column subset. Since the sampling of columns is stochastic, a number of such experts can be run and the resultant approximations are then linearly combined together. Empirically, the resultant approximation is more accurate than that of a single expert as in standard Nystr¨m. Moreover, its computational o cost is (roughly) only ne times the cost of standard Nystr¨m. However, as will be shown in Section 3, it is o essentially using a block diagonal matrix to approximate the inverse of a very large W . Since the inverse of a block diagonal matrix is another block diagonal matrix, this approximation can be poor unless W is close to block diagonal. However, this is highly unlikely in typical applications of the Nystr¨m method. o Recently, a new class of randomized algorithms are proposed for constructing approximate, low-rank matrix decompositions (Halko et al., 2009). It also extends the Monte Carlo algorithms in (Drineas et al., 2006) on which the analysis of the Nystr¨m method o in (Drineas & Mahoney, 2005) is based. Unlike the standard Nystr¨m which simply samples a column o subset for approximation, it first constructs a lowdimensional subspace that captures the action of the input matrix. Then, a standard factorization is performed on the matrix which is restricted to that subspace. Though being a randomized algorithm, it is shown that this can yield an accurate approximation with very high probability. On the other hand, the algorithm needs to have at least one pass over the whole input matrix. This is thus more expensive than the Nystr¨m method (and its ensemble variant) which only o accesses a column subset. On very large data sets, this performance difference can be significant. In this paper, we combine the merits of the standard Nystr¨m method and the randomized SVD algorithm. o The standard Nystr¨m is highly efficient but requires o a large enough number of columns to be sampled, while the randomized SVD algorithm is highly accurate but less efficient. Motivated by the observation that the ensemble Nystr¨m algorithm is essentially uso ing a block diagonal matrix approximation for W + , we will adopt a large column subset and then speed up the inner SVD step by randomized SVD. Both theoretical analysis and experimental results confirm that the er-

ror in the randomized SVD step is more than compensated for by the ability to use a large column subset, leading to an efficient and accurate eigenvalue decomposition even for very large input matrices. Moreover, unlike the ensemble Nystr¨m method which resorts o to a learner and needs to attend to the consequent model selection issues, the proposed method is very easy to implement and can be used to obtain approximate eigenvectors. The rest of this paper is organized as follows. Section 2 gives a short introduction on the standard/ensemble Nystr¨m method and the randomized SVD algorithm. o Section 3 then describes the proposed algorithm. Experimental results are presented in Section 4, and the last section gives some concluding remarks. Notations The transpose of vector/matrix is denoted by the superscript T . Moreover, Tr(A) denotes the trace of matrix A = [Aij ], A+ is its pseudo inverse, ran(A) is the range of A, A 2 = max{  :  is eigenvalue of AT A} is its spectral norm, A F = Tr(AT A) is its Frobenius norm, and i (A) denotes the ith largest singular value of A.

2. Related Works
2.1. Nystr¨m Method o The Nystr¨m method approximates a symmetric poso itive semidefinite (psd) matrix G  Rn×n by a sample C of m n columns from G. Typically, this subset of columns are randomly selected by uniform sampling without replacement (Williams & Seeger, 2001; Kumar et al., 2009). Recently, more sophisticated non-uniform sampling schemes have also been pursued (Drineas & Mahoney, 2005; Zhang et al., 2008). After selecting C, the rows and columns of G can be rearranged such that C and G are written as: C= W S and G = W S ST , B (1)

where W  Rm×m , S  R(n-m)×m and B  R(n-m)×(n-m) . Assume that the SVD of W is U U T , where U is an orthonormal matrix and  = diag(1 , . . . , m ) is the diagonal matrix containing the singular values of W in non-increasing order. For k  m, the rank-k Nystr¨m approximation is o
+ ~ Gk = CWk C T , k

(2)

+ -1 where Wk = i=1 i U (i) U (i)T , and U (i) is the ith column of U . The time complexity is O(nmk + m3 ). Since m n, this is much lower than the O(n3 ) complexity required by a direct SVD on G.

Making Large-Scale Nystr¨m Approximation Possible o

2.2. Ensemble Nystr¨m Algorithm o Since the Nystr¨m method relies on random sampling o of columns, it is stochastic in nature. The ensemble Nystr¨m method (Kumar et al., 2010) employs o an ensemble of ne  1 Nystr¨m approximators for o improved performance. It first samples mne columns from G, which can be written as C = [C1 , . . . , Cne ]  Rn×mne with each Ci  Rn×m . The standard Nystr¨m o method is then performed on Ci , obtaining a rank-k ~ approximation Gi,k (i = 1, . . . , ne ). Finally, these are weighted to form the ensemble approximation
ne

Algorithm 1 Randomized SVD (Halko et al., 2009). Input: m × m symmetric matrix W , scalars k, p, q. Output: U , . 1:   a m × (k + p) standard Gaussian random matrix. 2: Z  W , Y  W q-1 Z. 3: Find an orthonormal matrix Q (e.g., by QR decomposition) such that Y = QQT Y . 4: Solve B(QT ) = QT Z. 5: Perform SVD on B to obtain V V T = B. 6: U  QV .

~ Gens =
i=1

~ µi Gi,k ,

(3)

restricted to the above subspace and a standard SVD is then computed on the reduced matrix B = QT W Q (4)

where µi 's are the mixture weights. A number of choices have been used in setting these weights, including uniform weights, exponential weights and by ridge regression. Empirically, the best method is ridge regression. This, however, needs to sample an additional s columns from G as the training set, and another s columns as the hold-out set for model selection. The total time complexity is O(ne nmk+ne m3 +Cµ ), where Cµ is the cost of computing the mixture weights. Another disadvantage of the ensemble Nystr¨m o method is that, unlike the standard Nystr¨m method, o approximate eigenvectors of G cannot be easily obtained. As can be seen from (3), the eigenvectors of ~ each of the Gi,k 's in (3) are in general different and so cannot be easily combined together. Hence, the ensemble Nystr¨m method cannot be used with spectral o clustering and manifold learning algorithms. 2.3. Randomized Low-Rank Approximation Recently, a class of simple but highly efficient randomized algorithms are proposed for constructing approximate, low-rank matrix decompositions (Halko et al., 2009). In general, they can be used on complexvalued rectangular matrices. In the following, we focus on obtaining a rank-k SVD from a symmetric matrix W  Rm×m (Algorithm 1). In general, there are two computational stages in this class of algorithms. In the first stage (steps 1 to 3), an orthonormal matrix Q  Rm×(k+p) is constructed which serves as an approximate, low-dimensional basis for the range of W (i.e., W QQT W ). Here, p is an over-sampling parameter (typically set to 5 or 10) such that the rank of Q is slightly larger than the desired rank (k), and q is the number of steps of a power iteration (typically set to 1 or 2) which is used to speed up the decay of the singular values of W . In the second stage (steps 4 to 6), the input matrix matrix is

to obtain B = V V T . Finally, the SVD of W can be approximated as W U U T , where U = QV . Computationally, it takes O(m2 k) time1 to compute Z and Y , O(mk) time for the QR decomposition, O(mk 2 ) time to obtain B, and O(k 3 ) time for the SVD. Hence, the total time complexity is O(m2 k+k 3 ), which is quadratic in m. Moreover, it needs to have at least one pass over the whole input matrix.

3. Algorithm
3.1. Combining Nystr¨m and Randomized o SVD Obviously, the more columns are sampled, the more accurate is the Nystr¨m approximation. Hence, the o ensemble Nystr¨m method samples mne columns ino stead of m columns. In the following, we abuse notations and denote the corresponding W matrix by W(ne m)  Rmne ×mne . However, there is a tradeoff between accuracy and efficiency. If the standard Nystr¨m method were used, this would have taken o O(n3 m3 ) time for the SVD of W(ne m) . The ensemble e Nystr¨m method alleviates this problem by replacing o this expensive SVD by ne SVDs on ne smaller m × m matrices. Our key observation is that, by using (2), the ensemble Nystr¨m approximation in (3) can be o rewritten as
+ + ~ Gens = C diag(µ1 W1,k , . . . , µne Wne ,k )C T ,

(5)

where Wi,k  Rm×m is the W matrix in (1) corre+ + ~ sponding to Gi,k , and diag(µ1 W1,k , . . . , µne Wne ,k ) is
1 Here, we compute Y by multiplying W to a sequence of m × (k + p) matrices, as W Z, W (W Z), . . . , W (W q-2 Z).

Making Large-Scale Nystr¨m Approximation Possible o

 the block diagonal matrix 

+ µ1 W1,k


.. .
+ µne Wne ,k

. In

other words, the ensemble Nystr¨m algorithm can be o + equivalently viewed as approximating W(ne m) by the + + block diagonal diag(µ1 W1,k , . . . , µne Wne ,k ). Despite the resultant computational simplicity, the inverse of a block diagonal matrix is another block diagonal matrix. Hence, no matter how sophisticated the mixture weights µi 's are estimated, this block diagonal approximation is rarely valid unless W(ne m) is block diagonal. This, however, is highly unlikely in typical applications of the Nystr¨m method. o Since the ensemble Nystr¨m method attains better o performance by sampling more columns, our method will also sample more columns, or, equivalently, use a m larger than is typically used in the standard Nystr¨m method. However, instead of using a block o diagonal matrix approximation for solving the subsequent large-SVD problem, we will use a more accurate procedure. In particular, we will adopt the randomized low-rank matrix approximation technique introduced in Section 2.3. Algorithm 2 The proposed algorithm. Input: Psd matrix G  Rn×n , number of columns m, rank k, over-sampling parameter p, power parameter q. ^ Output: G, an approximation of G. 1: C  m columns of G sampled uniformly at random without replacement. 2: W  m × m matrix defined in (1). ~ 3: [U , ]  randsvd(W, k, p, q) using Algorithm 1. ~ 4: U  C U + . m n m T ^ 5: G . nU m nU

Table 1. Time complexities for the various methods to obtain a rank-k Nystr¨m approximation of an n × n matrix. o Here, m is the number of columns sampled. method ¨ Nystrom ¨ ensemble Nystrom randomized SVD proposed method time complexity O(nmk + m3 ) O(nmk + ne k3 + Cµ ) O(n2 k + k3 ) O(nmk + k3 )

ply other approximations, such as using the standard Nystr¨m method again. However, the Nystr¨m o o method is not good at approximating the trailing eigenvalues, which are important in computing the inverse of W . Preliminary experiments show that in order for the resultant approximation on G to be accurate, the inner Nystr¨m needs to sample close to o m columns, which, however, will lead to little speedup over a standard SVD. Moreover, recall from Section 2.1 that there are different column sampling strategies. Here, we will focus on uniform sampling without replacement (Kumar et al., 2009). Extension to other sampling schemes will be studied in the future. The time complexity required is O(nmk + k 3 ). A summary of the time complexities2 of the various methods is shown in Table 1. Recall that typically n m  k. As can be seen, all the methods except randomized SVD scale linearly with n. Moreover, the proposed method has a comparable complexity as the ensemble Nystr¨m method as both only scale cubically with k, o but not with m. 3.2. Error Analysis Let the column sampling matrix be S  {0, 1}n×k , where Sij = 1 if the ith column of G is chosen in the j random trial, and Sij = 0 otherwise. Then, C = GS and W = S T GS. Moreover, since G is psd, we can write it as G = X T X, (7) for some X  Rd×n . In the proof, we will also need the column-sampled and rescaled version of X: H = XS, where  = n/m is the scaling factor. Then, C = -1 X T H, W = -2 H T H. (9) (8)

The proposed algorithm is shown in Algorithm 2. Essentially, it combines the high efficiency of the Nystr¨m o method, which however requires a large enough column subset for accurate approximation, with the ability of the randomized algorithm to produce a very accurate SVD but still relatively efficient approxima^ ~ ~ tion. Note from step 5 that G = C U + U T C T . In ~ U T = QBQT = ~ turn, from Algorithm 1 and (4), U Q(QT W Q)QT . Hence, instead of relying on the block ^ diagonal matrix approximation in (5), G is now more accurately approximated as ^ G = CQ(QT W Q)+ QT C. (6)

The error analysis will depend on a number of results in (Halko et al., 2009; Kumar et al., 2009; Stewart,
In order to be consistent with the other methods, the total number of columns sampled in the ensemble Nystr¨m o method is now m (not ne m in Section 2.2).
2

Besides, instead of using the randomized SVD algorithm for the inner SVD, one might want to ap-

Making Large-Scale Nystr¨m Approximation Possible o

1990). For the readers' convenience, these are listed in the appendix. 3.2.1. Spectral Norm For the matrix W in step 3, we will first compute its (expected) approximation error E W - QQT W 2 . Since the input matrix G is psd, W is also psd. The following proposition can then be obtained by using a more general result in (Halko et al., 2009). Proposition 1. Given a psd matrix W , the Q obtained in Algorithm 1 satisfies E W - QQT W where  = 1 +
k p-1 2

Table 2. Data sets used. data Satimage RCV1 MNIST Covtype MNIST-8M #samples 4,435 23,149 60,000 581,012 3,276,294 dim 36 47,236 784 54 784

low-rank approx

embedding

4. Experiments
In this section, we study the efficiency of the proposed method in solving large dense eigen-systems. Experiments are performed on low-rank approximation (Section 4.1) and spectral embedding (Section 4.2). All the implementations are in Matlab. Experiments are run on a PC with 2.4GHz Core2 Duo CPU and 4G memory. 4.1. Low-rank Approximation We use a number of data sets from the LIBSVM archive4 (Table 2). The linear kernel is used on the RCV1 text data set, while the Gaussian kernel is used on all the others. The following methods are compared in the experiments: 1. Standard Nystr¨m method (denoted nys); o 2. Ensemble Nystr¨m method (denoted ens): As o in (Kumar et al., 2010), an additional s = 20 columns are used for training the mixture weights by ridge regression, and another s = 20 columns are used for choosing the regularization parameters. For the covtype data set, s and s are reduced to 2 so as to speed up computation. Moreover, we set ne = m/k. 3. The proposed method (denoted our): We fix the over-sampling p to 5, and the power parameter q to 2. 4. Randomized SVD (denoted r-svd): Similar to the proposed method, we also use p = 5 and q = 2. The first three methods are based on the Nystr¨m o method, and m columns are uniformly sampled without replacement. Due to randomness in the sampling process, we perform 10 repetitions and report the averaged result. On the other hand, the randomized SVD algorithm does not perform sampling and the whole
3 This bound can be obtained in (Kumar et al., 2010) by combining their (6) and (10). 4 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets/

  1/q k+1 (W ), - k.

(10)

+

  e k+p m p

The main theorem is stated below. ^ Theorem 1. For the G obtained in Algorithm 2, ^ E G- G
2

  1/q G-Gk

2 +(1+

1/q

n )  G , (11) m ii

where Gk is the best rank-k approximation of G, G = ii   e k+p k m - k. maxi Gii , and  = 1 + p-1 + p As in (Halko et al., 2009), the power iteration drives  1/q towards 1 exponentially fast as q increases, and so the error in (11) decreases with the number of sampled columns m. In particular, if we replace  1/q by 1, 2n then (11) becomes G - Gk 2 + m G , which is the ii same3 as that for the standard Nystr¨m method using o m columns. In other words, Algorithm 2 is as accurate as performing a large SVD in standard Nystr¨m. o 3.2.2. Frobenius Norm A similar bound can be obtained for the approximation error in terms of the Frobenius norm. Since there is no analogous theory for power iteration w.r.t. the Frobenius norm (cf. remark 10.1 of (Halko et al., 2009)), the analysis here is restricted to q = 1 and thus the resultant bound is quite loose. However, as will be seen in Section 4, empirically the approximation with just q = 2 is already very good. ^ Theorem 2. For the G obtained in Algorithm 2, ^ E G-G 
F

2(k + p)  G - Gk p-1

F

+

1+

4(k + p) m(p - 1)

nG . ii

Making Large-Scale Nystr¨m Approximation Possible o
8 7 6 relative error 5 4 3 2 1 0 0.6 1.2 1.8 m 2.4 3 x 103
0.25 0.6 1.2 1.8 2.4 3 m 3.6 4.2 4.8 5.4 6 x 103

x 10

-4

our nys ens r-svd
relative error

0.45 our nys ens r-svd

6 5.5 5 relative error 4.5 4 3.5 3 2.5 2

x 10

-3

6
our nys ens r-svd

x 10

-3

5 relative error 4 3 2 1 0

our nys ens

0.4

0.35

0.3

0.6

1.2

1.8

2.4

3 m

3.6

4.2

4.8

5.4 x 103

0.6 1.2 1.8 2.4

3 m

3.6 4.2 4.8 5.4

6 x 103

(a) satimage.
our nys ens r-svd 10 CPU time (sec.)
3

(b) rcv1.
our nys ens r-svd 10
3

(c) mnist.
our nys ens r-svd

(d) covtype.
our nys ens

10 CPU time (sec.)

2

10 CPU time (sec.) 1.2 1.8 m 2.4 3 4.2 5.4 x 103

3

10

2

CPU time (sec.)

10

2

10

1

10

1

10

2

10 10
0

1

0.6

1.2 m

1.8

2.4

3 x 103

10

0

0.6

1.2

1.8 m

2.4 3 3.6

4.8 6 x 103

0.6

0.6

1.2

1.8 m

2.4 3 3.6

4.8 6 x 103

(e) satimage.

(f) rcv1.

(g) mnist.

(h) covtype.

Figure 1. Performance of the various methods. Top: Low-rank approximation error; Bottom: CPU time. (The randomized SVD algorithm cannot be run on the covtype data set because it is too large).

input matrix is always used. Besides, the best rankk approximation could have been obtained by a direct SVD on the whole input matrix. However, this is computationally expensive even on medium-sized data sets and so is not compared here. 4.1.1. Different Numbers of Columns In the first experiment, we fix k = 600 and gradually increase the number of sampled columns (m). Figure 1 shows the the relative approximation er^ ror5 G - G F / G F and the CPU time. As can be seen, the randomized SVD algorithm is often the most accurate, albeit also the most expensive. Standard Nystr¨m can be as accurate as randomized SVD o when m is large enough. However, since Nystr¨m o takes O(m3 ) time for the SVD step, it also quickly becomes computationally infeasible. As for the ensemble Nystr¨m method, it degenerates to the standard o Nystr¨m when ne = 1 (the left endpoint of the curve). o Its approximation error decreases when the ensemble has more experts6 , which is consistent with the results in (Kumar et al., 2010). However, as discussed in Section 3.1, the ensemble Nystr¨m method approximates o
5 Results are only reported for the Frobenius norm because the approximation error w.r.t. the spectral norm is computationally difficult to compute, especially on large data sets. Nevertheless, this is still a good indication of the approximation performance as the spectral norm is upperbounded by the Frobenius norm (L¨tkepohl, 1996). u 6 Recall that we use ne = m/k, and so the number of experts increases with m.

the large SVD problem with a crude block diagonal matrix approximation. Hence, its accuracy is much inferior to that of the standard Nystr¨m (which pero forms the large SVD directly). On the other hand, the proposed method is almost as accurate as standard Nystr¨m, while its CPU time is comparable or even o smaller than that of the ensemble Nystr¨m method. o The accuracy of the ensemble Nystr¨m method can be o improved, at the expense of more computations. Recall that in our setup, all Nystr¨m-based methods have o access to m columns. For the ensemble Nystr¨m, these o columns are divided by the m/k experts, each receiving a size-k subset. In general, let r be the number of columns used by each expert. Obviously, the larger the r, the better the ensemble approximation. Indeed, in the extreme case where r = m, the ensemble Nystr¨m o method degenerates to the standard Nystr¨m. Hence, o accuracy of the ensemble Nystr¨m method can be imo proved by using fewer experts, with each expert using a larger column subset. However, the time for performing m size-r SVD's is O( m r3 ) = O(mr2 ). Figure 2 r r shows the resultant tradeoff between approximation error and CPU time on the satimage data set. As can be seen, in order for the ensemble Nystr¨m method to o have comparable speed with the proposed algorithm, this justifies our choice of r = k. 4.1.2. Different Ranks In the second experiment, we study the approximation performance when the rank k varies. Because of the

Making Large-Scale Nystr¨m Approximation Possible o
16 14 12 relative error 10 8 6 4 2 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4 m x 103 x 10
-3

16 our nys ens r-svd relative error 14 12 10 8 6 4 2

x 10

-3

16 our nys ens r-svd relative error 14 12 10 8 6 4 2

x 10

-3

16 our nys ens r-svd relative error 14 12 10 8 6 4 2

x 10

-3

our nys ens r-svd

0.4 0.8 1.2 1.6

2 2.4 2.8 3.2 3.6 4 m x 103

0.4 0.8 1.2 1.6

2 2.4 2.8 3.2 3.6 4 m x 103

0.4 0.8 1.2 1.6

2 2.4 2.8 3.2 3.6 4 m x 103

(a) k = 200.
3

(b) k = 400.
3

(c) k = 600.
3

(d) k = 800.
3

10 CPU time (sec.)

CPU time (sec.)

CPU time (sec.)

10

2

10

2

10

2

CPU time (sec.)

our nys ens r-svd

10

our nys ens r-svd

10

our nys ens r-svd

10

our nys ens r-svd

10

2

10

1

10

1

10

1

10

1

10

0

10 0.4 0.8 m 1.2 1.6 2 2.4 3.2 4 x 103

0

10 0.4 0.8 m 1.2 1.6 2 2.4 3.2 4 x 103

0

10 0.4 0.8 m 1.2 1.6 2 2.4 3.2 4 x 103

0

0.4

0.8 m

1.2 1.6 2 2.4 3.2 4 x 103

(e) k = 200.

(f) k = 400.

(g) k = 600.

(h) k = 800.

Figure 3. Performance at different k's on the MNIST data set. Top: Low-rank approximation error; Bottom: CPU time.
x 10
-4

8 7 6 relative error 5 4 3 2 1 0

CPU time (sec.)

our nys ens,600 ens,900 ens,1200

10

2

our nys ens,600 ens,900 ens,1200

semble Nystr¨m and the proposed method are most o scalable, while the proposed method is as accurate as the standard Nystr¨m that performs a large SVD. o
-3

10

1

7
0

x 10

6
0.6 1.2 m 1.8 2.4 3 x 103

0.6

1.2

(a) Approximation error.

(b) CPU time.

4 3 2 1 0 0 1 2 3 data size 4 5 x 10 6
5

CPU time (sec.)

relative error

1.8 m

2.4

3 x 103

10

5

our nys ens r-svd

10

4

10

3

our nys ens r-svd

10

2

Figure 2. Low-rank approximation performance for the ensemble Nystr¨m method, with varying number of columns o used by each expert.

10

1

10 4 10

0

10 data size

5

10

6

(a) Approximation error.

(b) CPU time.

lack of space, results are only reported on the MNIST data set. As can be seen from Figure 3, when k increases, the approximation error decreases while the CPU time increases across all methods. Hence, there is a tradeoff between accuracy and efficiency. Nevertheless, the relative performance comparison among the various methods is still the same as in Section 4.1.1. 4.1.3. Input Matrices of Different Sizes In this experiment, we examine how the performance scales with the size of the input matrix. The various methods are run on subsets of the covtype data set. Here, k is fixed to 600 and m to 0.03n. Results are shown in Figure 4. Note that the slopes of the curves in Figure 4(b) determines their scalings with n. As can be seen, the standard Nystr¨m method scales cubically o with n, while all the other methods scale quadratically (because m also scales linearly with n here). Moreover, similar to the results in the previous sections, the en-

Figure 4. Low-rank approximation performance at different sizes of the input matrix on the covtype data set.

4.2. Spectral Embedding In this section, we perform spectral embedding using the Laplacian eigenmap (Belkin & Niyogi, 2002). The Gaussian kernel is used to construct the affinity matrix. For easy visualization, the data are projected onto the two singular vectors of the normalized Laplacian with the second and third smallest singular values. Experiments are performed on the MNIST-8M data set7 , which contains 8.1M samples constructed by elastic deformation of the original MNIST training set. To avoid clutter of the embedding results, we only use digits 0, 1, 2 and 9, which result in a data set with about
7 http://leon.bottou.org/papers/ loosli-canu-bottou-2006

Making Large-Scale Nystr¨m Approximation Possible o

3.3M samples. Because of this sheer size, neither standard SVD nor Nystr¨m can be run on the whole set. o Moreover, neither can the ensemble Nystr¨m method o be used as it cannot produce approximate eigenvectors (Section 2.2). Hence, the full set can only be run with the proposed method, with m = 4000, k = 400 and the Gaussian kernel. For comparison, we also run standard SVD on a random subset of 8,000 samples. Results are shown in Figure 5. As can be seen, the two embedding results are very similar. Besides, for the proposed method, this embedding of 3.3M samples is obtained within an hour on our PC.

References
Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS 14, 2002. Drineas, P. and Mahoney, M.W. On the Nystr¨m o method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2175, 2005. Drineas, P., Kannan, R., and Mahoney, M.W. Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on Computing, 36(1):158­183, 2006. Fowlkes, C., Belongie, S., Chung, F., and Malik, J. Spectral grouping using the Nystr¨m method. IEEE o Transactions on Pattern Analysis and Machine Intelligence, 26(2):214­225, February 2004. Halko, N., Martinsson, P.-G., and Tropp, J.A. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. Technical report, 2009. Kumar, S., Mohri, M., and Talwalkar, A. Sampling techniques for the Nystr¨m method. In AISTATS, o 2009. Kumar, S., Mohri, M., and Talwalkar, A. Ensemble Nystr¨m method. In NIPS 22, 2010. o Lehoucq, R.B., Sorensen, D.C., and Yang, C. ARPACK Users' Guide: Solution of LargeScale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. SIAM, 1998. L¨tkepohl, H. Handbook of Matrices. John Wiley and u Sons, 1996. Stewart, G.W. Matrix perturbation theory. SIAM Review, 1990. Tenenbaum, J.B., de Silva, V., and Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319­2323, 2000. von Luxburg, U. A tutorial on spectral clustering. Statistics and Computing, 17(4):395­416, December 2007. Williams, C.K.I. and Seeger, M. Using the Nystr¨m o method to speed up kernel machines. In NIPS 13, 2001. Zhang, K., Tsang, I.W., and Kwok, J.T. Improved Nystr¨m low rank approximation and error analysis. o In ICML, 2008.

(a) Proposed method.

(b) SVD on a data subset.

Figure 5. Embedding results for the digits 0,1,2,9 in the MNIST-8M data set.

5. Conclusion
In this paper, we proposed an accurate and scalable Nystr¨m approximation scheme for very large data o sets. It first samples a large column subset from the input matrix, and then performs an approximate SVD on the inner submatrix by using the recent randomized low-rank matrix approximation algorithms. Both theory and experiments demonstrate that the proposed algorithm is as accurate as the standard Nystr¨m o method that directly performs a large SVD on the inner submatrix. On the other hand, its time complexity is only as low as the ensemble Nystr¨m method. In o particular, spectral embedding of a MNIST data set with 3.3 million examples takes less than an hour on a standard PC with 4G memory.

Acknowledgments
This research was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region (Grant 614508), the National Natural Science Foundation of China (Grant No. 60773090 and Grant No. 90820018), the National Basic Research Program of China (Grant No. 2009CB320901), and the National High-Tech Research Program of China (Grant No. 2008AA02Z315).