¨ Improved Nystrom Low-Rank Approximation and Error Analysis Kai Zhang T W I N S E N @ C S E . U S T. H K Ivor W. Tsang I VO R @ C S E . U S T. H K James T. Kwok JA M E S K @ C S E . U S T. H K Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong Abstract Low-rank matrix approximation is an effective tool in alleviating the memory and computational burdens of kernel methods and sampling, as the mainstream of such algorithms, has drawn considerable attention in both theory and practice. This paper presents detailed studies on the ¨ Nystrom sampling scheme and in particular, an ¨ error analysis that directly relates the Nystrom approximation quality with the encoding powers of the landmark points in summarizing the data. The resultant error bound suggests a simple and efficient sampling scheme, the k -means ¨ clustering algorithm, for Nystrom low-rank approximation. We compare it with state-of-the-art approaches that range from greedy schemes to probabilistic sampling. Our algorithm achieves significant performance gains in a number of supervised/unsupervised learning tasks including kernel PCA and least squares SVM. A useful way to alleviate the memory and computational burdens of kernel methods is to utilize the rapid decaying spectra of the kernel matrices (Williams & Seeger, 2000) and perform low-rank approximation in the form of K = GG , where G Rn×m with m n. However, the optimal (eigenvalue) decomposition takes O(n3 ) time and efficient alternatives have to be sought. In the following, we give a brief review on efficient techniques for low-rank decompositions of symmetric, positive (semi-)definite kernel matrices. Greedy approaches have been applied in several fast algorithms for approximating the kernel matrix. In (Smola ¨ & Scholkopf, 2000), the kernel matrix K is approximated by the subspace spanned by a subset of its columns. The basis vectors are chosen incrementally to minimize an upper bound of the approximation error. The algorithm takes O(m2 nl) time using a probabilistic heuristic, where l is the random subset size. In (Ouimet & Bengio, 2005), a greedy sampling scheme is proposed based on how well a sample point can be represented by a (constrained) linear combination of the current subspace basis in the feature space. Their algorithm scales as O(m2 n). Another well-known greedy approach for low-rank approximation of positive semi-definite matrices is the incomplete Cholesky decomposition (Fine & Scheinberg, 2001; Bach & Jordan, 2005; Bach & Jordan, 2002). It is a variant of the Cholesky decomposition that skip pivots below a certain threshold, and factorizes the kernel matrix K as K = GG where G Rn×m is a lower triangular matrix. Another class of low-rank approximation algorithms stem ¨ ¨ from the Nystrom method. The Nystrom method was originally designed to solve integral equations (Baker, 1977). ¨ Given a kernel matrix K , the Nystrom method can be deemed as choosing a subset of m columns (hence rows) E Rn×m , and reconstructing the complete kernel matrix by K E W -1 E , where W is the intersection of the selected rows and columns of K . The most popular ¨ sampling scheme for Nystrom method is random sampling, which leads to fast versions of kernel machines (Williams & Seeger, 2001; Lawrence & Herbrich, 2003) and spectral 1. Introduction Kernel methods play a central role in machine learning and have demonstrated huge success in modelling real-world data with highly complex, nonlinear structures. Examples include the support vector machine, kernel Fisher discriminant analysis and kernel principal component analysis. The key element of kernel methods is to map the data into a kernel-induced Hilbert space (·) where dot product between points can be computed equivalently through the kernel evaluation (xi ), (xj ) = K (xi , xj ). Given n sample points, this necessitates the calculation of an n × n symmetric, positive (semi-)definite kernel matrix. The resultant complexities in terms of both space (quadratic) and time (usually cubic) can be quite demanding for large problems, posing a big challenge on practical applications. Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). ¨ Improved Nystrom Low-Rank Approximation and Error Analysis clustering (Fowlkes et al., 2004). In (Platt, 2005), several variants of multidimensional scaling are all shown to be re¨ lated to the Nystrom approximation. There are also a large body of randomized algorithms for low-rank decomposition of arbitrary matrices (Frieze et al., 1998; Achlioptas & McSherry, 2001; Drineas et al., 2003), where the goal is to design column/row sampling probabilities that achieve provable probabilistic bounds. These algorithms are designed for a more general purpose and will not be the focus of this paper. However, we note that one of these randomized algorithms has been recently revised for efficient low-rank approximation of the symmetric Gram matrix (Drineas & Mahoney, 2005). Therefore we will use it as a representative of randomized algorithms in our empirical evaluations. The basic idea of (Drineas & Mahoney, 2005) is to sample the columns of the kernel matrix based on a pre-computed distribution using the norms of the columns. The reconstruction of the kernel matrix is also normalized by the sampling distribution. In terms of efficiency, greedy approaches usually take O(m2 n) time for sampling, while the random scheme only needs O(n) and is much more efficient. Probabilistic approaches, or randomized algorithms in general, are usually more expensive in that the sampling distributions have to be computed based on the original matrix, which require at least O(n2 ). In terms of memory, note that the matrices ¨ (E and W ) needed in the Nystrom method with random sampling can be simply computed on demand. This greatly reduces the memory requirement for very large-scale problems. In contrast, the intermediate matrices for greedy approaches have to be incrementally updated and stored. ¨ Although the Nystrom method possesses desirable scaling properties and has been applied with success in various machine learning problems, analysis on its key step of choosing the landmark set is relatively limited. In (Drineas & Mahoney, 2005), a probabilistic error bound is provided ¨ on the Nystrom low-rank approximation. However, the error bound only applies to the specially designed sampling scheme, which needs to compute the norms of all the rows/columns of the kernel matrix and is hence quite expensive. In (Zhang & Kwok, 2006), a block quantization scheme is proposed for fast spectral embedding. The kernel eigen-system is approximated by first computing a block-wise constant kernel matrix and then extrapolat¨ ing its eigenvectors through the weighted Nystrom extension. However, the error analysis is only on the block¨ quantization step, and how the Nystrom method affects the approximation quality in general remains unclear. Thus, the motivation of this paper is to provide a more concrete analysis on how the sampling scheme (or the choice of the ¨ landmark points) in general influences the Nystrom lowrank approximation, and to improve the sampling strategy while still preserving its computational efficiency. ¨ Our key finding is that the Nystrom low-rank approximation depends crucially on the quantization error induced by encoding the sample set with the landmark points. This suggests that, instead of applying the greedy or probabilistic sampling, the landmark points can be simply chosen as the k -means cluster centers, which finds a local minimum of the quantization error. To the best of our knowledge, ¨ the k -means has not been applied in the Nystrom low-rank approximation. The complexity of k -means is only linear in the sample size and dimension and, as our analysis expected, it demonstrates very encouraging performance that ¨ is consistently better than all known variants of Nystrom. We also compare it with the greedy approach of incomplete Cholesky decomposition and again obtain positive results. The rest of the paper is organized as follows. In Section 2, ¨ we give a brief introduction of the Nystrom method. In ¨ Section 3, we present an error analysis on how the Nystrom low-rank approximation is affected by the chosen landmark points, and propose the k -means algorithm for the sampling step. In Section 4, we compare our approach with a number of state-of-the-art low-rank decomposition techniques (including both greedy and probabilistic sampling approaches). The last section gives concluding remarks. ¨ 2. Nystrom Method ¨ The Nystrom method is originated from the numerical treatment of integral equations of the form p (y )k (x, y )i (y )dy = i i (x), (1) where p(·) is the probability density function, k is a positive definite kernel function, and 1 2 · · · 0 and 1 , 2 , . . . are the eigenvalues and eigenfunctions of the integral equation, respectively. Given a set of i.i.d. samples {x1 , x2 , . . . , xq } drawn from p(·), the basic idea is to approximate the integral in (1) by the empirical average: q 1j k (x, xj )i (xj ) i i (x). q =1 (2) Choosing x in (2) from {x1 , x2 , . . . , xq } leads to a standard eigenvalue decomposition K (q) U (q) = U (q) (q) , where (q ) Kij = k (xi , xj ) for i, j = 1, 2, . . . , q , U (q) Rq×q has orthonormal columns and (q) Rq×q is a diagonal matrix. The eigenfunctions i 's and eigenvalues i 's in (1) can be approximated by U (q) and (q) , as (Williams & Seeger, 2001): (q ) (q ) (3) i (xj ) q Uj i , i i /q . ¨ This means, the Nystrom method using different subset sizes q 's are all approximations to i and i in the inte- ¨ Improved Nystrom Low-Rank Approximation and Error Analysis Here, E Rn×m with Eij = k (xi , zj ), and Z , Z Rm×m contain the eigenvectors and eigenvalues of W Rm×m where Wij = k (zi , zj ). Using the approximations in (4), K can be reconstructed as m n m -1 -1 K E Z Z Z E Z Z n m n = E W -1 E . (5) ¨ Equation (5) is the basis for Nystrom low-rank approximation of the kernel matrix (Williams & Seeger, 2001; Fowlkes et al., 2004). ¨ gral equation (1). As a result, the Nystrom method using a small q can also be deemed as approximating the ¨ Nystrom method using a large q . Suppose the sample set X = {xi }n 1 , with the corresponding n × n kernel matrix i= ¨ K . Then the Nystrom method that randomly chooses a subset Z = {zi }m 1 of m landmark points will approximate i= the eigen-system of the full kernel matrix K K = K K by (Williams & Seeger, 2001) m n K E Z -1 , K Z . (4) Z n m 3.2. Approximation Error of Sub-Kernel Matrix In this section we apply a "clustered" data model to analyze ¨ the quality of Nystrom low rank approximation. Here, the data clusters can be naturally obtained by assigning each sample to the closest landmark point. As will be seen, this model allows us to derive an explicit error bound for the ¨ Nystrom approximation. Again, suppose that the landmark set is Z = {zi }m 1 , and i= the whole sample set X is partitioned into m disjoint clusters Sk 's. Let c(i) be the function that maps each sample xi X to the closest landmark point zc(i) Z , i.e., c(i) = arg minj =1,2,··· ,m xi - zj . Our goal is to study the approximation error in (5): K F - E W -1 E , (6) E= where · F denotes the matrix Frobenious norm. ¨ 3. Error Analysis of the Nystrom Method ¨ In this section we analyze how the Nystrom approximation error depends on the choice of landmark points. We first provide an important observation (Section 3.1), and then derive the error bound in more general settings based on a "clustered" data model (Section 3.2-3.4). The error bound gives important insights on the design of efficient sampling schemes for accurate low-rank approximation (Section 3.5). 3.1. Observation Proposition 1. Given the data set X = {xi }n 1 , and the i= landmark point set Z = {zj }m 1 . Then the Nystrom recon¨ j= struction of the kernel entry K (xi , xj ) will be exact if there exist two landmark points such that zp = xi , and zq = xj . Proof. Let Kxk ,Z R1×m be the similarity between xk ¨ and the landmark points Z . Then the Nystrom reconstruc tion of the kernel entry will be Kxi ,Z W -1 Kxj ,Z , where m ×m is the kernel matrix defined on the landWR mark set Z . Let W (k) be the k th row of W , then we have Kxi ,Z = W (p) and Kxj ,Z = W (q) since xi = zp , and xj = zq . As a result, the reconstructed entry will be W (p) W -1 (W (q) ) = Wpq = K (zp , zq ) = K (xi , xj ). Proposition 1 indicates that the landmark points should be chosen to overlap sufficiently with the original data. However, it is often impossible to use a small landmark set to represent every sample point accurately. First, we consider the simpler notion of partial approximation error defined as follows. Definition 1. Suppose each cluster has T samples1 . Repeat the following sampling process T times: at each time t, pick one sample from each cluster, and denote the set of samples chosen at time t as XIt . Then X = {XI1 XI1 ... XIT }, and the whole kernel matrix will be correspondingly decomposed into T 2 blocks, each of size m × m. Let KIi ,Ij , and EIi ,Z be the m × m similarity matrices defined on (XIi , XIi ) and (XIi , Z ), respectively, and W Rm×m the kernel matrix defined on Z . The partial approximation error is the difference between KIi ,Ij and its Nystrom ap¨ proximation under the Frobenius norm EIi ,Ij = KIi ,Ij - EIi ,Z W -1 EIj ,Z F . (7) We assume the kernel k satisfies the following property: a , k (k (a, b) - k (c, d))2 CX - c 2+ b - d 2 a, b, c, d (8) k where CX is a constant depending on k and the sample set X . The validity of this assumption on a number of commonly used kernels will be proved in Section 3.4. Proposition 2. For kernel k satisfying property (8), the partial approximation error EIi ,Ij is bounded by m 2 k k mCX (eIi + eIj ) + CX eIi EIi ,Ij m k k CX eIj + mCX eIi eIj W -1 F . (9) + where eIi is the quantization error induced by coding each sample in XIi by the closest landmark point in Z , i.e., x 2 x eIi = . (10) i - zc(i) i XI i If cluster sizes differ, add "virtual samples" to each cluster such that all the clusters have the same size (which is equal to T = maxk |Sk |). The virtual samples added to cluster Sk are chosen as the landmark point zk for that cluster, so they will not induce extra quantization errors but will loosen the bound. 1 ¨ Improved Nystrom Low-Rank Approximation and Error Analysis Proof. We will first define the following matrices AIi ,Ij = KIi ,Ij - W ; BIi ,Z = EIi ,Z - W ; CIj ,Z = EIj ,Z - W, (11) and then show that they have bounded Frobenius norms. Without loss of generality, we specify the indices as follows: KIi ,Ij (p, q ) = k (xIi (p) , xIj (q) ); EIi ,Z (p, q ) = k (xIi (p) , zq ); EIj ,Z (p, q ) = k (xIj (p) , zq ); and W (p, q ) = k (zp , zq ). With property (8), we have AIi ,Ij 2 = F = pm k CX = k mCX k 2mCX Proof. Here we sum up the terms in (9) separately. 2 2 iT iT jT e k k mCX (eIi + eIj ) = mCX Ii + eIj ,j = 1 =1 =1 2 iT k mCX =1 pm ,q = 1 k (xIi (p) , xIj (q) ) - k (zp , zq ) - zp - zp , 2 + 2 x Ii (q ) 2 ,q = 1 m p e x Ii (p) - zq Ij (q ) =1 Ii + eIj x Ii (p) + qm 2 - zq 2 where e = j =1 eIj = xi - zc(i) 2 is the same i X as defined in proposition 3. Similarly, the second term (and the third term) in (9) can be summarized as T m m m iT jT i k k k T CX eIi = CX eIi CX eT ,j = 1 =1 =1 T T T eIi + x jT =1 eIj 2T m k CX T e =1 x The last term in (9) can be summarized as T where eIi is the same as that in (10) since c(I (q )) = q . For matrix BIi ,Z , we have pk 2 BIi ,Z 2 = (xIi (p) , zq ) - k (zp , zq ) F ,q i ,j = 1 i k k mCX eIi eIj W -1 F = mCX W -1 F =1 T eIi 2 k mCX W -1 F T e By combining all these terms, we arrive at Proposition 3. k mCX pm =1 x Ii (p) - zp 2 = k mCX eIi , k 3.4. CX Under Different Kernels k and similarly for matrix CIj ,Z CIj ,Z 2 mCX eIj . F Note that the partial approximation error EIi ,Ij (7) can be re-written as follows using (11). F W + AIi,Ij - (W + BIi,Z )W -1 (W + CIj,Z ) EIi,Ij F = W F = + AIi,Ij - W - CIj,Z - BIi,Z - BIi ,Z W -1 CIj ,Z In this section, we show that many commonly used kernel functions satisfy the propexty in (8). Consider the star , -y tionary kernel k (x, y ) = including the Gaus sian kernel () = exp(-2 ), Laplacian kernel () = exp(-), and inverse distance kernel () = ( + )-1 . By using the mean value theorem and triangular inequality, we have, for any a, b, c, d Rd , (k (a, b) - k (c, d))2 = ( ( a - b /)- ( c - d /)) 2 2 AIi,Ij F + BIi,Z F + CIj,Z F + BIi,Z F CIj,Z F W -1 F Using the bounds on AIi ,Ij , BIi ,Z , CIj ,Z together with the definition in (11), we have Proposition 2 3.3. Approximation Error of Complete Kernel Matrix With the estimated partial approximation error, we can now ¨ obtain a bound on the complete error for Nystrom approximation (6). The basic idea is to sum up the partial errors EIi ,Ij over all i, j = 1, 2, ..., T . Proposition 3. The error of the Nystrom approximation (6) ¨ is bounded by m k k (12) E 4T CX eT + mCX T e W -1 F total quantization error of coding each sample xi X with the closest landmark point zj Z . where T = max |Sk |, and e = k = [ ( )/ ]2 ( a - b - c - d ) . Let v1 = a - c and v2 = b - d. Note that we have c-d a - b + v1 + v2 and similarly a - b c - d + v1 + v2 . So a - b - c - d is always bounded by ( a-b - c-d ) 2 n i=1 x i - zc(i) 2 is the k So CX can be chosen as max[2 ( )/ ]2 which is often 1 1 k bounded (CX is 22 for the Gaussian, 2 for the Laplacian, 1 and 2 4 for the inverse distance). Similarly, for polynod mial kernels of the form k (x, y ) = ( x, y + ) , ( 2 (k (a, b) - k (c, d))2 = a b + )d - (c d + )d ( a-c + b-d ) a . 2 -c 2+ b-d 2 2 = (p ( )(a b - c d)) = (p ( ) ((a - c) b + (b - d) c)) ( a - c) b 2 + (b - d) c 2 [2p ( )]2 a , [2p ( )R]2 -c 2+ b-d 2 2 2 ¨ Improved Nystrom Low-Rank Approximation and Error Analysis where R is the larger one of the two quantities: the maximum pairwise distance between samples, and maximum distance between samples and the origin point; and p(z ) = k z d . So CX can be chosen as max[ ( )R]2 = d2 Rd . 3.5. Sampling Procedure The error bound in Proposition 3 provides important in¨ sights on how to choose the landmark points in the Nystrom method. As can be seen, consistently, that for a number of commonly used kernels, the most important factor that influences the approximation quality is e, the error of quantizing each of the samples in X with the closest landmark in ¨ Z . If this quantization error is zero, the Nystrom low-rank approximation of the kernel matrix will also be exact. This agrees well with the ideal case discussed in Section 3.1. Motivated by this observation and the fact that k -means clustering can find a local minimum of the quantization error (Gersho & Gray, 1992), we propose to use the centers obtained from the k -means as the landmark points. Here, k is the desired number of landmark points in Z . The larger the k , the more accurate the approximation though at the cost of higher computations. Despite its simplicity, the k means procedure can greatly improve the approximation quality compared to other sampling schemes, as will be demonstrated empirically in Section 4. Recent advances in speeding up the k -means algorithms (Elkan, 2003; Kanungo et al., 2001) also make this k -means-based sampling strategy particularly suitable for large-scale problems. position of the m × m matrix S = G G = V V . Proof can be found in (Fowlkes et al., 2004). Therefore low-rank approximation is useful for algorithms that rely on eigenvectors of the kernel matrix, such as kernel PCA ¨ (Scholkopf et al., 1998), Laplacian eigenmap (Belkin & Niyogi, 2002) and normalized cut. ¨ Note that the Nystrom method, when designed originally to solve integral equations, did not provide orthogonal approximations to the kernel eigenfunctions. Thanks to the matrix completion view (5) (Fowlkes et al., 2004; Williams ¨ & Seeger, 2001), the Nystrom method can be utilized for obtaining orthogonal eigenvectors (Proposition 4), though ¨ the time complexity increases from the simple Nystrom extension (4) of O(mn) to O(m2 n). In the experiments we focus on the orthogonalized eigenvector approximation. Table 1. Complexities of basis selection for the different methods. time space Ours O(mn) O(mn) ¨ Nystrom O(n) O(mn) Drineas O(n2 ) O(mn) IC D O(m2 n) O(mn) 4. Experiments This section presents empirical evaluations of the various low-rank approximation schemes. First, we discuss how the low rank approximation fits into different applications. One is to solve linear systems of the form (K + I )x = a, where K is the kernel matrix, 0 is a regularization parameter and I is the n × n identity matrix. Given the low-rank approximation K GG , the following holds (Williams & Seeger, 2001) by the Woodbury formula (K + I )-1 , 1I - G( I + G G)-1 G (13) We compare altogether five low-rank approximation algorithms, including: 1. incomplete Cholesky decomposition ¨ (ICD)2 ; 2. Nystrom method (with random sampling); 3. the method in (Drineas & Mahoney, 2005); 4. our method (for simplicity, the maximum number of k -means iterations is restricted to 10); 5. SVD. Note that SVD (or eigenvalue decomposition in our context) provides the best low-rank approximation in terms of both the Frobenius norm and spectral norm (Golub & Van Loan, 1996). The complexi¨ ties of basis selection (i.e., choosing E and W in Nystrom, or sampling the columns in (Drineas & Mahoney, 2005) and ICD) in the different algorithms are listed in Table 1. Evaluations are performed in the contexts of kernel matrix approximation (Section 4.1), kernel PCA (Section 4.2), and LS-SVM classification (Section 4.3). We use core(TM)dual PC with 2.13GHz CPU and the codes are in matlab. 4.1. Approximating the Kernel Matrix We first examine the performance of the low-rank approximation schemes by measuring their approximation errors (in terms of the Frobenius norm) on the kernel matrix. We choose a number of benchmark data sets from the LIBSVM archive3 , summarized in Table 2. Note that our approximation error bound in Proposition 3 applies to most kernel functions (Section 3.4), and preliminary experimental results with these kernels have shown the superiority of our sampling scheme compared with other low-rank approximation methods. However, due to lack of space, we 2 3 which only needs O(m2 n) time and O(mn) memory. Therefore, it can be used in speeding up the Gaussian processes (Williams & Seeger, 2001) and least-squares SVM (LS-SVM) (Suykens & Vandewalle, 1999). The second application is to reconstruct the eigen-system of a matrix approximated by its low-rank decomposition. Proposition 4. Given the low-rank approximation K GG , where G Rn×m and m n, the top m eigenvectors U of K can be obtained as U GV -1/2 in O(m2 n) time, where V , Rm×m are from the eigenvalue decom- http://www.di.ens.fr/fbach/kernel-ica/index.htm http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ ¨ Improved Nystrom Low-Rank Approximation and Error Analysis low-rank approximation error () data size dimension data size dimension Table 2. A summary of data sets. ger m an splice adult1a 1000 1000 1605 24 60 123 segment w1a svmgd1a 2310 2477 3089 19 300 4 dna 2000 180 satimage 4435 36 35 probabilistic sampling random sampling k-means centers 30 25 20 will only report results for the Gaussian kernel K (x, y ) = exp(- x - y 2/). Here, is chosen as the average squared distance between data points and the mean of each data set. We gradually increase the subset size m from 1% to 10% of the data size. To reduce statistical variability, results of methods 2, 3, and 4 are based on averages over 20 repetitions. The approximation errors are plotted in Figure 1. As can be seen, our algorithm is only inferior to SVD on most data sets. Moreover, though the method in (Drineas & Mahoney, 2005) involves a more complicated probabilistic sampling scheme, its performance is only comparable or sometimes ¨ even worse than the Nystrom method with simple random sampling. Similar observations have also been reported in the context of SVD (Drineas et al., 2003). ICD seems to be inferior on several data sets. However, for data sets whose kernel spectra decay rapidly to zero4 (such as the segment, svmguide1a and satimage), ICD can also quickly attain performance comparable to others. We also examine empirically the relationship between E and e under different sampling schemes. Figure 2 reports the results on the german data, where m = 100 and each sampling scheme is repeated 100 times. As can be seen, there is a strong, positive correlation between E and e. This is observed on most data and agrees with our error analysis. 4.2. Kernel PCA In kernel PCA, the key step is to obtain eigenvectors of the 1 centered kernel matrix H K H , where H = I - n 11 n×n R . Following Proposition 2 of (Ouimet & Bengio, 2005), with the low-rank decomposition K GG , the centered kernel matrix can be written as (H G)(H G) or ¯ ¯ ¯ (G - G)(G - G) , where G Rn×m and all its rows equal to the mean of rows in G. Hence the top m eigenvectors can be obtained in O(m2 n) time according to Proposition 4. We evaluate the low rank approximation schemes by the embedding onto the top 3 principal directions. We align ~ the approximate embeddings (U ) with the standard KPCA embedding (U ) through a linear transform, and report the 4 Note that the (squared) rank-m approximation error of SVD n 2 is i=m+1 i , where i 's are the singular values of K sorted in descending order (Golub & Van Loan, 1996). Therefore, if SVD's error in Figure 1 drops rapidly, so does the spectrum of K . 15 3000 4000 5000 6000 quantization error (e) 7000 Figure 2. Low-rank approximation error versus quantization error for different sampling schemes. ~ minimum misalignment error: minAR3×3 U - U A F . The parameter setting is the same as in Section 4.1, except that we fix m = 0.05n for all the low-rank decomposition algorithms. Again, results of methods 2, 3, 4 in Table 3 are averaged over 20 repetitions. As we can seen, our algorithm is the best on most data sets, next comes the ¨ standard Nystrom and the method by (Drineas & Mahoney, 2005). The time consumptions of all low-rank approximation schemes are significantly lower than SVD. 4.3. Least Squares SVM Given the kernel matrix K , the training labels y {±1}n×1 , and the regularizn tion parameter C > 0, the LSa SVM classifier f (x) = i=1 i (x, xi ) + b is solved by b = y M -1 1/y M -1 y , and = M -1 (1 - by ), where 1 is a vector of all ones, and M = Y (K + I /C )Y and Y = diag(y ). Note that M -1 = Y (K + I /C )-1 Y can be computed efficiently using (13). We evaluate different low-rank approximation schemes in LS-SVM, using some difficult pairs of the USPS digits5 . We use Gaussian kernel exp(- x - y 2/) and C = 0.5. Table 4 reports the classification performance of the standard LS-SVM, and those with different low-rank approximation schemes, at m = 0.05n and 0.1n. Again, methods 2, 3, 4 are repeated 20 times. For m = 0.05n, our approach is significantly better than methods 1,2,3 with a confidence level that is at least 99.5%. For m = 0.1n, ours is also better with a confidence level that is at least 97.5% on the first 7 pairs. For the last 4 pairs, the differences between our approach and methods 1,2,3 are not statistically significant. Note, however, that the testing errors obtained by the various approximation algorithms on these 4 pairs are all close to those of the exact LS-SVM, i.e., all approximation algorithms have reached their possibly best performance. 5 ftp://ftp.kyb.tuebingen.mpg.de/pub/bs/data/ ¨ Improved Nystrom Low-Rank Approximation and Error Analysis 150 approximation error approximation error approximation error approximation error 100 ICD Nystrom Drineas Ours SVD 80 60 40 20 ICD Nystrom Drineas Ours SVD 150 100 ICD Nystrom Drineas Ours SVD 100 90 80 70 60 50 40 30 ICD Nystrom Drineas Ours SVD 50 50 0.02 0.04 0.06 ratio (m/n) 0.08 0.1 0.02 0.04 0.06 ratio (m/n) 0.08 0.1 0.02 0.04 0.06 ratio (m/n) 0.08 0.1 0.02 0.04 0.06 ratio (m/n) 0.08 0.1 (a) german 250 200 approximation error (b) splice ICD Nystrom Drineas Ours SVD 400 200 approximation error approximation error (c) adult1a ICD Nystrom Drineas Ours SVD 500 approximation error (d) dna ICD Nystrom Drineas Ours SVD 300 200 100 ICD Nystrom Drineas Ours SVD 0.02 0.04 0.06 ratio (m/n) 150 100 50 0 0.02 0.04 0.06 ratio (m/n) 150 100 50 0 400 300 200 100 0.08 0.1 0.08 0.1 0.02 0.04 0.06 ratio (m/n) 0.08 0.1 0 0.02 0.04 0.06 ratio (m/n) 0.08 0.1 (e) segment (f) w1a (g) svmguide1a (h) satimage Figure 1. Approximation errors (in terms of the Frobenius norm) on the kernel matrix by different low-rank approximation schemes. Table 3. Approximation errors and CPU time consumed for the different low-rank approximation schemes in the context of kernel PCA. Due to the lack of space, we do not show the standard deviation of the CPU time. data german splice adult1a dna segment w1a svmguide1a satimage Ours (4.40±0.58)×10-2 (3.44±0.43)×10-1 (4.41±0.49)×10-2 (1.88±0.21)×10-1 (7.87±4.43)×10-4 (1.55±0.78)×10-1 (5.16±2.12)×10-4 (5.20±0.97)×10-4 approximation error ¨ Nystrom Drineas (2.64±0.58)×10-1 (2.71±0.34)×10-1 0 (1.06±0.11)×10 (1.07±0.11)×100 (2.86±0.42)×10-1 (2.84±0.66)×10-1 (1.09±0.08)×100 (1.01±0.14)×100 (8.37±4.08)×10-3 (1.84±0.99)×10-2 (2.81±0.62)×10-1 (6.05±3.39)×10-1 (3.71±2.26)×10-3 (2.78±1.60)×10-2 (6.19±0.28)×10-3 (6.80±1.01)×10-2 ICD 5.11×10-1 1.27×100 6.19×10-1 1.17×100 2.37×10-2 1.11×100 5.07×10-4 2.47×10-2 SVD 27.6 24.2 134.8 197.0 322.8 394.0 650.4 2762.8 CPU time (seconds) ¨ Ours Nystrom Drineas 0.8 0.03 0.3 0.9 0.05 0.6 3.0 0.2 4.0 6.6 0.5 10.6 4.2 0.3 1.8 12.8 1.8 35.3 6.7 0.5 2.4 16.1 1.5 15.9 ICD 0.09 0.1 0.7 1.5 1.0 3.6 2.3 7.5 5. Conclusion ¨ The Nystrom method is a useful technique for low-rank approximation. However, analysis on its key step of choosing the landmark points and especially that in terms of approximation quality is still limited. In this paper, we draw an intuitive but important connection between the ¨ Nystrom approximation quality and the encoding capacities of landmark points. Our analysis suggests the k-means as a natural sampling scheme. Despite its simplicity, the k -means-based sampling gives encouraging performance when empirically compared with state-of-the-art low-rank approximation techniques. One future direction is to utilize label/side information for task-specific decomposition, where one excellent example is (Bach & Jordan, 2005) in the context of incomplete Cholesky decomposition. Region under grant 614907. References Achlioptas, D., & McSherry, F. (2001). Fast computation of low rank matrix approximations. Proceedings of the 23th Annual ACM Symposium on Theory of Computing (pp. 611 ­ 618). Bach, F., & Jordan, M. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1­48. Bach, F., & Jordan, M. (2005). Predictive low-rank decomposition for kernel methods. Proceedings of the 22th International Conference on Machine Learning (pp. 33 ­ 40). Baker, C. (1977). The numerical treatment of integral equations. Oxford: Clarendon Press. Belkin, M., & Niyogi, P. (2002). Laplacian eigenmaps and Acknowledgments This research has been partially supported by the Research Grants Council of the Hong Kong Special Administrative ¨ Improved Nystrom Low-Rank Approximation and Error Analysis Table 4. Testing errors (in %) on different pairs of the USPS digits. data 1-7 2-3 2-5 3-8 5-8 6-8 8-9 2-7 3-5 4-7 5-6 LS-SVM 0.97 2.47 1.67 2.71 2.76 0.59 1.45 1.44 4.91 2.88 1.21 SVD 0.97 2.47 0.83 3.31 2.14 0.59 1.16 0.86 4.91 2.59 0.60 m = 0.05n ¨ Ours Nystrom 0.77±0.09 1.56±0.51 2.91±0.55 4.48±1.51 1.91±0.62 2.82±1.10 3.93±0.84 5.64±1.45 3.38±0.65 4.34±1.52 1.04±0.35 2.69±1.34 1.04±0.50 2.79±2.00 0.90±0.10 1.47±0.64 6.53±1.18 7.39±1.39 2.54±0.40 3.28±0.96 1.57±0.63 2.79±1.25 Drineas 2.69±1.64 4.54±1.16 2.98±0.73 5.52±1.21 4.28±1.38 2.24±0.98 2.52±0.85 2.45±1.25 7.03±1.25 3.30±1.03 2.52±1.33 IC D 4.62 4.12 5.30 6.02 4.29 5.65 2.33 3.76 7.36 8.06 3.93 SVD 0.97 2.47 1.67 2.71 2.14 0.89 1.45 0.86 3.98 2.59 1.21 Ours 0.81±0.11 2.72±0.22 1.29±0.31 2.70±0.23 2.66±0.34 0.61±0.25 1.13±0.23 0.96±0.17 5.25±0.64 2.60±0.26 1.27±0.32 m = 0.1n ¨ Nystrom 0.94±0.18 2.77±0.59 1.57±0.46 3.78± 0.53 2.82±0.53 1.03±0.45 1.22±0.51 0.87±0.09 5.22±0.77 2.41±0.36 1.25±0.41 Drineas 1.54±0.35 2.97±0.59 1.58±0.51 3.97±0.55 3.12±0.58 1.18±0.44 1.79±0.37 1.08±0.39 5.49±0.83 2.43±0.45 1.48±0.29 IC D 1.22 4.94 2.23 4.21 4.60 5.05 2.91 2.61 5.52 5.47 2.12 spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems 14. Drineas, P., Drinea, E., & Huggins, P. (2003). An experimental evaluation of a Monte-Carlo algorithm for singular value decomposition. Proceedings of 8th Panhellenic Conference on Informatics (pp. 279­296). ¨ Drineas, P., & Mahoney, M. W. (2005). On the Nystrom method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153­2175. Elkan, E. (2003). Using the triangular inequality to accelerate k -means. Proceedings of the 21th International Conference on Machine Learning (pp. 147 ­ 153). Fine, S., & Scheinberg, K. (2001). Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2, 243 ­ 264. Fowlkes, C., Belongie, S., Chung, F., & Malik, J. (2004). ¨ Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 214­225. Frieze, A., Kannan, R., & Vempala, S. (1998). Fast MonteCarlo algorithms for finding low-rank approximations. Proceedings of the 39th Annual Symposium on Foundations of Computer Science (pp. 370 ­ 378). Gersho, A., & Gray, R. (1992). Vector quantization and signal compression. Boston: Kluwer Academic Press. Golub, G., & Van Loan, C. (1996). Matrix computations. Johns Hopkins University Press. 3rd edition. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2001). An efficient k -means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 881 ­ 892. Lawrence, N. Seeger, M., & Herbrich, R. (2003). Fast sparse Gaussian process methods: The informative vector machine. Advances in Neural Information Processing Systems. (pp. 625­632). MIT Press. Ouimet, M., & Bengio, Y. (2005). Greedy spectral embedding. Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (pp. 253­260). Platt, J. C. (2005). Fastmap, MetricMap, and Landmark ¨ MDS are all Nystrom algorithms. Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (pp. 261­268). ¨ ¨ Scholkopf, B., Smola, A., & Muller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299­1319. ¨ Smola, A., & Scholkopf, B. (2000). Sparse greedy matrix approximation for machine learning. Proceedings of the 17th International Conference on Machine Learning (pp. 911 ­ 918). Suykens, J., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9, 293­300. Williams, C., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classifiers. Proceedings of the 17th International Conference on Machine Learning (pp. 1159­1166). ¨ Williams, C., & Seeger, M. (2001). Using the Nystrom method to speed up kernel machines. Advances in Neural Information Processing Systems 13 (pp. 682 ­ 688). Zhang, K., & Kwok, J. (2006). Block-quantized kernel matrix for fast spectral embedding. Proceedings of the 23rd international conference on Machine learning (pp. 1097 ­ 1104).