Performance analysis for L2 kernel classification JooSeuk Kim Department of EECS University of Michigan Ann Arbor, MI, USA stannum@umich.edu Clayton D. Scott Department of EECS University of Michigan Ann Arbor, MI, USA clayscot@umich.edu Abstract We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1 Introduction In the classification problem we are given realizations (x1 , y1 ), . . . , (xn , yn ) of a jointly distributed pair (X, Y ), where X Rd is a pattern and Y {-1, +1} is a class label. The goal of classification is to build a classifier, i.e., a function taking X as input and outputting a label, such that some measure of performance is optimized. Kernel classifiers [1] are an important family of classifiers that have drawn much recent attention for their ability to represent nonlinear decision boundaries and to scale well with increasing dimension d. A kernel classifier (without offset) has the form , n i g (x) = sign i yi k (x, xi ) =1 where i are parameters and k is a kernel function. For example, support vector machines (SVMs) without offset have this form [2], as does the standard kernel density estimate (KDE) plug-in rule. Recently Kim and Scott [3] introduced an L2 or integrated squared error (ISE) criterion to design the coefficients i of a kernel classifier with Gaussian kernel. Their L2 classifier performs comparably to SVMs, and can also be obtained through the solution of a quadratic program. Like the SVM, the classifiers are sparse, meaning most of the coefficients i = 0, which has advantages for representation and evaluation efficiency. In addition, and unlike the SVM, there are no free parameters to be set by the user except the kernel bandwidth parameter. In this paper we develop statistical performance guarantees for the L2 kernel classifier introduced in [3]. The linchpin of our analysis is a new concentration inequality bounding the deviation of a cross-validation based ISE estimate from the true ISE. This bound is then applied to prove an oracle inequality and consistency in both ISE and probability of error. In addition, as a special case of our analysis, we are able to deduce performance guarantees for the method of L2 kernel density estimation described in [4, 5]. Both authors supported in part by NSF Grant CCF-0830490 1 The ISE criterion has a long history in the literature on bandwidth selection for kernel density estimation [6] and more recently in parametric estimation [7]. The use of ISE for optimizing the weights of a KDE via quadratic programming was first described in [4] and later rediscovered in [5]. In [8], an 1 penalized ISE criterion was used to aggregate a finite number of pre-determined densities. Linear and convex aggregation of densities, based on an L2 criterion, are studied in [9], where the densities are based on a finite dictionary or an independent sample. In classification, some connections relating SVMs and ISE are made in [10], although no new algorithms are proposed. Finally, the "difference of densities" perspective has been applied to classification in other settings by [11], [12], and [13]. In [11] and [13], a difference of densities are used to find smoothing parameters or kernel bandwidths. In [12], conditional densities are chosen among a parameterized set of densities to maximize the average (bounded) density differences. Section 2 reviews the L2 kernel classifier, and presents a slight modification needed for our analysis. Our results are presented in Section 3. Conclusions are offered in the final section, and proofs are gathered in an appendix. 2 L2 Kernel Classification We review the previous work of Kim & Scott [3] and introduce an important modification. For convenience, we relabel Y so that it belongs to {1, - } and denote I+ = {i | Yi = +1} and I- = {i | Yi = - }. Let f- (x) and f+ (x) denote the class-conditional densities of the pattern given the label. From decision theory, the optimal classifier has the form g (x) = sign {f+ (x) - f- (x)} , (1) where incorporates prior class probabilities and class-conditional error costs (in the Bayesian setting) or a desired tradeoff between false positives and false negatives [14]. Denote the "difference of densities" d (x) := f+ (x) - f- (x). The class-conditional densities are modelled using the Gaussian kernel as i i f+ (x; ) = i k (x, Xi ) , f- (x; ) = i k (x, Xi ) I+ I- with constraints = (1 , . . . , n ) A where i i A= | i = I+ i = 1, - exp d I- i 0 i . x - Xi 2 2 2 . The Gaussian kernel is defined as k (x, Xi ) = The ISE associated with is 2 2 -d/2 2 I S E () = d (x; ) - d (x) L2 = 2 dx (x; ) - d (x) d d d 2 2 = (x; ) d (x) dx + (x; ) dx - 2 (x) dx. Since we do not know the true d (x), we need to estimate the second term in the above equation H () d (x; ) d (x) dx (2) by Hn () which will be explained in detail in Section 2.1. Then, the empirical ISE is d d 2 2 I S E () = (x; ) dx - 2Hn () + (x) dx. Now, is defined as and the final classifier will be = arg min I S E () A + 1, d (x; ) 0 g (x) = - , d (x; ) < 0. 2 (3) (4) 2.1 Estimation of H () In this section, we propose a method of estimating H () in (2). The basic idea is to view H () as an expectation and estimate it using a sample average. In [3], the resubstitution estimator for H () was used. However, since this estimator is biased, we use a leave-one-out cross-validation (LOOCV) estimator, which is unbiased and facilitates our theoretical analysis. Note that the difference of densities can be expressed as in d (x; ) = f+ (x) - f- (x) = i Yi k (x, Xi ) . =1 Then, H () = = = where in d d (x; ) d (x) dx = d (x; ) f+ (x) dx - in =1 (x; ) f- (x) dx in =1 i Yi k (x, Xi ) f+ (x) dx - i Yi k (x, Xi ) f- (x) dx i Yi h (Xi ) k =1 h ( Xi ) k (x, Xi ) f+ (x) dx - (x, Xi ) f- (x) dx. (5) We estimate each h (Xi ) in (5) for i = 1, . . . , n using leave-one-out cross-validation j j 1 k (Xj , Xi ) - k (Xj , Xi ) , i I+ n - 1 + n- I- I+ ,j =i hi j j 1 k (Xj , Xi ) , i I- k (Xj , Xi ) - n+ n- - 1 I+ I- ,j =i where n+ = |I+ | , n- = |I- |. Then, the estimate of H () is Hn () = 2.2 Optimization n i=1 i Yi hi . The optimization problem (4) can be formulated as a quadratic program. The first term in (3) is 2 d in 2 i Yi k (x, Xi ) dx (x; ) dx = =1 = in jn =1 =1 k i j Yi Yj (x, Xi ) k (x, Xj ) dx = in jn =1 =1 i j Yi Yj k2 (Xi , Xj ) by the convolution theorem for Gaussian kernels [15]. As we have seen in Section 2.1, the second n term Hn () in (3) is linear in and can be expressed as i=1 i ci where ci = Yi hi . Finally, since the third term does not depend on , the optimization problem (4) becomes the following quadratic program (QP) n n in 1i j = arg min ci i . (6) i j Yi Yj k2 (Xi , Xj ) - 2 =1 =1 A =1 The QP (6) is similar to the dual QP of the 2-norm SVM with hinge loss [2] and can be solved by a variant of the Sequential Minimal Optimization (SMO) algorithm [3]. 3 Statistical performance analysis In this section, we give theoretical performance analysis on our proposed method. We assume that {Xi }iI+ and {Xi }iI- are i.i.d realizations from f+ and f- , respectively. Here, we are treating n+ and n- as deterministic variables such that n+ and n- as n . 3 3.1 Concentration inequality for Hn () Lemma 1. Conditioned on Xi , hi is an unbiased estimator of h (Xi ), i.e, h = E i | Xi h (Xi ) . Furthermore, for any > 0, s w P up |Hn () - H ()| > A 2d 4 here c = 2 2 / (1 + ) . e 2n -c(n+ -1) 2 + e-c(n- -1) 2 Lemma 1 implies that Hn () H () almost surely for all A simultaneously, provided that , n+ , and n- evolve as functions of n such that n+ 2d / ln n and n- 2d / ln n . 3.2 Oracle Inequality Next, we establish on oracle inequality, which relates the performance of our estimator to that of the best possible kernel classifier. e w Theorem 1. Let > 0 and set = ( ) = 2n -c(n+ -1) 2 + e-c(n- -1) 2 here c = 2d 4 / (1 + ) . Then, with probability at least 1 - 2 2 I S E () inf I S E () + 4 . A Proof. From Lemma 1, with probability at least 1 - I S E () - I S E () 2, A by using the fact I S E () - I S E () = 2 (Hn () - H ()). Then, with probability at least 1 - , for all A, we have w I S E () I S E () + 2 I S E () + 2 I S E () + 4 here the second inequality holds from the definition of . This proves the theorem. 3.3 Consistency Next, we have a theorem stating that I S E () converges to zero in probability. Theorem 2. Suppose that for f = f+ and f- , the hessian Hf (x) exists and each entry of Hf (x) is piecewise continuous and square integrable. If , n+ , and n- evolve as functions of n such that 0, n+ 2d / ln n , and n- 2d / ln n , then I S E () 0 in probability as n This result intuitively follows from the oracle inequality since the standard Parzen window density estimate is consistent and uniform weights are among the simplex A. The rigorous proof is presented in Appendix A.2. In classification, we are ultimately interested in minimizing the probability of error. The next theorem says the L2 kernel classifier is consistent with respect to the probability of error. Theorem 3. In addition to the assumptions in Theorem 2, assume that n- /n+ (1 - p)/p as n where p is the prior probability of the positive class. If is set to n- /n+ , then the L2 kernel classifier is consistent. In other words, given training data Dn = ((X1 , Y1 ) , . . . , (Xn , Yn )), t che classification error s d = Ln = P gn (X; ) Y | Dn onverges to the Bayes error L in probability as n goes to . The proof is given in section A.3. 4 3.4 Application to density estimation By setting = 0, we recover the L2 kernel density estimate of [4, 5] with slight modification. In this setting, the goal is to estimate f+ . Our concentration inequality, oracle inequality, and L2 consistency result immediately extend to provide the same performance guarantees for this method. In particular, we state the following corollary about L2 consistency of the density estimate. Corollary 1. Suppose that the hessian Hf (x) of a density function f (x) exists and each entry of Hf (x) is piecewise continuous and square integrable. Given i.i.d sample X1 , . . . , Xn from f (x), consider the kernel density estimate of f (x) with variable weights f (x; ) = where i 's are defined as in jn in 1 1j = arg min i i j k2 (Xi , Xj ) - 2 =1 =1 n-1 i =1 =1 i 0 in =1 i k (x, Xi ) k (Xi , Xj ) . =i If 0 and n 2d / ln n as n goes to , then f 2 (x; ) - f (x) dx 0 in probability. 4 Conclusion Through the development of a novel concentration inequality, we have established statistical performance guarantees on a recently introduced L2 kernel classifier. We view the relatively clean analysis of this classifier as an attractive feature relative to other kernel methods. In future work, we hope to invoke the full power of the oracle inequality to obtain adaptive rates of convergence, and consistency for not necessarily tending to zero. A Appendix A.1 Proof of Lemma 1 d 2 . Note that for any given i, (k (Xj , Xi ))j =i are independent and bounded by M = 1/ For random vectors Z f+ (x) and W f- (x), h (Xi ) in (5) can be expressed as h (Xi ) = E [k (Z, Xi ) | Xi ] - E [k (W, Xi ) | Xi ] . Since Xi f+ (x) for i I+ and Xi f- (x) for i I- , it can be easily shown that h = E i | Xi h (Xi ) . For i I+ , h > P Xi = x i - h (Xi ) > X + 1 j 1 P k (Xj , Xi ) - E [k (Z, Xi ) | Xi ] i=x n+ - 1 + I+ ,j =i > X j 1 P k (Xj , Xi ) - E [k (W, Xi ) | Xi ] i=x n- + I- ( 7) 5 By Hoeffding's inequality [16], the first term in (7) is > X =j (n+ - 1) 1 P k (Xj , Xi ) - (n+ - 1)E [k (Z, Xi ) | Xi ] i=x + I+ ,j =i X = > j j (n+ - 1) 1 P k ( Xj , Xi ) - E k (Xj , Xi ) | Xi i=x + I+ ,j =i I+ ,j =i X j j > (n+ - 1) 1 P k ( Xj , Xi ) - E k (Xj , Xi ) | Xi i=x + I+ ,j =i I+ ,j =i 2e -2(n+ -1) 2/(1+)2 M 2 . The second term in (7) is > X =j 1n- P k (Xj , Xi ) - n- E [k (W, Xi ) | Xi ] i=x + I- X j j > 1n- P k (Xj , Xi ) - E k (Xj , Xi ) | Xi i=x + I- I- 2e-2n- Therefore, P h i 2/(1+) M 2 2 2e-2(n- -1) h i 2/(1+)2 M 2 . - h (Xi ) P =E - h (Xi ) 2/(1+)2 M 2 Xi = X + 2e-2(n- -1) 2/(1+)2 M 2 2e-2(n+ -1) In a similar way, it can be shown that for i I- , > h P 2e-2(n+ -1) i - h (Xi ) . 2/(1+)2 M 2 + 2e-2(n- -1) 2/(1+)2 M 2 . Then, n s s h > i P up |Hn () - H ()| > =P up i Yi i - h (Xi ) A A =1 s n h > i up i |Yi | i - h (Xi ) =P A =1 s in h +i n h > i i - h (Xi ) i i - h (Xi ) P up A I+ I- B B s in s in h > 1 h >1 + P up i i - h (Xi ) P up i i - h (Xi ) + + A I- A I+ B B m m h >1 h >1 =P ax i - h (Xi ) +P ax i - h (Xi ) iI+ i I- + + B B i i h h >1 >1 =P +P i - h ( Xi ) i - h (Xi ) + + I+ I- B B h h >1 >1 i i + P P i - h (Xi ) i - h ( Xi ) + + I- I+ +2 4 2 4 2 n+ e-2(n+ -1) 2/(1+) M + 2e-2(n- -1) 2/(1+) M = 2 4 2 4 2 n- e-2(n+ -1) 2/(1+) M + 2e-2(n- -1) 2/(1+) M . 2 4 2 4 2 n e-2(n+ -1) 2/(1+) M + 2e-2(n- -1) 2/(1+) M 6 A.2 Proof of Theorem 2 Define u = (u1 , . . . , un ) such that ui = 1/n+ for i I+ and ui = 1/n- for i I- . By the similar argument for the convergence of MISE of kernel density estimate [7], it can be easily shown that M I S E (u) = E [I S E (u)] V d + d d = ar (x; u) bias2 (x; u) x 1 R w + n- 1 - d tH 2 1 = + (k ) + 4 R r o + + n- 1 - d + 4 d - n+ d n- d 4 f2 here R (f ) = (x) dx and Hf represent the Hessian matrix of f . Therefore, I S E (u) converges to 0 in probability since 0, n+ d and n+ d as n goes to 0. Then, I I 2+ 2 P {I S E () > } = P S E () > , I S E (u) > P S E () > , I S E (u) I I 2+ 2. P S E (u) > P S E () > I S E (u) + From the consistency of I S E (u) and the oracle inequality stated in theorem 1, I S E () converges to 0 in probability. A.3 Proof of Theorem 3 1-p p. Let = From [17], it suffices to show that d 2 dx 0 (x; ) - d (x) in probability. Note that d (x; ) - d (x) L2 = d (x; ) - d (x) + ( - ) f- (x) L2 d (x; ) - d (x) L2 + ( - ) f- (x) L2 I = S E () + | - | · f- (x) L2. The first term converges to 0 in probability from theorem 2 and the second term converges to 0 since as n . Thus Ln = P {g (x) = Y | Dn } L . References ¨ [1] B. Scholkopf and A. J. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002. [2] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273­297, 1995. [3] J. Kim and C. Scott, "Kernel classification via integrated squared error," IEEE Workshop on Statistical Signal Processing, August 2007. [4] D. Kim, Least Squares Mixture Decomposition Estimation, unpublished doctoral dissertation, Dept. of Statistics, Virginia Polytechnic Inst. and State Univ., 1995. [5] Mark Girolami and Chao He, "Probability density estimation from optimally condensed data samples," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1253­1264, OCT 2003. [6] B.A. Turlach, "Bandwidth selection in kernel density estimation: A review," Technical Report 9317, C.O.R.E. and Institut de Statistique, Universite Catholique de Louvain, 1993. ´ [7] David W.Scott, "Parametric statistical modeling by minimum integrated square error," Technometrics 43, pp. 274­285, 2001. [8] A.B. Tsybakov F. Bunea and M.H. Wegkamp, "Sparse density estimation with l1 penalties," Proceedings of 20th Annual Conference on Learning Theory, COLT 2007, Lecture Notes in Artificial Intelligence, v4539, pp. 530­ 543, 2007. [9] Ph. Rigollet and A.B. Tsybakov, "Linear and convex aggregation of density estimators," https:// hal.ccsd.cnrs.fr/ccsd- 00068216 , 2004. 7 [10] Robert Jenssen, Deniz Erdogmus, Jose C.Principe, and Torbjørn Eltoft, "Towards a unification of information theoretic learning and kernel method," in Proc. IEEE Workshop on Machine Learning for Signal Processing (MLSP2004), Sao Luis, Brazil. [11] Peter Hall and Matthew P.Wand, "On nonparametric discrimination using density differeces," Biometrika, vol. 75, no. 3, pp. 541­547, Sept 1988. [12] P. Meinicke, T. Twellmann, and H. Ritter, "Discriminative densities from maximum contrast estimation," in Advances in Neural Information Proceeding Systems 15, Vancouver, Canada, 2002, pp. 985­992. [13] M. Di Marzio and C.C. Taylor, "Kernel density classification and boosting: an l2 analysis," Statistics and Computing, vol. 15, pp. 113­123(11), April 2005. [14] E. Lehmann, Testing statistical hypotheses, Wiley, New York, 1986. [15] M.P. Wand and M.C. Jones, Kernel Smoothing, Chapman & Hall, 1995. [16] L. Devroye and G. Lugosi, "Combinatorial methods in density estimation," 2001. [17] Charles T. Wolverton and Terry J. Wagner, "Asymptotically optimal discriminant fucntions for pattern classification," IEEE Trans. Info. Theory, vol. 15, no. 2, pp. 258­265, Mar 1969. 8