A Risk Minimization Principle for a Class of Parzen Estimators Kristiaan Pelckmans, Johan A.K. Suykens, Bart De Moor Department of Electrical Engineering (ESAT) - SCD/SISTA K.U.Leuven University Kasteelpark Arenberg 10, Leuven, Belgium Kristiaan.Pelckmans@esat.kuleuven.be Abstract This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (co mputationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This anal ysis is related to Support Vector Machines by means of a margin transformation. Th e power of the MAM principle is illustrated further by application to ordi nal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time . 1 Introduction The quest for efficient machine learning techniques which (a) have favorable generalization capacities, (b) are flexible for adaptation to a specific task, and (c) are cheap to implement is a pervasive theme in literature, see e.g. [14] and references therein. T his paper introduces a novel concept for designing a learning algorithm, namely the Maximal Average Margin (MAM) principle. It closely resembles the classical notion of maximal margin as lying on the basis of perceptrons, Support Vector Machines (SVMs) and boosting algorithms, see a.o. [14, 11]. It however optimizes the average margin of points to the (hypothesis) hyperplane, instead of the worst case margin as traditional. The full margin distribution was studied earlier in e.g. [13], a nd theoretical results were extended and incorporated in a learning algorithm in [5]. The contribution of this paper is twofold. On a methodological level, we relate (i) results in structural risk minimization, (ii) data-dependent (but dimension-independent) Rademacher complexities [8, 1, 14] and a new concept of 'certainty of prediction', (iii) the notion of margin (as central is most state-of-the-art learning machines), and (iv) statistical estimators as Parzen windows and NadarayaWatson kernel estimators. In [10], the principle was alread y shown to underlie the approach of mincuts for transductive inference over a weighted undirec ted graph. Further, consider the modelclass consisting of all models with bounded average margin (or classes with a fixed Rademacher complexity as we will indicate lateron). The set of such classes is clearly nested, enabling structural risk minimization [8]. On a practical level, we show how the optimality principle can be used for designing a computationally fast approach to (large-scale) classification and ordi nal regression tasks, much along the same 1 Acknowledgements - K. Pelckmans is supported by an FWO PDM. J.A.K. Suykens and B. De Moor are a (full) professor at the Katholieke Universiteit Leuven, Belgium. Research supported by Research Council KUL: GOA AMBioRICS, CoE EF/05/006 OPTEC, IOF-SCORES4CHEM, several PhD/postdoc & fellow grants; Flemish Government: FWO: PhD/postdoc grants, projects G.0452.04, G.0499.04, G.0211.05, G.0226.06, G.0321.06, G.0302.07, (ICCoS, ANMMM, MLDM); IWT: PhD Grants, McKnow-E, Eureka-Flite+ Belgian Federal Science Policy Office: IUAP P6/04, EU: ERNSI; 1 lines as Parzen classifiers and Nadaraya-Watson estimators . It becomes clear that this result enables researchers on Parzen windows to benefit directly from recen t advances in kernel machines, two fields which have evolved mostly separately. It must be emphasized that the resulting learning rules were already studied in different forms and motivated by asy mptotic and geometric arguments, as e.g. the Parzen window classifier [4], the 'simple classifier' as in [12] chap. 1, probabilistic neural networks [15], while in this paper we show how an (empirical) risk based optimality criterion underlies this approach. A number of experiments confirm the us e of the resulting cheap learning rules for providing a reasonable (baseline) performance in a smal l time-window. The following notational conventions are used throughout the paper. Let the random vector (X, Y ) Rd × {-1, 1} obey a (fixed but unknown) joint distribution PX Y from a probability n space (Rd × {-1, 1}, P ). Let Dn = {(Xi , Yi )}i=1 be sampled i.i.d. according to PX Y . Let y Rn be defined as y = (Y1 , . . . , Yn )T {-1, 1}n and X = (X1 , . . . , Xn )T Rn×d . This paper is organized as follows. The next section illustrates the principle of maximal average margin for classification problems. Section 3 investigates the close relationship with Rademacher complexities, Section 4 develops the maximal average margin princip le for ordinal regression, and Section 5 reports experimental results of application of the MAM to c lassification and ordinal regression tasks. 2 2.1 Maximal Average Margin for Classifiers The Linear Case . Consequently, the signed distance of a sample (X, Y ) to the hyper-plane wT x = 0, or the margin M (w) R, can be defined as Y (wT X ) M (w) = . (2) w2 SVMs maximize the worst-case margin. We instead focus on the first moment of the margin distribution. Maximizing the expected (average) margin follows from solving YT = (w X ) M = max E max E [Y f (X )] . (3) w f H w2 Remark that the non-separable case does not require the need for slack-variables. The empirical counterpart becomes n 1 i Yi (wT Xi ) ^ , (4) M = max wn w2 =1 n 1 which can be written as a constrained convex problem as minw - n i=1 Yi (wT Xi ) s.t. w 2 n 1 1. The Lagrangian with multiplier 0 becomes L(w, ) = - n i=1 Yi (wT Xi ) + (wT w - 1). 2 By switching the minimax problem to a maximin problem (application of Slater's condition), the ( first order condition for optimality Lw,) = 0 gives w n 1i 1T wn = X y, (5) Yi Xi = n = 1 n where wn Rd denotes the optimum to (4). The corresponding parametey can be found by r 1 1 T XXT y since the n 1 Yi Xi 2 = n substituting (5) in the constraint wT w = 1, or = n i= T optimum is obviously taking place when w w = 1. It becomes clear that the above derivations remain valid as n , resulting in the following theorem. Let the class of hypotheses be defined as f H= (·) : Rd R, w Rd x Rd : f (x) = wT x, w 2 =1 (1) Theorem 1 (Explicit Actual Optimum for the MAMC) The function f (x) = wT x in H maximizing the expected margin satisfies = YT 1 (w X ) E [X Y ] w , (6) arg max E w2 w where is a normalization constant such that w 2 = 1. 2 2.2 Kernel-based Classifier and Parzen Window where K : Rd × Rd R is defined as the inner product such that (X )T (X ) = K (X, X ) for any X, X . Conversely, any function K corresponds with the inner product of a vylid map a 1 T y with if the function K is positive definite. As previously, the term becomes = n n×n kernel matrix R where ij = K (Xi , Xj ) for all i, j = 1, . . . , n. Now the class of positive definite Mercer kernels can be used as they induce a proper mapping . A classical choice is the use of a linear kernel (or K (X, X ) = X T X ), a polynomial kernel of degree p N0 (or K (X, X ) = (X T X + b)p ), an RBF kernel (or K (X, X ) = exp(- X - X 2 / )), or a dedicated 2 kernel for a specific application (e.g. a string kernel, a Fis her kernel, see e.g. [14] and references therein). Figure 1.a depicts an example of a nonlinear class ifier based on the well-known Ripley dataset, and the contourlines score the 'certainty of predi ction' as explained in the next section. The expression (7) is similar (proportional) to the classic al Parzen window for classification, but - differs in the use of a positive definite (Mercer) kernel K instead of the pdf ( Xh · ) with bandwidth h > 0, and in the form of the denominator. The classical motivation of statistical kernel estimators is based on asymptotic theory in low dimensions (i.e d = O(1)), see e.g. [4], chap. 10 and references. The functional form of the optimal rule (7) is similar to the 'simple classifier' described in [12], chap. 1. Thirdly, this estimator was also termed and empiric ally validated as a probabilistic neural network by [15]. The novel element from above result is the de rivation of a clear (both theoretical and empirical) optimality principle of the rule, as opposed to the asymptotic results of [4] and the geometric motivations in [12, 15]. As a direct byproduct, it becomes straightforward to extend the Parzen window classifier easily with an additional intercept term or other parametric parts, or towards additive (structured) models as in [9]. It becomes straightforward to recast the resulting classifi er as a kernel classifier by mapping the input data-samples X in a feature space : Rd Rd where d is possibly infinite. In particular, we do not have to resort to Lagrange duality in a context of convex optimization (see e.g. [14, 9] for an overview) or functional analysis in a Reproducing Kernel Hilbert Space. Specifically, n 1i T wn (X ) = Yi K (Xi , X ), (7) n = 1 3 Analysis and Rademacher Complexities The quantity of interest in the analysis of the generalization performance is the probability of predicting a mistake (the risk R(w; PX Y )), or YT = I , R(w; PX Y ) = PX Y (w (X )) 0 E (Y (wT (X )) 0) (8) where I (z ) equals one if z is true, and zero otherwise. 3.1 Rademacher Complexity Let {i }n 1 taken from the set {-1, 1}n be Bernoulli random variables with P ( = 1) = P ( = i= -1) = 1 . The empirical Rademacher complexity is then defined [8, 1] as 2 X n , s 2i ^ (9) i f (Xi ) Rn (H) E up 1 , . . . , Xn f H n =1 where the expectation is taken over the choice of the binary vector = (1 , . . . , n )T {-1, 1}n . It is observed that the empirical Rademacher complexity defines a natural complexity measure to study the maximal average margin classifier, as both the defin itions of the empirical Rademacher complexity and the maximal average margin resemble closely (see also [8]). The following result was given in [1], Lemma 22, but we give an alternative proof by exploiting the structure of the optimal estimate explicitly. Lemma 1 (Trace bound for the Empirical Rademacher Complexity for H) Let Rn×n be defined as ij = K (Xi , Xj ) for all i, j = 1, . . . , n, then 2t ^ Rn (H) r(). (10) n 3 Proof: The proof goes along the same lines as the classical bound on the empirical Rademacher complexity for kernel machines outlined in [1], Lemma n 2. Specifically, once a vector {-1, 1}n 2 1 is fixed, it is immediately seen that the maxf H n i=1 i f (Xi ) equals the solution as in (7) or n T maxw i=1 i (wT (Xi )) = T = T . Now, application of the expectation operator E over the choice of the Rademacher variables gives 2 2 E T ^ T Rn (H) = E n n 1 2 1 2 2i = E [i j ] K (Xi , Xj ) n ,j 2 = n R here the inequality is based on application of Jensen's inequality. This proves the Lemma. w i =1 n 2 K (Xi , Xi ) 1 = 2 n t r(), (11) emark tt at in the case of a kernel with constant trace (as e.g. in the case of the RBF kernel h r() = n), it follows from this result that also the (expected) Rademacher complexity where t ^ r(). In general, one has that EK (X, X )] equals the trace of the integral operator [K E [Rn (H)] TK defined on L2 (PX ) defined as TK (f ) = (X, Y )f (X )dPX (X ) as in [1]. Application of E g n 1 McDiarmid's inequality on the variable Z = supf H [Y (wT (X ))] - n i=1 Yi (wT (Xi )) ives as in [8, 1]. Lemma 2 K eviation Inequality) Let 0 < B < be a fixed constant such that supz (z) 2 (D (z , z ) B such that |wT (z )| B , and let R+ be fixed. Then with probability = supz 0 exceeding 1 - , one has for any w Rd that 2 n2 n ln 1i ^ E [Y (wT (X ))] . (12) Yi (wT (Xi )) - Rn (H) - 3B n =1 Therefore it follows that one maximizes the expected margin by maximizing the empirical average margin, while controlling the empirical Rademacher complexity by choice of the model class (kernel). In the case of RBF kernels, B = 1, resulting in a reasonable tight bound. It is now illustrated T how one can obtain a practical upper-bound to the 'certainty of prediction' using f (x) = wn x. Theorem 2 (Occurrence of Mistakes) Given an i.i.d. sample Dn = K (z , z ) B , and a fixed R+ . Then, B R such that supz 0 d 1 - , one has for all w R that y T y YT B - E [Y (wT (X ))] (w (X )) 0 P 1- + B nB {(Xi , Yi )}n 1 , a constant i= with probability exceeding ^ Rn (H) +3 B 2 n2 ln . (13) Proof: The proof follows directly from application of Markov's ine quality on the positive random variable B - Y (wT (X )), with expectation B - E [Y (wT (X ))], estimated accurately by the sample average as in the previous theorem. M ore generally, one obtains that with probability exceeding 1 - that for any w Rd and for any such that -B < < B that 2 n2 y T y ^ YT B ln Rn (H) 3B , (14) - + + P (w (X )) - B + n(B + ) B + B + with probability exceeding 1 - < 1. This results in a practical assessment of the 'certainty' o f a T prediction as follows. At first, note that the random variabl e Y (wn (x)) for a fixed X = x can take T T T T two values: either -|wn (x)| or |wn (x)|. Therefore P (Y (wn (x)) 0) = P (Y (wn (x)) = 4 1 Class prediction class 1 class 2 1 0.8 0.8 0.6 X2 X2 0.4 0.6 0.4 0.2 0.2 0 0 -0.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2 X1 0 0.2 0.4 0.6 0.8 -0.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2 X1 0 0.2 0.4 0.6 0.8 (a) (b) Figure 1: Example of (a) the MAM classifier and (b) the SVM on the Ripley dataset. The contourlines represent the estimate of certainty of prediction ('scores') as derive d in Theorem 2 for the MAM classifier for (a), and as in Corollary 1 for the case of SVMs with g (z ) = min(1, max(-1, z )) where |z | < 1 corresponds with the inner part of the margin of the SVM (b). While the contours in (a) give an overall score of the predictions, the scores given in (b) focus towards the margin of the SVM . T T T -|wn (x)|) P (Y (wn (x)) -|wn (x)|) as Y can only take the two values -1 or 1. Thus the event 'Y = sign(wT x )' for samples X = x occurs with probability lower than the rhs. of (13) with = |wT x |. When asserting this for a number nv N of samples X PX with nv , a misprediction would occur less than nv times. In this sense, one can use the latent variable wT (x ) as an indication of how 'certain' the prediction is. Figure 1 .a gives an example of the MAM classifier, together with the level plots indicating the certainty of prediction. Remark however that the described 'certainty of prediction' state ment differs from a conditional statement of the risk given as P (Y (wT (X )) < 0 | X = x ). The essential difference with the probabilistic estimates based on the density estimates resulting from the Parzen window estimator is that results become independent of the data dimension, as one avoids estimating the joint distribution. 3.2 Transforming the Margin Distribution Consider the case where the assumption of a reasonable const ant B such that P ( X 2 < B ) = 1 is unrealistic. Then, a transformation of the random variable Y (wT X ) can be fruitful using a monotone increasing function g : R R with a constant B B such that |g (z )| B , and g (0) = 0. In the choice of a proper transformation, two counteracting ef fects should be traded properly. At first, a small choice of B improves the bound as e.g. described in Lemma 2. On the other h and, such a transformation would make the expected value E [g (Y (wT (X )))] smaller than E [Y (wT (X ))]. Modifying Theorem 2 gives Corollary 1 (Occurrence of Mistakes, bis) Given i.i.d. samples Dn = {(Xi , Yi )}n 1 , and a fixed i= R+ . Let g : R R be a monotonically increasing function with Lipschitz constant 0 < Lg < 0 , let B R such that |g (z )| B for all z , and g (0) = 0. Then with probability exceeding 1 - , one has for any such that -B B and w Rd that 2 2 n log( ) 1 T ^ g B g (Yi (wn (Xi ))) - Lg Rn (H) - 3B i=1 n n T - . P (Y (wn (X ))) - B + B + (15) ^ This result follows straightforwardly from Theorem 2 using the property that Rn (g H) g 1 -E [ Y g ( w T ( X ) ) ] T ^ Lg Rn (H), see e.g. [1]. When = 0, one has P (Y (wn (X ))) 0 . 1 Similar as in the previous section, corollary 1 can be used to score the certainty of prediction by considering for each X = x the value of g (wT x ) and g (-wT x ). Figure 1.b gives an example by 5 considering the clipping transformation g (z ) = min(1, max(-1, z )) [-1, 1] such that B = 1. Note that this a-priori choice of the function g is not dependent on the (empirical) optimality criterion at hand. 3.3 Soft-margin SVMs and MAM classifiers Except the margin-based mechanisms, the MAM classifier shares other properties with the softmargin maximal margin classifier (SVM) as well. Consider the following saturation function g (z ) = (1 - z )+ , where (·)+ is defined as (z )+ = z if z 0, and zero otherwise (g (0) = 0). Application of this function to the MAM formulation of (4), one obtains for a C > 0 in 1 + - Yi (wT (Xi )) s .t. w T w = C , (16) max - w =1 which is equivalent to (4) as in the optimum, Yi (wT (Xi )) = (1 - i ) for all i. Thus, omission of the slack constraints i 0 in the SVM formulation results in the Parzen window classifier. which is similar to the SVM. Consider the following modification in min i s.t. wT w C and Yi (wT (Xi )) 1 - i w , =1 which is similar to the support vector machine (see e.g. [14] ). To make this equivalence more explicit, consider the following formulation of (16) in min i s.t. wT w C and Yi (wT (Xi )) 1 - i , i 0 i = 1, . . . , n, (17) w , =1 i = 1, . . . , n, (18) 4 Maximal Average Margin for Ordinal Regression Along the same lines as [6], the maximal average margin princ iple can be applied to ordinal regression tasks. Let (X, Y ) Rd × {1, . . . , m} with distribution PX Y . The w Rd maximizing P (I (wT ((X ) - (X ) )(Y - Y ) > 0)) can be found by solving for the maximal average margin between pairs as follows s . ign(Y - Y )wT ((X ) - (X ) ) M = max E (19) w w2 Given n i.i.d. samples {(Xi , Yi )}n 1 , empirical risk minimization is obtained by solving i= n 1i min - sign(Yj - Yi )wT ((Xj ) - (Xi )) s.t. w 2 1. (20) w n ,j = 1 i 1 T The Lagrangian with multiplier 0 becomes L(w, ) = - n ,j w sign(Yj - Yi )((Xj ) - (Xi )) + 2 (wT w - 1). Let there be n couples (i, j ). Let Dy {-1, 0, 1}n ×n such that Dy,ki = 1 and Dy,kj = -1 if the k th couple equals (i, j ). Then, by switching the minimax problem to a ( maximin problem, the first order condition for optimality Lw,) = 0 gives the expression. wn = w Y 1 ((Xj ) - (Xi )) = 1 XDy 1n . Now the parameter can be found by substituting n n i