A Risk Minimization Principle for a Class of Parzen Estimators
Kristiaan Pelckmans, Johan A.K. Suykens, Bart De Moor Department of Electrical Engineering (ESAT) - SCD/SISTA K.U.Leuven University Kasteelpark Arenberg 10, Leuven, Belgium Kristiaan.Pelckmans@esat.kuleuven.be

Abstract
This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (co mputationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This anal ysis is related to Support Vector Machines by means of a margin transformation. Th e power of the MAM principle is illustrated further by application to ordi nal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time .

1

Introduction

The quest for efficient machine learning techniques which (a) have favorable generalization capacities, (b) are flexible for adaptation to a specific task, and (c) are cheap to implement is a pervasive theme in literature, see e.g. [14] and references therein. T his paper introduces a novel concept for designing a learning algorithm, namely the Maximal Average Margin (MAM) principle. It closely resembles the classical notion of maximal margin as lying on the basis of perceptrons, Support Vector Machines (SVMs) and boosting algorithms, see a.o. [14, 11]. It however optimizes the average margin of points to the (hypothesis) hyperplane, instead of the worst case margin as traditional. The full margin distribution was studied earlier in e.g. [13], a nd theoretical results were extended and incorporated in a learning algorithm in [5]. The contribution of this paper is twofold. On a methodological level, we relate (i) results in structural risk minimization, (ii) data-dependent (but dimension-independent) Rademacher complexities [8, 1, 14] and a new concept of 'certainty of prediction', (iii) the notion of margin (as central is most state-of-the-art learning machines), and (iv) statistical estimators as Parzen windows and NadarayaWatson kernel estimators. In [10], the principle was alread y shown to underlie the approach of mincuts for transductive inference over a weighted undirec ted graph. Further, consider the modelclass consisting of all models with bounded average margin (or classes with a fixed Rademacher complexity as we will indicate lateron). The set of such classes is clearly nested, enabling structural risk minimization [8]. On a practical level, we show how the optimality principle can be used for designing a computationally fast approach to (large-scale) classification and ordi nal regression tasks, much along the same
1 Acknowledgements - K. Pelckmans is supported by an FWO PDM. J.A.K. Suykens and B. De Moor are a (full) professor at the Katholieke Universiteit Leuven, Belgium. Research supported by Research Council KUL: GOA AMBioRICS, CoE EF/05/006 OPTEC, IOF-SCORES4CHEM, several PhD/postdoc & fellow grants; Flemish Government: FWO: PhD/postdoc grants, projects G.0452.04, G.0499.04, G.0211.05, G.0226.06, G.0321.06, G.0302.07, (ICCoS, ANMMM, MLDM); IWT: PhD Grants, McKnow-E, Eureka-Flite+ Belgian Federal Science Policy Office: IUAP P6/04, EU: ERNSI;

1


lines as Parzen classifiers and Nadaraya-Watson estimators . It becomes clear that this result enables researchers on Parzen windows to benefit directly from recen t advances in kernel machines, two fields which have evolved mostly separately. It must be emphasized that the resulting learning rules were already studied in different forms and motivated by asy mptotic and geometric arguments, as e.g. the Parzen window classifier [4], the 'simple classifier' as in [12] chap. 1, probabilistic neural networks [15], while in this paper we show how an (empirical) risk based optimality criterion underlies this approach. A number of experiments confirm the us e of the resulting cheap learning rules for providing a reasonable (baseline) performance in a smal l time-window. The following notational conventions are used throughout the paper. Let the random vector (X, Y )  Rd × {-1, 1} obey a (fixed but unknown) joint distribution PX Y from a probability n space (Rd × {-1, 1}, P ). Let Dn = {(Xi , Yi )}i=1 be sampled i.i.d. according to PX Y . Let y  Rn be defined as y = (Y1 , . . . , Yn )T  {-1, 1}n and X = (X1 , . . . , Xn )T  Rn×d . This paper is organized as follows. The next section illustrates the principle of maximal average margin for classification problems. Section 3 investigates the close relationship with Rademacher complexities, Section 4 develops the maximal average margin princip le for ordinal regression, and Section 5 reports experimental results of application of the MAM to c lassification and ordinal regression tasks.

2
2.1

Maximal Average Margin for Classifiers
The Linear Case .

Consequently, the signed distance of a sample (X, Y ) to the hyper-plane wT x = 0, or the margin M (w)  R, can be defined as Y (wT X ) M (w) = . (2) w2 SVMs maximize the worst-case margin. We instead focus on the first moment of the margin distribution. Maximizing the expected (average) margin follows from solving YT = (w X )  M = max E max E [Y f (X )] . (3) w f H w2 Remark that the non-separable case does not require the need for slack-variables. The empirical counterpart becomes n 1 i Yi (wT Xi ) ^ , (4) M = max wn w2 =1 n 1 which can be written as a constrained convex problem as minw - n i=1 Yi (wT Xi ) s.t. w 2  n 1 1. The Lagrangian with multiplier   0 becomes L(w, ) = - n i=1 Yi (wT Xi ) +  (wT w - 1). 2 By switching the minimax problem to a maximin problem (application of Slater's condition), the ( first order condition for optimality  Lw,) = 0 gives w n 1i 1T wn = X y, (5) Yi Xi = n = 1 n where wn  Rd denotes the optimum to (4). The corresponding parametey  can be found by r 1 1 T XXT y since the n 1 Yi Xi 2 = n substituting (5) in the constraint wT w = 1, or  = n i= T optimum is obviously taking place when w w = 1. It becomes clear that the above derivations remain valid as n  , resulting in the following theorem.

Let the class of hypotheses be defined as  f H= (·) : Rd  R, w  Rd x  Rd : f (x) = wT x, w

2

=1

(1)

Theorem 1 (Explicit Actual Optimum for the MAMC) The function f (x) = wT x in H maximizing the expected margin satisfies = YT 1 (w X ) E [X Y ] w , (6) arg max E w2  w where  is a normalization constant such that w 2 = 1. 2


2.2

Kernel-based Classifier and Parzen Window

where K : Rd × Rd  R is defined as the inner product such that (X )T (X  ) = K (X, X  ) for any X, X  . Conversely, any function K corresponds with the inner product of a vylid map  a 1 T y with if the function K is positive definite. As previously, the term  becomes  = n n×n kernel matrix   R where ij = K (Xi , Xj ) for all i, j = 1, . . . , n. Now the class of positive definite Mercer kernels can be used as they induce a proper mapping . A classical choice is the use of a linear kernel (or K (X, X  ) = X T X  ), a polynomial kernel of degree p  N0 (or K (X, X  ) = (X T X  + b)p ), an RBF kernel (or K (X, X  ) = exp(- X - X  2 / )), or a dedicated 2 kernel for a specific application (e.g. a string kernel, a Fis her kernel, see e.g. [14] and references therein). Figure 1.a depicts an example of a nonlinear class ifier based on the well-known Ripley dataset, and the contourlines score the 'certainty of predi ction' as explained in the next section. The expression (7) is similar (proportional) to the classic al Parzen window for classification, but - differs in the use of a positive definite (Mercer) kernel K instead of the pdf ( Xh · ) with bandwidth h > 0, and in the form of the denominator. The classical motivation of statistical kernel estimators is based on asymptotic theory in low dimensions (i.e d = O(1)), see e.g. [4], chap. 10 and references. The functional form of the optimal rule (7) is similar to the 'simple classifier' described in [12], chap. 1. Thirdly, this estimator was also termed and empiric ally validated as a probabilistic neural network by [15]. The novel element from above result is the de rivation of a clear (both theoretical and empirical) optimality principle of the rule, as opposed to the asymptotic results of [4] and the geometric motivations in [12, 15]. As a direct byproduct, it becomes straightforward to extend the Parzen window classifier easily with an additional intercept term or other parametric parts, or towards additive (structured) models as in [9].

It becomes straightforward to recast the resulting classifi er as a kernel classifier by mapping the input data-samples X in a feature space  : Rd  Rd where d is possibly infinite. In particular, we do not have to resort to Lagrange duality in a context of convex optimization (see e.g. [14, 9] for an overview) or functional analysis in a Reproducing Kernel Hilbert Space. Specifically, n 1i T wn (X ) = Yi K (Xi , X ), (7) n = 1

3

Analysis and Rademacher Complexities

The quantity of interest in the analysis of the generalization performance is the probability of predicting a mistake (the risk R(w; PX Y )), or YT = I , R(w; PX Y ) = PX Y (w (X ))  0 E (Y (wT (X ))  0) (8) where I (z ) equals one if z is true, and zero otherwise. 3.1 Rademacher Complexity

Let {i }n 1 taken from the set {-1, 1}n be Bernoulli random variables with P ( = 1) = P ( = i= -1) = 1 . The empirical Rademacher complexity is then defined [8, 1] as 2 X n , s 2i ^ (9) i f (Xi ) Rn (H) E up 1 , . . . , Xn f H n =1

where the expectation is taken over the choice of the binary vector  = (1 , . . . , n )T  {-1, 1}n . It is observed that the empirical Rademacher complexity defines a natural complexity measure to study the maximal average margin classifier, as both the defin itions of the empirical Rademacher complexity and the maximal average margin resemble closely (see also [8]). The following result was given in [1], Lemma 22, but we give an alternative proof by exploiting the structure of the optimal estimate explicitly. Lemma 1 (Trace bound for the Empirical Rademacher Complexity for H) Let   Rn×n be defined as ij = K (Xi , Xj ) for all i, j = 1, . . . , n, then 2t ^ Rn (H)  r(). (10) n 3


Proof: The proof goes along the same lines as the classical bound on the empirical Rademacher complexity for kernel machines outlined in [1], Lemma n 2. Specifically, once a vector   {-1, 1}n 2 1 is fixed, it is immediately seen that the maxf H n i=1 i f (Xi ) equals the solution as in (7) or  n T maxw i=1 i (wT (Xi )) =  T =  T  . Now, application of the expectation operator E   over the choice of the Rademacher variables gives  2 2 E T ^   T  Rn (H) = E n n 1 2
1  2 2i = E [i j ] K (Xi , Xj ) n ,j

2 = n

R here the inequality is based on application of Jensen's inequality. This proves the Lemma. w

i

=1

n

2

K (Xi , Xi )

1

=

2 n

t r(),

(11)

emark tt at in the case of a kernel with constant trace (as e.g. in the case of the RBF kernel h  r() = n), it follows from this result that also the (expected) Rademacher complexity where t ^ r(). In general, one has that EK (X, X )] equals the trace of the integral operator [K E [Rn (H)]  TK defined on L2 (PX ) defined as TK (f ) = (X, Y )f (X )dPX (X ) as in [1]. Application of E g n 1 McDiarmid's inequality on the variable Z = supf H [Y (wT (X ))] - n i=1 Yi (wT (Xi )) ives as in [8, 1]. Lemma 2 K eviation Inequality) Let 0 < B <  be a fixed constant such that supz (z) 2 (D (z , z )  B such that |wT (z )|  B , and let   R+ be fixed. Then with probability = supz 0 exceeding 1 -  , one has for any w  Rd that 2 n2 n ln  1i ^ E [Y (wT (X ))]  . (12) Yi (wT (Xi )) - Rn (H) - 3B n =1 Therefore it follows that one maximizes the expected margin by maximizing the empirical average margin, while controlling the empirical Rademacher complexity by choice of the model class (kernel). In the case of RBF kernels, B = 1, resulting in a reasonable tight bound. It is now illustrated T how one can obtain a practical upper-bound to the 'certainty of prediction' using f (x) = wn x. Theorem 2 (Occurrence of Mistakes) Given an i.i.d. sample Dn = K (z , z )  B , and a fixed   R+ . Then, B  R such that supz 0 d 1 -  , one has for all w  R that y T y YT  B - E [Y (wT (X ))] (w (X ))  0 P 1- + B nB {(Xi , Yi )}n 1 , a constant i= with probability exceeding ^ Rn (H) +3 B 2 n2  ln  .

(13)

Proof: The proof follows directly from application of Markov's ine quality on the positive random variable B - Y (wT (X )), with expectation B - E [Y (wT (X ))], estimated accurately by the sample average as in the previous theorem. M ore generally, one obtains that with probability exceeding 1 -  that for any w  Rd and for any  such that -B <  < B that  2 n2  y T y ^ YT  B ln  Rn (H) 3B  , (14) - + + P (w (X ))  - B +  n(B + ) B +  B + 

with probability exceeding 1 -  < 1. This results in a practical assessment of the 'certainty' o f a T prediction as follows. At first, note that the random variabl e Y (wn (x)) for a fixed X = x can take T T T T two values: either -|wn (x)| or |wn (x)|. Therefore P (Y (wn (x))  0) = P (Y (wn (x)) = 4


1

Class prediction class 1 class 2

1

0.8

0.8

0.6 X2 X2 0.4

0.6

0.4

0.2

0.2

0

0

-0.2

-1.2

-1

-0.8

-0.6

-0.4

-0.2 X1

0

0.2

0.4

0.6

0.8

-0.2

-1.2

-1

-0.8

-0.6

-0.4

-0.2 X1

0

0.2

0.4

0.6

0.8

(a)

(b)

Figure 1: Example of (a) the MAM classifier and (b) the SVM on the Ripley dataset. The contourlines represent the estimate of certainty of prediction ('scores') as derive d in Theorem 2 for the MAM classifier for (a), and as in Corollary 1 for the case of SVMs with g (z ) = min(1, max(-1, z )) where |z | < 1 corresponds with the inner part of the margin of the SVM (b). While the contours in (a) give an overall score of the predictions, the scores given in (b) focus towards the margin of the SVM .
T T T -|wn (x)|)  P (Y (wn (x))  -|wn (x)|) as Y can only take the two values -1 or 1. Thus the event 'Y = sign(wT x )' for samples X = x occurs with probability lower than the rhs. of (13) with  = |wT x |. When asserting this for a number nv  N of samples X  PX with nv  , a misprediction would occur less than  nv times. In this sense, one can use the latent variable wT (x ) as an indication of how 'certain' the prediction is. Figure 1 .a gives an example of the MAM classifier, together with the level plots indicating the certainty of prediction. Remark however that the described 'certainty of prediction' state ment differs from a conditional statement of the risk given as P (Y (wT (X )) < 0 | X = x ). The essential difference with the probabilistic estimates based on the density estimates resulting from the Parzen window estimator is that results become independent of the data dimension, as one avoids estimating the joint distribution.

3.2

Transforming the Margin Distribution

Consider the case where the assumption of a reasonable const ant B such that P ( X 2 < B ) = 1 is unrealistic. Then, a transformation of the random variable Y (wT X ) can be fruitful using a monotone   increasing function g : R  R with a constant B  B such that |g (z )|  B , and g (0) = 0. In the choice of a proper transformation, two counteracting ef fects should be traded properly. At first, a small choice of B improves the bound as e.g. described in Lemma 2. On the other h and, such a transformation would make the expected value E [g (Y (wT (X )))] smaller than E [Y (wT (X ))]. Modifying Theorem 2 gives Corollary 1 (Occurrence of Mistakes, bis) Given i.i.d. samples Dn = {(Xi , Yi )}n 1 , and a fixed i=   R+ . Let g : R  R be a monotonically increasing function with Lipschitz constant 0 < Lg < 0   , let B  R such that |g (z )|  B for all z , and g (0) = 0. Then with probability exceeding   1 -  , one has for any  such that -B    B and w  Rd that 2 2 n log(  ) 1 T   ^ g  B g (Yi (wn (Xi ))) - Lg Rn (H) - 3B i=1 n n T - . P (Y (wn (X )))  -   B +  B +  (15) ^ This result follows straightforwardly from Theorem 2 using the property that Rn (g  H)  g  1 -E [ Y g ( w T  ( X ) ) ] T ^ Lg Rn (H), see e.g. [1]. When  = 0, one has P (Y (wn (X )))  0 . 1 Similar as in the previous section, corollary 1 can be used to score the certainty of prediction by considering for each X = x the value of g (wT x ) and g (-wT x ). Figure 1.b gives an example by 5


 considering the clipping transformation g (z ) = min(1, max(-1, z ))  [-1, 1] such that B = 1. Note that this a-priori choice of the function g is not dependent on the (empirical) optimality criterion at hand.

3.3

Soft-margin SVMs and MAM classifiers

Except the margin-based mechanisms, the MAM classifier shares other properties with the softmargin maximal margin classifier (SVM) as well. Consider the following saturation function g (z ) = (1 - z )+ , where (·)+ is defined as (z )+ = z if z  0, and zero otherwise (g (0) = 0). Application of this function to the MAM formulation of (4), one obtains for a C > 0 in 1 + - Yi (wT (Xi )) s .t. w T w = C , (16) max -
w =1

which is equivalent to (4) as in the optimum, Yi (wT (Xi )) = (1 - i ) for all i. Thus, omission of the slack constraints i  0 in the SVM formulation results in the Parzen window classifier.

which is similar to the SVM. Consider the following modification in min i s.t. wT w  C and Yi (wT (Xi ))  1 - i
w , =1

which is similar to the support vector machine (see e.g. [14] ). To make this equivalence more explicit, consider the following formulation of (16) in min i s.t. wT w  C and Yi (wT (Xi ))  1 - i , i  0 i = 1, . . . , n, (17)
w , =1

i = 1, . . . , n,

(18)

4

Maximal Average Margin for Ordinal Regression

Along the same lines as [6], the maximal average margin princ iple can be applied to ordinal regression tasks. Let (X, Y )  Rd × {1, . . . , m} with distribution PX Y . The w  Rd maximizing P (I (wT ((X ) - (X ) )(Y - Y  ) > 0)) can be found by solving for the maximal average margin between pairs as follows s . ign(Y - Y  )wT ((X ) - (X ) ) M  = max E (19) w w2

Given n i.i.d. samples {(Xi , Yi )}n 1 , empirical risk minimization is obtained by solving i= n 1i min - sign(Yj - Yi )wT ((Xj ) - (Xi )) s.t. w 2  1. (20) w n ,j = 1 i 1 T The Lagrangian with multiplier   0 becomes L(w, ) = - n ,j w sign(Yj - Yi )((Xj ) -   (Xi )) + 2 (wT w - 1). Let there be n couples (i, j ). Let Dy  {-1, 0, 1}n ×n such that Dy,ki = 1 and Dy,kj = -1 if the k th couple equals (i, j ). Then, by switching the minimax problem to a ( maximin problem, the first order condition for optimality  Lw,) = 0 gives the expression. wn = w Y 1 ((Xj ) - (Xi )) = 1 XDy 1n . Now the parameter  can be found by substituting  n n i <Yj 1 1 T D T XT X D 1  . Now the key element is the (5) in the constraint wT w = 1, or  = n yn n y computation of dy = Dy 1n . Note that dy (i) = jn sign(Yj - Yi ) ry (i), (21)

=1

with rY denoting the ranks of all Yi in y. This expression simplifies expression for wn as wn = 1 n Xdy . It is seen that using kernels as before, the resulting estim ator of the order of the responses corresponding to x and x becomes n 1i ^ fK (x, x ) = sign (m(x) - m(x )) , where m(x) = K (Xi , x) rY (i). (22) n = 1 6


Banana: Breast Cancer: Diabetes: Flare-Solar: German: Heart: Image: Ringnorm: Splice: Thyroid: Titanic: Twonorm: Waveform:

SVM 10.5 ± 0.44 25.5 ± 4.54 23.3 ± 1.55 32.4 ± 1.79 23.4 ± 2.10 15.4 ± 3.28 4.0 ± 0.78 1.8 ± 0.16 11.3 ± 0.68 5.4 ± 4.62 22.7 ± 1.11 2.5 ± 0.16 10.5 ± 0.59

LS-SVM 10.3 ± 0.43 26.7 ± 4.57 23.1 ± 1.72 33.5 ± 1.50 23.4 ± 2.15 15.9 ± 3.12 3.1 ± 0.58 2.5 ± 0.24 10.7 ± 0.56 5.2 ± 5.01 22.5 ± 0.90 2.4 ± 0.13 10.0 ± 0.46

kernel Log.Reg 10.5 ± 0.44 25.5 ± 4.52 23.1 ± 1.74 33.4 ± 1.60 23.7 ± 2.15 17.3 ± 3.00 3.1 ± 0.52 2.3 ± 0.15 11.4 ± 0.70 4.53 ± 2.25 22.8 ± 1.21 2.4 ± 0.13 9.68 ± 0.48

MAMClin 47.3 ± 4.28 26.3 ± 4.41 25.7 ± 1.99 44.7 ± 1.82 24.6 ± 2.46 15.9 ± 3.49 37.6 ± 1.11 24.2 ± 0.61 45.4 ± 7.64 15.5 ± 3.44 23.3 ± 1.54 2.5 ± 0.10 19.6 ± 0.51

MAMCRBF 10.7 ± 0.46 26.7 ± 4.10 25.9 ± 2.05 35.9 ± 4.15 25.3 ± 2.38 17.2 ± 3.51 3.7 ± 0.57 3.51 ± 1.4 15.6 ± 2.2 4.3 ± 2.32 22.8 ± 0.68 2.8 ± 0.19 10.9 ± 0.78

Table 1: Results on the classification datasets provided by [11]. The boldfaced numbers indicate the significantly best strategy per dataset, stars  indicate 'not significant different from best' based on Kruskal-Wallis. Results show that the proposed classifier has a comparable or a mildly dec reased performance to the SVM, LS-SVM and Kernel Logistic Regression, but much cheaper as argued (i.e. seconds versus tens of minutes).

Remark that the estimator m : Rd  R equals (except for the normalization term) the NadarayaWatson kernel based on the rank-transform rY of the responses. This observation suggest the application of standard regression tools based on the rank-trans formed responses as in [7]. Experiments confirm the use of the proposed ranking estimator, and also mo tivate the use of a more involved function approximation tools as e.g. LS-SVMs [16] based on the rank-transformed responses.

5

Illustrative Example

Table 5 provides numerical results on the 13 classification ( including 100 randomizations) benchmark datasets as described in [11]. The choice of an appropri ate kernel parameter was obtained by cross-validation over a range of bandwidths from  = 1e - 2 to  = 1e15. The results illustrate that the Parzen window classifier performs in general slight ly (but not significantly so) worse than the other methods, but obviously reduces the required amoun t of memory and computation time (i.e. O(n) versus O(n2 ) - O(n3 )). Hence, it is advised to use the Parzen classifier as a cheap base-line method, or to use it in a context where time- or memo ry requirements are stringent. The first artificial dataset for testing the ordinal regression s cheme is constructed as follov s. The trainw v ing set {(Xi , Yi )}n 1  R5 × R with n = 100 and a validation set {(Xi , Yiv )}n 1  R5 × R i= i= v v T T with nv = 250 is constructed such that Zi = (w Xi )3 + ei and Zi = (w Xi )3 + ev with i v v v w  N (0, 1), X, X  N (0, I5 ), and e, e  N (0, 0.25). Now Y (and Y ) are generated pre0 v5 serving the order implied by {Zi }1=0 (and {Zi }2=0 ) with the intervals 2 -distributed with 5 degrees i1 i1 of freedom. Figure 2.a shows the results of a Monte Carlo expe riment relating both the O(n) proposed estimator (22), a LS-SVM regressor of O(n2 ) - O(n3 ) on the rank-transformed responses {(Xi , rY (i))}, the O(n4 ) - O(n6 ) SVM approach as proposed in [3] and the Gaussian Process approach of O(n4 ) - O(n6 ) given in [2]. The performance of the different algorithms is expressed in terms of Kendall's  computed on the validation data. Table 2.b reports the resul ts on some large scale datasets as described in [2], imposing a maximal computation time of 5 minutes. Both tests suggest the competitive nature of the proposed O(n) procedure, while clearly showing the benefit of using function estimation (as e.g. LS-SVMs) based on the rank-transformed responses.

6

Conclusion

This paper discussed the use of the MAM risk optimality principle for designing a learning machine for classification and ordinal regression. The relati on with classical methods including Parzen windows and Nadaraya-Watson estimators is established, while the relation with the empirical Rademacher complexity is used to provide a measure of 'certa inty of prediction'. Empirical experiments show the applicability of the O(n) algorithms on real world problems, trading performance somewhat for computational efficiency with respect to state-of-the art learning algorithms. 7


120 oMAM LS-SVM oSVM oGP

100

80 Frequency

60

40

20

0 0.5

0.55

0.6

0.65

0.7

0.75 

0.8

0.85

0.9

0.95

1

Data (train/test) Bank(1) (100/8.092) Bank(1) (500/7.629) Bank(1) (5.000/3.192) Bank(1) (7.500/692) Bank(2) (100/8.092) Bank(2) (500/7.629) Bank(2) (5.000/3.192) Bank(2) (7.500/692) Cpu(1) (100/20.540) Cpu(1) (500/20.140) Cpu(1) (5.000/15.640) Cpu(1) (7.500/13.140) Cpu(1) (15.000/5.640)

(a)

oMAM 0.37 0.49 0.56 0.57 0.81 0.83 0.86 0.88 0.44 0.50 0.57 0.60 0.69 (b)

LS-SVM 0.43 0.51 0.56 0.84 0.86 0.88 0.62 0.66 0.68 -

oSVM 0.46 0.55 0.87 0.87 0.64 0.66 -

oGP 0.41 0.50 0.80 0.81 0.63 0.65 -

Figure 2: Results on ordinal regression tasks using oMAM (22) of O(n), a regression on the rank-transformed responses using LS-SVMs [16] of O(n2 ) - O(n3 ), ordinal SVMs and ordinal Gaussian Processes for preferential learning of O(n4 ) - O(n6 ). The results are expressed as Kendall's  (with -1    1) computed on the validation datasets. Figure (a) reports the numerical results of the artifi cially generated data, Table (b) gives the result on a number of large scaled datasets described in [2], if the co mputation took less than 5 minutes.

References
[1] P.L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463­482, 2002. [2] W. Chu and Z. Ghahramani. Gaussian processes for ordinal reg ression. Journal of Machine Learning Research, 6:1019­1041, 2006. [3] W. Chu and S. S. Keerthi. New approaches to support vector ordin al regression. In in Proc. of International Conference on Machine Learning, pages 145­152. 2005. ¨ [4] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996. [5] A. Garg and D. Roth. Margin distribution and learning algorithms. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML), pages 210­217. Morgan Kaufmann Publishers, 2003. [6] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, pages 115­132, 2000. MIT Press, Cambridge, MA. [7] R.L. Iman and W.J. Conover. The use of the rank transform in re gression. Technometrics, 21(4):499­509, 1979. [8] V. Koltchinski. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902­1914, 1999. [9] K. Pelckmans. Primal-Dual kernel Machines. PhD thesis, Faculty of Engineering, K.U.Leuven, May. 2005. 280 p., TR 05-95. [10] K. Pelckmans, J. Shawe-Taylor, J.A.K. Suykens, and B. De Moor. Margin based transductive graph cuts using linear programming. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, (AISTATS 2007), pp. 360-367, San Juan, Puerto Rico, 2007. ¨ ¨ [11] G. Ratsch, T. Onoda, and K.-R. Muller. Soft margins for adaboost. Machine Learning, 42(3):287 ­ 320, 2001. ¨ [12] B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. [13] J. Shawe-Taylor and N. Cristianini. Further results on the margin dis tribution. In Proceedings of the twelfth annual conference on Computational learning theory (COLT), pages 278­285. ACM Press, 1999. [14] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [15] D.F. Specht. Probabilistic neural networks. Neural Networks, 3:110­118, 1990. [16] J.A.K. Suykens, T. van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, Singapore, 2002.

8