An Analysis of Inference with the Universum Fabian H. Sinz Max Planck Institute for biological Cybernetics ¨ Spemannstrasse 41, 72076, Tubingen, Germany fabee@tuebingen.mpg.de Alekh Agarwal University of California Berkeley 387 Soda Hall Berkeley, CA 94720-1776 alekh@eecs.berkeley.edu Olivier Chapelle Yahoo! Research Santa Clara, California chap@yahoo-inc.com ¨ Bernhard Scholkopf Max Planck Institute for biological Cybernetics ¨ Spemannstrasse 38, 72076, Tubingen, Germany bs@tuebingen.mpg.de Abstract We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a projected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results. 1 Introduction Learning algorithms need to make assumptions about the problem domain in order to generalise well. These assumptions are usually encoded in the regulariser or the prior. A generic learning algorithm usually makes rather weak assumptions about the regularities underlying the data. An example of this is smoothness. More elaborate prior knowledge, often needed for a good performance, can be hard to encode in a regulariser or a prior that is computationally efficient too. Interesting hybrids between both extremes are regularisers that depend on an additional set of data available to the learning algorithm. A prominent example of data-dependent regularisation is semisupervised learning [1], where an additional set of unlabelled data, assumed to follow the same distribution as the training inputs, is tied to the regulariser using the so-called cluster assumption. A novel form of data-dependent regularisation was recently proposed by [11]. The additional dataset for this approach is explicitly not from the same distribution as the labelled data, but represents a third -- neither -- class. This kind of dataset was first proposed by Vapnik [10] under the name Universum, owing its name to the intuition that the Universum captures a general backdrop against which a problem at hand is solved. According to Vapnik, a suitable set for this purpose can be thought of as a set of examples that belong to the same problem framework, but about which the resulting decision function should not make a strong statement. Although initially proposed for transductive inference, the authors of [11] proposed an inductive classifier where the decision surface is chosen such that the Universum examples are located close to it. Implementing this idea into an SVM, different choices of Universa proved to be helpful in various classification tasks. Although the authors showed that different choices of Universa and loss functions lead to certain known regularisers as special cases of their implementation, there are still a few unanswered questions. On the one hand it is not clear whether the good performance of their algorithm is due to the underlying original idea, or just a consequence of the employed algorithmic 1 relaxation. On the other hand, except in special cases, the influence of the Universum data on the resulting decision hyperplane and therefore criteria for a good choice of a Universum is not known. In the present paper we would like to address the second question by analysing the influence of the Universum data on the resulting function in the implementation of [11] as well as in a least squares version of it which we derive in section 2. Clarifying the regularising influence of the Universum on the solution of the SVM can give valuable insight into which set of data points might be a helpful Universum and how to obtain it. The paper is structured as follows. After briefly deriving the algorithms in section 2 we show in section 3 that the algorithm of [11] pushes the normal of the hyperplane into the orthogonal complement of the subspace spanned by the principal directions of the Universum set. Furthermore, we demonstrate that the least squares version of the Universum algorithm is equivalent to a hybrid between kernel Fisher Discriminant Analysis and kernel Oriented Principal Component Analysis. In section 4, we validate our analysis on toy experiments and give an example how to use the geometric and algorithmic intuition gained from the analysis to construct a Universum set for a real world problem. 2 The Universum Algorithms 2.1 The Hinge Loss U-SVM We start with a brief review of the implementation proposed in [11]. Let L = {(x1 , y1 ), ..., (xm , ym )} be the set of labelled examples and let U = {z1 , ..., zq } denote the set of Universum examples. Using the hinge loss Ha [t] = max{0, a - t} and fw,b (x) = w, x + b, a standard SVM can compactly be formulated as min w,b i 1 ||w||2 + CL H1 [yi fw,b (xi )]. 2 =1 m In the implementation of [11] the goal of bringing the Universum examples close to the separating hyperplane is realised by also minimising the cumulative -insensitive loss I [t] = max{0, |t| - } on the Universum points min w,b i j 1 ||w||2 + CL H1 [yi fw,b (x)] + CU I [ |fw,b (zj )| ]. 2 =1 =1 m q (1) Noting that I [t] = H- [t] + H- [-t], one can use the simple trick of adding the Universum examples twice with opposite labels and obtain an SVM like formulation which can be solved with a standard SVM optimiser. 2.2 The Least Squares U-SVM The derivation of the least squares U-SVM starts with the same general regularised error minimisation problem q m 1 CL i CU j 2 min ||w|| + Qy [fw,b (x)] + Q0 [fw,b (zj )]. w,b 2 2 =1 i 2 =1 (2) Instead of using the hinge loss, we employ the quadratic loss Qa [t] = ||t - a||2 which is used in the 2 least squares versions of SVMs [9]. Expanding (2) in terms of slack variables and yields min w,b q m 1 CL i CU j 2 ||w||2 + i + 2 2 2 =1 2 =1 j (3) s.t. w, xi + b = yi - i for i = 1, ..., m w, zj + b = 0 - j for j = 1, ..., q . Minimising the Lagrangian of (3) with respect to the primal variables w, b, and , and substituting their optimal values back into (3) yields a dual maximisation problem in terms of the Lagrange 2 multipliers . Since this dual problem is still convex, we can set its derivative to zero and thereby obtain the following linear system 0, 0 b = 1 1 K+C y 0 Here, K = U, and C = K KL,U CL L ,L KL,U KU,U 0 1 CU d a enotes the kernel matrix between the input points in the sets L and 1 CL 1 I 0 I n identity matrix of appropriate size scaled with 1 CU in dimensions associated with labelled examples and for dimensions corresponding to Universum examples. The solution (, b) can then be obtained by a simple matrix inversion. In the remaining part of this paper we denote the least squares SVM by Uls -SVM. 2.3 Related Ideas Although [11] proposed the first algorithm that explicitly refers to Vapnik's Universum idea, there exist related approaches that we shall mention briefly. The authors of [12] describe an algorithm for the one-vs-one strategy in multiclass learning that additionally minimises the distance of the separating hyperplane to the examples that are in neither of the classes. Although this is algorithmically equivalent to the U-SVM formulation above, their motivation is merely to sharpen the contrast between the different binary classifiers. In particular, they do not consider using a Universum for binary classification problems. There are also two Bayesian algorithms that refer to non-examples or neither class in the binary classification setting. [8] gives a probabilistic interpretation for a standard hinge loss SVM by establishing the connection between the MAP estimate of a Gaussian process with a Gaussian prior using a covariance function k and a hinge loss based noise model. In order to deal with the problem that the proposed likelihood does not integrate to one the author introduces a third -- the neither-- class, A similar idea is used by [4], introducing a third class to tackle the problem that unlabelled examples used in semi-supervised learning do not contribute to discriminative models PY|X (yi |xi ) since the parameters of the label distribution are independent of input points with unknown, i.e., marginalised value of the label. To circumvent this problem, the authors of [4] introduce an additional -- neither -- class to introduce a stochastic dependence between the parameter and the unobserved label in the discriminative model. However, neither of the Bayesian approaches actually assigns an observed example to the introduced third class. 3 Analysis of the Algorithm The following two sections analyse the geometrical relation of the decision hyperplane learnt with one of the Universum SVMs to the Universum set. It will turn out that in both cases the optimal solutions tend to make the normal vector orthogonal to the principal directions of the Universum. The extreme case where w is completely orthogonal to U, makes the decision function defined by w invariant to transformations that act on the subspace spanned by the elements of U. Therefore, the Universum should contain directions the resulting function should be invariant against. In order to increase the readability we state all results for the linear case. However, our results generalise to the case where the xi and zj live in an RKHS spanned by some kernel. 3.1 U-SVM and Projection Kernel For this section we start by considering a U-SVM with hard margin on the elements of U. Furthermore, we use = 0 for the -insensitive loss. After showing the equivalence to using a standard SVM trained on the orthogonal complement of the subspace spanned by the zj , we extend the result to the cases with soft margin on U. Lemma A U-SVM with CU = , = 0 is equivalent to training a standard SVM with the training points projected onto the orthogonal complement of span{zj - z0 , zj U}, where z0 is an arbitrary element of U. 3 Proof: Since CU = and = 0, any w yielding a finite value of (1) must fulfil w, zj + b = 0 for all j = 1, ..., q . So w, zj - z0 = 0 and w is orthogonal to span{zj - z0 , zj U}. Let PU denote the projection operator onto the orthogonal complement of that set. From the previous argument, we can replace w, xi by PU w, xi in the solution of (1) without changing it. Indeed, the optimal w in (1) will satisfy w = PU w. Since PU is an orthogonal projection we have that PU = PU and hence PU w, xi = w, PU xi = w, PU xi . Therefore, the optimisation problem in (1) is T the same as a standard SVM where the xi have been replaced by PU xi . he special case the lemma refers to, clarifies the role of the Universum in the U-SVM. Since the resulting w is orthogonal to an affine space spanned by the Universum points, it is invariant against aeatures implicitly specified by directions of large variance in that affine space. Picturing the ·, zj f s filters that extract certain features from given labelled or test examples x, using the Universum algorithms means suppressing the features specified by the zj . Finally, we generalise the result of the lemma by dropping the hard constraint assumption on the Universum examples, i.e. we consider the case CU < . Let w and b the optimal solution of (1). We have that jq jq | w , zj + b|. CU | w , zj + b | CU min =1 b =1 The right hand side can be interpreted as an "L1 variance". So the algorithm tries to find a direction w such that the variance of the projection of the Universum points on that direction is small. As CU approaches infinity this variance approaches 0 and we recover the result of the above lemma. 3.2 Uls -SVM, Fisher Discriminant Analysis and Principal Component Analysis In this section we present the relation of the Uls -SVM to two classic learning algorithms: (kernel) oriented Principal Component Analysis (koPCA) and (kernel) Fisher discriminant analysis (kFDA) [5]. As it will turn out, the Uls -SVM is equivalent to a hybrid between both up to a linear equality constraint. Since koPCA and kFDA can both be written as maximisation of a Rayleigh Quotient we start with the Rayleigh quotient of the hybrid rom FDA z max w w (C L w X | X (c + - c )(c k - f }| + (xi - c )(xi - c ) {z from FDA k -c ) q X | ~ ~ +CU (zj - c)(zj - c) )w j =1 - { w . k=± iI k } {z from oPCA } ~1 Here, c± denote the class means of the labelled examples and c = 2 (c+ + c- ) is the point between them. As indicated in the equation, the numerator is exactly the same as in kFDA, i.e. the interclass variance, while the denominator is a linear combination of the denominators from kFDA and koPCA, i.e. the inner class variances from kFDA and the noise variance from koPCA. As noted in [6] the numerator is just a rank one matrix. For optimising the quotient it can be fixed to an arbitrary value while the denominator is minimised. Since the denominator might not have full rank it needs to be regularised [6]. Choosing the regulariser to be ||w||2 , the problem can be rephrased as min w ||w||2 + w " CL P k=± P iI k (xi - ck )(xi - ck ) (+ + CU Pq j =1 (zj ~ ~ - c)(zj - c) " w (4) s.t. w c - c- ) = 2 As we will see below this problem can further be transformed into a quadratic program min ||w||2 + CL || ||2 + CU ||||2 w,b (5) s.t. w, xi + b = yi + i for all i = 1, ..., m w, zj + b = j for all j = 1, ..., q 1k = 0 for k = ±. Ignoring the constraint = 0, this program is equivalent to the quadratic program (3) of the Uls -SVM. The following lemma establishes the relation of the Uls -SVM to kFDA and koPCA. 1k 4 Lemma For given CL and CU the optimisation problems (4) and (5) are equivalent. Proof: Let w, b, and the optimal solution of (5). Combining the first and last constraint, we get ~ w c± + b 1 = 0. This gives us w (c+ - c- ) = 2 as well as b = -w c. Plugging and in (5) and using this value of b, we obtain the objective function (4). So we have proved that the minimum value of (4) is not larger than the one of (5). ~ Conversely, let w be the optimal solution of (4). Let us choose b = -w c, i = w z j = w j + b. Again both objective functions are equal. We just have to check that 0. But because w (c+ - c- ) = 2, we have x i i+ b - yi and i = : yi =±1 T 1i m± : y i = w i =±1 c± +b 1=w c± - w (+ c + c- ) 2 1= w (± c -c 2 ) 1 = 0. he above lemma establishes a relation of the Uls -SVM to two classic learning algorithms. This further clarifies the role of the Universum set in the algorithmic implementation of Vapnik's idea as proposed by [11]. Since the noise covariance matrix of koPCA is given by the covariance of the Universum points centered on the average of the labelled class means, the role of the Universum as a data-dependent specification of principal directions of invariance is affirmed. The koPCA term also shows that bothq the position and covarianceq structure are crucial to a good ~ ~ ~ ~ ~ Universum. To see this, we rewrite j =1 (zj - c)(zj - c) as j =1 (zj - z)(zj - z) + q (z - q 1 , ~~ ~ ~ c)(z - c) where z = q j =1 zj is the Universum mean. The additive relationship between covariance of Universum about its mean, and the distance between Universum and training sample means projected onto w shows that either quantity can dominate depending on the data at hand. In the next section, we demonstrate the theoretical results of this section on toy problems and give an example how to use the insight gained from this section to construct an appropriate Universum. 4 Experiments 4.1 Toy Experiments The theoretical results of section 3 show that the covariance structure of the Universum as well as its absolute position influence the result of the learning process. To validate this insight on toy data, we sample ten labelled sets of size 20, 50, 100 and 500 from two fifty-dimensional Gaussians. Both Gaussians have a diagonal covariance that has low standard deviation (1,2 = 0.08) in the first two dimensions and high standard deviation (3,...,50 = 10) in the remaining 48. The two Gaussians are displaced such that the mean of µ± = ±0.3 exceeds the standard deviation by a factor of 3.75 in i the first two dimensions but was 125 times smaller in the remaining ones. The values are chosen such that the Bayes risk is approx. 5%. Note, that by construction the first two dimensions are most discriminative. We construct two kinds of Universa for this toy problem. For the first kind we use a mean zero Gaussian with the same covariance structure as the Gaussians for the labelled data (3,...,50 = 10), but with varied degree of anisotropy in the first two dimensions (1,2 = 0.1, 1.0, 10). According to the results of section 3 the Universa should be more helpful for larger anisotropy. For the second kind of Universa we use the same covariance as the labelled classes but shifted them along the line between the means of the labelled Gaussians. This kind of Universa should have a positive effect on the accuracy for small displacements but that effect should vanish with increasing amount of translation. Figure 1 shows the performance of a linear U-SVMs for different amounts of training and Universum data. In the top row, the degree of isotropy increases from left to right, whereas = 10 refers to the complete isotropic case. In the bottom row, the amount of translation increases from left to right. As expected, performance converges to the performance of an SVM for high isotropy and large translations t. Note, that large translations do not affect the accuracy as much as a high isotropy. However, this might be due to the fact the variance along the principal components of the Universum is much larger in magnitude than the applied shift. We obtained similar results for the Uls -SVM. Also, the effect remains when employing an RBF kernel. 5 = 0.1 0.5 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 SVM (q=0) q = 100 q = 500 0.4 0.3 0.2 0.1 0 0 100 = 1.0 0.5 SVM (q=0) q = 100 q = 500 0.4 0.3 0.2 0.1 200 300 400 500 0 0 100 = 10.0 SVM (q=0) q = 100 q = 500 mean error mean error mean error 200 300 400 500 m t = 0.1 0.5 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 SVM (q=0) q = 100 q = 1000 0.4 0.3 0.2 0.1 0 0 100 200 m t = 0.5 0.5 SVM (q=0) q = 100 q = 1000 0.4 0.3 0.2 0.1 300 400 500 0 0 100 200 m t = 0.9 SVM (q=0) q = 100 q = 1000 mean error mean error mean error 300 400 500 m m m Figure 1: Learning curves of linear U-SVMs for different degrees of isotropy and different amounts of t translation z z + 2 · (c+ - c- ). With increasing isotropy and translation the performance of the U-SVMs converges to the performance of a normal SVM. Universum Test error Mean output Angle 0 1.234 0.406 81.99 1 1.313 -0.708 85.57 2 1.399 -0.539 79.49 3 1.051 -0.031 69.74 4 1.246 -0.256 79.75 6 1.111 0.063 81.02 7 1.338 -0.165 82.72 9 1.226 -0.360 77.98 Table 1: See text for details. Without Universum, test error is 1.419%. The correlation between the test error and the absolute value of the mean output (resp. angle) is 0.71 (resp 0.64); the p-value (i.e the probability of observing such a correlation by chance) is 3% (resp 5.5%). Note that for instance that digits 3 and 6 are the best Universum and they are also the closest to the decision boundary. 4.2 Results on MNIST Following the experimental work from [11], we took up the task of distinguishing between the digits 5 and 8 on MNIST data. Training sets of size 1000 were used, and other digits served as Universum data. Using different digits as universa, we recorded the test error (in percentage) of U-SVM. We also computed the mean output (i.e. w, x + b) of a normal SVM trained for binary classification between the digits 5 and 8, measured on the points from the Universum class. Another quantity of interest measured was the angle between covariance matrices of training and Universum data in the feature space. Note that for two covariance matrices CX and CY corresponding to matrices tX and Y (centered about their means), the cosine of the angle is defined as trace(CX CY )/ t ace(C2 )trace(C2 ). This quantity can be computed in feature space as r X Y trace(KX Y KX Y )/ race(K2 X )trace(K2 Y ), with KX Y the kernel matrix between the sets X X Y and Y . These quantities have been documented in Table 1. All the results reported are averaged over 10-folds of cross-validation, with C = CU = 100, and = 0.01. 4.3 Classification of Imagined Movements in Brain Computer Interfaces Brain computer interfaces (BCI) are devices that allow a user to control a computer by merely using his brain activity [3]. The user indicates different states to a computer system by deliberately changing his state of mind according to different experimental paradigms. These states are to be detected by a classifier. In our experiments, we used data from electroencephalographic recordings (EEG) with a imagined-movement paradigm. In this paradigm the patient imagines the movement of his left or right hand for indicating the respective state. In order to reverse the spatial blurring of the brain activity by the intermediate tissue of the skull, the signals from all sensors are demixed via 6 Algorithm SVM U-SVM LS-SVM Uls -SVM U UC 3 Unm UC 3 Unm FS 40.00 ± 7.70 41.33 ± 7.06 (0.63) 39.67 ± 8.23 (1.00) 41.00 ± 7.04 40.67 ± 7.04 (1.00) 40.67 ± 6.81 (1.00) S1 12.35 ± 6.82 13.53 ± 6.83 (0.63) 12.35 ± 7.04 (1.00) 13.53 ± 8.34 12.94 ± 6.68 (1.00) 16.47 ± 7.74 (0.50) SVM U-SVM LS-SVM Uls -SVM UC 3 Unm UC 3 Unm DATA I JH 40.00 ± 11.32 34.58 ± 9.22 (0.07) 37.08 ± 11.69 (0.73) 40.42 ± 11.96 37.08 ± 7.20 (0.18) 37.92 ± 12.65 (1.00) DATA I I S2 35.29 ± 13.30 32.94 ± 11.83 (0.63) 27.65 ± 14.15 (0.13) 33.53 ± 13.60 32.35 ± 10.83 (0.38) 31.18 ± 13.02 (0.69) JL 30.00 ± 15.54 30.56 ± 17.22 (1.00) 30.00 ± 16.40 (1.00) 30.56 ± 15.77 31.11 ± 17.01 (1.00) 30.00 ± 15.54 (1.00) S3 35.26 ± 14.05 35.26 ± 14.05 (1.00) 36.84 ± 13.81 (1.00) 34.21 ± 12.47 35.79 ± 15.25 (1.00) 35.79 ± 15.25 (1.00) Table 2: Mean zero-one test error scores for the BCI experiments. The mean was taken over ten single error scores. The p-value for a two-sided sign test against the SVM error scores are given in brackets. an independent component analysis (ICA) applied to the concatenated lowpass filtered time series of all recording channels [2]. In the experiments below we used two BCI datasets. For the first set (DATA I) we recorded the EEG activity from three healthy subjects for an imagined movement paradigm as described by [3]. The second set (DATA I I) contains EEG signals from a similar paradigm [7]. We constructed two kind of Universa. The first Universum, UC 3 consists of recordings from a third condition in the experiments that is not related to imagined movements. Since variations in signals from this condition should not carry any useful information about imagined movement task, the classifier should be invariant against them. The second Universum Unm is physiologically motivated. In the case of the imagined-movement paradigm the relevant signal is known to be in the so called -band from approximately 10 - 12Hz and spatially located over the motor cortices. Unfortunately, signals in the -band are also related to visual activity and independent components can be found that have a strong influence from sensors over the visual cortex. However, since ICA is unsupervised, those independent components could still contain discriminative information. In order to make the learning algorithm prefer the signals from the motor cortex, we construct a Universum Unm by projecting the labelled data onto the independent components that have a strong influence from the visual cortex. The machine learning experiments were carried out in two nested cross validation loops, where the inner loop was used for model selection and the outer for testing. We exclusively used a linear kernel. Table 2 shows the mean zero-one loss for DATA I and DATA I I and the constructed Universa. On the DATA I dataset, there is no improvement in the error rates for the subjects FS and JL compared to an SVM without Universum. Therefore, we must assume that the employed Universa did not provide helpful information in those cases. For subject JH, UC 3 and Unm yield an improvement for both Universum algorithms. However, the differences to the SVM error scores are not significantly better according to a two-sided sign test. The Uls -SVM performs worse than the U-SVM in almost all cases. On the DATA I I dataset, there was an improvement only for subject S2 using the U-SVM with the Unm and UC 3 Universum (8% and 3% improvement respectively). However, also those differences are not significant. As already observed for the DATA I dataset, the Uls -SVM performs constantly worse than its hinge loss counterpart. The better performance of the Unm Universum on the subjects JH and S2 indicates that additional information about the usefulness of features might in fact help to increase the accuracy of the classifier. The regularisation constant CU for the Universum points was chosen C = CU = 0.1 in both cases. This means that the non-orthogonality of w on the Universum points was only weakly 7 penalised, but had equal priority to classifying the labelled examples correctly. This could indicate that the spatial filtering by the ICA is not perfect and discriminative information might be spread over several independent components, even over those that are mainly non-discriminative. Using the Unm Universum and therefore gently penalising the use of these non-discriminative features can help to improve the classification accuracy, although the factual usefulness seems to vary with the subject. 5 Conclusion In this paper we analysed two algorithms for inference with a Universum as proposed by Vapnik [10]. We demonstrated that the U-SVM as implemented in [11] is equivalent to searching for a hyperplane which has its normal lying in the orthogonal complement of the space spanned by Universum examples. We also showed that the corresponding least squares Uls -SVM can be seen as a hybrid between the two well known learning algorithms kFDA and koPCA where the Universum points, centered between the means of the labelled classes, play the role of the noise covariance in koPCA. Ideally the covariance matrix of the Universum should thus contain some important invariant directions for the problem at hand. The position of the Universum set plays also an important role and both our theoretical and experimental analysis show that the behaviour of the algorithm depends on the difference between the means of the labelled set and of the Universum set. The question of whether the main influence of the Universum comes from the position or the covariance does not have a clear answer and is probably problem dependent. From a practical point, the main contribution of this paper is to suggest how to select a good Universum set: it should be such that it contains invariant directions and is positioned "in between" the two classes. Therefore, as can be partly seen from the BCI experiments, a good Universum dataset needs to be carefully chosen and cannot be an arbitrary backdrop as the name might suggest. References ¨ [1] O. Chapelle, B. Scholkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006. ¨ [2] N. J. Hill, T. N. Lal, M. Schroder, T. Hinterberger, B. Wilhelm, F. Nijboer, U. Mochty, G. Widman, C. E. ¨ ¨ Elger, B. Scholkopf, A. Kubler, and N. Birbaumer. Classifying EEG and ECoG signals without subject training for fast bci implementation: Comparison of non-paralysed and completely paralysed subjects. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2):183­186, 06 2006. [3] T. N. Lal. Machine Learning Methods for Brain-Computer Interdaces. PhD thesis, University Darmstadt, 09 2005. Logos Verlag Berlin MPI Series in Biological Cybernetics, Bd. 12 ISBN 3-8325-1048-6. [4] Neil D. Lawrence and Michael I. Jordan. Gaussian processes and the null-category noise model. In ¨ A. Zien O. Chapelle, Bernhard Scholkopf, editor, Semi-Supervised Learning, chapter 8, pages 137­150. MIT University Press, 2006. ¨ ¨ ¨ [5] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, A. Smola, and K. Muller. Invariant feature extraction and classification in kernel spaces. In Advances in Neural Information Processing Systems 12, pages 526­532, 2000. ¨ ¨ [6] Sebastian Mika, Gunnar Ratsch, and Klaus-Robert Muller. A mathematical programming approach to the kernel fisher algorithm. In Advances in Neural Information Processing Systems, NIPS, 2000. ´ [7] J. del R. Millan. On the need for on-line learning in brain-computer interfaces. IDIAP-RR 30, IDIAP, Martigny, Switzerland, 2003. Published in "Proc. of the Int. Joint Conf. on Neural Networks", 2004. [8] P. Sollich. Probabilistic methods for support vector machines. In Advances in Neural Information Processing Systems, 1999. [9] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293­300, 1999. ¨ [10] V. Vapnik. Transductive Inference and Semi-Supervised Learning. In O. Chapelle, B. Scholkopf, and A. Zien, editors, Semi-Supervised Learning, chapter 24, pages 454­472. MIT press, 2006. [11] J. Weston, R. Collobert, F. Sinz, L. Bottou, and V. Vapnik. Inference with the universum. In Proceedings of the 23rd International Conference on Machine Learning, page 127, 06/25/ 2006. [12] P. Zhong and M. Fukushima. A new support vector algorithm. Optimization Methods and Software, 21:359­372, 2006. 8