Second Order Bilinear Discriminant Analysis for single-trial EEG analysis Christoforos Christoforou Department of Computer Science The Graduate Center of the City University of New York 365 Fifth Avenue New York, NY 10016-4309 cchristoforou@gc.cuny.edu Paul Sajda Department of Biomedical Engineering Columbia University 351 Engineering Terrace Building, MC 8904 1210 Amsterdam Avenue New York, NY 10027 ps629@columbia.edu Lucas C. Parra Department of Biomedical Engineering The City College of The City University of New York Convent Avenue 138th Street New York,NY 10031, USA parra@ccny.cuny.edu Abstract Traditional analysis methods for single-trial classi cation of electro- encephalography (EEG) focus on two types of paradigms: phase locked methods, in which the amplitude of the signal is used as the feature for classi cation, e.g. event related potentials; and second order methods, in which the feature of interest is the power of the signal, e.g. event related (de)synchronization. The procedure for deciding which paradigm to use is ad hoc and is typically driven by knowledge of the underlying neurophysiology. Here we propose a principled method, based on a bilinear model, in which the algorithm simultaneously learns the best rst and second order spatial and temporal features for classi cation of EEG. The method is demonstrated on simulated data as well as on EEG taken from a benchmark data used to test classi cation algorithms for brain computer interfaces. 1 1.1 Introduction Utility of discriminant analysis in EEG Brain computer interface (BCI) algorithms [1][2][3][4] aim to decode brain activity, on a singletrial basis, in order to provide a direct control pathway between a user's intentions and a computer. Such an interface could provide locked in patients a more direct and natural control over a neuroprosthesis or other computer applications [2]. Further, by providing an additional communication 1 channel for healthy individuals, BCI systems can be used to increase productivity and ef ciency in high-throughput tasks [5, 6]. Single-trial discriminant analysis has also been used as a research tool to study the neural correlates of behavior. By extracting activity that differs maximally between two experimental conditions, the typically low signal-noise ratio of EEG can be overcome. The resulting discriminant components can be used to identify the spatial origin and time course of stimulus/response speci c activity, while the improved SNR can be leveraged to correlate variability of neural activity across trials to behavioral variability and behavioral performance [7, 5]. In essence, discriminant analysis adds to the existing set of multi-variate statistical tools commonly used in neuroscience research (ANOV , A Hoteling T 2 , Wilks' test). 1.2 Linear and quadratic approaches In EEG the signal-to-noise ratio of individual channels is low, often at -20dB or less. To overcome this limitation, all analysis methods perform some form of averaging, either across repeated trials, across time, or across electrodes. Traditional EEG analysis averages signals across many repeated trials for individual electrodes. A conventional method is to average the measured potentials following stimulus presentation, thereby canceling uncorrelated noise that is not reproducible from one trial to the next. This averaged activity, called an event related potential (ERP), captures activity that is time-locked to the stimulus presentation but cancels evoked oscillatory activity that is not locked in phase to the timing of the stimulus. Alternatively, many studies compute the oscillatory activity in speci c frequency bands by ltering and squaring the signal prior to averaging. Thus, changes in oscillatory activity are termed event related synchronization or desynchronization (ERS/ERD). Surprisingly, discriminant analysis methods developed thus far by the machine learning community have followed this dichotomy: First order methods in which the amplitude of the EEG signal is considered to be the feature of interest in classi cation corresponding to ERP and second order methods in which the power of the feature is considered to be of importance for classi cation corresponding to ERS/ERD. First order methods include temporal ltering + thresholding [2], hierarchical linear classi ers [5] and bilinear discriminant analysis [8, 9]. Second order methods include the logistic regression with a quadratic term [11] and the well known common spatial patterns method (CSP) [10] and its variants: common spatio-spectral patterns (CSSP)[12], and common sparse spectral spatial patterns (CSSSP)[13] . Choosing what kind of features to use traditionally has been an ad hoc process motivated by knowledge of the underlying neurophysiology and task. From a machine-learning point of view, it seems limiting to commit a priori to only one type of feature. Instead it would be desirable for the analysis method to extract the relevant neurophysiological activity de novo with minimal prior expectations. In this paper we present a new framework that combines both the rst order features and the second order features in the analysis of EEG. We use a bilinear formulation which can simultaneously extract spatial linear components as well as temporal ( ltered) features. 2 2.1 Second order bilinear discriminant analysis Problem setting Given a set of sample points to the EEG signal of D = {Xn , yn }N=1 , X RD × T , y {-1, 1} , where Xn corresponds n D channels and T sample points and yn indicate the class that corresponds right or left hand imaginary movement, stimulus versus control to one of two conditions (e.g. conditions, etc.), the task is then to predict the class label y for an unobserved trial X. 2.2 Second order bilinear model De ne a function, f (X; ) = Trace(UT XV) + Trace(AT (XB)(XB)T A) where and } (1) = {U RD × R , V RT × R , A RD × K B RT × T are the parameters of the model R is a scaling constant. We consider the following discriminative model; we model the 2 log-odds ratio of the posterior class probability to be the sum of a bilinear function with respect to the EEG signal amplitude and linear with respect to the second order statistics of the EEG signal: log 2.2.1 The Interpretation of the model P (y = +1|X) = f (X|) P (y = -1|X) (2) rst term of the equation (1) can be interpreted as a spatio-temporal projection of the signal, rst order statistics of the signal. Speci cally, the columns under the bilinear model, and captures the ur vk U represent R linear projections in space (rows of X). Similarly, each of the R columns of in matrix V represent linear projections in time (columns of X). By re-writing the term as: of Trace(UT XV) = Trace(VUT X) = Trace(WT X) where we de ned of elements of (3) X with the W = UV , it is easy to see that the bilinear projection is a linear combination rank - R constrained on W. This expression is linear in X and thus T captures directly the amplitude of the signal directly. In particular, the polarity of the signal (positive evoked response versus negative evoked response) will contribute signi cantly to discrimination if it is consistent across trials. This term, therefore, captures phase locked event related potentials in the EEG signal. The second term of equation (1), is a projection of the power of the the second order statistics of the signal. enforces in matrix ltered signal, which captures As before, each column of matrix A and B, represent components that project the data in space and time respectively. Depending on the structure one B different interpretations of the model can be archived. In the general case where no structure on r ank - T example if s B is assumed, B, the model captures a linear combination of the elements of a econd order matrix approximation of the signal then = XB(XB)T . In the case where ltered signal. For Toeplitz structure is enforced on B de nes a temporal lter on the signal and the model captures the linear combination of the power of the second order matrix of the B is xed to a Toeplitz matrix with coef cients corresponding to a 8Hz-12Hz band pass lter, then the second term is able to extract differences in the alpha-band which is known to be modulated during motor related tasks. Further, by learning The spatial weights several electrodes. Finally, the scaling factor B from the data, we may be able to identify new frequency bands that have so far not been identi ed in novel experimental paradigms. A together with the Trace operation ensure that the power is measured, not in individual electrodes, but in some component space that may re ect activity distributed across (which may seem super uous given the available degrees of freedom) is necessary once regularization terms are added to the log-likelihood function. 2.3 Logistic regression We use a logistic Rregression (LR) formalism as it is particularly convenient when imposing additional statistical properties on the matrices U, V, A, B such as smoothness or sparseness. In addition, in our experience, LR performs well in strongly overlapping high-dimensional datasets and is insensitive to outliers, the later being of particular concern when including quadratic features. Under the logistic regression model (2) the class posterior probability P (y |X; ) is modeled as (4) P (y |X; ) = and the resulting log likelihood is given by 1 1 + e-y(f (X;)+wo ) L() = - nN =1 log(1 + e-y(f (Xn ;)+wo ) ) (5) We minimize the negative log likelihood and add a log-prior on each of the columns of and parameters of U, V and A B that act as a regularization term, which is written as: rR =1 argmin -L() - U,V,A,B,wo (log p(ur ) + log p(vr )) - kK =1 t T =1 log p(ak ) - log(p(bt )) (6) 3 where the log-priors are given for each of the parameters as , log p(vk ) = K(u) RD×D , K [8]. uT K(v) uk , k (v ) T ×T R log p(ak ) , K(a) R = aT K(a) ak and k D ×D (b) T ×T ,K R log p(uk ) = uT K(u) uk k log p(bk ) = bT K(b) bk . k are kernel matrices that con- trol the smoothness of the parameter space. Details on the regularization procedure can be found in Analytic gradients of the log likelihood (5) with respect to the various parameters are given by: L() ur L() vr L() ar L() bt where we de ne = = = = nN =1 yn (Xn )Xn vr yn (Xn )ur Xn yn (Xn )(Xn B)(Xn B)T ar yn (Xn )XT AAT Xbt (7) nN =1 (8) 2 2 nN =1 (9) nN =1 (10) (Xn ) = 1 - P (y |X) = where ui , vi , ai , and bi correspond to the ith e-y(f (Xn ;)+wo ) 1 + e-y(f (Xn ;)+wo ) columns of U, V, A, B respectively. (11) 2.4 Fourier Basis for B If matrix B is constrained to have a circular toepliz structure then it can be represented as B = F-1 DF, where F-1 denotes the inverse Fourier matrix, and D is a diagonal complex-valued matrix of Fourier coef cients. In such a case, we can re-write equations (9) and (10) as L() ar L() di = = 2 2 nN =1 ^ yn (Xn )(Xn F-1 DF-T XT )ar n yn (Xn )(F-T XT AAT Xn F-1 )i,i di n (12) nN =1 (13) (14) where ^ D = DDT , and the parameters are now optimized with respect to Fourier coef cients Di,i . 3 3.1 di = An iterative minimization procedure can be used to solve the above minimization. Results Simulated data In order to validate our method and its ability to capture both linear and second order features, we generated simulated data that contained both types of features; namely ERP type of features and ERS/ERD type of features. The simulated signals were generated with a signal to noise ratio of -20dB which is a typical noise level for EEG. A total of 28 channels, 500 ms long signals and at a sampling frequency of 100Hz where generated, resulting in a matrix of X of 28 by 50 elements, for each trial. Data corresponding to a total of 1000 trials were generated; 500 trials contained only zero mean Gaussian noise (representing baseline conditions), with the other 500 trials having the signal of interest added to the noise (representing the stimulus condition): For channels 1-9 the signal was composed of a 10Hz sinusoid with random phase in each of the nine channels, and across trials. The 4 U component 1.5 0.4 0.3 amplitude 0.2 0.1 0 -0.1 V Component 1 0.5 0 -0.5 0 10 channels 20 30 0 50 100 150 200 250 300 time(m/s) 350 400 450 500 A component 1.5 0.15 0.1 1 amplitute 0.05 0 -0.05 -0.1 -0.15 -0.5 0 10 channels 20 30 -0.2 0 50 100 150 B component 0.5 0 200 250 300 time (m/s) 350 400 450 500 Figure 1: Spatial and temporal component extracted on simulated data for the linear term (top) and quadratic term (bottom). sinusoids were scaled to match the simulates an ERP type feature. -20dB SNR. This simulates an ERS type feature. For channels 10-18, a peak represented by a half cycle sinusoid was added at approximately 400 ms, which T The extracted components are shown in Figure 1. The linear component identi ed the ERP activity. components U (in this case only a colV has a temporal umn vector) has non-zero coef cients for channels 10 to 18 only, showing that the method correctly Furthermore, the associated temporal component pro le that matches the time course of the simulated evoked response. Similarly, the second order the spatial distribution of the non-phase locked activity. The temporal frequency domain and the resulting A have non-zero weights for only channels 1-9 showing that the method also identi ed lter B was trained in the lter is shown here in the time domain. It exhibits a dominant 10Hz component, which is indeed the frequency of the non-phase locked activity. 3.2 BCI competition dataset To evaluate the performance of the proposed method on real data we applied the algorithm to an EEG data set that was made available through The BCI Competition 2003 ([14], Data Set IV). EEG was recorded on 28 channels for a single subject performing self-paced key typing, that is, pressing the corresponding keys with the index and little ngers in a self-chosen order and timing (i.e. self-paced). Key-presses occurred at an average speed of 1 key per second. Trial matrices were extracted by epoching the data starting 630ms before each key-press. A total of 416 epochs were recorded, each of length 500ms. For the competition, the rst 316 epochs were to be used for classi er training, while the remaining 100 epochs were to be used as a test set. Data were recorded at 1000 Hz with a pass-band between 0.05 and 200 Hz, then downsampled to 100Hz sampling rate. For this experiment, the matrix 33Hz bandpass columns of B was xed to a Toeplitz structure that encodes a 10Hz- lter and only the parameters U, V , A and U and V w0 were trained. The number of lter were set to 1, where two columns were used for A. The temporal was selected based on prior knowledge of the relevant frequency band. This demonstrates the exibility of our approach to either incorporate prior knowledge when available or extract it from 5 U component 0.1 V component 0.05 0 -0.05 -0.1 0 100 200 300 time (m/s) Second Column of A 400 500 First Column of A Figure 2: Spatial and temporal component (top), and two spatial components for second order features (bottom) learned on the benchmark dataset data otherwise. Regularization parameters where chosen via a ve fold cross validation procedure (details can be found in [8]). The resulting components for this dataset are shown in Figure 2. Benchmark performance was measured on the test set which had not been used during either training or cross validation. our method on a new The number of misclassi ed trials in the test set was 13 which places rst place given the results of the competition which can be found onHence, our line http://ida. rst.fraunhofer.de/projects/bci/competition ii/results/index.html ([14]). method works as a classi er producing a state-of-the art result on a realistic data set. The receiveroperator characteristic (ROC) curve for cross validation and for the independent testset are shown in Figure 3. Figure 3.2 also shows the contribution of the linear and quadratic terms for every trial for the two types of key-presses. The gure shows that the two terms provide independent information and that in this case the optimal relative weighting factor is 1. 4 Conclusion In this paper we have presented a framework for uncovering spatial as well as temporal features in EEG that combine the two predominant paradigms used in EEG analysis: event related potentials and oscillatory power. These represent phase locked activity (where polarity of the activity matters), and non-phase locked activity (where only the power of the signal is relevant). We used the probabilistic formalism of logistic regression that readily incorporates prior probabilities to regularize the increased number of parameters. We have evaluated the proposed method on both simulated data, and a real BCI benchmark dataset, achieving state-of-the-art classi cation performance. The proposed method provides a basis for various future directions. For example, different sets of basis functions (other than a Fourier basis) can be enforced on the temporal decomposition of the data through the matrix B (e.g. wavelet basis). Further, the method can be easily generalized to 6 AUC : 0.96 1 0.9 0.8 0.7 True positive rate 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 False positive rate 0.8 1 True positive rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 AUC : 0.935 #errors:13 0.4 0.6 False positive rate 0.8 1 Figure 3: ROC curve with area under the curve 0.96 for the cross validation on the benchmark dataset (left). ROC curve with area under the curve 0.93, on the independent test set, for the benchmark dataset. There were a total of 13 errors on unseen data, which is less than any of the results previously reported, placing this method in rst place in the benchmark ranking. Training Set Testing set 10 second order term 5 0 -5 -10 -15 -20 second order term -10 0 first order term 10 5 0 -5 -10 -15 -10 -5 0 first order term 5 10 Figure 4: Scatter plot of the rst order term vs second order term of the model, on the training and testing set for the benchmark dataset ('+' left key, and 'o' right key). It is clear that the two types of features contain independent information that can help improve the classi cation performance. 7 multi-class problems by using a multinomial distribution on y. Finally, different regularizations (i.e L1 norm, L2 norm) can be applied to the different types of parameters of the model. References [1] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan. Brain-computer interfaces for communication and control. Clin Neurophysiol, 113(6):767791, June 2002. [2] N. Birbaumer, N. Ghanayim, T. Hinterberger, I. Iversen, B. Kotchoubey, A. Kubler, J. Perelmouter, E. Taub, and H. Flor. May 1999. [3] B. Blankertz, G. Curio, and K. uller. Classifying single trial eeg: Towards brain computer interfacing. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. MIT Press, 2002., 2002. [4] B. Blankertz, G. Dornhege, C. Schfer, R. Krepki, J. Kohlmorgen, K. Mller, V. Kunzmann, F. Losch, and G. Curio. Boosting bit rates and error detection for the classi cation of fast-paced motor commands based on single-trial eeg analysis. IEEE Trans. Neural Sys. Rehab. Eng., 11(2):127131, 2003. [5] Adam D. Gerson, Lucas C. Parra, and Paul Sajda. Cortically-coupled computer vision for rapid image search. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14:174179, June 2006. [6] Lucas C. Parra, Christoforos Christoforou, Adam D. Gerson, Mads Dyrholm, An Luo, Mark Wagner, Marios G. Philiastides, and Paul Sajda. Spatiotemporal linear decoding of brain state: Application to performance augmentation in high-throughput tasks. IEEE, Signal Processing Magazine, January 2008. [7] Philiastides Marios G., Ratcliff Roger, and Sajda Paul. Neural representation of task dif culty and decision making during perceptual categorization: A timing diagram. 89658975, August 2006. [8] Mads Dyrholm, Christoforos Christoforou, and Lucas C. Parra. Bilinear discriminant component analysis. J. Mach. Learn. Res., 8:10971111, 2007. [9] Ryota Tomioka and Kazuyuki Aihara. Classifying matrices with a spectral regularization. In 24th International Conference on Machine Learning, 2007. ¨ [10] H. Ramoser, J. Muller-Gerking, and G. Pfurtscheller. Optimal spatial ltering of single trial EEG during Journal of Neuroscience, 26(35): A spelling device for the paralysed. Nature, 398(6725):2978, Mar February- imagined hand movement. IEEE Trans. Rehab. Eng., 8:441446, December 2000. [11] Ryota Tomioka, Kazuyuki Aihara, and Klaus-Robert Mller. Logistic regression for single trial eeg clas¨ si cation. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 13771384. MIT Press, Cambridge, MA, 2007. [12] S. Lemm, B. Blankertz, G. Curio, and K. Muller. Spatio-spectral lters for improving the classi cation of single trial eeg. IEEE Trans Biomed Eng., 52(9):15418, 2005., 2005. [13] Dornhege G., Blankertz B, and K.R. Krauledat M. Losch F. Curio G.Muller. Combined optimization of spatial and temporal 2006. [14] B. Blankertz, K.-R. Muller, G. Curio, T.M. Vaughan, G. Schalk, J.R. Wolpaw, A. Schlogl, C. Neuper, G. Pfurtscheller, T. Hinterberger, M. Schroder, and N. Birbaumer. The bci competition 2003: progress and perspectives in detection and discrimination of eeg single trials. Transactions on, 51(6):10441051, 2004. Biomedical Engineering, IEEE lters for improving brain-computer interfacing. IEEE Trans. Biomed. Eng. 2006, 8