A Repro ducing Kernel Hilb ert Space Framework for Pairwise Time Series Distances Zhengdong Lu zhengdon@csee.ogi.edu To dd K. Leen tleen@csee.ogi.edu Yonghong Huang huang@csee.ogi.edu Deniz Erdogmus derdogmus@ieee.org OGI School, Oregon Health & Science University, 20000 NW Walker Rd., Beaverton, OR 97006 USA Abstract A good distance measure for time series needs to properly incorporate the temporal structure, and should be applicable to sequences with unequal lengths. In this paper, we propose a distance measure as a principled solution to the two requirements. Unlike the conventional feature vector representation, our approach represents each time series with a summarizing smooth curve in a reproducing kernel Hilbert space (RKHS), and therefore translate the distance between time series into distances between curves. Moreover we propose to learn the kernel of this RKHS from a population of time series with discrete observations using Gaussian process-based non-parametric mixed-effect models. Experiments on two vastly different real-world problems show that the proposed distance measure leads to improved classification accuracy over the conventional distance measures. the feature extraction method is still more art than science, and the performance depends heavily on the designer's domain knowledge and the particular heuristic implemented. The sampling method, although preserving most of the information, is accused of ignoring the important temporal structure of the series. Indeed, the sampled sequences, if treated as vectors in Euclidean space, lead to the same classifiers after any permutation of the vector entries. Moreover, the sampling strategy does not apply to situations where we have only sparse observations that are made at irregular times. In this paper, we propose a principled non-parametric distance measure for time series by representing each time series with a smooth curve in a reproducing kernel Hilbert space (RKHS) with a kernel learned from data. This new distance measure circumvents the limitations of the two above mentioned strategies. Pap er Roadmap In Section 2, we give the background of the Bregman divergence, and then generalize it to function space for a proper distance measure of smooth curves. In Section 3 we propose a family of new distance measures for time series with only discrete observations. Section 4 is devoted to the nonparametric mixed-effect model, which helps to further specify the proposed distance measure. In Section 5, we apply the proposed distance measure to two realworld time series classification problems. Finally we discuss the related work in Section 6. 1. Intro duction Time series classification is a supervised learning problem aimed at labeling temporally structured sequences of variable length. The most common approach reduces time series classification to a static problem by suitably transforming the input sequences into vectors in Euclidean space. One can either summarize each time series with attributes pertinent to classification (called feature extraction )(Keogh & Pazzani, 1998), or use a properly sampled and aligned subsequence (called sampling )(Parra et al., 2003). Unfortunately, Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). 2. Gaussian Pro cesses and Functional Bregman Divergence The Bregman divergence is a natural generalization of squared Euclidean distance and KL-divergence. A Bregman divergence corresponding to a strictly convex function (x) (called seed function) is defined as d (x1 ||x2 ) = (x1 ) - (x2 ) - (x2 ), x1 - x2 . (1) A RKHS Framework for Pairwise Time Series Distances Bregman divergence is closely connected to the exponential family (Banerjee et al., 2005). For any distribution in the exponential family p(x; ) = exp( x, - ())p0 (x), we know that the log likelihood can be re-written as log p(x; ) = -d (x||µ()) + (x) + log p0 (x), where is the conjugate function of (x) = sup{ x, - ()} between function f1 and f2 , with a seed functional g [·] D dg (f1 ||f2 ) = g [f1 ] - g [f2 ] - g [f2 ](f1 (t) - f2 (t))dt. where Dg [f ] is the Fr´chet derivative. The Gaussian e process expressed in Eq.(5) can be viewed as a member of the exponential family extended to distributions on functions (Altun et al., 2004). Then a direct generalization of Eq.(3) leads to g [f ] = 1 ||f ||2 , which gives a H 2 GP-related divergence for smooth functions 1 dH (f1 ||f2 ) = ||f1 - f2 ||2 . (6) H 2 (2) (3) and µ() = () is the expectation parameter corresponding to . We go one step further to argue that d (x1 ||x2 ) should be a proper model-weighted divergence measure between any x1 and x2 . It is straightforward to show that for multi-variate Gaussian distribution N (a, ), the corresponding Bregman divergence is given by d (x1 ||x2 ) = 1 (x1 - x2 )T -1 (x1 - x2 ), 2 (4) 3. Distance for Time Series We consider k time series, using yi to denote the Ni observations from the ith time series made at times ti . . yi = [yi1 , · · · , yiNi ]T , ti = [ti1 , · · · , tiNi ]T . The subscript i on ti and Ni indicates that the observation times and even the number of observations are generally different for each individual. The time series are called synchronized if all the ti are the same. We can define a distance measure for such time series by associating the observations {ti , yi } with a (smooth) curve. We assume the observations for each individual i is generated from a independent Gaussian process fi with the same covariance function K (and therefore H) and mean f0 . The observation is modeled as yin = fi (tin ) + in , n = 1, 2, · · · , Ni , (7) where in is a white observation noise with standard deviation for all i and n. We choose to summarize each individual time series i with the expectation of fi (t) given the discrete noisy observation {ti , yi }. ^ fi (t) = E [fi (t)|yi , f0 ; ti , K ] 2 -1 which is also suggested in (Tipping, 1999) as a modelweighted distance for Gaussian distribution. 2.1. Extension to Function Space We generalize our discussion on the Bregman divergence and the exponential family to function spaces. To facilitate our discussion, we adopt the language of functional integral, which, although allegedly not rigorously defined, provides a powerful technique for describing the probability on functions (Simon, 1979). Gaussian processes (GPs) (Rasmussen & Williams, 2006) generalize the multivariate Gaussian distribution to function space, which model any function f with the following probability 1 1 p[f ] exp(- ||f - f0 ||2 ), H 2 (5) (8) with f0 being the mean function and || · ||H the norm for the reproducing kernel Hilbert space (RKHS) H. We use K to denote the reproducing kernel, which will also be noted as the covariance function for the Gaussian process expressed in Eq.(5) (Seeger, 2004). In regularization theory, the norm ||·||H is often related to a particular type of smoothness of function, with large (even infinite) ||f ||H for non-smooth function f . After generalizing Eq.(1) to the functional case (Frigyik et al., 2006), we get the Bregman divergence 1 In the remainder of the paper, we use the square brackets [ ] to distinguish functionals from common functions. = f0 + K (t, ti )(K (ti , ti ) + I) (yi - f0,i ) (9) . where f0,i = [f0 (ti1 ), f0 (ti2 ), · · · , f0 (tiNi )]T is the values of f0 at times ti , and K (ti , ti ) is the Ni × Ni matrix with the (n, m) entry being K (tin , tim ). With ^ a smooth f0 , we have ||fi ||H < +, which can be ^ loosely interpreted as that fi is smooth according to K . In Fig.1, we give an example of using such a curve ^ f to represent the noisy observations (black crosses). ^ ^ We then use the distance between fi and fj as the distance between time series {ti , yi } and {tj , yj } 2 , 2 Although E [||fi - fj ||2 |yi , yj ; ti , tj ] seems to be a reaH sonable measure of distance, it goes to infinity since with probability one a sample f from the a Gaussian process with covariance function K has ||f ||H = (Seeger, 2004). A RKHS Framework for Pairwise Time Series Distances Figure 1. Using smooth curve to represent noisy discrete observations (black crosses). The smooth curve is obtained using Eq.(9) with K being a Gaussian kernel and f0 = 0. The norm ||fi - fj ||H measures the irregularity defined by K , in contrast to the Euclidean distance ( fi (t) - fj (t))2 dt which only concerns about the point wise difference between fi and fj . It is also important to notice the particular temporal structure incorporated varies greatly with the choice of K . For example, the widely used Mat´rn (including Gaussian) kernel or e rational quadratic kernel promote different types and level of smoothness. On the other hand, the temporal structure is often problem specific and hard to determine beforehand. In the next section, we will discuss learning this temporal structure from the data. which is given by Eq.(6) as 1^ ^ dij = ||fi - fj ||2 . H 2 (10) 4. Non-parametric Mixed-effect Mo del In Section 3, we assume a Gaussian process with known mean and covariance function. However in practice it is often not the case. Instead we may want to learn the characteristic of Gaussian process from examples. One situation of interest to us is when a population of similar time series are available. This prior learning scheme is known in statistics as the empirical Bayesian or the hierarchial Beyesian (Gelman, 2004). Particularly, the model is called mixed-effect model when the hyper-prior is a Gaussian, on which the maximum likelihood (ML) solution can be found with Expectation-Maximization (EM) algorithm. Traditional mixed-effect models are parametric, which assume a -parameterized regression model for each individual. Since the model parameters vary across individuals, it is natural to consider them generated by the sum of a fixed and a random piece = + i , where is called the fixed effect, and i , called random effect, is assumed distributed N (0, D) with unknown covariance D. The fitting of mixed-effect model is to find , D, and the variance of observation noise. In non-parametric mixed-effect models, the individual regression models do not take a parametric form. Instead, we assume the observations are generated by k smooth curves {f1 , f2 , · · · , fk } fluctuating around a mean (fixed-effect) function f0 . We use fi = fi - f0 to denote the deviation of fi from f0 (random effect). The prior of both f0 and fi can be summarized with the following equations: p0 [f0 ] pf [fi ] 1 (15) exp(- ||f0 ||2 0 ) H 2 1 exp(- ||fi ||2 ) i = 1, 2, · · · , k , (16) H 2 Since H is the RKHS induced by the kernel K , this distance measure is well-defined H 1^ 1^ ^^ ^ ^ fi - fj , fi - fj dij = ||fi - fj ||2 = H 2 2 1 = K(t, ti )vi -K (t, tj )vj , K (t, ti )vi -K (t, tj )vj H , 2 where vi = (K (ti , ti ) + 2 I)-1 (yi - f0,i ). Using the reproducing kernel property t n , tm K(tn , t), K (tm , t) H = K (tn , tm ), the distance measurement can be simplified as dij = 1T 1T T vi K (ti , ti )vi + vj K (ti , ti )vj -vi K (ti , tj )vj . 2 2 (11) It is important to note that this distance does not require all the time series to be synchronized, an advantage when sequences are of different lengths, or the observations are made at different times, as shown in our first experiment in Section 5. When the observations for all individuals are synchronized, we have ti = t = [t1 , t2 , · · · , tN ]T with N as the total number of observations for each individual. Letting K = K (t, t), we can re-write dij as T T T dij = vi Kvi + vj Kvj - 2vi Kvj (12) (13) 2 -1 = (vi - vj ) K(vi - vj ) = (yi -yj ) (K + I) T 2 -1 T K(K + I) (yi -yj ).(14) Temp oral Structure In Eq.(11)-(14), the temporal regularity is incorporated in the distance via the kernel K . It is most clear when we notice that K models the correlation of f value at different time K (ti , tj ) = E [(f (ti ) - f0 (ti ))T (f (tj ) - f0 (tj ))]. where H and H0 are generally different Hilbert spaces, with the corresponding reproducing kernel denoted as K and K0 . Also we assume the observation noise to be A RKHS Framework for Pairwise Time Series Distances white Gaussian with variance 2 , from which follows p(yi |fi , f0 ; ti ) jni exp(- (yin - fi (tin ) - f0 (tin ))2 ). 2 2 =1 which enables us to employ the EM algorithm in finding M. In the following, we will give the results of the expectation step (E-step) and the maximization step (M-step). E-step: In each EM iteration: Q(M, Mg ) = E{fi |Y;Mg } [log{p(Y, {fi }; M)p0 [f0 ]}] = ik d fi log p(yi , fi ; M)p(fi |yi ; Mg ) + log p[f0 ], where Mg stands for the parameters from the last iteration. After some algebra, we can re-arrange Q(M, Mg ) into the following form 1 Q(M, Mg ) = - ||f0 ||2 0 - n log H 2 k jni 1i -2 E{fi |Y;Mg } [(yij - fi (tij ) - f0 (tij ))2 ] 2 =1 =1 + ik =1 We assume H0 (and thus the form of p0 [·]) is predetermined, while the fixed effect f0 is to be decided. Also unknown are the noise variance 2 and the Hilbert space H for random effects (or equivalently K ). Our learning task is therefore to jointly optimize over {f0 , K, } by maximizing the following probability of Y = {y1 , y2 , · · · , yk }. p(Y|f0 ; K, )p0 [f0 ] = D ik fi {p(yi |fi , f0 ; )pf [fi ]}, p0 [f0 ] =1 =1 (17) D where the integral g [ ] is a functional integral over (Simon, 1979). Using the Gaussian property, Eq.(17) can be further reduced to a standard integral p(Y|f0 ; K, )p0 [f0 ] = d ik p0 [f0 ] fi {p(yi |fi , f0 ; )p(fi ; K )}. =1 (18) d fi log p(fi ; M)p(fi |yi ; Mg ). (21) where fi = [fi (ti1 ), fi (ti2 ), · · · , fi (tiNi )]T collects the values of fi on times ti and p(fi ; K ) is a standard multivariate Gaussian p(fi ; K ) = 1 ( exp(- fiT K (ti , ti )-1 fi ). Ni |K (t , t )| 2 2 ) ii 1 (19) M-step: In M-step, we find the M = arg max Q(M, Mg ), M (22) In general, there is no unique solution of K that maximizes p(Y|f0 ; K, )p0 [f0 ]. Indeed, it is easy to verify that if K (tin , tim ) = K (tin , tim ) for any individual i and time index (n, m), we will have p(Y|f0 ; K, )p0 [f0 ] = p(Y|f0 ; K , )p0 [f0 ]. This situation can be circumvented in two ways. First we can restrain K in a particular parametric family, such as the widely used Gaussian kernel. Second, we can instead optimize only over the entry K (tin , tim ) for all individual i, and time index (n, m). Both strategies will be addressed in this paper. 4.1. Optimization with the EM Algorithm The task is to find the set M = {f0 , K, } that maximizes the probability p(Y|f0 ; K, )p0 [f0 ]. As shown in Eq.(18), we can rewrite the data likelihood p(yi |f0 ; K, ) using the {f1 , f2 , · · · , fk } as the latent variables d fi p(yi |fi , f0 , )p(fi ; K ), (20) p(yi |f0 ; K, ) = and use M to update the model parameters. The optimization in Eq.(22) can be divided into two separate parts. The first three terms on the left hand side of Eq.(21) is a function of only (f0 , ); The last (fourth) term is a function of only K . To find the solution of f0 and , we need to solve the following optimization problem: 1 ( , f0 ) = arg min { ||f0 ||2 0 + N log + H ,f0 2 kn 1 i ji E{fi |Y;Mg } [(yij -fi (tij )-f0 (tij ))2 ]. (23) 2 2 =1 =1 Particularly, with any fixed , maximizing Q(M, Mg ) over f0 becomes a regularized regression problem 1 f0 = arg min ||f0 ||2 0 + H f0 2 Nn 1 i ji {(yij - E{fi |yi ,Mg } [fi (tij )] - f0 (tij ))2 }. 2 2 =1 =1 The optimization over K is K = arg max ik d fi log p(fi ; K )p(fi |yi ; K g ) (24) K K =1 A RKHS Framework for Pairwise Time Series Distances = arg max - K K ik 1 { log |K (ti , ti )| 2 =1 (25) If we let P be the set of positive definite matrix, the solution of Eq.(27) is simple K= k 1i (Cg + µg (µg )T ). i i k =1 i 1 + tr(K (ti , ti )-1 (Cig + µg (µg )T ))}, i i 2 (28) where K is the set of feasible K , and µi is the posterior mean E [fi |yi ; M] that can be calculated as µi = K (ti , ti )(K (ti , ti ) + 2 I)-1 (yi - f0,i ) and Ci is the posterior covariance of fi Ci = K (ti , ti ) - K (ti , ti )(K (ti , ti ) + 2 I)-1 K (ti , ti ). 4.2. Parametric Covariance Estimation We assume the covariance function K is of the parametric form K (x, y ; ). For example, the Gaussian kernel with scale a and kernel width s K (x, y ; {a, s}) = a exp(- ||x - y ||2 ), 2s2 The non-parametric fitting of kernel matrix K is appealing since it does not assume a particular form for the covariance matrix and thus can fully exploit the information in the samples. However it can only be used when the time series are synchronized. One example of this modeling choice is given in Section 5.2. 5. Exp eriments We tested the proposed distance measure on two realworld applications. The first one is an algorithm for cognitive decline detection based on longitudinal clinical observations of motor ability. The second one is an target identifier system based on electroencephalograph (EEG) signal. In each experiment, we employ support vector machine (SVM) (Burges, 1998) with Gaussian kernel defined as follows dij (29) Gij = exp(- 2 ) 2r where dij is the squared distance between the time series i and j and the kernel width r is usually obtained using cross-validation. It is easy to see the G is a Mercer kernel. 5.1. Cognitive Decline Detection Based on Longitudinal Data Research by our group and others show that motor changes, such as in walking and finger tapping rates, can effectively predict cognitive decline several years before impairment is manifest (Camicioli et al., 1998). It would be useful to build a system to detect cognitive decline (at least partially) from motor behavior, since they can be obtained by unintrusive in-home assessment (Hayes et al., 2004). Our research focuses on using clinical motor behavior and data from the Oregon Brain Aging Study (OBAS) (Green et al., 2000). All 143 sub jects in the cohort were healthy at entry, and when the data were drawn 46 of them had developed into mild cognitive impairment, while 97 remained cognitively healthy. We divide all the sub jects into the impaired group and the normal group according to their state when the data were drawn from the database. 3 We intend to predict whether a sub ject 3 This grouping is potentially inaccurate due to the possibility that those cognitively healthy sub jects can later develop into dementia, which is known as right censoring in survival analysis. or as suggested in (Lanckriet et al., 2004) a convex combination of a set of kernels {K1 , K2 , · · · , KM } K (x, y ; ) = 1 K1 (x, y )+2 K2 (x, y )+· · ·+M KM (x, y ). In this case, the optimization of K in the M-step can be reduced to the following parameter estimation = arg max - ik 1 { log |K (ti , ti ; )| 2 =1 (26) 1 + tr(K (ti , ti ; )-1 (Cig + µg (µg )T ))}, i i 2 where p(fi ; ) = p(fi ; K (ti , ti ; )). This parametric form of K is appealing in either one of the following two situations: · when the observation are sparse, since the parametric K is generally less prone to overfitting compared to the non-parametric estimation, as will be discussed in Section 4.3. · when the time series are not synchronized (as in Section 5.1) since the parametric K allows the out-of-sample extension. 4.3. Non-parametric Covariance Estimation When all the time series all synchronized, we have ti = t, i = 1, 2, · · · , k . We can replace K (ti , ti ) in Eq.(25) with K K (t, t), and rewrite the optimization into the matrix form iN 1 { log |K|+ K = arg max - KP 2 =1 1 tr(K-1 (Cg + µg (µg )T ))}. i i i 2 (27) A RKHS Framework for Pairwise Time Series Distances two examples Normal Impaired 3 log(seconds) log(seconds) 2.4 2.2 2 1.8 1.6 65 1.5 65 p opulations and fixed effect 3.5 1 0.9 0.8 0.7 Detection Rate SVM(P) SVM(L) s te ps 1 0.9 0.8 0.7 Detection Rate 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.4 0.6 False Alarm Rate 0.8 1 0 0 0.2 SVM(P) SVM(L) seconds 2.6 2.5 0.6 0.5 0.4 0.3 2 70 75 Age 80 85 70 75 80 Age 85 90 95 0.2 0.1 0 0 Figure 2. Left panel: sample spaghetti plots of seconds from two groups. Right panel: the population of seconds data and the fit fixed effect model (red line). 0.4 0.6 False Alarm Rate 0.8 1 tapping dominant 1 0.9 0.8 SVM(P) SVM(L) 1 0.9 0.8 0.7 Detection Rate 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.4 0.6 False Alarm Rate 0.8 1 0 0 tapping non-dominant SVM(P) SVM(L) would develop into cognitive impairment based on his or her motor behavior before a clinical diagnosis (if any). In this experiment, this task reduces to predicting the group membership for each sub ject. This classification is difficult due to the fact that motor observations are sparse and noisy, as shown in Fig.2(left panel). We examined four motor behaviors summarized in Table 1. Usually as the sub jects age or become impaired, the seconds and steps increase, while tappingD and tappingN decrease. seconds steps tappingD tappingN # of seconds the sub ject takes to walk 9 m # of steps the sub ject takes to walk 9 m # of the tappings the sub ject does in 10 seconds with the dominant hand # of the tappings the sub ject does in 10 seconds with the non-dominant hand Table 1. Description of data. 0.7 Detection Rate 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 False Alarm Rate 0.8 1 Figure 3. The ROC curve of SVM with two distance measures. SVM(P): SVM with proposed distance. SVM(L): SVM with least-square features. We fit the non-parametric mixed-effect model to each motor behavior with the parameterized kernel K0 (t1 , t2 ) K (t1 , t2 ; {a, s}) = ||t1 - t2 ||2 ), 2s2 0 ||t1 - t2 ||2 ), = a exp( 2s2 exp( where s0 is predetermined and {a, s} are to be learnt. The right panel of Fig.2 shows the seconds time series from the 143 sub jects (black --) and the fit fixed effect (red line). Once the model is fit, the distance between any two sub jects i and j is calculated as in Eq.(11). For comparison, we also examined a parametric feature based on the least-square (LSQ) fit coefficients for linNi ear regression: xi = arg minx j =1 (x0 + x1 tij - yij )2 with x = [x0 , x1 ]T . This feature extraction is justified by the observation that the intercept and the slope of the motor behavior tra jectory are predictor of future cognitive decline and dementia (Marquis et al., 2002). Based on the LSQ feature we get another distance measure dij = ||xi - xj ||2 . We employ a SVM as the classifier with kernels calculated with Eq.(29). Fig.3 compares the ROC curves using the proposed distance measure and the Euclidean distance between the LSQ features. It is clear that SVM with proposed distance measure outperforms the SVM with the LSQ features in terms of the area under curve (AUC). There are two reasons for the superiority of the proposed distance over the LSQ feature: · The simple heuristic features such as the intercept and the slope cannot capture enough information for the classification. · The feature extraction is not robust enough for the sparse and noisy observations. 5.2. EEG-based Image Target Detection The system reported here exploits the perceptual capabilities of expert humans for searching ob jects of interest (e.g., a golf course in a satellite image) within large image sets. The technique uses event related potentials (ERPs), neural signals linked to critical events, such as interesting/novel visual stimuli. The basic idea of the ERP-based image triage system is to collect electroencephalograph (EEG) signals from a sub ject's scalp when he or she performs visual target detection, and then detect the ERPs associated with the target A RKHS Framework for Pairwise Time Series Distances stimuli. We focus on single-trial ERP detection using 32 EEG sensors, which is challenging due to the low signal-to-noise ratio. This detection task is then boiled down to classifying the EEG segments into target-associated EPRs and distractors. After proper alignment and sampling, the EEG segments are transformed into synchronized sequence of length 4128, which are denoted yi for each individual trial i. In this experiment, we collected the EEG data from three human experts, each of them performed 1 training session and 7 test sessions. In each training session, the human expert was fed with 600 images with 50 targets among them. In each test session, there are 1-4 targets within 3000 distractors. Fig.4 (left panel) shows single-trial EEG signals associated with a target and a distractor stimulus. Due to the high dimensionality, the EM algorithm will be fairly slow due to the extensive use of inverse of K (4128 × 4128). To keep the computation at a reasonable level, we simplify the model by assigning a flat prior to the fixed effect f0 , or equivalently letting ||f ||H0 = 0 for any f . This simplifying assumption instantly leads to the following results. · The optimal solution of f0 is simply the data mean k 1 f0 = k i=1 yi , as shown in Fig.4 (right panel). will then be used directly in linear classifiers. One obvious choice is the non-degenerated linear transformation xi = K1/2 (K + 2 I)-1 yi (30) where K 2 could be any matrix A RN ×N with AAT = K. We tested both the proposed distance and the (squared) Euclidean distance 4 ||yi - yj ||2 as the distance term dij in the Gaussian kernel G and compared the performance of the SVM with the two distance measures. In addition, we also tried a linear logistic classifier (LLC) with both the raw feature yi and ISO feature xi as the input. In our experiment, the SVM parameters and kernel width were selected using 10-fold cross validation. Due to the extremely low probability of targets and the high cost of misdetection, we aim for zero-miss and minimum false alarm rate (MFAR), which is defined as the percentage of false alarms among all classifications while all targets are correctly detected. We test both SVM and LLC on the 21 (=3 × 7) test sessions. Table 1 summarizes the detection results when different distance or features are used. The criteria of comparison include the average MFAR across the 21 sessions, the number of sessions with low MFAR ( 10%)and very low MFAR ( 2%). Clearly, the LLC with ISO features outperforms the LLC with raw feature by giving low average MFAR, more low MFAR sessions, and more very low MFAR sessions. The story is similar when using SVM as the classifier: the proposed distance outperforms the the Euclidean distance on all three criteria. Clearly the temporal structure is important in describing the EEG signal, and thus plays a crucial role in deciding the distances between EEG time series. The proposed distance measure successfully incorporates the temporal structure information learnt with a rather simple algorithm, and yields significantly better classification than the Euclidean distance that simply adds the index-by-index differences. 1 Based on the above two results, we can pick a and then calculate the optimal covariance K with Eq.(28) in one iteration. two examples 1 Target Distractor Magnitude(µv) 100 200 300 Time(ms) 400 500 · The data likelihood is independent of 2 as long ^ as itkis less than the smallest eigenvalue of K = 1 T i=1 (yi - f0 )(yi - f0 ) . k p opulation and fixed effect 1 0.5 Magnitude(µv) 0.5 0 0 -0.5 -0.5 -1 0 -1 0 100 200 300 Time(ms) 400 500 6. Related Work The connection between Bregman divergence and exponential family is first proposed by (Forster & Warmuth, 2000), and later used by several authors in deriving a proper distance measure for either clustering (Banerjee et al., 2005) or dimension reduction (Collins et al., 2001). Our work also depends heavily on the functional Bregman divergence, an idea first fully explored in (Frigyik et al., 2006). The non-parametric It would be fair to learn the covariance for a Mahalanobis distance. However, we have not performed this comparison for the sake of simplicity. 4 Figure 4. EEG data and the fit mixed-effect model. Left panel: Example of target-associated and distractorassociated EEG signals. Right panel: The population of EEG signals (black --) and the fit f0 (red curve). Once the optimal f0 and K are obtained, the distance between any time series i and j can be calculated using Eq.(14). In addition to directly using the distance, we isometrically embed the time series {yi } into Euclidean space while preserving the distance expressed in Eq.(14). The embedded vectors, called ISO feature, A RKHS Framework for Pairwise Time Series Distances LLC(I) LLC(R) SVM(P) SVM(E) Aver. MFAR 8.99% 18.18% 4.91% 6.31% # 2% 12 2 13 7 # 10% 16 12 19 16 Camicioli, R., Howieson, D., Oken, B., Sexton, G., & Kaye, J. (1998). Motor slowing precedes cognitive impairment in the oldest old. Neurology, 50. Collins, M., Dasgupta, S., & Schapire, R. (2001). A generalization of principal component analysis to the exponential family. NIPS 13. Forster, J., & Warmuth, M. K. (2000). Relative expected instantaneous loss bounds. COLT 13. Frigyik, B., Srivastava, S., & Gupta, M. (2006). Functional bregman divergence and bayesian estimation of distributions. arXiv:cs/0611123. Gelman, A. (2004). Bayesian data analysis, second edition. Chapman & Hall. Green, M., Kaye, J., & Ball, M. (2000). The oregon brain aging study: Neuropathology accompanying healthy aging in the oldest old. International Journal of Neural System, 54, 105­113. Hayes, T., Pavel, M., & Kaye, J. (2004). An unobtrusive in-home monitoring system for detection of key motor changes preceding cognitive decline. 26th Annual International Conference of the IEEE EMBS. Keogh, E., & Pazzani, M. (1998). An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. KDD 98 (pp. 239­241). ACM Press. Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., & Jordan, M. (2004). Learning the kernel with semidefinite Programming. Journal of Machine Learning Research, 5, 27­72. Marquis, S., Moore, M., Howieson, D. B., Sexton, G., Payami, H., Kaye, J. A., & Camicioli, R. (2002). Indpendent predictors of cognitive decline in healthy elderly persons. Arch. Neurol., 59, 601­606. Parra, L., Alvino, C., Tang, A., Pearlmutter, B., Yeung, N., Osman, A., & Sa jda, P. (2003). Single trial detection in eeg and meg: Keeping it linear. Neurocomputing, 5254, 177­183. Ramsay, J., & Silverman, B. (1997). Functional data analysis. Springer-Verlag. Ramsay, J. O., & Dalzell, C. J. (1991). Some tools for functional data analysis. Journal of the Royal Statistical Society, Series B (Methodological), 53, 539­572. Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press. Schwaighofer, A., Tresp, V., & Yu, K. (2005). Learning gaussian process kernels via hierarchical bayes. NIPS17. Seeger, M. (2004). Gaussian process for machine learning. Internation Journal of Neural System, 14, 69­106. Simon, B. (1979). Functional integration and quantum physics. Academic Press. Tipping, M. (1999). Deriving cluster analytic distance functions from gaussian mixture models. Table 2. The detection results with different classifier settings. Columns : AverMFAR: the average MFAR across 21 sessions; # 2%: the number of sessions with MFAR 2%; # 10%:the number of sessions with MFAR 10%. Rows: LLC(I): LLC with the ISO feature; LLC(R): LLC with raw feature; SVM(P): SVM with the proposed distance; SVM(E): SVM with Euclidean distance. mixed-effect model is a natural generalization to the hierarchical Bayesian Gaussian process proposed by (Schwaighofer et al., 2005) to functional form where synchronized and non-synchronized time series can be treated in a unified framework. This work can be viewed as a particular example of the functional data analysis (Ramsay & Silverman, 1997). Particularly, in an early effort towards the functional PCA (Ramsay & Dalzell, 1991), the authors suggested to map the discrete observations (ti , yi ) to a smooth function through the following regularized regression 1 ^ fi (t) = arg min f2 N ni 1 (yin - f (tin ))2 + ||Df ||2 , (31) 2 =1 where D is a linear operator. The solution to Eq.(31) is the expectation in Eq.(9) if we let = 2 and K be the Green's function of the operator D D. The difference, however, are that (1) our model also assumes a nonzero mean (fixed effect) f0 and (2) the kernel K is learned from a population of time series. Acknowledgments ZL wants to thank Inderjit Dhillon, Arindam Banerjee, Jeff Kaye, and Misha Pavel for helpful discussions, and Robin Guariglia and Milar Moore for help with the OBAS data. TL, ZL, and DE were partially funded by the OHSU Behavioral and Intervention Commons pro ject funded by Intel Corporation. YH is supported by NSF: ECS-0524835, ECS-0622239, and IIS-0713690. References Altun, Y., Hofmann, T., & Smola, A. (2004). Exponential families for conditional random fields. Uncertainty in Artificial Intel ligence(UAI). Banerjee, A., Merugu, S., Dhillon, I. S., & Ghosh, J. (2005). Clustering with bregman divergences. JMLR, 6, 1705­1749. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Know ledge Discovery, 2, 121­167.