Stability Bounds for Non-i.i.d. Processes Mehryar Mohri Department of Computer Science Courant Institute of Mathematical Sciences New York, NY 10012 mohri@cs.nyu.edu Afshin Rostamizadeh Department of Computer Science Courant Institute of Mathematical Sciences New York, NY 10012 rostami@cs.nyu.edu Abstract The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression. 1 Introduction The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds [2, 3]. A learning algorithm is stable when the hypotheses it outputs differ in a limited way when small changes are made to the training set. A key advantage of stability bounds is that they are tailored to specific learning algorithms, exploiting their particular properties. They do not depend on complexity measures such as the VC-dimension, covering numbers, or Rademacher complexity, which characterize a class of hypotheses, independently of any algorithm. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). Note that the i.i.d. assumption is typically not tested or derived from a data analysis. In many machine learning applications this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. An example of time series data is stock pricing, where clearly prices of different stocks on the same day or of the same stock on different days may be dependent. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations that weakens over time [8, 10, 16, 17]. Our proofs are also based on the independent block technique commonly used in such contexts [17] and a generalized version of McDiarmid's inequality [7]. We prove novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms thereby extending the usefulness of stability-bounds to non-i.i.d. scenarios. It also illustrates their application to general classes of learning algorithms, including Support Vector Regression (SVR) [15] and Kernel Ridge Regression [13]. Algorithms such as support vector regression (SVR) [14, 15] have been used in the context of time series prediction in which the i.i.d. assumption does not hold, some with good experimental results [9, 12]. To our knowledge, the use of these algorithms in non-i.i.d. scenarios has not been supported by any theoretical analysis. The stability bounds we give for SVR and many other kernel 1 regularization-based algorithms can thus be viewed as the first theoretical basis for their use in such scenarios. In Section 2 we will introduce the definitions for the non-i.i.d. problems we are considering and discuss the learning scenarios. Section 3 gives our main generalization bounds based on stability including the full proof and analysis. In Section 4, we apply these bounds to general kernel regularization-based algorithms, including Support Vector Regression and Kernel Ridge Regression. 2 Preliminaries We first introduce some standard definitions for dependent observations in mixing theory [5] and then briefly discuss the learning scenarios in the non-i.i.d. case. 2.1 Non-i.i.d. Definitions Definition 1. A sequence of random variables Z = {Zt }t=- is said to be stationary if for any t and non-negative integers m and k , the random vectors (Zt , . . . , Zt+m ) and (Zt+k , . . . , Zt+m+k ) have the same distribution. Thus, the index t or time, does not affect the distribution of a variable Zt in a stationary sequence. This does not imply independence however. In particular, for i < j < k , Pr[Zj | Zi ] may not equal Pr[Zk | Zi ]. The following is a standard definition giving a measure of the dependence of the random variables Zt within a stationary sequence. There are several equivalent definitions of this quantity, we are adopting here that of [17]. Definition 2. Let Z = {Zt }t=- be a stationary sequence of random variables. For any i, j j Z {-, +}, let i denote the -algebra generated by the random variables Zk , i k j . Then, for any positive integer k , the -mixing and -mixing coefficients of the stochastic process Z are defined as P . P s (k ) = sup En up r[A | B ] - Pr[A] (k ) = sup r[A | B ] - Pr[A] (1) n B - An+k n An+k - B n Z is said to be -mixing (-mixing) if (k ) 0 (resp. (k ) 0) as k . It is said to be algebraically -mixing (algebraically -mixing) if there exist real numbers 0 > 0 (resp. 0 > 0) and r > 0 such that (k ) 0 /k r (resp. (k ) 0 /k r ) for all k , exponentially mixing if there exist real numbers 0 (resp. 0 > 0) and 1 (resp. 1 > 0) such that (k ) 0 exp(-1 k r ) (resp. (k ) 0 exp(-1 k r )) for all k . Both (k ) and (k ) measure the dependence of the events on those that occurred more than k units of time in the past. -mixing is a weaker assumption than -mixing. We will be using a concentration inequality that leads to simple bounds but that applies to -mixing processes only. However, the main proofs presented in this paper are given in the more general case of -mixing sequences. This is a standard assumption adopted in previous studies of learning in the presence of dependent observations [8, 10, 16, 17]. As pointed out in [16], -mixing seems to be "just the right" assumption for carrying over several PAC-learning results to the case of weakly-dependent sample points. Several results have also been obtained in the more general context of -mixing but they seem to require the stronger condition of exponential mixing [11]. Mixing assumptions can be checked in some cases such as with Gaussian or Markov processes [10]. The mixing parameters can also be estimated in such cases. Most previous studies use a technique originally introduced by [1] based on independent blocks of equal size [8, 10, 17]. This technique is particularly relevant when dealing with stationary -mixing. We will need a related but somewhat different technique since the blocks we consider may not have the same size. The following lemma is a special case of Corollary 2.7 from [17]. Lemma 1 (Yu [17], Corollary 2.7). Let µ 1 and suppose that µ is measurable function, with h w µ si absolute value bounded by M , on a product probability space j , i=1 ri here ri j =1 si ri+1 for all i. Let Q be a probability measure on the product space with marg,inal measures Qi i i+1 sj +1 si on (i , ri ), and let Qi+1 be the marginal measure of Q on i = 1, . . . , µ-1. j =1 j , j =1 rj µ Let (Q) = sup1iµ-1 (ki ), where ki = ri+1 - si , and P = i=1 Qi . Then, | E[h] - E[h]| (µ - 1)M (Q). Q P (2) 2 The lemma gives a measure of the difference between the distribution of µ blocks where the blocks are independent in one case and dependent in the other case. The distribution within each block is assumed to be the same in both cases. For a monotonically decreasing function , we have (Q) = (k ), where k = mini (ki ) is the smallest gap between blocks. 2.2 Learning Scenarios We consider the familiar supervised learning setting where the learning algorithm receives a sample of m labeled points S = (z1 , . . . , zm ) = ((x1 , y1 ), . . . , (xm , ym )) (X × Y )m , where X is the input space and Y the set of labels (Y = R in the regression case), both assumed to be measurable. For a fixed learning algorithm, we denote by hS the hypothesis it returns when trained on the sample S . The error of a hypothesis on a pair z X × Y is measured in terms of a cost function c : Y × Y R+ . Thus, c(h(x), y ) measures the error of a hypothesis h on a pair (x, y ), c(h(x), y ) = (h(x) - y )2 in the standard regression cases. We will use the shorthand c(h, z ) := c(h(x), y ) for a hypothesis h and z = (x, y ) X × Y and will assume that |c| is upper bounded by a constant M > 0. We denote by R(h) the empirical error of a hypothesis h for a training sample S = (z1 , . . . , zm ): im R(h) = 1 c(h, zi ). m =1 (3) In the standard machine learning scenario, the sample pairs z1 , . . . , zm are assumed to be i.i.d., a restrictive assumption that does not always hold in practice. We will consider here the more general case of dependent samples drawn from a stationary mixing sequence Z over X × Y . As in the i.i.d. case, the objective of the learning algorithm is to select a hypothesis with small error over future samples. But, here, we must distinguish two versions of this problem. In the most general version, future samples depend on the training sample S and thus the generalization error or true error of the hypothesis hS trained on S must be measured by its expected error conditioned on the sample S : R(hS ) = E[c(hS , z ) | S ]. (4) z This is the most realistic setting in this context, which matches time series prediction problems. A somewhat less realistic version is one where the samples are dependent, but the test points are assumed to be independent of the training sample S . The generalization error of the hypothesis hS trained on S is then: R(hS ) = E[c(hS , z ) | S ] = E[c(hS , z )]. (5) z z This setting seems less natural since if samples are dependent, then future test points must also depend on the training points, even if that dependence is relatively weak due to the time interval after which test points are drawn. Nevertheless, it is this somewhat less realistic setting that has been studied by all previous machine learning studies that we are aware of [8, 10, 16, 17], even when examining specifically a time series prediction problem [10]. Thus, the bounds derived in these studies cannot be applied to the more general setting. We will consider instead the most general setting with the definition of the generalization error based on Eq. 4. Clearly, our analysis also applies to the less general setting just discussed as well. 3 Non-i.i.d. Stability Bounds ^ This section gives generalization bounds for -stable algorithms over a mixing stationary distribution.1 The first two sections present our main proofs which hold for -mixing stationary distributions. In the third section, we will be using a concentration inequality that applies to -mixing processes only. ^ The condition of -stability is an algorithm-dependent property first introduced in [4] and [6]. It has been later used successfully by [2, 3] to show algorithm-specific stability bounds for i.i.d. samples. Roughly speaking, a learning algorithm is said to be stable if small changes to the training set do not produce large deviations in its output. The following gives the precise technical definition. ^ Definition 3. A learning algorithm is said to be (uniformly) -stable if the hypotheses it returns for t any two training samples S and S hat differ by a single point satisfy z X × Y , 1 ^ |c(hS , z ) - c(hS , z )| . (6) The standard variable used for the stability coefficient is . To avoid the confusion with the -mixing ^ coefficient, we will use instead. 3 (a) (b) (c) (d) Figure 1: Illustration of the sequences derived from S that are considered in the proofs. Many generalization error bounds rely on McDiarmid's inequality. But this inequality requires the random variables to be i.i.d. and thus is not directly applicable in our scenario. Instead, we will use a theorem that extends McDiarmid's inequality to general mixing distributions (Theorem 1, Section 3.3). To obtain a stability-based generalization bound, we will apply this theorem to (S ) = R(hS ) - R(hS ). To do so, we need to show, as with the standard McDiarmid's inequality, that is a Lipschitz function and, to make it useful, bound E[]. The next two sections describe how we achieve both of these in this non-i.i.d. scenario. 3.1 Lipschitz Condition As discussed in Section 2.2, in the most general scenario, test points depend on the training sample. We first present a lemma that relates the expected value of the generalization error in that scenario and the same expectation in the scenario where the test point is independent of the training sample. We denote by R(hS ) = Ez [c(hS , z )|S ] the expectation in the dependent case and by R(hSb ) = Ez [c(hSb , z)] that expectation when the test points are assumed independent of the training, with e Sb denoting a sequence similar to S but with the last b points removed. Figure 1(a) illustrates that sequence. The block Sb is assumed to have exactly the same distribution as the corresponding block of the same size in S . ^ Lemma 2. Assume that the learning algorithm is -stable and that the cost function c is bounded by M . Then, for any sample S of size m drawn from a -mixing stationary distribution and for any b {0, . . . , m}, the following holds: ^ | E[R(hS )] - E[R(hSb )]| b + (b)M . S S (7) ^ Proof. The -stability of the learning algorithm implies that ^ E[R(hS )] = E [c(hS , z )] E [c(hSb , z )] + b . S S,z S,z (8) The application of Lemma 1 yields ^ ^ E[R(hS )] E [c(hSb , z)] + b + (b)M = ES [R(hSb )] + b + (b)M . S S,z e (9) The other side of the inequality of the lemma can be shown following the same steps. We can now prove a Lipschitz bound for the function . Lemma 3. Let S = (z1 , z2 , . . . , zm ) and S i = (z1 , z2 , . . . , zm ) be two sequences drawn from a -mixing stationary process that differ only in point i [1, m], and let hS and hS i be the hypotheses ^ returned by a -stable algorithm when trained on each of these samples. Then, for any i [1, m], the following inequality holds: ^ |(S ) - (S i )| (b + 1)2 + 2 (b)M + M . m (10) 4 Proof. To prove this inequality, we first bound the difference of the empirical errors as in [3], then ^ the difference of the true errors. Bounding the difference of costs on agreeing points with and the one that disagrees with M yields |R(hS ) - R(hS i )| = = m 1j |c(hS , zj ) - c(hS i , zj )| m =1 (11) 1 |c(hS , zi ) - c(hS i , zi )| m (12) (13) 1j m |c(hS , zj ) - c(hS i , zj )| + =i ^ M. + m ^ Now applying Lemma 2 to both generalization error terms and using -stability result in |R(hS ) - R(hS i )| = ^ i |R(hSb ) - R(hSb )| + 2b + 2 (b)M ^ i E[c(hSb , z) - c(hSb , z)] + 2b + 2 (b)M z e (14) (15) (16) ^ ^ + 2b + 2 (b)M . The lemma's statement is obtained by combining inequalities 13 and 16. 3.2 Bound on E[] As mentioned earlier, to make the bound useful, we also need to bound ES [(S )]. This is done by analyzing independent blocks using Lemma 1. ^ Lemma 4. Let hS be the hypothesis returned by a -stable algorithm trained on a sample S drawn from a stationary -mixing distribution. Then, for all b [1, m], the following inequality holds: ^ E[|(S )|] (6b + 1) + 3 (b)M . S (17) Proof. We first analyze the term ES [R(hS )]. Let Si be the sequence S with the b points before and after point zi removed. Figure 1(b) illustrates this definition. Si is thus made of three blocks. Let Si denote a similar set of three blocks each with the same distribution as the corresponding block in Si , but such that the three blocks are independent. In particular, the middle block reduced to one point ^ zi is independent of the two others. By the -stability of the algorithm, 1m + 1m i i ^ c(hS , zi ) E c(hSi , zi ) 2b . (18) E[R(hS )] = E Si m S Sm =1 =1 Applying Lemma 1 to the first term of the right-hand side yields 1m + i ^ E[R(hS )] E c(hSi , zi ) 2b + 2 (b)M . e S e Si m =1 (19) Combining the independent block sequences associated to R(hS ) and R(hS ) will help us prove the lemma in a way similar to the i.i.d. case treated in [3]. Let Sb be defined as in the proof of Lemma 2. To deal with independent block sequences defined with respect to the same hypothesis, we will consider the sequence Si,b = Si Sb , which is illustrated by Figure 1(c). This can result in as many as four blocks. As before, we will consider a sequence Si,b with a similar set of blocks each with the same distribution as the corresponding blocks in Si,b , but such that the blocks are independent. ^ Since three blocks of at most b points are removed from each hypothesis, by the -stability of the learning algorithm, the following holds: ( 1m i c(hS , zi ) - c(hS , z ) 20) E[(S )] = E[R(hS ) - R(hS )] = E S S S,z m =1 + 1m i ^ c(hSi,b , zi ) - c(hSi,b , z ) 6b . E (21) Si,b ,z m =1 5 Now, the application of Lemma 1 to the difference of two cost functions also bounded by M as in the right-hand side leads to + 1m i ^ 6b + 3 (b)M . (22) E[(S )] E c(hSi,b , zi ) - c(hSi,b , z) e e S e Si,b ,z m =1 e Now, since the points z and zi are independent and since the distribution is stationary, they have the same distribution and we can replace zi with z in the empirical cost, we can write 1m + i ^ E[(S )] E c(hS i , z) - c(hSi,b , z) (23) 6b + 3 (b)M e e i,b S e Si,b ,z m =1 e ^ ^ + 6b + 3 (b)M , (24) i where Si,b is the sequence derived from Si,b by replacing zi with z. The last inequality holds by ^ -stability of the learning algorithm. The other side of the inequality in the statement of the lemma can be shown following the same steps. 3.3 Main Results This section presents several theorems that constitute the main results of this paper. We will use the following theorem which extends McDiarmid's inequality to -mixing distributions.2 Theorem 1 (Kontorovich and Ramanan [7], Thm. 1.1). Let : Z m R be a function defined over a countable space Z . If is l-Lipschitz with respect to the Hamming metric for some l > 0, then the following holds for all > 0: - , 2 (25) Pr[|(Z ) - E[(Z )]| > ] 2 exp Z 2ml2 ||m ||2 where ||m || 1 + 2 km =1 (k ). ^ Theorem 2 (General Non-i.i.d. Stability Bound). Let hS denote the hypothesis returned by a stable algorithm trained on a sample S drawn from a -mixing stationary distribution and let c be a measurable non-negative cost function upper bounded by M > 0, then for any b [0, m] and any > 0, the following generalization bound holds h i ^ b Pr R(hS ) - R(hS ) > + (6b + 1) + 6M (b) 2 exp S ! P - 2 (1 + 2 m 1 (i))-2 i= . ^ 2m((b + 1)2 + 2M (b) + M /m)2 Proof. The theorem follows directly the application of Lemma 3 and Lemma 4 to Theorem 1. The Theorem gives a general stability bound for -mixing stationary sequences. If we further assume that the sequence is algebraically -mixing, that is for all k , (k ) = 0 k -r for some r > 1, then we can solve for the value of b to optimize the bound. Theorem 3 (Non-i.i.d. Stability Bound for Algebraically Mixing Sequences). Let hS denote the ^ hypothesis returned by a -stable algorithm trained on a sample S drawn from an algebraically -mixing stationary distribution, (k ) = 0 k -r with r > 1 and let c be a measurable non-negative cost function upper bounded by M > 0, then for any > 0, the following generalization bound holds i h ^ b Pr R(hS ) - R(hS ) > + + (r + 1)6M (b) 2 exp S - 2 (4 + 2/(r - 1))-2 ^ + (r + 1)2M (b) + M /m)2 2m(2 ! , where (b) = 0 r 0 M ^ r/(r+1) . 2 The original bound is expressed in terms of -mixing coefficients. We are adapting it to the case of stationary -mixing sequences by using the following straightforward inequality for a stationary process: 2(j - i) ij . 6 Proof. For an algebraically mixing sequence, the value of b minimizing the bound of Theorem 2 ^ -1/(r+1) ^ r/(r+1) ^ satisfies b = rM (b), which gives b = and (b) = 0 . The r 0 M r 0 M following term can be bounded as 1+2 im =1 (i) = 1 + 2 im =1 i -r 1 1+2 + 1 m i -r = di 1+2 1 m1-r - 1 + 1-r . (26) For r > 1, the exponent of m is negative, and so we can bound this last term by 3 + 2/(r - 1). Plugging in this value and the minimizing value of b in the bound of Theorem 2 yields the statement of the theorem. In the case of a zero mixing coefficient ( = 0 and b = 0), the bounds of Theorem 2 and Theorem 3 coincide with the i.i.d. stability bound of [3]. In order for the right-hand side of these bounds to ^ converge, we must have = o(1/ m) and (b) = o(1/ m). For several general classes of ^ algorithms, O(1/m) [3]. In the case of algebraically mixing sequences with r > 1 assumed in ^ ^ Theorem 3, O(1/m) implies (b) = 0 ( /(r0 M ))(r/(r+1)) < O(1/ m). The next section illustrates the application of Theorem 3 to several general classes of algorithms. 4 Application We now present the application of our stability bounds to several algorithms in the case of an algebraically mixing sequence. Our bound applies to all algorithms based on the minimization of a regularized objective function based on the norm · K in a reproducing kernel Hilbert space, where K is a positive definite symmetric kernel: m 1i c(h, zi ) + h argmin hH m =1 2 K, (27) ^ under some general conditions, since these algorithms are stable with O(1/m) [3]. Two specific instances of these algorithms are SVR, for which the cost function is based on the -insensitive cost: 0 if |h(x) - y | , c(h, z ) = |h(x) - y | = (28) |h(x) - y | - otherwise, and Kernel Ridge Regression [13], for which c(h, z ) = (h(z ) - y )2 . Corollary 1. Assume a bounded output Y = [0, B ], for some B > 0, and assume that K (x, x) for all x for some > 0. Let hS denote the hypothesis returned by the algorithm when trained on a sample S drawn from an algebraically -mixing stationary distribution. Then, with probability at least 1 - , the following generalization bounds hold for: · Support vector regression: 3 2 B 132 2 ln(1/ ) R(hS ) R(hS ) + +5 + ; (29) 2m m · Kernel Ridge Regression: 262 B 2 R(hS ) R(hS ) + +5 m 1 2 2 B 2 + B 2 ln(1/ ) . m (30) B ^ Proof. It has been shown in [3] that for SVM Regression 2 /(2m) and that M < / B ^ and for Kernel Ridge Regression, 22 B 2 /(m) and M < /. Plugging in these values in the bound of Theorem 3 and simplifying the expression (with some loss in tightness) by using the lower bound on r, r > 1, yield directly the statement of the corollary. These bounds give, to the best of our knowledge, the first stability- based generalization bounds for SVR and Kernel Ridge Regression in a non-i.i.d. scenario. Similar bounds can be obtained for other families of algorithms such as maximum entropy discrimination, which can be shown to have comparable stability properties [3]. Our bounds have the same convergence behavior as those derived by [3] in the i.i.d. case. In fact, they difr only by some constants. As in the i.i.d. case, they are non-trivial when the condition fe 1/ m on the regularization parameter holds for all large values of m. 7 It would be interesting to give a quantitative comparison of our bounds and the generalization bounds of [10] based on covering numbers for mixing stationary distributions, in the scenario where test points are independent of the training sample. In general, because the bounds of [10] are not algorithm-dependent, one can expect tighter bounds using stability, provided that a tight bound is given on the stability coefficient. The comparison also depends on how fast the covering number grows with the sample size and trade-off parameters such as . For a fixed , the asymptotic behavior of our stability bounds for SVR and Kernel Ridge Regression is tight. 5 Conclusion We presented general stability bounds for mixing stationary sequences. These bounds apply to large classes of algorithms, including Support Vector Regression and Kernel Ridge Regression, extending to weakly dependent observations existing bounds in the i.i.d. case. Since they are algorithmspecific, our bounds can often be tighter than other generalization bounds based on a complexity measure such as covering numbers. However, they can perhaps be further improved by considering weaker notions of stability based on for example the expected difference of the hypotheses generated for similar samples instead of the maximum. Acknowledgments This work was partially funded by the New York State Office of Science Technology and Academic Research (NYSTAR) and was also sponsored in part by the Department of the Army Award Number W23RYX-3275N605. The U.S. Army Medical Research Acquisition Activity, 820 Chandler Street, Fort Detrick MD 217025014 is the awarding and administering acquisition office. The content of this material does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred. References ´` ´ ´ [1] S. N. Bernstein. Sur l'extension du theoreme limite du calcul des probabilites aux sommes de quantites ´ dependantes. Math. Ann., 97:1­59, 1927. ´ [2] Olivier Bousquet and Andre Elisseeff. Algorithmic stability and generalization performance. In Advances in Neural Information Processing Systems (NIPS 2000), 2001. ´ [3] Olivier Bousquet and Andre Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499­526, 2002. [4] Luc Devroye and T. Wagner. Distribution-free performance bounds for potential function rules. In Information Theory, IEEE Transactions on, volume 25, pages 601­604, 1979. [5] Paul Doukhan. Mixing: Properties and Examples. Springer-Verlag, 1994. [6] Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. In Computational Learing Theory, pages 152­162, 1997. [7] Leo Kontorovich and Kavita Ramanan. Concentration inequalities for dependent random variables via the martingale method, 2006. ´ [8] Aurelie Lozano, Sanjeev Kulkarni, and Robert Shapire. Convergence and consistency of regularized boosting algorithms with stationary -mixing observations. In NIPS, 2006. [9] Davide Mattera and Simon Haykin. Support vector machines for dynamic reconstruction of a chaotic system. In Advances in kernel methods: support vector learning, pages 211­241. MIT Press, Cambridge, MA, USA, 1999. [10] Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5­34, April 2000. [11] D. Modha and E. Masry. On the consistency in nonparametric estimation under mixing assumptions. IEEE Transactions of Information Theory, 44:117­133, 1998. ¨ ¨ ¨ [12] Klaus-Robert Muller, Alex Smola, Gunnar Ratsch, Bernhard Scholkopf, Jens Kohlmorgen, and Vladimir Vapnik. Predicting time series with support vector machines. In Proceedings of the International Conference on Artificial Neural Networks (ICANN'97), Lecture Notes in Computer Science, pages 999­1004. Springer, 1997. [13] Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge Regression Learning Algorithm in Dual Variables. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 515­521. Morgan Kaufmann Publishers Inc., 1998. ¨ [14] Bernhard Scholkopf and Alex Smola. Learning with Kernels. MIT Press: Cambridge, MA, 2002. [15] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. [16] Mathukumalli Vidyasagar. Learning and Generalization: With Applications to Neural Networks. Springer, 2003. [17] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94­116, Jan. 1994. 8