Stability of Transductive Regression Algorithms Corinna Cortes Google Research, 76 Ninth Avenue, New York, NY 10011. corinna@google.com Mehryar Mohri mohri@cims.nyu.edu Courant Institute of Mathematical Sciences and Google Research, 251 Mercer Street, New York, NY 10012. Dmitry Pechyony Technion - Israel Institute of Technology, Haifa 32000, Israel. pechyony@cs.technion.ac.il Ashish Rastogi Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012. rastogi@cs.nyu.edu Abstract This paper uses the notion of algorithmic stability to derive novel generalization bounds for several families of transductive regression algorithms, both by using convexity and closed-form solutions. Our analysis helps compare the stability of these algorithms. It suggests that several existing algorithms might not be stable but prescribes a technique to make them stable. It also reports the results of experiments with local transductive regression demonstrating the benefit of our stability bounds for model selection, in particular for determining the radius of the local neighborhood used by the algorithm. use of transductive algorithms which leverage the unlabeled data during training to improve learning performance. This paper deals with transductive regression, which arises in problems such as predicting the real-valued labels of the nodes of a known graph in computational biology, or the scores associated with known documents in information extraction or search engine tasks. Several algorithms have been devised for the specific setting of transductive regression (Belkin et al., 2004b; Chapelle et al., 1999; Schuurmans & Southey, 2002; Cortes & Mohri, 2007). Several other algorithms introduced for transductive classification can be viewed in fact as transductive regression ones as their ob jective function is based on the squared loss, e.g., (Belkin et al. 2004a; 2004b). Cortes and Mohri (2007) also gave explicit VC-dimension generalization bounds for transductive regression that hold for all bounded loss functions and coincide with the tight classification bounds of Vapnik (1998) when applied to classification. This paper presents novel algorithm-dependent generalization bounds for transductive regression. Since they are algorithm-specific, these bounds can often be tighter than bounds based on general complexity measures such as the VC-dimension. Our analysis is based on the notion of algorithmic stability. In Sec. 2 we give a formal definition of the transductive regression setting and the notion of stability for transduction. Our bounds generalize the stability bounds given by Bousquet and Elisseeff (2002) for the inductive setting and extend to regression the stabilitybased transductive classification bounds of (El-Yaniv & Pechyony, 2006). Standard concentration bounds 1. Intro duction Many learning problems in information extraction, computational biology, natural language processing and other domains can be formulated as transductive inference problems (Vapnik, 1982). In the transductive setting, the learning algorithm receives both a labeled training set, as in the standard induction setting, and a set of unlabeled test points. The ob jective is to predict the labels of the test points. No other test points will ever be considered. This setting arises in a variety of applications. Often, the points to label are known but they have not been assigned a label due to the prohibitive cost of labeling. This motivates the Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Stability of Transductive Regression such as McDiarmid's bound (McDiarmid, 1989) cannot be readily applied to the transductive regression setting since the points are not drawn independently but uniformly without replacement from a finite set. Instead, a generalization of McDiarmid's bound that holds for random variables sampled without replacement is used, as in (El-Yaniv & Pechyony, 2006). Sec. 3.1 gives a simpler proof of this bound. This concentration bound is used to derive a general transductive regression stability bound in Sec. 3.2. In Sec. 4, we present the stability coefficients for a family of local transductive regression algorithms. The analysis in this section is based on convexity. In Sec. 5, we study the stability of other transductive regression algorithms (Belkin et al., 2004a; Wu & Sch¨lkopf, 2007; o Zhou et al., 2004; Zhu et al., 2003) based on their closed form solution and propose a modification to the seemingly unstable algorithm that makes them stable and guarantees a non-trivial generalization bound. Finally, Sec. 6 shows the results of experiments with local transductive regression demonstrating the benefit of our stability bounds for model selection, in particular for determining the radius of the local neighborhood used by the algorithm. This provides a partial validation of our bounds and analysis. esis h on a point x labeled with y (x). The cost function commonly used in regression is the squared loss c(h, x) = (h(x) - y (x))2 . In the remaining of this paper, we will assume a squared loss but many of our results generalize to other convex cost functions. The training and test errors of h are respectively R(h) = m u 1 1 k=1 c(h, xk ) and R(h) = u k=1 c(h, xm+k ). The m generalization bounds we derive are based on the notion of transductive algorithmic stability. Definition 1 (Transduction -stability). Let L be a transductive learning algorithm and let h denote the hypothesis returned by L for X (S, T ) and h the hypothesis returned for X (S , T ). L is said to be uniformly -stable with respect to the cost function c if there exists 0 such that for any two partitionings X (S, T ) and X (S , T ) that differ in exactly one training (and thus test) point and for al l x X , c . (1) (h, x) - c(h , x) 3. Transduction Stability Bounds 3.1. Concentration Bound for Sampling without Replacement Stability-based generalization bounds in the inductive setting are based on McDiarmid's inequality (1989). In the transductive setting, the points are drawn uniformly without replacement and thus are not independent. Therefore, McDiarmid's concentration bound cannot be readily used. Instead, a generalization of McDiarmid's bound for sampling without replacement is needed as in El-Yaniv and Pechyony (2006). We will denote by Sm a sequence of random vari1 ables S1 , . . . , Sm and write Sm = xm as a short1 1 hand for the m equalities Si = xi , i = 1, . . . , m and Pr[xm 1 |xi-1 , xi ]=Pr[Sm 1 = xm 1 |Si-1 = xi-1 , Si = xi ]. i+ i+ i+ 1 1 1 2. Definitions Let us first describe the transductive learning setting. Assume that a full sample X of m + u examples is given. The learning algorithm further receives the labels of a random subset S of X of size m which serves as a training sample. The remaining u unlabeled examples, xm+1 , . . . , xm+u X , serve as test data. We denote by X (S, T ) a partitioning of X into the training set S and the test set T . The transductive learning problem consists of predicting accurately the labels ym+1 , . . . , ym+u of the test examples, no other test examples will ever be considered (Vapnik, 1998).1 The specific problems where the labels are real-valued numbers, as in the case studied in this paper, is that of transduction regression. It differs from the standard (induction) regression since the learning algorithm is given the unlabeled test examples beforehand and can thus exploit this information to improve performance. We denote by c(h, x) the cost of an error of a hypothAnother natural setting for transduction is one where the training and test samples are both drawn according to the same distribution and where the test points, but not their labels, are made available to the learning algorithm. However, as pointed out by Vapnik (1998), any generalization bound in the setting we analyze directly yields a bound for this other setting, essentially by taking the expectation. 1 Theorem 1 ((McDiarmid, 1989), 6.10). Let Sm be 1 a sequence of random variables, each Si taking values in the set X , and assume that a measurable function : X m R satisfies: i [1, m], xi , x X, i h i i h m i-1 i-1 ESi+1 |S1 , Si = xi - ESm 1 |S1 , Si = xi ci . i+ Pr [| - E [] | ] 2 exp ,, -22 Pm 2 i=1 ci « Then, > 0, . The following is a concentration bound for sampling without replacement needed to analyze the generalization of transductive algorithms. Theorem 2. Let xm be a sequence of random vari1 ables, sampled from an underlying set X of m + u elements without replacement, and let that : X m R Stability of Transductive Regression be a symmetric function such that for al l i [1, m] and for al l x1 , . . . , xm X and x , . . . , x X , m 1 (x1 , . . . , xm ) - (x1 , . . . , xi-1 , x , xi+1 , . . . , xm ) c. i ^ ~ Pr - E [] 2 exp mu m+u-1/2 3.2. Transductive Stability Bound To obtain a general transductive regression stability bound, we apply the concentration bound of Theorem 2 to the random variable (S ) = R(h) - R(h). To do so, we need to bound ES [(S )], where S is a random subset of X of size m, and |(S ) - (S )| where S and S are samples differing by exactly one point. Lemma 1. Let H be a bounded hypothesis set (x X, |h(x) - y (x)| B ) and L a -stable algorithm returning the hypotheses h and h for two training sets S and S of size m each, respectively, differing in exactly one point. Then, |(S ) - (S )| 2 + B 2 (m + u)/(mu). (2) Then, > 0, ,, -22 (m, u)c2 « , where (m, u) = · 1 1-1/(2 max{m,u}) . ESm 1 i+ Proof.^ For a fixed i ~ |Si-1 , Si = xi 1 i+1 x g (xi-1 ) = m 1 i+1 x m m (xi-1 , x , x i+1 ) Pr[x i+1 |xi-1 , x ]. m i i 1 1 For the ^ uniform sampling without probabili~ y terms can be t ^[1, m], let g (Si-1 ) = ~1 - ESm 1 |Si-1 , Si = xi . Then, 1 i+ (xi-1 , xi , xm 1 ) Pr[xm 1 |xi-1 , xi ] - i+ i+ 1 1 replacement, written as: u! = . (m+u-i)! - . To compute the expression between brackets, we divide the set of permutations m {x i+1 } into two sets, those that contain xi and those m that do not. If a permutation x i+1 contains xi we k -1 m can write it as x i+1 xi x k+1 , where k is such that x = xi . We then match it up with the permutation k k -1 m xi x i+1 x k+1 from the set {xi xm 1 }. These two i+ permutations contain exactly the same elements, and since the function is symmetric in its arguments, the difference in the value of the function on the permutations is zero. In the other case, if a permutation x i+1 does not contain the element xi , then we simply match it up with the same permutation in {xm 1 }. The matchi+ ing permutations appearing in the summation are then m m xi x i+1 and x x i+1 which clearly only differ with rei spect to xi . The difference in the value of the function in this case can be bounded by m . The numc = i ber of such permutations is (m - i)! +u-(i +1) m- m Qm-1 1 Pr xm 1 |xi-1 , xi = i+ 1 k=i m+u-k P i-1 u! g (x1 ) = (m+u-i)! [ xm (xi-1 , xi , xm 1 ) i+ 1 i+1 P i-1 m x m 1 (x1 , xi , x i+1 )] i+ Thus, Lemma 2. Let h be the hypothesis returned by a stable algorithm L. Then, |ES [(S )] | . Proof. By definition, S and S differ exactly in one point. Let xi S , xm+j S be the points in which the two sets differ. The lemma follows from the observation that for each one of the m - 1 common labeled points in S and S , and for each one of the u - 1 common test points in T and T (recall T = X \ S , T = X \ S ), the difference in cost is bounded by , while for xi and xm+j , the difference in cost is bounded by B 2 . Then, it follows that |(S ) - (S )| 1 . 2 2 (u-1) 1 + (m-1) + B + B 2 + B 2 u + m u m u m Proof. By definition of (S ), its expectation is Pu Pm 1 1 Since k=1 ES [c(h, xm+k )] - m k=1 ES [c(h, xk )]. u ES [c(h, xm+j )] is the same for all j [1, u], and ES [c(h, xi )] the same for all i [1, m], for any i and j , ES [(S )] = ES [c(h, xm+j )] - ES [c(h, xi )] = Thus, ES [(S )] = ES [c(h , xi )] - ES [c(h, xi )]. ES,S X [c(h , xi ) - c(h, xi )] . Theorem 3. Let H be a bounded hypothesis set (x X, |h(x) - y (x)| B ) and L a -stable algorithm. Let h be the hypothesis returned by L when trained on X (S, T ). Then, for any > 0, with prob. at least 1 - , R(h) R(h) + + 2 B 2 (m + u) + mu (m, u) ln 1 . 2 (m+u-i-1)! , which leads to the following upper bound: ( xu-1)! i-1 x m (x1 , xi , xm 1 ) - (xi-1 , x , x i+1 ) m m i+ i 1 i+1 i+1 (m+u-i-1)! u! c, which implies that |g (xi-1 )| (m+u-i)! · 1 (u-1)! (m+u-i-1)! u c m+u-i c. Then, combining Theorem 1 (u-1)! m 1 m 1 with the identity i=1 (m+u-i)2 m+u-1/2 u-1/2 , -2 , 2 - E [] 2 exp u (m,u)c2 yields that Pr 1 u where u (m, u) = m+m-1/2 · 1-1/(2u) . The function u Proof. The result follows directly from Theorem 2 and Lemmas 1 and 2. is symmetric in m and u in the sense that selecting one of the sets uniquely determines the other set. The statement of the theorem then follows from a similar u 1 bound with m (m, u) = m+m-1/2 · 1-1/(2m) , taking u the tighter of the two. This is a general bound that applies to any transductive algorithm. To apply it, the stability coefficient , which depends on m and u, needs to be determined. In the subsequent sections, we derive bounds on for a number of transductive regression algorithms (Cortes Stability of Transductive Regression & Mohri, 2007; Belkin et al., 2004a; Wu & Sch¨lkopf, o 2007; Zhou et al., 2004; Zhu et al., 2003). 4. Stability of Lo cal Transductive Regression Algorithms This section describes and analyzes a general family of local transductive regression algorithms (LTR) generalizing the algorithm of Cortes and Mohri (2007). LTR algorithms can be viewed as a generalization of the so-called kernel regularization-based learning algorithms to the transductive setting. The ob jective function that they minimize is of the form: F (h, S ) = h 2 K Since |h(x)| M C + C , this immediately gives us a bound on |h(x) - y (x)| M (1 + C + C ). Thus, we are in a osition to apply Theorem 3 with B = AM , p A = 1 + C + C . We now derive a bound on the stability coefficient . To do so, the key property we will use is the convexity of h c(h, x). Note, however, that in the case of c, the pseudo-targets may depend on the training set S . This dependency matters when we wish to apply convexity with two hypotheses h and h obtained by training on different samples S and S . For convenience, for any two such fixed hypotheses h and h , we extend the definition of c as follows. For all t [0, 1], ( 2 c(th+(1-t)h , x) = th+(1-t)h )(x)-(ty+(1-t)y ) . + where · K is the norm in the reproducing kernel Hilbert space (RKHS) with associated kernel K , C 0 and C 0 are trade-off parameters, and c(h, x) = (h(x) - y(x))2 is the error of the hypothesis h on the unlabeled point x with respect to a pseudo-target y. m u CX C X c(h, xk ) + e(h, xm+k ), (3) c m u k=1 k=1 This allows us to use the same convexity property for c as for c for any two fixed hypotheses h and h , as verified by the following lemma, and does not affect the proofs otherwise. Lemma 4. Let h be a hypothesis obtained by training on S and h by training on S . Then, for al l t [0, 1], tc(h, x) + (1 - t)c(h , x) c(th + (1 - t)h , x). (5) Pseudo-targets are obtained from neighborhood labels y (x) by a local weighted average. Neighborhoods can be defined as a ball of radius r around each point in the feature space. We will denote by loc the scorestability coefficient of the local algorithm used, that is the maximal amount by which the two hypotheses differ on an given point, when trained on samples disagreeing on one point. This notion is stronger than that of cost-based stability. Proof. Let y = y(x) be the pseudo-target value at x when the training set is S and y = y (x) when the training set is S . For all t [0, 1], tc(h, x) + (1 - t)c(h , x) - c(th + (1 - t)h , x) = t(h(x) - y )2 + (1 - t)(h (x) - y )2 e e ^ ~2 - t(h(x) - y ) + (1 - t)(h (x) - y ) . e e In this section, we use the bounded-labels assumption, that is x S , |y (x)| M . We also assume that for any x X , K (x, x) 2 . We will use the following bound based on the reproducing property and the Cauchy-Schwarz inequality valid for any hypothesis h H : x X, |h(x)| = | h, K (x, ·) | h K The statement of the lemma follows directly by the convexity of x x2 over real numbers. Let h be a hypothesis obtained by training on S and h by training on S . Let = h - h . Then, for all x X , |c(h, x) - c(h , x)| = |(x) ((h(x) - y (x)) + (h (x) - y (x))) | 2M (1 + C + C )|(x)|. As in 4, for all x X , |(x)| K , thus for all x X , |c(h, x) - c(h , x)| 2M (1 + C + C ) K. Lemma 3. Let h be the hypothesis minimizing (3). Assume that for any x , K (x, x) 2 . Then, for X any x X , |h(x)| M C + C . Proof. The proof is a straightforward adaptation of the technique of (Bousquet & Elisseeff, 2002) to LTR algorithms. By Eqn. 4, |h(x)| h K . Let 0 Rm+u be the hypothesis assigning label zero to all examples. By definition of h, F (h, S ) F (0, S ) (C + C )M . 2 p K (x, x) h K. (4) (6) Lemma 5. Assume that for al l x X , |y (x)| M . Let S and S be two samples differing by exactly one point. Let h be the hypothesis returned by the algorithm minimizing the objective function F (h, S ), h be the hypothesis obtained by minimization of F (h, S ) and let y and y be the corresponding pseudo-targets. Then, (C /m + C /u) + loc C /u) . where = h - h and A = 1 + C + C . 2AM ( K Using h K F (h, S ) yields the statement. C [c(h , xi ) - c(h, xi )] /m - C [c(h , xi ) - c(h, xi )] /u Stability of Transductive Regression Proof. Let c(hi , yi ) = c(h, xi ) and c(h , yi ) = c(h , xi ). ~ ~ ~ ~ i ~ ~ By Lemma 3 and the bounded-labels assumption, |c(h , yi ) - c(hi , yi )| i = |c(h , yi ) - c(h , yi ) + c(h , yi ) - c(hi , yi )| i i i |(yi - yi )(yi + yi - 2h )| + |(h - hi )(h + hi - 2yi )| . i i i |c(h , yi ) - c(hi , yi )| 2AM (loc + i K ). Wu & Sch¨lkopf, 2007; Zhou et al., 2004; Zhu et al., o 2003). We present a general framework for bounding the stability coefficient of these algorithms and then examine the stability coefficient of each of these algorithms in turn. For a symmetric matrix A Rn×n we will denote by M (A) its largest eigenvalue and m (A) its smallest. Then, for any v Rn×1 , m (A) v 2 Av 2 M (A) v 2 . We will also use in the proof of the following proposition the fact that for symmetric matrices A, B Rn×n , M (AB) M (A)M (B). Prop osition 1. Let h and h solve (8), under test and training sets that differ exactly in one point and let C, C , y, y be the analogous empirical weight and the target value matrices. Then, h - h 2 m (Q) M (C) By the score-stability of local estimates, y (xi ) - y(xi ) loc . Thus, (7) Using 6 leads after simplification to the statement of the lemma. The proof of the following theorem is based on Lemma 4 and Lemma 5 and is reserved to a longer version of this paper. Theorem 4. Assume that for al l x X , |y (x)| M and there exists such that x X , K (x, x) 2 . Further, assume that the local estimator has uniform stability coefficient loc . Let A = 1 + C + C . Then, LTR is uniformly -stable with C 2 . C 2C loc C C 22 + ++ + 2(AM ) mu m u AM 2 u Our experiments with LTR will demonstrate the benefit of this bound for model selection (Sec. 6). y - y 2 +1 + M (Q) C-1 - C-1 2 y 2 . m (Q) m (Q) +1 ) + 1 M (C M (C) Proof. Let = h - h and y = y - y. Let C = -1 (C-1 Q + I) and C = (C Q + I). By definition, = C = C = C Thus, -1 -1 -1 y + (C y - C-1 y -1 y + (C-1 5. Stability Based on Closed-Form Solutions 5.1. Unconstrained Regularization Algorithms In this section, we consider a family of transductive regression algorithms that can be formulated as the following optimization problem: min hT Qh + (h - y)T C(h - y). h y 2 M (Q) C-1 - C-1 2 · y 2 + . m (C) m (C )m (C) (10) m Furthermore, m (C) M (Q) +1. Plugging this bound (C ) back into Eqn. 10 yields: 2 - C-1 )y ( C -1 -1 C-1 - C )Q )y. 2 m (Q) M (C) y 2 +1 + (8) Q R(m+u)×(m+u) is a symmetric regularization matrix, C R(m+u)×(m+u) is a symmetric matrix of empirical weights (in practice it is often a diagonal matrix), y R(m+u)×1 are the target values of the m labeled points together with the pseudo-target values of the u unlabeled points (in some formulations, the pseudo-target value is 0), and h R(m+u)×1 is a column vector whose ith row is the predicted target value for the xi . The closed-form solution of (8) is given by h = (C-1 Q + I)-1 y. (9) Since h - h is bounded by h - h 2, the proposition provides a bound on the score-stability of h for the transductive regression algorithms of Zhou et al. (2004); Wu and Sch¨lkopf (2007); Zhu et al. (2003). o For each of these algorithms, the pseudo-targets used are zero. If we make the bounded labels assumption (x X, |y (x)| M , for some M > 0), it is not difficult to show that y - y 2 2M and y 2 mM . We now examine each algorithm in turn. Consistency metho d (CM) In the CM algorithm (Zhou et al., 2004), the matrix Q is a normalized Laplacian of a weight matrix W R(m+u)×(m+u) that captures affinity between pairs of points in the full sample X . Thus, Q = I - D-1/2 WD-1/2 , where D R(m+u)×(m+u) is a diagonal matrix, with [D]i,i = M (Q) C-1 - C-1 2 y 2 . m (Q) m (Q) +1 ) + 1 M (C M (C) The formulation (8) is quite general and includes as special cases the algorithms of (Belkin et al., 2004a; Stability of Transductive Regression [W]i,j . Note that m (Q) = 0. Furthermore, matrices C and C are identical in CM, both diagonal matrices with (i, i)th entry equal to a positive constant µ > 0. Thus C-1 = C-1 and using Prop. 1, we obtain the following bond on the score-stability of the u CM algorithm: CM 2M . j following optimization problem: min hT Lh + m CX (h(xi ) - yi )2 m i=1 hH sub ject to: m+u X i=1 (11) h(xi ) = 0, Lo cal learning regularization (LL - Reg) In the o LL - Reg algorithm (Wu & Sch¨lkopf, 2007), the regularization matrix Q is (I - A)T (I - A), where I R(m+u)×(m+u) is an identity matrix and A R(m+u)×(m+u) is a non-negative weight matrix that captures the local similarity between all pairs of points in X . A is normalized, i.e. each of its rows sum to 1. Let Cl , Cu > 0 be two positive constants. The matrix C is a diagonal matrix with [C]i,i = Cl if xi S and Cu otherwise. Let Cmax = max{Cl , Cu } C and Cmin = min{. l , Cu }. Thus, C-1 - C-1 2 = 1 1 2 Cmin - Cmax By the Perron-Frobenius theorem, its eigenvalues lie in the interval (-1, 1] and M (A) 1. Thus, m (Q) 0 and M (Q) 4 and we have the following bound on the score stability of 1 he LL - Reg algorithm: LL-Reg 2M + t mM 1 4 mM Cmin - Cmax 2M + 4Cmin . where L R(m+u)×(m+u) is a smoothness matrix, e.g., the graph Laplacian, {yi | i [1, m]} are the target values of the m labeled nodes. The hypothesis set H in this case can be thought of as a hyperplane in Rm+u that is orthogonal to the vector 1 Rm+u . Maintaining the notation used in (Belkin et al., 2004a), we let PH denote the operator corresponding to the orthogonal pro jection on H . For a sample S drawn without replacement from X , define IS R(m+u)×(m+u) to be the diagonal matrix with [IS ]i,i = 1 if xi S and 0 otherwise. Similarly, let yS R(m+u)×1 be the column vector with [yS ]i,1 = yi if xi S and 0 otherwise. The closed-form solution on a training sample S is given by (Belkin et al., 2004a): ""-1 "m " yS . L + IS hS = PH C (12) Theorem 5. Assume that the vertex labels of the graph G = (X, E ) and the hypothesis h obtained by optimizing Eqn. 11 are both bounded (x, |h(x)| M and |y (x)| M for some M > 0). Let A = 1 + C . Then, for any > 0, with probability at least 1 - , « ,, (AM )2 (m + u) b R(h) R(h) + + 2 + mu s (m, u) ln 1 , 2 Gaussian Mean Fields algorithm GMF (Zhu et al., 2003) is very similar to the LL - Reg, and admits exactly the same stability coefficient. Thus, the stability coefficients of the algorithms of CM, LL - Reg, and GMF can be large. Without additional constraints on the matrix Q, these algorithms do not seem to be stable enough for the generalization bound of Theorem 3 to converge. A particular examm+ u ple of constraint is the condition i=1 h(xi ) = 0 used by Belkin et al.'s algorithm (2004a). In the next section, we give a generalization bound for this algorithm and then describe a general method for making the algorithms just examined stable. 5.2. Stability of Constrained Regularization Algorithms This subsection analyzes constrained regularization algorithms such as the Laplacian-based graph regularization algorithm of Belkin et al. (2004a). Given a weighted graph G = (X, E ) in which edge weights represent the extent of similarity between vertices, the task consists of predicting the vertex labels. The hypothesis h returned by the algorithm is solution of the u with (m, u) = m+m-1/2 · 1-1/(2 m1 x{m,u}) and u a (4 2M 2 )/(m2 /C - 1) + (4 2mM 2 )/(m2 /C - 1)2 , 2 is the second smal lest eigenvalue of the Laplacian. Proof. The proof is similar to that of (Belkin et al., 2004a) but uses our general transductive regression bound instead. The generalization bound we just presented differs in several respects from that of Belkin et al. (2004a). Our bound explicitly depends on both m and u while theirs shows only a dependency on m. Also, our bound does not depend on the number of times a point is sampled in the training set (parameter t), thanks to our analysis based on sampling without replacement. Contrasting the stability coefficient of Belkin's algorithm with the stability coefficient of LTR (Theorem 4), we note that it does not depend on C and loc . This is because unlabeled points do not enter the ob jective function, and thus C = 0 and y(x) = 0 for all Stability of Transductive Regression x X . However, the stability does depend on the second smallest eigenvalue 2 and the bound diverges as C 2 approaches m . In all our regression experiments, we observed that this algorithm does not perform as well in comparison with LTR. 5.3. Making Seemingly Unstable Algorithms Stable In Sec. 5.2, we saw that imposing additional constraints on the hypothesis, e.g., h · 1 = 0, allowed one to derive non-trivial stability bounds. This idea can be generalized and similar non-trivial stability bounds can be derived for "stable" versions of the algorithms presented in Sec. 5.1 CM, LL - Reg, and GMF. Recall that the stability bound in Prop. 1 is inversely proportional to the smallest eigenvalue m (Q). The main difficulty with using the proposition for these algorithms is that m (Q) = 0 in each case. Let vm denote the eigenvector corresponding to m (Q) and let 2 be the second smallest eigenvalue of Q. One can modify (8) and constrain the solution to be orthogonal to vm by imposing h · vm = 0. In the case of (Belkin et al., 2004a), vm = 1. This modification, motivated by the algorithm of (Belkin et al., 2004a), is equivalent to increasing the smallest eigenvalue to be 2 . As an example, by imposing the additional constraint, we can show that the stability coefficient of CM becomes bounded by O(C /2 ), instead of (1). Thus, if C = O(1/m) and 2 = (1), it is bounded by O(1/m) and the generalization bound converges as O(1/m). parameters being fixed, the bound of Theorem 3 is tightest when m = u. The value of the input variables were normalized to have mean zero and variance one. For the Boston Housing data set, the total number of examples was 506. For the Elevators and the Ailerons data set, a random subset of 2000 examples was used. For both of these data sets, other random subsets of 2000 samples led to similar results. The Boston Housing experiments were repeated for 50 random partitions, while for the Elevators and the Ailerons data set, the experiments were repeated for 20 random partitions each. Since the target values for the Elevators and the Ailerons data set were extremely small, they were scaled by a factor 1000 and 100 respectively in a pre-processing step. In our experiments, we estimated the pseudo-target of a point x T as a weighted average of the labeled poix ts x N (x ) in a neighborhood of x . Thus, n x yx = Weights are deN (x ) x yx / N (x ) x . fined in terms of a similarity measure K (x, x ) captured by a kernel K : x = K (x, x ). Let m(r) be the number of labeled points in N (x ). Then, it is easy to show that loc 4max M /(min m(r)), where max = maxxN (x ) x and min = minxN (x ) x . Thus, for a Gaussian kernel with parameter , loc 2 2 4M /(m(r)e-2r / ). To estimate loc , one needs an estimate of m(r), the number of samples in a ball of radius r from an unlabeled point x . In our experiments, we estimated m(r) as the number of samples in a ball of radius r from the origin. Since all features are normalized to mean zero and variance one, the origin is also the centroid of the set X . We implemented a dual solution of LTR and used Gaussian kernels, for which, the parameter was selected using cross-validation on the training set. Experiments were repeated across 36 different pairs of values of (C, C ). For each pair, we varied the radius r of the neighborhood used to determine estimates from zero to the radius of the ball containing all points. Figure 1(a) shows the mean values of the test MSE of our experiments on the Boston Housing data set for typical values of C and C . Figures 1(b)-(c) show similar results for the Ailerons and Elevators data sets. For the sake of comparison, we also report results for induction. The relative standard deviations on the MSE are not indicated, but were typically of the order of 10%. LTR generally achieves a significant improvement over induction. 6. Exp eriments 6.1. Mo del Selection Based on Bound This section reports the results of experiments using our stability-based generalization bound for model selection for the LTR algorithm. A crucial parameter of this algorithm is the stability coefficient loc (r) of the local algorithm, which computes pseudo-targets yx based on a ball of radius r around each point. We derive an expression for loc (r) and show, using extensive experiments with multiple data sets, that the value r minimizing the bound is a remarkably good estimate of the best r for the test error. This demonstrates the benefit of our generalization bound for model selection, avoiding the need for a held-out validation set. The experiments were carried out on several publicly available regression data sets: Boston Housing, Elevators and Ailerons2 . For each of these data sets, we used m = u, inspired by the observation that, all other The generalization bound we derived in Eqn. 3 consists of the training error and a complexity term that depends on the parameters of the LTR algorithm 2 www.liaad.up.pt/~ltorgo/Regression/DataSets.html. (C, C , M , m, u, , , ). Only two terms dep end l oc Stability of Transductive Regression 48 46 44 42 40 38 36 0 1 2 Transduction Induction Training Error + Slack Term .098 .096 Transduction Induction Training Error + Slack Term .315 .310 .305 Transduction Induction Training Error + Slack Term .094 .300 .092 .295 .090 .290 .088 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 .285 0 1 2 3 4 5 6 7 8 (a) (b) (c) Figure 1. MSE against the radius r of LTR for three data sets: (a) Boston Housing. (b) Ailerons. (c) Elevators. The small horizontal bar indicates the location (mean ± one standard deviation) of the minimum of the empirically determined r. upon the choice of the radius r: R(h) and loc . Thus, keeping all other parameters fixed, the theoretically optimal radius r is the one that minimizes the training error plus the slack term. The figures also include plots of the training error combined with the complexity term, appropriately scaled. The empirical minimization of the radius r coincides with or is close to r . The optimal r based on test MSE is indicated with error bars. 6.2. Stable Versions of Unstable Algorithms We refer to the stable version of the CM algorithm presented in Sec. 5.1 as CM - STABLE. We compared CM and CM - STABLE empirically on the same datasets, again using m = u. For the normalized Laplacian we used k -nearest neighbors graphs based on Euclidean distance. The parameters k and C were chosen by five-fold cross-validation over the training set. The experiment was repeated 20 times with random partitions. The averaged mean-squared errors with standard deviations, are reported in Table 6.2. Dataset Elevators Ailerons Housing CM 0.3228 ± 0.0264 0.1149 ± 0.0081 57.93 ± 6.5 CM - STABLE 0.3293 ± 0.0286 0.1184 ± 0.0087 57.92 ± 6.5 References Belkin, M., Matveeva, I., & Niyogi, P. (2004a). Regularization and semi-supervised learning on large graphs. COLT (pp. 624­638). Belkin, M., Niyogi, P., & Sindhwani, V. (2004b). Manifold regularization (Technical Report TR-2004-06). University of Chicago. Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. JMLR, 2, 499­526. Chapelle, O., Vapnik, V., & Weston, J. (1999). Transductive Inference for Estimating Values of Functions. NIPS 12 (pp. 421­427). Cortes, C., & Mohri, M. (2007). On Transductive Regression. NIPS 19 (pp. 305­312). El-Yaniv, R., & Pechyony, D. (2006). Stable transductive learning. COLT (pp. 35­49). McDiarmid, C. (1989). On the method of bounded differences. Surveys in Combinatorics (pp. 148­188). Cambridge University Press, Cambridge. Schuurmans, D., & Southey, F. (2002). Metric-Based Methods for Adaptive Model Selection and Regularization. Machine Learning, 48, 51­84. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. Berlin: Springer. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley-Interscience. Wu, M., & Sch¨lkopf, B. (2007). Transductive classio fication via local learning regularization. AISTATS (pp. 628­635). Zhou, D., Bousquet, O., Lal, T., Weston, J., & Sch¨lkopf, B. (2004). Learning with local and global o consistency. NIPS 16 (pp. 595­602). Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semisupervised learning using gaussian fields and harmonic functions. ICML (pp. 912­919). We conclude from this experiment that CM and CM - STABLE have the same performance. However, as we showed previously, CM - STABLE has a non-trivial risk bound and thus comes with some guarantee. 7. Conclusion We presented a comprehensive analysis of the stability of transductive regression algorithms with novel generalization bounds for a number of algorithms. Since they are algorithm-dependent, our bounds are often tighter than those based on complexity measures such as the VC-dimension. Our experiments also show the effectiveness of our bounds for model selection and the good performance of LTR algorithms.