An RKHS for Multi-View Learning and Manifold Co-Regularization Vikas Sindhwani vsindhw@us.ibm.com Mathematical Sciences, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 USA David S. Rosenb erg Department of Statistics, University of California Berkeley, CA 94720 USA drosen@stat.berkeley.edu Abstract Inspired by co-training, many multi-view semi-supervised kernel methods implement the following idea: find a function in each of multiple Reproducing Kernel Hilbert Spaces (RKHSs) such that (a) the chosen functions make similar predictions on unlabeled examples, and (b) the average prediction given by the chosen functions performs well on labeled examples. In this paper, we construct a single RKHS with a data-dependent "co-regularization" norm that reduces these approaches to standard supervised learning. The reproducing kernel for this RKHS can be explicitly derived and plugged into any kernel method, greatly extending the theoretical and algorithmic scope of coregularization. In particular, with this development, the Rademacher complexity bound for co-regularization given in (Rosenberg & Bartlett, 2007) follows easily from wellknown results. Furthermore, more refined bounds given by localized Rademacher complexity can also be easily applied. We propose a co-regularization based algorithmic alternative to manifold regularization (Belkin et al., 2006; Sindhwani et al., 2005a) that leads to ma jor empirical improvements on semi-supervised tasks. Unlike the recently proposed transductive approach of (Yu et al., 2008), our RKHS formulation is truly semisupervised and naturally extends to unseen test data. 1. Intro duction In semi-supervised learning, we are given a few labeled examples together with a large collection of unlabeled data from which to estimate an unknown target function. Suppose we have two hypothesis spaces, H1 and H2 , each of which contains a predictor that well-approximates the target function. We know that predictors that agree with the target function also agree with each other on unlabeled examples. Thus, any predictor in one hypothesis space that does not have an "agreeing predictor" in the other can be safely eliminated from consideration. Due to the resulting reduction in the complexity of the joint learning problem, one can expect improved generalization performance. These conceptual intuitions and their algorithmic instantiations together constitute a ma jor line of work in semi-supervised learning. One of the earliest approaches in this area was "co-training" (Blum & Mitchell, 1998), in which H1 and H2 are defined over different representations, or "views", of the data, and trained alternately to maximize mutual agreement on unlabeled examples. More recently, several papers have formulated these intuitions as joint complexity regularization, or co-regularization, between H1 and H2 which are taken to be Reproducing Kernel Hilbert Spaces (RKHSs) of functions defined on the input space X . Given a few labeled examples {(xi , yi )}iL and a collection of unlabeled data {xi }iU , co-regularization learns a prediction function, ( 1 f1 2 1) f (x) = (x) + f (x) 2 1 2 where f H1 and f H2 are obtained by solving the following optimization problem, 1 2 (f , f ) = argmin f 1 H1 ,f 2 H2 Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). 1 ||f 1 ||2 1 + 2 ||f 2 ||2 2 H H i V (yi , f (xi )) (2) L +µ i U [f 1 (xi ) - f 2 (xi )]2 + An RKHS for Multi-View Learning and Manifold Co-Regularization In this ob jective function, the first two terms measure complexity by the RKHS norms · 2 1 and · 2 2 H H in H1 and H2 respectively, the third term enforces agreement among predictors on unlabeled examples, and the final term evaluates the empirical loss of the mean function f = (f 1 + f 2 )/2 on the labeled data with respect to a loss function V (·, ·). The real-valued parameters 1 , 2 , and µ allow different tradeoffs between the regularization terms. L and U are index sets over labeled and unlabeled examples respectively. Several variants of this formulation have been proposed independently and explored in different contexts: linear logistic regression (Krishnapuram et al., 2005), regularized least squares classification (Sindhwani et al., 2005b), regression (Brefeld et al., 2006), support vector classification (Farquhar et al., 2005), Bayesian co-training (Yu et al., 2008), and generalization theory (Rosenberg & Bartlett, 2007). The main theoretical contribution of this paper is the construction of a new "co-regularization RKHS," in which standard supervised learning recovers the solution to the co-regularization problem of Eqn. 2. Theorem 2.2 presents the RKHS and gives an explicit formula for its reproducing kernel. This "coregularization kernel" can be plugged into any standard kernel method giving convenient and immediate access to two-view semi-supervised techniques for a wide variety of learning problems. Utilizing this kernel, in Section 3 we give much simpler proofs of the results of (Rosenberg & Bartlett, 2007) concerning bounds on the Rademacher complexity and generalization performance of co-regularization. As a more algorithmic application, in Section 4 we consider the semi-supervised learning setting where examples live near a low-dimensional manifold embedded in a high dimensional ambient euclidean space. Our approach, manifold co-regularization (CoMR), gives ma jor empirical improvements over the manifold regularization (MR) framework of (Belkin et al., 2006; Sindhwani et al., 2005a). The recent work of (Yu et al., 2008) considers a similar reduction. However, this reduction is strictly transductive and does not allow prediction on unseen test examples. By contrast, our formulation is truly semisupervised and provides a principled out-of-sample extension. the final prediction function f : f = argmin µi 2 2 1 1 2 ||f ||H1 + ||f 2 ||2 2 + H 2 2 f ( y 1i 1 1 2 2 3) [f (xi ) - f (xi )] + V i , f (xi ) 2 2 f =f 1 +f 2 f 1 H1 ,f 2 H2 min U L ~ Consider the sum space of functions, H, given by, ~ H = H1 H2 (4) 1 2 1 1 2 = {f |f (x) = f (x) + f (x), f H , f H2 } and impose on it a data-dependent norm, f 2~ = H min f =f 1 +f 2 f 1 H1 ,f 2 H2 1 f 1 2 1 + 2 f 2 2 2 H H f1 (xi ) - f 2 (xi ) 2 (5) +µ i U The minimization problem in Eqn. 3 can then be posed ~ as standard supervised learning in H as follows, ( y 1i 1 f = argmin f 2~ + 6) V i , f (xi ) H 2 2 ~ f H L 1 where = 2 . Of course, this reformulation is not ~ really useful unless H itself is a valid new RKHS. Let us recall the definition of an RKHS. Definition 2.1 (RKHS). A reproducing kernel Hilbert space (RKHS) is a Hilbert Space F that possesses a reproducing kernel, i.e., a function k : X × X R for which the fol lowing hold: (a) k (x, .) F for al l x X , and (b) f, k (x, .) F = f (x) for al l x X and f F , where ·, · F denotes inner product in F . ~ In Theorem 2.2, we show that H is indeed an RKHS, and moreover we give an explicit expression for its reproducing kernel. Thus, it follows that although the domain of optimization in Eqn. 6 is nominally a function space, by the Representer Theorem we can express it as a finite-dimensional optimization problem. 2.1. Co-Regularization Kernels Let H1 , H2 be RKHSs with kernels given by k 1 , k 2 re~ spectively, and let H = H1 H2 as defined in Eqn. 4. We have the following result. ~ Theorem 2.2. There exists an inner product on H ~ is a RKHS with norm defined by Eqn. 5 for which H ~ and reproducing kernel k : X × X R given by, ~ k (x, z ) = s(x, z ) - µdT H dz x (7) 2. An RKHS for Co-Regularization We start by reformulating the co-regularization optimization problem, given in Eqn. 1 and Eqn. 2, in the following equivalent form where we directly solve for An RKHS for Multi-View Learning and Manifold Co-Regularization where s(x, z ) is the (scaled) sum of kernels given by, - - s(x, z ) = 1 1 k 1 (x, z ) + 2 1 k 2 (x, z ), and dx is a vector-valued function that depends on the difference in views measured as, - - 2 dx = 1 1 k1 x - 2 1 kU x , U bounds are given on the Rademacher complexity of the co-regularized hypothesis space. This leads to generalization bounds in terms of the Rademacher complexity. In this section, we derive these complexity bounds in a few lines using Theorem 2.2 and a well-known result on RKHS balls. Furthermore, we present improved generalization bounds based on the theory of localized Rademacher complexity. 3.1. Rademacher Complexity Bounds Definition 3.1. The empirical Rademacher complexity of a function class A = {f : X R} on a sample x1 , . . . , x X is defined as , 2 s i ^ i f (xi ) R (A) = E up f A =1 k T where ki x = i (x, xj ), j U , and H is a positiveU µ definite matrix given by H = (I + S )-1 . Here, S is the w - - 2 1 gram matrix of s(·, ·), i.e., S = 1 1 KU U + 2 1 KU U i here KU U = k i (U, U ) denotes the Gram matrices of k i over unlabeled examples. We give a rigorous proof in Appendix A. 2.2. Representer Theorem ~ Theorem 2.2 says that H is a valid RKHS with kernel ~. By the Representer Theorem, the solution to Eqn.6 k is given by i ~ f (x) = (8) i k (xi , x) L where the expectation is with respect to = {1 , . . . , }, and the i are i.i.d. Rademacher ran1 dom variables, that is, P (i = 1) = P (i = -1) = 2 . The corresponding components in H1 , H2 can also be retrieved as, i ( k - 1 1 f (x) = i 1 1 1 (xi , x) - µdT i H kU x 9) x 2 f (x) = i L - i 2 1 L k2 2 (xi , x) + µdT i H kU x x ( 10) Note that H1 and H2 are defined on the same domain X so that taking the mean prediction is meaningful. In a two-view problem one may begin by defining H1 , H2 on different view spaces X 1 , X 2 respectively. Such a problem can be mapped to our framework by extending H1 , H2 to X = X 1 × X 2 by re-defining f 1 (x1 , x2 ) = f 1 (x1 ), f 1 H1 ; similarly for H2 . While we omit these technical details, it is important to note that in such cases, Eqns. 9 and 10 can be reinterpreted as predictors defined on X 1 , X 2 respectively. We now cite a well-known result about the Rademacher complexity of a ball in an RKHS (see e.g. (Boucheron et al., 2005)). Let Hr := {f H : ||f ||H r} denote the ball of radius r in H. Then we have the following: Lemma 3.2. The empirical Rademacher complexity on the sample x1 , . . . , x X for the RKHS bal l r is H 1 ^ bounded as fol lows: 2 2r trK R (Hr ) 2r trK 4 k where K = (xi , xj ) i,j =1 is the kernel matrix. Let H be an arbitrary RKHS with kernel k (·, ·), and denote the standard RKHS i upervised learning ob jecs 2 tive function by Q(f ) = L V (yi , f (xi )) + ||f ||H . Lei f = argminf H Q(f ). Then Q(f ) Q(0) = t H L V (yi , 0). It follows that f 2 Q(0)/. Thus if we have some control a priori on Q(0), then we can restrict the search for f to a ball in H of radius Q (0)/. r= 3. Bounds on Complexity and Generalization By eliminating all predictors that do not collectively agree on unlabeled examples, co-regularization intuitively reduces the complexity of the learning problem. It is reasonable then to expect better test performance for the same amount of labeled training data. In (Rosenberg & Bartlett, 2007), the size of the coregularized function class is measured by its empirical Rademacher complexity, and tight upper and lower For the co-regularization problem described in ~ Eqns. 3 and 6, we have f Hr where r2 = supy V (0, y ), where is number of labeled examples. We now state and prove bounds on the Rademacher ~ complexity of Hr . The bounds here are exactly the same as those given in (Rosenberg & Bartlett, 2007). However, while they have a lengthy "bare-hands" approach, here we get the result as a simple corollary of Theorem 2.2 and Lemma 3.2. Theorem 3.3. The empirical Rademacher complexity on the labeled sample x1 , . . . , x X for the RKHS ~ bal l Hr is bounded as fol lows: 2r t ~ 1 2r t ~ ^~ rK R (Hr ) rK , 4 2 An RKHS for Multi-View Learning and Manifold Co-Regularization ~ roof. Note that K is just the kernel matrix for the co~ regularization kernel k (·, ·) on the labeled data. Then the bound follows immediately from Lemma 3.2. 3.2. Co-Regularization Reduces Complexity The co-regularization parameter µ controls the extent to which we enforce agreement between f 1 and f 2 . Let ~ H(µ) denote the co-regularization RKHS for a particular value of µ. From Theorem 3.3, we see that the ~ Rademacher complexity for a ball of radius r in H(µ) decreases with µ by an amount determined by µ ( -1 T (µ) = tr DU L (I + µS ) DU L 11) = i 2 =1 where -1 - - 2 T ~ K = 1 1 LL + 2 1 KLL - µDU L (I + µS ) DU L and K1 P - - 2 1 DU L = 1 1 KU L - 2 1 KU L e need two more conditions for the localized bound: Condition 2. For any probability distribution P , ~ there exists f H1 satisfying P [V (f (x), y )] = inf f H1 P [V (f (x), y )] ~ Condition 3. There is a constant B 1 such that ~ for every probability distribution P and every f H1 2 we have, P (f - f ) B P (V [f (x), y ] - V [f (x), y )]) In the following theorem, let P denote the empirical probability measure for the labeled sample of size . Theorem 3.5. [Cor. 6.7 from (Bartlett et al., 2002)] Assume that supxX k (x, x) 1 and that V satisfies ^ ~ the 3 conditions above. Let f be any element of H1 ^(x), y ] = inf ~ P V [f (x), y ]. There satisfying P V [f f H1 exist a constant c depending only on A and B s.t. with probability at least 1 - 6e- , V ^ , ^ P [f (x), y ] - V [f (x), y ] c r + h1 a i where r min0h + ^ nd where >h i 1 , . . . , are the eigenvalues of the labeled-data kernel ~ matrix KLL in decreasing order. Note that while Theorem 3.4 bounds the gap between expected and empirical performance of an arbitrary ~ f H1 , Theorem 3.5 bounds the gap between the ~ empirical loss minimizer over H1 and true risk min~ 1 . Since the localized bound only needs imizer in H to account for the capacity of the function class in the neighborhood of f , the bounds are potentially tighter. Indeed, while the bound in Theorem 3.4 is in terms of the trace of the kernel matrix, the bound in Theorem 3.5 involves the tail sum of kernel eigenvalues. If the eigenvalues decay very quickly, the latter is potentially much smaller. where (·, ·) is a metric on R|U | defined by -1 - - - - 2 (s, t) = µ(1 1 s - 2 1 t) (I + µS ) (1 1 s - 2 1 t) k1 2 U xi kU xi ( 12) We see that the complexity reduction, (µ), grows with the -distance between the two different representations of the labeled points. Note that the metric is determined by S , which is the weighted sum of the gram matrices of the two kernels on unlabeled data. 3.3. Generalization Bounds With Theorem 2.2 allowing us to express multi-view co-regularization problems as supervised learning in a data-dependent RKHS, we can now bring a large body of theory to bear on the generalization performance of co-regularization methods. We start by quoting the theorem proved in (Rosenberg & Bartlett, 2007). Next, we state an improved bound based on localized Rademacher complexity. Below, we denote the unit ~ ~ ball in H by H1 . Condition 1. The loss V (·, ·) is Lipschitz in its first argument, i.e., there exists a constant A such that ^^ ^ ^ ^ ^ y , y1 , y2 : |V (y1 , y ) - V (y2 , y )| A |y1 - y2 | 2 Theorem 3.4. Suppose V : Y [0, 1] satisfies Condition 1. Then conditioned on the unlabeled data, for any (0, 1), with probability at least 1 - over the sample of labeled points (x1 , y1 ), . . . , (x , y ) drawn ~ i.i.d. from P , we have for any predictor f H1 that 1i ^~ P [V ((x), y )] V ((xi ), yi ) + 2B R (H1 ) =1 l 12 + 3 n(2/ )/2 + 4. Manifold Co-Regularization Consider the picture shown in Figure 1(a) where there are two classes of data points in the plane (R2 ) lying on one of two concentric circles. The large, colored points are labeled while the smaller, black points are unlabeled. The picture immediately suggests two notions of distance that are very natural but radically different. For example, the two labeled points are close in the ambient euclidean distance on R2 , but infinitely apart in terms of intrinsic geodesic distance measured along the circles. Suppose for this picture one had access to two kernel functions, k 1 , k 2 that assign high similarity to nearby points according to euclidean and geodesic distance re Wspectively. Because of the difference in ambient and intrinsic representations, by co-regularizing the associated RKHSs one can hope to get good reductions in An RKHS for Multi-View Learning and Manifold Co-Regularization complexity, as suggested in section 3.2. In Figure 1, we report the value of complexity reduction (Eqn. 12) for four point clouds generated at increasing levels of noise off the two concentric circles. When noise becomes large, the ambient and intrinsic notions of distance converge and the amount of complexity reduction decreases. Figure 1. Complexity Reduction (µ = 1) (Eqn. 12) with respect to noise level . The choice of k1 , k2 is discussed in the following subsections. respect to the graph. Here, W denotes the adjacency matrix of the graph. When X is a euclidean space, a x -x 2 typical W is given by Wij = exp(- i22j ) if i and j are nearest neighbors and 0 otherwise. In practice, one may use a problem dependent similarity matrix to set these edge weights. This norm can be conveniently written as a quadratic form f T M f , where M is the graph Laplacian matrix defined as M = D - W , and D j is a diagonal degree matrix with entrees Dii = Wij . (a) = 7957, = 0 (b) = 7569, = 0.1 It turns out that HI with the norm · I is an RKHS whose reproducing kernel kI : V × V R is given by kI (xi , xj ) = Mij , where M denotes the pseudo-inverse of the Laplacian. Given HI with its reproducing kernel, graph transduction solves the standard RKHS riegularization problem, 2 f = argminf HI f I + L V (yi , f (xi )), where yi is the label associated with the node xi . Note that the solution f is only defined over V , the set of labeled and unlabeled examples. Since graph transduction does not provide a function whose domain is the ambient space X , it is not clear how to make predictions on unseen test points x X . Possessing a principled "out-of-sample extension" distinguishes semisupervised methods from transductive procedures. 4.1. Ambient and Intrinsic Co-Regularization (c) = 3720, = 0.2 (d) = 3502, = 0.5 The setting where data lies on a low-dimensional submanifold M embedded in a higher dimensional ambient space X , as in the concentric circles case above, has attracted considerable research interest recently, almost orthogonal to multi-view efforts. The main assumption underlying manifold-motivated semisupervised learning algorithms is the following: two points that are close with respect to geodesic distances on M should have similar labels. Such an assumption may be enforced by an intrinsic regularizer that emphasizes complexity along the manifold. Since M is truly unknown, the intrinsic regularizer is empirically estimated from the point cloud of labeled and unlabeled data. In the graph transduction approach, an nn-nearest neighbor graph G is constructed which serves as an empirical substitute for M. The vertices V of this graph are the set of labeled and unlabeled examples. Let HI denote the space of all functions mapping V to R, where the subscript I implies "intrinsic." Any function f HI can be identified with the vector f = [f (xi ), xi V ]T . One can impose i 2 a norm f 2 = I j Wij [f (xi ) - f (xj )] on HI that provides a natural measure of smoothness for f with We propose a co-regularization solution for out-ofsample prediction. Conceptually, one may interpret the manifold setting as a multi-view problem where each labeled or unlabeled example appears in two "views": (a) an ambient view in X in terms of euclidean co-ordinates x and (b) an intrinsic view in G as a vertex index i. Let HA : X × X R be an RKHS defined over the ambient domain with an associated kernel kA : X × X R. We can now combine ambient and intrinsic views by co-regularizing HA , HI . This can be done by setting k 1 = kA and k 2 = kI in Eqn. 7 and solving Eqn. 6. The combined prediction function f given by Eqn. 8 is the mean of an ambient 1 component f given by Eqn. 9 and an intrinsic compo2 nent f given by Eqn. 10. Even though f is transductive and only defined on labeled and unlabeled exam1 ples, the ambient component f can be used for outof-sample prediction. Due to co-regularization, this ambient component is (a) smooth in HX and (b) in agreement with a smooth function on the data manifold. We call this approach manifold co-regularization, and abbreviate it as CoMR. 4.2. Manifold Regularization In the manifold regularization (MR) approach of (Belkin et al., 2006; Sindhwani et al., 2005a), the An RKHS for Multi-View Learning and Manifold Co-Regularization following optimization problem is solved: 2 f = argmin 1 f HA + 2 f T M f f HA 4.3. Exp eriments In this section, we compare MR and CoMR. Similar to our construction of the co-regularization kernel, (Sindhwani et al., 2005a) provide a data-dependent kernel that reduces manifold regularization to standard supervised learning in an associated RKHS. We write the manifold regularization kernel in the following form, ~ ¯ ¯¯ k mr (x, z ) = s(x, z ) - dT H dz ¯ x (14) + i V (yi , f (xi )) (13) L where f = [f (xi ), i L, U ]T . The solution, f is defined on X , and can therefore be used for out-ofsample prediction. In Figure 2, we show a simple two-dimensional dataset where MR provably fails when HA is the space of linear functions on R2 . The LINES dataset consists of two classes spread along perpendicular lines. In MR the intrinsic regularizer is enforced directly on HA . It can be easily shown that the intrinsic norm of a linear function f (x) = wT x along the perpendicular lines is exactly the same as the ambient norm, i.e., f 2 I = f 2 A = wT w. Due to this, MR simply igH H nores unlabeled data and reduces to supervised training with the regularization parameter 1 + 2 . The linear function that gives maximally smooth predictions on one line also gives the maximally nonsmooth predictions on the other line. One way to remedy this restrictive situation is to introduce slack variables = (i )iLU in Eqn. 13 with an 2 penalty, and instead solve: f = argminf HA , 1 f 2 A + H i 2 (f + )T M (f + ) + µ 2 + L V (yi , f (xi )). Re-parameterizing g = f + , we can re-write the 2 above problem as, f = arigminf HA ,gHI 1 f HA + 2 2 g HI + µ f - g 2 + L V (yi , f (xi )), which may be viewed as a variant of the co-regularization problem in Eqn. 2 where empirical loss is measured for f alone. Thus, this motivates the view that CoMR adds extra slack variables in the MR ob jective function to better fit the intrinsic regularizer. Figure 2 shows that CoMR achieves better separation between classes on the LINES dataset. Figure 2. Decision boundaries of MR and CoMR (using the quadratic hinge loss) on the LINES dataset -1 - ¯ where we have, s = 1 1 k 1 (x, z ), dx = 1 1 kU x and ¯ -1 -1 -¯ ¯ ¯ ¯ , where K 1 is the Gram MaH = 1 K 1 + 2 1 K 2 1 trix of k = kA over labeled and unlabeled examples, ¯ and K 2 = M . We use the notation s, d, H so that ¯¯ ¯ the kernel can be easily compared with corresponding quantities in the co-regularization kernel Eqn. 7. In this section we empirically compare this kernel with the co-regularization kernel of Eqn. 7 for exactly the same choice of k 1 , k 2 . Semi-supervised classification experiments were performed on 5 datasets described in table 1. Table 1. Datasets with d features and c classes. 10 random data splits were created with l labeled, u unlabeled, t test, and v validation examples. Dataset LINES G50C U S PS T COIL20 PCMAC d 2 50 256 241 7511 c 2 2 10 20 2 l 2 50 50 40 50 u 500 338 1430 1320 1385 t 250 112 477 40 461 v 250 50 50 40 50 The LINES dataset is a variant of the two-dimensional problem shown in Figure 2 where we added random noise around the two perpendicular lines. The G50C, USPST, COIL20, and PCMAC datasets are well known and have frequently been used for empirical studies in semi-supervised learning literature. They were used for benchmarking manifold regularization in (Sindhwani et al., 2005a) against a number of competing methods. g50c is an artificial dataset generated from two unit covariance normal distributions with equal probabilities. The class means are adjusted so that the Bayes error is 5%. COIL20 consists of 32 × 32 gray scale images of 20 ob jects viewed from varying angles. USPST is taken from the test subset of the USPS dataset of images containing 10 classes of handwritten digits. PCMAC is used to setup binary text categorization problems drawn from the 20newsgroups dataset. For each of the 5 datasets, we constructed random splits into labeled, unlabeled, test and validation sets. The sizes of these sets are given in table 1. For all datasets except LINES, we used Gaussian am- (a) MR (b) CoMR An RKHS for Multi-View Learning and Manifold Co-Regularization Table 2. Error Rates (in percentage) on Test Data Dataset LINES G50C U S PS T COIL20 PCMAC MR 7.7 (1.2) 5.8 (2.8) 18.2 (1.5) 23.8 (11.1) 11.9 (3.4) CoMR 1.0 (1.5) 5.5 (2.3) 14.1 (1.6) 14.8 (8.8) 8.9 (2.6) Table 4. Parameters Used. Note µ = 1 for CoMR. Linear kernel was used for LINES dataset. Dataset LINES G50C U S PS T COIL20 PCMAC nn 10 50 10 2 50 - 17.5 9.4 0.6 2.7 p 1 5 2 1 5 MR 1 , 2 0.01, 10-6 1, 100 0.01, 0.01 10-4 ,10-6 10, 100 CoMR 1 , 2 10-4 , 100 10, 10 10-6 , 10-4 10-6 , 10-6 1, 10 Table 3. Error Rates (in percentage) on Unlabeled Data Dataset LINES G50C U S PS T COIL20 PCMAC 7.5 6.6 18.6 37.5 11.0 MR (1.0) (0.8) (1.4) (6.0) (2.4) CoMR 1.3 (2.0) 6.9 (1.0) 13.3 (1.0) 14.8 (3.3) 9.4 (1.9) 2 References Bartlett, P. L., Bousquet, O., & Mendelson, S. (2002). Localized rademacher complexities. COLT 02 (pp. 44­58). Springer-Verlag. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 7, 2399­2434. Bertinet, A., & Thomas-Agnan, C. (2004). Reproducing kernel hilbert spaces in probability and statistics. Kluwer Academic Publishers. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. COLT. Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: P&S, 9, 323­375. Brefeld, U., G¨rtner, T., Scheffer, T., & Wrobel, S. a (2006). Efficient co-regularised least squares regression. ICML (pp. 137­144). Farquhar, J. D. R., Hardoon, D. R., Meng, H., Taylor, J. S., & Szedmak, S. (2005). Two view learning: ´ SVM-2K, theory and practice. NIPS. Krishnapuram, B., Williams, D., Xue, Y., & A. Hartemink, L. C. (2005). On semi-supervised classification. NIPS. Rosenberg, D., & Bartlett, P. L. (2007). The Rademacher complexity of co-regularized kernel classes. AISTATS. Sindhwani, V., Niyogi, P., & Belkin, M. (2005a). Beyond the point cloud: From transductive to semisupervised learning. ICML. Sindhwani, V., Niyogi, P., & Belkin, M. (2005b). A coregularization approach to semi-supervised learning with multiple views. ICML Workshop on Learning in Multiple Views (pp. 824­831). Yu, S., Krishnapuram, B., Rosales, R., Steck, H., & Rao, R. (2008). Bayesian co-training. NIPS. bient kernels k 1 (x, z ) = exp(- ), and intrinsic graph kernel whose gram matrix is of the form K2 = (M p + 10-6 I )-1 . Here, M is the normalized Graph Laplacian constructed using nn nearest neighbors and p is an integer. These parameters are tabulated in Table 4 for reproducibility. For more details on these parameters see (Sindhwani et al., 2005a). We chose squared loss for V (·, ·). Manifold regularization with this choice is also referred to as Laplacian RLS and empirically performs as well as Laplacian SVMs. For multi-class problems, we used the oneversus-rest strategy. 1 , 2 were varied on a grid of values: 10-6 , 10-4 , 10-2 , 1, 10, 100 and chosen with respect to validation error. The chosen parameters are also reported in Table 4. Finally, we evaluated the MR solution and the ambient component of CoMR on an unseen test set. In Tables 2 and 3 we report the mean and standard deviation of error rates on test and unlabeled examples with respect to 10 random splits. We performed a paired t-test at 5% significance level to assess the statistical significance of the results. Results shown in bold are statistically significant. Our experimental protocol makes MR and CoMR exactly comparable. We find that CoMR gives ma jor empirical improvements over MR on all datasets except G50C where both methods approach the Bayes error rate. x-z 2 2 5. Conclusion In this paper, we have constructed a single, new RKHS in which standard supervised algorithms are immediately turned into multi-view semi-supervised learners. This construction brings about significant theoretical simplifications and algorithmic benefits, which we have demonstrated in the context of generalization analysis and manifold regularization respectively. An RKHS for Multi-View Learning and Manifold Co-Regularization A. Pro of of Theorem 2.2 This theorem generalizes Theorem 5 in (Bertinet & Thomas-Agnan, 2004). The Pro duct Hilb ert Space We begin by introducing the product space, F = H1 × H2 = {(f 1 , f 2 ) : f 1 H1 , f 2 H2 }, and an inner product on F defined by, (1 2 F f , f ), (g 1 , g 2 ) = 1 f 1 , g 1 H1 + 2 f 2 , g 2 H2 i f1 g1 ( (xi ) - f 2 (xi ) (xi ) - g 2 (xi ) 15) +µ U ~ ~ (a) k(z , ·) H z X Recall from Eqn. 7 that the co-regularization kernel is defined as - - = 1 1 k 1 (x, z ) + 2 1 k 2 (x, z ) T - - -µ 1 1 k1 x - 2 1 k2 x z U U . -1 -1 1 -1 2 where z = H dz = (I + µS ) 1 kU z - 2 kU z - - Define h1 (x) = 1 1 k 1 (x, z ) - µ1 1 k1 x z and U -1 2 -1 2 2 h (x) = 2 k (x, z ) + µ2 k (x, U )z . Not that, k e h1 span 1 (z , ·), k 1 (x1 , ·), . . . , k1 (xu , ·) H1 , ~ ah d similarly, , h2 H2 . It's clear that k (z , ·) = n 1 2 ~(z , ·) H. ~ (·) + h (·) and thus k ~ k (x, z ) It's straightforward to check that ·, · F is a valid inner product. Moreover, we have the following: Lemma A.1. F is a Hilbert space. Proof. (Sketch.) We need to show that F is complete. 1 2 1 Let (fn , fn ) be a Cauchy sequence in F . Then fn 1 2 2 is Cauchy in H and fn is Cauchy in H . By the 1 completeness of H1 and H2 , we have fn f 1 in H1 2 2 2 1 2 and fn f in H , for some (f , f ) F . Since H1 and H2 are RKHSs, convergence in norm implies pointwise convergence, and thus the co-regularization part of the distance also goes to zero. (b) Repro ducing Prop erty For convenience, we collect some basic properties of h1 and h2 in the following lemma: Lemma A.2 (Properties of h1 and h2 ). Writing h1 (U ) for the column vector with entries h1 (xi ) i U , and similarly for other functions on X , we have the fol lowing: f1 1 H -1 1 -1 1 T ,h (17) 1 = 1 f (z ) - µ1 f (U ) z f2 2 H -1 2 -1 2 T ,h (18) 2 = 2 f (z ) + µ2 f (U ) z h1 (U ) - h2 (U ) = z (19) ~ ~ H is a Hilb ert Space Recall the definition of H ~ in Eqn. 4. Define the map u : F H by u(f 1 , f 2 ) = f1 . 1 + f 2 The map's kernel N := u-1 (0) is a closed 2 subspace of F , and thus its orthogonal complement N is also a closed subspace. We can consider N as a Hilbert space with the inner product that is the natural restriction of ·, · F to N . Define v : N ~ H as the restriction of u to N . Then v is a bijection, ~ and we define an inner product on H by f, g H = v ~ -1 Proof. The first two equations follow from the definitions of h1 and h2 and the reproducing kernel property. The last equation is derived as follows: - - h1 (U ) - h2 (U ) =1 1 k 1 (U, z ) - µ1 1 k 1 (U, U )z - - - 2 1 k 2 (U, z ) - µ2 1 k 2 (U, U )z = (I + µS ) =dz - µS (I + µS )-1 dz I d = - µS (I + µS )-1 z -1 dz = z (f ), v -1 (g ) F . (16) ~ We conclude that H is a Hilbert space isomorphic to N . ~ The Co-Regularizationv Norm Fix any f. H, and note that u-1 (f ) = -1 (f ) + n | n N Since v -1 (f ) and N are orthogonal, it's clear by the Pythagorean theorem that v -1 (f ) is the element of u-1 (f ) with minimum norm. Thus ||f ||2~ = ||v -1 (f )||2 = F H (f 1 ,f 2 )u-1 (f ) min ||(f 1 , f 2 )||2 F ~ We see that the inner product on H induces the norm claimed in the theorem statement. We next check the two conditions for validity of an RKHS (see Definition 2.1). where the last line follows from the Sherman-MorrisonWoodbury inversion formula. ~ Since ~ (z , ·) = h1 (·) + h2 (·), it is clear that (h1 , h2 ) = k + -1 k (z , ·) n, for some n N . Since v -1 (f ) N , v we have f ~ v F ~ ~ , k (z , ·) = -1 (f ), v -1 (k (z , ·)) H v F = -1 (f ), (h1 , h2 ) - n v F = -1 (f ), (h1 , h2 ) f H =1 1 , h1 + 2 f2 , h2 H2 1 = h1 T f1 +µ (U ) - h2 (U ) (U ) - f 2 (U ) f1 T f 1 (z ) + f 2 (z ) - µ (U ) - f 2 (U ) z f T + µ 1 (U ) - f 2 (U ) z (from A.2) = f (z )