An Information Theoretic Framework for Multi-view Learning Karthik Sridharan and Sham M. Kakade Toyota Technological Institute at Chicago {karthik, sham}@tti-c.org Abstract In the multi-view learning paradigm, the input variable is partitioned into two different views X1 and X2 and there is a target variable Y of interest. The underlying assumption is that either view alone is sufficient to predict the target Y accurately. This provides a natural semi-supervised learning setting in which unlabeled data can be used to eliminate hypothesis from either view, whose predictions tend to disagree with predictions based on the other view. This work explicitly formalizes an information theoretic, multi-view assumption and studies the multi-view paradigm in the PAC style semisupervised framework of Balcan and Blum [2006]. Underlying the PAC style framework is that an incompatibility function is assumed to be known -- roughly speaking, this incompatibility function is a means to score how good a function is based on the unlabeled data alone. Here, we show how to derive incompatibility functions for certain loss functions of interest, so that minimizing this incompatibility over unlabeled data helps reduce expected loss on future test cases. In particular, we show how the class of empirically successful coregularization algorithms fall into our framework and provide performance bounds (using the results in Rosenberg and Bartlett [2007], Farquhar et al. [2005]). We also provide a normative justification for canonical correlation analysis (CCA) as a dimensionality reduction technique. In particular, we show (for strictly convex loss functions of the form (w·x, y )) that we can first use CCA as dimensionality reduction technique and (if the multi-view assumption is satisfied) this projection does not throw away much predictive information about the target Y -- the benefit being that subsequent learning with a labeled set need only work in this lower dimensional space. 1 Introduction The "multi-view" approach to learning has been receiving increasing attention as a paradigm for semi-supervised learning. The implicit assumption is that either view alone has sufficient information about the target Y . The basic intuition as to why this assumption is helpful is that the complexity of the learning problem could be reduced by eliminating hypothesis from each view that tend not to agree with each other, which, crucially, can be done using unlabeled data. There are many natural applications for which this assumption is applicable. For example, consider a setting where it is easy to obtain pictures of objects from different camera angles and say our supervised task is one of object recognition. Intuitively, we can think of unlabeled data as providing examples of viewpoint invariance. One can even consider multi-modal views, e.g. identity recognition where the task might be to identify a person with one view being a video stream and the other an audio stream -- each of these views would be sufficient to determine the identity. In NLP, an example would be a paired document corpus, consisting of a document and its translation into another language, and the supervised task could be predicting some high level property of the document. The motivating example in Blum and Mitchell [1998] is a web-page classification task, where one view was the text in the page and the other was the hyper-link structure. This work explicitly formalizes a general information theoretic multi-view assumption. Based on this assumption, we seek to understand the reduction in label complexity from using unlabeled data. There are two natural classes of algorithms in the literature which can be considered multi-view algorithms. These classes are the co-regularization algorithms and algorithms based on CCA. For the former, we analyze the co-regularization algorithms of Sindhwani et al. [2005], Brefeld et al. [2006] (and the related SVM-2K algorithm of Farquhar et al. [2005]) in a generalization of the PAC style semi-supervised framework of Balcan and Blum [2006]. Technically, this PAC model is for the 0/1 loss, but we generalize the framework to arbitrary loss functions. For the latter class of algorithms, we generalize the CCA results in Kakade and Foster [2007] to show how CCA can be used for dimensionality reduction, when dealing with convex loss functions (under linear prediction). In the Discussion, we present a practical answer to the open problem presented in Balcan and Blum [2007] (presented at COLT 2007) using co-regularization algorithms, under the theory of surrogate loss functions [Bartlett et al., 2006], and we also discuss the connection to the Information Bottleneck method of Tishby et al. [1999]. In the remainder of the Introduction, we present our setting and main information theoretic assumption, and then summarize our contributions and related work. 1.1 A Multi-View Assumption In the (multi-view) semi-supervised setting, we assume that we have n labeled examples S = {(xi , xi , y i )}n 1 and m 1 2 i= unlabeled examples U = {(xi , xi )}n+m 1 , where yi Y 1 2 i=n+ and xi Xv for v {1, 2}, which are both sampled in an v i.i.d. manner from some unknown underlying joint distribution (typically m >> n). In particular, the joint underlying distribution is over X1 × X2 × Y . As usual, the goal is to predict Y , as measured with respect to some known loss function. Information theory provides the natural language to state an assumption for multi-view learning. In particular, the conditional mutual information I (A : B |C ) (between random outcomes A and B conditioned on C ) measures how much information is shared between A and B conditioned on already knowing C , which can be viewed as how much knowing A reduces our uncertainty of B , conditioned on already knowing C . We now state our first main assumption. Assumption 1 (Multi-View Assumption) There exists an info > 0 such that I (Y : X2 |X1 ) info and I (Y : X1 |X2 ) info Let us try to get an intuitive feel for this assumption. The assumption states that (on average) if we already knew X1 then there is little more information that we could gain about Y from learning X2 (and vice-versa) -- this small potential gain is quantified by info . Hence, we can think of this assumption as stating that both X1 and X2 are (approximately) redundant with regards to their information about Y . Let us examine how the compatibility assumption made in the co-training case [Blum and Mitchell, 1998], where Y {0, 1}, is related to this assumption. Here, it was assumed that a perfect prediction of Y is possible using the knowledge of either view alone. This implies the above conditions are satisfied with info = 0, since conditioned on either view, the target Y is already known (so there is no possible reduction in uncertainty with knowledge from the remaining view). However, note that under this assumption, neither view need accurately predict the target, just that they both carry (roughly) the same information about the target. Hence, the assumption is well suited to situations with noise. In fact, even if info = 0, there need not exist perfect predictions of the target -- though for this case we would expect that the optimal predictions should perfectly agree (as they carry the same information about Y ), a point which we return to. The work in Blum and Mitchell [1998] also introduced a further conditional independence assumption, which states that X1 and X2 are independent conditioned on the knowledge of Y . The work of Dasgupta et al. [2001], Abney [2004] shows how unreasonably strong this extra assumption is, with regards to classification. In our work, we make no further assumptions on the underlying data distribution. 1.2 Co-Regularization There is a recent class of algorithms which control model complexity in the two view setting by co-regularizing [Sindhwani et al., 2005, Brefeld et al., 2006]. A related algorithm is the two view SVM-2K algorithm of Farquhar et al. [2005]. These class of algorithms all have demonstrated empirical successes. The question we seek to understand is how unlabeled data improves the performance of these algorithms. These co-regularization algorithms add an additional regularizer which penalizes using functions from either view which tend to disagree. The (kernelized) algorithm of Sindhwani et al. [2005], Brefeld et al. [2006] is to find two predictors f1 and f2 (where f1 : X1 Y and f2 : X2 Y ) which minimize the following co-regularized loss: 1E ( S [ (f1 (x1 ), y )] + ES [ (f2 (x2 ), y )]) 2 f1 2 + f2 2 + co EU (f1 (x1 ) - f2 (x2 ))2 K K (1) where · K is a pre-specified norm over functions; ES and EU are empirical averages with respect to the labeled and unlabeled sets S and U , respectively; and is some convex loss (such as the hinge loss or squared loss). The last term is the co-regularizer. Note that if co = 0 then this problem just reduces to solving two independent (regularized) problems. The SVM-2K algorithm of Farquhar et al. [2005] is similar -- it essentially imposes an agreement constraint into the SVM objective function, based on the L1 norm (which allows for an efficient implementation). Rosenberg and Bartlett [2007] provide generalization bounds for co-regularization (using a co-regularizer that is the square loss) in terms of Rademacher complexities. Farquhar et al. [2005] also provide generalization bounds (again using Rademacher complexities) for the SVM-2K algorithm. These bounds characterize how much the complexity class of the hypothesis space decreases with the co-regularization. We can view these bounds as characterizing how much the variance of the algorithm decreases. In particular, as co increases, this has the effect of decreasing the variance (as a harder constraint is being imposed). While these are valid generalization bounds (which compare the empirical expectation of a predictor to the true expectation), they do not address the bias issue of how performance could decrease as co is increased too much. In particular, as co is increased, the algorithm is not as free to use certain hypothesis (which we can think of as the bias). Roughly speaking, these previous multi-view results quantify how model complexity is reduced, but they do not specify why this is reasonable to do. Hence, to understand how unlabeled data could improve performance, we must characterize how much the co-regularization effects this bias-variance trade-off. We address these issues under the recent PAC framework for semi-supervised learning of Balcan and Blum [2006] -- though we generalize the setting for arbitrary loss functions (Balcan and Blum [2006] only considered the 0/1 loss). Their framework assumes an incompatibility function -- a function which scores how good hypothesis are just based on the underlying data distribution. They provide a general framework for characterizing how such an incompatibility function can reduce the need for labeled samples. Intuitively, one can view the co-regularizer as an incompatibility function, as it is scoring hypothesis based on unlabeled data -- if a pair of hypothesis disagree strongly under the co-regularizer it is unlikely that they would be good predictors. One of our main contributions for analyzing these coregularization algorithms is that we show how the incompatibility function is really a derived property of the loss function -- the incompatibility function needs to satisfy a rather mild inverse Lipschitz condition. Under relatively general conditions, incompatibility functions can be derived for many loss functions of interest -- we provide examples for the (regularized) hinge loss, the square loss, for the 0/1 loss, and for strictly convex losses. Interestingly (and rather subtly), our incompatibility function for the 0/1 loss makes use of Tsybakov's noise condition. We then explicitly use the Rademacher bounds in Rosenberg and Bartlett [2007], Farquhar et al. [2005] to provide performance bounds under the multi-view assumption. These bounds characterize the bias-variance trade-off. We explicitly quantify how to set the co-regularization parameter co in terms of info , showing that an appropriate setting of co is O(1/ info ). In particular, this shows it is appropriate for co as info 0, i.e. when the information theoretic assumption is as sharp as possible, we are permitted to co-regularize as hard as possible (without introducing any bias). For this case, the co-regularization algorithms obtain their maximal reduction in variance. 1.3 Dimensionality Reduction While PCA is the time-honoured and simplest dimensionality reduction technique, there are few normative reasons as to why this technique is appropriate. The typical justification is that the top k principal directions are those which best reconstruct the data, in a mean squared sense. One common criticism of this oft used justification is that a rescaling of the data could change the outcome of PCA. Canonical Correlation Analysis (CCA) [Hotelling, 1935] -- like PCA but for the two view setting -- also serves as a rather general and widely used dimensionality reduction technique. Roughly speaking, it uses the cross-correlation matrix between the two views to find the canonical directions -- those directions which are most correlated (in a normalized sense) between the views. As a dimensionality reduction procedure, one can take the top k CCA directions which, roughly speaking, preserves the most correlated coordinates. However, unlike PCA, CCA is invariant to linear transformations of the data. (Under the linear transformation x1 Lx1 and x2 L x2 , the result of CCA does not change. This is because CCA works in terms of normalized correlation coefficients.) We define CCA more precisely in Section 3. In certain special cases, there are normative justifications for CCA as a dimensionality reduction technique. When x1 and x2 are jointly distributed as a Gaussian, the Gaussian Information Bottleneck method [Chechik et al., 2005] shows that CCA provides an appropriate compression scheme (under the Information Bottleneck criterion [Tishby et al., 1999]). In a semi-supervised multi-view setting, Kakade and Foster [2007] show that CCA provides the natural dimensionality reduction technique by which one can project x onto a lower dimensional space (using CCA) and yet still retain predictive information about y . However, this work was rather specific to the square loss and used a multiview assumption tailored to the square loss. This work provides a normative justification of CCA in a rather broad sense -- we generalize the work of Kakade and Foster [2007]. We consider a setting where have a convex loss function of the form (w · x, y ), where either the loss function is strictly convex (e.g. log loss, square loss) or we use a strictly convex regularizer (e.g. hinge loss with L2 regularization). We show that, under the multi-view assumption above, if we perform CCA and project the data onto to the top k canonical directions (where k is determined by the canonical eigenspectrum), then this projection loses little predictive information about Y . Hence, our subsequent supervised learning problem is simpler as we can work with a lower dimensional space (with the knowledge that we have not thrown away useful predictive information in working with this lower dimensional space). We state this precisely in Section 3. 2 Co-Regularization and Compatibility We now consider the PAC style semi-supervised framework introduced in Balcan and Blum [2006] and generalize the framework to general loss functions. We work with a prediction space Y that need not be equal to Y . The goal is to learn a pair of predictors (f1 , f2 ), where f1 : X1 Y and f2 : X2 Y , based on the labeled and unlabeled data such that the expected loss of any one of these predictors is small. We work with loss functions (bounded in [0, 1]) of the form (f ; (x1 , x2 , y )) (usually the loss functions are of the more restricted form (f (x), y ) though in some cases, e.g. Example 4, this more general form is appropriate). Denote by L(f1 ) the expected loss of f1 , i.e. L(f1 ) = E (f1 ; (x1 , y )), and L(f2 ) is similarly defined. Let F1 and F2 denote the hypothesis classes of interest, consisting of functions from X1 (and, respectively, X2 ) to the prediction space Y . Let a Bayes optimal predictor with respect to loss L based on input X1 , X2 be denoted by y (X1 , X2 ). So y argminf L(f ), where the argmin is over all functions. Similarly, let yv for v {1, 2} be Bayes optimal predictors with respect to loss function L based on input Xv . 2.1 Compatible Function Classes As discussed in the Introduction, to leverage our information theoretic assumption, we would like to say that a near optimal predictor using information from one view tends to agree with a near optimal predictor from another view. If this were the case, then the intuitive basis for an algorithm would be to find predictors from either view which tend to agree. However, quantifying this statement depends on the details of the loss function and the prediction space, since we need to specify a relationship between a measure of "closeness" of the loss function and a measure of agreement between hypothesis. We do this in the following assumption, which can be considered an inverse Lipschitz condition, which bounds how close two functions are in terms of how close their loss is. Assumption 2 (Inverse Lipschitz Condition) There exists a symmetric function d : Y × Y R+ and a monotonically increasing non-negative function on the reals (with (0) = 0) such that for all f , E [d(f (x), y (x))] (L(f ) - L(y )) where the expectation is with respect to x = (x1 , x2 ), and y is some Bayes optimal predictor with respect to loss L. Furthermore, for v 1, 2 and all fv , E [d(fv (x), yv (x))] (L(fv ) - L(yv )) where yv is a Bayes optimal predictor using only knowledge of xv . where fv Fv is the optimal predictor for view v within the hypothesis class Fv . Our first result shows that for a particular choice of t, the incompatibility class contains a good pair of hypothesis. Theorem 1 (Bias) If Assumptions 1, 2, and 3 are satisfied, then given a loss function bounded by 1 and if we set the incompibility function to be d, i.e. = d, then for t = at 2c2 (( info ) + ( bayes )), we have: d (f1 ,f2 )C (t) inf L(f1 ) + L(f2 ) L(y ) + bayes + info 2 While we this assumption seems natural enough, we should note that there some subtleties. For example, if we are dealing with binary prediction and the 0/1 loss function (the binary classification loss), consider the case where the target function is complete noise. Here, all predictors are Bayes optimal and have the maximal error rate of 0.5. Hence, predictors can be far from agreeing yet they are all optimal. In general, for the 0/1 loss, the higher the noise, the less nearoptimal predictors need to agree. In the next Subsection, we consider this case in more detail (in Example 2), and we also consider other commonly used loss functions. While it is natural to assume that d satisfies the triangle inequality, there are some natural choices of d which do not satisfy this. In particular, in some cases we would like to use d(y , y ) = (y - y )2 , which does not satisfy the triangle inequality. Hence, we only assume a relaxed version of the triangle inequality. Assumption 3 (Relaxed Triangle Inequality) For function d, there exists a cd 1 such that the (The proof is provided in the Appendix). Of course, for convex loss functions we have L( f1 +f2 ) L(f1 )+L(f2 ) . 2 2 The need for stating the bound in terms of the Bayes regret bayes is due to our information theoretic Assumption 1 not explicitly referring to any hypothesis classes F1 and F2 . The square root dependence on info is a result of using Pinsker's equality in the proof, which relates the L1 distance to the KL-distance (see Cover and Thomas [1991]). Note that in Balcan and Blum [2006] they did not explicitly characterize the quality of the incompatibility class -- they assumed that was known and that a setting of t was known such that C (t) contained a 'good' predictor. Here, we derive our incompatibility function and we specify a value t. Intuitively, this lemma characterizes the bias -- the reduction in performance -- by using C (t) instead of the full hypothesis classes F1 and F2 , in terms of the error info . We now provide examples of pairs and for commonly used loss functions, showing that our multi-view framework is quite general. 2.2 Examples of Loss/Incompatibility Pairs Example 1 (Squared Loss) Let Y , Y = R. Consider the loss function (y, y ) = (y - y)2 . Here, we can choose the incompatibility function (y1 , y2 ) = d(y1 , y2 ) = (y1 - y2 )2 and (x) = x. To see that this satisfies all the requisite assumptions, note that since (a - b)2 2(a2 + b2 ), we have that satisfies the relaxed triangle inequality with cd = 2. Also, since that yv = E [Y |Xv ] and y = E [Y |X1 , X2 ], we have: E (fv - yv )2 = E (fv - y )2 - E (yv - y )2 , y1 , y2 , y3 Y , d(y1 , y2 ) cd (d(y1 , y3 ) + d(y3 , y2 )) We now introduce the incompatibility framework of Balcan and Blum [2006] for the multi-view setting. Here, we have a function : Y × Y R+ , which we think of as scoring how incompatible two functions are. In particular, in this framework, they desire to use functions which are highly compatible. To formalize this, define the compatible function class with respect to incompatibility function and some t 0 as those pairs of functions which are compatible to the tune of t, more precisely: C (t) = {(f1 , f2 ) : f1 F1 , f2 F2 and E [(f1 , f2 )] t} where we are slightly abusing notation by referring to (f , f ) as meaning (f (x1 , x2 ), f (x1 , x2 )), which we do throughout. In order to characterize how good this compatibility class is, in terms of our multi-view assumption, we need to also define the Bayes regret: bayes = max{L(f1 ) - L(y1 ), L(f2 ) - L(y2 )} E (f - y )2 = E (f - y )2 - E (y - y )2 so our inverse Lipschitz condition is satisfied with equality. Example 2 (Zero-one Loss) Here, we have Y , Y = {1, -1} with (y, y ) = 1{y=b} . As discussed in the pre1y vious Subsection, there is no natural choice of d and for this loss function, without further restrictions on the noise. Hence, let us assume that Tsybakov's noise condition [Tsybakov, 2004] holds for each view independently and for both views together for some noise exponent (0, 1], which we define below. Now we can choose the incompatibility function (y1 , y2 ) = 1{y1 =b2 } with (x) = cx where c > 0 1b y (defined below). Here, is in fact a metric and hence satisfies the triangle inequality. To see that the choice of is appropriate, first note that by definition of Tsybakov's noise condition, for all f1 : X1 Y , f2 : X2 Y and f : X1 × X2 Y there exists c > 0 such that for v {1, 2} 1 P r(f (Xv )(v (Xv ) - ) 0) c(L(fv ) - L(yv )) 2 and 1 P r(f (X1 , X2 )( (X ) - ) 0) c(L(f ) - L(y )) 2 where v and stand for P (Y = 1|Xv ) and P (Y = 1|X1 , X2 ) respectively. Now since sign( (X ) - 1 ) 2 is the Bayes optimal predictor, 1{f (X )((X )- 1 )0} = 1 2 1{f (X )=y (X )} = (f , y ) and thus, under Tsybakov's noise 1 condition, Assumption 2 is satisfied. Example 3 (Strictly Convex Losses) Consider a loss function (y, y ) where, for each y , (·, y ) is strictly convex with respect to pseudo-metric d with modulus of convexity (defined below). Let the prediction space Y and output space Y be bounded a subset of R. Here, (y1 , y2 ) = (d(y1 , y2 )) satisfies Assumption 2 with (x) = x (provided the modulus 2 of convexity function ( ) p for some p > 0). In this case it is easy to check that cd = 1 if p < 1 and cd = 2p-1 otherwise. To see this, we first define modulus of convexity of the loss function with respect to pseudometric d (in its first parameter). We say that for a given y , (·, y ) has modulus of convexity if, 2 y+y (y, y ) + (y , y ) -( , y ) : d(y, y ) } y ( ) = inf { 2 Remark 1 It is worth noting that whenever Assumption 2 is satisfied with (y1 , y2 ) = g (d(y1 , y2 )) where d is some pseudo-metric and g is an invertible convex function then Assumption 2 is also with = d as the incompatibility function and = g -1 (). This is a simple consequence of Jensen's inequality. Example 4 (L2 Regularized Losses ) Say we have some loss function that is convex and Y = R. Now consider the regularized loss functional for a certain RKHS function class F , (f ; x, y ) := (f (x), y ) + f 2 K (2) Taking (y1 , y2 ) = (y1 - y2 )2 we can show that Assumption + 2 2 is satisfied for the regularized loss with (x) = (K2 ) x, K where K := supxX (x, x) (note that here we overload the notation K , but it is clear from context). To see this, define for f , f F the metric d,x (f , f ) = |f (x) - f (x)| + f - f K One can show that E [ (f )] is strictly convex with respect to d,x (Steinwart and Scovel [2006], Lemma 6.4) with modulus of convexity ( ) = (K 2 )2 . From this we see that + E [ (f ; x, y )] - E [ (f ; x, y )] 2 f + f (f ; x, y ) + (f ; x, y ) - ( ; x, y )] E[ 2 2 E (d ,x (f , f )) E (|f (x) - f (x)| + f - f ) E (|f (x) - f (x)|) E (f (x) - f (x))2 (K + )2 Thus we see that for the regularized loss functional the squared incompatibility satisfies Assumption 2, with our + 2 choice of (x) = (K2 ) x. 2.3 Convergence Bounds where the inf is over y, y Y . We actually want to work with a uniform bound on this function and so we define to be any function satisfying, ( ) inf y ( ) y Y Now note that L(fv ) + 2 and L(f ) + L(y ) f + y - L( ) E (d(f , y )) 2 2 Since L( that, fv +yv ) 2 L(yv ) - L( fv + 2 yv ) E (d(fv , yv )) L(yv ) and L( f +y ) L(y ) we have 2 L(fv ) - L(yv ) 2 We now characterize the sample complexity of an algorithm which uses a labeled and unlabeled data set, sampled from the underlying distribution. Our framework again parallels that of Balcan and Blum [2006] -- broadened to include more general loss functions. The basic algorithm we consider is identical to that in Balcan and Blum [2006]. Given an unlabeled data set U , we define the empirical compatibility class as: C (t) = {(f1 , f2 ) : f1 F1 , f2 F2 and EU [(f1 , f2 )] t} where the empirical expectation is: ( EU [(f1 , f2 )] = 1 (f1 (x1 ), f2 (x2 )) . m x1 ,x2 )U E [(fv , yv )] = E (d(fv , yv )) and E [(f , y )] = E (d(f , y )) L(f ) - L(y ) 2 which shows our choice of and is appropriate. The algorithm simply minimizes the average loss of predictions over labeled data subject to the constraint of choosing hypothesis from C (t). More precisely, for a given t, the algorithm simply chooses the best pair in this class: (f1 , f2 ) = argmin ES [ (f1 (x1 ), y ) + (f2 (x2 ), y )] (3) c f1 ,f2 C (t) Rademacher Bounds : For bounded loss and incompatibility functions, Rademacher bounds give us: R a l n(2/ ) G (H, U, ) = O m (H ) + 3 2m 1 nd G = G . Here, Rn (H) = n E supf H where i are Rademacher variables. The co-regularization algorithm can viewed as a dual version of this algorithm, which we consider in the following Subsection. As we are dealing with abstract hypothesis classes, as in Balcan and Blum [2006], we make an assumption about the learning complexity with respect to these abstract hypothesis class -- we give examples shortly. This assumption is stated in terms of both S and U , which allows us to use data-dependent sample complexity bounds (such as the Rademacher bounds), which is important in the next Subsection (for the analysis of the co-regularization algorithms and SVM-2K). Assumption 4 (Sample Complexity) For the hypothesis classes F1 and F2 , Unlabeled: With probability greater than 1 - over the i.i.d. sampling of unlabeled data set U we have that (f1 , f2 ) F1 × F 2 E [(f1 , f2 )] E [(f1 , f2 )] + G (F1 × F2 , U, ) where G is some notion of the generalization of the function class. Labelled Case: For any given unlabeled data set U , with probability greater than 1 - over i.i.d sampling of labeled data set S we have that for all pairs (f1 , f2 ) C (t), |L(f1 ) + L(f2 ) - (L(f1 ) + L(f2 ))| G (C (t), S, ) where G is some notion of the generalization of the function class. We now provide some standard sample complexity bounds. Remark 2 (Examples of G and G ) Assumption 4 is satisfied in the following standard examples. Finite Hypothesis Class: If the hypothesis classes are finite, then using Chernoff and union bounds we have l og (|H|) + log ( 1 ) G (H, U, ) = O m and G = G . n i=1 i f ( xi ) We are now ready to state our main result on the complexity of our multi-view algorithm. Theorem 2 Assume that the function is bounded by 1, the incompatibility function = d and that Assumptions 1, 2, 3 and 4 hold. Set t = 2c2 (( info ) + ( bayes )) + G (F1 × F2 , U, ) d and let the pair (f1 , f2 ) be the output of the algorithm (as defined by Equation 3) with this setting of t. Then with probability greater than 1 - over an i.i.d sample of both the labeled dataset S and unlabeled dataset U , we have L(f1 ) + L(f2 ) L(y )+G (C (t), S, /3) 2 + bayes + info (The proof is provided in the Appendix). This statement is analogous to the main complexity statements in the semi-supervised PAC framework of Balcan and Blum [2006]. In particular, the unlabeled complexity G only alters the setting of t, just as in Balcan and Blum [2006]. The labeled complexity term, G , appears as a penalization to the bound, again as in the semi-supervised PAC framework. The main difference is that we now specify the value of t to be used and compare ourselves to the Bayes optimal. Note that in Balcan and Blum [2006], there is no explicit characterization as to how much bias is introduced by using C (t) as opposed to using the unconstrained hypothesis space. The information theoretic assumption is what lows us to make al info is the bias this explicit characterization. The term introduced by using the constrained hypothesis space rather than the unconstrained hypothesis space. The benefit is that we could substantially reduce the variance. In particular, this variance reduction is reflected by that the labeled complexity term, G , only depends on the restricted hypothesis space, C (t), rather than the full hypothesis space -- the former of which could have significantly less complexity. We now show specific algorithms and analyses fit into this framework. 2.4 Algorithms We now provide bounds for co-regularization algorithms and the SVM-2K algorithm of Farquhar et al. [2005]. For v {1, 2} let Fv be some RKHS with respect to norm · K. Define as in Example 4, i.e. (f ; x, y ) := (f (x), y ) + f 2 K where (f (x), y ) is convex. Define L (f ) := E (f ; (x1 , x2 , y )) . (4) Finite VC Class: If the hypotheses map to [0, 1] and the VC dimension is finite, then V C dim(H) + log ( 1 ) G (H, U, ) = O m and G = G . Also let f = argmin E [L (f )] f where the argmin is over all functions (so f is the Bayes optimal predictor). By the Representer Theorem, f lives in the RKHS. This implies that bayes = 0. Throughout this section we overload notation by using K K := supxX (x, x) (when it is clear from context). Co-Regularization (with squared incompatibility) The original co-regularization algorithm introduced in Sindhwani et al. [2005] and also the co-regularized least squares regression Brefeld et al. [2006] both minimize the objective in Equation 1. Recall that for the regularized convex loss functions in Example 4, we already showed that (f1 (x1 ), f2 (x2 )) = (f1 (x1 ) - f2 (x2 )2 satisfies Assumption 2. Therefore we see that Theorem 2 justifies these co-regularization algorithms under the information theoretic Assumption 1. Rosenberg and Bartlett [2007] provide an estimate for the Rademacher complexity of kernel class for co-regularization in a transductive type setting (i.e. conditioned on the unlabeled data). The bound given is exactly of the form needed in Assumption 4. The subtlety in using these complexity bounds is that the co-regularization algorithms are a dual formulation of our Algorithm (see Equation 3), the latter of which imposes a hard agreement constraint. Hence, to provide a bound we need find an appropriate setting of the parameter co . The following theorem does this. Corollary 3 Assume we are working in the transductive setting (where U is known and the underlying data distribution is uniform over U ). Let Clip be the Lipschitz constant for v v v the loss. Let KS ×S , KS ×U and KU ×U stand for the kernel matrix between labeled examples, between labeled and unlabeled examples, and unlabeled and unlabeled samples for view v {1, 2} respectively. Given > 0, if we set co = 4(K +2 info then for ) the pair of functions (f1 , f2 ) F1 × F2 returned by the coregularization algorithm (Equation 1), with probability at least 1 - over labeled samples, l n( 2 ) 1 b +f2 b L ( f1 2 ) L (f ) + 2+3 2 n + 2CLip Rn ( Where, Rn (C ( 1 co )) compares to the Bayes optimal predictor f , while Rosenberg and Bartlett [2007] only compare to the best function in C (t) (without any normative justification for how to set the parameter t).Our comparison to f leads to the additional penalty of info (and we specify a value of co in the bound). Note that the appropriate setting of co is O(1/ info ). In particular, this shows it is appropriate for co as info 0, i.e. when the information theoretic assumption is as sharp as possible, we are permitted to co-regularize as hard as possible (without introducing any bias). For this case, the co-regularization algorithms obtain their maximal reduction in variance. To convert the above corollary to an inductive bound (where U is a random sample) we need to establish an unlabeled complexity statement of the kind in Assumption 4. Note that if the prediction space is bounded then it can be shown using covering number argl ments (Zhang [2002]) u that G (F1 × F2 , U, ) will be c og(1/) where c is some m constant (which depends of co and K )l Hence by setting . 2 t = 2cd (( bayes ) + ( info )) + c og(1/) we can get m the inductive statement required. Two View SVM The SVM-2K approach proposed by Farquhar et al. [2005] can be formulated as the following optimization problem: 1E ( S [ (f1 (x1 ), y )] + ES [ (f2 (x2 ), y )]) argmin (f1 ,f2 )F1 ×F2 2 + f1 2 + f2 2 + co EU [|f1 (x1 ) - f2 (x2 )|] K K (5) where is the hinge loss. Technically, the formulation in Farquhar et al. [2005] uses slack variables (more in line with the usual SVM formulation), but the above formulation is identical. 1 SVM-2K can be viewed as using the incompatibility function (y1 , y2 ) = |y1 - y2 |. Recall that for regularized convex loss functions in Example 4, we already showed that (f1 (x1 ) - f2 (x2 ))2 satisfies Assumption 2. Hence using Remark 1 we see that this incompatibility function for SVM-2K a(so satisfies Assumption 3 and 2 with cd = 1 and l K +)2 (x) = x. Hence, we get the following Corollary. 2 Corollary 4 Assume we are working in the transductive setting (where U is known and the underlying data distribution is uniform over U ). Given > 0, if we set and co = 2(K +)2 info then with probability at least 1 - over labeled samples, for the pair of functions (f1 , f2 ) F1 × F2 returned by SVM-2K algorithm (Equation 5), l n( 2 ) b +f2 b L ( f1 2 ) L (f ) + 2Rn (C ( 1o )) + 3 + info c 2n where Rn (C ( 1o )) is the data-dependent Rademacher c complexity. 1 Technically, the SVM-2K algorithm has a parameter which allows a little more disagreement, but the algorithm we specify is equivalent to the SVM-2K algorithm with = 0. C ( 1o )) c R n + info 1 2 R2 =-1 tr(KS ×S ) + -1 tr(KS ×S ) - tr(J T (I + M )-1 J ) 4(K + )2 info 1 2 1 2 J = -1 KU ×S --1 KU ×S , M = -1 KU ×U --1 KU ×U (The proof is provided in the Appendix). An important difference between our bounds and that in Rosenberg and Bartlett [2007] is that the above bound In particular, Farquhar et al. [2005] show how to upper bound Rn (C (t)) as a solution to a particular optimization problem. The proof is essentially identical to the previous Corollary, and is not provided. Again, the main extension in our work is that we compare the algorithm's performance to the loss of the Bayes optimal predictor f , while Farquhar et al. [2005] only compares to the best function in C (t). Our comparison to f leads to the additional penalty of info (and we specify a value of t in the bound). The appropriate setting of co is O(1/ info ) which again shows that smaller info gets, the harder we can coregularize. by cca (xv ). Let proj be the optimal linear predictor for view v using only the projected cca (xv ) as input. We now show that the loss of performance due to this projection is small if info is small. Theorem 6 Assume that Equation 6 holds, that Assumption 1 is satisfied, and that Assumptions 2 and 3 hold with respect to the squared incompatibility function. Then 1 4C ( info ) + ( bayes ) (v ) L(proj ) - L(yv ) + bayes - thresh where C satisfies Equation 6. (The proof is provided in the Appendix). In particular, if the cutoff, thresh , is 1 , then makes the 2 1 1-thresh factor in the bound into 2. Let us consider the implications for learning with a random labeled data set S using cca (xv ). Here, the a learning algorithm only needs to work with the coordinates which have sufficiently large i . Hence, the supervised learning problem is simpler as we can work with a lower dimensional space. This Theorem is analogous to the dimensionality reduction statements in Kakade and Foster [2007] -- though there the statements were restricted to the square loss (and a multi-view assumption based on the square loss). (v ) 3 Dimensionality Reduction and CCA Consider a setting where X = X1 × X2 is a real vector space (of finite or countably infinite dimension). Here, we work with linear predictors of the form wT x and convex losses of form (wT x, y ) that satisfy Assumptions 2 and 3 with respect to the squared incompatibility function. For example, most strictly convex loss functions can be used with the squared incompatibility function, including the square loss, log loss, exponential loss, and L2 regularized losses. Let L(w) = E [ (wT x, y )]. For simplicity, we work in the transductive setting -- in particular, we only assume knowledge of the second order statistics of the underlying data distribution (i.e. we know the covariance matrix of X ). Assume that the loss function is twice differentiable and that the second derivative of the loss function is bounded from above by some constant C , that is d2 (z, y ) z C (6) dz 2 Note that this assumption is satisfied for common strictly convex losses. Define canonical correlation analysis (CCA) as follows: Definition 5 The bases B1 , B2 for X1 and X2 is the canonical basis for the two views if for (x1 , x2 ) in this basis the following holds: 1. Orthogonality Conditions: For v {1, 2} 1 if i = j E [(xv )i (xv )j ] = 0 otherwise 2. Correlation Conditions: E [(x1 )i (x2 )j ] = i 4 Discussion An Open Problem from Balcan and Blum [2007] This problem (presented at COLT 2007) is where we have the 0/1 loss, and it is assumed that classifiers from either view can perfectly predict the data (so the best classifiers agree completely on the unlabeled data). Furthermore, they assume that the classifiers are linearly separable. The question posed is can an efficient algorithm be found? A more general and practically relevant question is this case but with noise, which of course makes the problem harder. Here, the optimal predictors (from either view) may not agree perfectly on the unlabeled data. However, under Example 2, we know that choosing d to be the 0/1 loss is a suitable discrepancy function (with being defined in terms of the Tsybakov noise exponent). In practice, even in the single view case, one is rarely able to directly minimize the 0/1 loss. Instead, what one actually does is minimize a surrogate loss function, such as the hinge loss, logistic loss, or exponential loss. Furthermore, through the work of Bartlett et al. [2006], we have an understanding of how minimizing these surrogate losses relate to the 0/1 loss. In our framework, we are able to choose a discrepancy functions tailored to our loss (as long as the discrepancy satisfies Assumption 2). Hence, if we are using a surrogate loss (for the 0/1 loss) then we should choose a incompatibility function that satisfies Assumption 2 with respect to this surrogate loss. We view both the co-regulation algorithms and the SVM-2K algorithm as the solution to this problem, under the theory of surrogate losses (where both these algorithms are using the surrogate hinge loss). 0 if i = j otherwise where i is the ith correlation coefficient. We assume without loss of generality that 1 1 2 ... 0. Now we present the main algorithm, which uses CCA as a dimensionality reduction technique. Consider some threshold, 0 < thresh < 1. Let ithresh be the smallest i such that i < thresh First, project xv to the subspace spanned by the first 1, ..., ithresh canonical coordinates. Denote this projection 4.1 Relations to the Information Bottleneck We end with a note on the connection to the Information Bottleneck method. In this method, the goal is to compress X1 to Z such that Z has maximum information about X2 -- in particular, Z is a compression of X1 that retains all the information that X1 has about X2 , that is, Z = argmin I (A : X1 ) A Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for gaussian variables. J. Mach. Learn. Res., 6:165­188, 2005. ISSN 1533-7928. Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience, 1991. Sanjoy Dasgupta, Michael L. Littman, and David A. McAllester. Pac generalization bounds for co-training. In NIPS, pages 375­382, 2001. Jason D. R. Farquhar, David R. Hardoon, Hongying Meng, John Shawe-Taylor, and Sndor Szedmk. Two view learning: Svm-2k, theory and practice. In NIPS, 2005. H. Hotelling. The most predictable criterion. Journal of Educational Psychology, 26:139­142, 1935. Sham M. Kakade and Dean P. Foster. Multi-view regression via canonical correlation analysis. In Nader H. Bshouty and Claudio Gentile, editors, COLT, volume 4539 of Lecture Notes in Computer Science, pages 82­96. Springer, 2007. David Rosenberg and Peter L. Bartlett. The rademacher complexity of co-regularized kernel classes. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007. V. Sindhwani, P. Niyogi, and M. Belkin. A CoRegularization Approach to Semi-supervised Learning with Multiple Views. In Workshop on Learning with Multiple Views, Proceedings of International Conference on Machine Learning, 2005. Ingo Steinwart and C. Scovel. Fast rates for support vector machines using gaussian kernels. In Los Alamos National Laboratory Technical Report LA-UR-04-8796, editor, Annals of Statistics, 2006. N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pages 368­377, 1999. A. Tsybakov. Optimal aggregation of classifiers in statistical learning. In Annals of Statistics, volume 32 No. 1, 2004. Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527­550, 2002. s.t. I (A : X2 ) = I (X1 : X2 ) where the argmin is over compression functions A of X1 . In the multi-view setting, if we find such a Z (with respect to to X1 and X2 ), it can be shown that I (Z : Y ) I (X1 : Y ) - info This shows that Z looses little predictive information about Y . In this sense, the Information Bottleneck is not throwing much relevant information with regards to Y and can be used as a semi-supervised algorithm. In fact, using Lemma 7, one can show that for any loss bounded by 1, the Bayes optimal predictor which uses only knowledge of Z has a regret of at most info with respect to the Bayes optimal predictor y . An interesting direction to pursue is to learn with Z as inputs to our learning algorithm rather than Xv , since Z has lower entropy. Two issues to consider are: 1) the mapping Z has an abstract range (so one needs to take care in how to learn a function from Z Y ) and 2) it is not clear how to implement the Information Bottleneck without knowledge of the underlying distribution. Acknowledgements We thank Gilles Blanchard for a number of helpful suggestions. References Steven Abney. Understanding the yarowsky algorithm. Comput. Linguist., 30(3):365­395, 2004. ISSN 0891-2017. Maria-Florina Balcan and Avrim Blum. A pac-style model for learning from labeled and unlabeled data. In SemiSupervised Learning, pages 111­126. MIT Press, 2006. Maria-Florina Balcan and Avrim Blum. Open problems in efficient semi-supervised pac learning. In Conference on Computational Learning Theory (COLT), 2007. Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. In Journal of the American Statistical Association, volume 101, No. 473, pages 138­156, 2006. Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In COLT' 98: Proceedings of the eleventh annual conference on Computational learning theory, pages 92­100, New York, NY, USA, 1998. ACM Press. ISBN 1-58113-057-0. Ulf Brefeld, Thomas Gartner, Tobias Scheffer, and Stefan Wrobel. Efficient co-regularised least squares regression. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 137­144, New York, NY, USA, 2006. ACM Press. ISBN 1-59593-383-2. A Proofs First we state two Lemmas that will be used in proving the theorem. Lemma 7 For v {1, 2}, if the loss function is bounded by 1 then we have that |L(y ) - L(yv )| info and |L(y1 ) - L(y2 )| 2 info Proof: Consider some function g : X [0, 1] and some two probability measures P and Q. We have that g g ( | (x)dQ - (x)dP | = | 1 - )g (x)dQ| | 1 - |dQ D K (Q P ) (7) dP where = dQ and the last step is because the L1 variational distance is bounded by square root of the KL divergence (Pinsker's Inequality). Now using this we get that for a fixed x1 , x2 we have that Proof: First note that by Assumptions 2 and 3 we have that for f1 and f2 there exists y1 and y2 such that E [(f1 , y1 )] (L(f1 ) - L(y1 )) and E [(f2 , y2 )] (L(f2 ) - L(y2 )) and since is monotonically increasing we have that E [(f1 , y1 )] ( ) and E [(f2 , y2 )] ( ) Again by Assumptions 2 and 3 we have that for some specific y , E [(y1 , y )] (L(y1 ) - L(y )) ( info ) and E [(y2 , y )] (L(y2 ) - L(y )) ( info ) |EY |X1 =x1 (y (x1 , x2 ), y ) - EY |X =(x1 ,x2 ) (y (x1 , x2 ), y )| D K (PY |X =(x1 ,x2 ) PY |X1 =x1 ) Taking expectation with respect to X = (X1 , X2 ) and using Jensen's inequality twice (once on the left for convex func tion |x| and once on the right for concave function x) we get that |EX EY |X1 =x1 (y (x1 , x2 ), y ) - L(y )| E X D K (PY |X =(x1 ,x2 ) PY |X1 =x1 ) Now note that since L(y1 ) EX EY |X1 =x1 (y (x1 , x2 ), y ) and L(y1 ) L(y ), we get |L(y1 ) - L(y )| E X D K (PY |X =(x1 ,x2 ) PY |X1 =x1 ) Since satisfies the relaxed triangle inequality Assumption 3, we get that E [(y2 , y1 )] cd ( info ) Again using relaxed triangle inequality Assumption 3, we get the required result that E [(f1 , f2 )] c2 (E [(f1 , y1 )] + E [(y1 , y2 )] + E [(f2 , y2 )]) d 2c2 (( ) + ( info )) d E [(f1 , f2 )] 2c2 (( bayes ) + ( info )) d Therefore setting t = 2c2 (( bayes ) + ( info )) we find d that (f1 , f2 ) C (t) and thus, (f1 , f2 ) = Proof:[of Theorem 1] Using Lemma 8 we see that argmin (f1 ,f2 )C (t) L(f1 ) + L(f2 ) 2 Also, EX DK (PY |X =(x1 ,x2 ) PY |X1 =x1 ) = IY :X2 |X1 and so we have that |L(y1 ) Now by definition of bayes we have that fv Fv min L(fv ) - L(yv ) bayes Therefore, - L(y )| info similarly we have |L(y2 ) - L(y )| info Also the above two inequalities together imply that |L(y1 ) - L(y2 )| 2 info L(f1 ) + L(f2 ) L(y1 ) + L(y2 ) + bayes 2 2 (f1 ,f2 )C (t) (8) Now by L ma 7 we see that for each v {1, 2}, L(yv ) - em L(y ) info . Hence using this in Equation (8) we conclude that L(f1 ) + L(f2 ) L(y ) + bayes + info min 2 (f1 ,f2 )C (t) min Lemma 8 For any f1 , f2 assume t L(f1 ) - L(y1 ) , L(f2 ) - L(y1 ) hen given Assumptions 1, 2 and 3 and that the loss function is bounded by B , we have that E [(f1 , f2 )] 2c2 (( ) + ( info )) d Proof:[of Theorem 2] Let (f1 C o , f2 C o ) C (t) be the minc c imizer of L(f1 ) + L(f2 ) in the class C (t). Using statement Assumption 4 (labeled) we have that with probability at least 1 - over the sample S , L(f1 c ) + L(f2 c ) - L(f1 c ) - L(f2 c ) Co Co Co Co G (C (t), S, ) Also for any (f1 , f2 ) C (t) we have that with probability at least 1 - over the sample S , L(f1 ) + L(f2 ) - L(f1 ) - L(f2 ) G (C (t), S, ) Hence combining the two, for the pair (f1 , f2 ) C (t) that minimizes L(f1 ) + L(f1 ) we have that with probability at least 1 - 2 over the sample S , L(f1 ) + L(f2 ) - L(f1 cot ) - L(f2 cot ) b b C (t), S, ) 2G ( of the form Assumption 4 (labeled). To this end define, H(t) = {(f1 , f2 ) : f1 2 + f2 2 ^ + co EU (f1 (x1 ) - f2 (x2 ))2 1} Notice that the solution of the co-regularization algorithm is contained in this class. Further as in Rosenberg and Bartlett [2007] define J (t) = {x f1 (x1 )+f2 (x2 ) : (f1 , f2 ) H}. 2 Now we can directly use Theorem 2 of their paper (assuming is bounded by 1) to get that with probability at least 1 - over labeled samples, for all (f1 , f2 ) C (t) ^ ^ L(f1 )+L(f2 ) L(f1 ) + L(f2 ) 1 + 2CLip Rn (J (t)) + (2 + 3 n l n(2/ ) ) (9) 2 Now Let t = 2c2 (( info ) + ( bayes )) then we see that d if (f1 , f2 ) C (t ) then, H E [(f1 , f2 )] t owever applying Assumption 4 (unlabeled) we find that with probability greater than1 - over the unlabeled dataset U we have that E (f1 , f2 ) E [(f1 , f2 )] + G (F1 × F2 , U, ) Thus we can conclude that with probability greater than 1 - over the i.i.d. unlabeled sample we have that (f1 , f2 ) C (t). Now using the above we see that with probability 1 - over unlabeled data min c (f1 ,f2 )C (t) Where by Theorem 3 of Rosenberg and Bartlett [2007] we find that Rn (J (t)) R n where 1 2 R2 =-1 tr(KS ×S ) + -1 tr(KS ×S ) - tr(J T (I + M )-1 J ) ( + K )2 t L(f1 ) + L(f2 ) = (f1 ,f2 )C (t min ) L(f1 ) + L(f2 ) and Hence using the result of Theorem 1 we can conclude that with probability 1 - 3 over both labeled and unlabeled data we have that L(f1 ) + L(f2 ) 2L(y ) + 2G (C (t), S, ) + 2 bayes + 2 1 2 J = -1 KU ×S --1 KU ×S 1 2 M = -1 KU ×U --1 KU ×U info Now this establishes the Assumption 4 , labeled statement we were aiming for. Now putting the regularization term on both sides of the inequality in Equation 9 we get that E [ (f1 , x1 ,y ) + (f2 , x2 , y )] ^ E [ (f1 , x1 , y ) + (f2 , x2 , y )] 1 + 4CLip Rn (J (t)) + (2 + 3 n l n(2/ ) ) 2 Proof:[Proof of Corollary 3] First note that we can write f1 F1 as (f1 , 0) F1 × F2 and similarly we can define any f2 F2 as (0, f2 ) F1 × F2 so that we can consider only the joint RKHS defined by sum of f1 and f2 . From Example 4 we first of all have that for the regularized loss Assumption 2 is satisfied by the squared incompatibility (i.e.. + 2 (y1 , y2 ) = (y1 - y2 )2 ) function with (x) = (K2 ) x. Also note that in this case bayes = 0 since f is in the RKHS (in fact for the regularized loss to even be applicable the function needs to live in the RKHS). Hence if e restrict w 8(+K )2 info ourselves to the class C (t) where t = then using Theorem 2, we see that we can get a low regularized regret with respect to f . Now without loss of generality assume that for the given loss we have that (0, y ) = 1. Then using this in Equation 1 we see that, ^ co EU [f1 (x1 ) - f2 (x2 )]2 1 and so using co = 1 we see that for any function pairs t (f1 , f2 ) returned by the algorithm EU [(f1 , f2 )] t. However since we are in the transductive setting EU [(f1 , f2 )] = E [(f1 , f2 )]. Now we use the result from Rosenberg and Bartlett [2007] to establish a statement Now this is essentially the labeled statement in Assumption 4 and since we are in the transductive case we do not need the unlabeled part of the assumption. Hence using Theorem 2 we see that with probability at least 1 - over labeled A samples for the pair f1 , f2 returned by co-regularization algorithm, E (f1 x1 , y ) + (f2 , x2 , y ) ] E [ (f , x1 , x2 , y )] 2 l 1 Rn (J (t)) + (2 + 3 n(2/ ) ) + info + 2CLip 2 n Now using Jensen's Inequality we see that the regularized loss of the average predictor is bounded by average of regularized loss of the predictors and hence the result. Proof:[of Theorem 6] Without loss of generality we assume we are in the CCA basis. For each v {1, 2} let (v) be the minimizer with respect to of E [ ( T xv , y )]. From the result of Lemma 8 using the the squared incompatibility function (cd = 2 in this case) we have that 8( info ) + 8( bayes ) E [(xT (1) - xT (2) )2 ] 1 2 i (1) 2 (2) 2 (1) (2) = [(i ) + (i ) - 2i i i ] i [(1 - i )(i )2 + (1 - i )(i )2 ] (1) (2) Hence using Equation 10 we can conclude that ( 4C ( info ) + ( bayes ) (v ) (v ) L(P ) - L( ) 1 - thresh ) Now since L(P ) L(proj ) we conclude that ( 4C ( info ) + ( bayes ) (v ) (v ) L(proj ) - L( ) 1 - thresh ) Finally since L( (v) ) - L(yv ) bayes we have the required result. (v ) (v ) (the last step is due to the identity 2ab a2 + b2 ). Hence we conclude that i (v ) (1 - i )(i )2 8( info ) + 8( bayes ) (10) Let P be the projection of (v) on to the first ithresh coordinates. Consider a twice differentiable loss function. By Taylor's theorem (second order) we have that there exists ~ some such that (xT P , y ) = (xT (v) , y ) + (P - (v) )T ( v ) v v 1 (v ) (v ) ~ + (P - (v) )T 2 ( T xv , y )(P - (v) ) 2 Taking expectation and noting that since (v) is the minimizer of the expected loss we find that L(P ) - L( (v) ) = 1 (v) (v ) (v ) ~ ( - P )T E [ 2 ( T xv , y )]( (v) - P ) 2 (v ) (v ) (v ) Let res = (v) - P . Note that since (P )i = ( (v) )i for all i's corresponding to correlation values greater than (v ) the threshold we see that res is zero in the first ithresh co(v ) ordinates and is equal to on the rest. Now note that for a loss function that is twice differentiable and a function of ~ T xv we have that by chain rule ~ d2 ( T xv , y ) 2 ( .xv , y ) = xv xT v ~T xv )2 d( Now using the assumption that the second derivative of the loss function is bounded by some C we then see that C () (v ) () L(P ) - L( (v) ) (rvs )T E [xv xT ](rvs ) e v e 2 Note that since we are in the CCA basis we have that E [(xv )i (xv )j ] = 0 when i = j and is 1 otherwise. Now note that for all i > itresh we have that 1 - i > 1 - thresh and so, C (v) (v ) 2 L(P ) - L( (v) ) 2 res i C (v ) = (i )2 2 >i thr esh (v ) (v ) (v ) (v ) Ci 2 >i thr esh 1 - i (v ) ( )2 1 - thresh i (1 - i )(i )2 (v ) i C 2(1 - thresh ) >i thr esh i C (v ) (1 - i )(i )2 2(1 - thresh )