Bayesian Co-Training Shipeng Yu, Balaji Krishnapuram, Romer Rosales, Harald Steck, R. Bharat Rao CAD & Knowledge Solutions, Siemens Medical Solutions USA, Inc. firstname.lastname@siemens.com Abstract We propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, unlike some previous multi-view learning methods. Furthermore, it can automatically estimate how much each view should be trusted, and thus accommodate noisy or unreliable views. Experiments on toy data and real world data sets illustrate the benefits of this approach. 1 Introduction Data samples may sometimes be characterized in multiple ways, e.g., web-pages can be described both in terms of the textual content in each page and the hyperlink structure between them. [1] have shown that the error rate on unseen test samples can be upper bounded by the disagreement between the classification-decisions obtained from independent characterizations (i.e., views) of the data. Thus, in the web-page example, misclassification rate can be indirectly minimized by reducing the rate of disagreement between hyperlink-based and content-based classifiers, provided these characterizations are independent conditional on the class. In many application domains class labels can be expensive to obtain and hence scarce, whereas unlabeled data are often cheap and abundantly available. Moreover, the disagreement between the class labels suggested by different views can be computed even when using unlabeled data. Therefore, a natural strategy for using unlabeled data to minimize the misclassification rate is to enforce consistency between the classification decisions based on several independent characterizations of the unlabeled samples. For brevity, unless otherwise specified, we shall use the term co-training to describe the entire genre of methods that rely upon this intuition, although strictly it should only refer to the original algorithm of [2]. Some co-training algorithms jointly optimize an objective function which includes misclassification penalties (loss terms) for classifiers from each view and a regularization term that penalizes lack of agreement between the classification decisions of the different views. In recent times, this coregularization approach has become the dominant strategy for exploiting the intuition behind multiview consensus learning, rendering obsolete earlier alternating-optimization strategies. We survey in Section 2 the major approaches to co-training, the theoretical guarantees that have spurred interest in the topic, and the previously published concerns about the applicability to certain domains. We analyze the precise assumptions that have been made and the optimization criteria to better understand why these approaches succeed (or fail) in certain situations. Then in Section 3 we propose a principled undirected graphical model for co-training which we call the Bayesian cotraining, and show that co-regularization algorithms provide one way for maximum-likelihood (ML) learning under this probabilistic model. By explicitly highlighting previously unstated assumptions, 1 Bayesian co-training provides a deeper understanding of the co-regularization framework, and we are also able to discuss certain fundamental limitations of multi-view consensus learning. In Section 4, we show that even simple and visually illustrated 2-D problems are sometimes not amenable to a co-training/co-regularization solution (no matter which specific model/algorithm is used ­ including ours). Empirical studies on two real world data sets are also illustrated. Summarizing our algorithmic contributions, co-regularization is exactly equivalent to the use of a novel co-training kernel for support vector machines (SVMs) and Gaussian processes (GP), thus allowing one to leverage the large body of available literature for these algorithms. The kernel is intrinsically non-stationary, i.e., the level of similarity between any pair of samples depends on all the available samples, whether labeled or unlabeled, thus promoting semi-supervised learning. Therefore, this approach is significantly simpler and more efficient than the alternating-optimization that is used in previous co-regularization implementations. Furthermore, we can automatically estimate how much each view should be trusted, and thus accommodate noisy or unreliable views. 2 Related Work Co-Training and Theoretical Guarantees: The iterative, alternating co-training method originally introduced in [2] works in a bootstrap mode, by repeatedly adding pseudo-labeled unlabeled samples into the pool of labeled samples, retraining the classifiers for each view, and pseudo-labeling additional unlabeled samples where at least one view is confident about its decision. The paper provided PAC-style guarantees that if (a) there exist weakly useful classifiers on each view of the data, and (b) these characterizations of the sample are conditionally independent given the class label, then the co-training algorithm can utilize the unlabeled data to learn arbitrarily strong classifiers. [1] proved PAC-style guarantees that if (a) sample sizes are large, (b) the different views are conditionally independent given the class label, and (c) the classification decisions based on multiple views largely agree with each other, then with high probability the misclassification rate is upper bounded by the rate of disagreement between the classifiers based on each view. [3] tried to reduce the strong theoretical requirements. They showed that co-training would be useful if (a) there exist low error rate classifiers on each view, (b) these classifiers never make mistakes in classification when they are confident about their decisions, and (c) the two views are not too highly correlated, in the sense that there would be at least some cases where one view makes confident classification decisions while the classifier on the other view does not have much confidence in its own decision. While each of these theoretical guarantees is intriguing and theoretically interesting, they are also rather unrealistic in many application domains. The assumption that classifiers do not make mistakes when they are confident and that of class conditional independence are rarely satisfied in practice. Nevertheless empirical success has been reported. Co-EM and Related Algorithms: The Co-EM algorithm of [4] extended the original bootstrap approach of the co-training algorithm to operate simultaneously on all unlabeled samples in an iterative batch mode. [5] used this idea with SVMs as base classifiers and subsequently in unsupervised learning by [6]. However, co-EM also suffers from local maxima problems, and while each iteration's optimization step is clear, the co-EM is not really an expectation maximization algorithm (i.e., it lacks a clearly defined overall log-likelihood that monotonically improves across iterations). Co-Regularization: [7] proposed an approach for two-view consensus learning based on simultaneously learning multiple classifiers by maximizing an objective function which penalized misclassifications by any individual classifier, and included a regularization term that penalized a high level of disagreement between different views. This co-regularization framework improves upon the cotraining and co-EM algorithms by maximizing a convex objective function; however the algorithm still depends on an alternating optimization that optimizes one view at a time. This approach was later adapted to two-view spectral clustering [8]. Relationship to Current Work: The present work provides a probabilistic graphical model for multi-view consensus learning; alternating optimization based co-regularization is shown to be just one algorithm that accomplishes ML learning in this model. A more efficient, alternative strategy is proposed here for fully Bayesian classification under the same model. In practice, this strategy offers several advantages: it is easily extended to multiple views, it accommodates noisy views which are less predictive of class labels, and reduces run-time and memory requirements. 2 3 Bayesian Co-Training 3.1 Single-View Learning with Gaussian Processes A Gaussian Process (GP) defines a nonparametric prior over functions in Bayesian statistics [9]. A random real-valued function f : Rd R follows a GP, denoted by G P (h, ), if for every finite number of data points x1 , . . . , xn Rd , f = {f (xi )}n 1 follows a multivariate Gaussian i= N (h, K) with mean h = {h(xi )}n 1 and covariance K = {(xi , xj )}nj =1 . Normally we fix the i= i, mean function h 0, and take a parametric (and usually stationary) form for the kernel function (e.g., the Gaussian kernel (xk , x ) = exp(- xk - x 2 ) with > 0 a free parameter). In a single-view, supervised learning scenario, an output or target yi is given for each observation xi (e.g., for regression yi R and for classification yi {-1p+1}). In the GP model we assume , there is a latent function f underlying the output, p(yi |xi ) = (yi |f , xi )p(f ) df , with the GP prior p(f ) = G P (h, ). Given the latent function f , p(yi |f , xi ) = p(yi |f (xi )) takes a Gaussian noise model N (f (xi ), 2 ) for regression, and a sigmoid function (yi f (xi )) for classification. The dependency structure of the single-view GP model can be shown as an undirected graph as in Fig. 1(a). The maximal cliques of the graphical model are the fully connected nodes (f (x1 ), . . . , f (xn )) and the pairs (yi , f (xi )), i = 1, . . . , n. Therefore, the joint probability of i 1 random variables f = {f (xi )} and y = {yi } is defined as p(f , y) = Z (f ) (yi , f (xi )), with potential functions1 e 1 xp(- 22 yi - f (xi ) 2) for regression K-1 1 (f ) = exp(- 2 f (1) f ), (yi , f (xi )) = (yi f (xi )) for classification and normalization factor Z (hereafter Z is defined such that the joint probability sums to 1). 3.2 Undirected Graphical Model for Multi-View Learning In multi-view learning, suppose we have f(x1) m different views of a same set of n data y1 f1(x1(1)) fc(x1) f2(x1(2)) (j ) samples. Let xi Rdj be the features y1 for the i-th sample obtained using the j y2 f2(x2(2)) f(x2) fc(x2) f1(x2(1)) th view, where dj is the dimensionality of the input space for view j . Then the y2 (1) (m) vector xi (xi , . . . , xi ) is the complete representation of the i-th data sam(a) (b) f2(xn(2)) fc(xn) f(xn) yn f1(xn(1)) (j ) (j ) ple, and x(j ) (x1 , . . . , xn ) repreyn sents all sample observations for the j -th view. As in the single-view learning, let y = (y1 , . . . , yn ) where yi is the single Figure 1: Factor graph for (a) one-view (b) two-view models. output assigned to the i-th data point. ... One can clearly concatenate the multiple views into a single view and apply a single-view GP model, but the basic idea of multi-view learning is to introduce one function per view which only uses the features from that view, and then jointly optimize these functions such that they come to a consensus. Looking at this problem from a GP perspective, let fj denote the latent function for the j -th view (i.e., using features only from view j ), and let fj G P (0, j ) be its GP prior in view j . Since one data sample i has only one single label (j ) yi even though it has multiple features from the multiple views (i.e., latent function value fj (xi ) for view j ), the label yi should depend on all of these latent function values for data sample i. The challenge here is to make this dependency explicit in a graphical model. We tackle this problem by introducing a new latent function, the consensus function fc , to ensure conditional independence between the output y and the m latent functions {fj } for the m views (see Fig. 1(b) for the undirected graphical model). At the functional level, the output y depends only on fc , and latent functions {fj } 1 The definition of in this paper has been overloaded to simplify notation, but its meaning should be clear from the function arguments. ... ... 3 depend on each other only via the consensus function fc . That is, we have the joint probability: jm 1 p(y , fc , f1 , . . . , fm ) = (y , fc ) (fj , fc ), Z =1 with some potential functions . In the ground network with n data samples, let f c = {fc (xi )}n 1 i= (j ) and f j = {fj (xi )}n 1 . The graphical model leads to the following factorization: i= p (y, f c , f 1 , . . . , f m ) = jm 1i (yi , fc (xi )) (f j ) (f j , f c ). Z =1 (2) Here the within-view potential (f j ) specifies the dependency structure within each view j , and the consensus potential (f j , f c ) describes how the latent function in each view is related with the consensus function fc . With a GP prior for each of the views, we can define the following potentials: , - - , fj - fc 2 1 j -1 f Kj f j (f j , f c ) = exp (f j ) = exp (3) 2 2 2j where Kj is the covariance matrix of view j , i.e., Kj (xk , x ) = j (xk , x ), and j > 0 a scalar which quantifies how far away the latent function f j is from f c . The output potential (yi , fc (xi )) is defined the same as that in (1) for regression or classification. Some more insight may be gained by taking a careful look at these definitions: 1) The within-view potentials only rely on the intrinsic structure of each view, i.e., through the covariance Kj in a GP setting; 2) Each consensus potential actually defines a Gaussian over the difference of f j and f c , 2 i.e., f j - f c N (0, j I), and it can also be interpreted as assuming a conditional Gaussian for f j with the consensus f c being the mean. Alternatively if we focus on f c , the joint consensus potentials 2 effectively define a conditional Gaussian prior for f c , f c |f 1 , . . . , f m , as N (µc , c I) where j -1 j fj 1 2 2 µc = c c = . (4) 2, 2 j j This can be easily verified as a product of Gaussians. This indicates that the prior mean of the consensus function f c is a weighted combination of the latent functions from all the views, and the weight is given by the inverse variance of each consensus potential. The higher the variance, the smaller the contribution to the consensus function. More insights of this undirected graphical model can be seen from the marginals, which we discuss in detail in the following subsections. One advantage of this representation is that is allows us to see that many existing multi-view learning models are actually a special case of the proposed framework. In addition, this Bayesian interpretation also helps us understand both the benefits and the limitations of co-training. 3.3 Marginal 1: Co-Regularized Multi-View Learning (j ) (j ) By taking the integral of (2) over f c (and ignoring the output potential for the moment), we obtain the joint marginal distribution of the m latent functions: 1 jm 1 1j fj - fk 2 p(f 1 , . . . , f m ) = exp - f j K-1 f j - . (5) j 2 2 2 Z 2 j + k =1