A New Analysis of Co-Training

Wei Wang wangw@lamda.nju.edu.cn Zhi-Hua Zhou zhouzh@lamda.nju.edu.cn National Key Laboratory for Novel Software Technology, Nanjing University, China

Abstract
In this paper, we present a new analysis on co-training, a representative paradigm of disagreement-based semi-supervised learning methods. In our analysis the co-training process is viewed as a combinative label propagation over two views; this provides a possibility to bring the graph-based and disagreementbased semi-supervised methods into a unified framework. With the analysis we get some insight that has not been disclosed by previous theoretical studies. In particular, we provide the sufficient and necessary condition for co-training to succeed. We also discuss the relationship to previous theoretical results and give some other interesting implications of our results, such as combination of weight matrices and view split.

assumptions, so do semi-supervised learning methods. S3VMs and graph-based methods generally work with the cluster assumption or the manifold assumption. The cluster assumption concerns on classification, while the manifold assumption can also be applied to tasks other than classification. Without these assumptions concerning the relationship between the labels and unlabeled data distribution, semi-supervised learning has limited usefulness (Ben-David et al., 2008). Like other semi-supervised methods, co-training also needs some assumptions to guarantee its success. When co-training was proposed, Blum & Mitchell (1998) proved that if the two sufficient and redundant views are conditionally independent to the other given the class label, co-training can be successful. Yu et al. (2008) proposed a graphical model for cotraining based on the conditional independence assumption as well. Abney (2002) showed that weak dependence can also guarantee successful co-training. After that, a weaker assumption called -expansion was proved sufficient for iterative co-training to succeed (Balcan et al., 2005). The above studies give theoretical support to co-training working with two views. For tasks with only a single view, some effective variants have been developed (Goldman & Zhou, 2000; Zhou & Li, 2005). Wang & Zhou (2007) proved that if the two classifiers are with large diversity, cotraining style algorithms can succeed. This gives theoretical support to the success of single-view co-training variants, and also contributes to a further understanding of co-training with two-views. To the best of our knowledge, all previous analyses studied the sufficient condition for the success of cotraining, yet the sufficient and necessary condition is untouched. In this paper, we provide a new analysis of co-training, where the learner in each view is viewed as label propagation and thus the co-training process can be viewed as the combinative label propagation over the two views. Based on this new analysis, we get some insights that have not been discovered by

1. Introduction
Semi-supervised learning (Chapelle et al., 2006; Zhu, 2007) deals with methods for automatically exploiting unlabeled data in addition to labeled data to improve learning performance. During the past decade, many semi-supervised learning algorithms have been developed, e.g., S3VMs, graph-based methods and disagreement-based methods. Co-training (Blum & Mitchell, 1998) is a representative paradigm of disagreement-based methods (Zhou & Li, in press). In its initial form, co-training trains two classifiers separately on two sufficient and redundant views and lets the two classifiers label some unlabeled instances for each other. It has been found useful in many applications such as statistical parsing and noun phrase identification (Hwa et al., 2003; Steedman et al., 2003). All machine learning methods work with specific
Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

A New Analysis of Co-Training

previous theoretical studies. In particular, we provide the sufficient and necessary condition for co-training to succeed. We also discuss the relationship to previous theoretical results and give some other interesting implications of our results. Finally we verify our theoretical findings empirically. The rest of this paper is organized as follows. After introducing some preliminaries we provide our new analysis of co-training in Section 3 and discuss the relationship between our results and previous theoretical results in Section 4. Then we verify our theoretical results in Section 5 and study the implication of our theoretical results on view split in Section 6. Finally we conclude the paper in Section 7.

Table 1. Re-description of co-training Input: Labeled examples L, unlabeled instances U and probabilistic transition matrix P v (v = 1, 2). Process: Perform label propagation from labeled examples L to unlabeled instances U on graph P v v and get the labeled examples set S0 . iterate k = 0, 1, 2, · · · 1 2 if Sk  Sk =  break; end if Perform label propagation from labeled examples 3-v v v Sk  (U - Sk ) to unlabeled instances U - Sk on v v graph P and get the labeled examples set Tk ; v v v Sk+1 = Sk  Tk . end iterate v v Output: fU corresponding to Sk

2. Preliminaries
Suppose we have example space X = X 1 × X 2 , where X 1 and X 2 correspond to the two different views of the example space, respectively. Let D denote the distribution over X, C 1 and C 2 denote the concept classes over X 1 and X 2 , respectively. Let Y = {-1, 1} denote the label space and c = (c1 , c2 ), where c1  C 1 and c2  C 2 , denote the underlying target concept. In co-training, we assume that the labels on examples are consistent with the target concepts c1 and c2 , i.e., there is no such example x = (x1 , x2 ) that c1 (x1 ) = c2 (x2 ) in X. Suppose that the data set is L  U , where L = { (x1 , x2 ), y1 , · · · , (x1 , x2 ), yl }  1 1 l l X × Y is the labeled data set and U = {(x1 , x2 ), l+1 l+1 · · · , (x1 , x2 )}  X is the unlabeled data set. l+u l+u

y(xv )|xv , xv and the label of xv . For two labeled s t s s examples xv and xv , if they have the same label, w q we set P y(xv ) = y(xv )|xv , xv = 1 and otherwise w q w q v P y(xv ) = y(xv )|xv , xv = 0. Let each entry Pij of w q w q v v v v the matrix P correspond to P y(xi ) = y(xj )|xi , xv j v fL YL v . (1  i, j  n = l + u) and f = = v 0 fU Without loss of generality, P v can be normalized to a probabilistic transition matrix according to Eq. 1.
v v Pij  Pij / n m=1 v Pim

(1)

3. Graph View of Co-training
Co-training (Blum & Mitchell, 1998) trains two learners respectively from two different views and lets the learners label the most confident unlabeled instances to enlarge the training set of the other learner. Such a process can be repeated until some stopping condition is met. In this section, we will show how co-training can be viewed as a combinative label propagation over the two views and then give the sufficient and necessary condition for co-training to succeed. Generally, assigning a label to an unlabeled instance xv t (v = 1, 2) based on a labeled example xv can be viewed s as estimating the conditional probability P y(xv ) = t y(xv )|xv , xv . For controlling the confidence of the ess t s timation of the learner, we can set a threshold  v > 0 (generally  v = 1/2). If P y(xv ) = y(xv )|xv , xv < t s t s  v , we set P y(xv ) = y(xv )|xv , xv = 0. Note that s t s t P y(xv ) = y(xv )|xv , xv = 0 does not mean that t s t s xv and xv must have different labels. In this way, s t we can assign label to xv according to P y(xv ) = t t

In this way, the labels can be propagated from labeled examples to unlabeled instances according to the process (Zhu, 2005): 1) Propagate f v = P v f v ; 2) Clamp v the labeled data fL = YL ; 3) Repeat from step 1 until f v converges. The labels of unlabeled instances in v U can be assigned according to sign(fU ). For some unlabeled instance xv if ftv = 0, it means that label t propagation on graph P v has no idea on xv . Thus, in t each view the learner can be viewed as label propagation from labeled examples to unlabeled instances on graph P v and we focus on label propagation in this pav v per. The error err(fU ), the accuracy acc(fU ) and the v uncertainty (fU ) of this graph-based method can be v v counted on U as acc(fU ) = Pxv U [fU (xv )·cv (xv ) > 0], t t t v v v v v v err(fU ) = Pxv U [fU (xt ) · c (xt ) < 0], and (fU ) = t v v Pxv U [fU (xt ) = 0]. In one view, the labels can be t propagated from initial labeled examples to some unlabeled instances in U and these newly labeled examples can be added into the other view. Then the other view can propagate the labels of initial labeled examples and these newly labeled examples to the remaining unlabeled instances in U . This process can be repeated until the stopping condition is met. Thus, the

A New Analysis of Co-Training

co-training algorithm can be re-described as the combinative label propagation process over the two views 1 2 1 2 2 1 in Table 1, where Sk  Sk = (Sk - Sk )  (Sk - Sk ). 3.1. Co-training with Perfect Graphs Label propagation needs a graph which is represented by the matrix P . In this paper, we focus on the co-training process with two graphs P 1 and P 2 constructed from the two views. How to construct a graph is an important issue studied in graph-based methods (Jebara et al., 2009; Maier et al., 2009) and is beyond the scope of this paper. First, we assume that P v (v = 1, 2) is perfect graph in this section, i.e., if P y(xv ) = y(xv )|xv , xv > 0, t s t s xv and xv must have the same label. It means that s t the learner is either "confident of labeling" or "having no idea". Before showing the sufficient and necessary condition for co-training with perfect graphs to succeed, we need Lemma 1 to indicate the relationship between label propagation and connectivity. Lemma 1 Suppose that P is perfect graph. Unlabeled instance xt0 can be labeled by label propagation on graph P if and only if it can be connected with some labeled example xtr in graph P through a path R in the form of VR = {t0 , t1 , · · · , tr }, where Pt t+1 > 0 ( = 0, · · · , r - 1). Proof. It is well known (Zhu, 2005) that the label propagation process has the following closed form solution for each connected component in graph P . fU = (I - PU U )-1 PU L YL . (2)

Proof. Here we give a proof by contradiction. Suppose that for any unlabeled instance xt  U (t = v l+1, · · · , l+u), fU (xv )·cv (xv ) > 0. From Lemma 1 and t t the process in Table 1 we know that for any unlabeled instance xt0  U , xt0 can be connected with some labeled example xtr  L through a path R in the form of VR = {t0 , t1 , · · · , tr }, where Pt1 t+1 > 0 or Pt2 t+1 > 0   1 2 v ( = 0, · · · , r - 1). If Sk  Sk =  while Sk = L  U , v there must exist some unlabeled instances in U - Sk . v Considering that Sk are obtained by label propagation on graph P v , so from Lemma 1 we know that for any v unlabeled instance xh  U - Sk , there is no path bev tween xh and any labeled example xd  Sk in graph v v P , i.e., Phd = 0. It is in contradiction with that any unlabeled instance in U can be connected with some labeled example in L through a path R. Therefore, v if fU (xv ) · cv (xv ) > 0 for all unlabeled instance xt , t t 1 2 v Sk  Sk is not  until Sk = L  U . Suppose the graph P v contains v connected components. If one example in some connected component is labeled, from Lemma 1 we know that all unlabeled instances in this connected component can be 1 2 labeled by label propagation. If Sk  Sk is not  v until Sk = L  U , in the k-th iteration of Table 1, the unlabeled instances in at least one connected component of either P 1 or P 2 will be labeled by label propagation. Thus, after at most 1 + 2 iterations all unlabeled instances in U can be assigned with labels by the process in Table 1. Considering that P v in each view is perfect graph, we get that for any v unlabeled instance xt  U , fU (xv ) · cv (xv ) > 0. t t Theorem 1 provides the sufficient and necessary condition for co-training with perfect graphs to succeed. With this theorem, for tasks with two views, if two perfect graphs can be constructed from the two views, we can decide whether co-training will be successful. 3.2. Co-training with Non-perfect Graphs In many real applications, it is generally hard to construct a perfect graph. We will discuss the case when the perfect graph assumption is waived in this section. In label propagation on non-perfect graph, an unlabeled instance may be connected with labeled examples belonging to different classes. As discussed in the proof of Lemma 1, the label propagation for each connected component  in graph P has the closed form of fU = (I -PU U )-1 PU L YL . Let A = (I -PU U )-1 , we can get Eq. 3 from Eq. 2. ft =
sL jU

Here U  L is a connected component  in graph P , where U  U and L  L. If an unlabeled instance xt cannot be connected with any labeled example, with respect to Eq. 2, we know that ft = 0. If xt0 can be connected with some labeled example xtr through a path R in the form of VR = {t0 , t1 , · · · , tr }, considering that P is a perfect r-1 graph we get |ft0 |  =0 Pt t+1 |ytr |. Thus, xt0 can be labeled with label sign(ft0 ) by label propagation. From Lemma 1 we know that when every unlabeled instance can be connected with some labeled example through a path in perfect graph P , label propagation on graph P is successful. Now we give Theorem 1. Theorem 1 Suppose P v (v = 1, 2) is perfect graph. v fU (xv ) · cv (xv ) > 0 for all unlabeled instance xt  U t t 1 2 (t = l + 1, · · · , l + u) if and only if Sk  Sk is not  in v Table 1 until Sk = L  U .

Atj Pjs Ys

(t  U )

(3)

From Eq. 3 we know that in each connected component

A New Analysis of Co-Training

the contribution of the labeled example xs (s  L ) to the unlabeled instance xt (t  U ) is jU Atj Pjs . Now we define the positive contribution and negative contribution for an unlabeled instance. Definition 1 Let L denote the labeled examples and U denote the unlabeled instances belonging to the connected component  in graph P . For an unlabeled instance xt (t  U ), the positive contribution to xt is
Ys =yt jU

further study what this necessary condition means and how to verify it before co-training. First, we introduce the combinative graph P c in Eq. 6 which aggregates graphs P 1 and P 2 .
c 1 2 Pij = max[Pij , Pij ]

(6)

Then we give Theorem 3 which indicates that each unlabeled instance can be connected with some labeled example in graph P c is the necessary condition.
1 2 v Theorem 3 Sk Sk is not  in Table 1 until Sk = L U (v = 1, 2) if and only if each unlabeled instance xt0  U can be connected with some labeled example xtr  L in graph P c through a path Rc in the form of VRc = {t0 , t1 , · · · , tr }, where Ptc t+1 > 0 ( = 0, · · · , r - 1).

Atj Pjs |Ys |

(4)

and the negative contribution to xt is
Ys =yt jU

Atj Pjs |Ys |.

(5)

If the positive contribution is larger than the negative contribution, the unlabeled instance xt will be labeled correctly by label propagation.1 Now we give Theorem 2 for co-training with non-perfect graphs. Theorem 2 Suppose P v (v = 1, 2) is non-perfect v graph. fU (xv ) · cv (xv ) > 0 for all unlabeled instance t t xt  U (t = l + 1, · · · , l + u) if and only if both (1) 1 2 and (2) hold in Table 1: (1) Sk  Sk is not  until v Sk = L  U ; (2) For any unlabeled instance in the v v v connected component k , where k  (U - Sk ) and 3-v v k  Sk = , its positive contribution is larger than its negative contribution. Proof. Here we give a proof by contradiction. Supv pose for any unlabeled instance xt  U , fU (xv ) · t 1 2 v cv (xv ) > 0. If Sk  Sk is equal to  while Sk = t L  U , for any unlabeled instance x = (x1 , x2 ) in v v U - Sk , fU (xv ) = 0. It is in contradiction with v v v v fU (x )·c (x ) > 0. If for some unlabeled instance x in v v v the connected component k , where k  (U - Sk ) 3-v v = , its positive contribution is no and k  Sk v larger than negative contribution, fU (xv ) · cv (xv )  0. v It is also in contradiction with fU (xv ) · cv (xv ) > 0. If conditions (1) and (2) hold, with Definition 1 it is easy to get that for any unlabeled instance xt  U , v fU (xv ) · cv (xv ) > 0. t t Theorem 2 provides the sufficient and necessary condition for co-training with non-perfect graphs to succeed. 1 2 Note that in both Theorem 1 and Theorem 2, Sk  Sk v is not  until Sk = L  U (v = 1, 2) is a necessary condition. In the following part of this section we will
We neglect the probability mass on the instances for which the non-zero positive contribution is equal to the non-zero negative contribution in this paper.
1

Proof. If we neglect the probability mass on the instances for which the non-zero positive contribution is equal to the non-zero negative contribution in this paper, similarly as the proof of Lemma 1 we get that: Unlabeled instance can be labeled by label propagation on graph P if and only if it can be connected with some labeled example in graph P through a path.
1 2 v If Sk  Sk is not  until Sk = L  U , any unlabeled instance xt  U can be labeled by the process in Table 1 2 1 2 1. So xt must belong to one of S0 , S0 , Tk or Tk for some k  0. Considering Eq. 6, the above discussion 1 2 1 2 and the fact that S0 , S0 , Tk and Tk have been obtained in previous iteration by label propagation and will be used as labeled examples in next iteration, we can get that xt0 can be connected with some labeled example xtr  L in graph P c through a path Rc .

If each unlabeled instance xt0  U can be connected with some labeled example xtr  L through a path Rc , with respect to Eq. 6, we can get that either Pt1 t+1 or  Pt2 t+1 is larger than 0 for  = 0, · · · , r - 1. Because  xtr is a labeled example, with above discussion and the process in Table 1 we know that xtr-1 , · · · , xt0 can be labeled by label propagation on either P 1 or P 2 . v Therefore, finally Sk = L  U .

3.3. Co-training with -Good Graphs It is somehow overly optimistic to expect to learn the target concept perfectly using co-training with nonperfect graphs. While learning the approximately correct concept using co-training with approximately perfect graphs is more reasonable. In perfect graph, all edges between the examples are reliable; while in nonperfect graph, it is hard to know which and how many edges are reliable. Restricting the reliability and allowing an -fraction exception is more feasible in real

A New Analysis of Co-Training

applications. In this section, we focus on the approximately perfect graph and provide sufficient condition for co-training the approximately correct concept.
v v Let 1 , · · · , v (v = 1, 2) denote the v connected components in graph P v , the definitions of purity and -good graph are given as follows. v Definition 2 Let pur( ) denote the purity of the v connected component  in graph P v , then v pur( ) = v v max |{xv : xv    cv (xv ) = 1}|/| |, v v |{xv : xv    cv (xv ) = -1}|/| | (7)

3-v 3-v v v v cv (k ) > 0}|/|k | > |{xt : xt  k Sk fk (xt )· v v v c (k ) < 0}|/|k | + .

4. Relationship to Previous Results
There are several theoretical analyses on co-training indicating that co-training can succeed if some condition holds, i.e., conditional independence, weak dependence, -expansion and large diversity. In this section we will discuss the relationship between our results and the previous results at first, then we will discuss some other interesting implications of our results. 4.1. Conditional Independence Blum & Mitchell (1998) proved that when the two sufficient views are conditionally independent given the class label, co-training can be successful. The conditional independence means that for the connected 1 2 1 2 components i of P 1 and j of P 2 , P (i  j ) = 1 2 v P (i )P (j ). Since Sk (v = 1, 2) is the union of some 1 2 connected components of P v , we have P (Sk  Sk ) = 1 2 1 2 1 P (Sk )P (Sk ). It means that P (Sk  Sk ) = P (Sk )(1 - 2 2 1 P (Sk )) + P (Sk )(1 - P (Sk )), which implies that condition (1) in Theorem 4 holds. In addition, Eqs. 8 and 9 can be obtained for -good graphs.
3-v 3-v v v P k  Sk  fk (xt ) · cv (k ) > 0 3-v v  P (k )P (Sk )(1 - ) 3-v 3-v v P k  Sk  fk (xt ) 3-v v P (k )P (Sk )

If  1 -  for all 1    v , we say that P is an -good graph.

v pur( )

v

v With -good graph, predicting the labels of all  correctly is sufficient to get a learner whose error rate is less than . From Definition 1 we know that the contribution is related to the number of labeled examples in the connected component. In a connected component, if the labeled examples with label y (y  {-1, 1}) is much more than the labeled examples with label -y, the unlabeled instances belong to this connected component may be labeled with y. Based on this, we assume graph P v satisfies the following condition: v in the connected component k of graph P v where 3-v v v v v = , let fk dek  (U - Sk ) and k  Sk v note the learner corresponding to Sk , if |{xt : xt  3-v 3-v v v k  Sk  fk (xt ) · y > 0}|/|k | > |{xt : xt  3-v 3-v v v k  Sk  fk (xt ) · y < 0}|/|k | + , the unlabeled v instances belonging to k can be labeled with y by label propagation on graph P v . Here   [0, 1) can be thought of as a form of margin which controls the confidence in label propagation. With this assumption, we get Theorem 4 which provides a margin-like sufficient condition for co-training the approximately correct concept with -good graphs.

The purity of the connected components reflects the reliability of the graph. The higher the purity, the more reliable the graph. With the purity, we can define v v the label of  as cv ( ).  v if |{xv : xv    cv (xv ) = 1}|   1 v v v v v |{x : x    cv (xv ) = -1}| c ( ) =  -1 otherwise

(8) ·c
v v (k )

<0 (9)

<

Thus, we get that condition (2) in Theorem 4 holds 3-v with  = P (Sk )(1 - 2). However, in real applications the conditional independence assumption is overly strong to satisfy for the two views. 4.2. Weak Dependence Abney (2002) found that weak dependence can lead to successful co-training. The weak dependence means 2 1 that for the connected components i of P 1 and j 1 2 1 2 of P 2 , P (i  j )   P (i )P (j ) for some  > 0. 1 2 It implies that the number of examples in Sk  Sk is not very small. So condition (1) in Theorem 4 holds. For -good graphs, without loss of generality, assume 3-v 3-v v v that P (k  Sk ) = 1 P (k )P (Sk ) and that
3-v 3-v v v P k  Sk  fk (xt ) · cv (k ) < 0 3-v v  2 P (k )P (Sk )

Theorem 4 Suppose P v (v = 1, 2) is -good graph. v acc(fU )  1 -  if both (1) and (2) hold in Table 1: v 2 1 (1) Sk  Sk is not  until Sk = L  U ; (2) In the v v v connected component k , where k  (U - Sk ) and 3-v 3-v 3-v v v  k  S k = , |{xt : xt  k  Sk  fk (xt ) ·

(10)

for some 1 > 0 and 2 > 0, we can have
3-v 3-v v v P k  Sk  fk (xt ) · cv (k ) > 0

A New Analysis of Co-Training
3-v 3-v v v  P  k  S k - 2 P (k )P (Sk )

=

3-v v P (k )P (Sk )(1

- 2 ).

(11)

Thus, we get that condition (2) in Theorem 4 holds 3-v with  = P (Sk )(1 - 22 ). 4.3. -Expansion Balcan et al. (2005) proposed -expansion and proved that it can guarantee the success of co-training. They assumed that the learner in each view is never "confident but wrong", which corresponds to the case with perfect graphs in Theorem 1. The -expansion means 1 2 1 that Sk and Sk satisfy the condition that P r(Sk  2 1 2 1 2 Sk )   min[P r(Sk  Sk ), P r(Sk  Sk )]. When expansion is met, it is easy to know that the condition 1 2 in Theorem 1 holds. Note that Sk  Sk =  is weaker 1 2 than -expansion, since P r(Sk  Sk ) does not need to have a lower bound with respect to some positive . 4.4. Large Diversity Wang & Zhou (2007) showed that when the diversity between the two learners is larger than their errors, the performance of the learner can be improved by cotraining style algorithms. Since the learners have both error and uncertainty with non-perfect graphs in Table 1, it is very complicated to define the diversity between them. Therefore, we only discuss co-training with perfect graphs here. For perfect graphs, the learners in Table 1 are "confident of labeling", so the error is 0. Thus, that the diversity between the two learners is 1 2 larger than their errors means P r(Sk  Sk ) > 0, which implies that the condition in Theorem 1 holds. 4.5. Other Implications From above discussions it can be found if any of the previous condition holds, our condition also holds; this means that our results are more general and tighter. Our results also have other interesting implications. Firstly, there were some works which combine the weight matrices or the Laplacians for each graph and then classify unlabeled instances according to the combination (Sindhwani et al., 2005; Argyriou et al., 2006; Zhang et al., 2006; Zhou & Burges, 2007), yet the underlying principle is not clear. To some extent, Theorem 3 can provide some theoretical support to these methods, i.e., these methods are developed to satisfy the necessary condition for co-training with graphs to succeed as much as possible. Secondly, in tasks where there does not exist two views, several single-view variants of co-training have been developed (Goldman & Zhou, 2000; Zhou & Li,

2005); to apply the standard two-view co-training directly, view split is a possible solution. This has been explored in Nigam & Ghani (2000) and Brefeld et al. (2005). Their studies show that when there are a lot of features and the features have much redundancy, a random split of the features is able to generate two views that enable co-training to outperform several other single-view learning algorithms. However, it is evident that a random split would not be effective in most cases and how to judge a method for view split is also an open problem. From Theorem 3 we know that each unlabeled instance can be connected with some labeled example in the combinative graph P c is the necessary condition for co-training to succeed. Actually, it implies a possible view split method, i.e., to select the view split which makes more unlabeled instances become connected with labeled examples in graph P c . In Section 6, we will report on some preliminary experimental results on this method.

5. Verification of Theoretical Results
We use the artificial data set (Muslea et al., 2002) and the course data set (Blum & Mitchell, 1998) in the experiments. The artificial data set has two artificial views which are created by randomly pairing two examples from the same class and contains 800 examples. For controlling the connectivity between the two views, the number of clusters per class can be set as a parameter. Here we use 2 clusters and 4 clusters, respectively. The course data set has two natural views: pages view (i.e., the text appearing on the page) and links view (i.e., the anchor text attached to hyper-links pointing to the page) and contains 1,051 examples. We use 1 -NN in each view to approximate the matrix P v (v = 1, 2), i.e., if example xv is the nearest neighbor s v v v v of xv , Pst = Pts = 1 and otherwise Pst = Pts = 0. t c The combinative graph P is constructed according to Eq. 6. P 1 , P 2 and P c are normalized according to Eq. 1. We randomly select some data to be used as the labeled data set L and use the remaining data to generate the unlabeled data set U . The error rate is calculated over U . To study the performance with different amount of labeled examples, we run experiments with different sizes of L, from 10% to 50% with interval 5%. The experiments are repeated for 20 runs and the average results are shown in Figure 1. From Figure 1(a), 1(c) and 1(e) we can see that on both data sets the performance of co-training is much better than the learner in each view. This can be successfully explained by our graph view explanation. As Figure 1(b), 1(d) and 1(f) show, the amount of unlabeled instances that are not connected with any la-

A New Analysis of Co-Training

beled example in the graph P c is much smaller than that in the graphs P 1 and P 2 . So, co-training can label not only the unlabeled instances that can be labeled by a single view, but also the unlabeled instances that cannot be labeled by either a single view.
0.50 0.45

ratio of unconnected

view1 view2 co-training

0.20 0.15 0.10 0.05 0 10% 20% 30% 40% ratio of labeled examples

error rate

P P2 Pc

1

0.40 0.35 0.30 10%

20% 30% 40% ratio of labeled examples

50%

50%

ing to Eq. 1. L and U are generated in the way as similar as that described in Section 5. Among the ten view splits, we select the one which leads to the largest amount of unlabeled instances connected with labeled examples in the graph P c . The results are shown in Figure 2(a), where we also present the performances of using the original pages view and co-training with random view split. From Figure 2(a) we can see that the performance of co-training with selected view split is always better than using the single view and is always superior to co-training with random view split except in very few cases where they are comparable.
0.27 0.24
error rate pages view random split graph split

(a) artificial with 2 clusters (b) artificial with 2 clusters
0.50 0.45

1.00
abs of eigenvalue

P Pr
c

ratio of unconnected

view1 view2 co-training

0.20 0.15 0.10 0.05 0 10% 20% 30% 40% ratio of labeled examples

error rate

P P2 Pc

1

0.80 0.60 0.40 0.20 0 0

Pc
s

0.21 0.18 0.15 0.12 10%

0.40 0.35 0.30 10%

20% 30% 40% ratio of labeled examples

50%

10 20 30 40 50 index of ordered eigenvalues

60

20% 30% 40% ratio of labeled examples

50%

50%

(a)

(b)

(c) artificial with 4 clusters (d) artificial with 4 clusters
0.40 0.35 0.30

ratio of unconnected

pages view links view co-training

0.30 0.25 0.20 0.15 0.10 0.05 0 10% 20% 30% 40% ratio of labeled examples

error rate

P P2 Pc

1

0.25 0.20 0.15 0.10 0.05 10% 20% 30% 40% ratio of labeled examples 50%

Figure 2. Result on the view split method. (a) The x-axis shows the amount of labeled examples (in the ratio of L), and the y-axis shows the error rate; (b) the x-axis shows the index of the ascendingly ordered eigenvalues, and the y-axis shows the absolute value of the eigenvalues.

50%

(e) course

(f) course

Figure 1. Results on the artificial data set and the course data set. The x-axis shows the amount of labeled examples (in the ratio of L); the y-axis in (a), (c) and (e) shows the error rate of co-training and learners using a single view; the y-axis in (b), (d) and (f) shows the amount (in ratio) of unlabeled data that are not connected with any labeled example in the graphs P 1 , P 2 and P c .

6. View Split for Single-view Data
As mentioned in the end of Section 4, our theoretical results imply a method which enables co-training to work on single-view data, i.e., to select the view split which makes more unlabeled instances become connected with labeled examples in graph P c , and then generate two views for co-training to work on. We use the course data set with only the pages view here. Thus, the experimental data is with a single view. We split the features of the pages view into two parts randomly ten times and use 1 -NN to approximate the matrices. The combinative graph P c is constructed according to Eq. 6. Let P denote the graph on the pages view, P and P c are normalized accord-

To study the result further, we calculate the eigenvalues of the Laplacian matrices related to the graphs c c c P , Pr and Ps (Pr corresponds to the combinative c graph with random view split and Ps corresponds to the combinative graph with selected view split) and sort the absolute value of these eigenvalues in ascending order, respectively. The first 60 ones are plotted in Figure 2(b). By setting  = 10-10 , we get that the c Laplacian matrix related to the graphs P and Pr have 27 and 13 eigenvalues whose absolute value is smaller than , respectively. While the Laplacian matrix rec lated to the graph Ps has only 9 eigenvalues whose absolute value is smaller than . This implies that, the graph P has 27 connected components and the graph c c Pr has 13 connected components, while the graph Ps has only 9 connected components, with an apparent improvement. In other words, through the selected view split, more unlabeled instances become connected c with labeled examples in the graph Ps . This validates the usefulness of the simple view split method derived from our theoretical results.

7. Conclusion
In this paper, we provide a new analysis of co-training, based on which we get the sufficient and necessary condition for co-training to succeed. Although previously

A New Analysis of Co-Training

there were many theoretical studies on co-training, to the best of our knowledge, this is the first result on the sufficient and necessary condition for co-training. We also discuss the relationship between our results and previous theoretical results. Moreover, our results have some other interesting implications, such as combination of weight matrices and view split. Our results can be extended to multi-view cases. Similar sufficient and necessary condition for multi-view learning can be obtained. Note that such an extension only cares the multiple matrices rather than where these matrices come from, and therefore it is also suited for applications where there is only one view in the data set but multiple conditional probability matrices can be obtained in different concept spaces. It is noteworthy that in previous semi-supervised learning studies, the disagreement-based and graphbased methods were developed separately, in two parallel threads. While our analysis provides a possibility to bring them into a unified framework, which will be explored further in the future.

Goldman, S. and Zhou, Y. Enhancing supervised learning with unlabeled data. In ICML, pp. 327­ 334, 2000. Hwa, R., Osborne, M., Sarkar, A., and Steedman, M. Corrected cotraining for statistical parsers. In ICML Workshop, 2003. Jebara, T., Wang, J., and Chang, S.-F. Graph construction and b-matching for semi-supervised learning. In ICML, pp. 441­448, 2009. Maier, M., von Luxburg, U., and Hein, M. Influence of graph construction on graph-based clustering measures. In NIPS 21, pp. 1025­1032. 2009. Muslea, I., Minton, S., and Knoblock, C. A. Active + semi-supervised learning = robust multi-view learning. In ICML, pp. 435­442, 2002. Nigam, K. and Ghani, R. Analyzing the effectiveness and applicability of co-training. In CIKM, pp. 86­ 93, 2000. Sindhwani, V., Niyogi, P., and Belkin, M. A coregularization approach to semi-supervised learning with multiple views. In ICML Workshop, 2005. Steedman, M., Osborne, M., Sarkar, A., Clark, S., Hwa, R., Hockenmaier, J., Ruhlen, P., Baker, S., and Crim, J. Bootstrapping statistical parsers from small data sets. In EACL, pp. 331­338, 2003. Wang, W. and Zhou, Z.-H. Analyzing co-training style algorithms. In ECML, pp. 454­465, 2007. Yu, S., Krishnapuram, B., Rosales, R., Steck, H., and Rao, R. B. Bayesian co-training. In NIPS 20, pp. 1665­1672. 2008. Zhang, T., Popescul, A., and Dom, B. Linear prediction models with graph regularization for web-page categorization. In SIGKDD, pp. 821­826, 2006. Zhou, D. and Burges, C. Spectral clustering and transductive learning with multiple views. In ICML, pp. 1159­1166, 2007. Zhou, Z.-H. and Li, M. Tri-training: Exploiting unlabeled data using three classifiers. IEEE TKDE, 17 (11):1529­1541, 2005. Zhou, Z.-H. and Li, M. Semi-supervised learning by disagreement. KAIS, in press. Zhu, X. Semi-supervised learning with graphs. PhD thesis, CS School, CMU, 2005. Zhu, X. Semi-supervised learning literature survey. Technical Report 1530, Department of Computer Sciences, University of Wisconsin at Madison, 2007.

Acknowledgments
Supported by NSFC (60635030, 60721002), 973 Program (2010CB327903) and JiangsuSF (BK2008018).

References
Abney, S. Bootstrapping. In ACL, pp. 360­367, 2002. Argyriou, A., Herbster, M., and Pontil, M. Combining graph laplacians for semi-supervised learning. In NIPS 18, pp. 67­74. 2006. Balcan, M.-F., Blum, A., and Yang, K. Co-training and expansion: Towards bridging theory and practice. In NIPS 17, pp. 89­96. 2005. Ben-David, S., Lu, T., and Pal, D. Does unlabeled data provably help? worst-case analysis of the sample complexity of semi-supervised learning. In COLT, pp. 33­44, 2008. Blum, A. and Mitchell, T. Combining labeled and unlabeled data with co-training. In COLT, pp. 92­ 100, 1998. Brefeld, U., B¨scher, C., and Scheffer, T. Multi-view u discriminative sequential learning. In ECML, pp. 60­71, 2005. Chapelle, O., Sch¨lkopf, B., and Zien, A. (eds.). Semio Supervised Learning. MIT Press, 2006.