On Multi-View Active Learning and the Combination with Semi-Sup ervised Learning

Wei Wang wangw@lamda.nju.edu.cn Zhi-Hua Zhou zhouzh@lamda.nju.edu.cn National Key Laboratory for Novel Software Technology, Nanjing University, China

Abstract
Multi-view learning has become a hot topic during the past few years. In this paper, we first characterize the sample complexity of multi-view active learning. Under the expansion assumption, we get an exponential improvement in the sample complexity from usual O( 1 ) to O(log 1 ), requiring neither strong assumption on data distribution such as the data is distributed uniformly over the unit sphere in Rd nor strong assumption on hypothesis class such as linear separators through the origin. We also give an upper bound of the error rate when the -expansion assumption does not hold. Then, we analyze the combination of multi-view active learning and semi-supervised learning and get a further improvement in the sample complexity. Finally, we study the empirical behavior of the two paradigms, which verifies that the combination of multi-view active learning and semi-supervised learning is efficient.

supervised learning. Some approaches use a generative model for the classifier and employ EM to model the label estimation or parameter estimation process (Dempster et al., 1977; Miller & Uyar, 1997; Nigam et al., 2000); some approaches use the unlabeled data to regularize the learning process in various ways, e.g., defining a graph on the data set and then enforcing the label smoothness over the graph as a regularization term (Belkin et al., 2001; Zhu et al., 2003; Zhou et al., 2005); some approaches use the multi-view setting to train learners and then let the learners to label unlabeled examples (Blum & Mitchell, 1998; Goldman & Zhou, 2000; Zhou & Li, 2005). The multi-view setting is first formalized by Blum and Mitchell (1998), where there are several disjoint subsets of features (each subset is called as a view ), each of which is sufficient for learning the target concept. For example, the web page classification task has two views, i.e., the text appearing on the page itself and the anchor text attached to hyper-links pointing to this page (Blum & Mitchell, 1998); the speech recognition task also has two views, i.e., sound and lip motion (de Sa & Ballard, 1998). Another important paradigm for using unlabeled data to complement labeled data, which is the focus of this paper, is active learning (Cohn et al., 1994; Freund et al., 1997; Tong & Koller, 2001; Melville & Mooney, 2004). In active learning, the learners actively ask the user to label the most informative examples and hope to learn a good classifier with as few labeled examples as possible. There have been many theoretical analyses on the sample complexity of single-view active learning. For some simple learning tasks the sample complexity of active learning can be O(log 1 ) which is exponentially improved in contrast to O( 1 ) of passive learning taking into account the desired accuracy bound . Unfortunately, such an exponential improvement is not always achievable in active learning. Dasgupta (2006) illustrated that if the hypothesis class H is linear separators in R2 and if the data distribution is some density

1. Intro duction
Learning from labeled data is well-established in machine learning, but labeling the training data is time consuming, sometimes may be very expensive since it may need human efforts. In many machine learning applications, unlabeled data can often be obtained abundantly and cheaply, so there has recently been substantive interest in using large amount of unlabeled data together with labeled data to achieve better learning performance. There are two popular paradigms for using unlabeled data to complement labeled data. One is semiAppearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s).


On Multi-View Active Learning and the Combination with Semi-Sup ervised Learning

supported on the perimeter of the unit circle, there are some target hypotheses in H for which ( 1 ) labels are needed to find a classifier with error rate less than , no matter what active learning approach is used. Under the strong assumptions that the hypothesis class is linear separators through the origin, that the data is distributed uniformly over the unit sphere in Rd , and that the learning task is a realizable case (i.e., there exists a hypothesis perfectly separating the data), the sample complexity of active learning is O(d log 1 ) taking into account the desired accuracy bound (Freund et al., 1997; Dasgupta et al., 2005) 1 . For some known data distribution D and specific hypothesis class, Dasgupta (2006) gave the coarse sample complexity bounds for realizable active learning. The study of sample complexity of active learning for realizable case without strong assumptions on the data distribution and the hypothesis class remains an open problem. All the above results were obtained under the singleview setting. The first algorithm for active learning in multi-view setting is co-testing (Muslea et al., 2000; Muslea et al., 2006). It focuses on the set of contention points (i.e., unlabeled examples on which different views predict different labels) and asks the user to label some of them. This is somewhat related to Query-by-Committee (Freund et al., 1997) since cotesting also uses more than one learners to identify the most informative unlabeled examples to query, but the typical Query-by-Committee works under a singleview setting while co-testing exploits the multi-views explicitly. It was reported that co-testing outperforms existing active learners on a variety of real-world domains such as wrapper induction, Web page classification, advertisement removal and discourse tree parsing. To the best of our knowledge, however, there is no theoretical result on the sample complexity of multi-view active learning. In this paper, we first theoretically analyze the sample complexity of multi-view active learning under the expansion assumption which is first mentioned by Balcan et al. (2005) and prove that the sample complexity of multi-view active learning can be exponentially improved to O(log 1 ). A clear advantage is that we do not use strong assumptions which were employed in most previous studies, such as the hypothesis class is linear separators through the origin and the data is distributed uniformly over the unit sphere in Rd . In case the -expansion assumption does not hold, we give an upper bound of the error rate. Second, we analyze the combination of multi-view active learning and
e The O notation is used to hide factors log log( 1 ), log(d) and log( 1 ) 
1

semi-supervised learning and get an further improvement in the sample complexity. Finally, we study the empirical behavior of the two paradigms, which verifies that the combination of multi-view active learning and semi-supervised learning is more efficient than pure multi-view active learning. The rest of this paper is organized as follows. After introducing some preliminaries in Section 2, we analyze the sample complexity of multi-view active learning in Section 3. Then we analyze the sample complexity of the combination of multi-view active learning and semi-supervised learning in Section 4 and study the empirical behavior in Section 5. Finally we conclude the paper in Section 6.

2. Preliminaries
In the multi-view setting, an example x is described with several different disjoint sets of features. Without loss of generality, in this paper we only consider the two-view setting for the sake of simplicity. Suppose that the example space X = X1 × X2 is with some unknown distribution D, X1 and X2 are the two views, and Y = {-1, 1} is the label space. Let c = (c1 , c2 ) be the underlying target concept, where c1 and c2 are the underlying target concepts in the two views, respectively. Suppose that the example space is consistent, that is, there is no such example x = (x1 , x2 ) that c1 (x1 ) = c2 (x2 ) in X . Let H1 and H2 be the hypothesis class in each view, respectively. For any hj  Hj and x = (x1 , x2 ) we say xj  hj if and only if hj (xj ) = cj (xj ) (j = 1, 2). In this way any hypothesis in Hj can be thought of as a subset of Xj . In each round of iterative multi-view active learning, the learners ask the user to label some unlabeled examples and add them into the labeled training data. These newly labeled examples provide more information about the data distribution. In this paper, we consider the co-testing -like Paradigm 1 described in Table 1. In Paradigm 1, the learners ask the user to label some contention points to refine the classifiers. If the confident set of each view is expanding by considering the other view together, Paradigm 1 may succeed. Intuitively, we can use the -expansion assumption to analyze the process. Suppose S1  X1 and S2  X2 denote the examples that are correctly classified in each view, respectively. Let P r(S1  S2 ) denote the probability mass on examples that are correctly classified in both views, while P r(S1  S2 ) denotes the probability mass on examples that are correctly classified only in one view (i.e., examples disagreed by the two classifiers). Now we give


On Multi-View Active Learning and the Combination with Semi-Sup ervised Learning Input: Unlabeled data set U = {x1 , x2 , · · · , }, where each example xt is given as a pair (xt , xt ) 1 2 Pro cess: Ask the user to label m0 unlabeled examples drawn randomly from D to compose the labeled data set L Iterate i = 0, 1, · · · , s Train two classifiers hi and hi consistent with L in each view, respectively; 1 2 Apply hi and hi to the unlabeled data set U and find out the contention points set Qi ; 1 2 Ask the user to label mi+1 unlabeled examples drawn randomly from Qi , then add them into L and delete them from U . Output: hf inal = combine(hs , hs ) 1 2 Table 1. Paradigm 1: Multi-view active learning

our definition on -expansion. Definition 1 D is -expansion if for any S1  X1 , S2  X2 , we have P r(S1  S2 )   min[P r(S1  S2 ), P r(S1  S2 )]. We say that D is -expanding with respect to hypothesis class H1 ×H2 if the above holds for al l S1  H1 X1 , S2  H2 X2 (here we denote by Hj Xj the set {hXj : h  Hj } for j = 1, 2). Note that Definition 1 on -expansion is almost the same as that in Balcan et al. (2005). To guarantee the success of iterative co-training, they made several assumptions such as that the learning algorithm used in each view is confident about being positive and is able to learn from positive examples only, and that the distribution D+ over positive examples is expanding. There are many concept classes, however, are not learnable from positive examples only. Apparently, all problems which satisfy the definition of Balcan et al. (2005) also satisfy our definition. We will make use of the following lemma when deriving our sample complexity bound (Anthony & Bartlett, 1999). Lemma 1 Let H be a set of functions from X to {-1, 1} with finite VC-dimension V  1. Let P be an arbitrary, but fixed probability distribution over X × {-1, 1}. For any ,  > 0, if we draw a sample from P of size N ( ,  ) = 1 (4V log( 1 ) + 2 log( 2 )), then  a ith probability 1 -  , al l hypotheses with error  w re inconsistent with the data.

voting or winner-take-al l (Muslea et al., 2006). In this paper, we use the following simple combination scheme for binary classification: hi if hi (x1 ) = hi (x2 ) 1 (x1 ) 1 2 (1) hi om (x) = c random g uess if hi (x1 ) = hi (x2 ) 1 2 Assuming that the data distribution D is -expanding with respect to hypothesis class H1 × H2 , we will analyze how many labels the user should label to achieve classifiers with error rate no larger than . We consider i i the iterative process and let S1  X1 and S2  X2 i i where S1 and S2 corresponds to the classifiers hi  H1 1 and hi  H2 in the i-th round, respectively. The ini2 tial m0 unlabeled examples are randomly picked from D and labeled by the user according to the target concept c. Suppose m0 is sufficient for learning two classifiers h0 and h0 whose error rates are at most 1/4 1 2 (i.e., P r(S0 )  1 - 1/4 and P r(S0 )  1 - 1/4), and 1 2 thus P r(S0  S0 )  1/2. The -expansion condition 2 1 suggests P r(S0  S0 )  P r(S0  S0 ). 1 2 1 2 In each round of Paradigm 1, the learners ask the user to label some unlabeled examples according to the target concept c and add them into the labeled data set. Then the two classifiers are refined. Some example x in X might be predicted with different labels between the i-th and (i + 1)-th round. Intuitively, in order to get the classifiers improved in Paradigm 1, the reduced size of confident set should be no more than the size of contention set. Moreover, considering that there is no noise in the labeled data since all the labels are given by the user according to the target concept, and that the amount of labeled training examples are monotonically increasing, the asymptotic performance of PAC learners increase, we can assume that P r(Si+1 | Si  Si )  1 2 j P r(Si  Si ) 1 2 (j  {1, 2}) (2) 16P r(Si  Si ) 1 2

3. Sample Complexity of Multi-View Active Learning
There are many strategies to combine the classifiers in Paradigm 1, for example, weighted voting, majority


On Multi-View Active Learning and the Combination with Semi-Sup ervised Learning

Intuitively, by multiplying the denominator at the right-hand to the left-hand (16 is used for a faster convergence; it can be 2 for an easier understanding), Eq. 2 implies that the total reduced size of confident sets on both views after using the newly labeled contention points is no more than the size of contention set. Apparently, all problems that satisfy the assumption of Balcan et al. (2005) also satisfy Eq. 2. Now we give our main theorem. Theorem 1 For data distribution D -expanding with respect to hypothesis class H1 × H2 , let and  denote the final desired accuracy and confidence parameters. og  + If s = llog 81 and mi = 16 (4V log( 16 ) + 2 log( 8(s 1) ))   C (i = 0, 1, · · · , s), Paradigm 1 wil l generate a classifier with error rate no more than with probability 1 -  . Here, V = max[V C (H1 ), V C (H2 )] where V C (H) denotes the VC-dimension of the hypothesis class H and /4 1/ constant C = 1++/ . 1 Pro of. In Paradigm 1, we use Eq. 1 to combine the two classifiers, thus the error rate of the combined classifier hi om is c errorhi om c = P r(Si 1  Si ) 2 1 + P r(Si  Si ) 1 2 2

with probability 1 - 2(s 1) . Considering Eq. 2 we have +
i P r(Si  Si | S1-1  Si-1 )  1 2 2 i P r(S1-1  Si-1 ) 2 . i-1 8P r(S1  Si-1 ) 2

Since P r(Si  Si ) = P r(Si  Si | Si-1  Si-1 ) 1 2 1 2 1 2 ·P r(Si-1  Si-1 ) 1 2 +P r(Si  Si | Si-1  Si-1 ) 1 2 1 2 ·P r(Si-1  Si-1 ) 1 2 +P r(Si  Si | Si-1  Si-1 ) 1 2 1 2 i ·P r(S1-1  Si-1 ), 2 we have P r(Si  Si )  1 2  i P r(Si-1  Si-1 ) + P r(Si-1  S2-1 ) . 1 1 2 4

From Eq. 3 we can get that
i P r(Si-1  S2-1 )  P r(Si-1  Si-1 )/ . 1 1 2

Thus, considering
i P r(Si-1  Si-1 ) = P r(Si-1  S2-1 ) + P r(Si-1  Si-1 ), 1 2 1 1 2

 P r(Si  Si ) + P r(Si  Si ) 2 1 1 2 = P r(Si 1  Si ) 2
+ With m0 = log( 16 ) + 2 log( 8(s 1) )), using    Lemma 1 we have P r(S0 )  16 and P r(S0 )  16 1 2  with probability 1 - 4(s+1) . Generally, we have that i an arbitrary Sj (j = 1, 2) being consistent with the  examples in L has an error rate at most 16 with probi  ability 1 - 4(s+1) . So we have P r(S1  Si )  1 -  2 8 with probability 1 - 2(s 1) . Without loss of generality, + consider 0 <   1 and therefore 1 -  > 1 . Thus the 8 2 16  (4V

we have P r(Si  Si ) 1 2 P r(Si-1  Si-1 ) 1 2   =
i-1 i   S2-1 ) + P r(Si-1  Si-1 ) 1 2 4 P r (S1 i P r(Si-1  Si-1 ) + P r(S1-1  Si-1 ) 1 2 2 i  P r(Si-1  S2-1 ) + P r(Si-1  Si-1 )/ 1 1 2 4 i P r(Si-1  Si-1 ) + P r(S1-1  Si-1 )/ 1 2 2

/4 + 1/ . 1 + 1/

Now we get P r(Ss  Ss )  ( 1 2  So when s =
/4+1/ 1+1/

-expansion condition suggests P r(Si 1  Si ) 2 

P r(Si 1


Si ). 2

(3)

For i  1, the learners ask the user to label mi i unlabeled examples drawn randomly from S1-1  i-1 S2 according to the target concept c and obtain i i two new classifiers S1 and S2 . Similarly, if mi = 8(s+1) 16 16 )), using Lemma 1 we have  (4V log(  ) + 2 log(  P r(Si | Si-1  Si-1 )  j 2 1 with probability 1 -
 4(s+1) .

/4 + 1/ s ) P r(S0  S0 ) 1 2 1 + 1/  /4 + 1/ s ( ). 8 1 + 1/ where C is a constant and

llog 81

og


 (j  {1, 2}) 16 So we get that  8

< 1, we have P r(Ss  Ss )  . In other 1 2 words, we get a classifier hsom whose error rate is no c F ore than with probability 1 -  . m rom Theorem 1 we know that we only need to label s 1 1 i=0 mi = O (log log(log )) examples to get a classifier with error rate no more than with probability

C

P r(Si  Si | Si-1  Si-1 )  1 2 1 2


On Multi-View Active Learning and the Combination with Semi-Sup ervised Learning

1 -  . Thus, we achieve an exponential improvement in sample complexity from O( 1 ) to O(log 1 ) as in Dasgupta et al. (2005) and Balcan et al. (2007). Note that we have not assumed a specific data distribution and a specific hypothesis class which were assumed in the studies of Dasgupta et al. (2005) and Balcan et al. (2007). From the proof of Theorem 1 we can also know  that the proportion 16 in Eq. 2 can be relaxed to close  to 2 . Such relaxation will not affect the exponential improvement, but will reduce the convergence speed. Further, considering that not every data distribution D is -expanding with respect to hypothesis class H1 × H2 , we will give a coarse upper bound of the generalization error for Paradigm 1 for cases when the -expansion assumption does not hold. Let P r(Si  Si ) = i P r(Si  Si ) (i = 0, 1, · · ·). If the 1 1 2 2 -expansion assumption does not hold in Paradigm 1, for any > 0 and any integer N > 0, the size of the set {i : i > N  i < } is infinite. We set a parameter c > 0 as the stop condition. When P r(Si  Si ) is 1 2 less than c, we terminate the iteration in Paradigm 1. Now we make the definition on expanded region with respect to c. Definition 2 Let  c denote the expanded region with respect to c in Paradigm 1, 
c

the expanded region  , the better the improvement of Paradigm 1. Theorem 2 can also be applied to oneshot co-training (Balcan et al., 2005).

4. Sample Complexity of Combination of Multi-View Active Learning and Semi-Sup ervised Learning
We can try to reduce the sample complexity further by combining multi-view active learning with semisupervised learning. Previously this has been tried in some applications and led to good results (Zhou et al., 2006), yet to the best of our knowledge, there is no theoretical analysis which supports such argument. For computational simplicity, we consider the following case in this section. Suppose that the hypothesis class Hj is the subset of mappings from Xj to [-1, 1] and y = sig n(c(x)), c = (c1 , c2 ) is the underlying target concept, where c1 and c2 is the underlying target concept in each view, respectively. Let d(f , g ) denote the probability that the two classifiers f  Hj and g  Hj predict different labels on an example xj drawn randomly from Xj , then s d(f , g ) = P rxj Xj ig n f (xj ) = g sig n (xj ) .

= P r(S0  S0 ) - P r(Si  Si ), 1 2 1 2

where i = min{i : P r(Si  Si ) < c  i  1}. 2 1 After i rounds the region in which both classifiers wrongly predict becomes smaller and smaller, from P r(S0  S0 ) to P r(Si  Si ). This expanded region can 1 2 1 2 be thought of as an approximation of i =1 P r(Sk Sk ). 2 1 k Theorem 2 When the -expansion assumption does not hold, set c > 0 to terminate Paradigm 1. The error rate of hi om can be smal ler than h0om for  c + c c 0 0 1 2 (P r (S1  S2 ) - c). Pro of. Considering errorhi om c = P r(Si  1

Suppose that for any f , g  Hj , there exists some constant L1 > 0 to hold that |f (xj ) - g (xj )|  L1 · d(f , g ) · xj 2, where xj 2 denotes the 2-norm of xj . Without loss of generality, suppose that there exists some constant L2 > 0 to hold that xj 2  L2 for xj  Xj (j = 1, 2). Now we have the following theorem for Paradigm 2 which combines multi-view active learning with semi-supervised learning. Theorem 3 For data distribution D -expanding with respect to hypothesis class H1 × H2 , let and  denote the final desired accuracy and confidence parameters. og  + 1 1 If s = llog 81 , m0 = L (4V log( L ) + 2 log( 8(s 1) )) and
+ mi = 16 (4V log( 16 ) + 2 log( 8(s 1) )) (i = 1, 2, · · · ,),   Paradigm 2 wil l generate a classifier with error rate no more than with probability 1 -  .
C

Si ) + 1 P r(Si  Si ) and P r(Si  Si ) < c, we 1 2 2 1 2 2 have that errorh0om - errorhi om is larger than c c 1 T c + 2 (P r(S0  S0 ) - c).  1 2 heorem 2 implies that Paradigm 1 could not boost the performance to arbitrarily high and gives a coarse upper bound of the error rate, when the -expansion assumption does not hold. The improvement depends on the expanded region  and the disagreement between the initial two classifiers. The larger

Here, V = max[V C (H1 ), V C (H2 )] where V C (H) denotes the VC-dimension of the hypothesis class H, con/4 1/  1 stant C = 1++/ and constant L = min[ 16 , 16L1 L2 ]. 1 Pro of. In Paradigm 2, we also use Eq. 1 to com1 1 bine the two classifiers. With m0 = L (4V log( L ) + +  1 2 log( 8(s 1) )) where constant L = min[ 16 , 16L1 L2 ], us1 1 ing Lemma 1 we have P r(S0 )  L and P r(S0 )  L 1 2  with probability 1 - 4(s+1) . Generally, we have that an


On Multi-View Active Learning and the Combination with Semi-Sup ervised Learning Input: Unlabeled data set U = {x1 , x2 , · · · , }, where each example xt is given as a pair (xt , xt ) 1 2 Threshold thr Pro cess: Ask the user to label m0 unlabeled examples drawn randomly from D to compose the labeled data set L Iterate i = 0, 1, · · · , s Set counter ni+1 to 0. If D is expanding, set counter ni+1 to +; Otherwise, set counter ni+1 to 0; 1 2 2 Train two classifiers hi and hi consistent with L in each view, respectively; 1 2 Apply hi and hi to the unlabeled data set U and find out the contention points set Qi ; 1 2 for k = 1, · · · , mi+1 Draw an example xk = (xk , xk ) randomly from Qi ; 1 2 if |hi (xk )| > thr then y k = sig n(hi (xk )); 1 1 1 1 else if |hi (xk )| > thr then y k = sig n(hi (xk )); 2 2 2 2 else ask the user to label xk and ni+1 = ni+1 + 1; 1 1 Add (xk , y k ) into L and delete it from U and Qi . end for for w = 1, 2, · · · if ni+1  mi+1 - ni+1 break; 2 1 Draw an example xw = (xw , xw ) randomly from U - Qi ; 1 2 if |hi (xw )| > thr then y w = sig n(hi (xw )); 1 1 1 1 else if |hi (xw )| > thr then y w = sig n(hi (xw )); 2 2 2 2 else ask the user to label xw and ni+1 = ni+1 + 1; 2 2 Add (xw , y w ) into L and delete it from U . end for Output: hf inal = combine(hs , hs ) 1 2 Table 2. Paradigm 2: Combination of multi-view active learning and semi-supervised learning
i arbitrary Sj (j = 1, 2) being consistent with the exam1 ples in L has an error rate at most L with probability 1 - 4(s 1) . So, for any example x = (x1 , x2 ), +

the size of the region S1  S2 .

5. Empirical Study
In this section we empirically study the performance of the Paradigms 1 and 2 on a real-world data set, i.e., the course data set (Blum & Mitchell, 1998). This data set has two views (pages view and links view) and contains 1,051 examples each corresponds to a web page, and the task is to predict whether an unseen web page is a course page or not. There are 230 positive examples (roughly 22%). We randomly use 25% data as the test set and use the remaining 75% data as the unlabeled set U in Tables 1 and 2. Then, we randomly draw 10 positive and 30 negative examples from U to generate the initial m0 labeled examples. In practice, the thr in Paradigm 2 can be determined by cross validation on labeled examples. Here in our experiments, for the ease of comparison, we do not set thr and instead, we fix the number of examples to be queried in both Paradigms. Thus, we can study their performance under the same number of queries. In detail, in the i-th round, Paradigm 1 picks out two contention points randomly to query; while Paradigm

|hi (xj ) - cj (xj )|  L1 · L2 · d(hi , cj )  j j

1 . 16

We can set the threshold thr in Paradigm 2 to 116 . If |hi (xj )| > 116 , hi and cj make the same prediction on j j xj . When s = llog 81 , from the proof of Theorem
C

og


1 we have P r(Ss  Ss )  . Thus we get a classifier 1 2 hsom whose error rate is no more than with probac T bility 1 -  using Paradigm 2. s he sample complexity of Paradigm 2 is m0 + i=1 ni , 1 which is much smaller than that of Paradigm 1. From Theorem 3 we know that the sample complexity can be further reduced by combining multi-view active learning with semi-supervised learning, however, it needs a stronger assumption on the hypothesis class H1 × H2 . If this assumption holds, in contrast to Paradigm 1, when -expansion does not hold, we can s query i=1 (mi - ni ) more examples on which both 1 classifiers have small margin, which can help to reduce


On Multi-View Active Learning and the Combination with Semi-Sup ervised Learning
0.12

error rate

0.10

Paradigm 1 Paradigm 2 Random Sampling

0.08

0.06 40

72 104 number of queried labels

136

Figure 1. Comparison of the performances

O(log 1 . The -expansion assumption we employed ) is weaker than assumptions taken by previous theoretical studies on active learning, such as that the data is distributed uniformly over the unit sphere in Rd and that the hypothesis class is linear separators through the origin. We also give an upper bound of the error rate for cases where the -expansion assumption does not hold. Then, we analyze the combination of multi-view active learning with semi-supervised learning and get that such a combination can reduce the sample complexity further, which is verified by an empirical study. This provides an explanation to that why the method described in (Zhou et al., 2006) can lead to good results. Our work is the first theoretical analysis on the sample complexity of realizable multi-view active learning. Recently, non-realizable active learning, where there does not exist a hypothesis perfectly separating the data, starts to attract attention (Balcan et al., 2006; Balcan et al., 2007; Dasgupta et al., 2008). Extending our work to non-realizable multi-view active learning is a future work.

2 picks out the example with the smallest absolute sum of the two classifiers' outputs from Qi and U - Qi respectively to query, and picks out the example with the largest absolute sum of the two classifiers' outputs h om Qi and U .- Qi respectively to label as fr sig n i (x1 ) + hi (x2 ) That is, the two examples to 1 2 | be quer) ed in Paradigm 2 are arg minxQi hi (x1 ) + i 1 ) |i hi (x2 ) and arg minxU -Qi h1 (x1 ) + hi (x2 ) , while 2 2 the two examples Paradigm 2 labels for | itself by semi-su) ervised learning are arg maxxQi hi () 1 ) + p 1x |i i i h2 (x2 ) and arg maxxU -Qi h1 (x1 ) + h2 (x2 ) . We use Random Sampling as the baseline and implement the classifiers with SMO in WEKA (Witten & Frank, 2005). The experiments are repeated for 20 runs and Figure 1 plots the average error rates of the three methods against the number of examples that have been queried. It can be found from Figure 1 that with the same number of queried examples, although there are some fluctuation, the performance of Paradigm 1 is generally better than that of Random Sampling, while the performance of Paradigm 2 is better than that of the others. In particular, the advantage of Paradigm 2 becomes more prominent as the number of queries increases. This is not difficult to understand since with more labeled data the learners become stronger and thus the labels obtained from the semi-supervised learning process become more helpful. Overall, the empirical study verifies that comparing with pure active learning, the combination of multiview active learning and semi-supervised learning can reduce the sample complexity.

Acknowledgments
This research was supported by the National Science Foundation of China (60635030, 60721002), the Foundation for the Author of National Excellent Doctoral Dissertation of China (200343) and the National High Technology Research and Development Program of China (2007AA01Z169).

References
Anthony, M., & Bartlett, P. L. (Eds.). (1999). Neural network learning: Theoretical foundations. Cambridge, UK: Cambridge University Press. Balcan, M.-F., Beygelzimer, A., & Langford, J. (2006). Agnostic active learning. Proceedings of the 23rd International Conference on Machine Learning (pp. 65­72). Pittsburgh, PA. Balcan, M.-F., Blum, A., & Yang, K. (2005). Cotraining and expansion: Towards bridging theory and practice. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural information processing systems 17, 89­96. Cambridge, MA: MIT Press. Balcan, M.-F., Broder, A. Z., & Zhang, T. (2007). Margin based active learning. Proceedings of the 20th Annual Conference on Learning Theory (pp. 35­50). San Diego, CA. Belkin, M., Matveeva, I., & Niyogi, P. (2001). Reg-

6. Conclusion
In this paper, we first characterize the sample complexity of multi-view active learning and get an exponential improvement in the sample complexity from O( 1 ) to


On Multi-View Active Learning and the Combination with Semi-Sup ervised Learning

ularization and semi-supervised learning on large graphs. Proceedings of the 17th Annual Conference on Learning Theory (pp. 624­638). Banff, Canada. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory (pp. 92­100). Madison, WI. Cohn, D. A., Atlas, L. E., & Ladner, R. E. (1994). Improving generalization with active learning. Machine Learning, 15, 201­221. Dasgupta, S. (2006). Coarse sample complexity bounds for active learning. In Y. Weiss, B. Sch¨lkopf o and J. Platt (Eds.), Advances in neural information processing systems 18, 235­242. Cambridge, MA: MIT Press. Dasgupta, S., Hsu, D., & Monteleoni, C. (2008). A general agnostic active learning algorithm. In J. Platt, D. Koller, Y. Singer and S. Roweis (Eds.), Advances in neural information processing systems 20, 353­360. Cambridge, MA: MIT Press. Dasgupta, S., Kalai, A. T., & Monteleoni, C. (2005). Analysis of perceptron-based active learning. Proceedings of the 18th Annual Conference on Learning Theory (pp. 249­263). Bertinoro, Italy. de Sa, V. R., & Ballard, D. H. (1998). Category learning through multi-modality sensing. Neural Computation, 10, 1097­1117. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1­38. Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133­168. Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. Proceedings of the 17th International Conference on Machine Learning (pp. 327­334). San Francisco, CA. Melville, P., & Mooney, R. J. (2004). Diverse ensembles for active learning. Proceedings of the 21st International Conference on Machine Learning (pp. 584­591). Banff, Canada. Miller, D. J., & Uyar, H. S. (1997). A mixture of experts classifier with learning based on both labelled and unlabelled data. In M. Mozer, M. I. Jordan and T. Petsche (Eds.), Advances in neural information processing systems 9, 571­577. Cambridge, MA: MIT Press.

Muslea, I., Minton, S., & Knoblock, C. A. (2000). Selective sampling with redundant views. Proceedings of the 17th National Conference on Artificial Intelligence (pp. 621­626). Austin, TX. Muslea, I., Minton, S., & Knoblock, C. A. (2006). Active learning with multiple views. Journal of Artificial Intel ligence Research, 27, 203­233. Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39, 103­134. Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 45­66. Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. San Francisco, CA: Morgan Kaufmann. 2nd edition. Zhou, D., Sch¨lkopf, B., & Hofmann, T. (2005). Semio supervised learning on directed graphs. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural information processing systems 17, 1633­ 1640. Cambridge, MA: MIT Press. Zhou, Z.-H., Chen, K.-J., & Dai, H.-B. (2006). Enhancing relevance feedback in image retrieval using unlabeled data. ACM Transactions on Information Systems, 24, 219­244. Zhou, Z.-H., & Li, M. (2005). Semi-supervised learning with co-training. Proceedings of the 19th International Joint Conference on Artificial Intel ligence (pp. 908­913). Edinburgh, Scotland. Zhu, X., Ghahramani, Z., & Lafferty, J. (2003). Semisupervised learning using Gaussian fields and harmonic functions. Proceedings of the 20th International Conference on Machine Learning (pp. 912­ 919). Washington, DC.