A general agnostic active learning algorithm Sanjoy Dasgupta UC San Diego dasgupta@cs.ucsd.edu Daniel Hsu UC San Diego djhsu@cs.ucsd.edu Claire Monteleoni UC San Diego cmontel@cs.ucsd.edu Abstract We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. We provide a fall-back guarantee that bounds the algorithm's label complexity by the agnostic PAC sample complexity. Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. We also demonstrate improvements experimentally. 1 Intro duction Active learning addresses the issue that, in many applications, labeled data typically comes at a higher cost (e.g. in time, effort) than unlabeled data. An active learner is given unlabeled data and must pay to view any label. The hope is that significantly fewer labeled examples are used than in the supervised (non-active) learning model. Active learning applies to a range of data-rich problems such as genomic sequence annotation and speech recognition. In this paper we formalize, extend, and provide label complexity guarantees for one of the earliest and simplest approaches to active learning--one due to Cohn, Atlas, and Ladner [1]. The scheme of [1] examines data one by one in a stream and requests the label of any data point about which it is currently unsure. For example, suppose the hypothesis class consists of linear separators in the plane, and assume that the data is linearly separable. Let the first six data be labeled as follows. 11 00 1 0 1 0 1 0 The learner does not need to request the label of the seventh point (indicated by the arrow) because it is not unsure about the label: any straight line with the s and s on opposite sides has the seventh point with the s. Put another way, the point is not in the region of uncertainty [1], the portion of the data space for which there is disagreement among hypotheses consistent with the present labeled data. Although very elegant and intuitive, this approach to active learning faces two problems: 1. Explicitly maintaining the region of uncertainty can be computationally cumbersome. 2. Data is usually not perfectly separable. 1 Our main contribution is to address these problems. We provide a simple generalization of the selective sampling scheme of [1] that tolerates adversarial noise and never requests many more labels than a standard agnostic supervised learner would to learn a hypothesis with the same error. In the previous example, an agnostic active learner (one that does not assume a perfect separator exists) is actually stil l uncertain about the label of the seventh point, because all six of the previous labels could be inconsistent with the best separator. Therefore, it should still request the label. On the other hand, after enough points have been labeled, if an unlabeled point occurs at the position shown below, chances are its label is not needed. 1 0 11 00 1 0 1 0 11 00 1 0 11 00 1 0 1 0 1 0 11 00 11 00 11 00 11 00 The last claim may sound awfully suspicious, because S T is not i.i.d.! Indeed, this is in a sense the core sampling problem that has always plagued active learning: the labeled sample T might not be i.i.d. (due to the filtering of examples based on an adaptive criterion), while S only contains unlabeled examples (with made-up labels). Nevertheless, we prove that in our case, it is in fact correct to effectively pretend S T is an i.i.d. sample. A direct consequence is that the label complexity of our algorithm (the number of labels requested before achieving a desired error) is never much more than the usual sample complexity of supervised learning (and in some cases, is significantly less). An important algorithmic detail is the specific choice of generalization bound we use in deciding whether to request a label or not. The usual additive bounds with rate n-1/2 are too loose, e.g. we know in the zero-error case the rate should be n-1 . Our algorithm magnifies this small polynomial difference in the bound into an exponential difference in label complexity, so it is crucial for us to use a good bound. We use a normalized bound that takes into account the empirical error (computed on S T ) of the hypothesis in question. In this paper, we present and analyze a simple agnostic active learning algorithm for general hypothesis classes of bounded VC dimension. It extends the selective sampling scheme of Cohn et al. [1] to the agnostic setting, using normalized generalization bounds, which we apply in a simple but subtle manner. For certain hypothesis classes and distributions, our analysis yields improved label complexity guarantees over the standard sample complexity of supervised learning. We also demonstrate such improvements experimentally. 1.1 Related work To extend the notion of uncertainty to the agnostic setting, we divide the sampled data into two groups, S and T : S contains the data for which we have determined the label ourselves (we explain below how to ensure that they are consistent with the best separator in the class) and T contains the data for which we have explicitly requested a label. Now, somewhat counter-intuitively, the labels in S are completely reliable, whereas the labels in T could be inconsistent with the best separator. To decide if we are uncertain about the label of a new point x, we reduce to a supervised learning task: for each possible label y {±1}, we ^ learn a hypothesis hy consistent with the labels in S {(x, y )} and with minimal empirical ^ ^ error on T . If, say, the error of the hypothesis h+1 is much larger than that of h-1 , we can safely infer that the best separator must also label x with -1 without requesting a label; if the error difference is only modest, we explicitly request a label. Standard generalization bounds for an i.i.d. sample let us perform this test by comparing empirical errors on S T . Our algorithm extends the selective sampling scheme of Cohn et al. [1] (described above) to the agnostic setting. Most previous work on active learning either makes strong distributional assumptions (e.g. separability, uniform input distribution) [1­8], or is generally computationally prohibitive [2, 4, 9]. See [10] for a discussion of these results. A natural way to formulate active learning in the agnostic setting is to ask the learner to return a hypothesis with error at most + (where is the error of the best hypothesis in 2 the specified class) using as few labels as possible. A basic constraint on the label complexity was pointed out by Kaari¨inen [11], who showed that for any (0, 1/2), there are data ¨¨ a distributions that force any active learner that achieves error at most + to request (( /)2 ) labels. The first rigorously-analyzed agnostic active learning algorithm, called A2 , was developed recently by Balcan, Beygelzimer, and Langford [9]. Like Cohn-AtlasLadner [1], this algorithm uses a region of uncertainty, although the lack of separability complicates matters and A2 ends up explicitly maintaining an -net of the hypothesis space. Subsequently, Hanneke [12] characterized the label complexity of the A2 algorithm in terms of a parameter called the disagreement coefficient. Our work was inspired by both [1] and [9], and we have built heavily upon their insights. Our algorithm overcomes their complications by employing reductions to supervised learning.1 We bound the label complexity of our method in terms of the same parameter as used for A2 [12], and get a somewhat better dependence (linear rather than quadratic). 2 2.1 Preliminaries Learning framework and uniform convergence Let X be the input space, D a distribution over X × {±1} and H a class of hypotheses h : X {±1} with VC dimension vcdim(H) = d < (the finiteness ensures the nth shatter coefficient S (H, n) is at most O(nd ) by Sauer's lemma). We denote by DX the marginal of D over X . In our active learning model, the learner receives unlabeled data sampled from DX ; for any sampled point x, it can optionally request the label y sampled from the conditional distribution at x. This process can be viewed as sampling (x, y ) from D and revealing only x to the learner, keeping the label y hidden unless the learner explicitly requests it. The error of a hypothesis h under D is errD (h) = Pr(x,y)( D [h(x) = y ], and on a l finite sample Z X × {±1}, the empirical error of h is err(h, Z ) = x,y )Z 1 [h(x) = y ]/|Z |, where 1 [·] is the 0-1 indicator function. We assume for simplicity that the minimal error l = inf {errD (h) : h H} is achieved by a hypothesis h H. Our algorithm uses the following normalized uniform convergence bound [14, p. 200]. Lemma 1 (Vapnik and Chervonenkis [15]). Let F be a family of measurable functions f : Z {0, 1} over ( space Z . Denote by EZ f the empirical average of f over a subset a Z Z . Let n = 4/n) ln(8S (F , 2n)/ ). If Z is an i.i.d. sample of size n from a fixed distribution over Z , then, with probability at least 1 - , for al l f F : E E E E. 2 2 - min n f , n + n f Ef - EZ f min n + n f , n f Z Z 2.2 Disagreement co efficient We will bound the label complexity of our algorithm in terms of (a slight variation of ) the disagreement coefficient introduced in [12] for analyzing the label complexity of A2 . Definition 1. The disagreement metric on H is defined by (h, h ) = PrxDX [h(x) = h (x)]. The disagreement coefficient = (D, H, ) > 0 is P w rxDX [h B (h , r) s.t. h(x) = h (x)] = inf :r + r here B (h, r) = {h H : (h, h ) < r}, h = arg inf hH errD (h), and = errD (h ). The quantity bounds the rate at which the disagreement mass of the ball B (h , r) ­ the probability mass of points on which hypotheses in B (h , r) disagree with h ­ grows as a function of the radius r. Clearly, 1/( + ); furthermore, it is a constant bounded 1 It has been noted that the Cohn-Atlas-Ladner scheme can easily be made tractable using a reduction to supervised learning in the separable case [13, p. 68]. Although our algorithm is most naturally seen as an extension of Cohn-Atlas-Ladner, a similar reduction to supervised learning (in the agnostic setting) can be used for A2 [10]. 3 Algorithm 1 Input: stream (x1 , x2 , . . . , xm ) i.i.d. from DX Initially, S0 and T0 . For n = 1, 2, . . . , m: 1. For each y {±1}, let hy LEARNH (Sn-1 {(xn , y )}, Tn-1 ). ^ ^ ^ 2. If err(h-y , Sn-1 Tn-1 ) - err(hy , Sn-1 Tn-1 ) > n-1 for some y {±1} ^ ^ ^ (or if no such h-y is found) ^ Return hf = LEARNH (Sm , Tm ). then Sn Sn-1 {(xn , y )} and Tn Tn-1 . ^ 3. Else request yn ; Sn Sn-1 and Tn Tn-1 {(xn , yn )}. Figure 1: The agnostic selective sampling algorithm. See (1) for how to set n . independently of 1/( + ) in several cases previously considered in the literature [12]. For in example, if H is homogeneous l ear separators and DX is the uniform distribution over the unit sphere in Rd , then = ( d). 3 Agnostic selective sampling Here we state and analyze our general algorithm for agnostic active learning. The main techniques employed by the algorithm are reductions to a supervised learning task and generalization bounds applied to differences of empirical errors. 3.1 A general algorithm for agnostic active learning Figure 1 states our algorithm in full generality. The input is a stream of m unlabeled ~ examples drawn i.i.d from DX ; for the time being, m can be thought of as O((d/)(1 + /)) where is the accuracy parameter.2 For S, T X × {±1}, let LEARNH (S, T ) denote a supervised learner that returns a hypothesis h H consistent with S , and with minimum error on T . Algorithm 1 maintains two sets of labeled examples, S and T , each of which is initially empty. Upon receiving xn , it learns two hypotheses, hy = LEARNH (S {(xn , y )}, T ) for y {±1}, and then compares ^ ^ ^ their empirical errors on S T . If the difference is large enough3 , it is possible to infer how h labels xn (as we show in Lemma 3). In this case, the algorithm adds xn , with this inferred label, to S . Otherwise, the algorithm requests the label yn and adds (xn , yn ) to T . Thus, S contains examples with inferred labels consistent with h , and T contains examples with their requested labels. Because h might err on some examples in T , we just insist that LEARNH find a hypothesis with minimal error on T . Meanwhile, by construction, h is consistent with S , so we require LEARNH to only consider hypotheses consistent with S . 3.2 Bounds for error differences We still need to specify n , the threshold value for error differences that determines whether the algorithm requests a label or not. Intuitively, n should reflect how closely empirical errors on a sample approximate true errors on the distribution D. The setting of n can only depend on observable quantities, so we first clarify the distinction between empirical errors on Sn Tn and those with respect to the true (hidden) labels. ! Definition 2. Let Sn and Tn be as defined in Algorithm 1. Let Sn be the set of labeled examples identical to those in Sn , except with the true hidden labels swapped in. Thus, for ! example, Sn Tn is an i.i.d. sample from D of size n. Final ly, let ! err! (h) = err(h, Sn Tn ) n and ~ The O notation suppresses log 1/ and terms polylogarithmic in those that appear. If LEARNH cannot find a hypothesis consistent with S {(xn , y )} for some y , then it is clear that h (x) = -y . In this case, we simply add (xn , -y ) to S , regardless of n-1 . 2 3 errn (h) = err(h, Sn Tn ). 4 + l Definition 3. For a pair (h, h ) H × H, define gh,h (x, y ) = 1 [h(x) = y h (x) = y ] and - gh,h (x, y ) = 1 [h(x) = y h (x) = y ]. l ! It is straightforward to apply Lemma 1 to empirical errors on Sn Tn , i.e. to err! (h), but n we cannot use such bounds algorithmically: we do not request the true labels for points in Sn and thus cannot reliably compute err! (h). What we can compute are error differences n err! (h) - err! (h ) for pairs of hypotheses (h, h ) that agree on (and make the same mistakes n n ! ! on) Sn , since for such pairs, we have errn (h) - errn (h ) = errn (h) - errn (h ). + - With this notation, we have err(h, Z ) - err(h , Z ) = EZ [gh,h ] - EZ [gh,h ] for any Z X × - + {±1}. Now, applying Lemma 1 to G = {gh,h : (h, h ) H × H} = {gh,h : (h, h ) H × H}, 2 and noting that S (G , n) S (H, n) , gives the following lemma. ( 4/n) ln(8S (H, 2n)2 / ). With probability at least 1 - over an Lemma 2. Let n = i.i.d. sample Z of size n from D, we have for al l (h, h ) H × H, E E . + - 2 err(h, Z ) - err(h , Z ) errD (h) - errD (h ) + n + n Z [gh,h ] + Z [gh,h ] ( 4/n) ln(8(n2 + n)S (H, 2n)2 / ). Then, with probability at least Corollary 1. Let n = 1 - , for al l n 1 and al l (h, h ) H × H consistent with Sn , we have e e 2 errn (h) - errn (h ) errD (h) - errD (h ) + n + n ( rrn (h) + rrn (h )). ! Proof. Applying Lemma 2 to each Sn Tn (replacing with /(n2 + n)) and a union bound implies, with probability at least 1 - , the bounds in Lemma 2 hold simultaneously ! for all n 1 and all (h, h ) H2 with Sn Tn in place of Z . The corollary follows + ! ! l because errn (h) - errn (h ) = errn (h) - errn (h ); and because gh,h (x, y ) 1 [h(x) = y ] and + - gh,h (x, y ) 1 [h (x) = y ] for (h, h ) consistent with Sn , so ESn Tn [gh,h ] errn (h) and l ! - ESn Tn [gh,h ] errn (h ). ! Corollary 1 implies that we can effectively apply the normalized uniform convergence bounds from Lemma 1 to empirical error differences on Sn Tn , even though Sn Tn is not an i.i.d. sample from D. In light of this, we use the following setting of n : e ( e 2 n := n + n rrn (h+1 ) + rrn (h-1 ) 1) d ( ~ 4/n) ln(8(n2 + n)S (H, 2n)2 / ) = O( log n/n) as per Corollary 1. where n = 3.3 Correctness and fall-back analysis We now justify our setting of n with a correctness proof and fall-back guarantee. Lemma 3. With probability at least 1 - , the hypothesis h = arg inf hH errD (h) is consistent with Sn for al l n 0 in Algorithm 1. Proof. Apply the bounds in Corollary 1 and proceed by induction on n. The base case is trivial since S0 = . Now assume h is consistent with Sn . Suppose upon receiving xn+1 , we discover errn (h+1 ) - errn (h-1 ) > n . We will show that h (xn+1 ) = -1 (assume both h+1 and h-1 exist, since it is clear h (xn+1 ) = -1 if h+1 does not exist). Suppose for the sake of contradiction that h (xn+1 ) = +1. We know the that errn (h ) e rrn (h+1 ) (by e e 2 inductive hypothesis) and errn (h+1 ) - errn (h-1 ) > n + n ( rrn (h+1 ) + rrn (h-1 )). In 2 particular, errn (h+1 ) > n . Therefore, errn (h ) - errn (h-1 ) = (errn (h ) - errn (h+1 )) + (errn (h+1 ) - errn (h-1 )) e e e e e 2 > rrn (h+1 )( rrn (h ) - rrn (h+1 )) + n + n ( rrn (h+1 ) + rrn (h-1 )) e e e e 2 ) - rrn (h+1 )) + n + n ( rrn (h+1 ) + rrn (h-1 )) > n ( rrn (h e e 2 ) + = n + n ( rrn (h rrn (h-1 )). 5 Now Corollary 1 implies that errD (h ) > errD (h-1 ), a contradiction. Theorem 1. Let = inf hH errD (h) and d = vcdim(H). There exists a constant c > 0 such that the fol lowing holds. If Algorithm 1 is given a stream of m unlabeled examples, then with probability at least 1 - , th( algorithm returns a hypothesis with error at most e + c · ((1/m)(d log m + log(1/ )) + /m)(d log m + log(1/ ))). Proof. Lemma 3 implies that h is consistent with Sm with probability at least 1 - . Using the same bounds from Corollary 1 (already applied in Lemma 3) on h and hf etogether 2 with the fact errm (hf ) errm (h ), we have errD (hf ) + m + m + m rrD (hf ), 2 which in turn implies errD (hf ) + 3m + 2m . ~ So, Algorithm 1 returns a hypothesis with error at most + when m = O((d/)(1 + /)); this is (asymptotically) the usual sample complexity of supervised learning. Since the ~ algorithm requests at most m labels, its label complexity is always at most O((d/)(1 + /)). 3.4 Lab el complexity analysis We can also bound the label complexity of our algorithm in terms of the disagreement coefficient . This yields tighter bounds when is bounded independently of 1/( + ). The key to deriving our label complexity bounds based on is noting that the probability of requesting the (n + 1)th label is intimately related to and n (see [10] for the complete proof ). Lemma 4. There exists a constant c > 0 such that, with probability at least 1 - 2 , for al l n 1, the fol lowing holds. Let h (xn+1 ) = y where h = arg inf hH errD (h). Then, the ^ probability that Algorithm 1 requests the label yn+1 is Prxn+1 DX [Request yn+1 ] c · · ( + 2 2 n ), whered = (D, H, 3m + 2m ) is the disagreement coefficient, = errD (h ), and ~ n = O( log n/n) is as defined in Corol lary 1. Now we give our main label complexity bound for agnostic active learning. Theorem 2. Let m be the number of unlabeled data given to Algorithm 1, d = vcdim(H), 2 = inf hH errD (h), m as defined in Corol lary 1, and = (D, H, 3m + 2m ). There exists a constant c1 > 0 such that for any c2 1, with probability at least 1 - 2 : 2 1. If (c2 - 1)m , Algorithm 1 returns a hypothesis with error as bounded in Theorem 1 and the expected number of labels requested is at most d . 1 2 1 + c1 c2 · log m + log log m 2. Else, the same holds except the expected number of labels requested is at most . 1 2 1 + c1 · m + d log m + log log m Furthermore, if L is the expected number of labels requested as per above, then with probability 3 at least 1 - , the algorithm requests no more than L + L log(1/ ) labels. Proof. Follows from Lemma 4 and a Chernoff bound for the Poisson trials 1 [Request yn ]. l 2 With the substitution = 3m + 2m as per Theorem 1, Theorem 2 entails that for any hypothesis class and data distribution for which the disagreement coefficient = (D, H, ) is bounded independently of 1/( + ) (see [12] for some examples), Algorithm 1 only needs ~ ~ O(d log2 (1/)) labels to achieve error and O(d(log2 (1/) + ( /)2 )) labels to achieve error . The latter matches the dependence on / in the (( /)2 ) lower bound [11]. The linear dependence on improves on the quadratic dependence required by A2 [12]4 . For an illustrative consequence of this, suppose DX is the uniform distribution on the sphere 4 It may be possible to reduce A2 's quadratic dependence to a linear dependence by using normalized bounds, as we do here. 6 4000 3500 3000 2500 3500 3000 2500 2000 2000 1500 1500 1000 500 0 0 5000 10000 1000 500 0 0 5000 10000 (a) (b) (c) (d) Figure 2: (a & b) Labeling rate plots. The plots show the number of labels requested (vertical axis) versus the total number of points seen (labeled + unlabeled, horizontal axis) using Algorithm 1. (a) H = thresholds: under random misclassification noise with = 0 (solid), 0.1 (dashed), 0.2 (dot-dashed); under the boundary noise model with = 0.1 (lower dotted), 0.2 (upper dotted). (b) H = intervals: under random misclassification with (p+ , ) = (0.2, 0.0) (solid), (0.1, 0.0) (dashed), (0.2, 0.1) (dot-dashed), (0.1, 0.1) (dotted). (c & d) Locations of label requests. (c) H = intervals, h = [0.4, 0.6]. The top histogram shows the locations of first 400 label requests (the x-axis is the unit interval); the bottom histogram is for all (2141) label requests. (d) H = boxes, h = [0.15, 0.85]2 . The first 200 requests occurred at the ×s, the next 200 at the s, and the final 109 at the s. in Rd and H is homogeneous linear separators; in this case, = ( d). Then the label complexity of A2 depends at least quadratically on the dimension, whereas the corresponding dependence for our algorithm is d3/2 . 4 Exp eriments We implemented Algorithm 1 in a few simple cases to experimentally demonstrate the label complexity improvements. In each case, the data distribution DX was uniform over [0, 1]; the stream length was m = 10000, and each experiment was repeated 20 times with different random seeds. Our first experiment studied linear thresholds on the line. The target hypothesis was fixed to be h (x) = sign(x - 0.5). For this hypothesis class, we used two different noise models, each of which ensured inf hH errD (h) = errD (h ) = for a prespecified [0, 1]. The first model was random misclassification: for each point x DX , we independently labeled it h (x) with probability 1 - and -h (x) with probability . In the second model (also used in [7]), for each point x DX , we independently labeled it +1 with probability (x - 0.5)/(4 ) + 0.5 and -1 otherwise, thus concentrating the noise near the boundary. Our second experiment studied intervals on the line. Here, we only used random misclassification, but we varied the target interval length p+ = PrxDX [h (x) = +1]. The results show that the number of labels requested by Algorithm 1 was exponentially smaller than the total number of data seen (m) under the first noise model, and was polynomially smaller under the second noise model (see Figure 2 (a & b); we verified the polynomial vs. exponential distinction on separate log-log scale plots). In the case of intervals, we observe an initial phase (of duration roughly 1/p+ ) in which every label is requested, followed by a more efficient phase, confirming the known active-learnability of this class [4, 12]. These improvements show that our algorithm needed significantly fewer labels to achieve the same error as a standard supervised algorithm that uses labels for all points seen. As a sanity check, we examined the locations of data for which Algorithm 1 requested a label. We looked at two particular runs of the algorithm: the first was with H = intervals, p+ = 0.2, m = 10000, and = 0.1; the second was with H = boxes (d = 2), p+ = 0.49, m = 1000, and = 0.01. In each case, the data distribution was uniform over [0, 1]d , and the noise model was random misclassification. Figure 2 (c & d) shows that, early on, labels were requested everywhere. But as the algorithm progressed, label requests concentrated near the boundary of the target hypothesis. 7 5 Conclusion and future work We have presented a simple and natural approach to agnostic active learning. Our extension of the selective sampling scheme of Cohn, et al. [1] 1. simplifies the maintenance of the region of uncertainty with a reduction to supervised learning, and 2. guards against noise with a subtle algorithmic application of generalization bounds. Our algorithm relies on a threshold parameter n for comparing empirical errors. We prescribe a very simple and natural choice for n ­ a normalized generalization bound from supervised learning ­ but one could hope for a more clever or aggressive choice, akin to those in [6] for linear separators. Finding consistent hypotheses when data is separable is often a simple task. In such cases, reduction-based active learning algorithms can be relatively efficient (answering some questions posed in [16]). On the other hand, agnostic learning suffers from severe computational intractability for many hypothesis classes (e.g. [17]), and of course, agnostic active learning is at least as hard in the worst case. Our reduction is relatively benign in that the learning problems created are only over samples from the original distribution, so we do not create pathologically hard instances (like those arising from hardness reductions) unless they are inherent in the data. Nevertheless, an important research direction is to develop algorithms that only require solving tractable (e.g. convex) optimization problems. A similar reduction-based scheme may be possible. References [1] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201­221, 1994. [2] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2):133­168, 1997. [3] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In COLT, 2005. [4] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, 2005. [5] S. Hanneke. Teaching dimension and the complexity of active learning. In COLT, 2007. [6] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In COLT, 2007. [7] R. Castro and R. Nowak. Upper and lower bounds for active learning. In Al lerton Conference on Communication, Control and Computing, 2006. [8] R. Castro and R. Nowak. Minimax bounds for active learning. In COLT, 2007. [9] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, 2006. [10] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. UCSD Technical Report CS2007-0898, http://www.cse.ucsd.edu/djhsu/papers/cal.pdf, 2007. [11] M. K¨¨ri¨inen. Active learning in the non-realizable case. In ALT, 2006. aa a [12] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007. [13] C. Monteleoni. Learning with online constraints: shifting concepts and active learning. PhD Thesis, MIT Computer Science and Artificial Intelligence Laboratory, 2006. [14] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Lecture Notes in Artificial Intel ligence, 3176:169­207, 2004. [15] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264­280, 1971. [16] C. Monteleoni. Efficient algorithms for general active learning. In COLT. Open problem, 2006. [17] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In FOCS, 2006. 8