Importance Weighted Active Learning Alina Beygelzimer IBM Research, 19 Skyline Drive, Hawthorne, NY 10532 beygel@us.ibm.com Sanjoy Dasgupta dasgupta@cs.ucsd.edu Department of Computer Science and Engineering, University of California, San Diego 9500 Gilman Drive, La Jolla, CA 92093 John Langford Yahoo! Research, 111 West 40th Street, New York, NY 10018 jl@yahoo-inc.com Abstract We present a practical and statistically consistent scheme for actively learning binary classifiers under general loss functions. Our algorithm uses importance weighting to correct sampling bias, and by controlling the variance, we are able to give rigorous label complexity bounds for the learning process. 1. Introduction An active learner interactively chooses which data points to label, while a passive learner obtains all the labels at once. The great hope of active learning is that interaction can substantially reduce the number of labels required, making learning more practical. This hope is known to be valid in certain special cases, where the number of label queries has been shown to be logarithmic in the usual sample complexity of passive learning; such cases include thresholds on a line, and linear separators with a spherically uniform unlabeled data distribution (Dasgupta et al., 2005). Many earlier active learning algorithms, such as (Cohn et al., 1994; Dasgupta et al., 2005), are not consistent when data is not perfectly separable under the given hypothesis class: even with an infinite labeling budget, they might not converge to an optimal predictor (see Dasgupta and Hsu (2008) for a discussion). This problem has recently been addressed in two threads of research. One approach (Balcan et al., 2006; Dasgupta et al., 2008; Hanneke, 2007) constructs Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). learning algorithms that explicitly use sample complexity bounds to assess which hypotheses are still "in the running," and thereby assess the relative value of different unlabeled points (in terms of whether they help distinguish between the remaining hypotheses). These algorithms have the usual PAC-style convergence guarantees, but also have rigorous label complexity bounds that are in many cases significantly better than the bounds for passive learning. The second approach to active learning uses importance weights to correct sampling bias (Bach, 2007; Sugiyama, 2006). The PAC-guarantee active learning algorithms have yet to see practical use for several reasons. First, they are built explicitly for 0­1 loss and are not easily adapted to most other loss functions. This is problematic because in many applications, other loss functions are more appropriate for describing the problem, or make learning more tractable (as with convex proxy losses on linear representations). Second, these algorithms make internal use of generalization bounds that are often loose in practice, and they can thus end up requiring far more labels than are really necessary. Finally, they typically require an explicit enumeration over the hypothesis class (or an -cover thereof), which is generally computationally intractable. Importance weighted approaches have only been analyzed in limited settings. For example, Bach (2007) considers linear models and provides an analysis of consistency in cases where either (i) the model class fits the data perfectly, or (ii) the sampling strategy is non-adaptive (that is, a query doesn't depend on the sequence of previous queries). Also, the analysis in these works is asymptotic rather than yielding finite label bounds. Label complexity is of paramount importance in active learning, because otherwise simpler passive learning approaches can be used. Furthermore, Importance Weighted Active Learning a poor choice of importance weights can result in high label complexity. We address the problems above with a new active learning scheme that provably yields PAC-style label complexity guarantees. When presented with an unlabeled point xt , this scheme queries its label with a carefully chosen probability pt , taking into account the identity of the point and the history of labels seen so far. The points that end up getting labeled are then weighted according to the reciprocals of these probabilities (that is, 1/pt ), in order to remove sampling bias. We show (theorem 1) that this simple method guarantees statistical consistency: for any distribution and any hypothesis class, active learning eventually converges to the optimal hypothesis in the class. As in any importance sampling scenario, the biggest challenge is controlling the variance of the process. This depends crucially on how the sampling probability pt is chosen. Roughly, our strategy is to make it proportional to the spread of values h(xt ), as h ranges over the remaining candidate hypotheses. For this setting of pt , which we call loss-weighting, we have two results: a fallback guarantee that the label complexity is never much worse than that of supervised learning (theorem 2), and a rigorous label complexity bound (theorem 11). Previously, label complexity bounds for active learning were only known for 0­1 loss, and were based on the disagreement coefficient of the learning problem (Hanneke, 2007). We generalize this notion to general loss functions, and analyze label complexity in terms of it. We consider settings in which these bounds turn out to be roughly the square root of the sample complexity of supervised learning. In addition to these upper bounds, we show a general lower bound on the label complexity of active learning (theorem 12) that significantly improves the best previous such result (K¨¨ri¨inen, 2006). aa a We conduct practical experiments with two IWAL algorithms. The first is a specialization of IWAL with loss-weighting to the case of linear classifiers with convex loss functions; here, the algorithm becomes tractable via convex programming (section 7). The second uses a simple bootstrapping scheme that reduces active learning to passive learning without requiring much additional computation (section 7.2). In every case, these experiments yield substantial reductions in label complexity compared to passive learning, without compromising predictive performance. They suggest that IWAL is a practical scheme that can reduce the label complexity of active learning without sacrificing the statistical guarantees (like consistency) we take for granted in passive learning. Boosting and bagging-based algorithms of Abe and Mamitsuka (1998) are similar in spirit to our bootstrapping IWAL scheme in section 7.2, but they are not consistent in the presence of noise. 2. Preliminaries Let X be the input space and Y the output space. We consider active learning in the streaming setting where at each step t, a learner observes an unlabeled point xt X and has to decide whether to ask for the label yt Y . The learner works with a hypothesis space H = {h : X Z}, where Z is a prediction space. The algorithm is evaluated with respect to a given loss function l : Z × Y [0, ). The most common loss function is 0­1 loss, in which Y = Z = {-1, 1} and l(z, y) = 1(y = z) = 1(yz < 0). The following examples address the binary case Y = {-1, 1} with Z R: l(z, y) = (1 - yz)+ (hinge loss), l(z, y) = ln(1 + e-yz ) (logistic loss), l(z, y) = (y - z)2 = (1 - yz)2 (squared loss), and l(z, y) = |y - z| = |1 - yz| (absolute loss). Notice that all the loss functions mentioned here are of the form l(z, y) = (yz) for some function on the reals. We specifically highlight this subclass of loss functions when proving label complexity bounds. Since these functions are bounded (if Z is), we further assume they are normalized to output a value in [0, 1]. 3. The Importance Weighting Skeleton Algorithm 1 describes the basic outline of importanceweighted active learning (IWAL). Upon seeing xt , the learner calls a subroutine rejection-threshold (instantiated in later sections), which looks at xt and past history to return the probability pt of requesting yt . The algorithm maintains a set of labeled examples seen so far, each with an importance weight: if yt ends up being queried, its weight is set to 1/pt . Algorithm 1 IWAL (subroutine rejection-threshold) Set S0 = . For t from 1, 2, ... until the data stream runs out: 1. Receive xt . 2. Set pt = rejection-threshold(xt , {xi , yi , pi , Qi : 1 i < t}). 3. Flip a coin Qt {0, 1} with E[Qt ] = pt . If Qt = 1, request yt and set St = St-1 {(xt , yt , 1/pt )}, else St = St-1 . 4. Let ht = arg minhH (x,y,c)St c · l(h(x), y). Examples are assumed to be drawn i.i.d. from the un- Importance Weighted Active Learning derlying probability distribution D. The expected loss of h H on D is given by L(h) = E(x,y)D l(h(x), y). The importance weighted estimate of the loss at time T is T 1 Qt LT (h) = l(h(xt ), yt ), T t=1 pt where Qt is as defined in the algorithm. It is not hard to see that E[LT (h)] = L(h), with the expectation taken over all the random variables involved. Theorem 2 gives large deviation bounds for LT (h), provided that the probabilities pt are chosen carefully. Even though step 4 is stated as empirical risk minimization, any passive importance-weighted learning algorithm can be used to learn ht based on St . 3.1. A safety guarantee for IWAL A desirable property for a learning algorithm is consistency: Given an infinite budget of unlabeled and labeled examples, does the algorithm converge to the best predictor? Some early active learning algorithms (Cohn et al., 1994; Dasgupta et al., 2005) do not satisfy this baseline guarantee if the points cannot be classified perfectly by the given hypothesis class. We prove that IWAL algorithms are consistent, as long as pt is bounded away from 0. Comparing this result to the usual sample complexity bounds in supervised learning (for example, corollary 4.2 of (Langford, 2005)), we see that the label complexity is at most 2/p2 times that of a supervised min algorithm. For simplicity, the bound is given in terms of ln |H| rather than the VC dimension of H. The argument, which is a martingale modification of standard results, can be extended to VC spaces. Theorem 1. For all distributions D, for all finite hypothesis classes H, for any > 0, if there is a constant pmin > 0 such that pt pmin for all 1 t T , then ln |H| + ln 2 2 < . P max |LT (h) - L(h)| > hH pmin T Proof. Fix D. For a hypothesis h H, consider a sequence of random variables U1 , . . . , UT with Ut = Qt pt l(h(xt ), yt ) - L(h). Since pt pmin , |Ut | 1/pmin . The sequence Zt = i=1 Ui is a martingale, letting Z0 = 0. Indeed, for any 1 t T , E[Zt | Zt-1 , ..., Z0 ] = E [Ut + Zt-1 | Zt-1 , ..., Z0 ] = E [ l(h(xt ), yt ) - L(h) + Zt-1 | Zt-1 , ..., Z0 ] = Zt-1 . Observe that |Zt+1 - Zt | = |Ut+1 | 1/pmin for all 0 t < T . Using ZT = T (LT (h) - L(h)) and applying t Azuma's inequality (1967), we see that for any > 0, P |LT (h) - L(h)| > < 2e- 2 /2 pmin T . Setting = 2(ln |H| + ln(2/)) and taking a union bound over h H then yields the desired result. 4. Setting the rejection threshold Algorithm 2 gives a particular instantiation of the rejection threshold subroutine in IWAL. The subroutine maintains an effective hypothesis class Ht , which is initially all of H and then gradually shrinks by setting Ht+1 to the subset of Ht whose empirical loss isn't too much worse than L , the smallest empirical loss in Ht : t Ht+1 = {h Ht : Lt (h) L + t }. t (8/t) ln(2t(t + 1)|H|2 /) The allowed slack t = comes from a standard sample complexity bound. We will show that, with high probability, any optimal hypothesis h is always in Ht , and thus all other hypotheses can be discarded from consideration. For each xt , the loss-weighting scheme looks at the range of predictions on xt made by hypotheses in Ht and sets the sampling probability pt to (roughly) the size of this range: precisely, pt = maxf,gHt maxy l(f (xt ), y) - l(g(xt ), y). Since the loss values are normalized to lie in [0, 1], we can be sure that pt is also in this interval. Algorithm 2 loss-weighting (x, {xi , yi , pi , Qi : i < t}) 1. Initialize H0 = H. t-1 Qi 1 l(h(xi ), yi ) 2. Update Lt-1 = min hHt-1 t - 1 p i=1 i Set Ht to {h Ht-1 : 1 t-1 t-1 i=1 Qi l(h(xi ), yi ) L + t-1 t-1 pi 3. Return pt = maxf,gHt ,yY l(f (x), y) - l(g(x), y). 4.1. A generalization bound We start with a large deviation bound for each ht output by IWAL(loss-weighting). It is not a corollary of theorem 1 because it does not require the sampling probabilities be bounded below away from zero. Theorem 2. Pick any learning problem D and hypothesis class H, and let h H be a minimizer of the loss function with respect to D. For any > 0, with probability at least 1 - , for any T : (i) h HT , and Importance Weighted Active Learning (ii) L(f ) - L(g) 2T -1 for any f, g HT . In particular, if hT is the hypothesis output by IWAL(lossweighting), L(hT ) - L(h ) 2T -1 . We need the following lemma for the proof. Lemma 3. For all learning problems D, for all hypothesis classes H, for all > 0, with probability at least 1 - , for all T and all f, g HT , |LT (f ) - LT (g) - L(f ) + L(g)| T . Proof. Pick any T and f, g HT , and define Zt = Qt pt l(f (xt ), yt ) - l(g(xt ), yt ) - (L(f ) - L(g)). Then E [ Zt | Z1 , · · · , Zt-1 ] = E xt ,yt [ l(f (xt ), yt )- l(g(xt ), yt ) | Z1 , · · · , Zt-1 ] - (L(f ) - L(g)) = 0. Thus Z1 , Z2 , . . . is a martingale difference sequence. We can use Azuma's inequality to show that its sum is tightly concentrated, if the individual Zt are bounded. To check boundedness, observe that since f, g are in HT , they must also be in H1 , H2 , . . . , HT -1 . Thus for all t T , pt |l(f (xt ), yt ) - l(g(xt ), yt )|, whereupon 1 |Zt | pt |l(f (xt ), yt ) - l(g(xt ), yt )| + |L(f ) - L(g)| 2. We allow failure probability /2T (T + 1) at time T . Applying Azuma's inequality, we have P[|LT (f ) - LT (g) - L(f ) + L(g)| T ] T Dasgupta et al. (2008) studied this question for an active learning scheme under 0­1 loss. For learning problems with bounded disagreement coefficient (Hanneke, 2007), the number of queries was found to be O(T + d log2 T ), where d is the VC dimension of the function class, and is the best error rate achievable on the underlying distribution by that function class. We will soon see (section 6) that the term T is inevitable for any active learning scheme; the remaining term has just a polylogarithmic dependence on T . We generalize the disagreement coefficient to arbitrary loss functions and show that, under conditions similar to the earlier result, the number of queries is O T + dT log2 T , where is now the best achievable loss. The inevitable T is still there, and the second term is still sublinear. 5.1. Label Complexity: Main Issues Suppose the loss function is minimized by h H, with L = L(h ). Theorem 2 shows that at time t, the remaining hypotheses Ht include h and all have losses in the range [L , L +2t-1 ]. We now prove that under suitable conditions, the sampling probability pt has expected value L + t-1 . Thus the expected total number of labels queried upto time T is roughly T L T + t=1 t-1 L T + T ln |H|. To motivate the proof, consider a loss function l(z, y) = (yz); all our examples are of this form. Say is differentiable with 0 < C0 | | C1 . Then the sampling probability for xt is pt = = f,gHt f,gHt =P t=1 Zt T T 2e-T T /8 = 2 . T (T + 1)|H|2 Since HT is a random subset of H, it suffices to take a union bound over all f, g H. A union bound over T finishes the proof. Proof of theorem 2. Start by assuming that the 1 - probability event of lemma 3 holds. We first show by induction that h = arg minhH L(h) is in HT for all T . It holds at T = 1, since H1 = H0 = H. Now suppose it holds at T , and show that it is true at T +1. Let hT minimize LT over HT . By lemma 3, LT (h ) - LT (hT ) L(h ) - L(hT ) + T T . Thus LT (h ) L + T and hence h HT +1 . T Next, we show that for any f, g HT we have L(f )-L(g) 2T -1 . By lemma 3, since HT HT -1 , L(f ) - L(g) LT -1 (f ) - LT -1 (g) + T -1 L -1 + T T -1 - L -1 + T -1 = 2T -1 . Since hT , h HT , T we have L(hT ) L(h ) + 2T -1 . max max l(f (xt ), y) - l(g(xt ), y) y max max (yf (xt )) - (yg(xt )) y f,gHt y C1 max max |yf (xt ) - yg(xt )| = C1 max |f (xt ) - g(xt )| f,gHt hHt 2C1 max |h(xt ) - h (xt )|. So pt is determined by the range of predictions on xt by hypotheses in Ht . Can we bound the size of this range, given that any h Ht has loss at most L + 2t-1 ? 2t-1 L(h) - L Ex,y |l(h(x), y) - l(h (x), y)| - 2L Ex,y C0 |y(h(x) - h (x))| - 2L = C0 Ex |h(x) - h (x)| - 2L . So we can upperbound maxhHt Ex |h(x) - h (x)| (in terms of L and t-1 ), whereas we want to upperbound the expected value of pt , which is proportional 5. Label Complexity We showed that the loss of the classifier output by IWAL(loss-weighting) is similar to the loss of the classifier chosen passively after seeing all T labels. How many of those T labels does the active learner request? Importance Weighted Active Learning to Ex maxhHt |h(x)-h (x)|. The ratio between these two quantities is related to a fundamental parameter of the learning problem, a generalization of the disagreement coefficient (Hanneke, 2007). We flesh out this intuition in the remainder of this section. First we describe a broader class of loss functions than those considered above (including 0­1 loss, which is not differentiable); a distance metric on hypotheses, and a generalized disagreement coefficient. We then prove that for this broader class, active learning performs better than passive learning when the generalized disagreement coefficient is small. 5.2. A subclass of loss functions We give label complexity upper bounds for a class of loss functions that includes 0­1 loss and logistic loss but not hinge loss. Specifically, we require that the loss function has bounded slope asymmetry, defined below. Recall earlier notation: response space Z, classifier space H = {h : X Z}, and loss function l : Z ×Y [0, ). Henceforth, the label space is Y = {-1, +1}. Definition 4. The slope asymmetry of a loss function l : Z × Y [0, ) is Kl = sup z,z Z Definition 7. For any f, g H and distribution D define (f, g) = ExD maxy |l(f (x), y) - l(g(x), y)|. For any r 0, let B(f, r) = {g H : (f, g) r}. Suppose L = minhH L(h) is realized at h . We know that at time t, the remaining hypotheses have loss at most L + 2t-1 . Does this mean they are close to h in -distance? The ratio between the two can be expressed in terms of the slope asymmetry of the loss. Lemma 8. For any distribution D and any loss function with slope asymmetry Kl , we have (h, h ) Kl (L(h) + L ) for all h H. Proof. For any h H, (h, h ) = Ex maxy |l(h(x), y) - l(h (x), y)| Kl Ex,y |l(h(x), y) - l(h (x), y)| Kl (Ex,y [l(h(x), y)] + Ex,y [l(h (x), y)]) = Kl (L(h) + L(h )). 5.4. A generalized disagreement coefficient When analyzing the A2 algorithm (Balcan et al., 2006) for active learning under 0­1 loss, (Hanneke, 2007) found that its label complexity could be characterized in terms of what he called the disagreement coefficient of the learning problem. We now generalize this notion to arbitrary loss functions. Definition 9. The disagreement coefficient is the infimum value of such that for all r, ExD suphB(h ,r) supy |l(h(x), y)-l(h (x), y)| r. Here is a simple example for linear separators. Lemma 10. Suppose H consists of linear classifiers {u Rd : u B} and the data distribution D is uniform over the surface of the unit sphere in Rd . Suppose the loss function is l(z, y) = (yz) for differentiable with C0 | | C1 . Then the disagreement coeffi cient is at most (2C1 /C0 ) d. Proof. Let h be the optimal classifier, and h any other classifier with (h, h ) r. Let u , u be the corresponding vectors in Rd . Using lemma 5, r ExD supy |l(h(x), y) - l(h (x), y)| C0 ExD |h(x) - h (x)| = C0 ExD |(u - u ) · x| C0 u - u /(2 d). Thus for any h B(h , r), we have that the corresponding vectors satisfy u - u 2r d/C0 . We can now bound the disagreement coefficient. ExD suphB(h ,r) supy |l(h(x), y) - l(h (x), y)| C1 ExD suphB(h ,r) |h(x) - h (x)| C1 Ex sup{|(u - u ) · x| : u - u 2r d/C0 } C1 · 2r d/C0 . maxyY |l(z, y) - l(z , y)| . minyY |l(z, y) - l(z , y)| The slope asymmetry is 1 for 0­1 loss, and for hinge loss. For differentiable loss functions l(z, y) = (yz), it is easily related to bounds on the derivative. Lemma 5. Let l (z, y) = (zy), where is a differentiable function defined on Z = (-B, B) R. Suppose C0 | (z)| C1 for all z Z. Then for any z, z Z, and any y {-1, +1}, C0 |z - z | |l (z, y) - l (z , y)| C1 |z - z |. Thus l has slope asymmetry C1 /C0 . Proof. By the mean value theorem, there is some Z such that l (z, y) - l (z , y) = (yz) - (yz ) = ()(yz -yz ). Thus |l (z, y)-l (z , y)| = | ()|·|z - z |, and the rest follows from the bounds on . For instance, this immediately applies to logistic loss. Corollary 6. Logistic loss l(z, y) = ln(1 + e-yz ), defined on label space Y = {-1, +1} and response space [-B, B], has slope asymmetry at most 1 + eB . 5.3. Topologizing the space of classifiers We introduce a simple distance function on the space of classifiers. Importance Weighted Active Learning 5.5. Upper Bound on Label Complexity Finally, we give a bound on label complexity for learning problems with bounded disagreement coefficient and loss functions with bounded slope asymmetry. Theorem 11. For all learning problems D and hypothesis spaces H, if the loss function has slope asymmetry Kl , and the learning problem has disagreement coefficient , then for all > 0, with probability at least 1 - over the choice of data, the expected number of labels requested by IWAL(loss-weighting) during the first T iterations is at most 4 · Kl · (L T + O( T ln(|H|T /))), where L is the minimum loss achievable on D by H, and the expectation is over the randomness in the selective sampling. Proof. Suppose h H achieves loss L . Pick any time t. By theorem 2, Ht {h H : L(h) L + 2t-1 } and by lemma 8, Ht B(h , r) for r = Kl (2L + 2t-1 ). Thus, the expected value of pt (over the choice of x at time t) is at most ExD supf,gHt supy |l(f (x), y) - l(g(x), y)| 2 ExD suphHt supy |l(h(x), y) - l(h (x), y)| 2 ExD suphB(h ,r) supy |l(h(x), y) - l(h (x), y)| 2r = 4·Kl ·(L + t-1 ) . Summing over t = 1, . . . , T , T we get the lemma, using t=1 (1/ t) = O( T ). H is ; (b) any active learner seeking a classifier of error at most + must make (d 2 / 2 ) queries to succeed with probability at least 1/2. Proof. Pick a set of d points xo , x1 , x2 , . . . , xd-1 shattered by H. Here is a distribution over X × Y : point xo has probability 1 - , while each of the remaining xi has probability /(d - 1), where = 2( + 2 ). At xo , the response is always y = 1. At xi , i 1, the response is y = 1 with probability 1/2 + bi , where bi is either +1 or -1, and = 2 / = /( + 2 ) < 1/4. Nature starts by picking b1 , . . . , bd-1 uniformly at random. This defines the target hypothesis h : h (xo ) = 1 and h (xi ) = bi . Its error rate is · (1/2 - ) = . Any learner outputs a hypothesis in H and thus implicitly makes guesses at the underlying hidden bits bi . Unless it correctly determines bi for at least 3/4 of the points x1 , . . . , xd-1 , the error of its hypothesis will be at least + (1/4) · · (2) = + . Now, suppose the active learner makes at most c(d - 1)/ 2 queries, where c is a small constant (c 1/125 suffices). We'll show that it fails (outputs a hypothesis with error at least + ) with probability at least 1/2. We'll say xi is heavily queried if the active learner queries it at least 4c/ 2 times. At most 1/4 of the xi 's are heavily queried; without loss of generality, these are x1 , . . . , xk , for some k (d - 1)/4. The remaining xi get so few queries that the learner guesses each corresponding bit bi with probability less than 2/3; this can be derived from Slud's lemma (below), which relates the tails of a binomial to that of a normal. Let Fi denote the event that the learner gets bi wrong; so EFi 1/3 for i > k. Since k (d - 1)/4, the probability that the learner fails is given by P[F1 +· · ·+ Fd-1 (d-1)/4] P[Fk+1 +· · ·+Fd-1 (d-1)/4] P[B (d - 1)/4] P[Z 0] = 1/2, where B is a binomial((3/4)(d - 1), 1/3) random variable, Z is a standard normal, and the last inequality follows from Slud's lemma. Thus the active learner must make at least c(d - 1)/ 2 = (d 2 / 2 ) queries to succeed with probability at least 1/2. Lemma 13 (Slud (1977)). Let B be a Binomial (n, p) random variable with p 1/2, and let Z be a standard normal. For any k [np, n(1 - p)], P[B k] P[Z (k - np)/ np(1 - p)]. Theorem 12 uses the same example that is used for lower bounds on supervised sample complexity (section 14.4 of (Devroye et al., 1996)), although in that case the lower bound is d/ 2 . The bound for active learning is smaller by a factor of because the ac- 6. A lower bound on label complexity K¨¨ri¨inen (2006) showed that for any hypothesis class aa a H and any > > 0, there is a data distribution such that (a) the optimal error rate achievable by H is ; and (b) any active learner that finds h H with error rate at most + (with probability greater than 1/2) must make 2 / 2 queries. We now strengthen this lower bound to d 2 / 2 , where d is the VC dimension of H. Let's see how this relates to the label complexity rates of the previous section. It is well-known that if a supervised learner sees T examples (for any T > d/), its final hypothesis has error at most + d/T (Devroye et al., 1996) with high probability. Think of this as + for = d/T . Our lower bound now implies that an active learner must make at least d 2 / 2 = T queries. This explains the T leading term in all the label complexity bounds we have discussed. Theorem 12. For any , > 0 such that 2 1/4, for any input space X and hypothesis class H (of functions mapping X into Y = {+1, -1}) of VC dimension 1 < d < , there is a distribution over X × Y such that (a) the best error rate achievable by Importance Weighted Active Learning tive learner can avoid making repeated queries to the "heavy" point xo , whose label is immediately obvious. 7. Implementing IWAL IWAL(loss-weighting) can be efficiently implemented in the case where H is the class of bounded-length linear separators {u Rd : u 2 B} and the loss function is convex: l(z, y) = (yz) for convex . Each iteration of Algorithm 2 involves solving two optimization problems over a restricted hypothesis set Ht = t