Robust Bounds for Classification via Selective Sampling Nicol` Cesa-Bianchi o DSI, Universit` degli Studi di Milano, Via Comelico 39, 20135 - Milano, Italy a Claudio Gentile DICOM, Universit` dell'Insubria, Via Mazzini 5, 21100 - Varese, Italy a Francesco Orabona Idiap, Rue Marconi 19, Case Postale 592, CH-1920 Martigny, Switzerland cesa-bianchi@dsi.unimi.it claudio.gentile@uninsubria.it forabona@idiap.ch Abstract We introduce a new algorithm for binary classification in the selective sampling protocol. Our algorithm uses Regularized Least Squares (RLS) as base classifier, and for this reason it can be efficiently run in any RKHS. Unlike previous margin-based semisupervised algorithms, our sampling condition hinges on a simultaneous upper bound on bias and variance of the RLS estimate under a simple linear label noise model. This fact allows us to prove performance bounds that hold for an arbitrary sequence of instances. In particular, we show that our sampling strategy approximates the margin of the Bayes optimal classifier to any desired accuracy by asking O d/2 queries (in the RKHS case d is replaced by a suitable spectral quantity). While these are the standard rates in the fully supervised i.i.d. case, the best previously known result in our harder setting was O d3 /4 . Preliminary experiments show that some of our algorithms also exhibit a good practical performance. the label is predicted but not queried, then the learner never knows whether his prediction was correct. Thus, only queried labels are observed while all others remain unknown. This protocol is often called selective sampling, and we interchangeably use queried labels and sampled labels to denote the labels observed by the learner. Given a general online prediction technique, like regularized least squares (RLS), we are interested in controlling the predictive performance as the query rate goes from fully supervised (all labels are queried) to fully unsupervised (no label is queried). This is motivated by observing that, in a typical practical scenario, one might want to control the accuracy of predictions while imposing an upper bound on the query rate. In fact, the number of observed labels has usually a very direct influence on basic computational aspects of online learning algorithms, such as running time and storage requirements. In this work we develop semi-supervised variants of RLS for binary classification. We analyze these variants under no assumptions on the mechanism generating the sequence of instances, while imposing a simple linear noise model for the conditional label distribution. Intuitively, our algorithms issue a query when a common upper bound on bias and variance of the current RLS estimate is larger than a given threshold. Conversely, when this upper bound gets small, we infer via a simple large deviation argument that the margin of the RLS estimate on the current instance is close enough to the margin of the Bayes optimal classifier. Hence the learner can safely avoid issuing a query on that step. In order to summarize our results, assume for the sake of simplicity that the Bayes optimal classifier -- which for us is a linear classifier u-- has unknown 1. Introduction A practical variant of the standard (fully supervised) online learning protocol is a setting where, at each prediction step, the learner can abstain from observing the current label. If the learner observes the label, which he can do by issuing a query, then the label value can be used to improve future predictions. If Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). Robust Bounds for Classification via Selective Sampling margin |u xt | > 0 on all instances xt Rd . Then, in our data model, the average (per-step) risk of the fully supervised RLS, asking NT = T labels in T steps, is known to converge to the average risk of -1 excludthe Bayes optimal classifier at rate d 2 T ing logarithmic factors. In this work we show that, using our semi-supervised RLS variant, we can replace NT = T with any desired query bound NT = d T (for 0 1) while achieving a convergence rate of order -1 -1 d 2 T + 2/ T . One might wonder whether these results could also be obtained just by running a standard RLS algorithm with a constant label sampling rate, say T -1 , independent of the sequence of instances. If we could prove for RLS an instantaneous regret bound like d/T then the answer would be affirmative. However, the lack of assumptions on the way instances are generated makes it hard to prove any nontrivial instantaneous regret bound. If the margin is known, or, equivalently, our goal is to approximate the Bayes margin to some accuracy , then we show that the above strategies achieve, with high probability, any desired accuracy by querying only order of d/2 labels (excluding logarithmic factors). Again, the reader should observe that this bound could not be obtained by, say, concentrating all queries on an initial phase of length O(d/2 ). In such a case, an obvious adversarial strategy would be to generate noninformative instances just in that phase. In short, if we require online semi-supervised learning algorithms to work in worst-case scenarios we need to resort to nontrivial label sampling techniques. We have run comparative experiments on both artificial and real-world medium-sized datasets. These experiments, though preliminary in nature, reveal the effectiveness of our sampling strategies even from a practical standpoint. 1.1. Related work and comparison Selective sampling is a well-known semi-supervised online learning setting. Pioneering works in this area are (Cohn et al., 1990) and (Freund et al., 1997). More recent related results focusing on linear classification problems include (Balcan et al., 2006; Balcan et al., 2007; Cavallanti et al., 2009; Cesa-Bianchi et al., 2006b; Dasgupta et al., 2008; Dasgupta et al., 2005; Strehl & Littman, 2008), although some of these works analyze batch rather than online protocols. Most previous studies consider the case when instances are drawn i.i.d. from a fixed distribution, exceptions being the worst-case analysis in (Cesa-Bianchi et al., 2006b) and the very recent analysis in the KWIK learning protocol (Strehl & Littman, 2008). Both of these papers use variants of RLS working on arbitrary instance sequences. The work (Cesa-Bianchi et al., 2006b) is completely worst case: the authors make no assumptions whatsoever on the mechanism generating labels and instances; however, they are unable prove bounds on the label query rate as we do here. The KWIK model of (Strehl & Littman, 2008) --see also the more general setup in (Li et al., 2008)-- is closest to the setting considered in this paper. There the goal is to approximate the Bayes margin to within a given accuracy . The authors assume arbitrary sequences of instances and the same linear stochastic model for labels as the one considered here. A modification of the linear regressor in (Auer, 2002), combined with covering arguments, allows them to compete against an adaptive adversarial strategy for generating instances. Their algorithm, however, yields the significantly worse bound O d3 /4 on the number of queries, and seems to work in the finite dimensional (d < ) case only. In contrast, our algorithms achieve the better query bound O d/2 against oblivious adversaries. Moreover, our algorithms can be easily run in any infinite dimensional RKHS. 2. Preliminaries In the selective sampling protocol for online binary classification, at each step t = 1, 2, . . . the learner receives an instance xt Rd and outputs a binary prediction for the associated unknown label yt {-1, +1}. After each prediction the learner may observe the label yt only by issuing a query. If no query is issued at time t, then yt remains unknown. Since one expects the learner's performance to improve if more labels are observed, our goal is to trade off predictive accuracy against number of queries. All results proven in this paper hold for any fixed individual sequence x1 , x2 , . . . of instances, under the sole assumption that xt = 1 for all t 1. Given any such sequence, we assume the corresponding labels yt {-1, +1} are realizations of random variables Yt such that E Yt = u xt for all t 1, where u Rd is a fixed and unknown vector such that u = 1. Note that sgn t ), for t = u x, is the Bayes optimal classifier for this noise model. We study selective sampling algorithms that use sgn t to predict Yt . The quantity t = w xt is a t margin computed via the RLS estimate wt = I + St-1 St-1 + xt x t -1 St-1 Y t-1 (1) defined over the matrix St-1 = x , . . . , x t-1 of the 1 N Robust Bounds for Classification via Selective Sampling Nt-1 queried instances up to time t - 1. The random vector Y t-1 = Y1 , . . . , YNt-1 contains the observed labels (so that Yk is the label of x ), and I is the d × d k identity matrix. We are interested in simultaneously controlling the cumulative regret T RT = t=1 P(Yt t < 0) - P(Yt t < 0) (2) and the number NT of queried labels, uniformly over T. Let At = I + St-1 St-1 + xt x . We introduce the t following relevant quantities: Algorithm 1 The BBQ selective sampler Parameters: 0 1 Initialization: Weight vector w = 0 for each time step t = 1, 2, . . . do Observe instance xt Rd ; predict label yt {-1, +1} with sgn(w xt ); if rt > t- then query label yt , update wt using (xt , yt ) as in (1). end if end for Bias Query) algorithm. BBQ queries xt whenever rt is larger than a threshold vanishing as t- , where 0 1 is an input parameter. This simple query condition builds on Property 5 of Lemma 1. This property shows that t is likely to be close to the margin t of the Bayes optimal predictor when both the bias 2 Bt and the variance bound q t are small. Since these quantities are both bounded by (functions of) rt (see Properties 2, 3, and 4 of Lemma 1), this suggests that one can safely disregard Yt when rt is small. According to our noise model, the label of xt is harder to predict if |t | is small. For this reason, our regret bound is split into a cumulative regret on "big margin" steps t, where |t | , and "small margin" steps, where |t | < . On one hand, we bound the regret on small margin steps simply by T , where T = {1 t T : |t | < } . On the other hand, we show that the overall regret can be bounded in terms of the best possible choice of with no need for the algorithm to know this optimal value. Theorem 1 If BBQ is run with input [0, 1] then its cumulative regret RT after any number T of steps satisfies RT min T + (2 + e) 1/! + 1+ 2 e 8 2 1/ Bt = u I + xt x A-1 xt , t t -1 q t = St-1 At xt , rt = x A-1 xt t t st = A-1 xt t . The following properties of the RLS estimate (1) have been proven in, e.g., (Cesa-Bianchi et al., 2006a). Lemma 1 For each t = 1, 2, . . . the following inequalities hold: 1. E t = t - Bt ; 2. |Bt | st + rt ; 3. st rt ; 4. q t 2 rt ; 5. For all > 0, P t + Bt - t 2 exp - 2 2 qt 2 ; 6. If NT is the total number of queries issued in the first T steps, then NT ln(1 + i ) d ln 1 + rt d i=1 d 0<<1 8d NT ln 1 + 2 d . 1tT Yt queried The number of queried labels is NT = O (d T ln T ) . It is worth observing that the bounds presented here hold in the finite dimensional (d < ) case only. One can easily turn them to work in any RKHS after switching to an eigenvalue representation of the cumulative regret --e.g., by using the middle bound in Property 6 of Lemma 1 rather than the rightmost one, as we did in the proof below. This essentially corresponds to analyzing Algorithm 1 in a dual variable representation. A similar comment holds for Remark 1 and Theorem 2 below. where i is the i-th eigenvalue of the (Gram) ma trix ST ST defined on the queried instances. 3. A new selective sampling algorithm Our main theoretical result provides bounds on the cumulative regret and the number of queried labels for the selective sampling algorithm introduced in Figure 1. We call this algorithm the BBQ (Bound on Robust Bounds for Classification via Selective Sampling In the rest of the paper we denote by {} the indicator function of a Boolean predicate . Proof: [of Theorem 1] Fix any (0, 1). As in (Cavallanti et al., 2009), we first observe that our label noise model allows us to upper bound the time-t regret P(Yt t < 0) - P(Yt t < 0) as P(Yt t < 0) - P(Yt t < 0) {|t | < } + P t t 0, |t | {|t | < } + P t - t . Hence the cumulative regret (2) can be split as follows: T where (·) is the Euler's Gamma function (x) = -t x-1 1 e t dt. We further bound (1/) 1/! 0 using the monotonicity of . For the third term we write exp - 2 8rt 8 e2 rt t : rt >t- t : rt >t- 8d NT ln 1 + e2 d . 1 The first step uses the inequality e-x ex for x > 0, while the second step uses Property 6 in Lemma 1. Finally, in order to derive a bound on the number NT of queried labels, we have RT T + t=1 P t - t . (3) NT t : rt >t- rt T t- rt t : rt >t- We proceed by expanding the indicator of t - t with the introduction of the bias term Bt t - t t + Bt - t + |Bt | > Note that |Bt | > 2 rt > 2 8 e exp - 2 8rt 2 . 2 d T ln 1 + NT d where for the last inequality we used, once more, Property 6 in Lemma 1. Hence, NT = O (d T ln T ) , and this concludes the proof. It is important to observe that, if we disregard the margin term T (which is fully controlled by the adversary), the regret bound depends logarithmically on T for any constant > 0: RT T + O d 1 + 2 ln T 2/ . the first inequality deriving from a combination of Properties 2 and 3 in Lemma 1 and then overapproximating, whereas the second one uses {b < 1} e1-b b. Moreover, by Properties 4 and 5 in Lemma 1, we have that P t + Bt - t 2 2 exp - 2 8rt . If is set to 1 then our bound on the number of queries NT becomes vacuous, and the selective sampling algorithm essentially becomes fully supervised. This recovers the known regret bound for RLS in the fully supervised case, RT T + O (d ln T ) 2 . Remark 1 A randomized variant of BBQ exists that (1-)/ queries label yt with independent probability rt [0, 1]. Through a similar bias-variance analysis as the one in Theorem 1 above, one can show that in expectation (over the internal randomization of this algorithm) the cumulative regret RT is bounded L by min0<<1 T + O 2/ while the number of queried labels NT is O T L1- , being L = d ln T . This bound is similar (though generally incomparable) to the one of Theorem 1. We substitute this back into (3) and single out the steps where queries are issued. This gives RT T + (2 + e) + (2 + e) t : rt >t- t : rt t- exp - exp - 2 8rt 2 8rt . The second term is bounded as follows: exp - t : rt t- 2 8rt T t=1 exp - 2 t 8 8 2 1/ 4. A parametric performance guarantee In the proof of Theorem 1 the quantity acts as a threshold for the cumulative regret, which is split into a sum over steps t such that |t | < (where the regret 0 2 x exp - 8 1 dx = (1/) Robust Bounds for Classification via Selective Sampling Algorithm 2 The parametric BBQ selective sampler Parameters: 0 < , < 1 Initialization: weight vector w = 0 for each time step t = 1, 2, . . . do observe instance xt Rd ; predict label yt {-1, +1} with sgn(w xt ) t(t + 1) then if - rt - st + < q t 2 ln 2 query label yt update wt using (xt , yt ) as in (1) end if end for grows by less than ) and a sum over the remaining steps. Most technicalities in the proof are due to the fact that the final bound depends on the optimal choice of this , which the algorithm need not know. On the other hand, if a specific value for is provided in input to the algorithm, then the cumulative regret over steps t such that |t | can be bounded by any constant > 0 using only order of (d/2 ) ln(T /) queries. In particular, when mint |t | , the above logarithmic bound implies that the per-step regret vanishes exponentially fast as a function of the number of queries. As we stated in the introduction, this result cannot be obtained as an easy consequence of known results, due to the adversarial nature of the instance sequence. We now develop the above argument for a practically motivated variant of our BBQ selective sampler. Let us disregard for a moment the bias term Bt . In order to guarantee that t - t holds when no query is issued, it is enough to observe that Property 5 of Lemma 1 implies that t - t 2rt ln 2 with probability at least 1 - . This immediately delivers a rule prescribing that no query be issued at time t when 2rt ln 2 . A slightly more involved condition, one that better exploits the inequalities of Lemma 1, allows us to obtain a significantly improved practical performance. This results in the algorithm described in Figure 2. The algorithm, called Parametric BBQ, takes in input two parameters and , and issues a query at time t whenever1 - rt - st + 2. the number NT of queries issued after any number T of steps is bounded as NT = O d 2 ln T ln ln(T /) . This theorem has been phrased so as to make it easier to compare to a corresponding result in (Strehl & Littman, 2008) for the KWIK ("Knows What It Knows") framework. In that paper, the authors use a modification of Auer's upper confidence linear regression algorithm for associative reinforcement learning (Auer, 2002). This modification allows them to compete against any adaptive adversarial strategy generating instance vectors xt , but it yields the significantly worse bound O d3 /4 on NT (in the KWIK setting NT is the number of times the prediction algorithm answers "I don't know"). Besides, their strategy seems to work in the finite dimensional (d < ) case only. In contrast, Parametric BBQ works against an oblivious adversary only, but it has the better bound NT = O d/2 , with the O notation hiding a mild (logarithmic) dependence on T . Moreover, Parametric BBQ can be readily run in infinite (d = ) dimensional RKHS --recall the comment before the proof of Theorem 1. In fact, this is a quite important feature: the real-world experiments of Section 5 needed kernels in order to either attain a good empirical performance (on Adult) or use a reasonable amount of computational resources (on RCV1). Remark 2 The bound on the number of queried labels in Theorem 2 is optimal up to logarithmic factors. In fact, it is possible to prove that there exists a sequence x1 , x2 , . . . of instances and a number 0 > 0 such that: for all 0 and for any learning algorithm that issues N = O(d/2 ) queries there exists a target vector u Rd and a time step t = d/2 for which the estimate t computed by the algorithm for t = u xt has the property P |t - t | > = 1 . Hence, at least d/2 queries are needed to learn any target hyperplane with arbitrarily small accuracy and arbitrarily high confidence. Proof: [Theorem 2] Let I {1, . . . , T } be the set of time steps when a query is issued. Then, using Property 2 of Lemma 1 we can write t - t > t + Bt - t > - |Bt | t + Bt - t > [ - rt - st ]+ . < qt 2 ln 2t(t + 1) . (4) Theorem 2 If Parametric BBQ is run with input , (0, 1) then: 1. with probability at least 1 - , t - t holds on all time steps t when no query is issued; 1 tI / tI / ^ Here and throughout x]+ = max{0, x}. tI / Robust Bounds for Classification via Selective Sampling Maximum error 2 exp - This gives [ - rt - st ]+ 2 qt 2 2 . t(t + 1) 0.6 0.3 0.4 0.2 tI / P t - t > P t + Bt - t > [ - rt - st ]+ 2 exp - | - rt - st |+ 2 qt 2 2 0.2 0.1 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 1 tI / tI / tI / t(t + 1) t=1 = . t(t + 1) Figure 1. Maximum error (jagged blue line) and number of queried labels (decreasing green line) on a synthetic dataset for Parametric BBQ with = 0.1 and 0.15 1. The straight blue line is the theoretical upper bound on the maximum error provided by the theory. In order to derive a bound on the number NT of queried labels, we proceed as follows. For every step t I in which a query was issued we can write - rt - rt - rt - st - rt - st + qt 2 ln 2t(t + 1) 2 rt ln 2t(t + 1) where we used Properties 3 and 4 of Lemma 1. Solving for rt and overapproximating we obtain rt 2 2 linear kernel. In Figure 1 the jagged blue line represents the maximum error over the example sequence, i.e., maxt t - t . (Although we stopped the plot at = 1, note that the maximum error is dominated by |t |, which can be of the order of Nt .) As predicted by Theorem 2, the maximum error remains below the straight line y = (the maximum error predicted by the theory). In the same plot, the decreasing green line shows the number of queried labels, which closely follows the curve -2 predicted by the theory. This initial test reveals that the algorithm is dramatically underconfident, i.e., it is a lot more precise than it thinks. Moreover, the actual error is rather insensitive to the choice of . In order to leverage on this, we ran the remaining tests using Parametric BBQ with a more extreme setting of parameters. Namely, we changed the query condition (the "if" condition in Algorithm 2) to 1 - rt - st + . (5) 2 + 1 + 2 ln 2t(t+1) Similarly to the proof of Theorem 1, we then write NT min rt tI tI rt d ln 1 + d 2 NT d . Using (5) we get NT = O ln T ln ln(T /) . < qt 2 ln 2 for 0<<1. This amounts to setting the desired error to a default value of = 1 while making the number of queried labels independent of T . With the above setting, we compared Parametric BBQ to the second-order version of the label-efficient classifier (SOLE) of (Cesa-Bianchi et al., 2006b). This is a mistake-driven RLS algorithm that queries the label of the current instance with probability 1/(1 + b |t |), where b > 0 is a parameter and t is the RLS margin. The other baseline algorithm is a vanilla sampler (called Random in the plots) that asks labels at random with constant probability 0 < p < 1. Recall that SOLE does not come with a guaranteed bound on the 5. Experiments In this section we report on preliminary experiments with the Parametric BBQ algorithm. The first test is a synthetic experiment to validate the model. We generated 10,000 random examples on the unit circle in R2 . The labels of these examples were generated according to our noise model (see Section 2) using a randomly selected hyperplane u with unit norm. We then set = 0.1 and analyzed the behavior of the algorithm with various settings of > 0 and using a Fraction of queried labels We first take expectations on both sides, and then apply Property 5 of Lemma 1 along with condition (4) rewritten as follows 1 Theoretical maximum error Observed maximum error Fraction of queried labels 0.5 0.8 0.4 Robust Bounds for Classification via Selective Sampling 0.62 0.6 0.75 0.7 0.58 F-measure 0.54 0.52 0.5 0.48 0.46 0 SOLE Random BBQ Parametric 0.01 0.02 0.03 0.04 Fraction of queried labels 0.05 0.06 F-measure 0.56 0.65 0.6 0.55 0.5 SOLE Random BBQ Parametric 0.1 0.15 0.2 0.25 0.3 Fraction of queried labels 0.35 0.4 0.45 0.05 Figure 2. F-measure against fraction of queried labels on the a9a dataset (32,561 examples in random order). The plotted curves are averages over 10 random shuffles. Figure 3. F-measure against fraction of queried labels averaged over the 50 most frequent categories of RCV1 (first 40,000 examples in chronological order). number of queried labels. Random, on the other hand, has the simple expectation bound E[NT ] = p T . For each algorithm we plot the F-measure (harmonic mean of precision and recall) against the fraction of queried labels. We control the fraction of queried labels by changing the parameters of the three algorithms ( for Parametric BBQ, b for SOLE, and p for Random). For the first real-world experiment we chose a9a2 , a subset of the census-income (Adult) database with 32,561 binary-labeled examples and 123 features. In order to bring all algorithms to a reasonable performance level, we used a Gaussian kernel with 2 = 12.5. The plots (Figure 2) show that less than 6% queries are enough for the three algorithms to saturate their performance. In the whole query range Parametric BBQ is consistently slightly better than SOLE, while Random has the worst performance. For our second real-world experiment we used the first 40,000 newswire stories in chronological order from the Reuters Corpus Volume 1 dataset (RCV1). Each newsstory of this corpus is tagged with one or more labels from a set of 102 classes. A standard TF-IDF bag-of-words encoding was used to obtain 138,860 features. We considered the 50 most populated classes and trained 50 classifiers one-vs-all using a linear kernel. Earlier experiments, such as those reported in (Cesa-Bianchi et al., 2006b), show that RLS-based algorithms perform best on RCV1 when run in a mistake driven fashion. For this reason, on this dataset we used a mistake-driven variant of Parametric BBQ, storing a queried label only when it is wrongly predicted. Figure 3 shows the (macro)average F-measure 2 plotted against the average fraction of queried labels, where averages are computed over the 50 classifiers. Here the algorithms need over 35% of labels to saturate. Moreover, Parametric BBQ performs worse than SOLE, although still better than Random. Since SOLE and Parametric BBQ are both based on the mistake-driven RLS classifier, any difference of performance is due to their different query conditions: SOLE is margin-based, while Parametric BBQ uses q t and related quantities. Note that, unlike the margin, q t does not depend on the queried labels, but only on the correlation between their corresponding instances. This fact, which helped us a lot in the analysis of BBQ, could make a crucial difference between domains like RCV1 (where instances are extremely sparse) and Adult (where instances are relatively dense). More experimental work is needed in order to settle this conjecture. 6. Conclusions and ongoing research We have introduced a new family of online algorithms, the BBQ family, for selective sampling under (oblivious) adversarial environments. These algorithms naturally interpolate between fully supervised and fully unsupervised learning scenarios. A parametric variant (Parametric BBQ) of our basic algorithm is designed to work in a weakened KWIK framework (Li et al., 2008; Strehl & Littman, 2008) with improved bounds on the number of queried labels. We have made preliminary experiments. First, we validated the theory on an artificially generated dataset. Second, we compared a variant of Parametric BBQ to algorithms with similar guarantees, with encouraging results. www.csie.ntu.edu.tw/cjlin/libsvmtools/ Robust Bounds for Classification via Selective Sampling A few issues we are currently working on are the following. First, we are trying to see if a sharper analysis of BBQ exists which allows one to prove a regret bound of the form T + d ln T when NT = O(d ln T ). This 2 bound would be a worst-case analog of the bound Cavallanti et al. (2009) have obtained in an i.i.d. setting. This improvement is likely to require refined bounds on bias and variance of our estimators. Moreover, we would like to see if it is possible either to remove the ln T dependence on the bound on NT in Theorem 2 or to make Parametric BBQ work in adaptive adversarial environments (presumably at the cost of looser bounds on NT ). In fact, it is currently unclear to us how a direct covering argument could be applied in Theorem 2 which avoids the need for a conditionally independent structure of the involved random variables. On the experimental side, we are planning to perform a more thorough empirical investigation using additional datasets. In particular, since our algorithms can also be viewed as memory bounded procedures, we would like to see how they perform when compared to budget-based algorithms, such as those in (Weston et al., 2005; Dekel et al., 2007; Cavallanti et al., 2007; Orabona et al., 2008). Finally, since our algorithms can be easily adapted to solve regression tasks, we are planning to test the BBQ family on standard regression benchmarks. Cavallanti, G., Cesa-Bianchi, N., & Gentile, C. (2009). Linear classification and selective sampling under low noise conditions. In Advances in Neural Information Processing Systems 21 (pp. 249­256). Cesa-Bianchi, N., Gentile, C., & Zaniboni, L. (2006a). Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7, 31­ 54. Cesa-Bianchi, N., Gentile, C., & Zaniboni, L. (2006b). Worst-case analysis of selective sampling for linear classification. Journal of Machine Learning Research, 7, 1025­1230. Cohn, R., Atlas, L., & Ladner, R. (1990). Training connectionist networks with queries and selective sampling. In Advances in Neural Information Processing Systems 2 (pp. 566­573). Dasgupta, S., Hsu, D., & Monteleoni, C. (2008). A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems 21 (pp. 353­360). Dasgupta, S., Kalai, A. T., & Monteleoni, C. (2005). Analysis of perceptron-based active learning. Proceedings of the 18th Annual Conference on Learning Theory (pp. 249­263). Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2007). The Forgetron: A kernel-based Perceptron on a budget. SIAM Journal on Computing, 37, 1342­1372. Freund, Y., Seung, S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133­168. Li, L., Littman, M., & Walsh, T. (2008). Knows what it knows: a framework for self-aware learning. Proceedings of the 25th International Conference on Machine Learning (pp. 568­575). Orabona, F., Keshet, J., & Caputo, B. (2008). The Projectron: a bounded kernel-based Perceptron. Proceedings of the 25th International Conference on Machine Learning (pp. 720­727). Strehl, A., & Littman, M. (2008). Online linear regression and its application to model-based reinforcement learning. In Advances in Neural Information Processing Systems 20 (pp. 631­638). Weston, J., Bordes, A., & Bottou, L. (2005). Online (and offline) on an even tighter budget. Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (pp. 413­420). Acknowledgments We would like to thank the anonymous reviewers for their helpful comments. The third author is supported by the DIRAC project under EC grant FP6-0027787. All authors acknowledge partial support by the PASCAL2 NoE under EC grant FP7-216886. This publication only reflects the authors' views. References Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3, 397­422. Balcan, M., Beygelzimer, A., & Langford, J. (2006). Agnostic active learning. Proc. of the 23rd International Conference on Machine Learning (pp. 65­72). Balcan, M., Broder, A., & Zhang, T. (2007). Marginbased active learning. Proceedings of the 20th Annual Conference on Learning Theory (pp. 35­50). Cavallanti, G., Cesa-Bianchi, N., & Gentile, C. (2007). Tracking the best hyperplane with a simple budget Perceptron. Machine Learning, 69, 143­167.