Linear Classification and Selective Sampling Under Low Noise Conditions Giovanni Cavallanti DSI, Universita degli Studi di Milano, Italy ` cavallanti@dsi.unimi.it ` Nicolo Cesa-Bianchi DSI, Universita degli Studi di Milano, Italy ` cesa-bianchi@dsi.unimi.it Claudio Gentile DICOM, Universita dell'Insubria, Italy ` claudio.gentile@uninsubria.it Abstract We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate , with labels being sampled at the same rate (here denotes is the exponent in the low noise condition). We the sample size, and compare this convergence rate to the rate achieved by the fully supervised algorithm using all labels. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1 Introduction In the standard online learning protocol for binary classification the learner receives a sequence of instances generated by an unknown source. Each time a new instance is received the learner predicts its binary label, and is then given the true label of the current instance before the next instance is observed. This protocol is natural in many applications, for instance weather forecasting or stock market prediction, because Nature (or the market) is spontaneously disclosing the true label after each learner's guess. On the other hand, in many other applications obtaining labels may be an expensive process. In order to address this problem, a variant of online learning that has been proposed is selective sampling. In this modified protocol the true label of the current instance is never revealed unless the learner decides to issue an explicit query. The learner's performance is then measured with respect to both the number of mistakes (made on the entire sequence of instances) and the number of queries. A natural sampling strategy is one that tries to identify labels which are likely to be useful to the algorithm, and then queries those ones only. This strategy somehow needs to combine a measure of utility of examples with a measure of confidence. In the case of learning with linear functions, a statistic that has often been used to quantify both utility and confidence is the margin. In [10] this approach was employed to define a selective sampling rule that queries a new label whenever the margin of the current instance, with respect to the current linear hypothesis, is smaller (in magnitude) than an adaptively adjusted threshold. Margins were computed using a linear learning algorithm based on an incremental version of Regularized linear Least-Squares (RLS) for classification. Although this selective sampling algorithm is efficient, and has simple variants working quite well in practice, the rate of convergence to the Bayes risk was never assessed in terms of natural distributional parameters, thus preventing a full understanding of the properties of this algorithm. We improve on those results in several ways making three main contributions: (i) By coupling the Tsybakov low noise condition, used to parametrize the instance distribution, with the linear model 1 © § $£ © § ¥£ ¡ ¤¤¨¦¤# ! " © § £ © § ¥£ ¡ ¤¤¨¦¤¢ of [10], defining the conditional distribution of labels, we prove that the fully supervised RLS (all labels are queried) converges to the Bayes risk at rate where is the noise exponent in the low noise condition. (ii) Under the same low noise condition, we prove that the RLS-based selective sampling rule of [10] converges to the Bayes risk at rate with labels being queried at the same rate. Moreover, we show that similar results can be established for a mistake-driven (i.e., space and time efficient) variant. (iii) We perform experiments on a realworld medium-size dataset showing that variants of our mistake-driven selective sampler compare favorably with other selective samplers proposed in the literature, like the ones in [11, 16, 20]. Related work. Selective sampling, originally introduced by Cohn, Atlas and Ladner in [13, 14], differs from the active learning framework as in the latter the learner has more freedom in selecting which instances to query. For example, in Angluin's adversarial learning with queries (see [1] for a survey), the goal is to identify an unknown boolean function from a given class, and the learner can query the labels (i.e., values of ) of arbitrary boolean instances. Castro and Nowak [9] study a framework in which the learner also queries arbitrary domain points. However, in their case labels are stochastically related to instances (which are real vectors). They prove risk bounds in terms of nonparametric characterizations of both the regularity of the Bayes decision boundary and the behavior of the noise rate in its proximity. In fact, a large statistical literature on adaptive sampling and sequential hypothesis testing exists (see for instance the detailed description in [9]) which is concerned with problems that share similarities with active learning. The idea of querying small margin instances when learning linear classifiers has been explored several times in different active learning contexts. Campbell, Cristianini and Smola [8], and also Tong and Koller [24], study a poolbased model of active learning, where the algorithm is allowed to interactively choose which labels to obtain from an i.i.d. pool of unlabeled instances. A landmark result in the selective sampling protocol is the query-by-committee algorithm of Freund, Seung, Shamir and Tishby [17]. In the realizable (noise-free) case, and under strong distributional assumptions, this algorithm is shown to require exponentially fewer labels than instances when learning linear classifiers (see also [18] for a more practical implementation). An exponential advantage in the realizable case is also obtained with a simple variant of the Perceptron algorithm by Dasgupta, Kalai and Monteleoni [16], under the sole assumption that instances are drawn from the uniform distribution over the unit ball in . In the general statistical learning case, under no assumptions on the joint distribution of label and instances, selective sampling bears no such exponential advantage. For instance, Kaariainen shows ¨¨ ¨ that, in order to approach the risk of the best linear classifier within error , at least labels are needed, where is the risk of . A much more general nonparametric lower bound for active learning is obtained by Castro and Nowak [9]. General selective sampling strategies for the nonrealizable case have been proposed in [3, 4, 15]. However, none of these learning algorithms seems to be computationally efficient when learning linear classifiers in the general agnostic case. 2 Learning protocol and data model We consider the following online selective sampling protocol. At each step the samand outputs a binary prediction pling algorithm (or selective sampler ) receives an instance for the associated label . After each prediction, the algorithm has the option of "saman example. After pling" (issuing a query) in order to receive the label . We call the pair seeing the label , the algorithm can choose whether or not to update its internal state using the new . information encoded by We assume instances are realizations of i.i.d. random variables drawn from an unknown distribution on the surface of the unit Euclidean sphere in , so that for all . Following [10], we assume that labels are generated according to the following simple linear noise model: there exists a fixed and unknown vector , with Euclidean norm , such that for all . Hence has label with probability . Note that S G N , for , is the Bayes optimal classifier for this noise model. In the following, all probabilities and expectations are understood with respect to the joint distribution of the i.i.d. data process . We use to denote conditioning on . Let be an arbitrary measurable function. The instantaneous regret is the excess risk of S G N w.r.t. the Bayes risk, i.e., . Let be a sequence of real functions where each is measurable w.r.t. the -algebra generated by . When is understood from the context, we write as a function of 2 ¨ ©§ £ ©¥¤¤ © 2pcT` 0`g 2¦ fe) A & ¥ ¡ A s ¨ ¥ ¡ A "X 8988¨ & ¥ s ¨ ¥ "X ¨777 A A X ¨ & ¥ ¡ A s ¨ ¥ ¡ A X" 978978¨ & ¥ s ¨ ¥ "X j ¨7 $ ¥ &f ! i& ¥ "X ¥ "sYhB! g& ¥ VXx ¥ VsYedV Q& f 0& & 777 d 898¨ ¨ & d §EHi & A s ¨ A VX 8989¨ & ¥ s ¨ ¥ VX ¨777 A U 7898¨ & $ s $ X" & ¥ s ¥ VXI 77 q¨ ¨ ¨ @EeBV@© 8(V g 0 & & #A@ a2 ! C ¥&xEYg ¥2 2 Ax@w0bX A 2¦ ) A @ g 0 A @ 0 A uu A 4 r xEYwyxw¨RvX tVsq S Hihg C A GF Y A@ & $ '%$"! &# 777¨4 2 0 98865¨ 31) § ¨¥ £ ! ¡¢ ¢ £ ¡ ¦ § 2 0 `A X dcaba` & A F ¨ A "@ ¥¤ © AX § ¤§ ¥ ¤¢ £ ¡ $£ © £ ¡ ¢ CA EDB@ AF ( U S 2 QI C A ©2 T¨ RGPHGF W&'F ¨ EV@ AA ! AF be the instantaneous conditional regret . Our goal is to bound the expected cumulative regret as a function of , and other relevant quantities. Observe that, although the learner's predictions can only depend on the queried examples, the regret is computed over all time steps, including the ones when the selective sampler did not issue a query. In order to model the distribution of the instances around the hyperplane , we use Mammen-Tsybakov low noise condition [25]: When the noise exponent is the low noise condition becomes vacuous. In order to study the case , one can use the equivalent formulation of (1) --see, e.g., [5], for all measurable With this formulation, one can show that w.p. 1. implies the hard margin condition 3 Algorithms and theoretical analysis We consider linear classifiers predicting the value of through S G N , where is a dynamically updated weight vector which might be intended as the current estimate for . Our is an RLS estimator defined over the set of previously queried examples. More precisely, let be the number of queried examples during the first time steps, be the matrix of the queried instances up to time , and be the vector of the corresponding labels. Then the RLS estimator is defined by where is the identity matrix. Note that depends on the current instance . The RLS estimator in this particular form has been first considered by Vovk [26] and by Azoury and Warmuth [2]. Compared to standard RLS, here acts by futher reducing the variance of . We use to denote the margin whenever is understood from the context. Thus is the current approximation to . Note that is measurable w.r.t. the -algebra generated by . Sometimes we write instead of to stress the depenon the number of queried examples. We also use to denote the Bayes margin dence of . , which we need for the inverse of The RLS estimator (2) can be stored in space . Moreover, using a standard formula for small-rank adjustments of inverse . The algorithm in (2) can also matrices, we can compute updates and predictions in time be expressed in dual variable form. This is needed, for instance, when we want to use the feature expansion facility provided by kernel functions. In this case, at time the RLS estimator (2) can be represented in space. The update time is also quadratic in . Our first result establishes a regret bound for the fully supervised algorithm, i.e., the algorithm that predicts using RLS as in (2), queries the label of every instance, and stores all examples. This result is the baseline against which we measure the performance of our selective sampling algorithm. The regret bound is expressed i.t.o. the whole spectrum of the process covariance matrix . Theorem 1 Assume the low noise condition (1) holds with exponent and constant . Then the expected cumulative regret after steps of the fully supervised algorithm based on (2) is bounded by by This, in turn, is bounded from above Here terminant of the matrix at argument, . When , and (corresponding to a vacuous noise condition) the bound of Theorem 1 reduces to . When (corresponding to a hard margin condition) the bound gives the 3 ( y )3'% ( % !$ ¨ 7 7 7 ! y )'&#@ 8889¨ ¥"@ r A g C E RA ¥&A X denotes the de- is the -th eigenvalue of ¦ & ¥ "X© S4 ! A @A A ¥ X ¥ X Hq © ! A E@ 'c h ¢ ¥ ¡ A & ¥ ¡ $ A 3E ) & 6 $ 5D A B@ A @ S ¥ ¡ B ¥ A¡ A & $ 36D A Eg H¥&A "X' X 0 A@ A A A A X & ¥ ¡ A s ¥ ¡ A "X 8898A7 ¨ & ¥ s ¨ ¥ "X BA ( C)'% $ @ A ¨77 A@ j A@ A¨ ¨ A A A X A A A @A A@ A 67 986 4 A A 0A ¨ ¥ ¡ A 0 ¥ A¡ ¥ ¡ ¤ A W@x@ S ¥ ¡ A ¥ ¡ d S 54¢ E ! $ ¨ 7 7 7 !¥ 2F a898¨ 1F r 0 ¥ ¡ A 0 vB) 2Q 0 ¥ ¡ A ) A ! ¦ y 0¦ c e ¦gX 97898¨ $ X ¨ ¥ X r §f ¨ 77 T 7 T )(W V X H5VV (X ( RSP 6 ¢ ¡ 0 )(W V X )V X `T & c f S ¥2 SRP ¥ Sdc bS 2 © e a W WH ¤ G G IG WW )5V( UT& ¦ ¦ S 4 SRQS 2 ©© q 7 P Y WH W (W)V X `H5VV (X IG ¤HF A D As v X X £ & ©¤§ ¥ © dVd© ¤ ! f& ¥ "x& ¥ V ¢ ! ¦ ! w ! There exist and such that & ¥ Vq r fS W&v"X¡ A $Vs ¥ ¡ f0 W&A V ¥ ¡ 1 A A A A for all . . X © © ¤ hf & ¥ " ¢ 0 H g @ ! &¦ ¦ & $ V ¥ y 6¨V ¥ ¡ § G¨& ! P¥&A VX©6Tt¥"s¢AS ¤¢ £¡ S (bQ G! f ¢¥ A & ¥&A ¥ ¡ A AX only. Let © (1) ¤( 0 SR§ 6 i P ! w p¢ ¡ ¥ X ¥ X Bq & 8© 4 6# 2 £ (2) 1. Observe instance 2. Predict the label 3. If 5. If 4. Else if : ; with S G N and store ; , where then query label ; then query is scheduled to be stored, then increment and update using Figure 1: The selective sampling algorithm. logarithmic behavior . Notice that is substantially smaller than whenever the spectrum of is rapidly decreasing. In fact, the second bound is clearly meaningful even when , while the third one only applies to the finite dimensional case. Fast rates of convergence have typically been proven for batch-style algorithms, such as empirical risk minimizers and SVM (see, e.g., [25, 23]), rather than for online algorithms. A reference closer to our paper is Ying and Zhou [27], where the authors prove bounds for online linear classification using the low noise condition (1), though under different distributional assumptions. Our second result establishes a new regret bound, under low noise conditions, for the selective sampler introduced in [10]. This variant, described in Figure 1, queries all labels (and stores all examples) during an initial stage of length at least , where denotes the smallest nonzero . When this transient regime is over, the sameigenvalue of the process covariance matrix pler issues a query at time based on both the query counter and the margin . Specifically, if evidence is collected that the number of stored examples is smaller than our current estimate of , that is if , then we query (and store) the label of the next instance . Note that the margin threshold explicitly depends, through , on additional information about the data-generating process. This additional information is needed because, unlike the fully supervised classifier of Theorem 1, the selective sampler queries labels at random steps. This prevents us from bounding the sum of conditional variances of the involved RLS estimator through , as we can do when proving Theorem 1 (see below). Instead, we have to individually bound each conditional variance term via the smallest empirical eigenvalue of the correlation matrix. The transient regime in Figure 1 is exactly needed to ensure that this smallest empirical eigenvalue gets close enough to . Compared to the analysis contained in [10], we are able to better capture the two main aspects of the selective sampling protocol: First, we control the probability of making a mistake when we do not query labels; second, the algorithm is able to adaptively optimize the sampling rate by exploiting the additional information provided by the examples having small margin. The appropriate sampling rate clearly depends on the (unknown) amount of noise which the algorithm implicitly learns on the fly. In this respect, our algorithm is more properly an adaptive sampler, rather than a selective sampler. Finally, we stress that it is fairly straightforward to add to the algorithm in Figure 1 a mistake-driven rule for storing examples. Such a rule provides that, when a small margin is detected, a query be issued (and the next example be stored) only if SGN (i.e., only if the current prediction is mistaken). This turns out to be highly advantageous from a computational standpoint, because of the sparsity of the computed solution. It is easy to adapt our analysis to obtain even for this algorithm the same regret bound as the one established in Theorem 2. However, in this case we can only give guarantees on the expected number of stored examples (which can indeed be much smaller than the actual number of queried labels). Theorem 2 Assume the low noise condition (1) holds with exponent and assume the selective sampler of Figure 1 is run with (notice that is unknown to the algorithm). Then, after steps, the expected cumulative regret and the expected number of queried labels are both bounded by The proof, sketched below, hinges on showing that is an almost unbiased estimate of the true margin , and relies on known concentration properties of i.i.d. processes. In particular, we show that our selective sampler is able to adaptively estimate the number of queries needed to ensure a increase of the regret when a query is not issued at time . The noiseless case ("the 4 & ¥ § ©F ¨ ¥ § E"@ AA SRP 6 § A @A A e e ! ¦ e ! ¥&©F ¨ x"@ AA $ ¥ § 'F A A ¤ £¡ ¥ $ ¥ $A @ A ¢ AF A &A F ¨ A V@ A A & A @ e3 U '2 2 Q I ¨ RPC A F 2 0 A @ xDC A @ ¥ ¡ A 0 h ) Parameters: , for each . Initialization: weight vector ; storage counter At each time do the following: & c f S ¥2 R P ¥ d c a e A B ()3'% $ @ A 7 #)X V ! )V ! T e SRP S R P$ e S 6 ¡ "W G WW )3V( 0A X 1 $ #& e 56 ¦ ¥2 ¥ X ¥ X Bq U T) SRP ¨ 6 I B( & ¥ ¡ e 5& ) R #¨4 2 A C)3'% $$ @ A A # P§ ¥ ¥ & G! ¥ ¡A ¨ 7 ¦9 7 2 a8Y8)7 ¨ i6 0 ¥ X ¥ X Bq ¤ RP 6¢ ¡ ! 0 e ! ) 7 7¨4 0 879865¨ 2 B) A 1 ! e . is as in (2). . A & A B )'% $ ¥A@ F ©0 ( AA uu H ¦ S 4 uu SRP ¦ ¥ § x@ A $A A#2 )#2 hard margin condition") is essentially the one considered in [10]. In that simple case, the sampling rate determined by the algorithm in Figure 1 has a natural interpretation: we sample often enough to make sure that the size of 's confidence interval, at the confidence value of , is smaller than (observe that we must do that without knowing the exact value of ). This guarantees that at time the regret increases as , so as to achieve the logarithmic regret . As expected, when we compare Theorem 2 to the fully-supervised "yardstick" given in Theorem 1, we see that the cumulative regret of our selective sampler has a significantly worse dependence on the time horizon (i.e., vs. ). Besides, we have a looser dependence on the process spectrum (eigenvalues of ) in that the bound of Theorem 2 depends (inversely) on the smallest process eigenvalue. Notice that, the way it is stated now, this bound only applies to the ) case. It turns out this is a fixable artifact of our analysis, rather than finite-dimensional ( an intrinsic limitation of the selective sampling scheme in Figure 1. See Remark 3 below. On the other hand, Theorem 2 also shows that, under mild assumptions (and up to logarithmic factors), the instantaneous regret is . Since the total number of queried labels is , this implies an instantaneous regret vanishing polynomially fast as . In contrast, the fully-supervised algorithm (Theorem 1) has the slower rate . The gap between the two bounds becomes exponential when , recovering the exponential improvement observed in previous work (e.g., [16, 17]). Proof of Theorem 1. The proof proceeds by relating the classification regret to the square loss regret via a comparison theorem. The square loss regret is then controlled by applying a known pointwise bound. For all measurable , introduce the square loss regret along with its conditional version . We apply the comassociated with the square loss. parison theorem from [5] with the -transform function Under the low noise condition (1) this yields A @ 9 ¥ d A8 A ¥ ¡A (% ( % @A 7 " A § ¥ ¡ ¨ ) R P § 4 e 2 A B )'&$$ @ A ¨ ! A B )'&$ &A A 7 S ¦ A @ 9 @ 9 ¥ d A8 A ¥ d A8 A ¦ A ¥ ¡ B ( A " A A ¨ A § ¥ ¡ g ¨ ) SR#¨4 e 2 A )3'% $$ @ A ¦ 7 S & A ¥ ¡ ¦ 7 ¥ P§ A & © ¦ A A ¨ ! A B )3'% $ @ A A A Y d ( ¦7 ) $A @ A & (% 7 ©h¦ S© A A ¨ ! A B )3'&$ @ A T A A Y ¥ d ¦A a 1¤§ ¥ © &f G! YA A & f ( A A t"s£Q ! YA B )3'% $ @ A $VsY G ¥ d ¦A a 6 2 ¥ X ¥ q c e B A X A X ¥ d A a Bq 0 ¦ ¦ q X0 ¦ ¢ SRP W) )V3V(X & ¢ W uu B ¦ S 4 uu R P S 2 ¦ A Xg A QA ¤ $ ¥&bEYPQ $Vs§Q $ & A X A t"s ¢ ¥ d ¦A a 0 i& A V ¥ ¡ A B ¢ ¥ d A &¥ Q 0 & @g & ! a 7 E30 T 4$ 4¡ ¥ $ ¥ S ¥2 S 4$ 2§ ¥ $ 0R¥2 )¡&$R " B"@' 531 531 G ( ' % # @ B"@ g 0& §C ¦@ T & A V ¥ ¡ A B ¢ ¥ d ¦A a ¦ q T & A V ¥ ¡ A B ¢ d© ¥ d ¦A a ©q Wy& A V ¥ ¡ A ¥ d ¦A a Rq ¨ ¨ r ¨ W )V X W )V X G W )V3( & )W 3V( & I G W d ¢ d© d W) ¥ ¤ 0 )3VV $ (X §Y& ¦¥1¤ I ¢ ¤ for all measurable . We thus have the last term following from Jensen's inequality. Further, we observe that in our probabilistic model is Bayes optimal for the square loss. In fact, for any unit norm , we have Hence Theorem 11.8]) by which, in turn, can be bounded pointwise (see, e.g., [12, . Putting together gives the first bound. Next, we take the of a real bound just obtained and apply Jensen's inequality twice, first to the concave function argument, and then to the concave function of a (positive definite) matrix argument. Observing that yields the second bound. The third bound derives . from the second one just by using Proof sketch of Theorem 2. We aim at bounding from above the cumulative regret which, according to our probabilistic model, can be shown to be at most In turn, the sum is split into a contribution due to: (I) the initial time steps; (II) the time steps on which we trigger the query of the next label (because is smaller than the threshold at time ); (III) the steps that do not trigger any queries at all. We can write (I) (II) (III) 5 Q q 0& ¢ B2 ¢ EbBdV £ $ e ¡( SQS A 6 ¢ ¤ RP ) # §Q 2 A 2 )XW V ! ¢ A B ¥ ¡ W #2 WH # 2 H5VV (X )XW V5( diH' )(W V X ¥ X ¥ X Hq Q ¨ ¤ $ & ¥ "X' ¥ s 12 ¢ Q $ ¤ & ¥ "Xx ¥ s )#2 A @A HXW V ! W WH H5VV (! w ¡ f6 ) AA Note that (I) (II) is also a bound on the expected number of labels queried by the algorithm, so that (III) bounds the regret over non-sampled examples. In what follows, we sketch the way we bound each of the three terms separately. A bound on (I) is easily obtained as (I) just because for all . To bound (II) and (III) we need to exploit the fact that the subsequence of stored instances and labels is a sequence of i.i.d. random variables distributed as . This allows us to carry out a (somewhat involved) bias-variance analysis showing that for any fixed number of stored examples, is an almost unbiased estimator of , whose bias and variance tend to vanish as when is sufficiently large. In particular, if then as long as is of the order of . The variance of is controlled by known results (the one we used is [22, Theorem 4.2]) on the concentration of eigenvalues of an empirical correlation matrix to the eigenvalues of the process covariance matrix . For such a result to apply, we have to impose that . By suitably combining these concentration results we can bound term (II) by and term (III) by . Putting together and choosing of the order of gives the desired regret bound. The bound on the number of queried labels is obtained in a similar way. Remark 3 The linear dependence on in Theorem 2 derives from a direct application of the concentration results in [22]. In fact, it is possible to take into account in a fairly precise manner the way the process spectrum decreases (e.g., [6, 7]), thereby extending the above analysis to the infinite-dimensional case. In this paper, however, we decided to stick to the simpler analysis leading to Theorem 2, since the resulting bounds would be harder to read, and would somehow obscure understanding of regret and sampling rate behavior as a function of . 6 4 Experimental analysis In evaluating the empirical performance of our selective sampling algorithm, we consider two additional variants obtained by slightly modifying Step 4 in Figure 1. The first variant (which we just call S S, Selective Sampler) queries the current label instead of the next one. The rationale here is that we want to leverage the more informative content of small margin instances. The second variant is a mistake-driven version (referred to as S S M D, Selective Sampling Mistake Driven) that queries the current label (and stores the corresponding example) only if the label gets mispredicted. For clarity, the algorithm in Figure 1 will then be called S S N L (Selective Sampling Next Label) since it queries the next label whenever a small margin is observed. For all three algorithms we dropped the intial transient regime (Step 3 in Figure 1). We run our experiments on the first, in chronological order, newswire stories from the Reuters Corpus Volume 1 dataset (RCV1, [21]). Every example in this dataset is encoded as a vector of real attributes computed through a standard T F - I D F bag-of-words processing of the original news stories, and is tagged with zero or more labels from a set of 102 classes. The online categorization of excerpts from a newswire feed is a realistic learning problem for selective sampling algorithms since a newswire feed consists of a large amount of uncategorized data with a high labeling cost. The classification performance is measured using a macroaveraged -measure , where is the precision (fraction of correctly classified documents among all documents that were classified positive for the given topic) and is the recall (fraction of correctly classified documents among all documents that are labelled with the given topic). All algorithms presented here are evaluated using dual variable implementations and linear kernels. The results are summarized in Figures 2 and 3. The former only refers to (an average over) the 50 most frequent categories, while the latter includes them all. In Figure 2 (left) we show how S S M D compares to S S N L, and to its most immediate counterpart, S S. In Figure 2 (right) we compare S S M D to other algorithms that are known to have good empirical performance, including the second-order version of the label efficient classifier (S O L E), as described in [11], and the D K M P E R C variant of the D K M algorithm (see, e.g., [16, 20]). D K M P E R C differs from D K M since it adopts a standard perceptron update rule. The perceptron algorithm (P E R C) and its second-order counterpart (S O P) are reported here as a reference, since they are designed to query all labels. In particular, S O P is a mistake-driven variant of the algorithm analyzed in Theorem 1. It is reasonable to assume that in a selective sampling setup we are interested in the performance achieved when the fraction of queried labels stays below some threshold, say . In this range of sampling rate, S S M D has the steepest increase in the achieved -measure, and surpasses any other algorithm. Unsurprisingly, as the number of queried labels gets larger, S S M D, S O L E and S O P exhibit similar behaviors. Moreover, 6 W & X ¨¥ X ¥ )5V( ¥A A ¦ ¦ ¥ ¤ ¡¢ S ¦ ¤ ¢ § ¡ & S V # 4 X ¦ ¦¤¥ ¢¥ A B ()'% $ @ A W )V ! ¤ ¦ ¦ ¤ ¥ ¢ ¢ ¢# 2 ! !! ¨ ! I ¥ X ¥ X Bq c X c X c a ¥§ ¥ ¡ A A ( A £¤A B )3'% $ @ A A ¦ A A A ¡0 ¥ ¡ A & ¥ s ¨ ¥ VX © ) ! 2 A h¦ § ¦ 6 © 0¦ & ¦ ¤X ¢ ¥ § ¡ v§ S t¤§ ¥ © S© & RP ¡ A B )3'% $ @ A ( 0.75 0.7 0.65 0.6 0.55 0.5 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Fraction of queried labels SSMD SSNL SS 0.08 0.09 0.1 F-measure 0.45 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Fraction of queried labels SSMD DKMperc SOLE SOP PERC 0.08 0.09 0.1 Figure 2: Average -measure obtained by different algorithms after examples, as a function of the number of queried labels. The average only refers to the 50 most frequent categories. Points are obtained by repeatedly running each algorithm with different values of parameters (in Figure 1, the relevant parameter is ). Trend lines are computed as approximate cubic splines connecting consecutive points. 1 Fraction of labels (normalized) SVM margin (normalized) 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 20 40 Topics 60 80 100 0 0 20 40 Topics 60 80 100 Figure 3: Left: Correlation between the fraction of stored examples and the difficulty of each binary task, as measured by the separation margin. Right: -measure achieved on the different binary classification tasks compared to the number of positive examples in each topic, and to the fraction of queried labels (including the stored ones). In both plots, topics are sorted by decreasing frequency of positive examples. The two plots are produced by S S M D with a specific value of the parameter. Varying does not significantly alter the reported trend. the less than ideal plot of S S N L seems to confirm the intuition that querying small margin instances provides a significant advantage. Under our test conditions D K M P E R C proved ineffective, probably because most tasks in the RCV1 dataset are not linearly separable. A similar behavior was observed in [20]. It is fair to remark that D K M P E R C is a perceptron-like linear-threshold classifier while the other algorithms considered here are based on the more computationally intensive ridge regressionlike procedure. In our selective sampling framework it is important to investigate how harder problems influence the sampling rate of an algorithm and, for each binary problem, to assess the impact of the number of positive examples on F-measure performance. Coarsely speaking, we would expect that the hard topics are the infrequent ones. Here we focus on S S M D since it is reasonably the best candidate, among our selective samplers, as applied to real-world problems. In Figure 3 (left) we report the fraction of examples stored by S S M D on each of the 102 binary learning tasks (i.e., on each individual topic, including the infrequent ones), and the corresponding levels of -measure and queried labels (right). Note that in both plots topics are sorted by frequency with the most frequent categories appearing on the left. We represent the difficulty of a learning task by the norm of the weight vector obtained by running the C - S V M algorithm on that task1 . Figure 3 (left) clearly shows that S S M D rises the storage rate on difficult problems. In particular, even if two different tasks have largely 1 The actual values were computed using S V M - L I G H T [19] with default parameters. Since the examples in the Reuters Corpus Volume 1 are cosine normalized, the choice of default parameters amounts to indirectly setting the parameter to approximately . 7 e ! ! ! ¨ ! © I © ¤ £¡ ¢ e © F-measure Fraction of positive examples e different numbers of positive examples, the storage rate achieved by S S M D on those tasks may be similar when the norm of the weight vectors computed by C - S V M is nearly the same. On the other hand, the right plot shows (to our surprise) that the achieved F-measure is fairly independent of the number of positive examples, but this independence is obtained at the cost of querying more and more labels. In other words, S S M D seems to realize the difficulty of learning infrequent topics and, in order to achieve a good F-measure performance, it compensates by querying many more labels. References [1] D. Angluin. Queries revisited. In 12th ALT, pages 12­31. Springer, 2001. [2] K.S. Azoury and M.K. Warmuth. Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning, 43(3):211­246, 2001. [3] M.F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In 23rd ICML, pages 65­72. ACM Press, 2006. [4] M.F. Balcan, A. Broder, and T. Zhang. Margin-based active learning. In 20th COLT, pages 35­50. Springer, 2007. [5] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. JASA, 101(473):138­156, 2006. [6] G. Blanchard, O. Bousquet, and L. Zwald. Statistical properties of kernel principal component analysis. Machine Learning, 66:259­294, 2007. [7] M.L. Braun. Accurate error bounds for the eigenvalues of the kernel matrix. JMLR, 7:2303­2328, 2006. [8] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers. In 17th ICML, pages 111­118. Morgan Kaufmann, 2000. [9] R. Castro and R.D. Nowak. Minimax bounds for active learning. IEEE Trans. IT, 2008. To appear. [10] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classifiers via selective sampling. In 16th COLT, pages 373­387. Springer, 2003. [11] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Worst-case analysis of selective sampling for linear classification. JMLR, 7:1205­1230, 2006. [12] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [13] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201­221, 1994. [14] R. Cohn, L. Atlas, and R. Ladner. Training connectionist networks with queries and selective sampling. In NIPS 2. MIT Press, 1990. [15] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS 20, pages 353­360. MIT Press, 2008. [16] S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of Perceptron-based active learning. In 18th COLT, pages 249­263. Springer, 2005. [17] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2/3):133­168, 1997. [18] R. Gilad-Bachrach, A. Navot, and N. Tishby. Query by committee made real. NIPS, 18, 2005. [19] T. Joachims. Making large-scale SVM learning practical. In B. Sch ¨ lkopf, C. Burges, and A. Smola, o editors, Advances in Kernel Methods: Support Vector Learning. MIT Press, 1999. [20] C. Monteleoni and M. K ¨ ¨ ri ¨ inen. Practical online active learning for classification. In 24th IEEE CVPR, aa a pages 249­263. IEEE Computer Society Press, 2007. [21] Reuters. http://about.reuters.com/researchandstandards/corpus/, 2000. [22] J. Shawe-Taylor, C.K.I. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram matrix and the generalization error of kernel-PCA. IEEE Trans. IT, 51(7):2510­2522, 2005. [23] I. Steinwart and C. Scovel Fast Rates for Support Vector Machines using Gaussian Kernels Annals of Statistics, 35: 575-607, 2007. [24] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In 17th ICML, pages 999­1006. Morgan Kaufmann, 2000. [25] A. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135­ 166, 2004. [26] V. Vovk. Competitive on-line statistics. International Statistical Review, 69:213­248, 2001. [27] Y. Ying and D.X. Zhou. Online regularized classification algorithms. IEEE Transactions on Information Theory, 52:4775­4788, 2006. 8