Learning in the Limit with Adversarial Disturbances Constantine Caramanis and Shie Mannor Abstract We study distribution-dependent, data-dependent, learning in the limit with adversarial disturbance. We consider an optimization-based approach to learning binary classifiers from data under worst-case assumptions on the disturbance. The learning process is modeled as a decision-maker who seeks to minimize generalization error, given access only to possibly maliciously corrupted data. Two models for the nature of the disturbance are considered: disturbance in the labels of a certain fraction of the data, and disturbance that also affects the position of the data points. We provide distributiondependent bounds on the amount of error as a function of the noise level for the two models, and describe the optimal strategy of the decision-maker, as well as the worst-case disturbance. 1 Introduction Most of the work on learning in the presence of malicious noise has been within the PAC framework, focusing on a priori, distribution independent bounds on generalization error and sample complexity. This work has not fully addressed the question of what a decision-maker must do when faced with a particular realization of the data, and perhaps some knowledge of the underlying distribution and the corrupting disturbance. The main contribution of this paper is the development of a robust optimization-based, algorithmic datadependent, distribution-dependent approach to minimizing error of learning subject to adversarial disturbance. In the adversarial PAC setup, a decision-maker has access to IID samples from some source, only that a fraction of these points are altered by an adversary. There are several models for the noise which we discuss below. The decisionmaker is given > 0 and > 0 and attempts to learn an -optimal classifier with probability of at least 1 - . The emphasis in [KL93], as well as in several follow-up works Department of Electrical and Computer Engineering, The University of Texas at Austin, cmcaram@ece.utexas.edu Department of Electrical and Computer Engineering, McGill University, shie.mannor@mcgill.ca This work was partially supported by NSF Grants CNS0721532, EFRI-0735905, and the Canada Research Chairs Program (e.g., [BEK02, ACB98, CBDF+ 99, Ser03]) is on the sample complexity of learning in such setups and on particularly bad data sources. The algorithmic issue of the decision-maker's optimal strategy when faced with a certain disturbance level, i.e., a certain amount of possible data corruption, and a realization of the data has not been adequately explored; see [Lai88] for an initial discussion. While there are quite a few possible disturbance models that differ on the precise setup (what the adversary know, what the adversary can do and in which order), we focus on the strongest disturbance model where the adversary has access to the actual distribution and can modify it adversarialy within a constraint on the disturbance level. This "learning in the information limit" model is used to abstract other issues such as finite sample or limited adversary (see [CBDF+ 99] for a discussion on some relevant models). In this paper we consider two different noise models, with the intention of addressing the algorithmic aspects and the effect of the disturbance level. We note that we use the term disturbance rather than noise because in our model data are corrupted in a possibly adversarial way and the probabilistic aspect is essentially not relevant. We deviate from the traditional learning setup in three major assumptions. First, we focus on the question of how the decision-maker should minimize error, rather than following PAC-style results of computing a priori bounds on that error. Moreover, our analysis is distribution specific and we do not focus on particularly bad data sources. Second, the noise level is not assumed small and the decision-maker has to incur error in all but trivial problems (this has been studied in the malicious noise setup; see [CBDF+ 99]). Third, we do not ask how many samples are needed to obtain low generalization error, instead we assume that the distribution of the samples is provided to the decision-maker (equivalently, one may think of this as considering the large sample or "information theoretic" limit). However, this distribution is corrupted by potentially persistent noise; we may consider it as first tampered with by an adversary. After observing the modified distribution, the decision-maker has to commit to a single classifier from some predefined set H. The performance of the classifier chosen by the decision-maker is measured on the original, true distribution (this is similar to the agnostic setup of [KSS92]). The question is what should the decision-maker do? And how much error will he incur in the worst case? In order to answer these questions we adopt a robust optimization-theoretic perspective, where we regard our decisionmaker as trying to make optimal decisions while facing an adversary. Our aim is to provide an analysis by identifying optimal strategies, and quantify the error as a function of the adversary's strategy, i.e., the nature of the corrupting disturbance. We refer to the disturbance as selected by an adversary merely as a conceptual device, and not in strict analogy to game theory. In particular, the decision-maker does not assume that the corrupting noise is chosen with any specific aim; rather, the decision-maker selects a strategy to protect himself in the worst-case scenario. The true probability distribution is defined over the input space and on the labels. We focus on the case of proper learning, where this amounts to a distribution and the true classifier. Then the adversary modifies the distribution of the input points and the labels. The decision-maker observes the modified distribution and chooses a classifier in H to minimize the worst-case error. We note the relationship with [KSS92] who use a slightly different model. In their model, the decision maker chooses a classifier in H knowing that the true classifier is in some "touchstone" class T H. They say that an algorithm facilitates learning (with respect to a loss function) if it learns a function from H that is close to a function from T in the usual PAC sense (i.e., with high probability and small error after observing a number of samples polynomial in one over the error, and one over the confidence). As opposed to [KSS92] and most subsequent works, we do not focus on small noise and we ignore the sample complexity aspect altogether. Instead, we focus on the policy chosen by the decision maker and on the informational limits. In that respect, our work is most related to [CBDF+ 99] who considered the case of substantial noise. Their proposed strategy that deals with noise, however, is based on randomizing two strategies or using majority vote (phase 2 of the randomized Algorithm SIH in [CBDF+ 99]). We propose a more principled approach to handling adversarial noise, leading to improved results. If the noise level and characteristics are unlimited, the decision-maker cannot hope to do better than randomly guessing. We therefore limit the noise, and allow the adversary to change only a given fraction of the distribution, which we refer to as "the power of the adversary". An alternative view, which is common in robust optimization [BTN99], is to consider the power of the adversary as a design parameter. According to this view, the decision-maker tries to be resilient to a specified amount of uncertainty in the parameters of the problem. The paper is structured as follows. In Section 2 we describe the setup. We define two types of adversaries: one that can only flip a fraction of the points, and one that can also move the points to another location. In Section 3 we consider the optimal solution pairs for the two different set-ups. We characterize the strategy of both the decision-maker and the adversary as a function of the level of noise (the power of the adversary) and the specific distribution that generates the data. Taking such a distribution-dependent perspective allows us to characterize the decision-maker's optimal strategy as the solution to a linear program if the adversary can only flip labels, or a robust optimization problem in the case of the more powerful adversary that can also modify the measure. We further bound the error that may be incurred and show that in the worst case, both adversaries can cause an error twice their power. In Section 4 we show how performance degrades with the increase of this power. A technical proof along with a somewhat surprising worked out example are deferred to the online appendix [CM08]. 2 Setup and Definitions In this section we give the basic definitions of the noisy learning setup. Also, we formulate the optimization problem which characterizes the optimal policy of the decision-maker, and the worst-case noise. The decision-maker, after observing the noisy data, and knowing the power of the adversary, outputs a decision in the classifier space. The disagreement with the true classifier, is the generalization error. The decisionmaker's goal is to minimize this, in the worst case. We allow our decision-maker to output a so-called mixed strategy.1 Throughout this paper we focus on proper learning. We let H denote a predefined set of classifiers from which the true classifier is drawn, and from which the decision-maker must choose. Moreover, we assume that H is finite for the sake of simplicity and to avoid some (involved but straightforward) technicalities. Indeed, there are three natural extensions to our work that we postpone, primarily due to space limitations. First, while we focus on the proper learning setup, the non-proper setup (as in [KSS92]) seems to naturally follow our framework. Second, the case of an infinite set of classifiers H could be resolved by eliminating classifiers that are "close" according to the observed measure. This is particularly useful for the flip-only setup where the adversary cannot make two classifiers substantially different. Finally, while we do not consider sample complexity, such results should not be too difficult to derive by imitating the arguments in [CBDF+ 99]. 2.1 The Learning Model In this paper, we deviate from the PAC learning setup, and consider an a priori fixed underlying distribution µ, that generates the location (not the labels) of the training data. Thus the error calculations we make are a function of the power of the adversary and also of the fixed probability measure µ. We use the symbol µ throughout this paper, exclusively in reference to the true probability distribution which generates the location (not the label) of the points, and hence, is used to determine the generalization error. Given a partic^ ular classifier h, a true classifier htrue , and the underlying probability measure µ, the generalization error is given by the error function ^ ^ Eµ (htrue ; h) = µ{x : htrue (x) = h(x)}. We can extend this definition to a probability measure over H, or, in the game-theory terminology, a mixed strategy over H,i given by a weighting vector = (1 , 2 , . . . ) where i = 1 and i 0. In that case, denoting the space of mixed strategies by H , and a particular mixed strategy by 1 That is, rather than commit to a single classifier, our decisionmaker can commit to a randomized strategy, involving possibly multiple classifiers. observes an underlying probability measure of the form µ = ^ ^ µ+ + (1 - )µ- . ^^ ^ i i Eµ (htrue ; hi ). Eµ (htrue ; ) = The restrictions on this map determine the nature and level of noise. We consider two models for the noise, i.e., two adversaries. First, we have a `flip-only' adversary, corWe note that the mixing is often referred to as "probaresponding to the noise model where the adversary can flip bilistic concepts" or "probabilistic hypotheses" in machine some fixed fraction of the labels. We also consider a stronger learning. In the context of learning with adversarial noise `move-and-flip' adversary who can not only flip a constant + see [CBDF 99]. fraction of the points, but may also change their location. For the flip-only adversary the underlying measure µ is the same 2.2 The Noise Model and The Decision-Maker as the observed measure µ. Therefore the decision-maker ^ We next define the possible actions of the adversary and of minimizes the worst-case error where the worst case is over the decision-maker. As discussed above, in this paper we do all possible h H. This need not be true for the move-andnot consider sample complexity, and effectively consider the flip adversary. In this case, the decision-maker has only parsituation where the training sample is infinite in size (the intial information of the measure µ against which generalizaformation theoretic limit). We model this situation by assumtion error is computed, and hence the decision-maker must ing that rather than training samples, the decision-maker reprotect himself against the worst-case error, considering all ceives a distribution for each of the two labels. Since the adpossible classifiers h H, as well as all possible underlying versary modifies this object in various ways (noise is added ^^ ^ ~ measures µ consistent with the observations (, µ+ , µ- ). to the observations) we make some formal definitions which We do not intend measurability questions to be an issue facilitate discussion of this in the sequel. in this paper. Therefore we assume throughout that all meaLet X denote the space in which the training data exist. sures (and images under the adversary's action) are measurIn the typical, finite training data model, the decision-maker able with respect to some natural -field G . has access to a collection of labelled points, {(xi , li )}, where In each of the two cases above, the level of noise is dexi X , and li {+, -}. In our case then, the decisiontermined by how different the output probability measure maker receives a probability measure over this space ^^ ^ T (, µ+ , µ- ) = (, µ+ , µ- ) can be from the true probaM(X × {+, -}) (M denotes the space of probability meability measure (, µ+ , µ- ). A natural measure for this is the sures). We can represent such a measure by a triple (, µ+ , µ- ), notion of total variation. The distance, in total variation, bewhere µ+ , µ- are probability measures on X , and represent tween measures 1 , 2 is defined as the distribution of the positive and negative-labelled points respectively, and [0, 1] is the weight (or probability) of 1 ||1 - 2 ||T V = sup the positively labelled region, and (1 - ) that of the nega2 k , A1 , . . . , A k G tively labelled region. The interpretation is that a point-label s.t. Ai Aj = for i = j pair is generated by first choosing a label `+' or `-' with probability or 1 - , respectively, and then a point is generik ated according to the corresponding distribution, µ+ or µ- . |1 (Ai ) - 2 (Ai )|. Thus, the underlying distribution µ generating the location =1 of the points (not the labels) is given by (µ+ + (1 - )µ-). This definition also holds for unnormalized measures. We Thus, if htrue is the true classifier, then in the absence of extend this definition to triples (, µ+ , µ- ) by any noise, we would observe = (, µ+ , µ- ), where µ+ is the scaled restriction of µ to the region htrue (+) = {x : htrue (x) = +}, and similarly for µ- : = µ(htrue (+)); µ · {htrue (+)} ; µ · {htrue (-)} , µ- = 1- µ+ = H , we have ^^ ^^ ^ ||(, µ+ , µ- ) - (, µ+ , µ- )||T V = ||µ+ - µ+ ||T V + ^^ ||(1 - )µ- - (1 - )µ- ||T V . Therefore, we have: Definition 1 An adversary using policy T (either flip-only, or move-and-flip) has power if given any triple (, µ+ , µ- ), his policy T satisfies ||T (, µ+ , µ- ) - (, µ+ , µ- )||T V . We abbreviate this, and simply write ||T || . We can now define the two notions of adversary introduced above. Definition 2 A flip-only adversary of power can choose ^^ ^ any policy T such that ||T || , and (, µ+ , µ- ) = T (, µ+ , µ- ) satisfies ^^ ^^ ^ µ = µ+ + (1 - )µ- = µ+ + (1 - )µ- = µ. Definition 3 A move-and-flip adversary of power can choose any policy T such that ||T || . where if = 0 there is no µ+ , and if = 1 there is no µ- . Indeed, the triple (, µ+ , µ- ) is completely defined by µ and the true classifier htrue . Since µ is fixed, we write (, µ+ , µ- )htrue to denote the triple determined by µ and h tru e . Using this terminology, the adversary's action is a map T : M(X × {+, -}) - M(X × {+, -}) ^^ ^ (, µ+ , µ- ) - (, µ+ , µ- ). We use the hat symbol, ` ^ ' throughout, to denote the observation of the decision-maker. Therefore, while the true probability measure generating the point location is given, as above, by µ = µ+ + (1 - )µ- , the decision-maker µ. Therefore the worst case over all consistent triples beThe decision-maker must base his decision on the `noisy ob^ , µ+ , µ- ) = comes a worst case over all consistent classifiers. servations' he receives, in other words, on the triple ( ^ ^ When facing the move-and-flip adversary, it may no longer T (, µ+ , µ- ) which he sees. His goal is to minimize the ~~ ~~ be true that µ+ + (1 - )µ- = µ. Therefore the decisionworst-case generalization error, where the worst case is taken maker must consider the worst case over all consistent classiover consistent h H, and also over consistent measures fiers, and also over all consistent underlying measures such µ. We allow our decision-maker to play a so-called mixed ~ ~~ ~~ ~~ ~ strategy, and rather than output a single classifier h H, that = µ+ + (1 - )µ- for some possible (, µ+ , µ- ) to output a randomized strategy, , interpreted to mean that ^ , µ+ , µ- ). We refer to with total variation at most from ( ^ ^ classifier hi is chosen with probability i . We denote the this set of consistent underlying measures as set of these mixed strategies by H , and a particular mixed strategy by H . Then, the decision-maker's strategy is ^^ ^ = ( , (, µ+ , µ- )). a m ap : We define the following two setups for a fixed measure µ D,H : M(X × {+, -}) - H on X , htrue H, and a value for the power of the adver^^ ^ (, µ+ , µ- ) - . sary. The idea is that if the decision-maker can eliminate some elements of H, but cannot identify a unique optimal choice, then the resulting strategy D,H will output some measure supported over the ambiguous elements of H. We explicitly assume that the decision-maker's policy is a function of , the power of the adversary. In a worst-case formulation, a decision-maker without knowledge of is necessarily powerless. We also assume that the decision-maker knows whether the adversary has flip-only, or move-and-flip power. We do not assume that the decision-maker has any knowledge of the underlying distribution µ that generates the location of the points. For the flip-only adversary, the decisionmaker receives exact knowledge `for free' since by ignoring the {+, -}-labels, he obtains the true underlying distribution µ. Therefore in this case there is only a single consistent underlying measure, namely, the correct measure µ, and the decision-maker need only protect against the worst-case h H. In the case of the move-and-flip adversary, however, the decision-maker receives only partial knowledge of the probability measure that generates the location of the points. Given a strategy D of the decision maker and a rule T for the adversary, we define the error for a given measure µ and a true classifier htrue as: Error(µ, htrue , , D, T ) = [Eµ (htrue ; D(T ((, µ+ , µ- )htrue )))] . ( 2 .1 ) 2.3 An Optimization-Based Characterization In this section we characterize the optimal policy of the decisionmaker, and also the worst-case policy of the adversary, i.e., the worst-case noise, given the policy of the decision-maker. The noise-selecting adversary has access to the true triple (, µ+ , µ- ), and seeks to maximize the true error incurred. ^^ ^ The decision-maker sees only the corrupted version (, µ+ , µ- ), and minimizes the worst-case error, where the worst case is ~~ ~ taken over all possible, or consistent triples (, µ+ , µ- ) that the particular adversary with power (flip-only, or moveand-flip) could, under any policy, map to the observed triple ^^ ^ (, µ+ , µ- ).2 ~~ ~ For the flip-only adversary, any consistent triple (, µ+ , µ- ) ~ µ+ +(1-)µ- = ~~ the decision-maker considers must satisfy ~ 2 We remark again that unlike the game-theoretic setup, the decision-maker does not assume a rational adversary. We consider this case elsewhere. (S1) The flip-only setup: D1 = argmin D,H T :||T || T fl i p- onl y m ax m ax h H ( 2 .2 ) T Error(µ, h, , D, T ) 1 = argmax [Error(µ, htrue , , D1 , T )] . T :||T || T fl i p- onl y The decision-maker knows and H, and can infer µ since the adversary is flip-only. Thus he chooses D1 to minimize the worst-case error, where the worst case is over classifiers h H. The adversary has prior knowledge of µ, htrue and H, and of course , and chooses his strategy to maximize the true error, i.e., the error with respect to htrue and µ. (S2) The move-and-flip setup: D2 = argmin D,H T : | | T | | m ax T m ax h H ( 2 .3 ) Error(, h, , D, T ) 2 = argmax [Error(µ, htrue , , D2 , T )] . T : | | T | | Here the adversary is no longer constrained to pick T so that µ = µ. In this case the decision-maker must choose ^ a policy D2 to minimize the worst-case generalization error, with respect to h H and also measures . The adversary again tries to maximize the true error w.r.t. htrue and µ. We use Errori (i = 1, 2) to denote the error in S 1 and S 2 when µ, htrue , and are clear from the context, i.e., Errori = Error( , htrue , , Di , Ti ). We show below that the max and min in both (2.2) and (2.3) are attained, and can be computed by solving appropriate optimization problems. We interpret the argmin/argmax as selecting an arbitrary optimal solution if there are more than one. The fact that the max and min in both (2.2) and (2.3) are attained by some rule requires a proof. We show below that this is indeed the case for both setups since the respective rules can be computed by solving appropriate optimization problems. S1 and S 2 are not equivalent. We first show by example that the "flip only" setup and the "move and flip" setup are not equivalent. This is the case even for two classifiers. Indeed, consider the case X = [-5, 5] R, with threshold classifiers H = {h1 , h2 } with h1 (+) = [0, 5] and h2 (+) = [1, 5]. Then the disagreement region is [0, 1). Suppose h1 is the true classifier, and that the true underlying measure µ is uniform on [-5, 5], so that µ([0, 1)) = 10%. For < 5%, Error1 = Error2 = 0. For 5%, however, both the flip-only and move-andflip adversaries can cause error. Suppose = 10%. In S 1, the decision-maker knows the true µ, and hence knows that µ([0, 1)) = = 10%. Thus regardless of the action of the adversary, the decision-maker's optimal strategy is (1 , 2 ) = (1/2, 1/2), and the error is therefore Error1 = 10/2 = 5%. In S 2, however, the optimal strategy of the adversary is unique: flip the labels of all the points in [0, 1). The decision-maker sees µ([0, 1)) = 10%, but because the ^ adversary has move-power, the decision-maker does not know µ exactly. His goal is to minimize the error in the worst case, where now the worst case is over classifiers, and also over possible underlying measures. From his observations, the decision-maker can only conclude that if htrue = h1 then 0% µ([0, 1)) 10%, and if htrue = h2 , then 0% µ([0, 1)) 20%. The worst-case error corresponding to a strategy (1 , 2 ) is therefore max{101 ; 202 }. Minimizing this objective function subject to 1 + 2 = 1, and 1 , 2 0, we find (1 , 2 ) = (1/3, 2/3), and the true error (as opposed to the worst-case error) is Error2 = (1/3) · 0 + (2/3) · 10 = 20/3, which is greater than Error1 . This follows by our assumption that the adversary has power , and because µ+ (htrue (-)) + (1 - )µ- (htrue (+)) = 0. Therefore, F is the set of classifiers in H that could possibly be equal to htrue and thus Definition 4 above indeed gives the set of feasible, and therefore ambiguous, classifiers. In particular, under the assumption of proper learning, htrue F. Next, the decision-maker must compute the value of h for every h F , the feasible subset of classifiers. For any mixed strategy (this is sometimes referred to as a "probabilistic hypothesis") H that the decision-maker might choose, the error incurred is h h µ( (h, htrue )), (3.5) Eµ (htrue ; ) = =htrue where for any two classifiers h , h , we define (h , h ) = {x : h (x) = h (x)} to be the region where they differ. The decision-maker, however, does not know htrue , and hence his optimal strategy is the one that minimizes the worstcase error, maxhtrue H Eµ (htrue ; ). In the case of the fliponly adversary, the decision-maker sees the probability mea^^ ^ ^ sure (, µ+ , µ- ), and since he knows that µ = µ, he can correctly compute the value µ( (h , h )) for any two classifiers h , h . In other words, the decision-maker knows the true weight of any region where two classifiers disagree, and therefore we can state the following result which is a restatement of the above. Proposition 5 The optimal policy of the decision-maker in S 1 is given by computing the minimizer of: h h µ( (h, htrue )). ( 3 .6 ) min max htrue F =htrue 3 Optimal Strategy and Worst-Case Noise In this section we consider S1 and S2, and determine optimal strategies for the decision-maker, and the optimal strategy for the adversary, i.e., the worst-case noise. 3.1 The Decision-Maker in S 1 First we consider the decision-maker's optimal strategy for S 1, i.e., in the face of the flip-only adversary. The decisionmaker outputs a mixed strategy H . The support of the weight vector is the subset F of `feasible' classifiers in H, which incur at most error . This set is often referred to as the "version space". ^^ ^ Definition 4 Given the output (, µ+ , µ- ) = T (, µ+ , µ- ) of a flip-only adversary with power , the set of feasible, and ^^ ^ hence ambiguous classifiers, F = F (, µ+ , µ- ) H, is given by ^^ ^^ F = {h H : µ+ (h(-)) + (1 - )µ- (h(+)) }. ( 3 .4 ) Enumerating the set F as {h1 , . . . , hk }, the optimal is computed by solving the following linear optimization problem: min : s.t. : u i u =j i µ( (hi , hj )) i i = 1 i 0 j = 1, . . . , k i = 1, . . . , k . Here we define h(+) to be the positively labelled region, ^^ and h(-) the negatively labelled region, so that µ+ (h(-)) is the measure of the positive labels observed in the region h(-). The measure of the region where the true classifier disagrees with the observed measure can be at most . That is, ^^ ^^ µ+ (htrue (-)) + (1 - )µ- (htrue (+)) . P RO O F. The proof follows directly from the definition of the error associated to any mixed strategy , given in (3.5). W e note that in [CBDF+ 99] the question of how to choose the best probabilistic hypothesis was considered. The solution there was to randomize between two (maximally apart) classifiers or to choose a majority vote. We now explain why this is suboptimal. Consider three linear classifiers in general position in the plane H = {h1 , h2 , h3 } and let's suppose that there are 7 regions in the plane according to the agreement of the classifiers (assume that h1 (+) h2 (+) h3 (+) = ). Suppose that the decision maker observes that µ+ has sup^ ^ port only on h1 (+) h2 (+) h3 (+) (assume that = 1 - 3 and that < 1/4) and that µ- has equal support of ^ on h1 (-) h2 (-) h3 (+), h1 (-) h2 (+) h3 (-) and h1 (+) h2 (-) h3 (-). The example is constructed so that choosing any one classifier, in the worst case can lead to an error of 2 . It is easy to see that a majority vote would lead to a worst case error of 2 . Mixing between any two classifiers would lead to a worst case error of 2 as well. Mixing between the 3 classifiers, which is suggested by Proposition 5 leads to a worst case error of 4 /3 since we will get the classifier right with probability 1/3 and incur the 2 loss with probability 2/3. 3.2 The Decision-Maker in S 2 Next we consider the setup S 2, with the more powerful moveand-flip adversary. Again, the goal of the decision-maker is to pick a mixed strategy H , that minimizes the error given in (3.5). The set F of ambiguous classifiers is as defined in (3.4). In this case, however, in addition to not knowing htrue , the decision-maker also does not know the underlying measure µ, and hence the values µ( (h , h )), exactly. ^^ ^ As introduced in Section 2.3, we use = ( , (, µ+ , µ- )) ^^ ^ to denote the set of measures consistent with (, µ+ , µ- ). Thus the decision-maker seeks to minimize the worst-case error, now over H and . Any points that have the wrong label w.r.t. h could have been both moved and flipped. Therefore, to compute the worst case possible values of µ( (h , h )), for each classifier h the decision-maker considers, he must consider the observed measure of the points that have the correct label, and the wrong label, with respect to h. Thus we define: wr o n g µ( (h , h )), i.e., the worst-case values for ( (h , h )) for . The worst case over depends on the worst case over h H. That is, if h1 is the true classifier, then the worst-case values for ( (h , h )) may be different from the worst-case value if h2 is the true classifier. The worst-case values are computed using µh ( (h , h )) ^ and µh ( (h , h )). The idea is as follows: if some h is ^ the true classifier, then any measure in the region (h , h ) that is incorrectly labelled with respect to h may have also been moved from some other region. Therefore in the case that h = htrue , the weight of any particular region (h , h) could be as large as the weight of the correctly labeled points under µ, µh ( (h , h )), plus the weight (again under µ) of ^^ ^ the mislabelled points with respect to h in all other regions, plus the additional weight that could be moved to (h , h) using any `unused' power of the adversary. The weight of the mislabelled points is ^^ ^^ µ+ (h(-)) + (1 - )µ- (h(+)). The unused power is ^^ ^^ - µ+ (h(-)) + (1 - )µ- (h(+)). Therefore the weight (under µ) of the mislabelled points with ^ respect to any h, plus the unused power, must be exactly . If h = htrue , consider some region (h , h). The reasoning above tells us that the worst-case measure of this region is µh ( (h , h)) + . The following lemma makes this intu^ ition precise, and shows that this is indeed the case. Lemma 7 Assume that (h, h ) = for any h = h . Then, if h = htrue , we have µ( (h, h )) µh ( (h, h )) + . ^ ~~ ~ This bound is tight in the sense that there is a measure (, µ+ , µ- ) with total variation at most from the observations, that attains the upper bound. ~~ ~ P RO O F. We exhibit the following triple (, µ+ , µ- ) that sat~ , µ+ , µ- ) - (, µ+ , µ- )||T V : Assume, with^^ ^ isfies ||( ~ ~ out loss of generality, that (h, h ) h(+). Let be any probability measure over X , supported on (h, h ). Then, define: ~ µ- ~ = = ^ ^^ + (1 - )µ- (h(+)), h , µ- - µ- ^ ^ ( +) + ^^ ^ h µ+ - µ+ ^ (-) correct correct correct correct wr o n g µh ( (h , h )) ^ correct = ^^ µ+ ( (h , h ) h(-)) + ( 3 .7 ) ^ )µ- ( (h , h ) h(+)) (1 - ^ µ( (h , h ))- µh ( (h , h )). ^ ^ wr o n g µh ( (h , h )) ^ = In Proposition 6 below, the decision-maker uses these quantities to compute his optimal strategy that protects against the worst-case consistent classifier h F , and underlying measure . The worst-case classifier h and measure may depend on the action the decision-maker chooses. Thus, the decision-maker must solve a min max linear program. In doing so, he implicitly computes the worst-case measure as well, by computing a saddle point. Proposition 6 (a) The decision-maker's optimal policy, is to compute the set F , and then compute the optimal weight-vector that is the minimizer of min max E (htrue ; ) = htrue H min max htrue H ( 3 .8 ) h h ( (h, htrue )), =htrue where the max is over H and . The min and the max are both attained. (b) Moreover, the optimal strategy of the decision-maker is obtained as the solution to a robust linear optimization problem, which we reformulate as a single linear optimization. Recall that in S 2, in addition to the labels, the underlying measure µ is also corrupted. Therefore the decision-maker must compute the strategy with respect to the worst-case feasible classifier, and the worst-case consistent values for µ+ ~ = ^^ + (1 - )µ- (h(+)) , ( . ^^ ^^ wh e r e = 1 - )µ- (h(+)) + µ+ (h(-)) For the ~ , µ+ , µ- ), there exists a move-and-flip policy T with triple ( ~ ~ ~~ ~ ^^ ^ ||T || , such that T (, µ+ , µ- ) = (, µ+ , µ- ), hence the scalar upper bound is attainable. For the vector case, if (h, h1 ) · · · ~~ ~ there exists a triple (, µ+ , µ- ) that satisfies correct (h, hk ) = , (µ( (h, h1 )), . . . , µ( (h, hk ))) = ( µ ( (h, h1 )), . . . , ^ correct µ ( (h, hk ))) + (1, . . . , 1). ^ This follows by replacing (h, h ) by (h, h1 )· · · (h, hk ) in the proof above. In general, however, the tightness result does not hold simultaneously for many classifiers. That is to say, given classifiers {h1 , . . . , hk } different from some h, if (h, h1 ) · · · (h, hk ) = (as is in general the case), then, while the lemma tells us that µ( (h, hi )) µ ^ ( (h, hi )) + for each i, there will be no measure which realizes these upper bounds simultaneously. Moreover, the worst-case values then will depend on the decisionmaker's particular choice of . The -dependent worst-case consistent values for µ( (h , h )) are computed implicitly in the robust LP below. With this intuition, and the result of the lemma, we can now prove the proposition, and explicitly give the LP that yields the optimal strategy of the decision-maker. P RO O F. (of Proposition 6) The proof proceeds in three main steps: (i) First we show that the error, and hence the optimal strategy of the decision-maker, depends only on a finite dimensional equivalence class of measures . The first part of the proof is to characterize this finite dimensional set. (ii) Next, we establish the connection to robust optimization, and write a robust optimization problem that we claim yields the decision-maker's optimal strategy. Proving this claim is the second part of the proof. (iii) Finally, we show that the robust optimization problem may in fact be rewritten as a single LP, using duality theory of linear programming. For F the set of ambiguous classifiers, the decision-maker's policy is given by min max E (htrue ; ) = min max htrue F correct for some S {1, . . . , k }. We use (hi , hj )c to denote the ^ complement of the set. We define a variable S,j to represent the amount ofi mass that can e added (in the worist case) to bi c n the case the region S (hi , hj ) / S (hi , hj ) where hj is the true classifier. We can consider these as comk ponents of a vector in R2 -1 , indexed by nonempty subsets S {1, . . . , k }. Any such vector corresponds to an equivalence class of measures , that are indistinguishable to the decision-maker, in the sense that they induce precisely the same error. Given such a vector, the weight of the regaon i c orrect S ^ nd ^ (hi , hj ) is then µhj ( (hi , hj )) + {1,...,k} S,j thus for a given , the error would be iS i correct =j ^ i µhj ( (hi , hj )) + S {1,...,k} iS ^ S,j . S ^ For any fixed j , the collection of variables S,j {1,...,k} must satisfy four properties in order to correspond to some measure . The variables must be nonnegative, and the ^ sum over S of S,j must be at most . This follows since the total amount of mass moved or flipped must be at most , by definition of the power of the adversiary. Third, if the i i c set s empty, then the S (hi , hj ) S (hi , hj ) / ^S,j must be zero. Finally, the weight corresponding variable of each region (hi , hj ) can be at most 100%, and thus we must have correct ^ µhj ( (hi , hj )) + S {1,...,k} iS ^ S,j 100%. k ^ Therefore, if hj = htrue , the possible values of ·,j R2 -1 are given by: htrue F h h ( (h, htrue )), =htrue where is supported on F . While the worst case is over classifiers h F and all measures , the worst-case error incurred for any particular strategy in fact can only depend on the values of ( (h , h )) for every h , h F . Therefore we can consider equivalence classes of measures in that have the same values ( (h , h )). This reduces the inner maximization to a finite dimensional one. Enumerate the set F as {h1 , . . . , hk }. Then for any fixed hj F , if htrue = hj , then the regions whose measure is important for the error computation, are those that can be written as i i , (hi , hj ) (hi , hj )c S S / (j ) ^ = ·,j S^ S,j , ^ S,j 0, S {1,i . . . , k }, S = , : = 0, S : ^S,j S (hi , hj ) = i c , S (hi , hj ) / correct S^ S,j 100 - µhj ( (hi , hj )) ^ i i = j. For every j , the set (j ) is a polytope. The decision-maker must choose some that minimizes the worst-case error, where the worst case is over possible htrue F = {h1 , . . . , hk }, and then once that hj is fixed, the worst case over all possi^ ble (·,j ) (j ). Therefore the optimal strategy of the decision-maker is the solution to the following robust opti- mization problem: min : s.t. : u un Thus, we must have max ^ (S,j )S{1,...,k} (j ) o i i =j S correct ^ µhj ( (hi , hj )) + ^ = ·,j i i = 1, {1,...,k} iS i 0, ^ S,j , j = 1, . . . , k i max j {1,...,k} =r S correct ^ i µhr ( (hi , hr )) + correct µhj ( (hi , hj )) + ^ i {1,...,k} iS ^ S,r > ^ S,j . ^ S,j (j ) =j i S {1,...,k} iS (j ) S^ S,j , ^ S,j 0, S {1,i . . . , k }, S = , But then there must exist a measure consistent with the observed measure, for which : = 0, S : ^S,j (hi , hj ) S i = c , S (hi , hj ) / correct S^ S,j 100 - µhj ( (hi , hj )) ^ i S correct ^ i = j. ( (hi , hj )) = µhj ( (hi , hj )) + ^ S,j , {1,...,k} iS First we prove that this robust optimization indeed yields the strategy of the decision-maker that minimizes the worstcase effort. The proof of this follows by a combination of the methods used to prove Proposition 5 and Lemma 7. Certainly, for any hj F and , there exists a vector ^ (·,j ) (j ) such that and thus we have: max Eµ (htrue ; ) µ E (hr ; ) max E (htrue ; ). htrue H htrue H S correct ( (hi , hj ) = µhj ( (hi , hj )) + ^ {1,...,k} iS ^ S,j , i = j. > The technique of Lemma 7 establishes the converse, namely, ^ for any feasible vector (·,j ) (j ) there exists a measure that is consistent with the observed measure, and such that for any i {1, . . . , k }, correct ( (hi , hj )) = µhj ( (hi , hj )) + ^ S ^ S,j . {1,...,k} Therefore, if is indeed the true probability measure generating the location of the points, and if hr is the true classifier, then the error incurred by using strategy is strictly greater than the error incurred using strategy . Since both and hr are consistent with the observed probability measure and labels, respectively, the mixed strategy does not minimize the worst-case error. On the other hand, by similar reasoning, if is not an optimal strategy, i.e., if it is does not minimize the worst-case error as given in (3.8), then it is a strictly suboptimal solution to the linear optimization. This completes the proof that the robust optimization above indeed yields the strategy of the adversary which minimizes the worst-case error, where the worst case is over h F and also . This concludes the proofs of parts (i) and (ii). iS Thus we have shown that the sets (j ) are indeed the sets we should be considering. Next we show that the optimization we write down is the correct one. The proof of this follows that of Proposition 5. Let be the minimizer of the expression above, and let u be the optimal value of the optimization. If the decision-maker chooses some mixed strategy that is not a minimizer of the above, then there must exist some r {1, . . . , k }, corresponding to some htrue F , ^ and also a vector (·,r ) (r) feasible for the above linear optimization, for which i correct =r ^ i µhr ( (hi , hr )) + S {1,...,k} iS ^ S,r > u . We have left to prove the second part of the proposition, and part (iii) in the outline, namely, that we can rewrite the robust optimization problem as a single LP. First, we remark that for each j , the set (j ) is a polytope. The problem then, is a robust linear optimization problem. Using standard results from duality theory [BTN99], this can be reformulated as an ordinary linear optimization problem. We have the robust linear optimization problem: min : s.t. : u un max ^ (S,1 ) S {1,...,k} defined by equalities and inequalities among the variables. Writing these in vector form, we have: ^ -[I ]·,j (j ) ^ [Q ]·,j ^ (1, 1, . . . , 1) ·,j ^ [R(j ) ]·,j 0, = 0, , (100 - µhj ( (h1 , hj )), . . . ^ correct correct (1) o un i i =1 S correct ^ µh1 ( (hi , h1 )) + max ^ (S,2 )S{1,...,k} (2) correct {1,...,k} iS o i i ^ S,1 =2 ^ µh2 ( (hi , h2 )) + max o S {1,...,k} iS un ^ (S,k )S{1,...,k} (k) correct i . . . i ^ S,2 =k i i = 1, ^ µhk ( (hi , hk )) + i 0. S {1,...,k} iS ^ S,k Note that the robustification here is constraintwise-rectangular, that is, the uncertainty set has the form = (1) × · · · × (k ). Therefore, we can consider each constraint individually. Indeed, each inequality can be rewritten as i correct u- i µhj ( (hi , hj )) ^ =j n Note that while the vector c(j ) is a linear function of , the matrices A(j ) and the vector b are constant. We can then rewrite the linear optimization (3.9) as max : s.t. : (b) p(j ) p (j ) (j ) The matrices A(j ) , and the vector b, are given by the vector inequalities in (3.10) above: 0 0 . . -I . (j ) Q 0 (j ) correct (j ) b= A = -Q . ^ 100 - µhj ( (h1 , hj )) (j ) R . . . 1 1 ··· 1 1 correct 100 - µh ( (hk , hj )) ^j ^ , 100 - µhj ( (hk , hj ))). ( 3 .1 0 ) Here, I is the identity matrix, Q(j ) is a subset of the identityi matrix correspondii g to the sets S =or which we have f n , and the generic (hi , hj )c (hi , hj ) S / S row of R(j ) contains a 1 in every index containing a particu^ ^ lar i. Writing the equality as [Q(j ) ]·,j 0, and -[Q(j ) ]·,j 0, we can express the constraints defining (j ) more compactly as ^ . S (j ) ^ (j ) = : A ·,j b S,j {1,...,k} max o ^ (S,j )S{1,...,k} (j ) i =j and thus we can consider the linear optimization: i S ^ S,j max : i =j {1,...,k} S i {1,...,k} iS ^ S,j , ^ c ·,j ^ A(j ) ·,j b. The linear programming dual to this program is then min : ( 3 .9 ) s.t. : A(j ) = c iS s.t. : i and hence defining the vector c by cS = S i we can ^ write the objective function as c ·,j . The polytope (j ) is ^ The objective function is bilinear in both i and S,j . We have i , S i S ^S,j ^S,j = i i =j {1,...,k} S ^ S,j {1,...,k} (j ). pS 0 , S {1, . . . , k }. i Recalling that cS = S i , the robust linear optimization problem determining the optimal strategy of the decisionmaker can now be rewritten: min : u u s.t. : - (j ) iS {1,...,k} S pS i 0, S, j = 1, . . . , k i = 1 i 0. p (j ) j =S , . . ., k 1 i (j ) = A S i , S, j = 1, . . . , k i correct ^ =j i µhj ( (hi , hj )) (b) p(j ) , The variables of optimization are {u, i , pS }. The matrices A(j ) and the vector b are constants, determined by (3.10). Therefore this is indeed a linear optimization. Thus the proof of parts (i), (ii) and (iii) is complete, as is the proof of the proposition. F rom a complexity perspective the linear program is exponential in the size of F since all subsets are considered. This complexity is not an added feature of our development, as even linear classification over non-separable data becomes combinatorial. Still, in spite of this exponential nature the linear program can consider several approximation schemes such as constraint sampling. Moreover, pruning can be used for the classifiers in F ; this is pursued elsewhere. We have thus derived the optimal policy of the decision maker for both S 1 and S 2. We denote these as D1 and D2 , respectively. 3.3 Bounding the Decision-Maker's Error As defined above, the decision-maker's policy is a mixed strategy ­ a randomized policy. In the setting of the worstcase analysis which we consider, the decision-maker stands to benefit from the randomization. For example, suppose H = {h1 , h2 }, and µ( (h1 , h2 )) = 2 , where the adversary's power is . We consider the general optimal strategy for the adversary in the next section. In this case, however, it is clear that the optimal strategy for both the flip-only and the move-and-flip adversary, is to flip half of the `points', or measure, in (h1 , h2 ). Then the decision-maker cannot dis1 tinguish between h1 and h2 , and the optimal policy is 2 h1 + 1 1 2 h2 The expected worst-case error is 2 µ( (h1 , h2 )) = . If not for randomization, the worst-case error would have been 2 . Thus there is a concrete benefit to randomization. The next proposition quantifies this benefit (this is similar to Proposition 4.1 from [CBDF+ 99], but has a slightly tighter lower bound3), and obtains bounds on the error an adversary with power can obtain in any possible setup. Proposition 8 In both S 1 and S 2, for an adversary with power 1/2, there is a setup where Errori (1 - )2 for i = 1, 2. On the other hand, we always have Errori 2 for i = 1, 2 and if F is finite we have Errori (1-1/|F |)2 for i = 1, 2. P RO O F. We need to show that the lower bound can be approached arbitrarily closely in the case of the weaker adversary (flip-only), and the upper bound can never be exceeded by the more powerful adversary (move-and-flip). Let X be the unit circle in R2 , with µ the uniform measure on the disk. If = p/q is rational, divide the disk into q equal, numbered wedges, and define q classifiers, so that classifier i assigns positive labels to wedges {i, i + 1, . . . , i + p - 1} mod q , and negative labels to the remaining q - p wedges. As in Figure 1 suppose the true classifier is h1 . The optimal action of the adversary with power is to flip all positive labels. Now all classifiers are indistinguishable, and thus the decision-maker's optimal strategy is the uniform measure over all {hi }. The probability of full overlap with htrue 3 The lower bound of Proposition 4.1 from [CBDF+ 99] translates to Errori /(2 - ) which is smaller than the bound of Proposition 8. (j ) is 1/q , the probability of no overlap is (q - 2p + 1)/q , and of overlap r for 0 < r < p is 2/q . Computing the expectation, we have Error1 = (2p(q - p))/q 2 = (1 - )2 , as claimed. For irrational, we can approximate it arbitrarily closely with a rational number. In this case we can approach the lower bound arbitrarily closely. Next we show that even the more powerful move-andflip adversary can never exceed the upper bound. Observe that if the power of the adversary is , then for any two classifiers hi and hj , we must have µ( (hi , hj )) 2 . Then, if the decision-maker uses the possibly sub-optimal strategy of choosing = (1/n, 1/n, . . . , 1/n) (where n = |F |), then since by definition (h, h) = for all h, from expression (3.5) above, it follows that the expected error will never exceed (1 - 1/n)2 . ½ ¾ ¾ ¿ ½ Ì ¿ Figure 1: Here we have = 2/5, so p = 2 and q = 5. The figure on the left shows the correct labels. The adversary flips all +-labels to -. In the figure on the right, all classifiers are indistinguishable to the decision-maker. The decision-maker, therefore, outputs a randomized strategy that is uniform over all n classifiers (here n = 5). 3.4 The Adversary First we consider S 1 and the flip-only adversary. From Proposition 5, the optimal strategy of the decision-maker is specified by the subset of ambiguous classifiers, F . We call this (F ). Therefore the true error is also a function of F .hBy an abuse of notation, we can denote this by E (F ) = =htrue h (F )µ( (h, htrue )). Then the optimal strategy of the adversary is to create an ambiguous set F with as large an error as possible. Given any legal strategy T of the adversary, we denote by FT the resulting set of ambiguous classifiers. Therefore we have: Proposition 9 In S 1, the adversary's optimal strategy is to maximize E (F ): T1 = arg max E (FT ). T:||T|| ( 3 .1 1 ) T fli p- onl y The max here is attained since there are only finitely many different sets F . If there are more than one (as in general there will be) maps T corresponding to the optimal F , we arbitrarily choose one. Therefore T1 is well-defined, and is the optimal strategy for the adversary in S 1, and the proposition follows. Next we consider S 2, and the case of the move-and-flip adversary. From Proposition 6, the decision-maker's optimal action is given by an LP that is a function of the ambiguity set F , and the values { µh ( (h , h ))} for h , h F . ^ As above, we denote this optimal solution by = ( µh ^ ( (h , h ))), and the associated true generalization error is then Eµ (htrue ; ). For a given triple (, µ+ , µ- ), and power of the adversary, not all ambiguity sets F , and values for { µh ( (h , h ))} are attainable. We define the set of such ^ attainable values. Definition 10 Let A be the set of values { µh ( (h , h ))}, ^ for h , h F for some F , such that there exists a triple ^^ ^ (, µ+ , µ- ) that meets three conditions: ^^ ^ (a) F must be the ambiguity set corresponding to (, µ+ , µ- ), as in (3.4). (b) The triple must satisfy wrong correct correct correct 4 Error and the Power of the Adversary While we treat the noise as generated by an adversary, we may also consider it to be a design parameter chosen according to how we care to trade off optimality for robust^^ ^ ness. Indeed, upon seeing some realization (, µ+ , µ- ), the decision-maker may have partial knowledge of the level of noise. Equally, the decision-maker may specifically be interested in choosing a solution appropriate for some particular level of noise. For any fixed level , from the results in ~ ~ Section 3, the decision-maker obtains the resulting optimal ~ policy. When = 0, the optimal strategy of the decisionmaker is to deterministically choose the single classifier that minimizes the empirical error. If indeed = 0, then this is the optimal strategy. As grows, the optimal strategy of ~ the decision-maker becomes increasingly random, and in the limit as 100%, the optimal policy approaches the uni~ form distribution over all classifiers. For a fixed measure µ, H, and htrue H we consider the error as a function of . Graphing this function allows the decision-maker, in the scenario described above, to consider the tradeoff of robustness and optimality, and thus may choose the desirable design parameter , with respect to which ~ the optimal mixed strategy is obtained. In addition, this graph provides other information that is of interest. The graph of the error is not continuous. Rather, it is piecewise continuous (not necessarily linear), with certain break points. The location of these break points is important, and it is a function of the structure of H. A particular solution of the decisionmaker might be optimal for any in some interval [1 , 2 ), ~ but not optimal for 2 . ~ We consider the example from the end of Section 2.3 where h1 is the true classifier. There, the move-and-flip adversary is strictly more powerful than the flip-only adversary when > 5, and hence the setups S 1 and S 2 are not equiva lent. The graphs in Figure 2 show Errori (µ, htrue , , Ti , Di ) for fixed µ and htrue , and varying values of . In the left side of Figure 2 we have the superimposed graphs for this example, for S 1 and S 2 for 0 11. In the right side of Figure 2 we show the full graph of the true error Error2 , for 0 100. The graph for S 2 is obtained by using the results of Propositions 6 and 12. The optimal policy of the move-and-flip adversary differs for the three regions 0 < 5, 5 10, 10 100. In the first region, the adversary is powerless regardless of his action. In the second region, the optimal strategy is to flip % of the labels in (h1 , h2 ). For 10 100, the adversary's optimal strategy is to flip all the points in (h1 , h2 ), and also move and label `-' a ( - 10) fraction of the mass into (h1 , h2 ), so that µ( (h1 , h2 )) = . ^ The decision-maker's policy, as given by Proposition 6, protects the decision-maker against the worst possible (con~~ ~ sistent) triple (, µ+ , µ- ). Solving the robust LP from the proposition reveals both the true error, and the worst-case error. Both of these quantities may be of interest. In [CM08] we show, for this example, both the true error, and the worstcase error, for all values of . The true error exhibits numerous interesting properties. For instance, as shown in the figure, the true error is not monotonic in the power of the correct µh ( (h , h )) ^ ^^ = µ+ ( (h , h ) h(-)) + ^^ (1 - )µ- ( (h , h ) h(+)) = µ( (h , h ))- µh ( (h , h )). ^ ^ wrong correct µh ( (h , h )) ^ ^^ ^ (c) We must have ||(, µ+ , µ- ) - (, µ+ , µ- )||T V . Lemma 11 (a) The set A is a finite union of polyhedral sets, and it is compact. (b) The function Eµ (htrue ; ( µh ( (h , h ))) is piecewise ^ continuous, with finitely many discontinuities. We defer the proof of this lemma to [CM08]. The next proposition gives the optimal policy of the adversary for S 2: Proposition 12 The adversary's optimal strategy T2 maps ^^ ^ (, µ+ , µ- ) to a triple (, µ+ , µ- ) that matches the values correct correct µh ( (h , h )) from the solution to the (nonlinear) pro^ gram: max : Eµ (htrue ; ( µh ( (h , h ))) ^ s.t. : { µh ( (h , h ))} A. ^ correct correct ( 3 .1 2 ) P RO O F. By Lemma 11, A is compact, and Eµ (htrue ; ) is piecewise continuous. Therefore, the optimal value is attained for some ambiguity set F , and corresponding element { µh ( (h , h ))} of A. By the definition of A, there exists ^ at least one such map T2 , with ||T2 || , that attains this valuT . e hus the optimal policies for the decision-maker and adversary are each given by respective optimization problems. In summary, we have: Theorem 13 The pair of strategies (Di , Ti ) (i = 1, 2) for the decision-maker and the adversary, gives optimal solutions to S 1 and S 2, respectively. correct adversary (the worst-case error over measures and classifiers is, of course, monotonic). This is a direct consequence of Proposition 6. In [CM08] we pay particular attention to this, and other properties of the graph. Also, we give the details of the computations. Error in S1 vs S2 7 6 True Error Incurred 5 4 3 2 1 0 -1 0 1 2 3 4 5 6 7 8 9 10 11 Power of the Adversary (%) S2 Error S1 Error The Full S2 Error Graph 7 6 True Error Incurred structures. Considering the noise level as a design parameter and viewing the resulting error as a function of it yielded surprising results that show how counterintuitive the mini-max formulation of learning with adversarial noise could be. We showed for a simple example that while the worst-case error is monotone in the power of the adversary, the actual error (which depends on the particular underlying true probability measure) may not be monotone in the power of the adversary! This is because even though the adversary is more powerful, the decision maker is also better prepared. There are three natural extensions to our work that we did not pursue here mostly due to space limits. First, while we considered the proper learning setup, the non-proper setup (as in [KSS92]) seems to naturally follow our framework. Second, the case of infinite set of classifier H could be resolved by eliminating classifiers that are "close" according to the observed measure. This is particularly useful for the flip-only setup where the adversary cannot make two classifiers substantially different. Finally, while we do not consider sample complexity, such results should not be too difficult to derive by imitating the arguments in [CBDF+ 99]. 5 4 3 2 1 0 -1 0 10 20 30 40 50 60 70 80 90 100 Power of the Adversary (%) References P. Auer and N. Cesa-Bianchi. On-line learning with malicious noise and the closure algorithm. Annals of AI and Mathematics, 23(1):83­99, 1998. [BEK02] N. H. Bshouty, N. Eiron, and E. Kushilevitz. PAC learning with nasty noise. Theoretical Computer Science, 288(2):255­275, 2002. [BTN99] A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations Research Letters, 25(1):1­13, August 1999. [CBDF+ 99] N. Cesa-Bianchi, E. Dichterman, P. Fischer, E. Shamir, and H. Ulrich Simon. Sampleefficient strategies for learning in the presence of noise. Journal of the ACM, 46(5):684­719, 1999. [CM08] C. Caramanis and S. Mannor. Beyond PAC: A robust optimization approach for learning in the presence of noise: Online appendix. Available from http://users.ece.utexas.edu/~cmcaram/pubs/ RobustLearningOnlineApp.pdf, 2008. [ KL 9 3 ] M. Kearns and M. Li. Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4):807­837, 1993. [KSS92] Michael J. Kearns, Robert E. Schapire, and Linda Sellie. Toward efficient agnostic learning. In Computational Learing Theory, pages 341­352, 1992. [Lai88] P. D. Laird. Learning from good and bad data. Kluwer Academic Publishers, Norwell, MA, USA, 1988. [Ser03] R. Servedio. Smooth boosting and learning with malicious noise. Journal of Machine Learning Research, 4:633­648, 2003. [ACB98] Figure 2: The graph above shows the error incurred in G1 on the same axes as the error incurred in S 2, for 0 11. As soon as > 5, we see that the move-and-flip adversary is more powerful. Note that Error2 grows sublinearly for 5. In the graph on the right we show the error graph for the more powerful adversary for 0 100. The true error is not monotonic, as it decreases (non-linearly) for 50%. 5 Discussion This work takes a learning in the information theoretic limit view of learning with adversarial disturbance. Our main contribution is the introduction of an optimization-theoretic algorithmic framework for finding a classifier in the presence of such disturbance. We characterized the optimal policy of the decision-maker (as a function of the data) in terms of a tractable and easily solved optimization problem. This is a first step in developing the theory for a range of setups. For example, the Bayesian setup may be of interest. Here, the decision-maker has a prior over the possible classifiers, and instead of minimizing generalization error with respect to the worst-case consistent classifier and (in S 2) underlying measure µ, he considers minimizing expected (un~ der the Bayesian posterior) error. Extending this algorithmic approach to the game-theoretic setup, where the decisionmaker plays against a rational adversary, is also of interest, and allows the possibility of more complex information