Learning to Classify with Missing and Corrupted Features Ofer Dekel Microsoft Research, 1 Microsoft Way, Redmond, WA 98052 USA Ohad Shamir The Hebrew University, Jerusalem 91904, Israel O F E R D @ M I C RO S O F T. C O M O H A D S H @ C S . H U J I . AC . I L Abstract After a classifier is trained using a machine learning algorithm and put to use in a real world system, it often faces noise which did not appear in the training data. Particularly, some subset of features may be missing or may become corrupted. We present two novel machine learning techniques that are robust to this type of classification-time noise. First, we solve an approximation to the learning problem using linear programming. We analyze the tightness of our approximation and prove statistical risk bounds for this approach. Second, we define the onlinelearning variant of our problem, address this variant using a modified Perceptron, and obtain a statistical learning algorithm using an online-tobatch technique. We conclude with a set of experiments that demonstrate the effectiveness of our algorithms. of the features that were available during the training phase may be missing or corrupted. In this paper, we explore the possibility of anticipating and preparing for this type of classification-time noise. The problem of corrupted and missing features occurs in a variety of different classification settings. For example, say that our goal is to learn an automatic medical diagnosis system. Each instance represents a patient, each feature contains the result of a medical test performed on that patient, and the purpose of the system is to detect a certain disease. When constructing the training set, we go to the trouble of carefully performing every possible test on each patient. However, when the learned classifier is eventually deployed as part of a diagnosis system, and applied to new patients, it is highly unlikely that all of the test results will be available. Technical difficulties may prevent certain tests from being performed. Different patients may have different insurance policies, each covering a different set of tests. A patient's blood sample may become contaminated, replacing the features that correspond to blood tests with random noise, while having no effect on other features. We would still like our diagnosis system to make accurate predictions. Alternatively, our goal may be to train a fingerprint recognition system that controls the lock on a door. After a few days of flawless operation, a user with greasy fingers comes along and leaves an oily smudge on the fingerprint scanner panel. From then on, all of the features measured from the area under the smudge are either distorted or cannot be extracted altogether. Ideally, the fingerprint recognition system should continue operating. We take a worst-case approach to our problem, and assume that the set of affected features is chosen by an adversary individually per instance. More specifically, we assume that each feature is assigned an a-priori importance value and the adversary may remove or corrupt any feature subset whose total value is upper-bounded by a predefined parameter. In many natural settings, missing and damaged features are not actually chosen adversarially, but we find it beneficial to have our algorithm as robust as possible. 1. Introduction Supervised machine learning techniques often play a central role in solving complex real-world classification problems. First, we collect a training set of labeled examples and present this set to a machine learning algorithm. Then, the learning algorithm constructs a classifier, which can be put to use as a component in a working system. The process of collecting the training set and constructing the classifier is called the training phase, whereas everything that occurs after the hypothesis has been determined is called the classification phase. In many cases, the training phase can be performed under sterile and controlled conditions, and care can be taken to collect a high quality training set. In contrast, the classification phase often takes place in the noisy and uncertain conditions of the real world, and some Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Learning to Classify with Missing and Corrupted Features We present two different learning algorithms for our problem, each with pros and cons. The first approach formulates the learning problem as a linear program (LP), in a way that closely resembles the quadratic programming formulation of the Support Vector Machine (Vapnik, 1998). However, the number of constraints in this LP grows exponentially with the number of features. Using tricks from convex analysis, we derive a related polynomial-size LP, and give conditions under which it is an exact reformulation of the original exponential-size LP. When these conditions do not hold, the polynomial-size LP still approximates the exponential-size LP, and we prove an upper bound on the approximation difference. Despite the fact that the distribution of training examples is different from the distribution of examples observed during the classification phase, we prove a statistical generalization bound for this approach. Letting m denote the size of our training set and n the number of features, our polynomial LP formulation uses O(mn) variables and O(mn) sparse constraints. Depending on the dataset, this can still be rather large for off-theshelf LP solvers. We see this as a shortcoming of our first approach, which brings us to our second algorithmic approach. We define an online learning problem, which is closely related to the original statistical learning problem. We devise a modified version of the Perceptron algorithm (Rosenblatt, 1958) for this online problem, and convert this Perceptron into a statistical learning algorithm using an online-to-batch conversion technique (Cesa-Bianchi et al., 2004). This approach benefits from the computational efficiency of the online Perceptron, and from the generalization properties and theoretical guarantees provided by the online-to-batch technique. Experimentally, we observe that the efficiency of our second approach seems to come at the price of accuracy. Choosing an adequate regularization scheme is one of the keys to solving this problem successfully. Many existing machine learning algorithms, such as the Support Vector Machine, use L2 regularization to promote statistical generalization. When L2 regularization is used, the learning algorithm may put a large weight on one feature and compensate by putting a small weight on another feature. This promotes classifiers that focus their weight on the features that contribute the most. For example, in the degenerate case where one of the features actually equals the label, an L2 regularized learning algorithm is likely to put most of its weight on that one feature. Some algorithms use L1 regularization to further promote sparse solutions. In the context of our work, sparsity actually makes a classifier more susceptible to adversarial feature-corrupting noise. Here we prefer dense classifiers, which hedge their bets as much as possible. Both of the algorithms presented in this paper achieve this density by using a L regularization scheme. It is interesting to note that the choice of the L norm emerges as a natural one in the theoretical analysis of our first, LP-based learning approach. 1.1. Related Work Previous papers on "noise-robust learning" mainly deal with the problem of learning with a noisy training set, a research topic which is entirely orthogonal to ours. The learning algorithms presented in (Dietterich & Bakiri, 1995) and (Gamble et al., 2007) try to be robust to general additive noise that appears at classification time, but not necessarily to feature deletion or corruption. (?) presents adversarial learning as a one-shot two-player game between the classifier and an adversary, and designs a robust learning algorithm from a Bayesian-learning perspective. Our approach shares the motivation of (?) but is otherwise significantly different. In the related field of online learning, where the training and classification phases are interlaced and cannot be distinguished, (Littlestone, 1991) proves that the Winnow algorithm can tolerate various types of noise, both adversarial and random. Our work is most closely related to the work in (Globerson & Roweis, 2006), and its more recent enhancement in (Teo et al., 2008). Our motivation is the same as theirs, and the approaches share some similarities. Our experiments, presented in Sec. 4, suggest that our algorithms achieve considerably better performance, but we would also like to emphasize more fundamental differences between the two approaches: We allow features to have different a-priori importance levels, and we take this information into account in our algorithm and analysis. Our approach uses L regularization to promote a dense solution, where (Globerson & Roweis, 2006) uses L2 regularization. Our second approach, which uses online-to-batch conversion techniques, is entirely novel. Finally, we prove statistical generalization bounds for our algorithms despite the change in distribution at classification time. 2. A Linear Programming Formulation In this section, and throughout the paper, we use lowercase bold-face letters to denote vectors, and their plain-face counterparts to denote each vector's components. We also use the notation [n] as shorthand for {1, . . . , n}. 2.1. Feature Deleting Noise We first examine the case where features are missing at classification time. Let X Rn be an instance space and let D be a probability distribution on the product space X × {±1}. We receive a training set S = {(xi , yi )}m 1 i= sampled i.i.d. from D, which we use to learn our classifier. We assign each feature j [n] a value vj 0. Informally, we think of vj as the a-priori informativeness of fea- Learning to Classify with Missing and Corrupted Features ture j , or as the importance of feature j to the classification task. It can also represent the cost of obtaining the feature (such as the price of a medical test). Next, we define the value of a subset J of features as the sum of valuesj of the features in that subset, and we denote V (J ) = J vj . For instance, we frequently use V ([n]) when referring to n j j =1 vj and V ([n] \ J ) when referring to J vj . Next, we fix a noise-tolerance parameter N in [0, V ([n])] and define P = V ([n]) - N . During the classification phase, instances are generated in the following way: First, a pair (x, y ) is sampled from D. Then, an adversary selects a subset of features J [n] such that V ([n] \ J ) N , and replaces xj with 0 for all j J . The adversary selects J for each instance individually, and with full knowledge of the inner workings of our classifier. The noise-tolerance parameter N essentially acts as an upper bound on the amount of damage the adversary is allowed to inflict. We would like to use the training set S (which does not have missing features) to learn a binary classifier that is robust to this specific type of classification-time noise. We focus on learning linear margin-based classifiers. A linear classifier is defined by a weight vector w Rn and a bias term b R. Given an instance x, which is sampled from D, and a set of coordinates J left jntact by the adveri sary, the lj near classifier outputs b + i J wj xj . The sign wj xj constitutes the actual binary prediction, of b + J j while |b + J wj xj | is understood as the degree of confidence in that predictjon. A classification mistake occurs i if and only if y (b + J wj xj ) 0, so we define the risk of the linear classifier (w, b) as J with V ([n] \ J ) < N (1) R(w, b) = Pr . (x,y )D b j s .t. y + 0 J wj xj To see this, note that for a given example (xi , yi ), if there exists a feature subset J such that V ([n] \ J ) N and j yi (b + J wj xj ) 0 then the first constraint in Eq. (3) enforces i V (J )/P . The assumption V ([n] \ J ) N now implies that V (J ) P , and therefore i . If such a set J does not exist, then the second constraint in Eq. (3) enforces i 0. The optimization problem above actually does more than minimize an upper bound on the empirical risk. It also requires the margin attained by the feature subset J to grow with proportion to V (J ). While a true adversary would always inflict the maximal possible damage, our optimization problem also prepares for the case where less damage is inflicted, requiring the confidence of our classifier to increase as less noise is introduced. We also restrict w to a hyper-box of radius C , which controls the complexity of the learned classifier and promotes dense solutions. Moreover, this constraint is easy to compute and makes our algorithms more efficient. Although Eq. (3) is a linear program, it is immediately noticeable that the size of its constraint set may grow exponentially with the number of features n. For example, if vj = 1 for all j [n] and if N is n pc sitive inao teger, then the linear program contains over N onstrains per example. We deal with this problem below. 2.2. A Polynomial Approximation Taking inspiration from (Carr & Lancia, 2000), we find an efficient approximate formulation of Eq. (3), which turns out to be an exact reformulation of Eq. (3) when vj {0, 1} for all j [n]. Specifically, we replace Eq. (3) with m 1 min m i=1 i (4) n s.t. i [m] P i - j =1 i,j + yi b -i i [m] j [n] yi wj xi,j - i [m] j [n] i,j 0 , vj P The objective function of Eq. (3) is called the empirical hinge-loss obtained on the sample S . Since i is constrained to be non-negative, each training example contributes a non-negative amount to the total loss. Moreover, the objective function of Eq. (3) upper bounds the empirical risk of (w, b). More specifically, for any feasible point (w, b, ) of Eq. (3), i upper bounds times the indicator function of the event b j 0. min yi + J wj xi,j J : V ([n]\J )N Since D is unknown, we cannot explicitly minimize Eq. (1). Thus, we turn to the empirical estimate of Eq. (1), the empirical risk, defined as m m , b j 1i in yi + wj xi,j 0 (2) J m =1 J : V ([n]\J )N where [[ ]] denotes the indicator function of the predicate . Minimizing the empirical risk directly constitutes a difficult combinatorial optimization problem. Instead, we formulate a linear program that closely resembles the formulation of the Support Vector Machine (Vapnik, 1998). We choose a margin parameter > 0 and a regularization parameter C > 0, and solve the problem m 1 min m i=1 i (3) w,b, i vj - i,j , s.t. i [m] i [m] i 0 , J : V ([n] \ J ) N b j yi + J wj xi,j w V (J ) P C. - i , i [m] i 0 and i 0 , w C , where the minimization is over w Rn , b R, Rm , Rm , and 1 , . . . , m , each in Rn . The number of Learning to Classify with Missing and Corrupted Features variables and the number of constraints in this problem are both O(mn). The following theorem explicitly relates the optimization problem in Eq. (4) with the one in Eq. (3). Theorem 1. If (w , b , , , , . . . , ) is an optimal m 1 solution to Eq. (4), then (w , b , ) is a feasible point of Eq. (3), and therefore the value of Eq. (4) upper-bounds the value of Eq. (3). Moreover, if vj {0, 1} for all j [n], then (w , b , ) is an optimal solution to Eq. (3). Finally, if it does not hold that vj {0, 1} for all j [n], and assuming xi 1 for all i, then the difference between the value of Eq. (4) and the value of Eq. (3) is at most C / . As a first step towards proving Thm. 1, we momentarily forget about the optimization problem at hand and focus on another question: given a specific triplet (w, b, ), is it a feasible point of Eq. (3) or not? More concretely, for each training example (xi , yi ), we would like to determine if for all J with V ([n] \ J ) N it holds that yi b + j J -i provides a sufficient condition for Eq. (5) to hold for the example (xi , yi ). Moreover, this condition becomes both sufficient and necessary in the special case where vj {0, 1} for all j [n]. We now proceed with proving the first part of Thm. 1 using claim (a) in Lemma 1. The remaining parts of the theorem follow similarly from claims (b) and (c) in the lemma. Proof of Theorem 1. Let (w , b , , , , . . . , ) be m 1 an optimal solution to the linear program in Eq. (4). Specifically, it holds for all in [m] that and are noni i negative, that P i - j =1 i,j + yi b -i , and that j [n] yi wj xi,j - wj xi,j For example, if the value of this integer program is less than -i , thennlet be an optimal n olution and we have that s yi (b + j =1 j wj xi,j ) < ( j =1 j vj )/P - i . Namely, the set J = {j [n] : j = 1} violates Eq. (5). On the other hand, if there exists some J with V ([n] \ J ) N that violates Eq. (5) then its indicator vector is a feasible point of Eq. (6) whose objective value is less than -i . We can answer this question by comparing -i with the value of the following integer program: y ( n v min n yi b + j =1 j i wj xi,j - P j 6) { 0, 1} n s .t. P j = 1 j vj . V (J ) P - i . (5) vj vj - i,j . i P Therefore, it also holds that the value of the following optimization problem n max P i - (8) j =1 i,j + yi b i ,i s.t. j [n] yi wj xi,j - vj P j [n] i,j 0 and i 0 , i vj - i,j , Directly solving the integer program in Eq. (6) may be difficult, so instead we examine the properties of the following linear relaxation: ( y n v 7) min yi b + j =1 j i wj xi,j - P j n s.t. j [n] 0 j 1 and P j =1 j vj . To analyze this relaxation we require the following lemma. Lemma 1. Fix an example (xi , yi ), a linear classifier (w, b), and a scalar i > 0, and let be the value of Eq. (7) with respect to these choices. (a) If -i then Eq. (5) holds. (b) In the special case where vj {0, 1} for all j [n] and where N is an integer, -i if and only if Eq. (5) holds. (c) There exists a minimizer of Eq. (7) with at most one coordinate in (0, 1). The proof of the lemma is straightforward but technical, and is omitted due to lack of space. Lemma 1 tells us that comparing the value of the linear program in Eq. (7) with holds for all J with V ([n] \ J ) N . The optimization problem in Eq. (4) also constrains w C and i 0 for all i [m], thus, (w , b , ) satisfies the constraints in Eq. (3). Since Eq. (3) and Eq. (4) have the same objective function, the value of Eq. (3) is upper bounded by the value of Eq. (4). 2.3. Generalization Bounds We now prove a generalization bound on the risk of the classifier learned in our framework, using PAC-Bayesian techniques (McAllester, 2003). Throughout, we assume that x 1 with probability 1 over D. For simplicity, we assume that the bias term b is 0, and that vj > 0 for all j . These assumptions can be relaxed at the cost of a somewhat more complicated analysis. Given a classifier w, let (w, x, y ) denote the -loss attained on the example (x, y ), defined as m j V (J ) , (10) in y wj xj < P J : V ([n]\J )N J In other words, the value of Eq. (9) is also at least -i . Using claim (a) of Lemma 1, we have that V (J ) b j yi + - i , J wj xi,j P is at least -i . The strong duality principle of linear programming (Boyd & Vandenberghe, 2004) states that the value of Eq. (8) equals the value of its dual optimization problem, which is: ( y n v 9) min yi b + j =1 j i wj xi,j - P j n s.t. j [n] 0 j 1 and P j = 1 j vj . Learning to Classify with Missing and Corrupted Features where [[·]] again denotes the indicator function. Note that E[0 (w, x, y )] = R(w, 0), where R is defined in Eq. (1). Theorem 2. Let S be a sample of size m drawn i.i.d from D. For any > 0, with probability at least 1 - , it holds for all w Rn with w C that the risk associated with w is at most 1m , m,, sup : KL m i=1 (w, xi , yi ) (m-1 ) n where (m, , ) = ln(m/ ) + j =1 ln(4P C /( vj )) and KL is the Kullback-Leibler divergence. The above is upper-bounded by the empirical -loss (which equals m 1 i=1 (w, xi , yi )), plus the additional term m m 2 m i=1 (w, xi , yi ) (m,, ) m -1 relates the risk of a classifier in the above setting, to its expected -loss in the feature deletion setting, where the latter can be bounded with Thm. 2. Theorem 3. Let , CNand N be arbitrary positives, and , ln(1/)/2. Assume that we solve let be at least C Eq. (4) with parameters , C , N and with vj = 1 for all j [n]. Let w be the resulting linear classifier, and assume for simplicity that the bias term b is zero. Let f be a random vector-valued function on X , such that for every x X , f (x) is the instance x after the feature corruption scheme described above. Then, using as defined in Eq. (10), for (x, y ) drawn randomly from D, we have: y Pr w, f (x) 0 E [ (w, x, y )] + . + 2 (m, , ) . m-1 Proof sketch. The proof follows along similar lines to the PAC-Bayesian bound for linear classifiers in (MnAllester, c 2003). First, define the axis-aligned box B = j =1 [wj - vj vj 2P , wj + 2P ] [-C, C ]. We use the margin concept to upper bound E(x,y)D [0 (w, x, y )] by the expected /2 loss over D of a classifier sampled uniformly from B [-C, C ]n . We can upper bound this expected loss using the PAC-Bayesian theorem (McAllester, 2003), where the uniform distribution over B [-C, C ]n is the posterior classifier distribution, and the uniform distribution over [-C, C ]n is the prior. The bound we get is defined in terms of the average empirical /2 loss of a random classifier from B , plus a complexity term dependent on the volume ratio between B and [-C, C ]n . Finally, this average loss can be upper bounded by the empirical loss of w by repeating the technique of the first stage. The weaker bound stated in the theorem follows from a lower bound on the KL divergence, presented in (McAllester, 2003). It is interesting to note that L regularization emerges as the most natural one in this setting, since it induces the most convenient type of margin for relating the 0 , /2 , loss functions as described above. This lends theoretical support to our choice of the L norm in our algorithms. 2.4. Feature Corrupting Noise We now shift our attention to the case where a subset of the features is corrupted with random noise, and show that the the same LP approach used to handle missing features can also deal with corrupted features if the margin parameter in Eq. (4) is sufficiently large. For simplicity, we shall assume that all features are supported on [-1, 1] with zero mean. Unlike the feature deleting noise, we now assume that the each feature selected by the adversary is replaced with noise sampled from some distribution, also supported on [-1, 1] and having zero mean. The following theorem Proof. Let (x, y ) be an example and let J denote the feature subset which remains uncorrupted by the adversary. Using Hoeffding's bound and our assumption on , we yj i have that Pr wj fj (x) - s upper bounded J / by . Therefore, with probability at least 1 - over the randomness of f , y w, f (x) is equal to: j j j y wj xj + y wj fj (x) > y wj xj - . (11) J J / J Thus, with probability at least 1 - , Pr(y w, f (x) < 0) is upper bounded by E[ (w, x, y )]. Otherwise, with probability at most , Pr(y w, f (x) < 0) 1. We conclude with an interesting observation. In the feature corruption setting, making a correct prediction boils down to achieving a sufficiently large margin on the uncorrupted features. Let r (0, 1) be a fixed ratio between N and n, and let n grow to infinity. Assuming j reasona able degree of feature redundancy, the term y J wj xj grows as (j ). On the other hand, Hoeff ng's bound tells n di us that y J wj xj grows only as O ( N ). Therefore, for r arbitrarily close to 1 and a large enough n, the first sum in Eq. (11) dominates the second. Namely, by setting = ( N ) in Eq. (4), our ability to withstand feature corruption matches our ability to withstand feature deletion. 3. Solving the Problem with the Perceptron We now turn to our second learning algorithm, taking a different angle on the problem. We momentarily forget about the original statistical learning problem and instead define a related online prediction problem. In online learning there is no distinction between the training phase and the classification phase, so we cannot perfectly replicate the classification-time noise scenario discussed above. Instead, we assume that an adversary removes features from every instance that is presented to the algorithm. We address this online problem with a modified version of the Learning to Classify with Missing and Corrupted Features Perceptron algorithm (Rosenblatt, 1958) and use an onlineto-batch conversion technique to convert the online algorithm back into a statistical learning algorithm. The detour through online learning gives us efficiency while the online-to-batch technique provides us with the statistical generalization properties we are interested in. 3.1. Perceptron with Projections onto the Cube We start with a modified version of the well-known Perceptron algorithm (Ro(enblatt, m958), which observes a ses 1 quence of examples xi , yi ) i=1 , one example at a time, ( m and incrementally builds a sequence wi , bi ) i=1 of linear margin-based classifiers, while constraining them to a hyper-cube. Before processing example i, the algorithm has the vector wi and the bias term bi stored in its memory. An adversary takes the instance xi and reveals only a subset Ji of its features to the algorithm, attempting to cause the online algorithm to make a prediction mistake. In choosing Ji , the adversary is restricted by the constraint V ([n] \ J ) N . Next, the algorithm predicts the label associated with xi to be . b j wi,j xi,j sign i + Ji used above is the value that optimizes the cumulative loss bound below. As in the previous section, restricting the online classifier to the hyper-cube helps us control its complexity, while promoting dense classifiers. It also comes in handy in the next stage, when we convert the online algorithm into a statistical learning algorithm. Using a rather straightforward adaptation of standard Perceptron loss bounds, to the case where the hypothesis is confined to the hyper-cube, leads us to the following theorem, which compares the cumulative loss suffered by the algorithm with the cumulative loss suffered by any fixed hypothesis in the hyper-cube of radius C . Theorem 4. Choose any C > 0 and let w Rn and b R be much that w C and |b | s ( C . Let xi , yi ) i=1 be an arbitrary sequence of examples, with xi 1 1 for all i. Assume that this sequence is presented to our modified Perceptron, and let (wi , bi , ximyi ) be as defined in Eq. (12). Then it holds , that 1 i=1 (wi , bi , xi , yi ) is upper-bounded by m m C 1i (w , b , xi , yi ) + m =1 2 (n + 1) . m After the prediction is made, the correct label yi is revealed and the algorithms suffers a hinge-loss (w, b, x, y ), defined as m + b j V (J ) , (12) ax -y + J wj xj P J : V ([n]\J )N where P = V ([n]) - N and []+ denotes the hinge function, max{, 0}. Note that (wi , bi , xi , yi ) upper-bounds times the indicator of a prediction mistake on the current example, for any choice of Ji made by the adversary. We choose to denote the loss by to emphasize the close relation between (wi , bi , xi , yi ) and i in Eq. (3). Due to our choice of loss function, we can assume that the adversary chooses the subset Ji that inflicts the greatest loss. The next step is to convert our online algorithm into a statistical learning algorithm. 3.2. Converting Online to Batch To obtain a statistical learning algorithm, with risk guarantees, we assume that the sequence of examples presented to the modified Perceptron algorithm is a training set sampled i.i.d. from the underlying distribution D. We turn to the simple averaging technique presented in (Cesam 1 ¯ Bianchi et al., 2004) and define w = m i=1 wi-1 and m ¯= 1 ¯b b m i=1 bi-1 . (w, ¯) is called the average hypothesis, and defines our robust classifier. We use the derivation in (Cesa-Bianchi et al., 2004) to prove that the average classifier provides an adequate solution to our original problem. Note that the loss function we use, defined in Eq. (12), is bounded and convex in its first two arguments. This allows us to apply (Cesa-Bianchi et al., 2004, Corollary 2) to ¯b relate the risk of (w, ¯) with the cumulative online loss suffered by the Perceptron. It also allows us to apply Hoeffding's bound to relate the expected loss of any fixed classifier (w , b ) with its empirical loss on the training set. Combining both bounds results in the following corollary. Corollary 1. For any > 0, with probability at least 1 - over the random sampling f S , our algorithm constructs o i ¯b ¯b (w, ¯) such that E(x,y)D (w, ¯, x, y ) s at most ( w ,b ) H The algorithm now uses the correct label yi to construct the pair (wi+1 , bi+1 ), which is used to make the next prediction. If (w, b, x, y ) = 0, the algorithm defines wi+1 = wi and bi+1 = bi . Otherwise, the algorithm defines wi+1 using the following coordinate-wise update [ wi,j + yi xi,j ]±C if j Ji j [n] wi+1,j = , wi,j otherwise and bi+1 = [bi + yi ]±C , where = and []±C m . abbreviates the function max in{, C }, -C This update is nothing more than the standard Perceptron update with constant learning rate , with an added projection step onto the hyper-cube of radius C . The specific value of n +1C 2m min E [ (w, b, x, y )] + (3C +) 2 (n + 1 + ln( 2 )) , m Learning to Classify with Missing and Corrupted Features 0.5 0.4 LP-Based GR SVM 0.3 0.2 0.1 0 0 10 20 30 40 50 0 25 50 75 100 125 150 Num deleted Num deleted Figure 1. A comparison of our LP-based approach with the algorithm of (Globerson & Roweis, 2006) (GR) and with SVM on SPAM (left) and MNIST (right), with random noise. Following the lead of (Globerson & Roweis, 2006), we conducted experiments using the SPAM and MNIST datasets. The SPAM dataset, taken from the UCI repository, is a collection of spam and non-spam e-mails. Spam can be detected by different word combinations, so we expect considerable feature redundancy in this dataset. The MNIST dataset is a collection of pixel-maps of handwritten digits. Again, following (Globerson & Roweis, 2006), we focused on the binary problem of distinguishing the digit 4 from the digit 7. Adjacent pixels often contain redundant information, making MNIST well-suited for our needs. On each dataset, we performed 2 types of experiments. The first type follows exactly the protocol used in (Globerson & Roweis, 2006). Namely, the algorithm is trained with a small training set of 50 instances, and its performance is tested in the face of random feature-deleting noise, which uniformly deletes N non-zero features from each test instance, for various choices of N . Notice that this setting deviates from the adversarial setting considered so far, and the reason for conducting this experiment is to compare our results to those reported in (Globerson & Roweis, 2006). A validation set is used for parameter tuning. We did not test our online-to-batch algorithm within this setting, since it has little advantage with such a small training set. The results are presented in Fig. 1, and show test error as a function of the number of deleted features. Compared to its competitors, our algorithm has a clear and substantial advantage. The second type of experiment simulates more closely the adversarial setting discussed throughout the paper. Using 10-fold cross-validation, we corrupted each test instance using a greedy adversary, which deletes the most valuable features of each instance until either the limit N is reached or all useful features are deleted. 1/9 of the training set was used for parameter tuning. Due to computational considerations when running our LP-based algorithm, we performed a variant of bagging by randomly splitting the training set into chunks, training on each chunk individually, and finally averaging the resulting weight vectors. In contrast, our online-to-batch algorithm trained on the entire training set at once, and so did the SVM algorithm. We repeated this process for different values of N . For the SPAM dataset, we repeated this entire experiment twice, once with features values vj set uniformly to 1, and once with vj set using a mutual information heuristic. Formally, we set [ , 1 vj = Z max I [xj > c]]; y c R Test Error V , where = maxJ :V ([n]\J )N (J )/P and H is the set of all pairs (w, b) such that w C and |b| C . Using the fact that the hinge loss upper-bounds times the indicator function of a prediction mistake, regardless of the adversary's choice of the feature set, we have that the ex¯b pected hinge loss upper-bounds R(w, ¯). 4. Experiments and Conclusions We compare the performance of our two algorithms (LPbased and online-to-batch) with that of a linear L2 SVM (Joachims, 1998) and with the results reported in (Globerson & Roweis, 2006). We used the GLPK package (http://www.gnu.org/software/glpk) to solve the LP formulation of our LP-based algorithm. We begin with a highly illustrative sanity check. We generated a synthetic dataset of 1000 linearly separable instances in R20 and added label noise by flipping each label with probability 0.2. Then, we added two copies of the actual label as additional features to each instance, for a total of 22 features. We randomly split the data into equally sized training and test sets, and trained an SVM classifier on the training set. We set vj = 1 for j [20] and v21 = v22 = 10, expressing our prior knowledge that the last two features are more valuable. Using these feature values, we applied our technique with different values of the parameter N . We removed one or both of the highvalue features from the test set and evaluated the classifiers. With only one feature removed both SVM and our approach attained a test error of zero. With two features removed, the test error of the SVM classifier jumped to 0.477 ± 0.004 (over 100 random repetitions of the experiment), indicating that it essentially put all of its weight on the two perfect features. With the noise parameter set to N = 20, our approach attained a test error of only 0.22 ± 0.002. This is only marginally above the best possible error rate for this setting. v where Z is such that [ ] j = n, and where I ([ xj > c] ; y ) is the mutual information between the predicate [[xj > c]] and the label y , over all examples in the training set. Intuitively, we are calculating the amount of information contained in each individual feature on the label, provided that Learning to Classify with Missing and Corrupted Features 0.8 0.8 0.4 Test Error Test Error 0.4 0.2 0 0 0.4 0.2 0 0 Test Error 0.6 LP-Based Online-Batch SVM 0.6 LP-Based Online-Batch SVM 0.3 0.2 0.1 0 0 LP-Based Online-Batch SVM 2 4 6 2 4 6 8 10 12 14 10 20 30 40 50 N N N Figure 2. Experiments on SPAM with j J, vj = 1 (left) and with vj set with a mutual information heuristic (center). Experiments on MNIST with vj set with a mutual information heuristic (right). we are looking only at linear threshold functions. When experimenting with the MNIST dataset, we only used the values of vj set by our heuristic. This is a natural choice since the features of MNIST are of markedly different importance levels. For example, the corner pixels, which are always zero, are completely uninformative, while other pixels may be very informative. The results are presented in Fig. 2, and show test error as a function of N . Clearly, our algorithms have the advantage. SVM repeatedly puts all of its eggs in a small number of baskets, and is severely punished for this, while our technique anticipates the actions of the adversary and hedges its bets accordingly. Moreover, the results in Fig. 2 demonstrate the tradeoffs between our LP-based and online-to-batch algorithms. Although we have handicapped the LP-based algorithm by chunking the training set, its performance is comparable and sometimes superior to that of the online-to-batch algorithm. With less or without chunking, we expect its performance to be even better. We conclude that our proposed algorithms successfully withstand feature corruption at classification time, and considerably improve upon the current state of the art. On a more general note, this work has interesting connections to a recent trend in machine learning research, which is to develop sparse classifiers supported on a small subset of the features. In our setting, we are interested in the exact opposite, and the efficacy of using the L norm is clearly demonstrated here. The trade-off between robustness and sparsity provides fertile ground for future research. Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2, 263­286. Gamble, E., Macskassy, S., & Minton, S. (2007). Classification with pedigree and its applicability to record linkage. Workshop on Text-Mining & Link-Analysis. Globerson, A., & Roweis, S. (2006). Nightmare at test time: robust learning by feature deletion. Proceedings of ICML 23 (pp. 353­360). Joachims, T. (1998). Making large-scale support vector machine learning practical. In Advances in kernel methods - support vector learning. MIT Press. Littlestone, N. (1991). Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow. Proceedings of the COLT 4 (pp. 147­156). McAllester, D. A. (2003). Simplified PAC-bayesian margin bounds. Proceedings of COLT 16 (pp. 203­215). Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386­407. Teo, C.-H., Globerson, A., Roweis, S., & Smola, A. (2008). Convex learning with invariances. Advances in NIPS 21. Vapnik, V. N. (1998). Statistical learning theory. Wiley. References Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. Carr, R. D., & Lancia, G. (2000). Compact vs. exponentialsize LP relaxations SANDIA Report 2000-2170. Cesa-Bianchi, N., Conconi, A., & Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50, 2050­ 2057.