Small Sample Inference for Generalization Error in Classification Using the CUD Bound Eric B. Lab er and Susan A. Murphy Department of Statistics University of Michigan Ann Arbor, MI 48104 { laber, samurphy }@umich.edu Abstract Confidence measures for the generalization error are crucial when small training samples are used to construct classifiers. A common approach is to estimate the generalization error by resampling and then assume the resampled estimator follows a known distribution to form a confidence set [Kohavi 1995, Martin 1996,Yang 2006]. Alternatively, one might bootstrap the resampled estimator of the generalization error to form a confidence set. Unfortunately, these methods do not reliably provide sets of the desired confidence. The poor performance appears to be due to the lack of smoothness of the generalization error as a function of the learned classifier. This results in a non-normal distribution of the estimated generalization error. We construct a confidence set for the generalization error by use of a smooth upper bound on the deviation between the resampled estimate and generalization error. The confidence set is formed by bootstrapping this upper bound. In cases in which the approximation class for the classifier can be represented as a parametric additive model, we provide a computationally efficient algorithm. This method exhibits superior performance across a series of test and simulated data sets. small. Thus, both the estimated generalization error and the confidence in this estimate must be considered as the confidence might be very low due to the noise in the data. In addition, in this setting data must be collected from clinical trials and is subsequently expensive to obtain. The cost of these trials, coupled with notoriously low adherence rates, leads to data sets which are too small to admit a test set. Any inference for the generalization error must be derived from a single training set. In this paper, we introduce a new method for obtaining confidence bounds on the generalization error in the small sample setting. The paper is organized as follows. Section 2 of the paper provides the framework for this research, and describes why this problem is difficult. We also survey several standard approaches to this problem and motivate the need for a new method. In section 3 we derive a new bound called the constrained uniform deviation bound (CUD bound hereafter) on the deviation between the estimated and true generalization errors and describe how to use this bound to form confidence sets. Section 4 gives an efficient algorithm for computing the new upper bound when the classifier can be represented as a parametric additive model. Section 5 describes an empirical study of several methods for constructing confidence sets and analyzes the results. 2 Framework 1 Intro duction Measures of uncertainty are particularly critical in the small sample setting where training set to training set variation is likely to be high. One example occurs in medical diagnostics in mental health in which one may want to classify patients as "low risk" or "high risk" for relapse based on observed medical history. In these settings the signal to noise ratio is likely to be Our ob jective is to derive a confidence set for the generalization error under minimal assumptions. To this end, we only assume a training set D = {(xi , yi )}i=1,...,n of n iid draws from (unknown) distribution F over X × Y , where X is any feature space and Y = {-1, 1} is the label space. We also assume that there is a space of potential classifiers G from which we choose our classifier f using some criterion that depends on our training set D. Equally important is what we do not assume. We make no assumptions about the structure of G , nor do we assume that the Bayes classifier belongs to G . Finally, we do not assume the existence of an underlying "true" deterministic function y = g (x) which generates the data. Given a training set D we choose a classifier f G using some criterion (discussed in detail below). The classifier f aims to predict the value of a label y given feature x. Thus, a natural measure of the performance of f is its generalization error (f ) = E(X,Y )F I {f (X ) = Y }. That is, the probability that f will fail to correctly predict label Y from feature X . When data are abundant, inference for the generalization error is straightforward. In this setting, one partitions the available data into two sets L and T called a learning set and a testing set respectively. Estimation is done by choosing classifier f G using learning set L and then using empirical estimate ( 1 ^ test (f ) = #T xi ,yi )F I {f (xi ) = yi }. The exact ^test is known and confidence sets can distribution of be computed numerically [Langford 2005(1)]. More^ over, test (f ) is, under mild regularity conditions, asymptotically normal, making standard asymptotic inference applicable. ^ The convenient properties of test (f ) are a direct consequence of the independent test set T . We define the small sample setting as one in which the data are too meager to admit a test set. Without a test set one is forced to use the same data both to train and evaluate the classifier. Estimates and confidence sets for the generalization error must be constructed using resampling (e.g. bootstrap, cross-validation, etc.). However, when the training sets are small, the training set to training set distribution of these estimates may be highly non-normal, thus complicating the construction of confidence sets. This is particularly the case here as the generalization error (f ) is a non-smooth function of f . It is worth pointing out that there exist alternative motivations for seeking confidence sets in the absence of a testing set. The formation of a test set essentially "wastes" observations which are not used in the training of the classifier. The omission of these observations can significantly deteriorate the performance of the learned classifier. Another motivation is the existence of learning algorithms which assume that training accuracy correctly reflects generalization error when selecting a classifier. Using high quality bounds on the generalization error in the algorithm can boost performance [Langford 2005(2)] particularly when the training data is small. Relevant Work Much of the work on small sample inference for the generalization error has focused on point estimates via resampling. Popular point estimates include the .632+ estimator given by Efron and Tibshirani [Efron 1995], the LV 1O estimate given by Weiss [1991], and more recently the bolstered error estimate of BragaNeto and Dougherty [Dougherty 2004]. A nice survey of generalization error estimates is given by Schiavo and Hand [Schiavo 2000]. In the literature, a variety of general approaches have been suggested for constructing confidence sets for the generalization error in the absence of a test set. We discuss them in turn. 1. Known distribution. One approach to this problem is to assume the resampled estimator follows a known distribution [Kohavi 1995, Yang 2006]. Typically is taken to be normal, so that a 1 - confidence set for estimated generalization error ^ (f ) of classifier f is given by ^ (f ) ± -1 / n , 2 where is some estimate of the standard error and n is the effective sample size1 used to construct the resampled estimator. This approach ^ mimics the case where (f ) is constructed from a training set and then a normal approximation would have been justified for the estimator of the generalization error based on an independent test set. 2. Bayesian Approach. This approach mimics the ^ case where (f ) is constructed from a training set and then using an independent test set, a Bayesian approach is used to construct a posterior confidence set. Here we specify a Beta prior on the generalization error (f ) [Martin 1995]. (f ) ( , ). If an independent test set T = {(xi , yi )}i=1,...,m were available then random variables zi = I (f (xi ) = yi ) would be iid bern( ). The posterior P ( (f )|z1 , . . . , zm ) is also a Beta distribution, (f )|z1 , . . . , zm ^ ^ ( + m (f ), + m - m (f )) Note that the posterior depends only on the num^ ber of misclassified points, m (f ) on the test set. Here we use effective sample size to mean the expected number of unique points used in estimation during a single step of a resampling procedure. 1 Suppose we do not have a test set. Then given re^ ^ sampled estimate , we can view (n - n ) (f ) as the misclassified examples where n and n are the full and effective sample sizes. One can think of this as training on a set of size n and testing on a set of size n - n . Using this idea, the resampled model is given by, (f ) ( , ) (f )|z1 , . . . , zm (, ) ^ = + (f )(n - n ) ^ = + (1 - (f ))(n - n ). 3 The CUD Bound The posterior given above can be used to construct confidence intervals for (f ). 3. Direct Estimation. In this approach we repeatedly resample the data and from each resampled dataset we construct a point estimate of the generalization error (this may require further resampling). This process generates a sequence of estimates for the generalization error. A confidence set can be constructed, for example, by taking the quantiles of the generated sequence of resampled estimators. The methods described above can work well in some settings. Distribution-based methods tend to work well when distributional assumptions are met, providing tight confidence sets and accurate coverage. However, these methods can be non-robust to large deviations from distributional assumptions and can suffer from unacceptable type I errors. Direct estimation avoids distributional assumptions but is unstable for small samples because of non-smooth 0-1 loss. These problems are illustrated in the experiments sections of the paper. Ideally confidence sets for the generalization error should satisfy the following criteria. 1. The confidence set should deliver the promised coverage for a wide variety of examples. 2. The confidence set should be efficiently computable. 3. The confidence set should be theoretically justified. Given multiple algorithms for constructing confidence sets satisfying the above, algorithms that produce confidence sets of smallest size are preferred. We propose a new approach and compare this approach to the above methods using the first and second criteria in the experiments section. We construct an upper bound on the deviation between the training error and generalization error of a classifier using ideas from the large literature on constructing bounds on the so-called excess-risk of a selected classifier [Shawe-Taylor 1998, Barlett and Mendelson 2006]. Also we allow the use of a convex surrogate as in Zhang [2004] for the 0-1 loss. In particular we use the idea of constructing a bound for the supremum of the deviation between the training error and generalization error over a small subset of the approximation space G . This small subset is determined by the convex surrogate loss, L : (D, f ) R used ^ to construct the classifier (f = arg minf G L(D, f )). A variety of surrogate loss functions may be used, a common loss fu(nction is squared error loss, given by 1 2 L(D, f ) = n xi ,yi )D (yi - f (xi )) . Some other examples are the hinge loss, logistic loss, exponential loss and penalized quadratic loss. The intuition for the upper bound is as follows. Sup^ pose we knew (we don't) the limiting value of f , say f , as the amount of training data becomes infinite. f may not be the Bayes classifier. Also notice that f does not depend on the training set. In addition, ^ since we can think of f as being concentrated near f , this motivates the use of f as the anchor for the small subset over which we take the supremum. In particular, we take the supremum over the set of classifiers belonging to a neighborhood of f . The size of this ^ neighborhood should depend on how far f is from f ^) - L(D, f ). in terms of the difference in loss, L(D, f ^ To fix notation, let S (f ) be the empirical error of f on data set S , that is, 1( ^ S (f ) = #S I {f (xi ) = yi } xi ,yi )S ^^ so that D (f ) is the training error. We have the following result. CUD Bound. If n is any positive function of the size n of the training set D and g (x) = (x + 1) 1 then: ^ ^ g L ^ ^ D (f )- (f ) sup D (f )- (f ) (n (D, f )-L(D, f ))) f G Proof. First notice that g (x) 1 x R+ . Then, ^ since f = arg minf G L(D, f ) we must have ^ L(D, f ) - L(D, f ) 0 so that, ^ g (n (L(D, f ) - L(D, f ))) = 1. The result follows from noticing, ^ =^ × ^ ^ ^ ^ D (f ) - (f ) D (f ) - (f ) ^ g (n (L(D, f ) - L(D, f ))) ^ × sup D (f ) - (f ) f G as the true population and each bootstrap draw as a random draw of size n from the true population. The mapping between population and bootstrap quantities is straightforward and given in the table below. The reader is referred to [Efron 1994] or [Hall 1992] for details. Population (f ) f L(D, ) ^ D (f ) Bootstrap ^ D (f ) ^ f L(D(b) , ) ^ D(b) (f ) g (n L (D, f ) - L(D, f ))), ^ with the last inequality following because f G . The intuition here is that we consider a small neighborhood around f with n chosen so that the size of ^ the neighborhood shrinks at the same rate L(D, f ) L(D, f ) as the size of D tends to infinity. It is the function g which restricts the domain over which the supremum is taken. In particular, ^^ ^ |D (f ) - (f )| ^ g L sup D (f ) - (f ) (n (D, f ) - L(D, f ))) f G ^ Using the above, the procedure to compute Q1- is as follows. 1. Construct B bootstrap datasets D(1) , . . . , D(B ) from D. 2. For b = 1, . . . , B compute: Q(b) ^ × ^ = supf G D(b) (f ) - D (f ) ^ g (n (L(D(b) , f ) - L(D(b) , f ))). ^ I = - sup D (f ) - (f ) L(D, f ) L(D, f ) - n 1 f G ^ , sup D (f ) - (f ) f W where, W= f - G : L(D, f ) L(D, f ) - n 1 . ^ 3. Set Q1- to the 1 - sample quantile of Q(1) , . . . , Q(B ) . ^ We now substitute Q1- into Proposition 1 in place of Q1- to construct the confidence set. Initially the computation of the sup in step 2 above appears to be computationally intensive. Depending on the form of classifier space G and loss function L(, ) taking this sup may prove to be computationally infeasible. However, when the loss function L(, ) is convex and the classifier is approximated by a parametric additive model, computation is straightforward by way of a branch and bound algorithm which we discuss next. From the preceding set of inequalities we see that g serves as a continuous approximation to the indicator function. The rate, n , will depend on the complexity of G and the training set size, n. For example if G has low complexity such as a smoothly parameterized class with a finite dimensional parameter then the rate can be expected to be on the order of n. We now present the main result which follows immediately from the upper bound. Prop osition 1. Let (0, 1], then if Q1- is the 1 - quantile of, ^ g L sup D (f ) - (f ) (n (D, f ) - L(D, f ))), f G 4 ^ ^ ^ then P D (f ) - (f ) Q1- ) 1 - . Thus, the problem is reduced to computing Q1- . This is done using the bootstrap. 3.1 Computing Q1- An Efficient Algorithm For Parametric Additive Mo dels In this section we consider the space of classifiers G = {f (x) = sig n ip =1 , i b(i , x) i , i R i}, For any fixed we estimate Q1- with its bootstrap ^ estimate Q1- . A bootstrap sample D(b) from D is an iid draw of size n from the distribution which puts equal mass on each point in D. The bootstrap estimate is constructed by treating the original sample D where b(, ) are basis functions indexed by the i 's and the i 's are the basis coefficients. We also assume a convex surrogate loss function L(D, f ) where convexity refers to convexity in = (1 , . . . , p ) but not necessarily in = 1 , . . . , p . We assume that given training set D we choose classifier ^ ^^ f (x) = f (x; , ) = arg min L(D, f (; , )) , = a = arg min rg min L(D, f (; , )) ^ arg min L(D, f (; , )) ^ We now hold fixed and compute the upper bound as a supremum over . CUD Bound (Parametric Additive Mo dels). If n is any positive function of the size n of the training set D and g (x) = (x + 1) 1 then: i ^ ^^ ^^ D (f (; , )) - (f (; , )) s bounded above by: × ^ ^ ^ sup D (f (; , )) - (f (; , )) Rp 2 lead to the same classification on S ). Congruency modulo S defines an equivalency relation and hence partitions Rp . For any fixed bootstrap dataset D(b) let S be the collection of distinct points in D(b) - D and let M1 , . . . , MR be the subsequent equivalence classes. Then, for any equivalence class Mi , ^ ^ ^ ^ D(b) (f (; , )) - D (f (; , )) C (Mi ) Mi , where C (Mi ) is a constant. Then referring back to step 2 of the bootstrap procedure outlined in section 3 we can write, × ^ ^ ^ ^ Q(b) = supRp D (f (; , ) - D(b) (f (; , ))) ^^ ^^ g (n (L(D(b) , f (; , )) - L(D(b) , f (; , )))) = supi C (Mi ) ^^ ^ × sup g (n (L(D(b) , f (; , )) - L(D(b) , f (; , )))) Mi = sup C (Mi ) i g (n L ^ (D, f (; , )) - L(D, f (; , )))) ^^ ^ × g (n (L(D(b) , f (; , )) - inf L(D(b) , f (; , )))), Mi The supremum is difficult to calculate because of the term with the absolute values; it is the absolute difference between two sums of indicator functions and is hence both non-smooth and non-convex. We now discuss how to transform this problem into a series of convex optimization problems with linear constraints. To begin, suppose D(b) is a bootstrap dataset drawn from D. For any training point (xi , yi ) in D, let i be the number of copies of (xi , yi ) in D(b) . Then 0 i n for all i and the distribution of each i is binomial 1 with size n and probability n . Then we can write, = ^ ^ ^ ^ n D(b) (f (; , )) - D (f (; , )) =( ^ (i - 1)I {f (xi ; , ) = yi } xi ,yi )D with the last equality following from the monotonicity of g . Since L is convex in this would be a series of convex optimization problems if membership in Mi could be expressed as a set of convex constraints. We now show how this can be done. Let Mi be a representative member of equivalence class Mi . Then for any other Mi we must have, ^ f (xj ; Mi , ) kp =1 k b(k , xj ) 0 ^ ( ^ (i - 1)I {f (xi ; , ) = yi } xi ,yi )D (b) -D . for all xj such that (xj , y ) D(b) - D for some y . ^ Since f (xj ; Mi , ) does not depend on we see that membership to Mi is equivalent to satisfying a series of linear constraints, which is clearly convex. We have shown that calculation of Q(b) amounts to a series of R convex optimization problems. In practice, even though n is small (e.g. n 50) R can be very large (on the order of 2n in the worst case) making direct computation of all R convex optimization problems infeasible. Fortunately, exhaustive computation can be done using a branch and bound algorithm. To use branch and bound we must recursively partition the search space. To do this we arbitrarily label the feature vectors in D(b) - D by x1 , . . . , xd . The root of the tree represents all of Rp . The first left child represents the set of all Rp so that p ^ k=1 k b(k , x1 ) 0 while the first right child repp resents all Rp so that k=0 i b(k , x1 ) < 0. In ^ general, if S is a region defined by a node at level j of the tree, then its left child represents the subspace of The above term does not depend on points with i = 1, that is, those points which were selected exactly once in the bootstrap resampling. We let D(b) - D denote the set of training points (xi , yi ) that satisfy i = 1. Notice that the number of points in D(b) - D is necessarily smaller than n which translates into computational savings. To see this, we will need to make use of the following equivalence relation. Given 1 , 2 Rp and S a subset of D, we say 1 is congruent to 2 modulo S if f (x; 1 , ) = f (x; 2 , ) ^ ^ for all x such that (x, y ) S for some y (i.e. 1 and p S which satisfies S and k=0 k b(k , xj +1 ) 0, ^ similarly, its right child is the subspace of S which satp isfies S and k=1 k b(k , xj +1 ) < 0. Notice that ^ the terminal nodes of this tree are either infeasible or one of the equivalence classes M1 , . . . , MR . To complete the branch and bound algorithm we need to define upper and lower bounds on the value of objective function, ^ × ^ ^ ^ O( ) = D(b) (f (; , ) - D (f (; , ))) ^^ ^ g (n (L(D(b) , f (; , )) - L(D(b) , f (; , )))). In particular, given region S of Rp defined by a node on the tree representing our partition, we require an upper bound U (S ) on O( ), sup O( ) U (S ). S is non-positive we can remove the region S from our ^ search space since we know O( ) 0. 5 Exp eriments The upper bound U (S ) is constructed by bounding the two terms in O( ) separately. An upper bound ^ ^ ^ ^ on |D(b) (f (; , )) - D (f (; , ))| can be obtained by noticing that if S is a region of Rp defined by a node at level j , then the classification of the first j points in D(b) - D is fixed on S . The upper bound is constructed by assuming the worst performance possible on the remaining d - j points. To bound the second term, we compute, supS g (n (L(D (b) In this section we describe a set of experiments designed to test the coverage and diameter of confidence sets constructed using the CUD bound. To form a baseline for comparison we also construct confidence sets using a handful of methods found in the literature. These consist of two non-parametric bootstrap methods, a normal approximation using the bootstrap [Kohavi 1995], a normal approximation using CV [Yang 2006], a generalization of a Bayesian approach [Martin 1995], and the inverted binomial [Langford 2005].3 An online supplemental appendix provides a full description of these methods, the simulation study, and provides links to the source code (www.stat.lsa.umich.edu/laber/UAI2008.html). The confidence sets were constructed for sample sizes of n = 30 and n = 50 using the datasets given in table 1. All the data sets have binary labels and continuous Dataset Spam Ionosphere Heart Diabetes Abalone Liver Mammogram Magic Donut (Simulated) Outlier (Simulated) 2 mall (Simulated) S Three Points (Simulated) Motivation Not Simulated Not Simulated Not Simulated Not Simulated Not Simulated Not Simulated Not Simulated Not Simulated f close to Bayes f far from Bayes f = 0, low noise, far from Bayes ^^ (f ) highly non-normal, f far from Bayes ^ ^ , f (; )) - L(D(b) , f (; , )))), which is a convex optimization problem. The upper bound U (S ) is the product of these two upper bounds. In addition, we require the following lower bound, L(S ) supS O( ). The lower bound L(S ) is obtained by plugging any feasible point in S into O( ). In practice, a natural choice is the arg sup of the second term in the ob jective function, which has already been computed during the construction of U ( ). This algorithm running on a standard desktop with a sample size of n = 50 and using a total of 500 bootstrap samples to construct a confidence set runs in a only a few minutes. While this algorithm is still an NP-hard problem,2 in practice it is significantly less computationally intensive than evaluating all possible classifications. The reason is that the function g allows for significant reduction of the search space by restricting attention only to classifiers within a fixed ^ distance of selected the classifier f . To see this, notice that in the construction of U (S ), if the term ^ ^ supS g (n (L(D(b) , f (; )) - L(D(b) , f (; , )))), 2 Update: In the case of linear classification with squared error loss, current work in progress has proved this runs in polynomial time, on the order of O(nV C (G ) ). Table 1: Datasets used in experiments along with the reason for their inclusion. Here f is the limiting value of the classifier as the amount of training data becomes infinite. features. The real datasets can be found at the UCI data repository (www.ics.uci.edu/mlearn). The simulated data sets were designed to investigate the performance of the new procedure in several scenarios of interest. In this approach we are plugging in a resampled estimate and the effective sample size. That is,we are acting as if we had a test set. 3 For this series of experiments we fit a linear classifier using squared error loss. That is, w G= f (x) = sig n ip =1 i i (x) , i R here the i are transformations of the features (in all cases i was either a polynomial or pro jection onto a number of principle components). The surrogate loss function is given by L(D, f (; ) = ( xi ,yi )D jp =1 j j (xi ) - yi )2 , Data CUD Spam 1.00 Donut .999 Ionosphere .998 Heart .999 Diabetes 1.00 Outlier .988 3 Pt. .993 .983 Abalone Liv. .961 Magic .998 2 .985 Mammogram 1.00 BS1 .478 .880 .605 .406 .654 .884 .827 .963 .964 .918 .958 .678 n = 30 BS2 .983 .908 .926 .981 .900 .736 .717 .737 .878 .920 .786 .994 K .782 .631 .816 .718 .912 .808 .844 .972 .980 .961 .961 .646 M .632 .633 .757 .475 .986 .838 .896 .991 1.00 .982 .982 .426 Y .996 .937 .994 .998 .997 .910 .753 .998 1.00 .991 .996 .983 L .636 .620 .747 .458 .984 .961 .952 .997 1.00 .992 1.00 .406 and the scaling factor n used in g is chosen to be n. For each data set the following procedure was used to estimate coverage probabilities. 1. Randomly divide data into training set D and testing set T . · If the data set is real, randomly select n training points for D and use the remainder for T . · If the data set is simulated, generate n data points for D and generate 5000 data points for T . ^ 2. Build classifier f = arg minf G L(D, f ) using training data D. 3. For each procedure use the training data D and ^ chosen classifier f to construct confidence set with target coverage .950. The confidence interval is ^^ centered at D (f ) taken to be the .632 estimate [Efron 1983] in the the competing methods and the training error in the CUD bound.4 ^ ^^ 4. Using the (f ) = T (f ) we record the coverage for each procedure. The above steps are repeated 1000 times for each data set and the average coverage computed. The results are shown in tables 2 and 3 with the following abbreviations: quantile bootstrap (BS1) [Efron 1994], corrected boostrap (BS2) [Efron 1994], Yang's CV estimate (Y) [Yang 2006], Kohavi's normal approximation (K) [Kohavi 1996], a generalization of a Bayesian approach from Martin (M) [Martin 1996], and the inverted binomial of Langford (L) [Langford 2005]. The results for the n = 50 case are similar and the results are given in tables 4 and 5 of the appendix. The results in tables 2 and 3 show promising results for the new method. It was the only method to provide the desired coverage on all ten datasets and, in The competing methods experience a significant reduction in performance if they are centered at the training error. 4 Table 2: Average coverage of new method and competitors. Target coverage level was .950, those meeting or exceeding this level are highlighted in orange. n = 30 BS2 .432 .590 .429 .468 .305 .491 .476 .314 .372 .307 .330 .532 Data CUD .469 Spam Donut .485 Ionosphere .437 .459 Heart Diabetes .433 Outlier .555 3 Pt. .508 Abalone .617 Liver .559 .606 Magic 2 .595 Mammogram .464 BS1 .432 .590 .429 .468 .305 .491 .476 .314 .372 .307 .330 .532 K .315 .324 .301 .319 .307 .328 .318 .331 .327 .300 .329 .321 M .314 .324 .296 .319 .310 .329 .317 .331 .326 .283 .330 .321 Y .451 .414 .501 .433 .312 .456 .457 .504 .485 .464 .501 .421 L .354 .364 .337 .359 .350 .368 .356 .371 .366 .324 .351 .361 Table 3: Average diameter of confidence set constructed using new method and competitors. Method with smallest diameter and having correct coverage is highlighted in yellow. addition, possessed the smallest diameter for two of them. The CUD bound confidence set is conservative, as must be expected by its construction. However, the diameter of the new confidence set is far from trivial, even for this extremely small sample size. It is also of interest to learn when competing methods perform poorly. Figure 1 plots the empirical coverage of the top four methods against the mean absolute value of the training error's kurtosis (standardized central fourth moment subtract 3) for each of the nine data sets listed in table 1. The kurtosis for a normal distribution is 0, and we use absolute kurtosis as a measure of deviation from normality. Figure 1 shows a trend of decreasing coverage as deviation from normality increases for the three distribution based methods. This suggests these methods may not be robust Coverage By Kurtosis 1.0 q q q q q q q q q q q q q q q q q q q q q q q target 0.9 0.8 q q q 0.7 q smooth functions. Selecting n = is similar to standard quantile bootstrap (BS1) which is ineffective in the classification setting because it bootstraps the nonsmooth 0-1 loss. This suggests an upper bound on n . Conversely, a smaller choice of n leads to a bootstrapped estimate of a smoother quantity but also introduces conservatism. For example, setting n = 0 is equivalent to a uniform deviation bound on the training error which can often be trivially loose. This suggests a lower bound on n as well. Thus, n must strike a balance between smoothness required by the bootstrap and conservatism. Coverage New Method Y M K q 0.6 References [1] P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. J. Amer. Statis. Assoc., 101(473):138­156, Mar 2006. [2] P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463­482, 2002. [3] Ulisses Braga-Neto and Edward Dougherty. Bolstered error estimation. Pattern Recognition, 37:1267­1281, 2004. [4] B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Amer. Statis. Assoc., 78(382):316­331, 1983. [5] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, May 1994. [6] B. Efron and R. Tibshirani. Cross-validation and the bootstrap: Estimating the error of a prediction rule. Technical report, 1995. [7] Peter Hall. The Bootstrap and Edgeworth Expansion. Springer Series in Statistics. Springer, New York, 1992. [8] Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI, pages 1137­1145, 1995. [9] J. Langford. Combining train set and test set bounds. ICML, 2002. [10] J. Langford. Tutorial on practical prediction theory for classification. J. Machine Learning Research, 6:273­306, 2005. [11] J. Kent Martin and Daniel S. Hirschberg. Small sample statistics for classification error rates II: Confidence intervals and significance tests. Technical Report ICS-TR-96-22, 1996. 0.5 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 kurtosis Figure 1: Shows coverage of top four methods by absolute value of kurtosis of the training error. to non-normal training error. In particular we see that the method K (which has strongest normality assumptions) is most sensitive to deviation from normality. In contrast, the CUD bound method has stable, if conservative, coverage, regardless of the normality of training error. 6 Conclusions and Future Work In this paper we introduced a new method for constructing confidence sets for the generalization error in the small sample setting in classification. This bound is derived under the minimal assumption of a training set of iid draws from some unknown distribution F . In addition, we provided a computationally efficient algorithm for constructing this new confidence set when the classifier is approximated by a parametric additive model. In preliminary experiments, the new confidence set exhibited superior coverage while maintaining a small diameter on a catalog of test and simulated data sets. Moreover, it was demonstrated that confidence sets based on distributional assumptions may not be robust to deviation from normality in the training samples. Much work remains to be done. First, we need to provide theoretical guarantees for the level of confidence. We conjecture that a correct choice of n will depend on the complexity of the approximation space used to construct the classifier. It is well known that bootstrap estimators perform better when estimating [12] Rosa A. Schiavo and David J. Hand. Ten more years of error rate research. International Statistical Review, 68(3):295­310, 2000. [13] S.M. Weiss. Small sample error rate estimation for k-nn classifiers. Transactions on Pattern Analysis and Machine Intel ligence, 13(3):285­289, Mar 1991. [14] Yuhong Yang. Comparing learning methods for classification. Statistica Sinica, 16(2):635­657, 2006. [15] Xi Zhang and Kang G. Shin. Statistical analysis of feedback-synchronization signaling delay for multicast flow control. In INFOCOM, pages 1152­1161, 2001. App endix: Exp erimental Results for n = 50 Data CUD Spam 1.00 Donut .998 Ionosphere 1.00 Heart .999 .999 Diabetes Out. .984 3 Pt. .972 Abalone .988 .986 Liver Magic 1.00 2 .999 Mammogram .999 BS1 .172 .837 .292 .072 .654 .805 .800 .960 .888 .824 .955 .317 n = 50 BS2 .987 .904 .975 .999 .900 .893 .812 .832 .888 .823 .791 .999 K .415 .450 .575 .186 .912 .702 .648 .962 .974 .950 .960 .146 M .632 .633 .757 .475 .986 .838 .896 .991 1.00 .982 .982 .426 Y .996 .882 .985 .998 .984 .933 .624 .999 .988 .998 .996 .991 L .636 .620 .747 .458 .984 .961 .952 .997 1.00 .992 1.00 .406 Table 4: Average coverage of new method and competitors. Target coverage level was .950, those meeting or exceeding this level are highlighted in orange. n = 50 BS2 .377 .529 .364 .436 .365 .439 .413 .248 .304 .209 .265 .511 Data CUD Spam .418 Donut .448 Ionosphere .386 Heart .409 Diabetes .433 .442 Out. 3 Pt. .440 Abalone .523 .499 Liver Magic .561 2 .401 Mammogram .419 BS1 .377 .529 .364 .436 .365 .439 .413 .248 .304 .209 .265 .511 K .250 .258 .237 .254 .247 .262 .253 .264 .260 .227 .236 .255 M .314 .324 .296 .319 .310 .329 .317 .331 .326 .283 .330 .321 Y .329 .331 .315 .443 .358 .328 .357 .391 .382 .350 .389 .320 L .354 .364 .337 .359 .350 .368 .356 .371 .366 .324 .351 .361 Table 5: Average diameter of confidence set constructed using new method and competitors. Method with smallest diameter and having correct coverage is highlighted in yellow.