An Analysis of the Coupling between Training Set and Neighborhood Sizes for the k NN Classifier J. Scott Olsson Dept. of Math., University of Maryland, College Park olsson@math.umd.edu 0.90 We consider the relationship b etween training set size and the parameter k for the k-Nearest Neighb ors (kNN) classifier. When few examples are available, we observe that accuracy is sensitive to k and that b est k tends to increase with training size. We explore the subsequent risk that k tuned on partitions will b e sub optimal after aggregation and re-training. This risk is found to b e most severe when little data is available. For larger training sizes, accuracy b ecomes increasingly stable with resp ect to k and the risk decreases. Categories and Sub ject Descriptors: H.3 [Information Storage and Retrieval]: Miscellaneous General Terms: Exp erimentation, Measurement Keywords: text classification, k-Nearest Neighb ors, parameter tuning, parameter stability R -precision 0.70 Best k 100 1K 10K 100K 0.7 1K 0.6 0.60 500 0 50 100 150 0.5 10 100 20 30 50K 10K 5K Peak R -precision 0.80 0.8 40 ABSTRACT 100K 0.9 1K 10K 100K k Training Set Size Training Set Size Figure 1: (a) R-precision vs. k for several example training set sizes; sizes are noted on curves. (b) Best R-precision and (c) best k from Setting II trials. 2. EXPERIMENTS avdl Our implementation of kNN uses symmetric Okapi term weighting, w(tf ) = 0.5+1.5(tfdl )+tf , where w(tf ) is the computed term weight, tf is the term frequency of a word in a document, dl is the length of the document the term app ears in, and av dl is the average document length. Term weights are multiplied by their inverse document frequency, " " - ( t )+ 0 idf (t) = log N dfdf )+0.5.5 , where df (t) is the document fre(t quency of term t and N is the total numb er of training documents. The lab el scores contributed from the k nearest neighb ors are weighted by the inner product of the test and neighb or document vectors. Our data consists of 200K documents from the RCV1v2 newswire corpus [1] which we examine in two distinct settings. Setting I considers 500 trials at each of the fixed sizes: 100, 500, 1K, 10K, 50K training documents randomly sampled. This narrow view allows us to examine variance in b est k. In Setting I I, for each of 1500 trials, we sample D training documents, where D is a random variable uniformly distributed b etween 1 and 100K. This broad view hop es to reveal trends over a range of set sizes. In b oth settings, 1K disjoint testing documents are sampled p er trial. For each trial, we search for the optimal k: b eginning at k = 1, we increment k by one until fifteen consecutive incrementings have failed to improve up on the b est R-precision yet seen. This approach requires the R-precision vs. k curve to b e somewhat smooth, which we exp erimentally validated and illustrate in Figure 1a. R-precision tends to monotonically increase, p eak, and then decrease as k increases--a trend exhibited regardless of the training set size. Our requirement that fifteen consecutive increments of k fail to improve the R-precision additionally mitigates the risk that a particular curve will p eak in R-precision a second time. We also observe from Figure 1a that curves b ecome increasingly flat as set size increases; that is, for larger training sets, the 1. INTRODUCTION Before applying text classification to real world problems, practitioners generally carve up the available lab eled data to evaluate the system's accuracy and to tune classification parameters. At the same time, it is widely b elieved that the most consistent way to improve p erformance in general is to simply add more training data. Thus, real world systems evaluated and tuned on partitioned data typically aggregate all the available data and retrain b efore application. This calls into question the stability of the classification parameters b eing extrap olated onto the p ost-aggregation problem. We consider the parameter k in the well known kNN classifier [2], where k is the numb er of training examples (neighb ors) used to determine a test document's lab els. kNN is conceptually simple, scales well [3], and has p erformed strongly on several well studied test corp ora [1],[2]. Our aim is to understand whether the b est choice of k is dep endent on the training size, where we consider the optimal choice to b e that k which produces the largest Rprecision. R-precision is a well understood metric from the ranked retrieval community which, for the classification problem, can b e seen as a measure of the utility of a ranked list of hyp othesized lab els for a user. Formally, R-precision is the average prop ortion of lab els correctly assigned to a document, where a document with R true lab els has R lab els hyp othesized by the classifier. Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 685 0.55 0.60 0.65 0.70 0.75 0.80 0.85 100 Training Docs 500 Training Docs 1K Training Docs 10K Training Docs 50K Training Docs 0.55 0.60 0.65 0.70 0.75 0.80 0.85 100 500 1K 10K 50K Best R -precision Training Docs Figure 2: Setting I trials. Top and side boxplots show R-precision and k for each training size. Boxplot notches show 95% CIs for the mean. risk of p oorly choosing k decreases. Figure 1b plots the p eak R-precision obtained for each Setting I I trial vs. training set size. Observe that, after roughly 1K training documents, increasing R-precision by 10% requires the training set b e enlarged by a factor of nearly 100. The need for so much data to improve R-precision motivates the common practice of aggregating all available data and retraining b efore the classifier is applied. R-precision increases faster b elow 1K training documents. At the same time, with fewer documents, the risk of p oorly choosing k is increased. Figure 2 plots b est k vs. R-precision for each of the Setting I trials. Trials with equal amounts of training data form clusters, and, unsurprisingly, the average b est R-precision improves (with statistical significance) for each increase in training set size. Note that the mean optimal k, as well as the variance in optimal k, also increases at each step from 100 to 10K training documents, b efore dipping again at 50K. Supp ose only 1K lab eled documents were available for an evaluation. Figure 2 roughly tells us that if we partition those 1K documents into two equally sized training and testing p ools, search for an optimal k using the 500 training documents, and then aggregate all 1K documents for a real problem using that same k, this k will typically b e much smaller than the true optimal k for the aggregated training set. And as we saw in Figure 1a, small deviations from optimal k can result in large deviations from b est R-precision on small training set problems. Unfortunately, this deficiency will go undetected as, in such a case, we would have no lab eled data left for further evaluation. We susp ect the optimal k at first increases b ecause, for very little training data, a large k means a large prop ortion of training documents will b e used for lab eling (i.e., the smoothing will b e very aggressive); accordingly, the b est k (and the variance in b est k) must b e small. As the training set size increases initially, this small data problem is relaxed, and the b est k tends to increase, dep endent on some other prop erty of the training data (e.g., p erhaps the separability of documents having distinct lab els). Variance in b est k likely also increases b ecause, as seen in Figure 1a, as training sizes increase, near p eak R-precision is sustained over a broader range of k. On the other hand, if our training space were to b e densely p opulated by example documents, we might exp ect the optimal k to b e roughly k = 1. That is, if each test document had an identical document in training, that one nearest document in the training space would presumably have the appropriate lab els attached to it (inconsistencies in human judgments might increase this limiting k slightly). However, b ecause the numb er of documents needed to densely fill the space grows exp onentially in the numb er of features (i.e., dimensions) and b ecause we would exp ect the document space to b e "semantically anisotropic" (i.e., more "meaning distance" is traversed along some dimensions than others), this theoretical b ehavior of b est k will never b e observed in real, high-dimensional, problems. To determine the extent to which k may continue to dep end on training set size for larger set sizes, we considered trials from Setting I I, in which up to 100K training documents p er trial were investigated. As b efore, we found that b est k increases with training size for smaller set sizes. For larger set sizes, however, k and training set size app ear to b e completely uncorrelated. It is therefore unclear whether aggregating large amounts of training and testing data p oses a risk due to the instability of b est k (it is p ossible that the pre-aggregation training data will sufficiently effect the aggregated statistics so as to prevent the optimal k from drifting far). Future work could include exp erimental trials of this scenario. At the least, it is clear that b est k could only remain fixed through aggregation if the yet unknown prop erties which determine b est k themselves remain mostly unchanged. This suggests future work might explore the effects of partitioning strategies on parameter stability. Training Docs 100 500 1K 10K 50K 50 Best k 40 20 30 10 10 20 30 40 50 3. CONCLUSION We have seen that we must b e cautious to assume classification parameters tuned using a partition of available data will b e optimal after aggregation. The choice of k can significantly dep end on training size, particularly for problems with little available data. When more training data is used, it remains unclear what principally determines the optimal choice of k, although we have seen this is generally of little concern since R-precision b ecomes increasingly stable with resp ect to k as training sizes increase. 4. ACKNOWLEDGMENTS The author would like to thank Douglas W. Oard for his helpful feedback. This work was funded in part by NSF I IS 0122466 (MALACH). 5. REFERENCES [1] D. D. Lewis, et al., RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res., 5, 2004. [2] Y. Yang. An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1, 1999. [3] Y. Yang, et al., A Scalability Analysis of Classifiers in Text Categorization. In SIGIR, 2003. 686