Nearest-Neighbor-Based Active Learning for Rare Category Detection Jingrui He School of Computer Science Carnegie Mellon University jingruih@cs.cmu.edu Jaime Carbonell School of Computer Science Carnegie Mellon University jgc@cs.cmu.edu Abstract Rare category detection is an open challenge for active learning, especially in the de-novo case (no labeled examples), but of significant practical importance for data mining - e.g. detecting new financial transaction fraud patterns, where normal legitimate transactions dominate. This paper develops a new method for detecting an instance of each minority class via an unsupervised local-density-differential sampling strategy. Essentially a variable-scale nearest neighbor process is used to optimize the probability of sampling tightly-grouped minority classes, subject to a local smoothness assumption of the majority class. Results on both synthetic and real data sets are very positive, detecting each minority class with only a fraction of the actively sampled points required by random sampling and by Pelleg's Interleave method, the prior best technique in the sparse literature on this topic. 1 Introduction In many real world problems, the proportion of data points in different classes is highly skewed: some classes dominate the data set (majority classes), and the remaining classes may have only a few examples (minority classes). However, it is very important to detect examples from the minority classes via active learning. For example, in fraud detection tasks, most of the records correspond to normal transactions, and yet once we identify a new type of fraud transaction, we are well on our way to stopping similar future fraud transactions [2]. Another example is in astronomy. Most of the objects in sky survey images are explainable by current theories and models. Only 0.001% of the objects are truly beyond the scope of current science and may lead to new discoveries [8]. Rare category detection is also a bottleneck in reducing the sampling complexity of active learning [1, 5]. The difference between rare category detection and outlier detection is that: in rare category detection, the examples from one or more minority classes are often self-similar, potentially forming compact clusters, while in outlier detection, the outliers are typically scattered. Currently, only a few methods have been proposed to address this challenge. For example, in [8], the authors assumed a mixture model to fit the data, and selected examples for labeling according to different criteria; in [6], the authors proposed a generic consistency algorithm, and proved upper bounds and lower bounds for this algorithm in some specific situations. Most of the existing methods require that the majority classes and the minority classes be separable or work best in the separable case. However, in real applications, the support regions of the majority and minority classes often overlap, which affects negatively the performance of these methods. In this paper, we propose a novel method for rare category detection in the context of active learning. We typically start de-novo, no category labels, though our algorithm makes no such assumption. Different from existing methods, we aim to solve the hard case, i.e. we do not assume separability or near-separability of the classes. Intuitively, the method makes use of nearest neighbors to measure local density around each example. In each iteration, the algorithm selects an example with the 1 maximum change in local density on a certain scale, and asks the oracle for its label. The method stops once it has found at least one example from each class (given the knowledge of the number of classes). When the minority classes form compact clusters and the majority class distribution is locally smooth, the method will select examples both on the boundary and in the interior of the minority classes, and is proved to be effective theoretically. Experimental results on both synthetic and real data sets show the superiority of our method over existing methods. The rest of the paper is organized as follows. In Section 2, we introduce our method and provide theoretical justification, first for binary classes and then for multiple classes. Section 3 gives experimental results. Finally, we conclude the paper in Section 4. 2 Rare category detection 2.1 Problem definition Given a set of unlabeled examples S = {x1 , . . . , xn }, xi Rd , which come from m distinct classes, i.e. yi {1, . . . , m}, the goal is to find at least one example from each class by requesting as few total labels as possible. For the sake of simplicity, assume that there is only one majority class, which corresponds to yi = 1, and all the other classes are minority classes. 2.2 Rare category detection for the binary case First let us focus on the simplest case where m = 2, and Pr[yi = 1] Pr[yi = 2] = p, i.e. p 1. Here, we assume that we have an estimate of the value of p a priori. Next, we introduce our method for rare category detection based on nearest neighbors, which is presented in Algorithm 1. The basic idea is to find maximum changes in local density, which might indicate the location of a rare category. The algorithm works as follows. Given the unlabeled set S and the prior of the minority class p, we first estimate the number K of minority class examples in S . Then, for each example, we record its distance from the K th nearest neighbor, which could be realized by kd-trees [7]. The minimum distance over all the examples is assigned to r . Next, we draw a hyper-ball centered at each example with radius r , and count the number of examples enclosed by this hyper-ball, which is denoted as ni . ni is roughly in proportion to the local density. To measure the change of local density around a certain point xi , in each iteration of Step 3, we subtract nj of neighboring points from ni , and let the maximum value be the score of xi . The example with the maximum score is selected for labeling by the oracle. If the example is from the minority class, stop the iteration; otherwise, enlarge the neighborhood where the scores of the examples are re-calculated and continue. Before giving theoretical justification, here, we give an intuitive explanation of why the algorithm works. Assume that the minority class is concentrated in a small region and the probability distribution function (pdf) of the majority class is locally smooth. Firstly, since the support region of the minority class is very small, it is important to find its scale. The r value obtained in Step 1 will be used to calculate the local density ni . Since r is based on the minimum K th nearest neighbor distance, it is never too large to smooth out changes of local density, and thus it is a good measure of the scale. Secondly, the score of a certain point, corresponding to the change in local density, is the maximum of the difference in local density between this point and all of its neighboring points. In this way, we are not only able to select points on the boundary of the minority class, but also points in the interior, given that the region is small. Finally, by gradually enlarging the neighborhood where the scores are calculated, we can further explore the interior of the support region, and increase our chance of finding a minority class example. 2.3 Correctness In this subsection, we prove that if the minority class is concentrated in a small region and the pdf of the majority class is locally smooth, the proposed algorithm will repeatedly sample in the region where minority class examples occur with high probability. Let f1 (x) and f2 (x) denote the pdf of the majority and minority classes respectively, where x Rd . To be precise, we make the following assumptions. 2 Algorithm 1 Nearest-Neighbor-Based Rare Category Detection for the Binary Case (NNDB) Require: S , p 1: Let K = np. For each example, calculate the distance to its K th nearest neighbor. Set r to be the minimum value among all the examples. 2: xi S , let N N (xi , r ) = {x|x S, x - xi r }, and ni = |N N (xi , r )|. 3: for t = 1 : n do 4: xi S , if xi has not been selected, then si = max )(ni - nj ); otherwise, si = -. xj N N (xi ,tr 5: Query x = arg maxxi S si . 6: If the label of x is 2, break. 7: end for Assumptions 1. f2 (x) is uniform within a hyper-ball B of radius r centered at b, i.e. f2 (x) = V 1r) , if ( x B ; and 0 otherwise, where V (r) rd is the volume of B . 2. f1 (x) is bounded and positive in B 1 , i.e. f1 (x) (1-c1 p (r) , x B and f1 (x) p)V c2 p d (1-p)V (r ) , x R , where c1 , c2 > 0 are two constants. With the above assumptions, we have the following claim and theorem. Note that variants of the following proof apply if we assume a different minority class distribution, such as a tight Gaussian. 1 1 Claim 1. , > 0, if n max{ 2c2 p2 log 3 , 2(1-21 d )2 p2 log 3 , 4V ( r2 )4 log 3 }, where r2 = - 1 2 r2 r2 r 1 , and V ( 2 ) is the volume of a hyper-ball with radius 2 , then with probability at least 1 - , r and | ni - E ( ni )| V (r ), 1 i n, where V (r ) is the volume of a hyper-ball n n with radius r . ( Proof. First, notice that the expected proportion of points falling inside B , E ( |N Nnb,r)| ) (c1 + 1)p, 2 and that the maximum expected proportion of points falling inside any hyper-ball of radius r2 , xRd (1+c2 ) d r2 2 r max[E ( Pr[r |N N (x, n r2 2 )| )] 2-d p. Then r2 ni ni or xi S s.t. | - E ( )| > V (r )] 2 n n ni ni r2 > < r2 ] + Pr[r and xi S s.t. | - E ( )| > V (r )] Pr[r r] + Pr[r 2 2 n n r2 ni ni r2 Pr[|N N (b, r)| < K ] + Pr[max |N N (x, )| > K ] + n Pr[| - E ( )| > V (r )|r ] d 2 n n 2 xR 2 N N (x, r2 ) N N (b, r) ni ni r2 = Pr[| | < p] + Pr[max | | > p] + n Pr[| - E ( )| > V (r )|r ] n n n n 2 xRd > r or r < e-2nc1 p + e-2n(1-2 22 22 -d 2 2 )p + 2ne-2n 2V (r )2 where the last inequality is based on Hoeffding bound. Let e-2nc1 p 3 , e-2n(1-2 ) p 3 and 2ne-2n 2V (r 2ne-2n 3 1 3 1 1 B 2c2 p2 log , n 2(1-2-d )2 p2 log , and n 4V ( r2 )4 log 3 . n 1 2 -d 2 2 ) 2V ( r2 2 2) 3, we obtain ased on Claim 1, we get the following theorem, which shows the effectiveness of the proposed method. Main Theorem. If 1. Let B 2 be the hyper-ball centered at b with radius 2r. The minimum distance between the points inside B and the ones outside B 2 is not too large, i.e. min{ xi - xj |xi , xj S, xi - b r, xj - b > 2r} , where is a positive parameter. Notice that here we are only dealing with the hard case where f1 (x) is positive within B . In the separable case where the support regions of the two classes do not overlap, we can use other methods to detect the minority class, such as the one proposed in [8]. 1 3 2 2. f1 (x) is locally smooth, i.e. x, y Rd , |f1 (x) - f1 (y )| x-y , where 2d+1 V (r)2 r2 and OV ( 2 , r) is the volume of the overlapping region of two hyper-balls: one is of radius 2 r, the other one is of radius r2 , and its center is on the sphere of the bigger one. p2 O V ( r2 ,r ) 3. The number of examples is sufficiently large, 1 i.e. n max{ 2c2 p2 log 3 , 2(1-21 d )2 p2 log 3 , (1-p)4 1 V ( r2 )4 log 3 }. - 4 1 2 then with probability at least 1 - , after 2 iterations, NNDB will query at least one example r2 whose probability of coming from the minority class is at least 1 , and it will continue querying such 3 2d examples until the ( p(1-p) - 2) · th iteration. r Proof. Based on Claim 1, using condition 3, if the number of examples is sufficiently large, then with 2 probability at least 1 - , r2 r r and | ni - E ( ni )| (1 - p) V (r ), 1 i n. According to n n n condition 2, xi , xj S s.t. xi - b > 2r, xj - b > 2r and xi - xj , E ( ni ) and E ( nj ) n nj ni ) will not be affected by the minority class, and |E ( n ) - E ( n )| (1 - p) V (r (1 - p) V (r). Note that is always bigger than r. Based on the above inequalities, we have nj ni ni nj nj ni nj ni | - | | - E ( )| + | - E ( )| + |E ( ) - E ( )| 3(1 - p) V (r) (1) n n n n n n n n From inequality (1), it is not hard to see that xi , xj S , s.t. xi - b > 2r and xi - xj , nj ni = , n - n 3(1 - p) V (r ), i.e. when tr si 3(1 - p) V (r) n This is because if xj - b 2r, the minority class may also contribute to may be even smaller. nj n, (2) and thus the score On the other hand, based on condition 1, there exist two points xk , xl S , s.t. xk - b r, xl - b > 2r, and xk - xl . Since the contribution of the minority class to E ( nk ) is at least n 2 2 so E ( nk ) - E ( nl ) - (1 - p) V (r ) - (1 - p) V (r). Since n n V (r ) V (r ) ni ni ) for any example xi S , we have | n - E ( n )| (1 - p) V (r (1 - p) V (r), therefore 2 2 2 p · OV ( r2 , r) p · OV ( r2 , r) 3(1 - p)p2 · OV ( r2 , r) nk nl - - 3(1 - p) V (r) - n n V (r ) V (r) 2d+1 V (r) 2 p·OV ( 2 ,r ) , V (r ) r p·O V ( r2 ,r ) p·O V ( r2 ,r ) Since p is very small, p p) 3(1-+1 p ; therefore, 2d 2 nk n - nl n > 3(1 - p) V (r), i.e. when tr = , sk > 3(1 - p) V (r) (3) n In Step 4 of the proposed method, we gradually enlarge the neighborhood to calculate the change of local density. When tr = , based on inequalities (2) and (3), xi S , xi - b > 2r, we have sk > si . Therefore, in this round of iteration, we will pick an example from B 2 . In order for tr to be equal to , the value of t would be r 2 . r2 If we further increase t so that tr = c, where c > 1, we have the following conclusion: xi , xj n S , s.t. xi - b > 2r and xi - xj c, ni - nj (c + 2)(1 - p) V (r), i.e. si (c + 2)(1 - n n 2 d 2 p) V (r). As long as p (c+2)(1-p)p , i.e. c p(1-p) - 2, then xi S , xi - b > 2r, sk > si , 2d 2 and we will pick examples from B . Since r r, the method will continue querying examples in 2d h 2 B until the ( p(1-p) - 2) · r t iteration. 1 Finally, we show that the probability of picking a minority class example from B 2 is at least 3 . To this end, we need to calculate the maximum probability mass of the majority class within B 2 . Consider the case where the maximum value of f1 (x) occurs at b, and this pdf decreases by every time x moves away from b in the direction of the radius by , i.e. the shape of f1 (x) is a cone ( 1 in (d + 1) dimensional space. Since f1 (x) must integrate to 1, i.e. V ( f(b) ) · f1+b) = 1, where d1 1 V ( f(b) ) is the volume of a hyper-ball with radius f1 (b) , d we have f1 (b) = ( V +1) ) d+1 d+1 . ( 1 d 4 Therefore, the probability mass of the majority class within B 2 is: 2r 2r V (2r)(f1 (b) - ) + V (2r) < V (2r)f1 (b) d+1 1 d 1 d d + 1 d+1 d+1 V (r) d+1 d+1 = V (2r)( ) = 2d 1 (d + 1) V () V () d+1 2 p2 · OV ( r2 , r) d ) d+1 < 2p V (r ) where V (2r) is the volume of a hyper-ball with radius 2r. Therefore, if we select a point at random p p from B 2 , the probability that this point is from the minority class is at least p+(1-p)·2p p+2p = 1 . 3 2 < (d + 1) d+1 (2d+1 V (r) ) d+1 (d + 1) d+1 ( 1 d 1 .4 Rare category detection for multiple classes In subsection 2.2, we have discussed rare category detection for the binary case. In this subsection, we focus on the case where m > 2. To be specific, let p1 , . . . , pm be the priors of the m classes, and p1 pi , i = 1. Our goal is to use as few label requests as possible to find at least one example from each class. The method proposed in subsection 2.2 can be easily generalized to multiple classes, which is presented in Algorithm 2. In this algorithm, we are given the priors of all the minority classes. Using each pi , we estimate the number Ki of examples from this class, and calculate the corresponding ri value in the same manner as NNDB. Then, we calculate the local density at each example based on different scales ri . In the outer loop of Step 9, we calculate the r value which is the minimum of all the ri whose corresponding classes have not been discovered yet and its index. In the inner loop of Step 11, we gradually enlarge the neighborhood to calculate the score of each example. This is the same as NNDB, except that we preclude the examples that are within a certain distance of any selected example from being selected. This heuristic is to avoid repeatedly selecting examples from the same discovered class. The inner loop stops when we find an example from an undiscovered class. Then we will update the r value and resume the inner loop. If the minority classes form compact clusters and are far apart from each other, NNDM is able to detect examples from each minority class with a small number of label requests. Algorithm 2 Nearest-Neighbor-Based Rare Category Detection for Multiple Classes (NNDM) Require: S , p2 , . . . , pm 1: for i = 2 : m do 2: Let Ki = npi . t 3: For each example, calculate the distance between this example and its Kih nearest neighbor. i Set r to be the minimum value among all the examples. 4: end for 5: Let r1 = maxm 2 ri . i= 6: for i = 1 : m do 7: xj S , let N N (xj , ri ) = {x|x S, x - xj ri }, and ni = |N N (xj , ri )|. j 8: end for 9: while not all the classes have been discovered do 10: Let r = min{ri |1 i m, and class i has not been discovered}, and s be the corresponding index, i.e. r = rs . 11: for t = 1 : n do for each xi that has been selected and labeled yi , x S , s.t. x - xi ryi , si = -; 12: for all the other examples, si = max )(ns - ns ). i j xj N N (xi ,tr 13: Query x = arg maxxi S si . 14: If x belongs to a class that has not been discovered, break. end for 15: 16: end while In NNDB and NNDM, we need the priors of the minority classes as the input. As we will see in the next section, our algorithms are robust against small perturbations in the priors. 5 3 Experimental results In this section, we compare our methods (NNDB and NNDM) with the best method proposed in [8] (Interleave) and random sampling (RS) on both synthetic and real data sets. In Interleave, we use the number of classes as the number of components in the mixture model. For both Interleave and RS, we run the experiment multiple times and report the average results. 3.1 Synthetic data sets Figure 1(a) shows a synthetic data set where the pdf of the majority class is Gaussian and the pdf of the minority class is uniform within a small hyper-ball. There are 1000 examples from the majority class and only 10 examples from the minority class. Using Interleave, we need to label 35 examples, using RS, we need to label 101 examples, and using NNDB, we only need to label 3 examples in order to sample one from the minority class, which are denoted as `x' in Figure 1(b). Notice that the first 2 examples that NNDB selects are not from the correct region. This is because the number of examples from the minority class is very small, and the local density may be affected by the randomness in the data. 5 5 4 4 3 3 2 2 1 1 0 0 -1 -3 -2 -1 0 1 2 3 4 -1 -3 -2 -1 0 1 2 3 4 (a) Data Set (b) Examples Selected by NNDB, denoted as `x' Figure 1: Synthetic Data Set 1. In Figure 2(a), the X-shaped data consisting of 3000 examples correspond to the majority class, and the four characters `NIPS' correspond to four minority classes, which consist of 138, 79, 118, and 206 examples respectively. Using Interleave, we need to label 1190 examples, using RS, we need to label 83 examples, and using NNDM, we only need to label 5 examples in order to get one from each of the minority classes, which are denoted as `x' in Figure 2(b). Notice that in this example, Interleave is even worse than RS. This might be because some minority classes are located in the region where the density of the majority class is not negligible, and thus may be `explained' by the majority-class mixture-model component. 3.2 Real data sets In this subsection, we compare different methods on two real data sets: Abalone [3] and Shuttle [4]. The first data set consists of 4177 examples, described by 7 dimensional features. The examples come from 20 classes: the proportion of the largest class is 16.50%, and the proportion of the smallest class is 0.34%. For the second data set, we sub-sample the original training set to produce a smaller data set with 4515 examples, described by 9 dimensional features. The examples come from 7 classes: the proportion of the largest class is 75.53%, and the proportion of the smallest class is 0.13%. 6 200 180 160 140 120 100 80 60 40 20 0 0 50 100 150 200 250 200 180 160 140 120 100 80 60 40 20 0 0 50 100 150 200 250 (a) Data Set (b) Examples Selected by NNDM, denoted as `x' Figure 2: Synthetic Data Set 2. The comparison results are shown in Figure 3(a) and Figure 3(b) respectively. From these figures, we can see that NNDM is significantly better than Interleave and RS: with Abalone data set, to find all the classes, Interleave needs 280 label requests, RS needs 483 label requests, and NNDM only needs 125 label requests; with Shuttle data set, to find all the classes, Interleave needs 140 label requests, RS needs 512 label requests, and NNDM only needs 87 label requests. This is because as the number of components becomes larger, the mixture model generated by Interleave is less reliable due to the lack of labeled examples, thus we need to select more examples. Furthermore, the majority and minority classes may not be near-separable, which is a disaster for Interleave. On the other hand, NNDM does not assume a generative model for the data, and only focuses on the change in local density, which is more effective on the two data sets. 20 7 6 Classes Discovered 15 Classes Discovered 5 4 3 2 NNDM Interleave RS 10 NNDM Interleave RS 5 0 0 100 200 300 400 Number of Selected Examples 500 1 0 100 200 300 400 500 Number of Selected Examples 600 (a) Abalone (b) Shuttle Figure 3: Learning Curves for Real Data Sets 3.3 Imprecise priors The proposed algorithms need the priors of the minority classes as input. In this subsection, we test the robustness of NNDM against modest mis-estimations of the class priors. The performance of NNDB is similar to NNDM, so we omit the results here. In the experiments, we use the same data 7 sets as in subsection 3.2, and add/subtract 5%, 10%, and 20% from the true priors of the minority classes. The results are shown in Figure 4. From these figures, we can see that NNDM is very robust to small perturbations in the priors. For example, with Abalone data set, if we subtract 10% from the true priors, only one more label request is needed in order to find all the classes. 20 7 6 Classes Discovered 15 -5% -10% -20% 0 +5% +10% +20% 50 100 150 200 Number of Selected Examples 250 Classes Discovered 5 4 3 2 1 0 -5% -10% -20% 0 +5% +10% +20% 20 40 60 80 Number of Selected Examples 100 10 5 0 0 (a) Abalone (b) Shuttle Figure 4: Robustness Study 4 Conclusion In this paper, we have proposed a novel method for rare category detection, useful for de-novo active learning in serious applications. Different from existing methods, our method does not rely on the assumption that the data is near-separable. It works by selecting examples corresponding to regions with the maximum change in local density, and depending on scaling, it will select class-boundary or class-internal samples of minority classes. The method could be scaled up using kd-trees [7]. The effectiveness of the proposed method is guaranteed by theoretical justification, and its superiority over existing methods is demonstrated by extensive experimental results on both synthetic and real data sets. Moreover, it is very robust to modest perturbations in estimating true class priors. References [1] M. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proc. of the 23rd Int. Conf. on Machine Learning, pages 65­72, 2006. [2] S. Bay, K. Kumaraswamy, M. Anderle, R. Kumar, and D. Steier. Large scale detection of irregularities in accounting data. In Proc. of the 6th Int. Conf. on Data Mining, pages 75­86, 2006. [3] C. Blake and C. Merz. Uci repository of machine learning databases. In http://www.ics.uci.edu/ machine/MLRepository.html, 1998. Brazdil and J. Gama. Statlog repository. In [4] P. http://www.niaad.liacc.up.pt/old/statlog/datasets/shuttle/shuttle.doc.html, 1991. [5] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems 19, 2005. [6] S. Fine and Y. Mansour. Active sampling for multiple output identification. In The 19th Annual Conf. on Learning Theory, pages 620­634, 2006. [7] A. Moore. A tutorial on kd-trees. Technical report, University of Cambridge Computer Laboratory, 1991. [8] D. Pelleg and A. Moore. Active learning for anomaly and rare-category detection. In Advances in Neural Information Processing Systems 18, 2004. 8