SIGIR 2007 Proceedings Poster Locality Discriminating Indexing for Document Classification Jiani Hu, Weihong Deng, Jun Guo, and Weiran Xu School of Information Engineering, Beijing University of Posts and Telecommunications, Beijing, 100876, China {cughu, cvpr dwh}@126.com, guojun@bupt.edu.cn, xuweiran@pris.edu.cn prior class information. The novelty of the LDI algorithm comes from 1) the invader graph that effectively characterizes inter-class overlap of the class-specific manifolds; and 2) the locality discriminating criterion that defines the pro jection vector to best preserve the within-class local structures while suppress the between-class overlap. ABSTRACT This paper introduces a locality discriminating indexing (LD -I) algorithm for document classification. Based on the hypothesis that samples from different classes reside in classspecific manifold structures, LDI seeks for a pro jection which best preserves the within-class local structures while suppresses the between-class overlap. Comparative experiments show that the proposed method is able to derives compact discriminating document representations for classification. 2. THE ALGORITHM The document set is represented as a matrix X = [x1 , x2 , . . . , xN ], xi Rm , where N is the number of samples and m is the feature dimension. The class label of the sample xi is assumed to be ci {1, 2, . . . , Nc }, where Nc is the number of classes. Two graphs are introduced to construct class-specific manifolds. The k nearest-native graph, which is defined as a weighted graph Gn = {V, Sn }, with V = {xi }N 1 , i= n 1, if xj Nk (xi ) xi Nk (xj ) (1) Sn = ij 0, otherwise. where the set Nk (xi ) , called k nearest-native neighbors of xi , is composed of other k samples having the same label with xi and wished to be nearest to xi . Nearest-native graph consists of Nc unconnected subgraphs, and each subgraph preserves the local structure of a class-specific manifold. However, the class-specific manifold structures are usually overlapped in the high-dimensional space. To effectively characterize the manifold overlap, the invader graph is introduced, which is denoted as a weighted graph Gf = {V, Sf }, with V = {xi }N 1 , i= n 1, if xj Ik (xi ) xi Ik (xj ) f (2) Sij = 0, otherwise. where the set Ik (xi ) composes of the invaders of xi . The invaders of xi have two properties: 1) they reside in the local spherical space Rm (xi ) = {x : x - xi < i } with the radius as i = max xj - xi ; and 2) their labels xj Nk (xi ) Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing--indexing methods General Terms Algorithms Keywords document indexing, document classification, manifold analysis 1. INTRODUCTION Document representation and indexing is a fundamental problem for efficient classification, clustering, and retrieval. Latent Semantic Indexing (LSI) [1] and Locality Preserving Indexing (LPI) [2] are two well-known document indexing methods in vector space model. LSI essentially extracts the most representative features for document representation rather than the most discriminative features. LPI seeks a linear subspace which can preserve the local structure of data. Intuitively, the pro jection direction determined by LPI ensures that, if samples xi and xj are close, their projections yi and yj are close as well. It possibly happens that two near samples belonging to different classes may also result in close texts after the pro jection of LPI. To address these problems, we proposes a Locality Discriminating Indexing (LDI) for document representation and sequent classification. We assume the documents reside on diverse class-specific manifold structures that overlap one another, and aims to linearly represent the document in a manner which discount the inter-class overlap. Therefore, LDI explicitly considers both the local structure and the Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. are different from xi . Fig. 1 gives an illustration. Note that the invader graph is constructed by a -neighborhood adaptive to the local density around xi , which circumvents the parameter selection problem. It is ideal to make each sample and its k-nearest neighbors always belong to the same class in the pro jected space and its invaders far apart. Based on this locality discriminating criterion, the optimal pro jection vector can be determined as follows: P T T 2n ij (w xi - w xj ) Sij w = arg min P T T 2f w ij (w xi - w xj ) Sij 689 SIGIR 2007 Proceedings Poster local neighborhood Micro-average F1 value (a) TREC data 1 0.95 (b) Reuters data Micro-average F1 value 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.9 0.85 Xi LDI invaders native neighbors Xi 0.8 0.75 0.7 0.65 0 LDI LPI LSI Original 50 100 150 200 LDI LPI LSI Original 0 50 100 150 200 Dimension of document representation Dimension of document representation Figure 1: Schematic illustration of xi 's k(= 3) nearest-native neighb ors and two invaders b efore LDI pro jection (left) versus after pro jection (right). wT XLn XT w . (3) w wT XLf XT w The numerator characterizes the intra-class compactness by the sum of the distance between each data and its k-nearest natives, while the denominator models the inter-class overlap by the sum of distance between each data and its invaders. In equation (3), Ln = Dn - Sn is the Laplacian [3] of the nearest-native graph Gn , and Dn is a diagonal matrix Pn f f f with Dn = ii j Sij . And L = D - S is the Laplacian f f of the P ader graph G , and D is a diagonal matrix with inv Dfi = j Sfj . i i From above descriptions, the Locality Discriminating Indexing (LDI) algorithm can be described as follows: = arg min 1. LSI pro jection. We first pro ject the data set {xi }N 1 i= into the low-dimensional LSI subspace [1]. Let WLS I Rm× denote the transformation matrix of LSI. 2. Constructing the k nearest-native and invader graph according to equation (1) and (2). 3. Solving lo cality discriminating criterion in equation (3). Computing the eigenvector and eigenvalue of the following generalized eigenvalue problem: XLn XT w = XLf XT w. (4) Figure 2: Comparative classification p erformance of the LSI, LPI, LDI and original representation with increasing reduced dimensionality. We compare the proposed algorithm to the LSI, LPI and the original representation for text classification. We evaluate these algorithms using F1 measure based on precison/recall across documents (micro-average). In all the experiments, the F1 values are estimated by 5-fold cross-validation. Fig.2 shows the classification results with increasing dimension of the document representation. It can be observed from Fig.2 that LDI clearly outperforms other algorithms. In particular, LDI method achieves the highest micro-average F1 value when using only 7 features in TREC data. In summary, we propose locality discriminating indexing for document classification based on graph framework. We take experiments to evaluate the effectiveness of different document indexing methods. It is showed that the proposed algorithm derives discriminating features with low dimensionality and enhanced discrimination power. 4. ACKNOWLEDGMENTS This work was supported by National Natural Science Foundation of China (Grant No.60475007, 60675001). We'd like to thank the anonymous reviewers for their helpful comments. 5. REFERENCES [1] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6): 391­407, 1990. [2] X. He, D. Cai, H. Liu, and W. Y. Ma. Locality preserving indexing for document representation. In Proceedings of ACM SIGIR, 2004. [3] F. R. K. Chung. Spectral graph theory. American Mathematical Society, AMS Press, 1997. [4] F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1­47, 2002. Solutions of (4) can be simply obtained by finding the eigenvectors of (XLf XT )-1 XLn XT . 4. Output the final linear transformation matrix. Let w1 , w2 , . . . , w be the solutions of (4), ordered according to their eigenvalues, 0 1 2 . . . . Let d (d ) be the output dimension and WLDI = [w1 , w2 , . . . , wd ]. The linear transformation matrix is: W = WLS I · WLDI . (5) The low-dimensional representation using LDI is defined as follows: x y = WT x. 3. EXPERIMENTS AND CONCLUSIONS Experiments have been conducted on a subset of TREC ( TREC-5 & TREC-6) corpus and a subset of Reuters-21578 collection. The subset of TREC corpus contains 757 documents from 5 categories. The subset of Reuters-21578 has 1442 documents from 9 categories. Firstly, the documents are represented in the Vector Space Model using the normalized tf-idf weighting scheme [4] and subsequently transformed to the LSI, LPI, and LDI representations. The classifier for all document indexing algorithms is k-NN (k = 3). 690