Tensor Space Model for Document Analysis Deng Cai Dept. of Computer Science UIUC Xiaofei He Yahoo! Research Labs Jiawei Han Dept. of Computer Science UIUC dengcai2@cs.uiuc.edu hex@yahoo-inc.com hanj@cs.uiuc.edu ABSTRACT Vector Space Model (VSM) has been at the core of information retrieval for the past decades. VSM considers the documents as vectors in high dimensional space. In such a vector space, techniques like Latent Semantic Indexing (LSI), Support Vector Machines (SVM), Naive Bayes, etc., can be then applied for indexing and classification. However, in some cases, the dimensionality of the document space might be extremely large, which makes these techniques infeasible due to the curse of dimensionality. In this paper, we propose a novel Tensor Space Mo del for document analysis. We represent documents as the second order tensors, or matrices. Correspondingly, a novel indexing algorithm called Tensor Latent Semantic Indexing (TensorLSI) is developed in the tensor space. Our theoretical analysis shows that TensorLSI is much more computationally efficient than the conventional Latent Semantic Indexing, which makes it applicable for extremely large scale data set. Several experimental results on standard document data sets demonstrate the efficiency and effectiveness of our algorithm. Categories and Sub ject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing - Indexing methods General Terms: Algorithms, Theory, Experimentation, Performance Keywords: Tensor Space Model, Vector Space Model, Latent Semantic Indexing, Tensor Latent Semantic Indexing 1. INTRODUCTION Document indexing and representation has been a fundamental problem in information retrieval for many years. Most of previous works are based on the Vector Space Model (VSM, [3]). The documents are represented as vectors, and each word corresponds to a dimension. Learning techniques such as Latent Semantic Indexing (LSI, [2]), Support Vector Machines, Naive Bayes, etc., can be then applied in such a vector space. The main reason of the popularity of VSM is probably due to the fact that most of the existing learning algorithms can only take vectors as their inputs, rather than tensors. When VSM is applied, one if often confronted with a docCopyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. ument space Rn with a extremely large n. Let x Rn denotes the document vector. Let us consider n = 1, 000, 000 and learning a linear function g (x) = wT x. In most cases, learning g in such a space is infeasible in the sense of computability. For example, when LSI is applied in such a space, one needs to compute eigen-decomposition of a 1M × 1M matrix 1 . Different from traditional Vector Space Model based document indexing and representation, in this paper, we consider documents as matrices, or the second order tensors. For a document set with n words, we represent the documents as the second order tensors (or, matrices) in Rn1 Rn2 , where n1 × n2 n. For examples, a 1, 000, 000-dimensional vector can be converted into a 1000 × 1000 matrix. Let X Rn1 Rn2 denotes the document matrix. Naturally, a linear function in the tensor space can be represented as f (X ) = uT X v, where u Rn1 and v Rn2 . Clearly, f (X ) has only n1 + n2 (=2000 in our case) parameters which is much less than n(= 1, 000, 000) of g (x). Based on the tensor representation of documents, we propose a novel indexing algorithm called Tensor Latent Semantic Indexing (TensorLSI) operated in the tensor space rather than vector space. Sharing similar properties as the conventional Latent Semantic Indexing (LSI) [2], TensorLSI tries to find the principal components of the tensor space. 2 Let {ui }n11 be a set of basis functions of Rn1 and {vj }n=1 i= j n2 be a set of basis functions of R . It is easy to show that {ui vT } forms a basis of Rn1 Rn2 . Thus, TensorLSI aims at j finding bases {ui } and {vj } such that the pro jections of the documents onto {ui vT } can best represent the documents j in the sense of reconstruction error. It would be important to note that, while searching for the optimal bases {ui } and {vj }, we need only to compute the eigen-decompositions of two n1 × n1 and n2 × n2 matrices. This property makes our algorithm particularly applicable for the case when the number of words is extremely large. This work is foundamentally motivated by [4]. For more detailed document analysis using TSM, please see [1]. 2. THE ALGORITHM TensorLSI is fundamentally based on LSI. It tries to pro ject the data to the tensor subspace in which the reconstruction error is minimized. Given a set of document matrices Xi Rn1 ×n2 , i = 1, · · · , m. Let Yi = U T Xi V denote its pro jection in the tensor subspace U V . The reconstruction Note, we assume that the number of documents (m) is larger than n. When m < n, it suffices to compute the eigen-decomposition of a m × m matrix. 1 625 100 10 TensorLSI LSI Bas eline 3 Ac curac y (%) Table 1: Complexity Comparison of TensorLSI and LSI Time complexity Minimum memory Storage size TLSI O(mn1.5 ) O(n) O(2n) LSI O((m + n)q 2 ) O (q 2 ) O ( k n) n is the number of features and m is the number of documents q = min(m, n) k is usually around several hundreds 90 10 2 Time (s) 80 10 1 70 10 0 60 10 -1 Tens orLSI LSI Bas eline 2 4 6 Clas s Number 8 10 50 2 4 6 Clas s Number 8 10 10 -2 error for Xi can be written as Xi - U Yi V T . Thus, the ob jective function of TensorLSI can be described as follows: min U,V (a) (b) i m Xi - U Yi V T 2 (1) =1 which is equivalent to the following: m min U,V i Figure 1: (a)Clustering accuracy with resp ect to the numb er of classes. (b) Computation time (time on dimension reduction plus time on k-means) with resp ect to the numb er of classes. As can b e seen, TensorLSI achieves comparable accuracy with LSI while much faster than LSI. The evaluations were conducted with different number of clusters, ranging from 2 to 10. For each given cluster number k, 50 tests were conducted on different randomly chosen categories, and the average performance was computed over these 50 tests. The average accuracy and the computation time (time on dimension reduction plus time on k-means) of three methods are shown on fiure 1. Both LSI and TensorLSI are significantly better than baseline with respect to both clustering accuracy and computation time. For k = (2, 3, 4, 5, 6), LSI and TensorLSI achieved almost same clustering accuracy. For k = (8, 9, 10). Clustering using LSI is slightly better than clustering using TensorLSI. However, in all cases, the computation time (time on dimension reduction plus time on k-means) of TensorLSI is extremely shorter than that of LSI. Xi - U U T Xi V V T 2 (2) =1 With some algebraic steps [1], we can show that, the optimal U and V are given by solving the following two eigenvector problems: im T Xi Xi u = u u , im T Xi Xi v = v v (3) =1 =1 After obtaining the basis vectors {ui } (i = 1, · · · , n1 ) and {vj } (i = 1, · · · , n2 ), each ui vT is a basis of the transformed j tensor space, and uT Xt vj (t = 1, · · · , m) is the coordinate i of Xt corresponding to ui vT in this tensor space. When we j want to keep the first k principle component of document in the transformed tensor space, we use function f (ui , vj ) = t m =1 T tr(uT Xt vj vT Xt uj ) = i i t m =1 (uT Xt vj )2 (4) i 4. CONCLUSIONS to evaluate the importance of ui vT with respect to i and j . j f (ui , vj ) reflects the importance of the tensor basis ui vT in j terms of reconstruction error. When we want to keep the first k principle component in the transformed tensor space, we sort f (ui , vj ) for all the i and j in decreasing order and choose the first k pairs. Table 1 lists the computational complexity comparison of TensorLSI and LSI, more detailed analysis can be found in [1]. 3. EXPERIMENTS In this section, document clustering on Reuters-21578 corpus2 is used to show the effectiveness of our proposed algorithm. We chose k-means as our clustering algorithm and compared three methods. These three methods are listed below: · k-means on original term-document matrix (Baseline) · k-means after LSI (LSI) · k-means after TensorLSI (TLSI) The clustering result is evaluated by comparing the obtained label of each document with that provided by the document corpus. The accuracy (AC ) is used to measure the clustering performance [1]. 2 Reuters-21578 corpus is at http://www.daviddlewis.com/ resources/testcollections/reuters21578/ A novel document representation and indexing method has been proposed in this paper, called Tensor Latent Semantic Indexing. Different from conventional LSI which considers documents as vectors, TensorLSI considers documents as the second order tensors, or matrices. Based on the tensor representation, TensorLSI tries to find an optimal basis for the tensor subspace in terms of reconstruction error. Also, our theoretical analysis shows that TensorLSI can be much more efficient than LSI in time, memory and storage especially when the number of documents is larger than the number of words. 5. REFERENCES [1] D. Cai, X. He, and J. Han. Tensor space model for document analysis. Technical report, Computer Science Department, UIUC, UIUCDCS-R-2006-2715, April 2006. [2] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391­407, 1990. [3] G. Salton, A. Wong, and C. S. Yang. A vector space model for information retrieval. Communications of the ACM, 18(11):613­620, 1975. [4] J. Ye. Generalized low rank approximations of matrices. Machine Learning, 61(1-3):167­191, Nov. 2005. 626