Latent Semantic Analysis for Multiple-Type Interrelated Data Objects Xuanhui Wang , Jian-Tao Sun , Zheng Chen , ChengXiang Zhai Depar tment of Computer Science, University of Illinois at Urbana-Champaign {xwang20, czhai}@cs.uiuc.edu Microsoft Research Asia, Beijing, P.R.China {jtsun, zhengc}@microsoft.com Figure 1: Example of multiple-type interrelated data objects. Each edge denotes a single co-occurrence relationship. Categories and Subject Descriptors: H.3.3 [Information Search and Retrieval:] Indexing methods General Terms: Algorithms Keywords: M-LSA, LSA, mutual reinforcement principle, multipletype 1. INTRODUCTION Co-occurrence data arises naturally and frequently in a variety of applications such as information retrieval and text mining. In most existing work on analysis of co-occurrence data, only a single pairwise co-occurrence relationship between two types of objects is considered. For example, in information retrieval, the cooccurrence information between documents and words is used to rank documents with respect to queries [2]. In collaborative filter- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. ing, items are recommended to an active user based on historical co-occurrence data between users and items [14]. However, in most applications, there exist multiple types of data objects and each pair of them could have a pairwise co-occurrence relationship. For example, in the Web domain as shown in Figure 1, users co-occur with Web pages by viewing, queries co-occur with Web pages by referencing, Web pages co-occur with words by containing, and so on. With each kind of objects containing thousands of instances, each single co-occurrence relationship could be quite sparse. In the example above, using a single relationship, say, , to represent users may not be meaningful since there are millions of Web pages and each user may only view a tiny portion of them. Latent Semantic Analysis (LSA) [10] was proposed to alleviate the data sparseness problem by representing objects in a low-dimensional semantic space. In this space, semantically related objects are expected to be near to each other. Such a dimension reduction technique has been shown to improve performance in a variety of applications (e.g., [10, 13, 4]). However, the application of LSA is rather limited since it can only consider the co-occurrence relationship between two types of objects. With multiple co-occurrence relationships available, it is beneficial to exploit all of them to identify the semantic relations and alleviate the data sparseness problem. In the example above, we can also exploit other relations such as and to better represent users, since users with similar interests tend to issue similar queries, and similar queries could refer to similar Web pages. All these co-occurrence relations could be complementary, thus it is desirable to incorporate all of them so as to represent each type of objects more meaningfully. Though promising, exploiting the co-occurrence relations among multiple types of objects is challenging: 1) It is not clear how to effectively utilize all the co-occurrence relations among multiple types of objects to overcome data sparseness. 2) There may exist hidden relations between any two types of objects and these relations are complex since the information can also propagate through 236 e lp maxe x irta m ecnerrucco-oc s dro W gniniatnoc fer ere cn gni i s s u i n g seireuQ Co-occurrence data is quite common in many real applications. Latent Semantic Analysis (LSA) has been successfully used to identify semantic relations in such data. However, LSA can only handle a single co-occurrence relationship between two types of objects. In practical applications, there are many cases where multiple types of objects exist and any pair of these objects could have a pairwise co-occurrence relation. All these co-occurrence relations can be exploited to alleviate data sparseness or to represent objects more meaningfully. In this paper, we propose a novel algorithm, M-LSA, which conducts latent semantic analysis by incorporating all pairwise co-occurrences among multiple types of objects. Based on the mutual reinforcement principle, M-LSA identifies the most salient concepts among the co-occurrence data and represents all the objects in a unified semantic space. M-LSA is general and we show that several variants of LSA are special cases of our algorithm. Experiment results show that M-LSA outperforms LSA on multiple applications, including collaborative filtering, text clustering, and text categorization. 0××0 0000 seg ap beW c o n t a i n i n g gni weiv sresU ABSTRACT 0×00 ×000 any co-occurrence path. Take Figure 1 as an example. There is no direct co-occurrence between users and words. But similar words can induce similar queries and Web pages, thus in turn, similar users. The similarity between words can be propagated to users through the path "words queries users" or "words Web pages users". Even more complex propagations are also possible. To effectively utilize all the information among heterogeneous objects, we propose a novel and unified latent semantic analysis algorithm, M-LSA, to model all the objects in a unified framework and identify the latent semantic relations underneath all the co-occurrence data. By exploiting all the pairwise co-occurrence data simultaneously, M-LSA identifies the most salient or important concepts among them. These concepts span a unified lowdimensional semantic space, where each object is represented by a vector which reflects the strengths of its association with these concepts. Specifically, to identify important concepts, a natural belief is that important concepts are related to important objects. Based on this assumption, we utilize the mutual reinforcement principle, which is underlying the traditional LSA, to identify the important objects in each type leveraging all the co-occurrence relations quantitatively. This principle leads to an eigenvector problem and the obtained eigenvectors are regarded as the latent concepts. We show that the M-LSA algorithm is a natural generalization of the traditional LSA from two to multiple types of objects. The rest of this paper is organized as follows. Section 2 is to discuss previous work. We define our problem in Section 3. To solve this problem, we identify the mutual reinforcement principle underlying LSA and extend it to multiple types of objects in Section 4. This principle naturally leads to a solution to our problem: M-LSA. Section 5 is to present our experimental results for different applications. Finally, we conclude our paper in Section 6. Our work is related to the HITS algorithm [17] and a key-phrase extraction algorithm [27] in the sense of sharing the mutual reinforcement principle. HITS uses this principle to find good pages from a Web subgraph and [27] uses it to identify salient key-phrases from a document. In this paper, we use this principle for multipletype latent semantic analysis. 3. THE PROBLEM The problem we study is to analyze the co-occurrence relationship among multiple types of objects. Suppose we have N types of objects {X1 , X2 , ..., XN } and each pair of them could have a pairwise co-occurrence relation. Formally, we construct an undirected graph G(V , E ). V consists of N vertices with each corresponding to a type of objects. If there is a pairwise co-occurrence relation between two types of objects, we have an edge in E which connects the corresponding vertices. We name graph G as "multiple-type graph". In G, each type corresponds to a set of objects and we use |Xi | to denote the number of objects of this type. For each edge eij E , we have a |Xi | × |Xj | co-occurrence matrix Mij . Each edge could have a weight ij to measure the importance of the co-occurrence relation between Xi and Xj . Note that G is not necessary to be a complete graph. An edge eij is absent if the corresponding co-occurrence data is unavailable or not meaningful for an application. For example, in Figure 1, the corresponding graph G contains 4 types of objects: users, queries, Web pages, and words. We have 5 co-occurrence relations in G and each of them is denoted by an edge in Figure 1, thus we have 5 co-occurrence matrices. Intuitively, based on graph G, objects of any type (e.g., users) can be represented by objects of the other types to which it is directly connected (e.g., Web pages and queries). However, this method is not effective to exploit all the information on a multiple-type graph. A more general method is to represent objects of any type by all the types of objects which have paths to them (e.g., representing users by words). However, due to the complex relations among data objects, there may be many paths between two types of objects (e.g., users and words), thus this method is difficult to be implemented directly. To effectively utilize the information on a multiple-type graph G, our goal is to find the latent semantic representations for each type of objects. Specifically, based on the co-occurrence data of G, we first identify the most salient concepts based on the mutual reinforcement principle. These concepts span a semantic space. We then represent each object in this unified low-dimensional space. 2. RELATED WORK LSA was first introduced to address the synonym and polysemy problems in information retrieval [10]. Since then, LSA has attracted much attention and several researches analyzed it theoretically [3, 11, 21, 4]. For example, [11] and [21] used probabilistic model to study the effectiveness of LSA. More recently, [4] argued that spectral algorithms such as LSA can expand the documents implicitly to improve retrieval accuracy. Some variants of LSA have also been proposed recently. Probabilistic LSA (PLSA) [15] applies a probabilistic aspect model to the co-occurrence data. Iterative Residual Rescaling (IRR) [1] is proposed to counteract LSA's tendency to ignore the minor-class documents. Unlike LSA, Nonnegative Matrix Factorization (NMF) [19, 25] decomposes the matrix into two matrices with no negative values. Most work above only considers co-occurrence relations between two types of objects. High-order co-occurrence data or high-order tensor is studied in multilinear algebra [18]. In [18], the HighOrder Singular Value Decomposition (HOSVD) is proposed to factor high-order tensors; in contrast, we consider the pairwise cooccurrence relations between different types of objects in this paper, which is less computationally expensive than HOSVD. Several recent studies utilize pairwise co-occurrence data for different specific purposes such as object clustering [5] and similarity measuring [24, 16]. In [9], a unified approach is proposed to analyze both link and text information. Compared with these studies, our approach is more general and fundamental in that we provide a general principled method for analyzing any multiple types of objects. 4. M-LSA In this section, we first describe the mutual reinforcement principle based on the analysis of the traditional LSA. We then extend it to multiple types of objects and present the M-LSA algorithm. Finally, we show that two variants of the traditional LSA are special cases of M-LSA. In the following, we denote matrices by uppercase letters (e.g., A, B ), scalars by lower-case letters (e.g., a, b), and vectors by bold lower-case letters (e.g., a, b). 4.1 The Mutual Reinforcement Principle of LSA 4.1.1 Brief Review of LSA LSA is based on a mathematical operation, Singular Value Decomposition (SVD), which is akin to factor analysis. Consider the analysis of document-word co-occurrence data, if there are a total of n documents and m words in a document collection, the process 237 starts with the creation of the co-occurrence matrix between the documents and words A = [aij ], with each entry aij representing the co-occurrence frequency of the i-th word in the j -th document. For the m × n matrix A, where without loss of generality m n and r ank(A) = r , the SVD is defined as [12]: A = U V T where U = [u1 , u2 , ..., ur ] is an m×r column-orthonormal matrix whose columns are called left singular vectors; = diag [1 , 2 , ..., r ] is an r × r diagonal matrix whose diagonal elements are positive singular values sorted in descending order. V = [v1 , v2 , ..., vr ] is an n × r column-orthonormal matrix whose columns are called right singular vectors. Given an integer k (k r), LSA uses the first k singular vectors to represent the documents and words in a k-dimensional space [10]. Each singular vector is regarded as a latent concept which captures a salient recurrent word combination pattern in the document collection; a document has a large index value for a concept if it contains the corresponding word pattern [6]. More precisely, LSA represents each document by a row of [1 v1 , ..., k vk ] and each word by a row of [1 u1 , ..., k uk ]. 4.1.3 Unified Importance and Concept Vectors In LSA, the latent concepts are represented by v in document space and u in word space. Conceptually, however, they are associated with the same concept. Thus, our key idea for generalizing LSA to handle multiple co-occurrence matrices is to represent each concept as a single vector. We now describe the unified importance and concept vectors. Recall in Equation (1), we have v and u. If we concatenate them as a unified importance vector [u, v]T , Equation (1) can be rewritten as: [u, v] = T 0 AT A 0 [ u, v]T = B · [u, v]T (2) where B is defined as: B= 0 A T A 0 ( 3) 4.1.2 The Mutual Reinforcement Principle In LSA, the first k singular vectors of A represent the most important k concepts in the document collection. Alternatively, we can assume that important concepts are related to both important documents and important words. To identify the most important concept from A, we associate importance property for documents and words respectively and utilize the mutual reinforcement principle1 , An important document co-occurs with important words; an important word co-occurs with important documents. Numerically, if we associate an importance value vi with the ith document and an importance value uj with the j -th word, the mutual reinforcement principle is expressed as: vi = It is easy to see that [u, v]T is the eigenvector of B . Mathematically, [8] shows that the eigenvectors of B are closely related to the singular vectors of A: u and v in Equation (2) are the left and right singular vectors of A respectively. Thus, computationally the SVD of A is equal to finding the eigenvectors of B , while B gives us a (more preferred) unified view of importance. With the notion of a unified concept vector, we can now explain LSA in a unified view. Let c = [u, v]T be the unified concept vector. We denote the first k eigenvectors of B as {c1 , ..., ck }. Each of them is a unified concept vector and has the corresponding importance precisely represented by the eigenvalue of B , i , which is the same as the singular value of A [8]. Therefore, with this unified view, LSA represents each object by a row of the matrix: [1 c1 , ..., k ck ] = u, 1 v 1 , 1 1 ..., ..., k uk k vk . The upper part of the matrix is for words and the lower part of the matrix is for documents. È j : j i uj ; uj = È 4.2 The M-LSA Algorithm To factor the latent semantic concepts out from multiple co-occurrence relations represented by a multiple-type graph G, we first extend the mutual reinforcement principle of LSA to multiple-type graph. On a multiple-type graph G with N vertices and a number of pairwise co-occurrence relationships, important objects of a type co-occur with important objects of other types. The mutual reinforcement principle on a multiple-type graph is a natural generalization of that of two types of objects. This principle also provides a reasonable solution to finding the important latent concepts among the multiple co-occurrence data as we will describe in the following. Formally, recall that we have N types of objects on graph G: {X1 , X2 , ..., XN }. For any two types of objects: Xi and Xj , we have the co-occurrence matrix Mij (Mij = 0 if the edge eij is T absent on G). It is easy to see that Mij = Mj i . (For brevity, we only consider the co-occurrence data between different types of objects. The co-occurrence relationship within a single type of objects can be incorporated similarly.) Let us associate an importance value with each object. For the i-th type of objects in Xi , we have one weight vector wi to denote their importance. The mutual reinforcement principle can be expressed as: . Mij wj (4) wi = i:j i vi . where j i means that the j -th word co-occurs with the i-th document. If we denote the importance of all documents by a vector v and the importance of all words by a vector u, we can express the mutual reinforcement principle as: v u = AT u = Av T T (1) It is easy to see that v = A Av and u = AA u. Therefore, u and v are the principal left and right singular vectors of A respectively (please refer to [8] for proof). Based on the co-occurrence data, the mutual reinforcement principle provides a reasonable solution to factor the most important concept out. However, a collection generally contains multiple topics. The most important concept represents well the most salient topic, but not other topics. Fortunately, the mutual reinforcement principle can also be extended to the non-principal singular vectors of A easily [17]. Specifically we can get the first k singular vectors. Each vector is then considered as a latent concept, and the magnitude of the corresponding singular value represents the importance of the concept. LSA projects the documents and words to a low-dimensional semantic space spanned by these concepts. 1 Similar principles are used in HITS [17] and [27]. j :j =i 238 Taking unified view of latent concepts, we use w = [w1 , ..., wN ]T as the concatenated importance vector and define ¾ R= M21 . . . MN 1 0 M12 0 . . . MN 2 ··· ··· .. . ··· M1N M2N . . . 0 ¿ of objects, say, X1 and X2 , there is only one co-occurrence matrix 12 · M12 . In the M-LSA framework, we thus have ( I R= 0 12 M21 12 M12 0 = 12 · 0 5) T M12 M12 0 as the unified co-occurrence matrix. We can rewrite Equation (4) in a matrix format: . w =R·w (6) It is easy to show that w will converge to the eigenvector of the co-occurrence matrix R. Similar to LSA, since important objects will have high weights in w, we regard w as the most important latent concept vector across all the co-occurrence relations. Each entry in w corresponds to an object and its value can be regarded as the association weight between the object and this latent concept. Similarly, the first k eigenvectors of R represent the top k important concepts, which span a k-dimensional semantic space to represent all the objects. Specifically, let 1 2 ... k be the top k eigenvalues of R and the corresponding vectors are respectively c1 , c2 , ..., ck . The symmetry of matrix R guarantees that all eigenvalues are real numbers and l gives precisely the importance of the corresponding concept vector cl (1 l k). Therefore, the i-th object can be represented by [1 c1i , 2 c2i , ..., k cki ] where cli is the i-th entry in cl , i.e., the association weight between the i-th object and the l-th concept. All the objects are represented in the matrix: [1 · c1 , 2 · c2 , ..., k · ck ] with each row representing an object in the k-dimensional space. The above analysis treats all the co-occurrence relationships equally, i.e., G is unweighted. It is not the best choice in many cases. For example, different co-occurrences might not be equally reliable on the same scale; some may contain more noises than others. On the other hand, the entries in different co-occurrence matrices may have different scales, thus need to be normalized. Therefore, in general, we want to give different weights to different cooccurrence matrices, i.e., G is weighted. To incorporate these weights, we can directly replace Mij by ij · Mij in matrix R in Equation (5) and then conduct semantic analysis on this new matrix. These weights can be used as normalization factors or to reflect the importance of different matrices for a specific application. Since ij 's only represent the relative importance of the matrices and their scales do not change the eigenvectors of matrix R, without loss of generality, we could add a constraint i