Probabilistic Dyadic Data Analysis with Local and Global Consistency

Deng Cai DENGCAI @ CAD . ZJU . EDU . CN Xuanhui Wang XWANG 20@ CS . UIUC . EDU Xiaofei He XIAOFEIHE @ CAD . ZJU . EDU . CN  State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, 100 Zijinggang Road, 310058, China Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave., Urbana, IL 61801.

Abstract
Dyadic data arises in many real world applications such as social network analysis and information retrieval. In order to discover the underlying or hidden structure in the dyadic data, many topic modeling techniques were proposed. The typical algorithms include Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). The probability density functions obtained by both of these two algorithms are supported on the Euclidean space. However, many previous studies have shown naturally occurring data may reside on or close to an underlying submanifold. We introduce a probabilistic framework for modeling both the topical and geometrical structure of the dyadic data that explicitly takes into account the local manifold structure. Specifically, the local manifold structure is modeled by a graph. The graph Laplacian, analogous to the Laplace-Beltrami operator on manifolds, is applied to smooth the probability density functions. As a result, the obtained probabilistic distributions are concentrated around the data manifold. Experimental results on real data sets demonstrate the effectiveness of the proposed approach.

occurrence matrix. In order to discover the underlying or hidden structure in the dyadic data, topic modeling techniques are usually applied to learn a probabilistic interpretation of the row and column objects. Two of the most popular approaches for this purpose are Probabilistic Latent Semantic Indexing (PLSA) (Hofmann, 1999) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003). In the dyadic aspect model applied to text analysis, a corpus of document is modeled as a set of pairs (d, w), where d is a document index and w is a word index. Each document is represented as a unique distribution over the k settings of the latent variable z. Each setting of the latent variable z corresponds to an underlying topic. Associated with each topic is a distribution over words in the vocabulary. Thus, a document is seen as a distribution over topics where each topic is described by a different distribution over words. A word is generated for a document by choosing a topic and then selecting a word according to the distribution over words for the chosen topic. PLSA has been shown to be a low perplexity language model and outperforms Latent Semantic Indexing (LSI) (Deerwester et al., 1990) in terms of precision-recall on a number of document collections. However, the number of parameters of PLSA grows linearly with the number of documents, which suggests that PLSA is prone to overfitting (Blei et al., 2003). LDA was introduced to address this problem by incorporating a dirichlet regularization on the underlying topics. These two approaches do yield impressive results on exploratory dyadic data analysis. However, both of them fails take into account the geometry of the spaces where the objects (either column or row objects) reside. The learned probability distributions are simply supported on the ambient spaces. Recent studies (Roweis & Saul, 2000; Belkin & Niyogi, 2001) have shown that naturally occurring data, such as texts and images, cannot possibly "fill up" the ambient Euclidean space, rather it must concentrate around lower dimensional structures. The goal of this paper is to extract

1. Introduction
Dyadic data refers to domain where two sets of objects, row or column objects, are characterized by a matrix of numerical values which describe their mutual relationships. Such data arises in many real world applications such as social network analysis and information retrieval (Hofmann et al., 1998). A common example is term-document coAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Probabilistic Dyadic Data Analysis with Local and Global Consistency

this kind of low dimensional structure and use it to regularize the learning of probability distributions. We construct a nearest neighbor graph to model the underlying manifold structure. The graph Laplacian, analogous to the LaplaceBeltrami operator on manifolds, is then used as a smoothing operator applied to the conditional probability distributions pz (z|d). This way, two sufficiently close documents should have similar conditional probability distributions. We use Kullback-Leibler divergence to measure the distance between two conditional probability distributions. The local consistency is incorporated into the probabilistic modeling framework through a regularizer. We discuss how to solve the regularized log-likelihood maximization problem using Expectation-Maximization techniques. The rest of the paper is organized as follows. Section 2 provide a background of dyadic data analysis. Our Locallyconsistent Topic Modeling (LTM) approach is introduced in Section 3. A variety experimental results are presented in Section 4. Finally, we give concluding remarks in Section 5.

The parameters can be estimated by maximizing the loglikelihood
N M

L=
i=1 j=1 N M

n(di , wj ) log P (di , wj )
K

(2) P (wj |zk )P (zk |di )


i=1 j=1

n(di , wj ) log
k=1

where n(di , wj ) the number of occurrences of term wj in document di . The above optimization problem can be solved by using standard EM algorithm. Notice that there are N K + M K parameters {P (wj |zk ), P (zk |di )} which are independently estimated in PLSA model. It is easy to see that the number of parameters in PLSA grows linearly with the number of training documents (N ). The linear growth in parameters suggests that the model is prone to overfitting (Blei et al., 2003). To address this issue, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is then proposed. LDA assumes that the probability distributions of documents over topics are generated from the same Dirichlet distribution with K parameters. The K + M K parameters in a K-topic LDA model do not grow with the size of the corpus. Thus, LDA does not suffer from the same overfitting issue as PLSA.

2. Background
One of the popular approaches for dyadic data analysis is topic modeling. Recently, topic modeling algorithm receives a lot of interests (Li & McCallum, 2006; Rosen-Zvi et al., 2004). Two of the most well known topic modeling algorithms include Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003). In the following of the paper, we will use text analysis to explain the algorithms. The core of PLSA is a latent variable model for cooccurrence data which associates an unobserved topic variable zk  {z1 , · · · , zK } with the occurrence of a word wj  {w1 , · · · , wM } in a particular document di  {d1 , · · · , dN }. As a generative model for word/document co-occurrences, PLSA is defined by the following scheme: 1. Select a document di with probability P (di ); 2. Pick a latent topic zk with probability P (zk |di ); 3. Generate a word wj with probability P (wj |zk ). As a result one obtains an observation pair (di , wj ), while the latent topic variable zk is discarded. Translating the data generation process into a joint probability model results in the expression P (di , wj ) = P (di )P (wj |di ),
K

3. Probabilistic Topic Modeling with Local Consistency
Recent studies (Roweis & Saul, 2000; Belkin & Niyogi, 2001) have shown that naturally occurring data, such as texts and images, cannot possibly "fill up" the ambient Euclidean space, rather it must concentrate around lower dimensional structures. In this section, we describe a principled way to extract this kind of low dimensional structure and use it to regularize the learning of probability distributions. 3.1. The Latent Variable Model with Graph Regularization Recall that the documents d  D are drawn according to the distribution PD . One might hope that knowledge of the distribution PD can be exploited for better estimation of the conditional distribution P (z|d). Nevertheless, if there is no identifiable relation between PD and the conditional distribution P (z|d), the knowledge of PD is unlikely to be very useful. Therefore, we will make a specific assumption about the connection between PD and the conditional distribution P (z|d). We assume that if two documents d1 , d2  D are close in the intrinsic geometry of PD , then the conditional distributions P (z|d1 ) and P (z|d2 ) are "similar" to each other. In other words, the conditional probability

P (wj |di ) =
k=1

P (wj |zk )P (zk |di ).

(1)

Probabilistic Dyadic Data Analysis with Local and Global Consistency

distribution P (z|d) varies smoothly along the geodesics in the intrinsic geometry of PD . This assumption is also referred to as manifold assumption (Belkin & Niyogi, 2001), which plays an essential rule in developing various kinds of algorithms including dimensionality reduction algorithms (Belkin & Niyogi, 2001) and semi-supervised learning algorithms (Belkin et al., 2006; Zhu & Lafferty, 2005). Now we are facing two questions: 1.) how to measure the distance between two distributions? and 2.) how to model the local geometric structure in the data? A popular way to measure the "distance" between two distributions is by using Kullback-Leibler Divergence (KLDivergence). Given two distributions Pi (z) and Ps (z), the KL-Divergence between these two distributions is defined as: Pi (z) (3) Pi (z) log D Pi (z)||Ps (z) = Ps (z) z It is important to note that KL-Divergence is not a distance measure because it is not symmetric. Recent studies on spectral graph theory (Chung, 1997) and manifold learning theory (Belkin & Niyogi, 2001) have demonstrated that the local geometric structure can be effectively modeled through a nearest neighbor graph on a scatter of data points. Consider a graph with N vertices where each vertex corresponds to a document in the corpus. Define the edge weight matrix W as follows: Wis = 1, 0, if di  Np (ds ) or ds  Np (di ) otherwise. (4)

Now we can define our new latent variable model. The new model adopts the generative scheme of PLSA. It aims to maximize the regularized log-likelihood as follows: L = L - R
N M K


i=1 j=1 N

n(di , wj ) log
k=1

P (wj |zk )P (zk |di )

-

 D Pi (z)||Ps (z) + D Ps (z)||Pi (z) 2 i,s=1

Wis (6)

where  is the regularization parameter. Since this approach incorporates local consistency through a regularizer, we call it Locally-consistent Topic Modeling (LTM). It is important to note that this work is motivated from our previous work LapPLSA (Cai et al., 2008; Mei et al., 2008). The major difference is that LapPLSA constructs the regularizer using Euclidean distance. While in this work, we use the divergence measure which leads to a new objective function. We show how to apply EM algorithm to solve the optimization problem. 3.2. Model Fitting with EM The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm (Dempster et al., 1977). EM alternates two steps: (i) an expectation (E) step where posterior probabilities are computed for the latent variables, based on the current estimates of the parameters, (ii) a maximization (M) step, where parameters are updated based on maximizing the so-called expected complete data log-likelihood which depends on the posterior probabilities computed in the E-step. Same as PLSA, we also have N K + M K parameters {P (wj |zk ), P (zk |di )} and the latent variables are the hidden topics zk in LTM. For simplicity, we use  to denote all the N K + M K parameters. E-step: The E-step for LTM is exactly same as the E-step in PLSA. The posterior probabilities for the latent variables are P (zk |di , wj ), which can be computed by simply applying Bayes' formula on Eq. (1)(Hofmann, 2001): P (zk |di , wj ) = M-step: With simple derivations (Hofmann, 2001), one can obtain the relevant part of the expected complete data logP (wj |zk )P (zk |di )
K l=1

where Np (di ) denotes the set of p nearest neighbors of ds (with respect to the Euclidean distance). It is important to note that W can be constructed to incorporate more information. For example, we can naturally use the label information of part of the data. If we know the labels of both di and ds , we can set Wis = 1 when di and ds share the same label and set Wis = 0 if di and ds belong to different classes. . Let Pi (z) = P (z|di ) , the following term can be used to measure the smoothness of the conditional probability distribution P (z|d) varies smoothly along the geodesics in the intrinsic geometry of data. R= 1 D Pi (z)||Ps (z) + D Ps (z)||Pi (z) 2 i,s=1
N

Wis . (5)

By minimizing R, we get a conditional probability distribution which is sufficiently smooth on the intrinsic document geometric structure. A intuitive explanation of minimizing R is that if two documents di and ds are close (i.e. Wis is big), the distribution P (z|di ) and P (z|ds ) are similar to each other.

P (wj |zl )P (zl |di )

(7)

Probabilistic Dyadic Data Analysis with Local and Global Consistency

likelihood for LTM: Q() = Q1 () + Q2 ()
N M K

Thus, we can use the following approximation: log(x)  1 - 1 , x x  1.

=
i=1 j=1

n(di , wj )
k=1 N

P (zk |di , wj ) log P (wj |zk )P (zk |di )

-

 D Pi (z)||Ps (z) + D Ps (z)||Pi (z) 2 i,s=1

Wis (8)

The above approximation is based on the first order expansion of Taylor series of log function. With this approximation, the equations in Eq. (11) can be written as
M j=1

n(di , wj )P (zk |di , wj ) P (zk |di )
N

- i (12)

To obtain the M-step re-estimation equations, we need to maximize Q() with respect to the parameters  K and with the constraints that k=1 P (zk |di ) = 1 and M P (wj |zk ) = 1. j=1 Notice that Q() has two parts. The first part is exactly the expected complete data log-likelihood for PLSA. The second part is the regularization part which only involves the parameters {P (zk |di )}. Thus, the M-step re-estimation equation for {P (wj |zk )} will be exactly same as that in PLSA. It is(Hofmann, 2001): P (wj |zk ) =
N i=1 n(di , wj )P (zk |di , wj ) , N M i=1 n(di , wm )P (zk |di , wm ) m=1

-

 P (zk |di )

P (zk |di ) - P (zk |ds ) Wis = 0,
s=1

1  i  N, 1  k  K We have:
N

P (zk |di ) - P (zk |ds ) Wis
s=1 N N

(13) P (zk |ds )Wis

=P (zk |di )
s=1

Wis -
s=1

(9) Let D denote a diagonal matrix whose entries are column (or row, since W is symmetric) sums of W , Dii = s Wis . Define L = D - W , L is usually referred as graph Laplacian (Chung, 1997). We also define vector yk = [P (zk |d1 ), · · · , P (zk |dN )]T . It is easy to verify that Eq. (13) equals to the i-th element of vector Lyk . Let  denote a n × n diagonal matrix whose entries are i . The equations system in Eq. (12) can be rewritten as   M j=1 n(d1 , wj )P (zk |d1 , wj )   .  - y - Ly = 0,  . k k .   M n(dN , wj )P (zk |dN , wj ) j=1 1  k  K. (14)

Now let us derive the re-estimation equation for {P (zk |di )}. In order to take care of the normalization constraints, Eq. (8) has to be augmented by appropriate Lagrange multipliers i ,
N K

H = Q() +
i=1

i

1-
k=1

P (zk |di )

(10)

Maximization of H with respect to {P (zk |di )} leads to the following set of stationary equations
M j=1

n(di , wj )P (zk |di , wj ) P (zk |di )
N

- i Wis = 0,

-

 2

log
s=1

P (zk |ds ) P (zk |di ) +1- P (zk |ds ) P (zk |di )

1  i  N, 1  k  K (11) Because of the log term in the regularization part, it is hard to solve the above equations system. Recall the motivation of the regularization term, we hope that if two documents di and ds are close (i.e. Wis is big), the distribution P (z|di ) and P (z|ds ) are similar to each other, i.e., P (zk |di ) will be close to P (zk |ds ) and P (zk |di ) P (zk |ds )
Wis

Let e  Rn denote the vector with all ones. With the norK malization constraints, we know that k=1 yk = e. We can also easily verify that Le = 0. Thus, add the K equations systems in Eq. (14) together, we can compute the Lagrange multipliers
K M

i =
k=1 j=1

n(di , wj )P (zk |di , wj ) = n(di ), n(di , wj ) refers to the document

where n(di ) = length.

j

 1.

One can easily verifies that the matrix  + L is positive definite. By solving the linear equations systems in

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Eq. (14), one obtains the M-step re-estimation equation for {P (zk |di )}   M j=1 n(d1 , wj )P (zk |d1 , wj )   . . . yk = ( + L)-1  .   M j=1 n(dN , wj )P (zk |dN , wj ) (15) When the regularization parameter  = 0, we can easily see the above M-step re-estimation equation boils down to the M-step in original PLSA. +L is usually a sparse matrix. Some efficient iterative algorithms (e.g., LSQR (Paige & Saunders, 1982)) can be used to solve the above linear equations system instead of computing the matrix inversion. The E-step (Eq. 7) and M-step (Eq. 9 and 15) are alternated until a termination condition is met.

erwise, and map(ri ) is the permutation mapping function that maps each cluster label ri to the equivalent label from the data corpus. The best mapping can be found by using the Kuhn-Munkres algorithm (Lovasz & Plummer, 1986). In order to randomize the experiments, we conduct the evaluations with the cluster numbers ranging from two to ten. For each given cluster number k, 20 test runs were conducted on different randomly chosen clusters and we record both the average and standard deviation. We evaluate and compare three topic modeling algorithms and three traditional clustering algorithms as follows: · Probabilistic latent semantic analysis (PLSA in short) (Hofmann, 2001). · Latent dirichlet allocation (LDA in short) (Blei et al., 2003). · Locally-consistent Topic Modeling (LTM in short). This is the method proposed in this paper. · Kmeans clustering algorithm (Kmeans in short). · Spectral clustering algorithm based on normalized cut criterion (NCut in short) (Shi & Malik, 2000; Ng et al., 2001). · Nonnegative Matrix Factorization based clustering (NMF in short) (Xu et al., 2003). There are two parameters in our LTM approach: the number of nearest neighbors p and the regularization parameter . Throughout our experiments, we empirically set the number of nearest neighbors p to 5, the value of the regularization parameter  to 1000. 4.1.2. P ERFORMANCE E VALUATION Table 1 shows the clustering performance of the six approaches. We can see that both of the two traditional topic modeling approaches (PLSA and LDA) fail to achieve good performance (comparing to those standard clustering methods). One reason is that both PLSA and LDA discover the hidden topics in the Euclidean space and fail to consider the discriminant structure. By incorporating the geometric structure information into a graph regularizer and preserving the local consistency, the LTM approach gets significantly better performance than PLSA and LDA. The performance of LTM is also comparable with other three clustering algorithms. This shows that considering the intrinsic geometrical structure of the document space is important for learning a better hidden topic model in the sense of semantic structure. Figure 1 shows how the performance of LTM varies with the parameters  and p. The LTM is very stable with re-

4. Experiments
We evaluate our LTM approach in two application domains: document clustering and classification. 4.1. Document Clustering Clustering is one of the most crucial techniques to organize the data in an unsupervised manner. The hidden topics extracted by the topic modeling approaches can be regarded as clusters. The estimated conditional probability density function P (zk |di ) can be used to infer the cluster label of each datum. In this experiment, we investigate the use of topic modeling approach for text clustering. 4.1.1. DATA AND E XPERIMENTAL S ETTINGS Our empirical study was conducted based on a subset of the Reuters-21578 text data set, provided by Reuters and corrected by Lewis.1 30 largest categories are chosen for our experiments, which includes 8,067 documents and 18,832 distinct words. The clustering result is evaluated by comparing the obtained label of each document with that provided by the document corpus. The accuracy (AC) is used to measure the clustering performance (Xu et al., 2003). Given a document xi , let ri and si be the obtained cluster label and the label provided by the corpus, respectively. The AC is defined as follows: AC =
n i=1

(si , map(ri )) n

where n is the total number of documents and (x, y) is the delta function that equals one if x = y and equals zero oth1 http://www.daviddlewis.com/resources/testcollections /reuters21578/

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Table 1. Clustering performance on Reuters

k 2 3 4 5 6 7 8 9 10 Avg.

PLSA 0.775±0.144 0.548±0.112 0.570±0.145 0.458±0.146 0.447±0.108 0.432±0.077 0.363±0.091 0.362±0.113 0.378±0.069 0.481

LDA 0.803±0.155 0.651±0.148 0.621±0.159 0.577±0.194 0.535±0.129 0.475±0.113 0.381±0.093 0.404±0.167 0.452±0.137 0.544

LTM 0.892±0.128 0.814±0.108 0.775±0.133 0.712±0.141 0.675±0.115 0.644±0.113 0.534±0.097 0.596±0.101 0.576±0.092 0.691

kmeans 0.865±0.160 0.724±0.199 0.695±0.179 0.618±0.180 0.596±0.212 0.561±0.165 0.408±0.147 0.448±0.193 0.524±0.172 0.604

NCut 0.866±0.138 0.760±0.170 0.748±0.133 0.668±0.163 0.672±0.157 0.595±0.167 0.433±0.128 0.494±0.156 0.528±0.145 0.641

NMF 0.854±0.149 0.774±0.143 0.725±0.145 0.681±0.144 0.659±0.126 0.609±0.136 0.480±0.100 0.544±0.103 0.544±0.106 0.652

0.7 0.65 Clustering Accuracy Clustering Accuracy 0.6 0.55 0.5 0.45 0.4 0.35 0.3
0 1 2 3

0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35
5

LTM PLSA LDA kmeans NCut NMF 10 10 10  10 10
4

LTM PLSA LDA kmeans NCut NMF 3 5 7 9 11 p 13 15 17 19

0.3

10

Figure 1. The performance of LTM vs. parameters  and p. The LTM is very stable with respect to the parameter . It achieves consistent good performance with the  varying from 100 to 10000. The performance is also stable when parameter p is between 3 than 9, then it decreases as the p increases.

spect to the parameter . It achieves consistent good performance with the  varying from 100 to 10000. As we described, LTM uses a p-nearest graph to capture the local geometric structure of the data. It is more likely that a document share the same cluster membership with its pnearest neighbor when p is small. Thus it is expected that the performance decreases as the p increases. 4.2. Document Classification In the document classification problem, we wish to classify a document into two or more mutually exclusive categories. As in any classification problem, we may wish to consider generative approaches or discriminative approaches. In particular, by using one topic model for each class, we obtain a generative model for classification. It is also of interest to use topic modeling approaches (PLSA, LDA and LTM) in the discriminative framework, and this is our focus in this section. A challenging aspect of the document classification problem is the choice of features. Treating individual words as features yields a rich but very large feature set (Joachims, 1998). One way to reduce this feature set is to use topic

modeling approaches for dimensionality reduction. In particular, PLSA or LTM reduces any document to a fixed set of real-valued features P (zk |di ). It is of interest to see how much discriminatory information we lose in reducing the document description to these parameters. 4.2.1. DATA AND E XPERIMENTAL S ETTINGS The experimental settings in this work are basically the same as those in (Blei et al., 2003). Our empirical study was conducted based on a subset of the Nist Topic Detection and Tracking corpus (TDT2)2 . This corpus consists of data collected during the first half of 1998 and taken from 6 sources, including 2 newswires (APW, NYT), 2 radio programs (VOA, PRI) and 2 television programs (CNN, ABC). We use the largest 10 categories in this experiment, which includes 7,456 documents and 33,947 distinct words. In order to randomize the experiments, we conduct the evaluations with the training size for each category ranging from 2 to 20. For each case, 20 test runs were conducted on different randomly chosen labeled samples and we record both the average and standard deviation of the error rate.
2

http://www.nist.gov/speech/tests/tdt/tdt98/index.htm

Probabilistic Dyadic Data Analysis with Local and Global Consistency

Table 2. Classification error rate on TDT2

2 4 6 8 10 12 14 16 18 20 Avg.

Word Feature 0.292±0.060 0.213±0.034 0.151±0.049 0.117±0.040 0.118±0.034 0.101±0.023 0.095±0.019 0.088±0.025 0.076±0.021 0.072±0.014 0.132

PLSA 0.180±0.044 0.127±0.036 0.099±0.025 0.087±0.015 0.088±0.013 0.088±0.024 0.085±0.020 0.079±0.013 0.076±0.014 0.079±0.013 0.099

LDA 0.171±0.055 0.116±0.031 0.099±0.034 0.085±0.018 0.086±0.022 0.079±0.014 0.079±0.017 0.077±0.012 0.072±0.012 0.072±0.014 0.094

LTM 0.078±0.042 0.065±0.015 0.066±0.016 0.060±0.011 0.060±0.013 0.061±0.018 0.062±0.014 0.055±0.008 0.056±0.010 0.056±0.009 0.062

LTM with Label 0.074±0.040 0.057±0.010 0.057±0.008 0.057±0.009 0.055±0.007 0.056±0.010 0.059±0.011 0.054±0.006 0.053±0.007 0.054±0.007 0.057

In these experiments, we estimate the parameters of a PLSA (LDA) model on all the documents, without reference to their true class label. We then trained a support vector machine (SVM) on the low-dimensional representations and compared this SVM to an SVM trained on all the word features. For LTM, the label information of the training set can naturally be used to construct the graph. So we train two kinds of LTM models. One is purely un-supervised and we still use a p-nearest neighbor graph; the other utilizes the label information of the training set. We construct a graph in a semi-supervised manner, i.e., we modify the p-nearest neighbor graph by removing edges between samples belonging to different categories and add edges between samples belonging to the same category. 4.2.2. P ERFORMANCE E VALUATION Table 2 shows the classification error rate of the five approaches. We can see that all the three topic modeling approaches gain improvement over the baseline (word feature), especially when the number of training sample is small. By incorporating the geometric structure information into a graph regularizer, the LTM approach gets significantly better performance than PLSA and LDA. When the label information (training set) is incorporated, LTM (with Label) obtains even better performance. A key problem for all the topic modeling approaches is how to estimate the number of hidden topics. Figure 2 shows how the performance of three approaches varies with the number of the topics. Comparing to PLSA and LDA, the LTM model is less sensitive to the number of topics. This is another merit of applying LTM.

0.4 0.35 Classification Error Rate 0.3 0.25 0.2 0.15 0.1 0.05 Word feature PLSA LDA LTM

5710 15 20

30

50 # of Topics

70

100

Figure 2. The performance of three algorithms vary with the number of topics.

the intrinsic geometric structure of the data. Specifically, we adopt the popular manifold assumption and model the document space as a submanifold embedded in the ambient space and directly perform the topic modeling on this document manifold in question. As a result, LTM can have more discriminating power than traditional topic modeling approaches which discover the hidden topics in the Euclidean space, e.g. PLSA and LDA. Experimental results on text clustering and classification show that LTM provides better representation in the sense of semantic structure. Several questions remain to be investigated in our future work: 1. There is a parameter  which controls the smoothness of our LTM model. LTM boils down to original PLSA when  = 0. Also, it is easy to see that P (zk |di ) will be the same for all the documents when  = +. Thus, a suitable value of  is critical to our algorithm. It remains unclear how to do model selection theoretically and efficiently. 2. We consider the topic modeling on document manifold and develop our approach based on PLSA. The idea of exploiting manifold structure can also be nat-

5. Conclusion
We have presented a novel method for dyadic data analysis, called Locally-consistent Topic Modeling (LTM). LTM provides a principled way to incorporate the information in

Probabilistic Dyadic Data Analysis with Local and Global Consistency

urally incorporated into other topic modeling algorithms, e.g., Latent Dirichlet Allocation. 3. It would be very interesting to explore different ways of constructing the document graph to incorporate other prior knowledge. There is no reason to believe that the nearest neighbor graph is the only or the most natural choice. For example, for web page data it may be more natural to use the hyperlink information to construct the graph.

Hofmann, T. (1999). Probabilistic latent semantic indexing. Proc. 1999 Int. Conf. on Research and Development in Information Retrieval (pp. 50­57). Berkeley, CA. Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177­ 196. Hofmann, T., Puzicha, J., & Jordan, M. I. (1998). Learning from dyadic data. In Advances in neural information processing systems 11, 466­472. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. European conference on machine learning (pp. 137­142). Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. Proc. 2006 Int. Conf. Machine Learning (pp. 577­584). Lovasz, L., & Plummer, M. (1986). Matching theory. North Holland, Budapest: Akad´miai Kiad´. e o Mei, Q., Cai, D., Zhang, D., & Zhai, C. (2008). Topic modeling with network regularization. WWW '08: Proceeding of the 17th international conference on World Wide Web (pp. 101­110). Ng, A. Y., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems 14, 849­856. Cambridge, MA: MIT Press. Paige, C. C., & Saunders, M. A. (1982). LSQR: An algorithm for sparse linear equations and sparse least squares. ACM Transactions on Mathematical Software, 8, 43­71. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. Proceedings of the 20th conference on Uncertainty in artificial intelligence (pp. 487­494). Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323­2326. Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888­905. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. Proc. 2003 Int. Conf. on Research and Development in Information Retrieval (SIGIR'03) (pp. 267­273). Toronto, Canada. Zhu, X., & Lafferty, J. (2005). Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. ICML '05: Proceedings of the 22nd international conference on Machine learning (pp. 1052­1059). Bonn, Germany.

Acknowledgments
This work was supported by the Program for Changjiang Scholars and Innovative Research Team in University (IRT0652, PCSIRT), National Science Foundation of China under Grant 60875044, National Key Basic Research Foundation of China under Grant 2009CB320801 and Yahoo! Ph.D Fellowship. Any opinions, findings, and conclusions or recommendations expressed here are those of the authors and do not necessarily reflect the views of the funding agencies.

References
Belkin, M., & Niyogi, P. (2001). Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems 14, 585­591. Cambridge, MA: MIT Press. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from examples. Journal of Machine Learning Research, 7, 2399­2434. Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. Journal of machine Learning Research, 993­ 1022. Cai, D., Mei, Q., Han, J., & Zhai, C. (2008). Modeling hidden topics on document manifold. CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management (pp. 911­920). Chung, F. R. K. (1997). Spectral graph theory, vol. 92 of Regional Conference Series in Mathematics. AMS. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41, 391­407. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39, 1­38.