Conditional Topic Random Fields Jun Zhu junzhu@cs.cmu.edu Eric P. Xing epxing@cs.cmu.edu School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 Abstract Generative topic models such as LDA are limited by their inability to utilize nontrivial input features to enhance their performance, and many topic models assume that topic assignments of different words are conditionally independent. Some work exists to address the second limitation but no work exists to address both. This paper presents a conditional topic random field (CTRF) model, which can use arbitrary nonlocal features about words and documents and incorporate the Markov dependency between topic assignments of neighboring words. We develop an efficient variational inference algorithm that scales linearly in terms of topic numbers, and a maximum likelihood estimation (MLE) procedure for parameter estimation. For the supervised version of CTRF, we also develop an arguably more discriminative max-margin learning method. We evaluate CTRF on real review rating data and demonstrate the advantages of CTRF over generative competitors, and we show the advantages of max-margin learning over MLE. they are essentially not "feature-based" models due to the generative nature of the models. More precisely, a topic model specifies the joint likelihood of all the introduced variables, which prevents flexible incorporation of nontrivial features in the data, such as non-local contextual or summary features in an article or image, because directly modeling such features as random variables would result in a prohibitively large state space that makes inference and learning very difficult, if at all possible. One may find convincing arguments of instead preferring a feature-based model for various applications from the celebrated paper on conditional random fields (CRF) (Lafferty et al., 2001). Second, with some exceptions (Gruber et al., 2007; Wallach, 2006; Chen et al., 2009), many topic models assume that the topic assignments of different text or image words are conditionally independent, and do not depend on the ordering of words. Such oversimplifying assumptions can be harmful in many critical applications such as scene classification. Recently, a number of attempts have been made to address the limitation due to conditional-independence. At the data representation-level, "bag of region pairs" (G¨kalp & Aksoy, 2007) or "doublets" (Sivic et al., o 2005) are used to incorporate important spatial information in computer vision tasks; a bi-gram language model (Wallach, 2006) is used to consider word ordering information for text mining. At the latent topic-level, structured topic models with Markov properties (Gruber et al., 2007; Verbeek & Triggs, 2007; Wang et al., 2009b) or with latent permutations (Chen et al., 2009) have been proposed to capture the correlations between topic assignments of neighboring words. However, such models are all generative in nature, and cannot flexibly exploit non-trivial features of the entity in question. To our knowledge, very little advance has been made in the direction of enriching feature usage under a topic model; and no successful attempt exists that considers enhancing feature usage and alleviating conditionalindependency jointly. Maybe the most relevant work 1. Introduction Probabilistic topic models, such as the latent Dirichlet allocation (LDA) model (Blei et al., 2003), have been widely used for inferring a low dimensional representation that captures the latent semantics of textual or image documents. Such low dimensional representations can be used for classifying, clustering, or structurally browsing large corpora. However, most existing topic models share two key characteristics which could limit their utility. First, Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Conditional Topic Random Fields along this direction is the Dirichlet-multinomial regression (DMR) model (Mimno & McCallum, 2008), which can use arbitrary document-level features to refine the Dirichlet parameter priors in LDA models. However, due to the fully generative nature of the underlying topic models, DMR cannot incorporate wordlevel features. Moreover, DMR assumes that the topic assignments of different words are conditionally independent. Thus this model still largely suffers from the two limitations discussed above. In this paper, we propose a new model called conditional topic random fields (CTRF), which address both the feature and independence limitations in a simple but rich statistical framework; and we present efficient algorithms for training CTRFs. Underlying our proposed model is a key reliance on a conditional scheme, rather than a generative scheme, for defining the likelihood function of observations, which can be paralleled to the well-known contrast between a CRF and an HMM. A conditional model specifies a conditional likelihood of the observed documents, and can incorporate arbitrary over-lapping input features. (See (Lafferty et al., 2001) for a discussion on why this is feasible in a conditional model but not in a conventional generative model.) Our proposed model employs a general structure of the GLM to define a conditional distribution of latent topic assignments over words, rather than using a conventional multinomial model as in LDA. It retains all the remaining structure of the original topic models while incorporating arbitrary input features about words and documents; when new features are included, there is no need to alter this general design principle, therefore all the proposed inference and learning algorithms still apply. Moreover, CTRF directly incorporates the Markov dependency between the topic assignments of neighboring words, based on the same GLM principle. Depending on the nature of training data and applications, like the LDA, CTRFs can be specialized into an unsupervised or a supervised version. We present a maximum likelihood estimation (MLE) algorithm for unsupervised CTRFs. For supervised CTRFs, which aim to discover predictive latent topic representations by incorporating commonly available side information, while the same MLE method can be still applied, we also present a max-margin learning algorithm which is arguably preferred for discriminative tasks. Finally, we demonstrate the advantages of CTRFs and max-margin learning on real review rating data. The paper is structured as: Sec 2 present CTRFs. Sec 3 presents inference and MLE estimation. Sec 4 presents max-margin learning for supervised CTRFs. Sec 5 presents empirical results and Sec 6 concludes. Z Z Z W A1 A2 AM W A W (a) (b) (c) Figure 1. (a) the word generating model in LDA; (b) a generative model for a word and features; and (c) a conditional word generating model with features. 2. Conditional Topic Random Fields For clarity and self-containedness, we begin with a brief Recap of the standard LDA and supervised LDA. 2.1. LDA and sLDA The latent Dirichlet allocation (LDA) model (Blei et al., 2003) defines the document likelihood using a hierarchical Bayesian scheme. Specifically, a topic proportion vector is defined for each document, which is drawn from a Dirichlet distribution, and document words are each sampled from a topic-specific word distribution specified by a random drawn of the word-topic-assignment from the topic proportion vector. A supervised LDA (sLDA) (Blei & McAuliffe, 2007) introduces a response variable to incorporate commonly available side information (e.g., review rating scores) for discovering predictive topics. Let K be the number of topics; N be the number of the terms in a vocabulary; be a K × N matrix; and each k be a distribution over the N terms. For a document d, the generating procedure of sLDA is 1. Draw a topic proportion vector d | Dir(). 2. For each word (a) draw a topic assignment zdn | Multi(d ). (b) draw a word wdn |zdn , Multi(zdn ). 3. Draw a response variable: y|¯d , , 2 P (y|¯d , , 2 ), z z where zd = 1/N N zdn . ¯ n=1 Here, Zdn is a K-dimensional indicator variable that represents the topic assignment of wdn and d is a mixture proportion over topics of the document d. The sLDA model can be used for regression by defining a normal distribution P , or for classification by defining P as a logistic regression model (Wang et al., 2009a). To learn an sLDA, both maximum likelihood estimation (Blei & McAuliffe, 2007) and max-margin learning (Zhu et al., 2009) have been developed. 2.2. Conditional Topic Random Fields The basic idea of CTRF can be understood from a simple word generating model. In LDA or sLDA, each word is represented as a mixture of latent topics Z as illustrated in Figure 1 (a), where the generating probability is estimated from a document-word count matrix. However, using the word and its count frequency as in LDA models is insufficient in resolving Conditional Topic Random Fields M A M Zn Zs1 M Zs2 ZsN M A NxS D K Y D S Ws1 Ws2 WsN Y assume that each document consists of S sentences and each sentence has N words. The generating procedure of CdTM is similar to that of LDA. The difference is that instead of using a multinomial distribution, the mixing weights over topics in CdTM are defined with a generalized linear model (GLM) p(zsn |, a) = exp{ f (zsn , a)} , z exp{ f (zsn , a)} sn K Figure 2. (Left) supervised conditional topic models; (Right) supervised linear chain CTRF. the word's meaning ambiguity. For example, in a hotel review, the word "good" can be used to describe a positive aspect, while it can also be used to describe a negative aspect when used with a denying word, such as in the sentence "the service is not good". Therefore, exploring a rich set of input features (e.g., nonlocal contextual features) is expected to yield better models in terms of their discovered latent topics, and their performance on prediction tasks, such as regression. To incorporate a rich set of useful features in a word generating model, a fully generative approach is shown in Figure 1 (b), where we use a = {a1 , · · · , aM } to denote a set of local and global features, such as POS tags. In this generative model, we need to assume that the features are conditionally independent and define a conditional distribution p(ai |z) for each feature i, such as a multinomial distribution for quantized code-words (Sivic et al., 2005) or a Gaussian distribution for continuous data (Welling et al., 2004; Xing et al., 2005). There have been extensive discussions on such models being difficult to learn, because of the possibly prohibitive parameterization cost of defining p(w, a1:M |z), the need of large training corpora to reliably estimate such parameters, and the complexity of unsupervised training due to induced coupling of all features when Z is unobserved. More critically, the inherited conditional independent assumptions in such a model is unrealistic and limits its utility. An alternative approach to incorporating rich features in a word generating model is conditional modeling, which treats the features as conditions and directly defines the word generating distribution as illustrated in Figure 1(c). The conditional model defines a distribution p(w|a1:M ) = z p(z|a1:M )p(w|z), which is just a mixture of latent topic components, where the mixing weights are defined via a richer log-linear model rather than a mere multinomial. As we have discussed, conditional models can incorporate arbitrary input features without changing the model structure and inference algorithm when new features are included. Based on the conditional word generating model as in Figure 1(c), a conditional topic model (CdTM) can be defined as illustrated in Figure 2 (Left). Here, we where f is a vector of feature functions that are defined on arbitrary features related to words and documents. As a Bayesian model, the latent parameter variable can have an arbitrary prior. Here, we choose the independent multivariate normal distribution, i.e., p(|µ, ) = m p(m |µm , m ) = m N (µm , m ). One nice result of this choice is that the CdTM with a normal prior is a generalization of the correlated topic model (CTM) (Blei & Lafferty, 2006), which uses only one trivial feature that equals to one for any word. Although the CdTM can address the feature limitation, it still suffers from the conditional independence limitation. In order to address both limitations, we further propose the conditional topic random fields (CTRF). The linear chain CTRF is shown in Figure 2 (Right), where the topic assignments of the words in the same sentence are mutually influenced through a conditional random field (Lafferty et al., 2001). The generating procedure of a CTRF is 1. For m {1, · · · , M }, sample m p(m |µm , m ) 2. For s {1, · · · , S} (a) sample zs Pctrf (zs |, a) (b) for n {1, · · · , N }, sample wsn Multi(zsn ) where Pctrf (zs |, a) is a conditional topic random field over the topic assignments of all the words in one sentence. By the random field theory (Lafferty et al., 2001), Pctrf (zs |, a) has a log-linear form p(zs |, a) = exp f (a, zs ) - As (, a) , where As (, a) = log( zs exp( f (a, zs ))) is the log-partition function. For a linear chain CTRF, we have both singleton feature functions f (a, zsn ) and pairwise feature functions f (a, zsn , zsn+1 ), and f (a, zs ) = n [f (a, zsn ) + f (a, zsn , zsn+1 )] is the cumulative feature function value on a sentence. For simplicity, we assume that a pairwise feature function f (a, zsn , zsn+1 ) equals to zero if zsn = zsn+1 . This assumption is true in many applications1 , for example, in text documents we would expect that neighboring words in the same sentence should be assigned to the same topic if they have some feature patterns. With this assumption, each m is a K-dimensional vector and each element is associated with a (singleton or Gruber et al. (2007) made a more strict assumption that words in the same sentence are also in the same topic. 1 Conditional Topic Random Fields pairwise) topic assignment. We use fm to denote a Kdimensional column vector of which only one element is non-zero according to the topic assignment; f is a M K-dimensional vector by stacking fm ; is a stacking vector of m ; and I and P are the index sets of singleton and pairwise features, respectively. Now, we explain the response variable Y in Figure 2, which has been ignored on purpose in the above discussions. As in sLDA (Blei & McAuliffe, 2007), the variable Y is for incorporating the commonly available supervised side information, such as rating scores in hotel reviews or scene categories associated with images (Wang et al., 2009a). By exploiting supervised side information, supervised CTRF (sCTRF) are expected to discover predictive topic representations, as have been demonstrated in (Blei & McAuliffe, 2007; Zhu et al., 2009; Wang et al., 2009a). Depending on the properties of Y , its generating model can be defined accordingly. For continuous Y , as in sLDA, 1 p(y|¯, , 2 ) = N ( z , 2 ), where z = N S sn zsn ; z ¯ ¯ and for categorical Y , the distribution can be a logistic regression model. Here, we consider the continuous case. Extension to the discrete case can be similarly done as in (Zhu et al., 2009; Wang et al., 2009a). Since the supervised CTRF subsumes the unsupervised CTRF and CdTM, we will stick to the sCTRF when presenting the posterior inference and parameter estimation algorithms, and highlight the necessary modifications when applied to other models. likelihood of a document under an sCTRF, i.e., log p(y, w|µ, , , 2 , ) L, where L m E[log p(m |µm , m )] + s E[log p(zs |, a)] + sn E[log p(wsn |zsn , )] + E[log p(y|¯, , 2 )] + H(q), z where H(q) = -E[log q] is the entropy of the variational distribution q. Here, we make the mean-field assumption about q and it has the factorized form q(, z|, 2 , ) = mk 2 q(mk |mk , mk ) sn q(zsn |sn ), 2 where mk and mk are the mean and variance of the 2 univariate normal distribution q(mk |mk , mk ), and sn is the K-dimensional parameter of the topic distribution of zsn . We will use 2 to denote the stacking vector of mk similar to and define diag( 2 ). With this assumption, all the terms except the second one can be efficiently calculated, similar as in CTM (Blei & Lafferty, 2006). We omit the details for saving space. For the second term, we have E[log p(zs |, a)] = E[f (a, zs )] - E[As (, a)], which cannot be efficiently calculated because of the expectation of the log-partition function As (, a). Therefore, we need to further relax the lower bound L. Let hs (q) = zs E[exp( f (a, zs ))], we have2 E[As (, a)] hs (q) - 1 + log s , s (1) where s is a positive variational parameter. With the assumption of q and a little algebra, we have E[exp( f (a, zs ))] = exp{ f (a, zs ) + f (a, zs ) f (a, zs ) }. 2 3. Posterior Inference and Estimation The difficulty of inferring the posterior distribution of latent variables and estimating the unknown parameters in CTRF arises from two aspects. First, the topic distributions of different words in the same sentence are strongly correlated in the topic random fields. Second, the conditional distribution p(zs |, a) contains a log-partition function that involves a summation over an exponential number of latent topic assignments. Since exact inference is intractable, we develop an efficient variational inference algorithm to obtain an approximation of the posterior distribution. Our method relies on a forward-backward message passing procedure on the linear-chain, which scales linearly with respect to the number of topics. For parameter estimation, we present the standard maximum likelihood estimation (MLE) that is applicable for estimating both supervised and unsupervised CTRF. 3.1. Posterior Inference An Approximate Bound: By applying the Jensen's inequality, we obtain a lower bound of the log- For the univariate CdTM, hs (q) can be exactly calculated. However, for a non-trivial chain-structured random field, since the quadratic term f (a, zs ) f (a, zs ) couples the topic assignments of all the words in one sentence, it is intractable to exactly compute hs (q). Here, we approximate3 this term as E[exp( f (a, zs ))] n E[exp{ fn (a, zs )}] fn (a, zs ) fn (a, zs ) },(2) 2 = n exp{ fn (a, zs ) + where fn (a, zs ) = f (a, zsn , zsn+1 ). Now, we can develop a forward-backward message passing procedure 2 to compute hs (q). Specifically, we use fn (a, zs ) to denote the element-wise product fn (a, zs ) · fn (a, zs ). Then, we have hs (q) = zs 2 3 exp n fn (a, zs ) + n 1 2 2 ( ) fn (a, zs ) . 2 Due to the inequality log x a-1 x - 1 + log a, a > 0. By assuming fn (a, Zs ) are independent. Conditional Topic Random Fields 2 Since both fn and fn depend only on the cliques (i.e., nodes and edges in a linear chain CTRF), hs (q) and its gradients hs (q) and 2 hs (q) can be efficiently computed with a forward-backward massage passing procedure (Lafferty et al., 2001), which has a complexity O(N K) because the transition matrices are diagonal due to the assumption of pairwise feature functions. Our empirical studies show that this simple approximation method works well. Developing a tighter approximation is our future work. 3.2. Max-Likelihood Estimation Given training set D, where each document d is associated with a true response yd , parameter learning is to estimate the topics , Gaussian parameters (µ, ), and the response model parameters (, 2 ). The most common method is maximum likelihood estimation (MLE), which has been used to learn LDA and sLDA. For MLE, we optimize the approximate bound d Ld with a variational EM procedure, which iteratively performs an E-step and an M-step until the bound converges. In E-step, we perform variational inference for each document as discussed above. In Mstep, we maximize d Ld with respect to the unknown parameters. This can be performed similarly as in CTMs (Blei & Lafferty, 2006) for (µ, , ) and sLDAs (Blei & McAuliffe, 2007) for (, 2 ). We omit the details due to space limitation. Now, substituting the above inequality in Eq. (1) and the approximation in Eq. (2) into the variational lower bound L, we can get an approximate bound L, which is a function of (q, , µ, , , 2 , ). Inference: Given the parameters (µ, , , 2 , ), the inference can be done with a coordinate ascent method, iteratively optimizing L with respect to each variational parameter while holding all the others fixed. Let q old be the current solution of q, the update rule for s is simple, i.e., s = hs (q old ). For the parameters sn , we have the update rule snk exp log kwsn + mI 4. Joint Max-margin & Max-Likelihood Learning for sCTRF For supervised CTRFs, our goal is to discover predictive latent topic representations that are suitable for prediction tasks. Besides MLE, another arguably more discriminative approach is the max-margin learning, which has been applied to learn discriminative latent topic models in MedLDA (Zhu et al., 2009). In this section, we present an alternative margin-based discriminative approach for learning supervised CTRFs. With the approximate bound L, we formally define a joint max-margin and max-likelihood estimator of sCTRF as the solution to the following problem ,q,, fm (a, k) + m iNn min (k) (3) y 2 -sn + k + - k , 2 k SN 2S 2 N 2 2 where Nn is the set of neighbors of the node n; min (k) = j sij mP fm (a, j, k) is the mean m field message from node i; and -sn = ij ij - sn . From this update rule, we can explicitly see the additive contributions from different parts in the topic distribution. The first term is from the multinomial word generating model; the second term considers the local features; the third term incorporates the mutual dependency between neighbors; and the last two terms are from the supervised response model. For unsupervised CTRF, the update rule contains only the first three terms. For the parameters and 2 , since we cannot get closed form update rules, we apply gradient descent methods. The gradients for and 2 are m L = m + s min - d Ld (, q) + 1 + C (d + d ) 2 d=1 D s.t. d : ¯ yd - E[Zd ] + d ¯ -yd + E[Zd ] + d where and C are regularization constants; is a precision parameter; are non-negative slack variables; and denotes all the parameters (µ, , , 2 , ). This constrained optimization problem can be efficiently solved with Lagrangian methods. We introduce a pair of lagrange multipliers and for the two constraints associated with each document, and we perform alternative minimization over the Lagrangian functional. With the same mean field assumption of q, we get the same gradients for and 2 in the posterior inference as above. For , the update rule is similar to Eq. (3), but with an additional term snk exp log kwsn + mI E[fm (a, zs )] - s -1 s 2 s -1 s m hs (q) 1 2 L = - -1 + mk 2 kk mk hs (q) + 1 , 2 2mk where m = - µm ). Note that the Newton method as used in (Blei & Lafferty, 2006) optimizes with respect to each dimension of 2 in the log-domain. This coordinate ascent procedure is expensive, especially when the number of features is large. Here, we use the much faster L-BFGS method (Liu & Nocedal, 1989) for and log 2 jointly. -1 (m m fm (a, k) + m iNn min (k) + y 2 -sn + k - - k + k , 2 k 2 N 2 2 SN 2S SN Conditional Topic Random Fields 0.65 where the last term arises from the max-margin constraints and it plays a role of regularizing the topic assignment. Once the model makes a bad prediction for document d, one of the lagrange multipliers d and will be non-zero. Then, the last term in the expod nential will bias the topic assignment to favor a better prediction. This regularization effect will yield a more predictive topic representation. For µ, , and 2 , the update rules are the same as those in MLE. Finally, the parameter can be efficiently estimated by solving a standard SVM regression problem, which is the same as in MedLDA (Zhu et al., 2009). 0.6 0.55 0.5 Predictive R2 0.45 0.4 0.35 0.3 0.25 0.2 0.15 5 10 15 # Topics 20 sCTRF (margin) sCdTM (margin) sCTRF (MLE) MedLDA sLDA HTMM+SVR 25 Figure 3. The predictive R of different models. . NP-Chunking: We define pairwise feature functions for those words that are in the same noun or verb phrase, or the conjunction "and" or "or" appears between them. 2 5. Experiments In this section, we report empirical results of the supervised CTRF on real review rating data. Our goals are to demonstrate that sCTRF can discover good topic representations and can make accurate predictions. 5.1. Data Set We build a real data set by randomly crawling hotel reviews from TripAdvisor4 , where each review is associated with a global rating score and five aspect rating scores for the aspects­Value, Rooms, Location, Cleanliness, and Service. This data set is very interesting and can be used for many data mining tasks, for example, extracting the textual mentions of each aspect (Titov & McDonald, 2008). In these experiments, we focus on predicting the global rating scores for reviews. To avoid too short and too long reviews, we only keep those reviews whose character length is between 1500 and 6000. On TripAdvisor, the global ratings rank from 1 to 5. We randomly select 1000 reviews for each rating and the data set consists of 5000 reviews in total. We uniformly partition it into training and testing sets. For each review, we use the NLProcessor5 to do part-of-speech (POS) tagging and noun phrase (NP) chunking, and extract the following features: . POS-Tag: We distinguish four types of POS tags, that is, Adjective, Noun, Adverb, and Verb. Each type includes all its subcategories, e.g., Adjective includes "JJ" (Adjective), "JJR" (comparative Adjective), and "JJS" (superlative Adjective). . WordNet: WordNet6 is a large lexical database of English. We navigate it with some seeds of positive (e.g., good, excellent, etc) and negative (e.g., bad, painful, etc) words, and identify whether a word is positive or negative based on the synonym and antonym relationship. Words without strong relationship with the seeds are treated as neutral. For a positive or negative word, we also identify whether a denying word (e.g., not, no, etc.) appears before it within a word distance of 4. 4 5 By removing a standard list of stopping words and those terms whose count frequency is less than 5, we build a dictionary with 12000 terms. 5.2. Prediction Accuracy Similar as in (Blei & McAuliffe, 2007), we treat the problem of predicting rating scores as a regression problem. We take logarithm to make the response variables approximately normal. We compare sCTRF with sCdTM, sLDA, MedLDA, and hidden Markov topic models (HTMM) (Gruber et al., 2007). For the unsupervised HTMM, we feed the discovered topic representations to a linear support vector regression (SVR) (Smola & Sch¨lkopf, 2003) to make it cono sistent with the regression model as integrated in MedLDA and sCTRF. For sCTRF, we report two sets of results achieved with maximum likelihood estimation and max-margin training. Fig. 3 shows the predictive R2 (Blei & Lafferty, 2006) scores of different models. First, since supervised topic models can leverage the side information (e.g., rating scores) to discover latent topic representations, they generally outperform the decoupled two-step procedure as adopted in unsupervised topic models. Second, the feature-based sCdTM (with max-margin training) outperforms MedLDA, which only uses the word count feature. Third, sCTRF (max-margin) slightly outperforms sCdTM (max-margin) because of the incorporation of Markov dependency7 . Therefore, the reason for the superior performance of sCTRF (MLE or max-margin) compared to sLDA or MedLDA is because sCTRF can incorporate both rich input features and Markov dependency. Finally, the max-margin based methods (e.g., sCTRF (max-margin)) generally outperform likelihood-based methods (e.g., sCTRF (MLE)) due to a regularization effect introduced by constraints. For a linear SVR with word count features, the predictive R2 is about 0.56, which is comparable to the best performance of MedLDA but worse than the best performance of max-margin sCTRF. 7 http://www.tripadvisor.com/ http://www.infogistics.com/textanalysis.html 6 http://wordnet.princeton.edu/ Similar conclusions observed for MLE method. Conditional Topic Random Fields 0.35 0.3 R1 R2 R3 R4 R5 0.2 0.15 0.15 Probability Probability 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 0.1 Probability 0 2 4 6 8 10 0.1 0.05 0.05 0 0 0 2 4 6 8 10 12 14 16 Topic ID 0.4 Topic ID 0.4 Topic ID 0.4 0.3 0.3 Probability Probability Probability 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 1 2 3 4 5 6 0 0 2 4 6 8 10 0 0 2 4 6 8 10 12 14 16 Topic ID Topic ID Topic ID Figure 4. The average distribution over topics for documents of the same rating score. The top row is for max-margin sCTRF and the bottom row is for MedLDA. For each model, we use 5 (Left), 10 (Middle), and 15 (Right) topics. 1.2 1 0.8 Weight Weight 0.6 0.4 0.2 -0.1 Weight 2 4 6 8 10 Default Pos-JJ Re-Pos-JJ Neg-JJ Re-Neg-JJ 0.1 0.1 0 0 -0.1 -0.2 -0.2 0 -0.2 0 1 2 3 4 5 6 0 0 2 4 6 8 10 12 14 16 Topic ID Topic ID Topic ID Figure 5. The weights of five features in max-margin sCTRF with topic number 5 (Left), 10 (Middle), and 15 (Right). 5.3. Characterization of Topic Modeling In this section, we show some interesting properties of sCTRF on topic modeling, which provide insights for the outstanding performance in regression. 5.3.1. Average Topic Distribution Fig. 4 shows the average mixture distribution over topics for the two best representatives­max-margin sCTRF and MedLDA. The average is taken over testing documents that have the same rating score. We denote the rating scores from small to large by R1 , · · · , and R5 . We use curves that connect the topic-probability points to show the trends of probability change. We can see that the curves of sCTRF show a consistent smooth change from R1 to R5 . For example, when the topic number is 10 or 15, the lowest rating R1 has high probabilities on the first several topics while low probabilities on the last several topics; but the highest rating R5 is on the opposite side, i.e., having low probabilities on the first several topics and high probabilities on the last several topics. In fact, as shown in Table 1 for a 10 topic sCTRF, topics T1 and T2 are about the negative side of a hotel, while topics T9 and T10 are on the positive side. Therefore, reviews with a low rating score have high probabilities to describe negative aspects (e.g., T1) but low probabilities to describe positive aspects (e.g., T10). Overall, for the ratings from R1 to R5 , the probability mass is smoothly changed with more probability distributed on positive topics and less probability on negative topics. When topic number is small (e.g., 5), the topics do not show a regular negative/positive pattern due to the domi- nating effect of the Default feature, as we shall see. The smooth change of the probability curves in the max-margin sCTRF result in a better prediction in a linear regression model, as shown in Fig. 3. However, for MedLDA, the probability curves tend to mix together and do not show a consistently smooth change. Moreover, the topics discovered by MedLDA do not show a similar positive/negative pattern8 , as detailed below. Similarly, the likelihood-based sCTRF obtains a better topic representation than sLDA and a better predictive R2 . Details are deferred to a full version. 5.3.2. Feature Weights and Sample Topics Fig. 5 shows the weights of the feature functions that are defined with five features: Default­equal to one for any word; Pos-JJ­positive adjective; Neg-JJ­ negative adjective; Re-Pos-JJ­positive adjective that has a denying word before it; and Re-Neg-JJ­negative adjective that has a denying word before it. Note that the Default feature is equivalent to the word counts used in LDA models. Again, we use curves that connect the topic-weight points to show the change trends. We can see that when the topic number is small, the default feature dominates; but when the topic number is large (e.g., 10 or 15), the default feature tends to discover prominent latent topics that are common for all the documents, e.g., T3 to T8 in Table 1. In contrast, both the positive and negative adjective features tend to discover topics that are more discriminative for rating prediction, e.g., T1 and T10. 8 All results are automatic, without sorting of the topics. Conditional Topic Random Fields Table 1. Top words in different topics by 10 topic max-margin sCTRF and 10 topic MedLDA. T1 told dirty room front asked hotel bad small worst poor manager called rude reception broken Max-Margin CTRF MedLDA T2 T3 T4 T5 T6 T7 T8 T9 T10 T1 T2 T3 T4 T5 T6 T7 T8 booked place hotel room room hotel beach beach great hotel place room room beach hotel hotel hotel room hotel food hotel hotel area pool resort good location day day table pool room room pool told room bar check parking staff resort pool nice good house hotel bed resort breakfast told food front days day stay desk pool food ocean lovely guests stay night breakfast room rooms asked good asked time pool night stay breakfast island island beautiful service great stay shower area stay staff beach hotel day time rooms night day kids kids excellent place time desk bathroom view area day bar back night service bed breakfast view trip good wonderful area trip front night nice good manager night called people holiday floor bed location service great comfortable staff back told place ocean staff back day manager stay room desk floor service day restaurants beach rooms beautiful door coffee great nice reception room left water people door rooms walk staff enjoyed friendly time island rooms small bedroom bathroom rooms holiday stay rooms night time time restaurant restaurants loved fresh restaurant lovely place hotel chairs floor time staff work experience restaurant morning day time time trip large stay dinner back tv unit night stay people decided food water back clean food view area amazing pool experience floor water kids small night area finally resort drinks stayed staff room area food fantastic experience food time looked island n't service restaurant checked staff rooms day bathroom bit ocean walk perfect site night bed door time bed booked great T9 resort beach night food time pool room great service good day staff water people nice T10 hotel people room desk front stay day time parking pool nice service back check lobby Table 1 shows the top words for the topics discovered by 10 topic max-margin sCTRF and MedLDA. We can see that for sCTRF the topics show a regular positiveness/negativeness pattern. For example, the topics T1 and T2 are on the negative aspects of a hotel while T9 and T10 describe positive aspects, and in between, the topics from T3 to T8 tend to be neutral and describe the most common aspects, such as room, food, staff, etc. Overall, from T1 to T10, the positiveness of the topics increases while the negativeness decreases. This is consistent with the weight change of the Pos-JJ and Neg-JJ features over topics as shown in Fig. 5. However, the topics discovered by MedLDA do not show a regular pattern on positiveness or negativeness. For example, the positive words such as "good", "nice", or "great" appear in most of the topics, and within each topic the words are not so coherent, in terms of POS tags or meanings. The reason for discovering such unpurified topics is because MedLDA only uses the Default (i.e., word count) feature, which tends to discover common aspects of a hotel, as shown in Fig. 5. 6. Conclusions We have presented the conditional topic random field (CTRF), a general framework that incorporates arbitrary input features in latent topic models and the Markov dependency between topic assignments of neighboring words. We develop efficient inference and MLE parameter estimation algorithms. For the supervised CTRF, we also develop an arguably more discriminative max-margin learning method. On real review rating data, we demonstrate the interesting characterization of CTRF on topic modeling and the advantages of max-margin training on prediction tasks. Acknowledgements This work is supported by ONR N000140910758, NSF IIS-0713379, NSF Career DBI-0546594, and an Alfred P. Sloan Research Fellowship to EPX. References Blei, D. & Lafferty, J. Correlated topic models. NIPS, 2006. Blei, D. & McAuliffe, J. Supervised topic models. NIPS, 2007. Blei, D., Ng, A., & Jordan, M. Latent Dirichlet allocation. JMLR, (3):993­1022, 2003. Chen, H., Branavan, S.R.K., Barzilay, R., & Karger, D. Content modeling using latent permutations. JAIR, (36):129­163, 2009. G¨kalp, D. & Aksoy, S. Scene classification using o bag-of-regions representations. CVPR, 2007. Gruber, A., Rosen-Zvi, M., & Weiss, Y. Hidden topic markov models. AISTATS, 2007. Lafferty, J., McCallum, A., & Pereira, F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML, 2001. Liu, D. & Nocedal, J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, (45):503­528, 1989. Mimno, D. & McCallum, A. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. UAI, 2008. Sivic, J., Russell, B., Efros, A., Zisserman, A., & Freeman, W. Discovering objects and their locatioins in images. ICCV, 2005. Smola, A. & Sch¨lkopf, B. A tutorial on support o vector regression. Statistics and Computing, 2003. Titov, I. & McDonald, R. A joint model of text and aspect ratings for sentiment summarization. ACL, 2008. Verbeek, J. & Triggs, B. Region classification with markov field aspect models. CVPR, 2007. Wallach, H. Topic modeling: Beyond bag-of-words. ICML, 2006. Wang, C., Blei, D., & Fei-Fei, L. Simultaneous image classification and annotation. CVPR, 2009a. Wang, C., Thiesson, B., Meek, C., & Blei, D. Markov topic models. AISTATS, 2009b. Welling, M., Rosen-Zvi, M., & Hinton, G. Exponential family harmoniums with an application to information retrieval. NIPS, 2004. Xing, E., Yan, R., & Hauptmann, A. Mining associated text and images with dual-wing harmoniums. UAI, 2005. Zhu, J., Ahmed, A., & Xing, E. MedLDA: Maximum margin supervised topic models for regression and classification. ICML, 2009.