Supervised topic models David M. Blei Department of Computer Science Princeton University Princeton, NJ blei@cs.princeton.edu Jon D. McAuliffe Department of Statistics University of Pennsylvania, Wharton School Philadelphia, PA mcjon@wharton.upenn.edu Abstract We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1 Introduction There is a growing need to analyze large collections of electronic text. The complexity of document corpora has led to considerable interest in applying hierarchical statistical models based on what are called topics [3, 12, 8, 5]. Formally, a topic is a probability distribution over terms in a vocabulary. Informally, a topic represents an underlying semantic theme; a document consisting of a large number of words might be concisely modelled as deriving from a smaller number of topics. Such topic models provide useful descriptive statistics for a collection, which facilitates tasks like browsing, searching, and assessing document similarity. Most topic models, such as latent Dirichlet allocation (LDA) [3], are unsupervised: only the words in the documents are modelled. The goal is to infer topics that maximize the likelihood (or the posterior probability) of the collection. In this work, we develop supervised topic models, where each document is paired with a response. The goal is to infer latent topics predictive of the response. Unsupervised LDA has previously been used to construct features for classification. The hope was that LDA topics would turn out to be useful for categorization, since they act to reduce data dimension [3, 6]. However, when the goal is prediction, fitting unsupervised topics may not be a good choice. Consider predicting a movie rating from the words in its review. Intuitively, good predictive topics will differentiate words like "excellent", "terrible", and "average," without regard to genre. But unsupervised topics may correspond to genres, if that is the dominant structure in the corpus. The distinction between unsupervised and supervised topic models is mirrored in existing dimension-reduction techniques. For example, consider regression on unsupervised principal components versus partial least squares and projection pursuit [9], which both search for covariate linear combinations most predictive of a response variable. In text analysis, McCallum et al. developed a joint topic model for words and categories [10], and Blei and Jordan developed an LDA model to predict caption words from images [2]. This paper is organized as follows. We first develop the supervised latent Dirichlet allocation model (sLDA) for document-response pairs. We derive parameter estimation and prediction algorithms for the real-valued response case. Then we extend these techniques to handle diverse response types, using generalized linear models. We demonstrate our approach on two real-world problems. First, we use sLDA to predict movie ratings based on the text of the reviews. Second, we use sLDA to 1 predict the number of "diggs" that a web page will receive in the www.digg.com community, a forum for sharing web content of mutual interest. The digg count prediction for a page is based on the page's description in the forum. In both settings, we find that sLDA provides much more predictive power than regression on unsupervised LDA features. The sLDA approach also improves on the lasso, a modern regularized regression technique. 2 Supervised latent Dirichlet allocation In topic models, we treat the words of a document as arising from a set of latent topics, that is, a set of unknown distributions over the vocabulary. Documents in a corpus share the same set of K topics, but each document uses a mix of topics unique to itself. Thus, topic models are a relaxation of classical document mixture models, which associate each document with a single unknown topic. See [7] for a recent review. Here we build on latent Dirichlet allocation (LDA) [3], a topic model that serves as the basis for many others. In LDA, we treat the topic proportions for a document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In supervised latent Dirichlet allocation (sLDA), we add to LDA a response variable associated with each document. As mentioned, this variable might be the number of stars given to a movie, a count of the users in an on-line community who marked an article interesting, or the category of a document. We jointly model the documents and the responses, in order to find latent topics that will best predict the response variables for future unlabeled documents. We emphasize that sLDA accommodates various types of response: unconstrained real values, real values constrained to be positive (e.g., failure times), ordered or unordered class labels, nonnegative integers (e.g., count data), and other types. However, the machinery used to achieve this generality complicates the presentation. So we first give a complete derivation of sLDA for the special case of an unconstrained real-valued response. Then, in Section 2.4, we present the general version of sLDA, and explain how it handles diverse response types. Focus now on the case y R. Fix for a moment the model parameters: the K topics 1: K (each k a vector of term probabilities), the Dirichlet parameter , and the response parameters and 2 . Under the sLDA model, each document and response arises from the following generative process: 1. Draw topic proportions | Dir( ). 2. For each word (a) Draw topic assignment z n | Mult( ). (b) Draw word wn | z n , 1: K Mult(z n ). ¯ 2. 3. Draw response variable y | z 1: N , , 2 N z, N Here we define z := (1/ N ) n =1 z n . The family of probability distributions corresponding to this ¯ generative process is depicted as a graphical model in Figure 1. Notice the response comes from a normal linear model. The covariates in this model are the (unobserved) empirical frequencies of the topics in the document. The regression coefficients on those frequencies constitute . Note that a linear model usually includes an intercept term, which amounts to adding a covariate that always equals one. Here, such a term is redundant, because the components of z always sum to one. ¯ By regressing the response on the empirical topic frequencies, we treat the response as nonexchangeable with the words. The document (i.e., words and their topic assignments) is generated first, under full word exchangeability; then, based on the document, the response variable is generated. In contrast, one could formulate a model in which y is regressed on the topic proportions . This treats the response and all the words as jointly exchangeable. But as a practical matter, our chosen formulation seems more sensible: the response depends on the topic frequencies which actually occurred in the document, rather than on the mean of the distribution generating the topics. Moreover, estimating a fully exchangeable model with enough topics allows some topics to be used 2 k K film action time like theres director just see minutes films comedy love film romantic like funny two humor movie little film movie picture films motion characters director life story theres film like star story time disney little animated films new film see best films great just people time think dont d Zd,n Wd,n N Yd D , 2 -6 -4 -2 0 2 4 6 8 movie bad film like get movies good dont just think characters story film like doesnt character time director theres little movie rated like film just story teenagers picture kids acceptable film story films like just made make good see director show movie film films see year excellent totally look poor Figure 1: (Left) The graphical model representation of supervised latent Dirichlet allocation (sLDA). (Right) A 10-topic sLDA model fit to the movie review data of Section 3. entirely to explain the response variables, and others to be used to explain the word occurrences. This degrades predictive performance, as demonstrated in [2]. We treat , 1: K , , and 2 as unknown constants to be estimated, rather than random variables. We carry out approximate maximum-likelihood estimation using a variational expectation-maximization (EM) procedure, which is the approach taken in unsupervised LDA as well [3]. 2.1 Variational inference (E-step) Given a document and response, the posterior distribution of the latent variables is p ( , z 1: N | w1: N , y , , 1: K , , 2 ) = N p p ( | ) p (z n | ) p (wn | z n , 1: K ) ( y | z 1: N , , 2 ) n =1 N p . d z p ( | ) ( y | z 1: N , , 2 ) n =1 p (z n | ) p (wn | z n , 1: K ) 1: N (1) The normalizing value is the marginal probability of the observed data, i.e., the document w1: N and response y . This normalizer is also known as the likelihood, or the evidence. As with LDA, it is not efficiently computable. Thus, we appeal to variational methods to approximate the posterior. Variational objective function. We maximize the evidence lower bound (ELBO) L(·), which for a single document has the form w log p 1: N , y | , 1: K , , 2 L( , 1: N ; , 1: K , , 2 ) = E[log p ( | )] + nN =1 E[log p ( Z n | )] + nN =1 E[log p (wn | Z n , 1: K )] + E[log p ( y | Z 1: N , , 2 )] + H(q ) . (2) Here the expectation is taken with respect to a variational distribution q . We choose the fully factorized distribution, nN q ( , z 1: N | , 1: N ) = q ( | ) q (z n | n ), (3) =1 where is a K-dimensional Dirichlet parameter vector and each n parametrizes a categorical distribution over K elements. Notice E[ Z n ] = n . 3 The first three terms and the entropy of the variational distribution are identical to the corresponding terms in the ELBO for unsupervised LDA [3]. The fourth term is the expected log probability of the response variable given the latent topic assignments, 2 2 2 -E y 1 ¯ E[log p ( y | Z 1: N , , 2 )] = - log 2 2 - Z (4) 2 2 2 - y2 ¯+ E¯ 1 ¯ ZZ - 2y E Z 2 . (5) = - log 2 2 The first expectation is N ¯= 1n ¯ n . (6) EZ := N =1 The second expectation is ¯ ¯ E ZZ = 1 N2 N n =1 m =n n m + n =1 diag{n } N . (7) To see (7), notice that for m = n , E[ Z n Z m ] = E[ Z n ]E[ Z m ] = n m because the variational distribution is fully factorized. On the other hand, E[ Z n Z n ] = diag(E[ Z n ]) = diag(n ) because Z n is an indicator vector. For a single document-response pair, we maximize (2) with respect to 1: N and to obtain an estimate of the posterior. We use block coordinate-ascent variational inference, maximizing with respect to each variational parameter vector in turn. Optimization with respect to . The terms that involve the variational Dirichlet are identical to those in unsupervised LDA, i.e., they do not involve the response variable y . Thus, the coordinate ascent update is as in [3], nN n . (8) new + =1 Optimization with respect to j . Define - j := = j n . Given j {1, . . . , N }, the ELBO as a j E[log | ] + j E[log p (w | function of j alone is L( j ) = )] + y 1 j 1: K j log + const. Log of a vector - - + j + diag{ } - j j -j j j j N2 2N 2 2 means the vector of logs. The term with the quadratic form simplifies to 1 2 . (9) - + ( ) j -j j 2 N 2 2 n Thus the gradient is L = E[log | ] + E[log p (w j | 1: K )] + j y 2 - 1 - + ( ) log j + 1. (10) -j 2 2 2 N 2N Using this gradient to maximize the Lagrangian, which incorporates the constraint that the components of j sum to one, we obtain the coordinate update E new exp [log | ] + E[log p (w j | 1: K )] + j y N2 - 2 1 2 N 2 2 -j + ( ) . (11) Exp of a vector means the vector of exponentials. Here, the proportionality symbol means the components of new are computed according to (11), then normalized to sum to one. Note that j E[log i | ] = (i ) - ( j ), where (·) is the digamma function. The central difference between LDA and sLDA lies in this update. As in LDA, the j th word's variational distribution over topics depends on the word's topic probabilities under the actual model 4 (determined by 1: K ). But w j 's variational distribution, and those of all other words, affect the ¯ probability of the response, through the expected residual sum of squares E[( y - Z )2 ] from (4). The end result is that the update (11) also encourages j to decrease this expected residual sum of squares. Note that the update (11) depends on the variational parameters - j of all other words. Consequently, unlike LDA, the j cannot be updated in parallel, and distinct occurrences of the same term must be treated separately. 2.2 Parameter estimation (M-step) The corpus-level ELBO lower bounds the joint log likelihood across documents, which is the sum of the per-document log-likelihoods. In the E-step, we estimate the approximate posterior distribution for each document-response pair using the variational inference algorithm described above. In the M-step, we maximize the corpus-level ELBO with respect to the model parameters 1: K , , and 2 . For our purposes, it suffices simply to fix to 1/ K times the ones vector. In this section, we add ¯ ¯ document indexes to the previous section's quantities, so y becomes yd and Z becomes Z d . Estimating the topics. The M-step updates of the topics 1: K are the same as for unsupervised LDA, where the probability of a word under a topic is proportional to the expected number of times that it was assigned to that topic [3], ^n w k ,ew d D nN =1 =1 k 1(wd ,n = w )d ,n . (12) ^n Here again, proportionality means that each k ew is normalized to sum to one. Estimating the regression parameters. The only terms of the corpus-level ELBO involving and 2 come from the corpus-level analog of (5). Define y = y1: D as the vector of response values across documents. Let A be the D × ( K + 1) ¯ matrix whose rows are the vectors Z d . Then the corpus-level version of (5) is E[log p ( y | A , , 2 )] = - ( D 1 log(2 2 ) - E y - A) 2 2 2 ( y - A) . (13) Here the expectation is over the matrix A, using the variational distribution parameters chosen in the previous E-step. Expanding the inner product, using linearity of expectation, and applying the first-order condition for , we arrive at an expected-value version of the normal equations: EA A A -1 A E = E[ A] y new ^ E[ A] y . (14) Note that the d th row of E[ A] is=ust AA j previous E-step. Also, E previous E-step as well, given by (7). ¯ ¯ d d from d(6,). All these average vectors were fixed in the ¯ E Zd Z with each term having a fixed value from the We now apply the first-order condition for 2 to (13) and evaluate the solution at new : ^ ( . 1 new E y - Anew ) ( y - Anew ) ^2 ^ ^ D (15) From (15) it is clear that new 0. Substituting the definition of new and simplifying, we obtain ^2 ^ y . EA -1 1 y A new ^2 - y E[ A] E[ A] y (16) D It is tempting to try a further simplification of (16): in an ordinary least squares (OLS) regression of y on the columns of E[ A], the analog of (16) would just equal 1/ D times the sum of the squared residuals ( y - Anew ). However, that identity does not hold here, because the inverted matrix in (16) ^ is E[ A A], rather than E[ A] E[ A]. This illustrates that the update (14) is not just OLS regression of y on E[ A]. In particular, y is not orthogonal to E[ A]new . ^ 5 2.3 Prediction Our focus in applying sLDA is prediction. Specifically, we wish to compute the expected response value, given a new document w1: N and a fitted model {, 1: K , , 2 }: E[Y | w1: N , , 1: K , , 2 ] = E ¯ [ Z | w1: N , , 1: K ]. (17) ¯ The identity follows easily from iterated expectation. We approximate the posterior mean of Z using the variational inference procedure of the previous section. But here, the terms depending on y are removed from the j update in (11). Notice this is the same as variational inference for unsupervised LDA: since we averaged the response variable out of the right-hand side in (17), what remains is the standard unsupervised LDA model for Z 1: N and . Thus, given a new document, we first compute Eq [ Z 1: N ], the variational posterior distribution of the latent variables Z n . Then, we estimate the response with E[Y | w1: N , , 1: K , , 2 ] E¯ q [Z ] ¯ = . (18) 2.4 Diverse response types via generalized linear models Up to this point, we have confined our attention to an unconstrained real-valued response variable. In many applications, however, we need to predict a categorical label, or a non-negative integral count, or a response with other kinds of constraints. Sometimes it is reasonable to apply a normal linear model to a suitably transformed version of such a response. When no transformation results in approximate normality, statisticians often make use of a generalized linear model, or GLM. In this section, we describe sLDA in full generality, replacing the normal linear model of the earlier exposition with a GLM formulation. As we shall see, the result is a generic framework which can be specialized in a straightforward way to supervised topic models having a variety of response types. There are two main ingredients in a GLM: the "random component" and the "systematic component." For the random component, one takes the distribution of the response to be an exponential dispersion family with natural parameter and dispersion parameter : . y - A( ) (19) p ( y | , ) = h ( y , ) exp For each fixed , (19) is an exponential family, with base measure h ( y , ), sufficient statistic y , and log-normalizer A( ). The dispersion parameter provides additional flexibility in modeling the variance of y . Note that (19) need not be an exponential family jointly in ( , ). In the systematic component of the GLM, we relate the exponential-family parameter of the random component to a linear combination of covariates ­ the so-called linear predictor. For sLDA, ¯ ¯ the linear predictor is z . In fact, we simply set = z . Thus, in the general version of sLDA, the previous specification in step 3 of the generative process is replaced with y | z 1: N , , GLM(z , , ) , ¯ so that p ( y | z 1: N , , ) = h ( y , ) exp (z y ) - ¯ ¯ A( z ) (20) . (21) The reader familiar with GLMs will recognize that our choice of systematic component means sLDA uses only canonical link functions. In future work, we will relax this constraint. We now have the flexibility to model any type of response variable whose distribution can be written in exponential dispersion form (19). As is well known, this includes many commonly used distributions: the normal; the binomial (for binary response); the Poisson and negative binomial (for count data); the gamma, Weibull, and inverse Gaussian (for failure time data); and others. Each of these distributions corresponds to a particular choice of h ( y , ) d A( ). For example, it is easy to show an that the normal distribution corresponds to h ( y , ) = (1/ 2 ) exp{- y 2 /(2 )} and A( ) = 2 /2. In this case, the usual parameters µ and 2 just equal and , respectively. 6 Movie corpus 0.5 Digg corpus -6.37 0.12 Per-word held out log likelihood 0.10 -8.2 -8.3 Per-word held out log likelihood 0.4 -6.38 -8.1 Predictive R2 Predictive R2 0.3 -6.39 0.08 0.06 -8.4 -8.0 0.04 -6.40 0.2 -6.41 0.1 0.02 -6.42 0.00 sLDA LDA 0.0 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 24 10 20 Number of topics 30 -8.6 24 -8.5 10 20 Number of topics 30 Number of topics Number of topics Figure 2: Predictive R2 and per-word likelihood for the movie and Digg data (see Section 3). Variational inference (E-step). The distribution of y appears only in the cross-entropy term (4). The gradient of the ELBO with respect to j becomes 1 E y A . L ¯ (22) ( Z ) = E[log | ] + E[log p (w j | 1: K )] - log j + 1 + - j N j Thus, the key to variational inference in sLDA is obtaining the gradient of the expected GLM lognormalizer. Sometimes there is an exact expression, such as the normal case illustrated in (10). As another example, the Poisson GLM leads to an exact gradient, which we omit for brevity. Other times, no exact gradient is available. In a longer paper [4], we apply both variational bounding techniques and the multivariate delta method for moments [1] to obtain approximate gradients. The GLM contribution to the gradient determines whether the j coordinate update itself has a closed form, as it does in the normal case (11) and the Poisson case (omitted). If the update is not closed-form, we use numerical optimization, supplying a gradient obtained from one of the methods described in the previous paragraph. Parameter estimation (M-step). The topic parameter estimates are given by (12), as before. For the corpus-level ELBO, the gradient with respect to becomes 1 dD 1 dD dD µ . A = ¯ ¯ ¯ ¯ ¯ Eq ( Z d ) Z d (23) d yd - d yd - E ( Z d ) =1 =1 =1 The appearance of µ(·) = EGLM [Y | = ·] follows from exponential family properties. This GLM ¯ ¯ ¯ mean response is a known function of Z d in all standard cases. However, Eq [µ( Z d ) Z d ] has an exact solution only in some cases (e.g. normal, Poisson). In other cases, we approximate the expectation with methods similar to those applied for the j coordinate update. Reference [4] has details, including estimation of and prediction, where we encounter the same issues. 3 Empirical results We evaluated sLDA on two prediction problems. First, we consider "sentiment analysis" of newspaper movie reviews. We use the publicly available data introduced in [11], which contains movie reviews paired with the number of stars given. While Pang and Lee treat this as a classification problem, we treat it as a regression problem. With a 5000-term vocabulary chosen by tf-idf, the corpus contains 5006 documents and comprises 1.6M words. Second, we introduce the problem of predicting web page popularity on Digg.com. Digg is a community of users who share links to pages by submitting them to the Digg homepage, with a short description. Once submitted, other users "digg" the links they like. Links are sorted on the Digg homepage by the number of diggs they have received. Our Digg data set contains a year of link descriptions, paired with the number of diggs each received during its first week on the homepage. (This corpus will be made publicly available at publication.) We restrict our attention to links in the technology category. After trimming the top ten outliers, and using a 4145-term vocabulary chosen by tf-idf, the Digg corpus contains 4078 documents and comprises 94K words. 7 For both sets of response variables, we transformed to approximate normality by taking logs. This makes the data amenable to the continuous-response model of Section 2; for these two problems, generalized linear modeling turned out to be unnecessary. We initialized 1: K to uniform topics, 2 to the sample variance of the response, and to a grid on [-1, 1] in increments of 2/ K . We ran EM until the relative change in the corpus-level likelihood bound was less than 0.01%. In the E-step, we ran coordinate-ascent variational inference for each document until the relative change in the per-document ELBO was less than 0.01%. For the movie review data set, we illustrate in Figure 1 a matching of the top words from each topic to the corresponding coefficient k . We assessed the quality of the predictions with "predictive R 2 ." In a cross-validation, we define this quantity as the fraction of variability( in the out-of-fold response values which is captured by the ( out-of-fold predictions: pR2 := 1 - ( y - y )2 )/( y - y )2 ). ^ ¯ ¯ We compared sLDA to unsupervised LDA, followed by linear regression on the d from LDA. This is the regression equivalent of using LDA topics as classification features [3, 6].Figure 2 (L) illustrates that sLDA provides improved predictions on both data sets. Moreover, this improvement does not come at the cost of document model quality. The per-word hold-out likelihood comparison in Figure 2 (R) shows that sLDA fits the document data as well or better than LDA. Note that the Digg prediction is significantly harder than the movie review sentiment prediction, and that the homogeneity of Digg technology content leads the model to favor a small number of topics. Finally, we compared sLDA to the lasso, a discriminative regularized regression technique. On Digg data, the lasso's optimal model complexity yielded a CV pR2 of 0.088. The best sLDA pR2 was 0.095, an 8.0% relative improvement. On movie data, the best Lasso pR2 was 0.457 vs. 0.500 for sLDA, a 9.4% relative improvement. Note moreover that the Lasso provides only a prediction rule, whereas sLDA models latent structure useful for other purposes. 4 Discussion We have developed sLDA, a statistical model of labelled documents. The model accommodates the different types of response variable commonly encountered in practice. We presented a variational procedure for approximate posterior inference, which we then incorporated in an EM algorithm for maximum-likelihood parameter estimation. We studied the model's predictive performance, our main focus, on two real-world problems. In both cases, we found that sLDA moderately improved on the lasso, a state-of-the-art regularized regression method. Compared to linear regression after an unsupervised LDA analysis, sLDA gave a twofold and a 10-fold improvement in predictive accuracy on the two problems. Yet the topic structure recovered by sLDA had higher hold-out likelihood than LDA on one problem, and equivalent hold-out likelihood on the other. These results illustrate the benefits of supervised dimension reduction when prediction is the ultimate goal. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] P. Bickel and K. Docksum. Mathematical Statistics. Prentice Hall, 2000. D. Blei and M. Jordan. Modeling annotated data. In SIGIR, pages 127­134. ACM Press, 2003. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993­1022, 2003. David M. Blei and Jon D. McAuliffe. Supervised topic models. In preparation, 2007. W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In UAI, 2004. L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. IEEE Computer Vision and Pattern Recognition, 2005. T. Griffiths and M. Steyvers. Probabilistic topic models. In T. Landauer, D. McNamara, S. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. 2006. T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. Integrating topics and syntax. In NIPS 17, 2005. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001. A. McCallum, C. Pal, G. Druck, and X. Wang. Multi-conditional learning: Generative/discriminative training for clustering and classification. In AAAI, 2006. B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL, 2005. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smith. The author-topic model for authors and documents. In UAI, pages 487­494, 2004. 8