MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification Jun Zhu jun-zhu@mails.tsinghua.edu.cn Amr Ahmed amahmed@cs.cmu.edu Eric P. Xing epxing@cs.cmu.edu Dept. of Comp. Sci & Tech, TNList Lab, Tsinghua University, Beijing 100084 China School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA Abstract Supervised topic models utilize document's side information for discovering predictive low dimensional representations of documents; and existing models apply likelihoodbased estimation. In this paper, we present a max-margin supervised topic model for both continuous and categorical response variables. Our approach, the maximum entropy discrimination latent Dirichlet allocation (MedLDA), utilizes the max-margin principle to train supervised topic models and estimate predictive topic representations that are arguably more suitable for prediction. We develop efficient variational methods for posterior inference and demonstrate qualitatively and quantitatively the advantages of MedLDA over likelihood-based topic models on movie review and 20 Newsgroups data sets. 1. Introduction Statistical topic models have recently gained much popularity in managing a large collection of documents by discovering a low dimensional representation that captures the latent semantic of the collection. This low dimensional representation can then be used for tasks like classification and clustering or merely as a tool to structurally browse the otherwise unstructured collection. Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is an example of such models for textual documents. LDA posits that each document is an admixture of latent topics where the topics are unigram distribution over a given vocabulary. The admixture proportion is document-specific and is distributed as a latent Dirichlet random variable. Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s). When LDA is used for classification tasks, the document-specific mixing proportions are fed, usually, to a downstream classifier like an SVM. This two-step procedure is rather suboptimal as the side information of the documents, such as the category of a document or a numerical rating of a movie review, is not used in discovering the low-dimensional representation of the documents and thus can result in a sub-optimal representation for prediction. Developing a low dimensional representation that retains as much information as possible about the response variable has been studied in text modeling (McCallum et al., 2006) and image analysis (Blei & Jordan, 2003). Recently, supervised variants of LDA have been proposed, including the supervised LDA (sLDA) (Blei & McAuliffe, 2007) and the discriminative LDA (DiscLDA) for classification (Lacoste-Jullien et al., 2008). While sLDA and DiscLDA share the same goal (uncovering the latent structure in a document collection while retaining predictive power for supervised tasks), they differ in their training procedures. sLDA is trained by maximizing the joint likelihood of data and response variables while DiscLDA is trained to maximize the conditional likelihood of response variables. In this paper, we propose a max-margin discriminative variant of supervised topic models for both regression and classification. In contrast to the above two-stage procedure of using topic models for prediction tasks, the proposed maximum entropy discrimination latent Dirichlet allocation (MedLDA) is an integration of max-margin learning and hierarchical Bayesian topic models by optimizing a single objective function with a set of expected margin constraints. MedLDA is a special instance of PoMEN (i.e., partially observed maximum entropy discrimination Markov network) (Zhu et al., 2008b), which was proposed to combine maxmargin learning and structured hidden variables in Markov networks, for discovering latent topic presentations of documents. In MedLDA, the parameters for the regression or classification model are learned in MedLDA: Max-margin Topic Models for Regression and Classification a max-margin sense; and the discovery of latent topics is coupled with the max-margin estimation of the model parameters. This interplay yields latent topic representations that are more suitable for supervised prediction tasks. We develop an efficient and easyto-implement variational method for MedLDA, and in fact its running time is comparable to that of an unsupervised LDA for classification. This property stems from the fact that the MedLDA classification model directly optimizes the margin and does not suffer from a normalization factor which generally makes learning hard as in fully generative models such as sLDA. The paper is structured as follows. Sec. 2 presents the MedLDA for both regression and classification, with efficient variational EM algorithms. Sec. 3 generalizes MedLDA to other latent variable topic models. Sec. 4 presents empirical comparison between MedLDA and likelihood-based topic models. Finally, Sec. 5 concludes this paper with future research directions. Figure 1. Supervised topic model (Blei & McAuliffe, 2007). where E[X] is an expectation w.r.t the posterior distribution of the r.v. X or its variational approximation. DiscLDA (Lacoste-Jullien et al., 2008) is a discriminative variant of supervised topic models for classification, where the unknown parameters (i.e., a linear transformation matrix) are learned by maximizing the conditional likelihood of the response variables. Below, we present a max-margin variant of the supervised topic models, which can discover predictive topic representations that are more suitable for supervised prediction tasks, e.g., regression and classification. 2.2. Learning MedLDA for Regression 2. Max-Entropy Discrimination LDA In this section, we present the MedLDA model for both regression and classification. We first review the supervised topic models. 2.1. (Un)Supervised Topic Models The unsupervised LDA (latent Dirichlet allocation) (Blei et al., 2003) is a hierarchical Bayesian model, where topic proportions for a document are drawn from a Dirichlet distribution and words in the document are repeatedly sampled from a topic which itself is drawn from those topic proportions. Supervised topic models (sLDA) (Blei & McAuliffe, 2007) introduce a response variable to LDA for each document, as illustrated in Figure 1. Let K be the number of topics and M be the number of terms in a vocabulary. denotes a K × M matrix and each k is a distribution over the M terms. For the regression problem, where the response variable y R, the generative process of sLDA is as follows: 1. Draw topic proportions | Dir(). 2. For each word (a) Draw a topic assignment zn | Mult(). (b) Draw a word wn |zn , Multi(zn ). 3. Draw a response variable: y|z1:N , , 2 N ( z , 2 ), ¯ where z = 1/N N zn . ¯ n=1 Instead of learning a point estimate of as in sLDA, we take a Bayesian-style approach and learn a distribution q() in a max-margin manner. For prediction, we take the average over all the possible models: ¯ E[Y |w1:N , , , 2 ] = E[ Z|w1:N , , , 2 ]. (2) Now, the question underlying the averaging prediction rule (2) is how we can devise an appropriate loss function and constraints to integrate the maxmargin concepts into latent topic discovery. In the sequel, we present the maximum entropy discrimination latent Dirichlet allocation (MedLDA) based on the PoMEN (i.e., partially observed maximum entropy discrimination Markov networks) (Zhu et al., 2008b) framework. PoMEN is an elegant combination of maxmargin learning with structured hidden variables in Markov networks. The MedLDA is a special case of PoMEN to learn latent topic models to discover latent semantic structures of document collections. For regression, the MedLDA is defined as an integration of a Bayesian sLDA, where the parameter is sampled from a prior p0 (), and the -insensitive support vector regression (SVR) (Smola & Sch¨lkopf, o 2003). Thus, MedLDA defines a joint distribution: p(, z, , y, W|, , 2 ) = p0 ()p(, z, y, W|, , , 2 ), where the second term is the same as in the sLDA, D that is, p(, z, y, W|, , , 2 ) = d=1 p(d |) N 2 ( n=1 p(zdn |d )p(wdn |zdn , ))p(yd | zd , ). ¯ The marginal likelihood on D is p(y, W|, , 2 ). Since directly optimizing the log marginal likelihood is intractable, as in sLDA, we optimize an upper bound L(q), where q(, z, |, ) is a variational distribution to approximate the posterior p(, z, |, , 2 , y, W). To estimate the unknown constants (, , , 2 ), sLDA maximizes the joint likelihood p(y, W|, , , 2 ), where y is the vector of response variables in a corpus D and W are all the words. Given a new document, the expected response value is the prediction: ¯ E[Y |w1:N , , , , 2 ] = E[Z|w1:N , , , 2 ], (1) MedLDA: Max-margin Topic Models for Regression and Classification Thus, the integrated learning problem is defined as: D Algorithm 1 Variational MedLDAr Input: corpus D = {(y, W)}, constants C and , and topic number K. Output: Dirichlet parameters , posterior distribution q(), parameters , and 2 . repeat /**** E-Step ****/ for d = 1 to D do Update d as in Eq. (3). for i = 1 to N do Update di as in Eq. (4). end for end for Solve the dual problem D1 to get q(), µ and µ . /**** M-Step ****/ Update using Eq. (5), and update 2 using Eq. (6). is fixed as 1/K times the ones vector. until convergence P1(MedLDAr ) : q,,, 2 ,, min L(q) + C d=1 (d + d ) µd µd vd vd ¯ yd - E[ Zd ] + d , ¯d ] + d , -yd + E[ Z s.t. d : d 0, d 0, where µ, µ , v, v are lagrange multipliers; L(q) = -E[log p(, z, , y, W|, , 2 )] - H(q(z, , )); H(q) is the entropy of q; , are slack variables absorbing errors in training data; and is the precision. The rationale underlying the MedLDAr is that: let the current model be p(, z, , y, W|, , 2 ), then we want to find a latent topic representation and a model distribution (as represented by the distribution q) which on one hand tend to predict correctly on the data with a sufficient large margin, and on the other hand tend to explain the data well (i.e., minimizing an variational upper bound of the negative log-likelihood). This interplay will yield a topic representation that is more suitable for max-margin learning, as explained below. 2.2.1. Variational EM-Algorithm The constrained problem P1 is generally intractable. Thus, we make additional independence assumptions about q. As in standard topic models, we assume that D N q(, z, |, ) = q() d=1 q(d |d ) n=1 q(zdn |dn ), where d is a K-dimensional vector of Dirichlet parameters and each dn is a categorical distribution ¯ over K topics. Then, E[Zdn ] = dn , E[ Zd ] = N E[] (1/N ) n=1 dn . We can develop an EM algorithm, which iteratively solves the following two steps: E-step: infer the posterior distribution of the hidden variables (i.e., , z, and ). M-step: estimate the unknown parameters (i.e., , , and 2 ). The essential difference between MedLDA and sLDA lies in the E-step to infer the posterior distribution of z and because of the margin constraints in P1. As we shall see in Eq. (4), these constraints will bias the expected topic proportions towards the ones that are more suitable for max-margin learning. Since the constraints in P1 are not on the unknown parameters (, , and 2 ), the M-step is similar to that of the sLDA. We outline the algorithm in Alg. 1 and explain it in details below. Specifically, we formulate a Lagrangian1 L for P1 and iteratively solve the following steps: L = L(q) + C D (d + d ) - D µd ( + d - yd + d=1 d=1 ¯ ¯ E[ Zd ])- D (µd ( +d +yd -E[ Zd ])+vd d +vd d )- d=1 N K D d=1 i=1 cdi ( j=1 dij - 1), where the last term is due to the normalization condition K dij = 1, i, d j=1 1 Optimize L over : Since the constraints in P1 are not on , we can get the same update formula as in sLDA for each document d separately: N d + n=1 dn (3) Optimize L over : For each document d and each word i, by setting L/di = 0, we have: di exp E[log |] + E[log p(wdi |)] + - yd E[] N 2 2E[ d,-i ] + E[ ] E[] + (µd - µd ) , (4) 2N 2 2 N where d,-i = n=i dn and the result of exponentiating a vector is a vector of the exponentials of its corresponding components. The first two terms in the exponential are the same as those in unsupervised LDA. The essential differences of MedLDAr from the sLDA lie in the last three terms in the exponential of di . Firstly, the third and fourth terms are similar to those of sLDA, but in an expected version since we are learning the distribution q(). The second-order expectations E[ d,-i ] and E[ ] mean that the covariances of affect the distribution over topics. This makes our approach significantly different from a point estimation method, like sLDA, where no expectations or co-variances are involved in updating di . Secondly, the last term is from the max-margin regression formulation. For a document d, which lies around the decision boundary, i.e., a support vector, either µd or µd is non-zero, and the last term biases di towards a distribution that favors a more accurate prediction on the document. Moreover, the last term is fixed for words in the document and thus will directly affect the latent representation of the document, i.e., d . Therefore, the latent representation by MedLDAr is more suitable for max-margin learning. Optimize L over q(): Let A be the D × K matrix ¯ whose rows are the vectors Zd . Set the partial MedLDA: Max-margin Topic Models for Regression and Classification derivative L/q() = 0, then we get: p0 () exp q() = Z D 2.3. Learning MedLDA for Classification E[A A] For classification, the response variables y are discrete. yd ¯ (µd - µd + 2 )E[Zd ] - 2 2 For brevity, we only consider the multi-class classificad=1 D ¯ ¯ ¯ ¯ where E[A A] = d=1 E[Zd Zd ], and E[Zd Zd ] = N N 1 n=1 m=n dn dm + n=1 diag{dn }). PlugN2 ( ging q() into L, we get the dual problem of P1: D D tion, where y {1, · · · , M }. The binary case can be easily defined based on a binary SVM and the optimization problem can be solved similarly. As we have stated, fully generative topic models, such as the sLDA, have a normalization factor, which can make the learning generally intractable, except for some special cases like the normal distribution as in the regression case. In (Blei & McAuliffe, 2007), variational methods or high-order Taylor expansion is applied to approximate the normalization factor of a GLM. In our max-margin formulation, since our target is to directly minimize a hinge loss, we do not need a fully generative model. Instead, we define a partially generative model on (, z, W) only as in the unsupervised LDA, and for the classification model (i.e., from Zd to Yd ), we apply the max-margin principle, which does not require a normalized distribution. Thus, in this case, the likelihood of the corpus D is p(W|, ). Specifically, for classification, we assume the discriminant function F is linear, that is, F (y, z1:N , ) = y z , ¯ where z = 1/N n zn as in the regression model, y ¯ is a class-specific K-dimensional parameter vector associated with the class y and is a M K-dimensional vector by stacking the elements of y . Equivalently, F can be written as F (y, z1:N , ) = f (y, z ), where ¯ f (y, z ) is a feature vector whose components from ¯ (y - 1)K + 1 to yK are those of the vector z and all ¯ the others are 0. From each single F , a prediction rule can be derived as in SVM. Here, we consider the general case to learn a distribution of q() and for prediction, we take the average over all the possible models and the latent topics: ¯ y = arg max E[ f (y, Z)|, ]. (7) y D1 : max - log Z - µ,µ d=1 (µd + µd ) + d=1 yd (µd - µd ) s.t. d : µd , µd [0, C]. In MedLDAr , we can choose different priors to introduce some regularization effects. For the standard normal prior: p0 () = N (0, I), the posterior is also a norD mal: q() = N µ , , where µ = d=1 (µd -µd + yd ¯ is the mean and = (I + 1/ 2 E[A A])-1 2 )E[Zd ] is a K × K co-variance matrix. Computation of can be achieved robustly through Cholesky decomposition of 2 I +E[A A], an O(K 3 ) procedure. Another example is the Laplace prior, which can lead to a shrinkage effect (Zhu et al., 2008a) that is useful in sparse problems. In this paper, we focus on the normal prior. For the standard normal prior, the dual problem D1 is a quadratic programming problem: max - µ,µ 1 a a - 2 D D (µd + µd ) + d=1 d=1 yd (µd - µd ) s.t. d : µd , µd [0, C], D ¯ where a = d=1 (µd - µd + yd )E[Zd ]. This problem 2 can be solved with any standard QP solvers, although they may not be so efficient. To leverage recent developments in support vector regression, we note that the following primal form of D1 can be reformulated as a standard SVR problem and solved by using existing algorithms like SVM-light (Joachims, 1999) to get µ and the dual parameters µ and µ : 1 yd ¯ µ -1 µ - µ ( E[Zd ]) + C (d + d ) µ ,, 2 2 d=1 d=1 ¯ yd - µ E[Zd ] + d , µd ¯ -yd + µ E[Zd ] + d , µd s.t. d : d , 0, vd d 0, vd min D D Similar to the regression model, we define the integrated latent topic discovery and multi-class classification model as follows: D P2(MedLDAc ) : min q,q(),,, L(q) + KL(q()||p0 ()) + C d=1 d Now, we estimate the unknown parameters , , and 2 . Here, we assume is fixed. Optimize L over . The update equations are the same as for sLDA: D N k,w d=1 n=1 s.t. d, y = yd : E[ fd (y)] 1 - d ; d 0, 1(wdn = w)dnk , (5) Optimize L over 2 . This step is similar to that of sLDA but in an expected version. The update rule is: 2 1 y y - 2y E[A]E[] + E[ E[A A]] , (6) D where q(, z|, ) is a variational distribution; L(q) = -E[log p(, z, W|, )]-H(q(, z)) is a variational up¯ per bound of - log p(W|, ); fd (y) = f (yd , Zd ) - ¯d ), and are slack variables. E[ fd (y)] is the f (y, Z "expected margin" by which the true label yd is favored over a prediction y. The rationale underlying the MedLDAc is similar to that of the MedLDAr , that is, we want to find a latent topic representation q(, z|, ) and a parameter where E[ E[A A]] = tr(E[A A]E[ ]). MedLDA: Max-margin Topic Models for Regression and Classification distribution q() which on one hand tend to predict as accurate as possible on training data, while on the other hand tend to explain the data well. The KL-term in P2 is a regularizer of the distribution q(). 2.3.1. Variational EM-Algorithm As in MedLDAr , we can develop a similar variational EM algorithm. Specifically, we assume that q is fully factorized, as in the standard unsupervised LDA. N ¯ Then, E[ f (y, Zd )] = E[] f (y, 1/N n=1 dn ). We formulate the Lagrangian2 L of P2 and iteratively optimize L w.r.t , , q() and . Since the constraints in P2 are not on or , their update rules are the same as in MedLDAr and we omit the details here. We explain the optimization of P2 over and q() and show the insights of the max-margin topic model: Optimize L over . Again, since q is fully factorized, we can perform the optimization on each document separately. Set L/di = 0, then we have: di exp E[log |] + E[log p(wdi |)] 1 µd (y)E[yd - y ] . + N y=yd normal prior in this paper. For the standard normal prior p0 () = N (0, I), we can get: q() is a normal with a shifted mean, i.e., q() = N (µ , I), and the dual problem D2 is the same as the dual problem of a standard multi-class SVM that can be solved using existing SVM methods (Crammer & Singer, 2001) : max - µ 1 2 D D µd (y)E[fd (y)] d=1 y=yd 2 2 + d=1 y=yd µd (y) s.t. d : y=yd µd (y) [0, C]. 3. MedTM: a general framework We have presented MedLDA, which integrates the max-margin principle with an underlying LDA model, which can be supervised or unsupervised, for discovering predictive latent topic representations of documents. The same principle can be applied to other generative topic models, such as the correlated topic models (CTM) (Blei & Lafferty, 2005), as well as undirected random fields, such as the exponential family harmoniums (EFH) (Welling et al., 2004). (8) Formally, the max-entropy discrimination topic modThe first two terms in Eq. (8) are the same as in els (MedTM) can be generally defined as: the unsupervised LDA and the last term is due to the min max-margin formulation of P2 and reflects our intu- P(MedTM) : q(H),q(),,L(q(H)) + KL(q() p0 ()) + U () ition that the discovered latent topic representation is s.t. expected margin constraints, influenced by the max-margin estimation. For those where H are hidden variables (e.g., (, z) in LDA); examples that are around the decision boundary, i.e., are the parameters of the model pertaining to the support vectors, some of the lagrange multipliers are prediction task (e.g., in sLDA); are the paramenon-zero and thus the last term acts as a regularizer ters of the underlying topic model (e.g., the Dirichlet that biases the model towards discovering a latent repparameter ); and L is a variational upper bound of resentation that tends to make more accurate predicthe negative log likelihood associated with the undertion on these difficult examples. Moreover, this term lying topic model. U is a convex function over slack is fixed for words in the document and thus will divariables. For the general MedTM model, we can derectly affect the latent representation of the document velop a similar variational EM-algorithm as for the (i.e., d ) and will yield a discriminative latent repreMedLDA. Note that can be a part of H. For exsentation, as we shall see in Section 4, which is more ample, the underlying topic model of MedLDAr is a suitable for the classification task. Bayesian sLDA. In this case, H = (, z, ), = and the term KL(q() p0 ()) is contained in its L. Optimize L over q(): As in the regression model, we get the dual problem of P2: D2 : max - log Z + µ d=1 y=yd D 4. Experiments µd (y) s.t. d : y=yd µd (y) [0, C], In this section, we provide qualitative as well as quantitative evaluation of MedLDA on text modeling, classification and regression. 4.1. Text Modeling We study text modeling of the MedLDA on the 20 Newsgroups data set with a standard list of stop words3 removed. The data set contains postings in 20 related categories. We compare with the standard unsupervised LDA. We fit the dataset to a 110-topic MedLDAc model, which explores the supervised category information, and a 110-topic unsupervised LDA. 3 1 and the posterior q() = Z p0 () exp( µ ), where D µ = d=1 y=yd µd (y)E[fd (y)]. Again, we can choose different priors in MedLDAc for different regularization effects. We consider the = L(q) + KL(q()||p0 ()) + C D d - d=1 D D d=1 vd d - d=1 y=yd µd (y)(E[ fd (y)] + d - 1) - 2 L cdi ( K dij - 1), where the last term is from j=1 the normalization condition K dij = 1, i, d. j=1 D d=1 N i=1 http://mallet.cs.umass.edu/ MedLDA: Max-margin Topic Models for Regression and Classification 80 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Class MedLDA LDA Average per class 60 40 comp.graphics 20 0 -20 T 69 image jpeg gif file color files bit images format program T 11 graphics image data ftp software pub mail package fax images T 80 db key chip encryption clipper system government keys law escrow T 59 image jpeg color file gif images format bit files display T 104 ftp pub graphics mail version tar file information send server T 31 card monitor dos video apple windows drivers vga cards graphics -40 -60 -80 sci.electronics -100 -80 -60 -40 -20 0 20 40 60 80 T 32 ground wire power wiring don current circuit neutral writes work T 95 audio output input signal chip high data mhz time good T 46 source rs time john cycle low dixie dog weeks face T 30 power ground wire circuit supply voltage current wiring signal cable T 84 water energy air nuclear loop hot cold cooling heat temperature T 44 sale price offer shipping sell interested mail condition email cd 80 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 60 politics.mideast 40 20 T 30 israel israeli jews arab writes people article jewish state rights T 40 turkish armenian armenians armenia people turks greek turkey government soviet T 51 israel lebanese israeli lebanon people attacks soldiers villages peace writes T 42 israel israeli peace writes article arab war lebanese lebanon people T 78 jews jewish israel israeli arab people arabs center jew nazi T 47 armenian turkish armenians armenia turks genocide russian soviet people muslim 0 -20 misc.forsale -40 -60 -80 -100 -80 -60 -40 -20 0 20 40 60 80 T 109 sale price shipping offer mail condition interested sell email dos T 110 drive scsi mb drives controller disk ide hard bus system T 84 mac apple monitor bit mhz card video speed memory system T 44 sale price offer shipping sell interested mail condition email cd T 94 don mail call package writes send number ve hotel credit T 49 drive scsi disk hard mb drives ide controller floppy system Figure 2. t-SNE 2D embedding of the topic representation by: MedLDAc (above) and the unsupervised LDA (below). Figure 3. Top topics under each class as discovered by the MedLDA and LDA models Figure 2 shows the 2D embedding of the expected topic proportions of MedLDAc and LDA by using the t-SNE stochastic neighborhood embedding (van der Maaten & Hinton, 2008), where each dot represents a document and color-shape pairs represent class labels. Obviously, the max-margin based MedLDAc produces a better grouping and separation of the documents in different categories. In contrast, the unsupervised LDA does not produce a well separated embedding, and documents in different categories tend to mix together. A similar embedding was presented in (Lacoste-Jullien et al., 2008), where the transformation matrix in their model is pre-designed. The results of MedLDAc in Figure 2 are automatically learned. It is also interesting to examine the discovered topics and their association with class labels. In Figure 3 we show the top topics in four classes as discovered by both MedLDA and LDA. Moreover, we depict the per-class distribution over topics for each model. This distribution is computed by averaging the expected latent representation of the documents in each class. We can see that MedLDA yields sharper, sparser and fast decaying per-class distributions over topics which have a better discrimination power. This behavior is in fact due to the regularization effect en- forced over as shown in Eq. (8). On the other hand, LDA seems to discover topics that model the fine details of documents with no regard to their discrimination power (i.e. it discovers different variations of the same topic which results in a flat per-class distribution over topics). For instance, in the class comp.graphics, MedLDA mainly models documents in this class using two salient, discriminative topics (T69 and T11) whereas LDA results in a much flatter distribution. Moreover, in the cases where LDA and MedLDA discover comparably the same set of topics in a given class (like politics.mideast and misc.forsale), MedLDA results in a sharper low dimensional representation. 4.2. Prediction Accuracy In this subsection, we provide a quantitative evaluation of the MedLDA on prediction performance. 4.2.1. Classification We perform binary and multi-class classification on the 20 Newsgroup data set. To obtain a baseline, we first fit all the data to an LDA model, and then use the latent representation of the training4 documents as features to build a binary/multi-class SVM classifier. We 4 We use the training/testing split in: http://people.csail.mit.edu/jrennie/20Newsgroups/ MedLDA: Max-margin Topic Models for Regression and Classification 0.3 0.2 0.15 0.55 0.5 0.45 -6.32 -6.34 -6.36 -6.38 -6.4 0.1 Predictive R2 0.1 0.05 0 MedLDA MedLDA+SVM DiscLDA LDA+SVM (baseline) 20 40 60 80 100 0.4 0.35 0.3 0.25 0.2 0.15 0.1 5 10 15 20 MedLDA (full) MedLDA (partial) sLDA LDA+SVR 25 30 0 MedLDA MedLDA+SVM DiscLDA sLDA LDA+SVM (baseline) 0 10 20 30 40 -0.1 -0.05 -0.1 Per-word Likelihood 0.2 Relatvie Ratio Ralative Ratio -6.42 -6.44 -6.46 -0.2 # Topics # Topics MedLDA (full) MedLDA (partial) sLDA LDA 5 10 15 20 25 30 (a) (b) Figure 4. Relative improvement ratio against LDA+SVM for: (a) binary and (b) multi-class classification. # Topics # Topics Figure 5. Predictive R (left) and per-word likelihood (right) of different models on the movie review dataset. 2 denote this baseline by LDA+SVM. For a model M, we evaluate its performance using the relative improvement ratio, i.e., precision(M) - precision(LDA+SV M ) . precision(LDA+SV M ) Binary Classification: As in (Lacoste-Jullien et al., 2008), the binary classification is to distinguish postings of the newsgroup alt.atheism and the postings of the group talk.religion.misc. We compare MedLDAc with sLDA, DiscLDA and LDA+SVM. For sLDA, to the best of our knowledge, the classification model has not been evaluated. Therefore, we fit an sLDA regression model using the binary representation (0/1) of the class, and use a threshold 0.5 to make prediction. For MedLDAc , to see whether a second-stage max-margin classifier can improve the performance, we also build a method MedLDA+SVM, similar to LDA+SVM. For all the above methods that utilize the class label information, they are fit ONLY on the training data. We use the SVM-light (Joachims, 1999) to build SVM classifiers and to estimate q() in MedLDAc . The parameter C is chosen via 5 fold cross-validation during the training from {k 2 : k = 1, · · · , 8}. For each model, we run the experiments for 5 times and take the average as the final results. The relative improvement ratios of different models w.r.t topic numbers are shown in Figure 4(a). For the recently proposed DiscLDA (Lacoste-Jullien et al., 2008), since the implementation is not available, the results are taken from the original paper for both DiscLDA and LDA+SVM. We can see that the max-margin based MedLDAc works better than sLDA, DiscLDA and the two-step method of LDA+SVM. Since MedLDAc integrates the max-margin principle in its training, the combination of MedLDA and SVM does not yield additional benefits on this task. We believe that the slight differences between MedLDA and MedLDA+SVM are due to tuning of the regularization parameters. For efficiency, we do not change the regularization constant C during training MedLDAc . The performance would be improved if we select a good C in different iterations because the data representation is changing. Multi-class Classification: We perform multi-class classification on 20 Newsgroups with all the categories. We compare MedLDAc with MedLDA+SVM, LDA+SVM, and DiscLDA. We use the SVMstruct package5 with a 0/1 loss to solve the sub-step of learning q() and build the SVM classifiers for LDA+SVM and MedLDA+SVM. The results are shown in Figure 4(b), where the results of DiscLDA are again taken from (Lacoste-Jullien et al., 2008). We can see that all the supervised topic models discover more predictive topics for classification, and the max-margin based MedLDAc can achieve significant improvements with an appropriate number (e.g., 80) of topics. Again, we believe that the slight difference between MedLDAc and MedLDA+SVM is due to parameter tuning. 4.2.2. Regression We evaluate the MedLDAr model on the movie review data set. As in (Blei & McAuliffe, 2007), we take logs of the response values to make them approximately normal. We compare MedLDAr with the unsupervised LDA and sLDA. As we have stated, the underlying topic model in MedLDAr can be a LDA or a sLDA. We have implemented both, as denoted by MedLDA (partial) and MedLDA (full), respectively. For LDA, we use its low dimensional representation of documents as input features to a linear SVR and denote this method by LDA+SVR. The evaluation criterion is predictive R2 (pR2 ) as defined in (Blei & McAuliffe, 2007). Figure 5 shows the results together with the per-word likelihood. We can see that the supervised MedLDA and sLDA can get much better results than the unsupervised LDA, which ignores supervised responses. By using max-margin learning, MedLDA (full) can get slightly better results than the likelihood-based sLDA, especially when the number of topics is small (e.g., 15). Indeed, when the number of topics is small, the latent representation of sLDA alone does not result in a highly separable problem, thus the integration of max-margin training helps in discovering a more discriminative latent representation using the same number of topics. In fact, the number of support vectors (i.e., documents that have at least one non-zero lagrange multiplier) decreases dramatically at T = 15 and stays nearly the same for T > 15, which with reference to Eq. (4) explains why the relative improvement 5 http://svmlight.joachims.org/svm multiclass.html MedLDA: Max-margin Topic Models for Regression and Classification over sLDA decreased as T increases. This behavior suggests that MedLDA can discover more predictive latent structures for difficult, non-separable problems. For the two variants of MedLDAr , we can see an obvious improvement of MedLDA (full). This is because for MedLDA (partial), the update rule of does not have the third and fourth terms of Eq. (4). Those terms make the max-margin estimation and latent topic discovery attached more tightly. Finally, a linear SVR on the empirical word frequency gets a pR2 of 0.458, worse than those of sLDA and MedLDA. 4.2.3. Time Efficiency 10 10 10 10 10 10 6 5 sampling (T. Griffiths, 2004). Moreover, as the experimental results suggest, incorporation of a more expressive underlying topic model enhances the overall performance. Therefore, we plan to integrate and utilize other underlying topic models like the fully generative sLDA model in the classification case. Acknowledgements This work was done while J.Z. was visiting CMU under a support from NSF DBI-0546594 and DBI0640543 awarded to E.X.; J.Z. is also supported by Chinese NSF Grant 60621062 and 60605003; National Key Foundation R&D Projects 2003CB317007, 2004CB318108 and 2007CB311003; and Basic Research Foundation of Tsinghua National TNList Lab. For binary classification, MedLDAc is much more efficient than sLDA, and is comparable with the # Topics LDA+SVM, as shown in Figure 6. Training time. Figure 6. The slowness of sLDA may be due to the mismatching between its normal assumption and the non-Gaussian binary response variables, which prolongs the E-step. For multi-class classification, the training time of MedLDAc is mainly dependent on solving a multi-class SVM problem, and thus is comparable to that of LDA. For regression, the training time of MedLDA (full) is comparable to that of sLDA, while MedLDA (partial) is more efficient. 4 3 2 1 CPU-Seconds MedLDA sLDA LDA+SVM References Blei, D., & Jordan, M. (2003). Modeling annotated data. Inter. Conf. on Info. Retrieval, 127­134. Blei, D., & Lafferty, J. (2005). Correlated topic models. Neur. Info. Proc. Sys., 147­154. Blei, D., & McAuliffe, J. D. (2007). Supervised topic models. Neur. Info. Proc. Sys., 121­128. Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. J. of Mach. Learn. Res., 993­1022. Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. J. of Mach. Learn. Res., 265­292. Joachims, T. (1999). Making large-scale SVM learning practical. Advances in kernel methods­support vector learning, MIT-Press, 169­184. Lacoste-Jullien, S., Sha, F., & Jordan, M. I. (2008). DiscLDA: Discriminative learning for dimensionality reduction and classification. NIPS, 897­904. McCallum, A., Pal, C., Druck, G., & Wang, X. (2006). Multi-conditional learning: generative/discriminative training for clustering and classification. AAAI, 433­439. Smola, A., & Sch¨lkopf, B. (2003). A tutorial on supo port vector regression. Statistics and Computing, 199-222. T. Griffiths, M. S. (2004). Finding scientific topics. Proc. of National Academy of Sci., 5228­5235. Teh, Y. W., Newman, D., & Welling, M. (2006). A collapsed variational bayesian inference algorithm for latent dirichlet allocation. NIPS, 1353­1360. van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. JMLR, 2579­2605. Welling, M., Rosen-Zvi, M., & Hinton, G. (2004). Exponential family harmoniums with an application to information retrieval. NIPS, 1481-1488. Zhu, J., Xing, E., & Zhang, B. (2008a). Laplace maximum margin Markov networks. ICML, 1256­1263. Zhu, J., Xing, E., & Zhang, B. (2008b). Partially observed maximum entropy discrimination Markov networks. Neur. Info. Proc. Sys., 1977­1984. 0 10 20 30 40 5. Conclusions and Discussions We have presented the maximum entropy discrimination LDA (MedLDA) that uses the max-margin principle to train supervised topic models. MedLDA integrates the max-margin principle into the latent topic discovery process via optimizing one single objective function with a set of expected margin constraints. This integration yields a predictive topic representation that is more suitable for regression or classification. We develop efficient variational methods for MedLDA. The empirical results on movie review and 20 Newsgroups data sets show the promise of MedLDA on text modeling and prediction accuracy. MedLDA represents the first step towards integrating the max-margin principle into supervised topic models, and under the general MedTM framework presented in Section 3, several improvements and extensions are in the horizon. Specifically, due to the nature of MedTM's joint optimization formulation, advances in either max-margin training or better variational bounds for inference can be easily incorporated. For instance, the mean field variational upper bound in MedLDA can be improved by using the tighter collapsed variational bound (Teh et al., 2006) that achieves results comparable to collapsed Gibbs