Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors
David Andrzejewski andrzeje@cs.wisc.edu Xiaojin Zhu jerryzhu@cs.wisc.edu Mark Craven craven@biostat.wisc.edu Department of Computer Sciences, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, WI 53706 USA

Abstract
Users of topic modeling methods often have knowledge about the composition of words that should have high or low probability in various topics. We incorporate such domain knowledge using a novel Dirichlet Forest prior in a Latent Dirichlet Allocation framework. The prior is a mixture of Dirichlet tree distributions with special structures. We present its construction, and inference via collapsed Gibbs sampling. Experiments on synthetic and real datasets demonstrate our model's ability to follow and generalize beyond userspecified domain knowledge.

pear with high probability in the same topic. The analyst may want to interactively express the preference that the two sets of words should not appear together, re-run topic modeling, and incorporate additional preferences based on the new results. In both cases, we would like these preferences to guide the recovery of latent topics. Standard LDA lacks a mechanism for incorporating such domain knowledge. In this paper, we propose a principled approach to the incorporation of such domain knowledge into LDA. We show that many types of knowledge can be expressed with two primitives on word pairs. Borrowing names from the constrained clustering literature (Basu et al., 2008), we call the two primitives Must-Links and Cannot-Links, although there are important differences. We then encode the set of Must-Links and Cannot-Links associated with the domain knowledge using a Dirichlet Forest prior, replacing the Dirichlet prior over the topic-word multinomial p(word|topic). The Dirichlet Forest prior is a mixture of Dirichlet tree distributions with very specific tree structures. Our approach has several advantages: (i) A Dirichlet Forest can encode Must-Links and Cannot-Links, something impossible with Dirichlet distributions. (ii) The user can control the strength of the domain knowledge by setting a parameter , allowing domain knowledge to be overridden if the data strongly suggest otherwise. (iii) The Dirichlet Forest lends itself to efficient inference via collapsed Gibbs sampling, a property inherited from the conjugacy of Dirichlet trees. We present experiments on several synthetic datasets and two real domains, demonstrating that the resulting topics not only successfully incorporate the specified domain knowledge, but also generalize beyond it by including/excluding other related words not explicitly mentioned in the Must-Links and Cannot-Links.

1. Introduction
Topic modeling, using approaches such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003), has enjoyed popularity as a way to model hidden topics in data. However, in many applications, a user may have additional knowledge about the composition of words that should have high probability in various topics. For example, in a biological application, one may prefer that the words "termination", "disassembly" and "release" appear with high probability in the same topic, because they all describe the same phase of biological processes. Furthermore, a biologist could automatically extract these preferences from an existing biomedical ontology, such as the Gene Ontology (GO) (The Gene Ontology Consortium, 2000). As another example, an analyst may run topic modeling on a corpus of people's wishes, inspect the resulting topics, and notice that "into, college" and "cure, cancer" all apAppearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Incorporating Domain Knowledge into Topic Modeling

2. Related Work
We review LDA using the notation of Griffiths and Steyvers (2004). Let there be T topics. Let w = w1 . . . wn represent a corpus of D documents, with a total of n words. We use di to denote the document of word wi , and zi the hidden topic from which wi is (w) (d) generated. Let j = p(w|z = j), and j = p(z = j) for document d. The LDA generative model is then:   Dirichlet() (1) (2) (3) (4)

clusters, respectively. We borrow the notion for topic modeling. Informally, the Must-Link primitive prefers that two words tend to be generated by the same topic, while the Cannot-Link primitive prefers that two words tend to be generated by separate topics. However, since any topic  is a multinomial over words, any two words (in general) always have some probability of being generated by the topic. We therefore propose the following definition: Must-Link (u, v): Two words u, v have similar prob(u) (v) ability within any topic, i.e., j  j for j = 1 . . . T . It is important to note that the probabilities can be both large or both small, as long as they are similar. For example, for the earlier biology example we could say Must-Link (termination, disassembly). Cannot-Link (u, v): Two words u, v should not both have large probability within any topic. It is permissible for one to have a large probability and the other small, or both small. For example, one primitive for the wish example can be Cannot-Link (college, cure). Many types of domain knowledge can be decomposed into a set of Must-Links and Cannot-Links. We demonstrate three types in our experiments: we can Split two or more sets of words from a single topic into different topics by placing Must-Links within the sets and Cannot-Links between them. We can Merge two or more sets of words from different topics into one topic by placing Must-Links among the sets. Given a common set of words which appear in multiple topics (such as stopwords in English, which tend to appear in all LDA topics), we can Isolate them by placing Must-Links within the common set, and then placing Cannot-Link between the common set and the other high-probability words from all topics. It is important to note that our Must-Links and Cannot-Links are preferences instead of hard constraints. 3.2. Encoding Must-Links It is well-known that the Dirichlet distribution is limited in that all words share a common variance parameter, and are mutually independent except the normalization constraint (Minka, 1999). However, for MustLink (u, v) it is crucial to control the two words u, v differently than other words. The Dirichlet tree distribution (Dennis III, 1991) is a generalization of the Dirichlet distribution that allows such control. It is a tree with the words as leaf nodes; see Figure 1(a) for an example. Let  (k) be the Dirichlet tree edge weight leading into node k. Let C(k) be the immediate children of node k in the tree, L the leaves of the tree, I the internal nodes, and L(k) the

zi |(di )  Multinomial((di ) )   Dirichlet() wi |zi ,   Multinomial(zi )

where  and  are hyperparameters for the documenttopic and topic-word Dirichlet distributions, respectively. For simplicity we will assume symmetric  and , but asymmetric hyperparameters are also possible. Previous work has modeled correlations in the LDA document-topic mixtures using the logistic Normal distribution (Blei & Lafferty, 2006), DAG (Pachinko) structures (Li & McCallum, 2006), or the Dirichlet Tree distribution (Tam & Schultz, 2007). In addition, the concept-topic model (Chemudugunta et al., 2008) employs domain knowledge through special "concept" topics, in which only a particular set of words can be present. Our work complements the previous work by encoding complex domain knowledge on words (especially arbitrary Cannot-Links) into a flexible and computationally efficient prior.

3. Topic Modeling with Dirichlet Forest
Our proposed model differs from LDA in the way  is generated. Instead of (3), we have q  DirichletForest(, )

  DirichletTree(q) where q specifies a Dirichlet tree distribution,  plays a role analogous to the topic-word hyperparameter in standard LDA, and   1 is the "strength parameter" of the domain knowledge. Before discussing DirichletForest(, ) and DirichletTree(q), we first explain how knowledge can be expressed using MustLink and Cannot-Link primitives. 3.1. Must-Links and Cannot-Links Must-Links and Cannot-Links were originally proposed for constrained clustering to encourage two instances to fall into the same cluster or into separate

Incorporating Domain Knowledge into Topic Modeling

leaves in the subtree under k. To generate a sample   DirichletTree(), one first draws a multinomial at each internal node s  I from Dirichlet( C(s) ), i.e., using the weights from s to its children as the Dirichlet parameters. One can think of it as re-distributing the probability mass reaching s by this multinomial (initially, the mass is 1 at the root). The probability (k) of a word k  L is then simply the product of the multinomial parameters on the edges from k to the root, as shown in Figure 1(b). It can be shown (Dennis III, 1991) that this procedure gives DirichletTree()  p(|) =   (s)  C(s) (k) L(s) I  L  (k) k  -1    (k)  (k)   C(s) (k)   s k k k where (·) is the standard gamma function, and the L notation k means kL . The function (s)   (s) - (k) is the difference between the in-degree kC(s)  and out-degree of internal node s. When this difference (s) = 0 for all internal nodes s  I, the Dirichlet tree reduces to a Dirichlet distribution.

and the tree reduces to a Dirichlet distribution with symmetric prior : the Must-Links are turned off in this case. As we increase , the re-distribution of probability mass at s (governed by a Dirichlet under s) has increasing concentration |L(s)| but the same uniform base-measure. This tends to redistribute the mass evenly in the transitive closure represented by s. Therefore, the Must-Links are turned on when  > 1. Furthermore, the mass reaching s is independent of , and can still have a large variance. This properly encodes the fact that we want Must-Linked words to have similar, but not always large, probabilities. Otherwise, Must-Linked words would be forced to appear with large probability in all topics, which is clearly undesirable. This is impossible to represent with Dirichlet distributions. For example, the blue dots in Figure 1(c) are  samples from the Dirichlet tree in Figure 1(a), plotted on the probability simplex of dimension three. While it is always true that p(A)  p(B), their total probability mass can be anywhere from 0 to 1. The most similar Dirichlet distribution is perhaps the one with parameters (50,50,1), which generates samples close to (0.5, 0.5, 0) (Figure 1(d)). 3.3. Encoding Cannot-Links Cannot-Links are considerably harder to handle. We first transform them into an alternative form that is amenable to Dirichlet trees. Note that Cannot-Links are not transitive: Cannot-Link (A,B) and CannotLink (B,C) does not entail Cannot-Link (A,C). We define a Cannot-Link-graph where the nodes are words1 , and the edges correspond to the Cannot-Links. Then the connected components of this graph are independent of each other when encoding Cannot-Links. We will use this property to factor a Dirichlet-tree selection probability later. For example, the two CannotLinks (A,B) and (B,C) form the graph in Figure 1(e) with a single connected component {A,B,C}. Consider the subgraph on connected component r. We define its complement graph by flipping the edges (on to off, off to on), as shown in Figure 1(f). Let there be Q(r) maximal cliques Mr1 . . . MrQ(r) in this complement graph. In the following, we simply call them "cliques", but it is important to remember that they are maximal cliques of the complement graph, not the original Cannot-Link-graph. In our example, Q(r) = 2 and Mr1 = {A, C}, Mr2 = {B}. These cliques have the following interpretation: each clique (e.g., Mr1 = {A, C}) is the maximal subset of words in the connected component that can "occur together".
When there are Must-Links, all words in a Must-Link transitive closure form a single node in this graph.
1

Here n(k) is the number of word tokens in w that appear in L(k).

Like the Dirichlet, the Dirichlet tree is conjugate to the multinomial. It is possible to integrate out  to get a distribution over word counts directly, similar to the multivariate P´lya distribution: p(w|) = o   C(s) (k) C(s) I     (k) + n(k) k   (5) C(s) ( (k) )  (k) + n(k)  s k
k

We encode Must-Links using a Dirichlet tree. Note that our definition of Must-Link is transitive: MustLink (u, v) and Must-Link (v, w) imply Must-Link (u, w). We thus first compute the transitive closures of expressed Must-Links. Our Dirichlet tree for MustLinks has a very simple structure: each transitive closure is a subtree, with one internal node and the words in the closure as its leaves. The weights from the internal node to its leaves are . The root connects to these internal nodes s with weight |L(s)|, where | · | represents the set size. In addition, the root directly connects to other words not in any closure, with weight . For example, the transitive closure for a Must-Link (A,B) on vocabulary {A,B,C} is simply {A,B}, corresponding to the Dirichlet tree in Figure 1(a). To understand this encoding of Must-Links, consider first the case when the domain knowledge strength parameter is at its weakest  = 1. Then in-degree equals out-degree for any internal node s (both are |L(s)|),

Incorporating Domain Knowledge into Topic Modeling
C C

2  A  B


0.91 0.58 0.42

0.09

C

=(0.53

0.38

0.09)

A

B

A

B

(a)
A
A

(b)
2

(c)
  C B   B

(d)
C

B
B

 A

 A

 C
A B

C
C

(e)

(f)

(g)

(h)

(i)

Figure 1. Encoding Must-Links and Cannot-Links with a Dirichlet Forest. (a) A Dirichlet tree encoding Must-Link (A,B) with  = 1,  = 50 on vocabulary {A,B,C}. (b) A sample  from this Dirichlet tree. (c) A large set of samples from the Dirichlet tree, plotted on the 3-simplex. Note p(A)  p(B), yet they remain flexible in actual value, which is desirable for a Must-Link. (d) In contrast, samples from a standard Dirichlet with comparable parameters (50,50,1) force p(A)  p(B)  0.5, and cannot encode a Must-Link. (e) The Cannot-Link-graph for Cannot-Link (A,B) and Cannot-Link (B,C). (f) The complementary graph, with two maximal cliques {A,C} and {B}. (g) The Dirichlet subtree for clique {A,C}. (h) The Dirichlet subtree for clique {B}. (i) Samples from the mixture model on (g,h), encoding both Cannot-Links, again with  = 1,  = 50.

That is, these words are allowed to simultaneously have large probabilities in a given topic without violating any Cannot-Link preferences. By the maximality of these cliques, allowing any word outside the clique (e.g., "B") to also have a large probability will violate at least 1 Cannot-Link (in this example 2). We discuss the encoding for this single connected component r now, deferring discussion of the complete encoding to section 3.4. We create a mixture model of Q(r) Dirichlet subtrees, one for each clique. Each topic selects exactly one subtree according to probability p(q)  |Mrq |, q = 1 . . . Q(r) . (6)

resenting multinomials in which no Cannot-Link is violated. Such behavior is not achievable by a Dirichlet distribution, or a single Dirichlet tree2 . Finally, we mention that although in the worst case the number of maximal cliques Q(r) in a connected component of size |r| can grow exponentially as O(3|r|/3 ) (Griggs et al., 1988), in our experiments Q(r) is no larger than 3, due in part to Must-Linked words "collapsing" to single nodes in the Cannot-Link graph. 3.4. The Dirichlet Forest Prior In general, our domain knowledge is expressed by a set of Must-Links and Cannot-Links. We first compute the transitive closure of Must-Links. We then form a Cannot-Link-graph, where a node is either a Must-Link closure or a word not present in any Must-Link. Note that the domain knowledge must be "consistent" in that no pairs of words are simultaneously Cannot-Linked and Must-Linked (either explicitly or implicitly through Must-Link transitive closure.) Let R be the number of connected components in the Cannot-Link-graph. Our Dirichlet ForR est consists of r=1 Q(r) Dirichlet trees, represented by the template in Figure 2. Each Dirichlet tree has
Dirichlet distributions with very small concentration do have some selection effect. For example, Beta(0.1,0.1) tends to concentrate probability mass on one of the two variables. However, such priors are weak ­ the "pseudo counts" in them are too small because of the small concentration. The posterior will be dominated by the data, and we would lose any encoded domain knowledge.
2

Conceptually, the selected subtree indexed by q tends to redistribute nearly all probability mass to the words within Mrq . Since there is no mass left for other cliques, it is impossible for a word outside clique Mrq to have a large probability. Therefore, no CannotLink will be violated. In reality, the subtrees are soft rather than hard, because Cannot-Links are only preferences. The Dirichlet subtree for Mrq is structured as follows. The subtree's root connects to an internal node s with weight |Mrq |. The node s connects to words in Mrq , with weight . The subtree's root also directly connects to words not in Mrq (but in the connected component r) with weight . This will send most probability mass down to s, and then flexibly redistribute it among words in Mrq . For example, Figures 1(g,h) show the Dirichlet subtrees for Mr1 = {A, C} and Mr2 = {B} respectively. Samples from this mixture model are shown in Figure 1(i), rep-

Incorporating Domain Knowledge into Topic Modeling

R branches beneath the root, one for each connected component. The trees differ in which subtrees they include under these branches. For the r-th branch, there are Q(r) possible Dirichlet subtrees, corresponding to cliques Mr1 . . . MrQ(r) . Therefore, a tree in the forest is uniquely identified by an index vector q = (q (1) . . . q (R) ), where q (r)  {1 . . . Q(r) }.
connected component 1  Q
... ...

Finally, the complete generative model is

derived using (5): p(w|q1:T , z, , ) =   Cj (s) (k) Cj (s) (k) (k) T Ij j  (j + nj ) k .  (k) Cj (s) (k) (k) (j )  (j + nj ) j=1 s k k
T

...
(1)

connected component R  Q
... ...

p(w, z, q1:T |, , ) = p(w|q1:T , z, , )p(z|)
j=1
(R)

p(qj ).

4. Inference for Dirichlet Forest
Because a Dirichlet Forest is a mixture of Dirichlet trees, which are conjugate to multinomials, we can efficiently perform inference by Markov Chain Monte Carlo (MCMC). Specifically, we use collapsed Gibbs sampling similar to Griffiths and Steyvers (2004). However, in our case the MCMC state is defined by both the topic labels z and the tree indices q1:T . An MCMC iteration in our case consists of a sweep through both z and q1:T . We present the conditional probabilities for collapsed Gibbs sampling below. (Sampling zi ): Let n-i,j be the number of word tokens in document d assigned to topic j, excluding the (k) word at position i. Similarly, let n-i,j be the number of word tokens in the corpus that are under node k in topic j's Dirichlet tree, excluding the word at position i. For candidate topic labels v = 1 . . . T , we have p(zi = v|z-i , q1:T , w) 
Iv (i) (d)

M1q (1) other w

MRq (R) other w 
...

=


or

word

Must-Link

Figure 2. Template of Dirichlet trees in the Dirichlet Forest

To draw a Dirichlet tree q from the prior DirichletForest(, ), we select the subtrees independently because the R connected components are independent with respect to Cannot-Links: p(q) = R (r) ). Each q (r) is sampled according to (6), r=1 p(q and corresponds to choosing a solid box for the r-th branch in Figure 2. The structure of the subtree within the solid box has been defined in Section 3.3. The black nodes may be a single word, or a Must-Link transitive closure having the subtree structure shown in the dotted box. The edge weight leading to most nodes k is  (k) = |L(k)|, where L(k) is the set of leaves under k. However, for edges coming out of a Must-Link internal node or going into a Cannot-Link internal node, their weights are multiplied by the strength parameter . These edges are marked by "" in Figure 2. We now define the complete Dirichlet Forest model, (d) integrating out ("collapsing")  and . Let nj be the number of word tokens in document d that are assigned to topic j. z is generated the same as in LDA: p(z|) = (T ) ()T
D D d=1 T j=1

(n-i,v + )
s

(d)

v

(Cv (si)) Cv (s) k

v + n-i,v

(C (si)) (k)

v + n-i,v

(k)

,

where Iv ( i) denotes the subset of internal nodes in topic v's Dirichlet tree that are ancestors of leaf wi , and Cv (si) is the unique node that is s's immediate child and an ancestor of wi (including wi itself). (Sampling qj ): Since the connected components are independent, sampling the tree qj factors into sam(r) pling the cliques for each connected component qj .  For candidate cliques q = 1 . . . Q(r), we have  
Mrq (r) p(qj (r)

(nj + )
(d)

(d)

(n·

+ T )

.

There is one Dirichlet tree qj per topic j = 1 . . . T , sampled from the Dirichlet Forest prior p(qj ) = (r) R r=1 p(qj ). Each Dirichlet tree qj implicitly defines its tree edge weights j using , , and its tree structure Lj , Ij , Cj (·). Let nj be the number of word tokens in the corpus assigned to topic j that appear under the node k in the Dirichlet tree qj . The probability of generating the corpus w, given the trees q1:T  q1 . . . qT and the topic assignment z, can be
(k) (·)

=q


(-r) |z, q-j , qj , w)

Ij,r=q

s

 
Cj (s) k

j

(k) (k)


k

k  ×
(k)

Cj (s)

(j

(k)

+ nj )

(k)

Cj (s) (k) (j k

+ nj )

k

(j )

 
where Ij,r=q denotes the internal nodes below the r-th branch of tree qj , when clique Mrq is selected.

Incorporating Domain Knowledge into Topic Modeling

(Estimating  and ): After running MCMC for sufficient iterations, we follow standard practice (e.g. (Griffiths & Steyvers, 2004)) and use the last sample (z, q1:T ) to estimate  and . Because a Dirichlet tree is a conjugate distribution, its posterior is a Dirichlet tree with the same structure and updated edge weights. The posterior for the Dirichlet tree of (k) (k) (k) the j-th topic is  post j = j +nj , where the counts nj are collected from z, q1:T , w. We estimate j by the first moment under this posterior (Minka, 1999):
Ij (w) (w) j (C (sw))  post j j s (k)

ML(B,C), eta=1
0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.5 0.6 0.4 0.2

ML(B,C), eta=10
0.6 0.4 0.2

ML(B,C), eta=50

1

2

0 -0.2 -0.4

1

2

0 -0.2 -0.4

1

2

3
0 0.5

-0.6 -0.5

3
0 0.5

-0.6 -0.5

3
0 0.5

CL(A,B), eta=1
1 1

CL(A,B), eta=500
1

CL(A,B), eta=1500 1 2
0

0.5

1 2

3

0.5

1 2

3

0.5

3

0

0

=

 
Cj (s)

s

The parameter  is estimated the same way as in stan(d) (d) (d) dard LDA: j = (nj + )/(n· + T ).

(s )  post j 

-1

-0.5

-0.5

-0.5

4
-1 -1 -0.5 0 0.5

5
1 -1 -1

4
-0.5 0 0.5

5
1 -1 -1

4
-0.5 0 0.5

5
1

.

(7)

Isolate(B), eta=1
1 1

Isolate(B), eta=500
1

Isolate(B), eta=1000

0.5

0.5

0.5

0

1 2

3

0

1 2

3

0

1 2

3

-0.5

-0.5

-0.5

-1 -1

-0.5

0

0.5

1

-1 -1

-0.5

0

0.5

1

-1 -1

-0.5

0

0.5

1

5. Experiments
Synthetic Corpora: We present results on synthetic datasets to show how the Dirichlet Forest (DF) incorporates different types of knowledge. Recall that DF with  = 1 is equivalent to standard LDA (verified with the code of (Griffiths & Steyvers, 2004)). Previous studies often take the last MCMC sample (z and q1:T ), and discuss the topics 1:T derived from that sample. Because of the stochastic nature of MCMC, we argue that more insight can be gained if multiple independent MCMC samples are considered. For each dataset, and each DF with a different , we run a long MCMC chain with 200,000 iterations of burn-in, and take out a sample every 10,000 iterations afterward, for a total of 200 samples. We have some indication that our chain is well-mixed, as we observe all expected modes, and that samples with "label switching" (i.e., equivalent up to label permutation) occur with near equal frequency. For each sample, we derive its topics 1:T with (7) and then greedily align the 's from different samples, permuting the T topic labels to remove the label switching effect. Within a dataset, we perform PCA on the baseline ( = 1)  and project all samples into the resulting space to obtain a common visualization (each row in Figure 3. Points are dithered to show overlap.). Must-Link (B,C): The corpus consists of six documents over a vocabulary of five "words." The documents are: ABAB, CDCD, and EEEE, each represented twice. We let T = 2,  = 0.5,  = 0.01. LDA produces three kinds of 1:T : roughly a third of the time the topics are around [ A B | C D E ], which 2 2 4 4 2 1 1 is shorthand for 1 = ( 2 , 1 , 0, 0, 0) 2 = (0, 0, 4 , 1 , 1 ) 2 4 2 on the vocabulary ABCDE. Another third are around

Split(AB,CD), eta=1
1
1

Split(AB,CD), eta=100
1

Split(AB,CD), eta=500 2
0.5

2
0.5

1 4

3 5 7

2
0.5

1 4

3 5 7

1 4

3 5 7

0

0

0

-0.5

-0.5

-0.5

6
-1 -1 -0.5 0 0.5 1
-1 -1 -0.5

6
0 0.5 1 -1 -1 -0.5

6
0 0.5 1

Figure 3. PCA projections of permutation-aligned  samples for the four synthetic data experiments.

[ A B E | C D ], and the final third around [ A B C D |E]. 4 4 2 2 2 4 4 4 4 They correspond to clusters 1,2 and 3 respectively in the upper-left panel of Figure 3. We add a single MustLink (B,C). When  = 10, the data still override our Must-Link somewhat because clusters 1 and 2 do not disappear completely. As  increases to 50, Must-Link overrides the data and clusters 1 and 2 vanish, leaving only cluster 3. That is, running DF and taking the last sample is very likely to obtain the [ A B C D |E] topics. 4 4 4 4 This is what we want: B and C are present or absent together in the topics and they also "pull" A, D along, even though A, D are not in the knowledge we added. Cannot-Link (A,B): The corpus has four documents: ABCCABCC, ABDDABDD, twice each; T = 3,  = 1,  = 0.01. LDA produces six kinds of 1:T evenly: [ B D |A|C], [ A B |C|D], [ A D |B|C], [ B C |A|D], 2 2 2 2 2 2 2 2 [ A C |B|D], [ C D |A|B], corresponding to clusters 1­5 2 2 2 2 and the "lines". We add a single Cannot-Link (A,B). As DF  increases, cluster 2 [ A B |C|D] disappears, be2 2 cause it involves a topic A B that violates the Cannot2 2 Link. Other clusters become uniformly more likely. Isolate(B): The corpus has four documents, all of which are ABC; T = 2,  = 1,  = 0.01. LDA pro-

Incorporating Domain Knowledge into Topic Modeling

duces three clusters evenly: [ A C |B], [ A B |C], [ B C |A]. 2 2 2 2 2 2 We add Isolate(B), which is compiled into CannotLink (B,A) and Cannot-Link (B,C). The DF's samples concentrate to cluster 1: [ A C |B], which indeed 2 2 isolates B into its own topic. Split(AB,CD): The corpus has six documents: ABCDEEEE, ABCDFFFF, each present three times;  = 0.5,  = 0.01. LDA with T = 3 produces a large portion of topics around [ A B C D |E|F ] (not shown). 4 4 4 4 We add Split(AB,CD), which is compiled into MustLink (A,B), Must-Link (C,D), Cannot-Link (B,C), and increase T = 4. However, DF with  = 1 (i.e., LDA with T = 4) produces a large variety of topics: e.g., cluster 1 is [ A 3B 3D | A 7F |C|E], cluster 2 is 4 8 8 8 8 [ C 7D | 3A 3B C |E|F ], and cluster 7 is [ A B | C D |E|F ]. 8 8 8 8 4 2 2 2 2 That is, simply adding one more topic does not clearly separate AB and CD. On the other hand, with  increasing, DF eventually concentrates on cluster 7, which satisfies the Split operation. Wish Corpus: We now consider interactive topic modeling with DF. The corpus we use is a collection of 89,574 New Year's wishes submitted to The Times Square Alliance (Goldberg et al., 2009). Each wish is treated as a document, downcased but without stopword removal. For each step in our interactive example, we set  = 0.5,  = 0.1,  = 1000, and run MCMC for 2000 iterations before estimating the topics from the final sample. The domain knowledge in DF is accumulative along the steps. Step 1: We run LDA with T = 15. Many of the most probable words in the topics are conventional ("to, and") or corpus-specific ("wish, 2008") stopwords, which obscure the meaning of the topics. Step 2: We manually create a 50-word stopword list, and issue an Isolate preference. This is compiled into Must-Links among this set and Cannot-Links between this set and all other words in the top 50 for all topics. T is increased to 16. After running DF, we end up with two stopword topics. Importantly, with the stopwords explained by these two topics, the top words for the other topics become much more meaningful. Step 3: We notice that one topic conflates two concepts: enter college and cure disease (top 8 words: "go school cancer into well free cure college"). We issue Split("go,school,into,college", "cancer,free,cure,well") to separate the concepts. This is compiled into MustLinks within each quadruple, and a Cannot-Link between them. T is increased to 18. After running DF, one of the topics clearly takes on the "college" concept, picking up related words which we did not explicitly encode in our prior. Another topic does likewise for the "cure" concept (many wishes are like "mom stays cancer free"). Other topics have minor changes.

Table 1. Wish topics from interactive topic modeling Topic Merge success life money people iraq joy family vote Isolate god peace spam Isolate Split Split Top words sorted by  = p(word|topic) love lose weight together forever marry meet health happiness family good friends prosperity life happy best live time long wishes ever years as do not what someone so like don much he out make money up house work able pay own lots no people stop less day every each other another home safe end troops iraq bring war return love true peace happiness dreams joy everyone happy healthy family baby safe prosperous better hope president paul ron than person bush and to for a the year in new all my god bless jesus everyone loved know heart christ peace world earth win lottery around save com call if u 4 www 2 3 visit 1 i to wish my for and a be that the job go great school into good college hope move mom hope cancer free husband son well dad cure

Step 4: We then notice that two topics correspond to romance concepts. We apply Merge("love, forever, marry, together, loves", "meet, boyfriend, married, girlfriend, wedding"), which is compiled into MustLinks between these words. T is decreased to 17. After running DF, one of the romance topics disappears, and the remaining one corresponds to the merged romance topic ("lose", "weight" were in one of them, and remain so). Other previous topics survive with only minor changes. Table 1 shows the wish topics after these four steps, where we place the DF operations next to the most affected topics, and color-code the words explicitly specified in the domain knowledge. Yeast Corpus: Whereas the previous experiment illustrates the utility of our approach in an interactive setting, we now consider a case in which we use background knowledge from an ontology to guide topic modeling. Our prior knowledge is based on six concepts. The concepts transcription, translation and replication characterize three important processes that are carried out at the molecular level. The concepts initiation, elongation and termination describe phases of the three aforementioned processes. Combinations of concepts from these two sets correspond to concepts in the Gene Ontology (e.g., GO:0006414 is translational elongation, and GO:0006352 is transcription initiation). We guide our topic modeling using Must-Links among a small set of words for each concept. Moreover, we use Cannot-Links among words to specify that we prefer (i) transcription, translation and replication to be represented in separate topics, and (ii) initiation, elongation and termination to be represented in separate topics. We do not set any preferences between the "process" topics and the "phase" topics, however.

Incorporating Domain Knowledge into Topic Modeling Table 2. Yeast topics. The left column shows the seed words in the DF model. The middle columns indicate the topics in which at least 2 seed words are among the 50 highest probability words for LDA, the "o" column gives the number of other topics (not shared by another word). The right columns show the same topic-word relationships for the DF model. LDA DF 1 2 3 4 5 6 7 8 o 1 2 3 4 5 6 7 8 9 10 transcription · ·· 1 · · · transcriptional · ·· 2 · · · · 1 · · · template translation ·· · · · · · translational tRNA 1 · · replication · 2 · · · ·· · · · cycle division · 3 · · · initiation · ··· · · · · · start ·· · · · · · assembly · · 7 · · · · elongation · · 1 · termination ·· · disassembly · release 2 · · · stop

Acknowledgments: This work was supported by NIH/NLM grants T15 LM07359 and R01 LM07050, and the Wisconsin Alumni Research Foundation.

References
Basu, S., Davidson, I., & Wagstaff, K. (Eds.). (2008). Constrained clustering: Advances in algorithms, theory, and applications. Chapman & Hall/CRC. Blei, D., & Lafferty, J. (2006). Correlated topic models. In Advances in neural information processing systems 18, 147­154. Cambridge, MA: MIT Press. Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993­1022. Chemudugunta, C., Holloway, A., Smyth, P., & Steyvers, M. (2008). Modeling documents by combining semantic concepts with unsupervised statistical learning. Intl. Semantic Web Conf. (pp. 229­ 244). Springer. Dennis III, S. Y. (1991). On the hyper-Dirichlet type 1 and hyper-Liouville distributions. Communications in Statistics ­ Theory and Methods, 20, 4069­4081. Goldberg, A., Fillmore, N., Andrzejewski, D., Xu, Z., Gibson, B., & Zhu, X. (2009). May all your wishes come true: A study of wishes and how to recognize them. Human Language Technologies: Proc. of the Annual Conf. of the North American Chapter of the Assoc. for Computational Linguistics. ACL Press. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proc. of the Natl. Academy of Sciences of the United States of America, 101, 5228­5235. Griggs, J. R., Grinstead, C. M., & Guichard, D. R. (1988). The number of maximal independent sets in a connected graph. Discrete Math., 68, 211­220. Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. Proc. of the 23rd Intl. Conf. on Machine Learning (pp. 577­584). ACM Press. Minka, T. P. (1999). The Dirichlet-tree distribution (Technical Report). http://research.microsoft.com/ minka/papers/dirichlet/minka-dirtree.pdf. Tam, Y.-C., & Schultz, T. (2007). Correlated latent semantic model for unsupervised LM adaptation. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (pp. 41--44). The Gene Ontology Consortium (2000). Gene Ontology: Tool for the unification of biology. Nature Genetics, 25, 25­29.

The corpus that we use for our experiments consists of 18,193 abstracts selected from the MEDLINE database for their relevance to yeast genes. We induce topic models using DF to encode the Must-Links and Cannot-Links described above, and use standard LDA as a control. We set T = 100,  = 0.5,  = 0.1,  = 5000. For each word that we use to seed a concept, Table 2 shows the topics that include it among their 50 most probable words. We make several observations about the DF-induced topics. First, each concept is represented by a small number of topics and the MustLink words for each topic all occur as highly probable words in these topics. Second, the Cannot-Link preferences are obeyed in the final topics. Third, the topics use the process and phase topics compositionally. For example, DF Topic 4 represents transcription initiation and DF Topic 8 represents replication initiation. Moreover, the topics that are significantly influenced by the prior typically include highly relevant terms among their most probable words. For example, the top words in DF Topic 4 include "TATA", "TFIID", "promoter", and "recruitment" which are all specifically germane to the composite concept of transcription initiation. In the case of standard LDA, the seed concept words are dispersed across a greater number of topics, and highly related words, such as "cycle" and "division" often do not fall into the same topic. Many of the topics induced by ordinary LDA are semantically coherent, but the specific concepts suggested by our prior do not naturally emerge without using DF.