Spherical Topic Models

Joseph Reisinger JOERAII @ CS . UTEXAS . EDU Austin Waters AUSTIN @ CS . UTEXAS . EDU Bryan Silverthorn BSILVERT @ CS . UTEXAS . EDU Raymond J. Mooney MOONEY @ CS . UTEXAS . EDU Department of Computer Science, 1 University Station C0500, University of Texas, Austin, TX 78712

Abstract
We introduce the Spherical Admixture Model (SAM), a Bayesian topic model for arbitrary 2 normalized data. SAM maintains the same hierarchical structure as Latent Dirichlet Allocation (LDA), but models documents as points on a high-dimensional spherical manifold, allowing a natural likelihood parameterization in terms of cosine distance. Furthermore, SAM can model word absence/presence at the document level, and unlike previous models can assign explicit negative weight to topic terms. Performance is evaluated empirically, both through human ratings of topic quality and through diverse classification tasks from natural language processing and computer vision. In these experiments, SAM consistently outperforms existing models.

model the absence of words, only their presence, as document likelihood is based on the multinomial distribution. In contrast, a multivariate Bernoulli likelihood can model word absence, e.g., distinguishing between "true absences" and "missing presences," but is incapable of modeling frequency (McCallum & Nigam, 1998). In this paper, we introduce the Spherical Admixture Model (SAM), a class of topic models that represent data using directional distributions on the unit hypersphere (Mardia & Jupp, 2000), modeling both word frequency and word presence/absence. Specifically, we derive an admixture model with a von Mises-Fisher likelihood, which has been demonstrated to model sparse data such as text more accurately than corresponding multinomial models (Banerjee et al., 2005; Zhong & Ghosh, 2005).
SAM offers several other major benefits over LDA . First, documents are modeled as arbitrary unit vectors, allowing for richer feature representations (e.g. tf-idf or t-test feature weighting). Second, document-topic similarity is measured in terms of weighted cosine distance, defining similarity in terms of the directions of their word frequency vectors, which provides significant robustness to feature noise. Third, by exploiting the entire support of the von Mises-Fisher distribution, topics are able to assign negative weights to words: for example, seeing "neurons" in a NIPS abstract might imply that we should expect to see "SVM" significantly less often on average. Finally, despite its increased complexity, SAM admits an efficient variational Bayesian inference procedure.

1. Introduction
Unsupervised admixture, or topic models, such as Latent Dirichlet Allocation (LDA; Blei et al., 2003) build compact descriptions of document collections in terms of a small set of semantically coherent topics. Individual documents are decomposed as mixtures over the topic set, with each document maintaining its own set of mixture parameters. Unlike standard mixture models, where each mixture component (topic) is responsible for explaining all of the variation in a subset of the corpus, admixture models allow mixture components to share responsibility, often resulting in a significantly better generative model of the data.
LDA is a fully Bayesian extension of Latent Semantic Analysis (LSA; Hofmann, 1999), representing documents directly as word counts, modeling them implicitly as weighted averages on the multinomial simplex. Unlike similar methods such as the Aspect-Bernoulli Model (ABM; Bingham et al., 2009), LDA is unable to directly

Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

We evaluate SAM along two dimensions: (1) as a topic model using the human evaluation methods described in Chang et al. (2009) and (2) as a dimensionality reduction method on three real-world tasks, classifying Usenet posts from the CMU news-20 collection, detecting thematic shifts in the Italian text of Niccol` Machiavelli's Il o Principe, and classifying natural scenes in the 13-scene database (Fei-Fei & Perona, 2005). We find that SAM significantly outperforms LDA both in terms of human interpretability of topics and in its ability to capture salient semantic variation across all three corpora. This paper is divided into six sections: Section 2 covers re-

Spherical Topic Models

lated models, Section 3 introduces SAM and its variational approximation, Section 4 gives experimental results, Section 5 discusses future work, and Section 6 concludes.

2. Background
2.1. Spherical Mixture Models In this section and those subsequent, we adopt the terminology of topic models: data consists of D individual "documents," where each document is a set of "words" from a known vocabulary V . Probabilistic models of text have been built around the multinomial distribution and the von Mises-Fisher (vMF) distribution (Mardia & Jupp, 2000), and these distributions are associated with different representations of textual data. The multinomial distribution is the most straightforward model of discrete data. It assigns probabilities to integer vectors of event counts, which, for textual data, are typically raw non-normalized word counts in N|V | . The vMF distribution instead has its support on S , the unit (d-1)-sphere embedded in Rd . Its density is f (x; ľ, ) = cd () exp ľ x , where ľ is the mean direction with ||ľ|| = 1,   0 is the concentration paramed/2-1 is a normalization factor, and ter, cd () = (2)d/2 I d/2-1 () Ir (ˇ) is the modified Bessel function of the first kind and order r. vMF distributions have been used to model tf and tf-idf representations of text documents 2 -normalized onto S|V |-1 (Banerjee et al., 2005), and other directional data (Mardia & Jupp, 2000). The vMF distribution can be thought of as an Sd-1 analog of the multivariate Gaussian with spherical covariance, parameterized by cosine distance rather than Euclidean distance. Cosine distance computes similarity in terms of the directions of 2 -normalized feature vectors and corresponds to the normalized correlation coefficient. Evidence suggests that this type of directional measure is often superior to Euclidean distance in high dimensions (Manning & Sch¨ tze, 2000; Zhong & Ghosh, 2005). u The vMF is sensitive to word absence in a way that the multinomial is not: For example, let  = [1/3, 1/3, 1/3] be a multinomial parameter vector. Documents D1 = [1, 1, 1] and D2 = [3, 0, 0] are equiprobable under Mult(); in fact, all three-word documents have equal probability in this example. However, because D1 and D2 have different cosine distances from , the documents have different densities under a corresponding vMF. Although this is a simple example, it represents a larger issue with the multinomial. Inspired by the role of cosine distance in information retrieval, Banerjee et al. (2005) introduced the mixture of von Mises-Fisher distributions (movMF). The
d-1

movMF model treats each normalized document tf or tf-idf vector as drawn from a single vMF distribution centered on one cluster mean, selected by a common multinomial distribution. The likelihood of a T document d is f (d|)= t=1 t vMF(d|ľt , t ), where =(, ľ1 , 1 , . . . , ľT , T ),  is the parameterization of the multinomial over topics, and each ľ and  parameterizes the vMF distribution for a cluster. movMF generalizes classic clustering methods parameterized by cosine distance: when each cluster concentration  is taken to infinity, movMF becomes equivalent to spherical k-means (Banerjee et al., 2005). The movMF model successfully integrates a directional measure of similarity into a probabilistic setting, but its mixture model assumption--that each document is associated with a single cluster--is fundamentally restrictive. 2.2. Latent Dirichlet Allocation Admixture models such as LDA relax the assumption that each document is drawn exclusively from a single mixture component; instead, documents are drawn from a weighted average over all components. In LDA, this weighted average is implicit in the model structure (Blei et al., 2003). Each document wd maintains a separate multinomial distribution  d over topics . For each word token wi,d a topic index zi,d is drawn from  d and then wi,d is drawn from the corresponding topic multinomial zi,d . The generative model is given by:
 d | t | zi,d | d wi,d |zi,d     Dirichlet(), Dirichlet(), Mult( d ), Mult(zi,d ), d  D, t  T, i  |wd |, i  |wd |, (topic proportions) (topics) (topic indicators) (words)

where  and  are hyperparameters smoothing the perdocument topic distributions and per-topic word distributions respectively. As an admixture model, LDA relaxes the assumption that each document is drawn exclusively from a single mixture component. This flexibility allows it to uncover more fine-grained document structure than traditional mixture models. Furthermore, by marginalizing the topic indicators zi,d out of the model, LDA can be shown to draw each document from a multinomial whose parameters are a weighted average of the topics. The same intuition will be used to develop SAM as a weighted average over 2 -normalized topic means.

3. The Spherical Admixture Model
The Spherical Admixture Model (SAM), developed below, is a topic model for arbitrary 2 -normalized data. Like the movMF model, it is built on a probability distribution parameterized by cosine distance and capable of taking into account the absence of words; like LDA, it decomposes in-

Spherical Topic Models Table 1. Top positive and negative term weights learned by SAM on the NIPS corpus and Wikipedia. (+) shows the highest weighted words and (-) shows lowest weighted within each topic. Unlike LDA, SAM is able to represent words that are anti-correlated with the topic, rather than just unrelated. These correlations appear meaningful: In the case of Wikipedia, negatively weighted words are often related but not directly relevant to the topic. (+) svm kernel margin machines support (-) network experts units target clusters NIPS (+) genetic fitness crossover population search (-) mlp tree matrix discriminant lemma (+) navy ships naval submarines aircraft (-) airport airlines flights bus satellites Wikipedia (+) (-) album label singles chart song opera actor films players conservatory (+) india temple dynasty indian khan (-) germany borough england france parish

 
T


D w

z w

T


 v

 
mixing proportions for document d, t is the mean of topic t, and vd is the observed vector for document d. Each topic t is an arbitrary vector on the unit hypersphere S|V |-1 . Topics are equally capable of making words more or less likely: positive entries in a topic mean vector increase the weights of corresponding words in each perdocument mean, and negative entries reduce those weights (see Table 1 for an example from the NIPS and Wikipedia datasets). The empirical results in Section 4 demonstrate that this flexibility can help capture useful structure in data. 3.2. Variational Inference Given a document corpus, we are interested in inferring the posterior distribution of the topic means, topics, and perdocument topic proportions: p(, , ľ|v, , m, , 0 , ). Computing the exact posterior is intractable, thus we develop an efficient variational mean-field method to perform approximate inference in SAM. In variational mean-field methods, the true posterior is approximated by another distribution with a simpler, factored parametric form. An EM procedure is used to update the parameters of the approximate posterior and the model hyperparameters so that a lower bound on the log likelihood increases with each iteration (Jordan et al., 1999). We approximate the posterior as the factored distribution ~ ~ ~ ~ q(, , ľ|ľ, , ) = q(|ľ, )q(|)q(ľ|m, 0 ), ~ and assume the factors have the parametric forms ~ ~ q(t )=vMF(t |ľ, ), q(d )=Dir(d |), and ~ ~ ~ q(ľt )=vMF(ľt |mt , 0 ). Here, ľ, m, and  are the ~ free variational parameters. Given this factorization, a ~ ~ ~ lower bound L(ľ, , m) on the log likelihood is given by: ~ ~ ~ L(ľ, , m) = - = + - Eq [log p(v, , , ľ)] ~ ~ ~ Eq [log q(, , ľ; , , m)] Eq [log p(v|, )] + Eq [log p(|ľ, )] Eq [log p()] + Eq [log p(ľ)] ~ Eq [log q(|ľ, )] - Eq [log q(|~ )]  (1)


D

(a) LDA (b) SAM Figure 1. Graphical models for LDA and SAM.

dividual documents over multiple topics. 3.1. Model Definition
SAM is a Bayesian admixture model of normalized vectors on S|V |-1 . It is therefore not possible to define the admixture in terms of topic indicators for individual words in each document, as is done by LDA. SAM instead uses a weighted directional average to combine topics. To draw a collection of documents in SAM,

1. Draw a set of T topics  on the unit hypersphere; 2. For each document d, draw topic weights  d from a Dirichlet with hyperparameter ; 3. Draw a document vector vd from a vMF with mean Ż d = Avg(,  d ) and concentration . Representing the T topics as columns of matrix , and  d as a column vector, the weighted directional average  d Ż def is written as d = Avg(,  d ) = d .1 The complete generative model for SAM is given by:
ľ|0 t |ľ,   d | Ż d |,  d Ż vd |d ,     =  vMF(m, 0 ), vMF(ľ, ), Dirichlet(), Avg(,  d ),´ ` Ż vMF d ,  , t  T, d  D, d  D, d  D, (corpus mean) (topics) (topic proportions) (spherical average) (documents)

where ľ is the corpus mean direction,  controls the concentration of topics around ľ, the elements of  d are the
(Buss-Fillmore spherical average) This procedure does not yield the vector that minimizes the weighted sum of geodesic distances to the mean. Buss & Fillmore (2001) introduce the spherP def ical average AvgBF (, ) = arg minq i  i dS (i , q), where dS (p, q) is the geodesic distance between p, q  Sd . This definition is desirable, but must be computed iteratively.
1

- Eq [log q(ľ|m, 0 )]. ~

Spherical Topic Models

Note that the expectations in this expression are taken over the variational distribution q. The E step of variational EM consists of optimizing the expression for the log-likelihood lower bound (1) with respect to each of the free parameters d,i , ľt , and m. Similarly, in ~ ~ ~ the M step, eq. (1) is optimized with respect to each of the hyperparameters , m, , 0 , and . The EM procedure consists of alternating E and M steps until some suitable convergence criterion is reached. In this work, we use gradient ascent to update the variational topic means ľ and per-document topic proportions ~ d in the E step. For convenience, we define d,0 = ~ ~ k d,j and d = Eq [Avg(, d )] vd , where d  ~ j=1 {1 . . . D} ranges over the documents. Taking gradients of eq. (1) with respect to the variational parameters, we have: dL d~ d,i  =  d d +  (~ d,0 )(~ d,0 - 0 )   d~ d,i  - (~ d,i )(~ d,i - i )  
D ľt L ~ ľt d ~ d=1

The derivatives of Sd are: dSd d~ d,j  = 1 + 2(1 - Av ()2 )~ d,j + 2Av ()2 d ľ ľj  ~ ~ ~ d,0 (~ d,0 + 1) ~  2~ d,0 + 1  - Sd d,0 (d,0 + 1) ~ ~ 2(1 - AV ()2 )~ j + 2AV ()2 d,j ľd ľ ~ ~~ d,0 (~ d,0 + 1) ~ 

ľj Sd ~

=

Unlike the variational topics and topic proportions, the variational corpus mean m has a simple closed-form up~ date rule. The gradient of (1) with respect to m is: ~
T m L = 0 AV (0 )m + AV ()AV (0 ) ~ t=1

ľt + 2m, ~ ~

= AV ()AV (0 ) mt +  ~

where  is a Lagrange multiplier used to enforce the constraint that m must have unit 2 norm. Setting the gradi~ ent to zero and solving, we attain the closed-form update T rule m  0 m + AV () t=1 ľt . Update rules for ~ ~ the model hyperparameters can be derived using a process very similar to that above.2

Here  is the digamma function and AD (c) denotes the mean resultant length of a vMF distribution of dimension D with concentration c. This quantity can be approximated stably in high dimension using the approach of Abramowitz and Stegum, cf. Elkan (2006). Because d itself does not have a closed form, we use the approximation:
-1

4. Experiments
We evaluate SAM along two dimensions: First, the semantic coherence and relevance of its topics is compared against LDA using the subjective methods described by Chang et al. (2009). Second, its performance as a dimensionality reduction method is evaluated on three real-world tasks: classifying Usenet posts from the CMU news-20 collection, detecting thematic shifts in the Italian text of Niccol` Machiavelli's Il Principe, and classifying natural o scenes in the 13-scene database (Fei-Fei & Perona, 2005). Four models are compared: ˇ LDA ­ The Latent Dirichlet Allocation model, outlined in Section 2.2.3 ˇ movMF ­ The mixture of von-Mises Fisher clustering with soft assignments (Banerjee et al., 2005). ˇ SAM [S] ­ SAM with topic means in S|V |-1 that can contain both negative and positive entries. ˇ SAM [S+ ] ­ SAM with topics and spherical combinations restricted to the positive orthant of the unit hypersphere, ablating the model's ability to assign negative term weights in topics.
A MATLAB implementation is available at http://www. cs.utexas.edu/~austin 3 Collapsed Gibbs sampler with an asymmetric Dirichlet prior on the topic proportions, cf. Wallach et al. (2009); implemented in HBC (Daum´ III, 2009). e
2

E[Avg(, d )]

 E[d ] E

d  d

(2) (3)

 E[d ] E[d  d ]-1/2 .

The last factor is the expected squared norm of the random vector d , which we will refer to as Sd . These expectations can be computed in closed form using known properties of the Dirichlet and vMF distributions, yielding: d  AV ()~ d,0 Sd -1 where d,0 + (1 - AV ()2 ) d 2 + Av ()2 ||~ d ||2 ~ ~i ľ~ . Sd = d,0 (~ d,0 + 1) ~  Differentiating with respect to d,j and ľj , respectively, ~ ~ yields: dd AV () = d~ d,j  d,0 ~
ľj d ~ -1/2

(~ d ) vd , ľ~

ľj - ľd /~ d,0 ~ ~~  ľd dSd ~~  - 3/2  Sd 2Sd d~ d,j d,j vd ~ (~ d ) vd ľ~  - ˇ 3/2 Sd 2Sd
ľj Sd ~

vd

=

AV () d,0 ~

Spherical Topic Models
1.0 0.5 0.0 LDA
SAM

SAM (tf)

SAM (tf-idf)

sion task: the top five words from a topic are shuffled and a single intruder word is added to the set, drawn from the high probability words in a different topic. The rater is then asked to identify the intruder. As the semantic coherence and distinctness from other topics increases, this task becomes easier. Using LDA, raters were able to correctly identify the intruder words in 67.1% of cases (50 per model). Using SAM topics, raters were able to identify the intruders in 82.7% of cases with tf-idf features and 80.4% of cases with tf features (Figure 2). Both SAM results differ significantly from LDA (p<0.05; Student's t-test), indicating that SAM topics, when represented as the top weighted terms, are more semantically coherent than LDA topics. Topic relevance is evaluated through a forced-choice experiment: evaluators are presented with one of 100 randomly selected articles from Wikipedia and are asked to judge which of two topics is most relevant. Topics are ranked from both models and paired together for presentation: i.e. the highest weighted topic from SAM is paired with the highest weighted topic from LDA, etc. After discarding trials with low inter-rater agreement ( < 0.4; 47 trials), topics drawn from SAM are preferred roughly 3:2 over topics from LDA (0.616 ą 0.08), indicating that SAM topics are more relevant on average. 4.2. Classification Tasks In this section we compare the models through their performance as dimensionality reduction methods. The topic or cluster proportions inferred by each model are evaluated as features in several multiclass classification tasks. In all experiments in this paper, LDA is run with an asymmetric  prior and symmetric  prior, optimized using a hybrid Gibbs-EM empirical Bayes procedure. SAM uses  = 1500, 2 -normalized tf or tf-idf document representations and inference is performed with the Variational EM (VEM) procedure discussed in section 3.2. A simple Adaptive Metropolis-Hastings (MH) sampler for SAM is also evaluated, with  =  = 0.1.6 The total number of topics is fixed at 50.7 All results reported use Logistic Regression with a ridge estimator (le Cessie & van Houwelingen, 1992) and use 10×10-fold cross-validation. 4.2.1. CMU 20 N EWSGROUPS This first classification task is derived from the CMU news20 data set. Each news post is treated as a document
(Adaptive Metropolis-Hastings) Proposals for t are drawn from N (t , diag()) and projected onto the unit hypersphere.
7 Accuracy increases with T , but the main results here do not change significantly for T > 50. 6

(easy)
SAM

(hard)
LDA

1: vishnu, tamil, kerala, singh, nadu, meteorologist 2: oxidation, protein, potassium, footballers, hydrogen, symptoms 1: saloon, huron, burlington, county, mississippi, wl 2: tang, hong, howe, wu, kong, leone 1: male, mammals, empire, plants, species, birds 2: court, crimes, police, law, security, jazz 1: brother, sister, manga, anime, ride, orchestra 2: water, earth, power, energy, production, oil

(easy)
LDA

(hard)

Figure 2. (top) Boxplot showing summarizing human rater accuracy in the word-intruder task. (bottom) Examples of word intrusion questions that human raters found easy or difficult. Intruder words are shown in bold.

Quantitative evaluation measures common in clustering, such as normalized mutual information (Banerjee & Basu, 2007), are inappropriate in topic modeling because inferred topics do not necessarily correspond to pure partitions of the document collection. Furthermore, SAM and LDA cannot be compared directly in terms of perplexity, as they inhabit fundamentally different base measures.4 Instead, we focus our evaluation on qualitative corpus exploration and classifier accuracy, comparing topic proportion features derived from SAM and LDA to standard bag-of-words features (Blei et al., 2003). 4.1. Topic Interpretability Since SAM and LDA are incomparable in terms of perplexity, we instead evaluate the coherence and relevance of topics generated by both methods directly with human raters, adapting the procedure described by Chang et al. (2009). All experiments described in this section use Amazon's Mechanical Turk. Both LDA and SAM are trained on a random 10K document subset of Wikipedia.5 Unlike Chang et al. (2009), we perform significantly less aggressive postprocessing, and named entities are kept intact, making the inference task more difficult. In both experiments, each topic model is used to infer 50 topics. Responses are averaged over 8 human raters. Topic coherence is evaluated using a simple word intru(Measurability) vMF distributions are continuous, while multinomials are discrete; hence, neither perplexity nor the likelihood ratio test are applicable. 5 (Wikipedia dataset) Snapshot taken on 9/29/09; wikitext markup is removed, as are articles with fewer than 100 words. The 10K document subset has a vocabulary size of 16552 unique words and a total of 2M tokens.
4

Spherical Topic Models Table 2. Classification accuracy and 95% confidence intervals on the three news-20 tasks. SAM topic proportions make better features, particularly in more semantically tight domains. Since no significant difference was found between SAM [S] and SAM [S+ ], only SAM [S] is shown. Model different Bag-of-Words (tf) Bag-of-Words (tf-idf) Topic Only
LDA

Accuracy (%) similar 85.3 ą 0.7 85.9 ą 0.5 78.5 ą 2.7 64.5 ą 0.6 74.2 ą 0.4 81.2 ą 0.4 85.9 ą 0.3 85.7 ą 0.7 84.9 ą 0.5 84.9 ą 0.5 86.3 ą 0.5 88.1 ą 0.5

Table 3. Logistic regression accuracy using inferred features on the four-class Il Principe thematic shift detection task. Standard tf representations are used in all models. SAM infers significantly better features overall. Model Accuracy (%) Overall prin. war cond. Italy Bag-of-Words
LDA

same 75.9 ą 0.6 77.5 ą 0.8 66.3 ą 2.6 59.4 ą 0.4 56.0 ą 0.6 70.5 ą 0.5 75.0 ą 0.4 75.6 ą 0.8 75.8 ą 0.8 75.3 ą 0.6 75.6 ą 0.6 78.1 ą 0.6

91.3 ą 0.4 91.7 ą 0.3 87.8 ą 0.6 71.4 ą 0.3 71.9 ą 0.3 88.6 ą 0.4 93.3 ą 0.3 91.8 ą 0.4 91.1 ą 0.3 91.4 ą 0.5 91.9 ą 0.4 94.1 ą 0.3

movMF (tf) movMF (tf-idf) SAM (tf) SAM (tf-idf) Topic + Bag-of-Words
LDA

movMF MH SAM [S+ ] MH SAM [S] VEM SAM [S+ ] VEM SAM [S]

57.9 ą 3.4 57.3 ą 3.0 49.6 ą 8.3 46.1 ą 6.9 59.4 ą 5.4 58.7 ą 0.6 65.2 ą 0.3

60.5 59.4 47.6 46.5 60.9 64.9 71.3

71.3 63.9 11.7 31.8 51.7 71.1 65.1

55.3 58.1 55.8 54.4 64.8 60.8 62.5

45.1 34.9 0.0 8.3 31.4 13.9 50.6

movMF (tf) movMF (tf-idf) SAM (tf) SAM (tf-idf)

its performance (75.0% accuracy vs. 77.5% accuracy) despite the information lost in the reduction from 3000 features to only 50. Neither LDA nor movMF can match this accuracy. The performance gap between SAM and LDA suggests that generative models based on vMF distributions are better suited to capturing fine-grained semantic variation in text than are multinomial models. 4.2.2. D ETECTING T HEMATIC S HIFTS IN Il Principe Both SAM and LDA perform well when the corpus covers a wide variety of topics. To more precisely illustrate their differences, then, it is instructive to compare them in classification tasks where small semantic distinctions are important. In this section we perform supervised textual segmentation (identifying thematic shifts in discourse; cf. Hearst (1994)) on Niccol` Machiavelli's Il Principe. Since the o book is short, singly-authored, and thematically tight, topics must be fine-grained to be helpful. For training the topic models, documents are taken to be individual paragraphs of text. For classification, each paragraph is assigned one of four labels corresponding to the main themes of the book: (1) the types of principalities (chapters I - XI), (2) the types of armies (chapters XII - XIV), (3) the character and conduct of Princes (chapters XV- XXIII), and (4) the current political situation in Italy (circa 1505; chapters XXIV- XXVI). This split yields a challenging 4-way classification problem.9
SAM [S] discovers the best features in all settings; SAM significantly outperforms LDA and movMF, cutting relative classification error by by 18.5% (Table 3; significance determined using Fisher's Least Significant Difference test). Broken down by class, SAM sees the largest relative reductions in error for Italy, the most thematically ambiguous section. SAM [S] also outperforms SAM [S+ ] by a significant margin, highlighting the utility of explicitly representing negative term weights. Finally, since the Adaptive MH and VEM versions of SAM were run using the
9 (Il Principe dataset) The base text is the original Italian version, converted to lowercase with stopwords removed. A total of 128 paragraphs are extracted; 39.8% are labeled principalities, 37.5% are labeled conduct, 12.5% armies and 9.3% Italy.

and labeled with its newsgroup. Following Banerjee & Basu (2007), three subsets of news-20 are used for evaluation: (1) news-20-different, with posts from the unrelated groups rec.sport.baseball, sci.space and alt.atheism; (2) news-20-similar, with posts from the more similar groups rec.sport.baseball, talk.politics.guns and talk.politics.misc; and (3) news-20-same, with posts from the highly related groups comp.os.ms-windows.misc, comp.windows.x and comp.graphics. These domains span corpora with varying degrees of subject similarity, making it possible to measure how well SAM and LDA identify meaningful topics that capture fine-grained semantic structure. Each model is evaluated based on the performance of its topic proportions as features for classification, using raw bag-of-words features as the baseline. Table 2 summarizes the experimental results. In general, SAM finds better features than the other models, performing about as well as raw bag-of-words. The difference between SAM and LDA persists even as the task becomes more semantically tight, indicating that it finds more meaningful distinctions between finer-grained topics.8 Furthermore, features derived from tf-idf SAM significantly improve classification accuracy in news-20-different and news-20similar when combined with raw bag-of-words features, unlike LDA (news-20-different: 94.1% accuracy vs. 91.8% accuracy for LDA and 91.7% accuracy for bag-of-words only; news-20-similar: 88.1% accuracy vs. 85.7% accuracy for LDA and 85.9% accuracy for bag-of-words only). The bag-of-words representation is best for the semantically tight news-20-same dataset, but SAM nearly matches
(Classifier robustness) The results do not change significantly when replacing Logistic Regression with an SVM or Naive Bayes; implementations from Weka (Witten & Frank, 2005).
8

Spherical Topic Models

same convergence criterion, the results indicate that the approximations made in VEM SAM do not significantly affect performance, despite the fact that MH sampling takes significantly longer (10 hours as opposed to 20 minutes). The number of topics chosen has a large impact on performance for T < 50, but performance is relatively stable for T > 50. Hence selecting T using a nonparametric prior may be justifiable (Teh et al., 2006). 4.2.3. 13 NATURAL S CENE C ATEGORIES The final task involves classifying visual images according to their natural scene type, e.g. living room, coast, forest, etc, using the 13 scene database proposed by FeiFei & Perona (2005). We divide the full 13 class visual scene recognition task, 13-scene-full, into four separate 4class problems: 13-scene-different (including livingroom, MITstreet, CALsuburb, and MITopencountry), 13-scenesimilar (MITinsidecity, MITstreet, CALsuburb, MITtallbuilding), 13-scene-outdoor (MITcoast, MITforest, MITmountain, MITopencountry), and 13-scene-indoor (bedroom, kitchen, livingroom, PARoffice), ordered by their classification difficulty. We follow Fei-Fei and Perona's preprocessing steps: densely sampling patches, computing 128-dimensional SIFT descriptors, then clustering the descriptors using spherical k-means. The resulting clusters are treated as visual words, and each image is represented by its visual word counts. Note that despite the fact that a similar visual-bag-of-words representation is employed in this task, it differs fundamentally from the previous tasks in terms of sparsity: Figure 3 indicates that most visual words tend to occur in most scenes, leading to denser document representations. Thus, the comparative results obtained in this domain can be considered an ablation of SAM's ability to model the lack of features. Using 200 visual words, we find that SAM significantly outperforms LDA across all scene recognition tasks (Figure 4) when 10% of the data is used for training a Logistic Regression classifier. As more training data is used, the performance of LDA and SAM converge, indicating that SAM may perform better relative to LDA with less data. We find that neither topic model significantly outperforms a simple visual-bag-of-words representation for |V | = 200; for |V | = 1500, however, both significantly outperform visual-bag-of-words. With dense features, SAM provides a smaller benefit relative to LDA, as cosine distance and KLdivergence correspond more closely.

Histograms of document frequency 1000 principe news-20-different 13-scene |V|=200 13-scene |V|=1500

# of terms

100

10

1 1 10 100 1000 document frequency 10000

Figure 3. Histograms showing the number of terms with a given document frequency; 13-scene is significantly more dense in terms of vocabulary coverage than the textual domains.

used with SAM, such as modeling infinite-capacity (Teh et al., 2006) or correlated topics (Blei & Lafferty, 2005). Furthermore, SAM can be extended to model word burstiness explicitly, as in DCM - LDA (Doyle & Elkan, 2009), by exploiting the conjugate prior of Nu~ ez-Antonio & n Guti´ rrez-Pe~ a (2005). Second, although sparse document e n vectors improve the efficiency of the inference methods presented here, the topic vectors themselves are not sparse, leading to storage overhead that scales as O(|V | ˇ T ). Such overhead is undesirable with larger corpora. One possible solution is a spherical admixture with sparse topic representations (i.e., each topic only spans a subspace of the full S|V |-1 ), leading to more efficient inference and lower storage overhead.

6. Conclusion
This paper has developed SAM, an admixture model that decomposes spherically distributed data into weighted combinations of component vMF distributions. Unlike previous spherical models, SAM is a fully Bayesian admixture model that allows multiple component vMFs to explain different aspects of the data. Unlike previous admixture models, SAM uses directional distributions parameterized by cosine distance, can explicitly assign negative weight to topic terms, and models document-level word absence. In both subjective human studies and dimensionality reduction experiments, SAM was found to produce more relevant topic features than did either the movMF spherical mixture model or LDA, particularly on data where fine-grained topic distinctions are important. Three properties--cosine distance, negative terms, and word absence/presence--were shown to contribute to its performance.

5. Future Work
opens up a new class of admixture models: those based on spherical distributions. From that class originate three important avenues for future work. First, most extensions to LDA proposed in recent literature can easily be
SAM

Acknowledgements
We would like to thank Arindam Banerjee for early discussions and the movMF implementation and Kristen Grauman for input on the vision domain. This work was par-

similar |V|=200 lar |V|=200
Accuracy
LDA SAM

Spherical Topic Models

diff. 79.3 85.0

sim. 68.5 74.4

outdoor 60.9 68.4

indoor 43.6 50.2

all 43.4 50.3

Chang, Jonathan, Boyd-Graber, Jordan, Wang, Chong, Gerrish, Sean, and Blei, David M. Reading tea leaves: How humans interpret topic models. In NIPS, 2009. Daum´ III, Hal. HBC: Hierarchical Bayes compiler, 2009. e http://hal3.name/HBC. Doyle, Gabriel and Elkan, Charles. Accounting for word burstiness in topic models. In ICML, 2009. Elkan, Charles. Clustering documents with an exponentialfamily approximation of the Dirichlet compound multinomial distribution. In ICML, 2006. Fei-Fei, Li and Perona, Pietro. A Bayesian hierarchical model for learning natural scene categories. CVPR, 2005. Hearst, Marti A. Multi-paragraph segmentation of expository text. In ACL, 1994. Hofmann, Thomas. Probabilistic latent semantic indexing. In Research and Development in Information Retrieval, 1999.

[S]

50 aining %

Bag-of-Words Bag-of-Words LDA Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Bag-of-Words LDA SAM LDA SAM Tommi S., and Saul, Lawrence K. An introduction Figure 4. (top) Classification accuracy for 13-scene with |V | = 50 100 SAM to variational methods for graphical models. Machine 200. (bottom) Learning curves for all classes, |V | = 200 and 100 training % Learning, 37(2), 1999. 100 |V | = 1500.

tially supported by a Google Research Award and an NSF Graduate Research Fellowship to the first author. Experiments were run on the Mastodon Cluster, provided by NSF Grant EIA-0303609.

le Cessie, S. and van Houwelingen, J.C. Ridge estimators in logistic regression. Applied Statistics, 41(1), 1992. Manning, Christopher and Sch¨ tze, Hinrich. Foundations u of Statistical Natural Language Processing. 2000. Mardia, Kanti V. and Jupp, Peter E. Directional Statistics. Wiley, 2000. McCallum, Andrew and Nigam, Kamal. A comparison of event models for Naive Bayes text classification. In AAAI Workshop on Learning for Text Categorization, 1998. Nu~ ez-Antonio, Gabriel and Guti´ rrez-Pe~ a, Eduardo. A n e n Bayesian analysis of directional data using the von Mises-Fisher distribution. Communications in Statistics ­ Simulation and Computation, 34, 2005. Teh, Yee W., Jordan, Michael I., Beal, Matthew J., and Blei, David M. Hierarchical Dirichlet processes. JASA, 101, 2006. Wallach, Hanna, Mimno, David, and McCallum, Andrew. Rethinking LDA: Why priors matter. In NIPS. 2009. Witten, Ian H. and Frank, Eibe. Data Mining: Practical Machine Learning Tools and Techniques. 2005. Zhong, Shi and Ghosh, Joydeep. Generative model-based document clustering: A comparative study. KAIS, 8(3), 2005.

References
Banerjee, Arindam and Basu, Sugato. Topic models over text streams: A study of batch and online unsupervised learning. In SDM, 2007. Banerjee, Arindam, Dhillon, Inderjit S., Ghosh, Joydeep, and Sra, Suvrit. Clustering on the unit hypersphere using von Mises-Fisher distributions. JMLR, 6, 2005. Bingham, Ella, Kab´ n, Ata, and Fortelius, Mikael. The a aspect Bernoulli model: multiple causes of presences and absences. Pattern Analysis and Applications, 12(1), 2009. Blei, David, Ng, Andrew, and Jordan, Michael. Latent Dirichlet allocation. JMLR, 3, 2003. Blei, David M. and Lafferty, John D. Correlated topic models. In NIPS, 2005.
5 4

Buss, Samuel R. and Fillmore, Jay P. Spherical averages and applications to spherical splines and interpolation. ACM Transactions on Graphics, 20(2), 2001.