Memory Bounded Inference in Topic Models Ryan Gomes G O M E S @ V I S I O N . C A LT E C H . E D U Dept. of Computation and Neural Systems, California Institute of Technology, Pasadena, CA 91125 USA Max Welling WELLING@ICS.UCI.EDU Bren School of Information and Computer Science, University of California at Irvine, Irvine, CA 92697 USA Pietro Perona P E RO NA @ V I S I O N . C A LT E C H . E D U Dept. of Computation and Neural Systems, California Institute of Techology, Pasadena, CA 91125 USA Abstract What type of algorithms and statistical techniques support learning from very large datasets over long stretches of time? We address this question through a memory bounded version of a variational EM algorithm that approximates inference in a topic model. The algorithm alternates two phases: "model building" and "model compression" in order to always satisfy a given memory constraint. The model building phase expands its internal representation (the number of topics) as more data arrives through Bayesian model selection. Compression is achieved by merging data-items in clumps and only caching their sufficient statistics. Empirically, the resulting algorithm is able to handle datasets that are orders of magnitude larger than the standard batch version. In this paper we ask ourselves: What statistical techniques are suitable for this "lifelong learning task"? First, we need a class of models that can naturally expand as more data arrives, i.e. it's capacity should not be bounded a priori. Second, these models should allow efficient learning algorithms, both in terms of time and space. For instance, we should not have to store every single piece of information that has been captured. Our technique must produce a sequence of model estimates that reflect new information as it arrives, and the time required to produce each model update must scale modestly as more data is acquired. Finally, we require that the sequence of learned models are sufficiently similar to those that would be produced by a batch algorithm with access to the entire history of data observed at the time of each model update. Nonparametric Bayesian techniques such as the Dirichlet Process (DP) (Ferguson, 1973) and the Hierarchical Dirichlet Process (HDP) (Teh et al., 2006) satisfy our first desideratum, in that they naturally increase their model complexity with the available data. However, most existing Nonparametric Bayesian approaches are batch algorithms: they require every single data-point to be stored and revisited during learning. A batch algorithm could be naively applied to the continuous learning scenario, but all data would need to be cached and a new batch learning process would be run on the entire dataset to produce each model update. This would violate our second criterion in that the time and space requirements would increase unacceptably as the system ages. Here we propose a more flexible setup, where we impose a bound on the available memory but still allow the model order to increase with more data. We compress the data and the internal representation of the model without losing much in terms of model accuracy. The effect is that time and space requirements scale much more gradually over the lifetime of the system. The memory bound does impose a limit on the total capacity of the model, but this trade-off 1. Introduction Consider a collection of surveillance cameras monitoring at an airport. The cameras learn a model of their environment without supervision. Moreover, they learn for many years without significant interruption. Gradually, as more data is captured, the cameras build a joint model of visual object categories. This problem is akin to the way children learn to understand the world through the continuous process of mostly unsupervised learning. As children grow up they build an increasingly sophisticated internal representation of object categories that continuously restructures itself. Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). Memory Bounded Inference in Topic Models is flexible and can be adjusted online, i.e. as the model is learned. Experiments with a memory bounded variational approximation to HDP show that this technique can handle datasets many times larger than the standard implementations and results in substantially shorter run-times. the exponential family1 , l p(x|z = k , ) = exp kl l (x) - Ak ( k ) ( 2) 2. A Memory Bounded Variational Topic Model At a high level the idea is to use a variational approximation related to LDA (Blei et al., 2003) and HDP (Teh et al., 2006). Memory savings are achieved by "clumping" together data-cases. That is, we constrain groups of datapoints to have equal topic assignment variational distributions: q (zij ) = q (zi j ) = q (zc ) when points xij and xi j are members of the clump c. This allows us to achieve memory savings, because variational optimization performed under this constraint requires only the sufficient statistics of the data-cases in a clump, and the system can forget the exact identities of the summarized data points. Similarly, we will also clump entire documents (or images) by tying their variational distributions over topics: q ( j ) = q ( j ) = q ( s ) if document j and j belong to the same document group s. This tying of variational distributions guarantees that learning optimizes a lower bound to the exact free energy objective function, where the bound is increasingly loose with more tying. This idea was also leveraged in (Verbeek et al., 2003) and (Kurihara et al., 2006) to accelerate learning of Mixtures of Gaussians and DP Mixtures of Gaussians by using KD-trees. In the following we will talk about documents, but we note that this refers to other structured objects such as images as well. 2.1. The Variational Topic Model The following Bayesian topic model is our starting point, and p( | ) is conjugate to p(x|z , ), l p( k | ) = exp l kl - 0 Ak ( k ) - B ( ) ( 3) The posterior distributions over , , z are approximated variationally as k q ( ) = q ( k ; k ) (4) q ( ) = q (z) = i j j D( j ; j ) q (zij ) (5) (6) where we have introduced variationak parameters l { kl , kj , qij k }, the latter subject to qij k = 1. Furthermore, D denotes a Dirichlet distribution while q ( k ; k ) is also conjugate to p(x|z = k , ), l ( q ( k ; k ) = exp kl kl - k0 Ak ( k ) - Bk ( k ) 7) By writing down the variational free energy and minimizing it over , we find the following intuitive updates, kl = Fkl + l ; k0 = Nk + 0 ; kj = Nkj + k ; and Fk l Nk Nk j i qij k l (xij ) j (8) (9) (10) i qij k j i qij k p(x, z, , , ) = w k p( k | ) j i j p(xij |zij ; ) j,zij D( j ; ) k p(k ) (1) l E[kl |kl ] kl (xij )] 1 exp [ qij k exp [ (kj )] Zij exp [E[Ak ( k )|k0 ]] (11) k where Zij enforces the constraint qij k = 1 and the expectations are over q ( ). To learn the parameters {k } we first introduce gamma priors, k p() = G (k ; a, b) (12) here xij is word i in document j and zij denotes the topic that generated xij . j denotes the mixturekof topics that generated the words in document j , with j k = 1. j are distributed according to a Dirichlet distribution with parameter . Boldface symbols denote vector valued quantities. In this expression we will assume that p(x|z , ) is in Strictly speaking, the exponential family includes additional multiplicative terms h(x) in the expression for p(x| ) and g ( ) in the expression for p( | ). We have left these terms out to simplify the derivation and because for most well known distributions they are simply 1. However, it is straightforward to include them. 1 Memory Bounded Inference in Topic Models Using the bounds in (Minka, 2000) we can derive the following updates if we first insert the updates for and into the free energy, j (a - 1) + k [ (kj ) - (k )] j k (13) b+ [ (j ) - ()] with j = k kj and Nj = k Nk j . impose constraints on the variational distributions this has the effect of loosening the variational bound. Define Ds to be the number of documents in a document group, Nc the number of data-items in a word clump, Ncs the number of words in document group s and word clump c and finally c l ij c kl (xij ). In terms of these we k further define, Nk s c qck Ncs c qck Nc c qck c l k (15) (16) (17) 2.2. Optimizing the Number of Topics K Our strategy to search for a good value of K is to truncate the topic distributions as q (zij > K ) = 0 (see also (Teh et al., 2008)). This will have the effect that most terms in the free energy with k > K will cancel, the exception being the prior terms p(k ), k > K . For these terms we know that the value for k minimizing the free energy is given by the MAP value of the gamma-prior k = a-1 , k > b K . Inserting this back into the free energy we accumulate Kmax - K terms = a log b - log (a) + (a - 1) log a-1 - (a - 1) (14) b and qck Nk Fk l With these definitions we derive the following "clumped" update rules for the variational parameters kl and ks , kl = Fkl + l ks = k0 = Nk + 0 Nks Ds + k (18) (19) (20) where Kmax is the maximum number of topics. It is guaranteed that there exists a solution with lower free energy if we increase K . The reason is that we relax a selfimposed constraint on variational parameters (that q (zij > K ) = 0). As K increases the relative improvement in free energy quickly attenuates. The final value for K is obtained by thresholding this relative improvement. The nesting property (models with larger K are better) is the same for variational approximations to the DP in (Kurihara et al., 2006) and HDP (Teh et al., 2008). This raises the question if we can take the infinite limit for our model as well. The problem is that (Kmax - K ) as Kmax . This can be traced back to the fact that we should have added a proper prior p(K ) which would have diminished the contribution at large K . Instead we choose an improper, constant prior to avoid the need to estimate likely values for K a priori. However, it is still possible to work with infinite free energies because we are only interested in the relative change in free energy after increasing K , which is a finite quantity. In our experiments we chose a = 1 and b = 0.5, so that the MAP prior value of k is 0. 2.3. Clumping Data-Items and Documents We will now tie some of the variational distributions {qij k } across different data-items within and across documents (images) to a "clump distribution" qck . Similarly, we will tie some document specific distributions over topics {q ( j )} into a document group q ( s ). Note that since we e l c E[kl |kl ] Nkcl 1 exp Zc xp [E[Ak ( k )|k0 ]] s exp ( Nsc (ks ) Nc 21) The update for becomes k s Ds [ (ks ) - (k )] (a - 1) + k s Ds [ (s ) - ()] b+ (22) An expression for the free energy, after inserting expressions 18, 19 and 20, is given by eq. 29 in the appendix. 3. Incremental Learning with a Memory Constraint Our algorithm processes data in small groups composed of E documents, which we refer to as epochs. After the arrival of each epoch the algorithm proceeds in two stages: a model building phase during which a new model estimate is produced, and a compression phase in which decisions are made as to which words and documents to clump. The sufficient statistics of each clump are computed and data summarized by clumps are purged from memory. The assignment distributions q (z ) of purged data and topic distributions of merged documents q ( ) are discarded as well. The clump sufficient statistics are retained along with the current model estimate, which serves as a starting point for the next round of learning. Memory Bounded Inference in Topic Models Model Building Phase (Algorithm 3.1) Input: Previous model {kl , ks , k , c l , Ncs , Ds }, and k current epoch of E documents. Initialize j k = k for j = |S | + 1, · · · , |S | + E Iterate eqs. 21, 18, 19, 20, and 22 until convergence repeat Rank splits and merges according to criteria in (Ueda et al., 1999) for i = 1 to 10 do Split i-th ranked candidate topic along principal component Restricted iteration of eqs. 21, 18, 19, and 20 until convergence Evaluate change in eq. 29 resulting from split end for for i = 1 to 10 do Merge i-th ranked pair of topics Evaluate change in eq. 29 resulting from merge end for Select split or merge that yielded largest change in eq. 29 Iterate eqs. 21, 18, 19, and 20 until convergence until Change in eq. 29 is less than threshold to identify documents that are to be merged into document groups. Clumps are identified using a greedy top down splitting procedure. Because datapoints summarized by clumps are ultimately discarded, the compression process is irreversible. Therefore it is of fundamental importance to predict the locations of future data when deciding which points to clump. In order to estimate this, we rank cluster splits according to a modified free energy (eq. 30) in which the data Ts sample size is artificially increased by a factor P ptNc and T cs the number of documents is scaled by PdoDs , where Tpts s and Tdocs are the target number of data-points and documents expected during the lifetime of the system. This is equivalent to using the data empirical distribution as a predictive model of future data. If we determine clumps using the standard free energy, then the algorithm fails to split large groups of points that are likely to split once more data has arrived. Instead, it wastes memory by placing "stray" points in their own clumps. c 3.1. Model Building Phase The model building phase optimizes the free energy under the parameter tying constraints induced by the choice of clumps in previous compression phases. We perform a split-merge procedure similar to (Ueda et al., 1999) to determine the number of topics, using the heuristics in that work to rank topic suitability for split or merge. In our experiments we use Gaussian topic distributions, so splits are proposed along the principal component of the topic. The split proposals are refined by restricted variational updates. That is: equations 21, 18, 19, 20, and 22 are iterated but only for data-points whose highest responsibility is to the split topic, and the points may be assigned only to the two descendent topics. Merges are carried out by instantiating a new topic with the data-points with highest responsibility to the merged topics. A total of 10 splits and 10 merges are proposed, and evaluated by the resultant change in free energy (eq. 29). The top ranked change is then used to initialize full variational updates (which involve all data points). The model building phase halts once the change in free energy divided by its previous value is below a threshold, which was chosen to be 1E - 5 in our experiments. The procedure is summarized in algorithm 3.1. 3.2. Compression Phase The goal of the compression phase is to determine groups of data-points that are to be summarized by clumps, and We initialize the process by hard assigning each clump or data-point to the cluster with highest responsibility during the previous model building phase. We then proceed through each cluster and split it along the principal component, and refine this split by iterating restricted variational updates equations for the points in the cluster. The updates are modified by the data magnification factors: T F p c ts (23) kl = kl + l Nc T N p c ts k0 = (24) k + 0 Nc (a - 1) + k b+ Pdocs s Ds Pdocs s Ds T k j [ (ks ) - (k )] T s [ (s ) - ()] (25) Updates for qck and ks are unchanged. After the clusters are refined, the data-points are then hard assigned to the sub-cluster with greatest responsibility, and the proposed split is ranked according to the resultant change in eq. 30. We then greedily split the cluster with highest rank. The process repeats itself, with new clusters ranked in the same way described above. We cache the results of each split evaluation to avoid redundant computation. After we have reached a given memory bound we extract the partitions resulting from this recursive splitting procedure as our new clumps. Each clump must store sufficient statistics for full covari2 ance Gaussian components which require d +3d values, 2 where d is the dimension of the feature space. In addition, |S | (the number of document groups) values must be Memory Bounded Inference in Topic Models Clump Compression (Algorithm 3.2) Input: Output from model building phase: {qck , c l , Ncs , Ds }, current epoch of E documents and k memory bound M . Hard partition clumps: rc = arg maxk qck while M C < M (eq. 26) do for i = 1 to K do Split i-th cluster along principal component Iterate data magnified restricted updates until convergence Hard partition clumps into child clusters Evaluate change in eq. 30 resulting from split end for Select split that yielded largest change in eq. 30 K =K +1 end while Free Energy Ratio vs. Batch 1 0.99 0.98 0.97 0.96 0.95 0 100 200 Number of clumps 300 Free Energy Ratio vs. Batch 1 0.995 0.99 0.985 0.98 0.975 0.2 0.4 0.6 0.8 1 # of groups/total images processed Figure 1. Image Segmentation experiment. Left: Free energy ratio as a function of the number of clumps permitted by the memory bound. Right: Free energy ratio versus the number of image groups relative to the total number of images processed. 4.1. Joint Image Segmentation Our first experient is a joint image segmentation problem. The dataset is the Faces-Easy category of the Caltech 101 image dataset (Fei-Fei et al., 2004) consisting of 435 images. Each image contains a face centered in the image, but the lighting conditions and background vary. In terms of the vocabulary of the preceding sections, each image is a document and each pixel in the image is a word. Pixels are represented as five dimensional vectors of the following features: X and Y position relative to the center of the image, and three color coordinates in the CIELAB colorspace. The goal of our experiment is to find similar image regions across the multiple images, in an unsupervised way. We emphasize that our main objective is to study the efficiency of our algorithm, not to produce a state of the art image segmentation algorithm. The images were scaled to be 200 by 160 pixels in size. Thus, the total size of the dataset is 32,000 pixels per image, times 435 images, times 5 features per pixel equals 69,600,000 real numbers. Each pixel requires an assignment distribution. Our baseline implementation (i.e. a batch algorithm that processes all images in memory at once and does not use pixel clumping or image merging) was only able to jointly segment 30 images simultaneously, before running out of memory. The majority of memory is used to store the assignment distributions of pixels, and this is problematic as the number of topics increases during learning, since the space requirements scale as O(N K ), where N is the total number of pixels and K is the number of topics. We first compare the memory bounded approach to the baseline implementation on a joint segmentation task of 30 images in order to judge the impact of the pixel clumping approximation. We vary the upper limit on the number of clumps used to summarize the data during the compression phase, and compare the free energy bounds produced by the memory bounded algorithm to those produced by the baseline implementation. We define the free energy ratio stored to represent the counts Ncs for each clump. Note that from this perspective, it only makes sense to create 1 clumps within a cluster if it contains more than d+3 + d 2 data-points. If not, then it is more efficient to store the individual data-points and we refer to them as "singlets". The total memory cost of summarizing the data is then d2 MC = + 3d 2 | Nc > 1| + |S ||Nc > 1| + d|Nc = 1|, (26) where |Nc > 1| is the number of clumps with more than 1 data-item in them, and |Nc = 1| is the number of singlets. The clump compression procedure is summarized in algorithm 3.2. Document merging provides another way of controlling the memory cost, by reducing the number of image groups |S |. We use the following simple heuristic to rank the suitability of merging document groups s and s : DMs,s = E[sk ]E[s k ] ||E[ s ]||||E[ s ]|| k (27) Clumping and document merging enable a number of potential schemes for controlling space and time costs, depending on the application. We note that the time complexity per variational iteration scales as O(K (|Nc > 1| + |Nc = 1|) + |S |K ) and the space required to store q (zc ) distributions is O(K (|Nc > 1| + |Nc = 1|)). 4. Experiments We test our approach with two machine vision experiments. The first is an image segmentation task, and the second is an object recognition and retrieval task. Memory Bounded Inference in Topic Models Minutes per learning round 100 200 300 400 # of images processed 50 40 30 20 10 0 0 10 20 30 Learning round 40 # of Model Components 80 60 40 20 0 Figure 3. Joint segentation of 435 faces. The left plot shows the number of topics recovered as the system processes images. The right plot shows the run time for each learning round. This fluctuates with the number of new topics discovered during each round and tends to increase gradually with the total number of topics. Figure 2. Top row: From left to right: an example segmentation produced by the baseline method, memory bounded algorithm with 30% of total images and 125 clumps, and the memory bounded algorithm with no images merged and 125 clumps. Row 2: Example clump distributions. Pixels of the same color are summarized in a single clump. Row 3: segmentations corresponding to clumps in row 2. ba c -F as 1 - F E|FtEhatchEmb . This process was repeated for dif| b ferent subsets of 30 images from the dataset. In the memory bounded approach, images were processed in epochs of five images at a time. Figure 1 summarizes the results. We find that performance tends to saturate beyond a certain number of clumps. We also note a significant run time advantage of the memory bounded algorithm over the batch method. The average run time of the batch method was 3.09 hours versus 0.68 hours for the memory bounded approach. Next we study the impact of image (document) merges on the relative performance of the memory bounded algorithm versus the baseline batch algorithm, while varying the maximum number of image (document) groups permitted. The results are shown in figure 1. We find little qualitative difference between segmentations produced by the baseline and memory bounded algorithms. The possible exception is in the case when the memory bounded algorithm is run with a large number of image merges, in which case the algorithm seemed to discover fewer topics than the batch and memory bounded algorithm with only word clumping. Example image segmentations and clump distributions are shown in figure 2. Finally, we demonstrate the memory bounded algorithm on the full dataset of 435 images, which is more than an order of magnitude larger than can be handled with the baseline algorithm. We process images in epochs of 10 images at a time, for a total of 44 learning rounds. The upper limit on the number of clumps was set to 1000, which was likely many more than required since there were only 85 inferred topics. Because the number of documents was relatively small, we chose not to use document merges. The total run time of the algorithm was 15 hours. Figure 3 shows the number of topics as a function of the number of images processed, and the run time required during each image round. The run time is longer during learning rounds in which more new topics are discovered, because more splitmerge operations are necessary. The memory required for the memory bounded algorithm was 22 MB to store the current image epoch and clumps, less than 1MB for the current model estimate, and 235 MB for assignment distributions, for a total of 257 MB. In contrast, the baseline batch implementation would have required 531 MB to store all 435 images, 8.8155 GB to store assignment distributions for each pixel assuming 85 topics, and less than 1 MB for the model, for a total of 9.3 GB. (All memory amounts assume double precision floating point.) The memory bounded implementation therefore achieved a memory savings factor of about 38 with very little loss in accuracy. Figure 4 shows example joint segmentations produced by the memory bounded algorithm. These images were retrieved by first computing responsibilities for every image in the dataset, with respect to the final model estimate produced by the MB algorithm. Then, the images were sorted according to those that have the most pixels assigned to the largest topic. The largest topic indeed corresponds to a face, and is represented by the olive green segment in the figure. Other topics shared across images include hair and certain backgrounds. 4.2. Object Recognition and Retrieval Our object recognition and retrieval experiment involves all 101 object categories in the Caltech 101 dataset. We randomly select 3000 training images and 1000 test images. We extract 128-dimensional SIFT (Lowe, 2004) local ap- Memory Bounded Inference in Topic Models x 10 -7.06 1-NN Accuracy 0.5 1 1.5 Memory Bound 2 4 x 10 Free Energy -7.08 -7.1 -7.12 0 0.32 0 0.5 1 1.5 Memory Bound 2 4 x 10 0.36 8 0.38 0.34 1-NN Accuracy Free Energy Figure 4. Examples of joint segmentation produced after processing all Caltech Face images. Pixels that are the same color have highest responsibility to the same topic. These images were retrieved by sorting images according to those that have the most pixels assigned to the largest topic, which is the olive green colored face segment in each image. Figure 5. Object Recognition and Retrieval. Left: Training set free energy as a function of the memory bound. Right: 1-NN classification accuracy as a function of memory bound (measured as the equivalent number of data-points that could be stored in the same space). -7.06 -7.065 -7.07 -7.075 -7.08 x 10 8 0.38 0.36 0.34 0.32 0.3 0.28 0.26 0 0.5 1 # of groups/total images processed pearance descriptors from 500 randomly chosen locations in each image. The scale of each feature is also chosen randomly. In the language of topic models, each feature descriptor is a word, and the collection of feature descriptors in an image forms a document. This image representation is known as 'bag-of-features', because images are modeled as unordered collections of feature descriptors whose geometric positions are ignored. This dataset proved too large to compare directly to the batch algorithm We train a single topic model on all training images, using epochs of 60 images at a time. Because hundreds of topics are discovered we use diagonal covariance Gaussians and adjust equation 26 accordingly. Given a test im~ age x, retrieval is performed by ranking each training image's similarity to the test image. To develop the similarity i ~ measure we begin with log p(xij |x), which is the logprobability that the detections in the test image were generated by training image j given the training set. Then we variationally lower bound this quantity to obtain a test free energy and drop all constant terms not involving the test image and index j . Finally we lower bound this quantity by assuming that detections in the test image are hard assigned to the topic with highest responsibility (this leads to an expression that is much faster to evaluate with neglible impact on retrieval performance.) The retrieval score is: i l ~ score(j ) = max E[kl |kl ] kl (xij ) (28) k -7.085 0 0.5 1 # of groups/total images processed Figure 6. Object Recogniton and Retrieval. Left: Training set free energy versus the ratio of document groups to the total number of images processed. Right: 1-NN classification accuracy versus the ratio of document groups to total number of images processed. are re-estimated for images that were merged into a document group during training. We compute nearest neighbor (1-NN) classification accuracy by classifying the test image to the class label of the highest scoring image in the training set. Figure 5 shows the training set free energy and 1-NN classfication accuracy as a function of the memory bound M (measured as the equivalent number of data points that could be stored in the same space.) Because we used diagonal covariance matrices, there were enough clumps even at low levels of memory to maintain comparable classification performance. We note that the training free energy increases with memory as expected, and that the 1-NN accuracy tends to saturate as memory increases. Figure 6 shows the 1-NN accuracy and training free energy when the percentage of document groups relative to the number of total images processed is varied (the memory bound M is held fixed at 10000). We note that the classification performance suffers substantially when only small numbers of document groups are permitted. We use a heuristic for determining documents to merge (eq. 27). It is possible that a well motivated criterion (perhaps derived from the free energy) would give better performance. w - E[Ak ( k )|k0 ] + (kj ) k - ( kj ) here the expectations are with respect to q ( ) learned during training and kl and kj are from training as well. kj Memory Bounded Inference in Topic Models 5. Conclusion Machine learning has largely focussed on algorithms that run for a relatively short period of time, fitting models of finite capacity on a data-set of fixed size. We believe that this scenario is unrealistic if we aim at building truly intelligent systems. We have identified nonparametric Bayesian models as promising candidates that expand their model complexity in response to new incoming data. The flip-side is that nonparametric Bayesian algorithms are "examplebased" and as such require one to cache and process repeatedly every data-case ever seen. The objectives of infinite, adaptive model capacity on the one hand and efficiency, both in time and space on the other therefore seem to be fundamentally at odds with each other. In this paper we have made a first step towards resolving this issue by introducing a class of models that can adapt their model complexity adaptively but are able to do so at a fraction of the memory requirements and processing times necessary for their batch counterparts. There is no magic of course: with a fixed memory budget there is a limit to how complex the model can be, but we have shown that one can learn much larger models reliably with much less memory than a naive implementation would allow. Moreover, our learning algorithms allow a flexible tradeoff between memory requirements and model complexity requirements that can be adapted online. Intuitively, our method may be thought of as a two level clustering process. At the bottom level, data is clustered into clumps in order to limit time and space costs. At the top level, clumps are clustered to form topics in order to ensure good generalization performance. Potential application areas of the techniques introduced here are manyfold. For instance, we can imagine learning topic models from very large text corpora or the world wide web to understand its structure and facilitate fast searching algorithms. Another exciting direction is to build a taxonomy of visual object categories from a continuous stream of video data captured by surveillance cameras. 5.1. Acknowledgements We thank the anonymous reviewers for their helpful comments. This material is based on work supported by the National Science Foundation under grant numbers 0447903 and 0535278, the Office of Naval Research under grant numbers 00014-06-1-0734 and 00014-06-1-0795, and The National Institutes of Health Predoctoral Training in Integrative Neuroscience grant number T32 GM007737. 5.2. Appendix The following expressions for the free energy are used in the main text. Note that they are only valid after the updates for and have been performed. P F =K B ( )- k Bk (Fk + ) (29) " "P P Nks Ns + ks Ds log (k )/(k + Ds ) - s Ds log(()/(+ Ds )) P + ck Nc qck log qck P P P - k ((a-1) k log(k )-b k k ) -(Kmax -K )((a-1) log a-1 b -(a-1) )-Kmax (b log(a)-log (a)) (30) "T " P F =K B ( )- k Bk ( P pts Fk + ) c Nc "P " " " Nks Tc + PdoDs ks Ds log (k )/(k + Ds ) ss " "P Tc Ns - PdoDs s Ds log(()/(+ Ds )) ss "P "T + P pts ck Nc qck log qck N c c P P P - k ((a-1) k log(k )-b k k ) -(Kmax -K )((a-1) log a-1 b -(a-1) )-Kmax (b log(a)-log (a)) References Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993­ 1022. Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. IEEE CVPR Workshop of Generative Model Based Vision (WGMBV). Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1, 209­230. Kurihara, K., Welling, M., & Vlassis, N. (2006). Accelerated variational dirichlet process mixtures. NIPS. Lowe, D. G. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60, 91­110. Minka, T. (2000). Estimating a dirichlet distribution (Technical Report). Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. To appear in Journal of the American Statistical Association. Teh, Y. W., Kurihara, K., & Welling, M. (2008). Collapsed variational inference for HDP. Advances in Neural Information Processing Systems. Ueda, N., Nakano, R., Gharamani, Z., & Hinton, G. (1999). Smem algorithm for mixture models. Verbeek, J., Nunnink, J., & Vlassis, N. (2003). Accelerated variants of the em algorithm for gaussian mixtures (Technical Report). University of Amsterdam.