The Infinite Gamma-Poisson Feature Model
Michalis K. Titsias School of Computer Science, University of Manchester, UK mtitsias@cs.man.ac.uk

Abstract
We address the problem of factorial learning which associat es a set of latent causes or features with the observed data. Factorial models usuall y assume that each feature has a single occurrence in a given data point. However, there are data such as images where latent features have multiple occurrences, e.g. a visual object class can have multiple instances shown in the same image. To deal with such cases, we present a probability model over non-negative int eger valued matrices with possibly an infinite number of columns. This model can play the role of the prior in a nonparametric Bayesian learning scenario where b oth the latent features and the number of their occurrences are unknown. We use this p rior together with a likelihood model for unsupervised learning from images us ing a Markov chain Monte Carlo inference algorithm.

1

Introduction

Unsupervised learning using mixture models assumes that on e latent cause is associated with each data point. This assumption can be quite restrictive and a useful generalization is to consider factorial representations which assume that multiple causes have generated the data [13, 5]. Factorial models are widely used in modern unsupervised learning algorithms . Examples include the multiple topics models (e.g. [1, 2]) for documents and the multiple objects models (e.g. [14, 7]) for image data. Algorithms for learning factorial models should deal with t he problem of specifying the size of the representation. Bayesian learning and especially nonpara metric methods such as the Indian buffet process [3] can be very useful for solving this problem. Factorial models usually assume that each feature occurs on ce in a given data point. This is inefficient to model the precise generation mechanism of several data such as images. An image can contain views of multiple object classes such as cars and humans and each class may have multiple occurrences in the image. When we recognize images it is usefu l to predict not only the presence/absence of object classes but also the number of occurrences of each class. To deal with features having multiple occurrences, we introduce a probabil ity distribution over sparse non-negative integer valued matrices with possibly an unbounded number o f columns. Each matrix row corresponds to a data point and each column to a feature similarly to the binary matrix used in the Indian buffet process [3]. Each element of the matrix can be zero or a positive integer and expresses the number of times a feature occurs in a specific data point. This model is derived by taking the infinite limit of a finite Gamma-Poisson distribution similarly to how the infinite mixture model is derived [10, 12]. The infinite Gamma-Poisson model can play the role of the prior in a nonparametric Bayesian learning scenario where both the latent features and the number of their occurrences are unknown. Given this prior, we consider a likelihood model which is suitable for explaining the visual appearance and location of local image patches. Introducing a prior for the parameters of this likelihood model, we apply Bayesian learning using a Markov chain Monte Carlo inference algorithm and show results in some image data. 1


2

The finite Gamma-Poisson model

Let X = {X1 , . . . , XN } be some data where each data point Xn is a set of attributes. In section 4 we specify Xn to be a collection of local image patches. We assume that each data point is associated with a set of latent features and each feature can have multiple occurrences. Let znk denote the number of times feature k occurs in the data point Xn . Given K features, Z = {znk } is a N × K non-negative integer valued matrix that collects together all the znk values so as each row corresponds to a data point and each column to a feature. Give n that znk is drawn from a Poisson with a feature-specific parameter k , Z follows the distribution P (Z |{k }) = nN kK znk exp{-k } kK mk exp{-N k } k k = , N znk ! =1 n=1 znk !
=1 =1

(1)

N where mk = n=1 znk . We further assume that each k parameter follows a Gamma distribution that favors sparsity (in a sense that will be explained short ly):


K  G (k ; , 1) = k K

-1

exp{-k } .  ( K )

(2)

The hyperparameter  itself is given a vague Gamma prior G (; 0 , 0 ). Using the above equations we can easily integrate out the parameters {k } as follows P (Z |) = kK (mk +
 ( K )(N + 1)  K)

=1

 mk + K

which shows that given the hyperparameter  the columns of Z are independent. The above distribution as K increases favors sparsity. This can be shown by taking the ex pectation of the sum of all N elements of Z . Since the columns are independent this expectation is K n=1 E (znk ) and E (znk ) is given by z 1  znk N B (znk ; , ) = , E (znk ) = (4) K2 K =0
nk

N

n=1 znk !

,

(3)

where N B (znk ; r, p), with r > 0 and 0 < p < 1, denotes the negative binomial distribution over positive integers (r + znk ) r N B (znk ; r, p) = (5) p (1 - p)znk , znk !(r)
- that has a mean equal to r(1p p) . Using Equation (4) the expectation of the sum of znk s is N and is independent of the number of features. As K increases Z becomes sparser and  controls the sparsity of this matrix.

3

The infinite limit

We are interested in considering the infinite version of the a bove distribution with an unbounded number of features. The MCMC inference algorithm of section 5 requires the conditionals of the form P (znk |Z-(nk) , ) where Z-(nk) = Z \ znk . Next we compute such conditionals by taking the infinite limit of the corresponding conditionals for the finite model. This approach of expressing infinite models was introduced by Neal [10] for the infinite mixture model. We start from the finite case where Z has K columns. Since the columns of Z are produced independently it holds P (znk |Z-(nk) , ) = P (znk |Z-n,k , ) where Z-n,k = Zk \ znk and Zk denotes the column k of Z . We have  P (znk |k )p(k |Z-n,k )dk P (znk |Z-n,k , ) =
0

=

 (m-n,k + K + znk )  znk !(m-n,k + K )

N +1

N

 m- n ,k + K

N +1

1

znk

,

(6)

2


e where m-n,k = Note that the above distribution is the negative binomial e n=n znk .  N N B (znk ; m-n,k + K , N +1 ). We assume that the first K+ features are represented i.e. mk > 0 for k  K+ , while the rest K - K+ feature are unrepresented i.e. mk = 0 for k > K+ . To express the limit K   we recognize two cases. When m-n,k > 0 the limit of Equation (6) m gives the negative binomial N B (znk ; m-n,k , NN 1 ) which has a mean value equal to -n,k . When + N m-n,k = 0 the limit is zero. This holds for all unrepresented features k > K+ and also for all unique features that occur only in the data point n (i.e. m-n,k = 0, but znk > 0). For all such znk s we express the conditional distribution of their sum gn . Note that these znk s are independent since they  belong to different columns of Z and also Equation (6) says that znk |Z-n,k  N B (znk ; K , NN 1 ). + The sum of independent negative binomially distributed random variables with the same parameter N  p (here p = N +1 ) and possibly different r values (here r = K ) follows a negative binomial with the same p and the r values added [6]. Using this property and taking the infinite limit we find that gn  N B (gn ; , NN 1 ). This distribution gives the prior number of total occurren ces of features +  that are unique to the data point n and its mean value is equal to N . Interestingly, N B (gn ; , NN 1 ) + plays a similar role to that of P oisson(/N ) used in the Indian buffet process [3]. Finally we express the infinite conditional p(|Z ) since is required when we update  using the MCMC algorithm of section 5. Rewriting Equation (3) using the recursive property of the Gamma function we obtain p(|Z )  G (; 0 , 0 )K+ have
QK+ Qmk -1
k =1

(j +/K ) j =1 . (N +1)

Taking the infinite limit we (7)

p(|Z )  G (; 0 , 0 )

 K+ , (N + 1)

which shows that  depends only on the number of represented features, the numb er of training data and the hyperparameters 0 and 0 .

4

A likelihood model for images

An image can contain multiple objects of different classes. Also each object class can have more than one occurrences, i.e. multiple instances of the class may appear simultaneously in the image. Unsupervised learning should deal with the unknown number o f object classes in the images and also the unknown number of occurrences of each class in each i mage separately. If object classes are the latent features, what we wish to infer is the underlyi ng feature occurrence matrix Z . Next we consider an observation model that is a combination of Latent Dirichlet allocation [1] and Gaussian mixture models. Such a combination has been used before [14]. An important characteristic of our likelihood model is that it is defined conditioned on the feature occurrences Z and thus the infinite Gamma-Poisson prior can be used for inferring the number of o bject classes and their occurrences. Each image n is represented by dn local patches that are detected in the image so as Xn = (Yn , Wn ) = {(yni , wni ), i = 1, . . . , dn }. yni is the two-dimensional location of patch i and wni L  is an indicator vector (i.e. is binary and satisfies =1 wni = 1) that points into a set of L possible visual appearances. X , Y , and W denote all the data the locations and the appearances, respectively. A local patch in image n belongs to a certain occurrence of a feature. We use the doubl e index k j to denote the j occurrence of feature k where j = 1, . . . , znk and k  {k : zne > 0}. Kk We introduce the assignment indicator variable sni = {skj } which takes Mn = ni k=1 znk values and indicates the feature occurrence of patch i. Sn denotes all the assignments for the image n and S the total set of these variables. To generate S we first choose image-specific multino  mial parameters n = {nkj } from a symmetric Dirichlet D( Mn , Mn , . . .) and then draw from N P (S |{n }, Z ) = n=1 P (Sn |n , Zn ) where P (Sn |n , Zn ) = k
:znk >0 =1 z j nk

(nkj )nkj ,

(8)

dn nkj = i=1 skj and Zn is the row n of Z . For the appearances W , we assume that each feature ni k has a separate multinomial distribution with parameters k = {k1 , . . . , kL } that describes the 3


appearance of the local patches of that feature. Each k is drawn from the same symmetric Dirichlet D( ,  , . . .) and the distribution of W is P (W |{k }, S, Z ) = kK  L ( k  )
k

,

(9)

=1 =1

zn N dn  where k = n=1 i=1 j =k skj wni . To express a suitable distribution for the location yni , we 1 ni assume that each occurrence of a feature forms a Gaussian clu ster of image patch locations. Thus yni follows a image-specific Gaussian mixture with Mn components. We assume that the component k j has mean mnkj and covariance nkj . mnkj describes object location and nkj object shape. Given that any object can be anywhere in the image and have arbitrar y scale and orientation, (mnkj , nkj ) should be drawn from a quite vague prior. We consider a conjug ate Normal-Wishart prior common 1 1 to all features and images so as mnkj  N (mnkj |ľ,  nkj ) and -kj  W (-kj |v , V ). Finally n n N Y is produced by P (Y |{mn , n }, S, Z ) = n=1 p(Yn |mn , n , Sn , Zn ) where p(Yn |mn , n , Sn , Zn ) = idn k
= 1 : zn k >0 = 1 z j nk

[N (yni |mnkj , nkj )]

sk j ni

(10)

and mn and n collect all the means and covariances of the clusters in the image n. From the joint distribution of all variables we marginalize out the parame ters {n }, {k }, {mn } and {n } and N obtain p(X, S |Z ) = P (W |S, Z ) n=1 p(Yn |Sn , Zn )P (Sn |Zn ) where P (W |S, Z ) = k kK
=1

(L )  (k + L )

L

=1

(k +  ) , ( )
vnk j

(11)
v -1

p(Yn |Sn , Zn ) =

:znk >0

z j nk

j ( )-nkj |Vnkj | 2 ( nkj )( nk2 2 v ( nkj + 1) |V | 2 ( v )( v-1 ) 2 2 =1

v

)

,

(12)

P (Sn |Zn ) = where k = L
=1

( ) k (dn +  )

:znk >0 =1

z j nk (nkj +

 Mn )

 ( Mn )

,

(13)

k , vnkj = v + nkj , V -1 Vnkj = + nkj nkj +

nkj (ynkj - ľ)(ynkj - ľ)T  nkj + 1

-1

and ynkj and nkj are the sample mean and covariance matrix for the k j cluster in the image n. The hyperparameters ( ,  , ľ,  , v , V ) take fixed values that give vague priors. When we consider infinite number of features we have to simply r eplace K with the number of represented features K+ in all above expressions. Marginalizing out the assignments S is generally intractable and the MCMC algorithm discussed next produces samples from the posterior P (S, Z |X ).

5

MCMC inference

Inference with our model involves expressing the posterior P (S, Z |X ) over the feature occurrences Z and the assignments S . We are based on MCMC sampling from conditional posterior di stributions. Section 5.1 explains how we sample the assignment variables Sn given other variables. Here we present the sampling algorithm for the Z values. Suppose we wish to express the posterior conditional for som e znk . Simply witting P (znk |S, Z-(nk) , X ) is not useful since when znk changes the number of states the assignments Sn can take also changes. This is clear since znk is a structural variable that affects the number of K+ components Mn = k=1 znk of the mixture model associated with image n and assignments Sn . On the other hand the dimensionality of the assignments S-n = S \ Sn corresponding to the rest images is not affected when znk changes. To deal with the above we marginalize out Sn and we 4


compute the posterior conditional P (znk |S-n , Z-(nk) , X ). Using the conditional independencies induced by our model we have S P (W |S, Z )p(Yn |Sn , Zn )P (Sn |Zn ), (14) P (znk |S-n , Z-(nk) , W, Yn )  P (znk |Z-n,k , )
n

where P (znk |Z-n,k , ) for the infinite case is computed as described in section 3 while computing the sum requires an approximation. This sum is a marginal lik elihood and we apply importance sampling using as an importance distribution the posterior conditional P (Sn |S-n , Z, W, Yn ) [11]. In our implementation we consider a single sample drawn from the importance distribution so that    the sum is approximated by P (W |Sn , S-n , Z )p(Yn |Sn , Zn ) and Sn is a sample accepted after a burn in period using the MCMC algorithm presented in section 5.1. The MCMC algorithm processes the rows of Z iteratively and updates its values. A single step can K+ change a value of Z by one. Initially Z is such that Mn = k=1 znk  1, for any n which means that at least one mixture component explains the data of each image. The proposal distribution for changing the znk s should ensure that this constraint is satisfied. Suppose we update the row n of Z . For each k with m-n,k > 0 we make a random choice between attempting to increase or decrease znk using the probabilities pinc (i) and pdec (i) = 1 - pinc (i) so   as pinc (i) = 0.5, i = 1, 2, . . . and pinc (0) = 1. The proposal for the new znk is Q(znk |Zn ) = 1[z =znk +1] 1[z =znk -1] pinc (znk 1[Mn >1] ) nk pdec (znk 1[Mn >1] ) nk where 1[-] is one when the condition inside the brackets is true and zero otherwise. Note that the above distribution always attempts to increase znk when Mn = 1 or znk = 0. The acceptance probability in the Metropolis-Hastings step  Q(z |Zn ) is min(1, A Q(znk |Zn ) ) where 
nk

A=

    P (W |Sn , S-n , znk , Z-(nk) )p(Yn |Sn , znk , Zn,-k )P (znk |Z-n,k , ) P (W |Sn , S-n , znk , Z-(nk) )p(Yn |Sn , znk , Zn,-k )P (znk |Z-n,k , )

(15)

  and Zn,-k = Zn \ znk . Sn is found by applying the MCMC algorithm of section 5.1. If znk is  accepted, then Sn is also accepted as the current set of assignments. Finally we draw a sample from the set of unique and unrepresented features. Let Un = {znk : m-n,k = 0, znk > 0} be the set of unique features values, gn their sum and Kn = {k : znk  Un } the associated indices. Un is  a partition of the integer gn and we wish to propose a new partition Un so as a certain znk might   increase or decrease by one. First we propose a new gn using the negative binomial N B (gn |, NN 1 ) + which is truncated around gn 1[Mn >1] . In other words when Mn > 1 and gn > 0, the distribution is   truncated so as gn  {gn - 1, gn , gn + 1} and when Mn = 1 or gn = 0, then gn  {gn , gn + 1}.    When gn = gn the new unique set is Un = Un . When gn = gn + 1, we use the Chinese Restaurant  Process to define the new partition Un so as a feature k  Kn is chosen with probability gznk or an n+ unrepresented feature K+ + 1 is chosen with probability gn  . The chosen feature has its znk value +  increased. When gn = gn - 1 we choose a feature k  Kn with probability zgnk and decrease its znk n    value. The proposal is Q(Un |Zn ) = Q1 (gn |gn 1[Mn >1] )Q2 (Un |gn , Un ) where Q1 is the truncated  1[z z 1[z =z +1] k =1 ] nk nK+ +1 nk  nk negative binomial, Q2 (Un |gn = gn + 1, Un ) = gn +  Kn g n +  z 1[z =z -1] k nk nk   nk and Q2 (Un |gn = gn - 1, Un ) = . If the proposed Un differs from Un in  Kn gn

the value of the feature k the acceptance probability is min(1, A Q(Un |Zn ) ) where A is  n
n

Q(U |Z  )

we add Note that few Metropolis-Hastings updates for the hyperparameter  using the posterior conditional given by Equation (7). This distribution is log-concave and thus Adaptive Rejection Sampling can also be used. A complete MCMC iteration consists of a scan of all rows of Z plus few steps for updating . 5.1 Sampling the assignments

    P (W |Sn , S-n , znk , Z-(nk) )p(Yn |Sn , znk , Zn,-k )N B (gn ; , NN 1 ) + A= . P (W |Sn , S-n , znk , Z-(nk) )p(Yn |Sn , znk , Zn,-k )N B (gn ; , NN 1 ) +  when Un = Un , the sample is accepted without needed to evaluate A. Finally

(16)

In this section we discuss how we sample Sn using Gibbs sampling and Metropolis updates. As explained earlier we sample Sn any time we attempt to change a znk value. We first initialize Sn 5


 using its previous value as follows. When we propose znk = znk - 1, we select the occurrence that corresponds to the cluster having the fewer local patch es and we randomly assign these patches  to the rest feature occurrences. When znk = znk + 1, then either a new feature has been added or a new occurrence of an existent feature has been added. In t he first case we choose a portion of patches that has diverge appearance distribution compare t o the posterior appearance distributions of the existent features and we assign these patches to the new feature. When a new occurrence of an existent feature has been added, we randomly select a portion of patches that has a similar appearance distribution with the posterior appearance dis tribution of that feature and assign these patches to the new occurrence.

Once Sn is initialized we apply Gibbs sampling and iteratively sample each sni conditioned on the rest variables and assignments S-(ni) = S \ sni :
k P (snj = 1|S-(ni) , Z, W, Yn )  i -(ni) ( ee nkj ee

+ ) k  (e ke + ) -(ni) Mn ( + L )
e k

-(ni)

:znk >0 =1

z j nk |Vnkj |

v n k j -1 vnk j ) 2 )( 2 , -(ni) kj ( nkj +  sni + 1)

vnk j 2

(

(17)

k corresponding count variable and the pair (vnkj , Vnkj ) is computed given that snj = 1. The above i conditional is the product of two terms. The first term corres ponds to the Collapsed Gibbs sampling update of the Bayesian LDA model [4] and the second term corresponds to the Collapsed Gibbs sampling update of a Bayesian Gaussian mixture model [9].

where  is such that

e  wni

= 1,  -(ni) means that the patch i of the image n is excluded from the
ee

We also consider special Metropolis swap moves that can change whole groups of assignments simultaneously. The idea is that a cluster of patches might have been assigned to the wrong feature during the Gibbs updates. Swap moves can correct this by atte mpting to flip the assignments of two occurrences associated with two different features. Us ing a symmetric proposal distribution we randomly choose two features k1 and k2 from the set {k : znk > 0} and two occurrences j1 and  j2 associated with each of the selected features. Then we define a new Sn so that all the patches that were previously assigned to the cluster k1 j1 are now assigned to the cluster k2 j2 and and those  assigned to k2 j2 are now assigned to k1 j1 . Sn is accepted with probability min(1, A) where A is A=
 P (Sn |S-n , Z, W, Yn ) k = P (Sn |S-n , Z, W, Yn )

 { k 1 ,k 2 }

(k + L )   (k + L )

L

=1

 (k +  ) . (k +  )

(18)

The probability A simplifies because the swap move affects only the appearance distributions of the features k1 and k2 , while the partition of patch locations into clusters and also the appearance distributions of the rest features do not change. The whole s ampling algorithm alternates between  Gibbs scans and swap moves. After few iterations we accept the final sample Sn .

6

Experiments

In the first experiment we use a set of 10 artificial images. We consider four features that have the regular shapes shown in Figure 1. The discrete patch appe arances correspond to pixels and can take 20 possible grayscale values. Each feature has its own multinomial distribution over the appearances. To generate an image we first decide to include e ach feature with probability 0.5. Then for each included feature we randomly select the number of occurrences from the range [1, 3]. For each feature occurrence we select the pixels using the ap pearance multinomial and place the respective feature shape in a random location so that featur e occurrences do not occlude each other. The first row of Figure 1 shows a training image (left), the loc ations of pixels (middle) and the discrete appearances (right). The MCMC algorithm was initi alized with K+ = 1,  = 1 and zn1 = 1, n = 1, . . . , 10. The third row of Figure 1 shows how K+ (left) and the sum of all znk s (right) evolve through the first 500 MCMC iterations. The algorithm in the first 20 iterations has visited the matrix Z that was used to generate the data and then stabilizes. For 86% of the samples K+ is equal to four. For the state (Z, S ) that is most frequently visited, the second row of Figure 1 shows the localizations of all different feature occurren ces in three images. Each ellipse is drawn using the posterior mean values for a pair (mnkj , nkj ) and illustrates the predicted location and 6


10 13 14 12 15 15 13 13 14 10 10 14 15 11 12 15 4 5 3 5 1 5 2 13 14 14 15 12 11 12 12

14 14 17 20 18

15 18 19 19 17 9 5 7 8 9

15 19 15 18 16

training image n

locations Yn

appearances Wn

1331
8 6

3230
70 60 50 40 30

0212

4

2

20 10

0 0

100

200

300

400

500

0

100

200

300

400

500

Figure 1: The first row shows a training image (left), the loca tions of pixels (middle) and the discrete appearances (right). The second row shows the localization s of all feature occurrences in three images. Below of each image the corresponding row of Z is also shown. The third row shows how K+ (left) and the sum of all znk s (right) evolve through the first 500 MCMC iterations.

Figure 2: The left most plot on the first row shows the location s of detected patches and the bounding boxes in one of the annotated images. The rest five plots show examples of detections and localizations of the three most dominant features (including the car-category) in five non-annotated images.

7


shape of a feature occurrence. Note that ellipses with the sa me color correspond to the different occurrences of the same feature. In the second experiment we consider 25 real images from the U IUC1 cars database. We used the patch detection method presented in [8] and we constructed a dictionary of 200 visual appearances by clustering the SIFT [8] descriptors of the patches using K-means. Locations of detected patches are shown in the first row (left) of Figure 2. We partially labe led some of the images. Particularly, for 7 out of 25 images we annotated the car views using bounding boxes (Figure 2). This allows us to specify seven elements of the first column of the matrix Z (the first feature will correspond to the car-category). These znk s values plus the assignments of all patches inside the boxes do not change during sampling. Also the patches that lie outside the boxes in all annotated images are not allowed to be part of car occurrences. This is achieved by app lying partial Gibbs sampling updates and swap moves. The MCMC algorithm is initialized with K+ = 1, after 30 iterations stabilizes and then fluctuates between nine to twelve features. To keep the plots uncluttered, Figure 2 shows the detections and localizations of only the three most dominant features (including the car-category) in five non-annotated images. The red ellipses correspond to di fferent occurrences of the car-feature, the green ones to a tree-feature and the blue ones to a street- feature.

7

Discussion

We presented the infinite Gamma-Poisson model which is a nonparametric prior for non-negative integer valued matrices with infinite number of columns. We d iscussed the use of this prior for unsupervised learning where multiple features are associa ted with our data and each feature can have multiple occurrences within each data point. Our main i nterest for future research is to enhance our method for solving visual object classification problem s. The infinite Gamma-Poisson prior can be used for other purposes as well. For example, an interesti ng application can be Bayesian matrix factorization where an observation matrix is decomposed in to a product of two or more matrices with one of them being a non-negative integer valued matrix.

References
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3, 2003. [2] W. Buntime and A. Jakulin. Applying discrete PCA in data analysis. In UAI, 2004. [3] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS 18, 2006. [4] T. L. Griffiths and M. Steyvers. Finding scientific topics. In PNAS, 2004. [5] G. E. Hinton and R. S. Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In NIPS 6. 1994. [6] N. L. Johnson, S. Kotz, and A. W. Kemp. Univariate Discrete Distributions. Wiley, 1993. [7] N. Jojic and B. J. Frey. Learning Flexible Sprites in Video Layers. In CVPR, 2001. [8] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91­110, 2004. [9] S. MacEachern. Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics B, 23:727­741, 1994. [10] R. M. Neal. Bayesian mixture modeling. In 11th International Workshop on Maximum Entropy and Bayesian Methods of Statistical Analysis, pages 197­211, 1992. [11] M. A. Newton and A. E Raftery. Approximate Bayesian inference by the weighted likelihood bootstrap. Journal of the Royal Statistical Society, Series B, 3:3­48, 1994. [12] C. E. Rasmussen. The infinite gaussian mixture model. In NIPS 12. MIT Press, 2000. [13] E. Saund. A multiple cause mixture model for unsupervised learnin g. Neural Computation, 7:51­71, 1995. [14] E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. Descr ibing Visual Scenes using Transformed Dirichlet Processes. In NIPS 18, 2006.
1

available from http://l2r.cs.uiuc.edu/cogcomp/Data/Car/.

8