The Phylogenetic Indian Buffet Pro cess: A Non-Exchangeable Nonparametric Prior for Latent Features Kurt T. Miller EECS University of California Berkeley, CA 94720 tadayuki@cs.berkeley.edu Thomas L. Griffiths Psychology and Cognitive Science University of California Berkeley, CA 94720 tom griffiths@berkeley.edu Michael I. Jordan EECS and Statistics University of California Berkeley, CA 94720 jordan@cs.berkeley.edu Abstract Nonparametric Bayesian models are often based on the assumption that the ob jects being modeled are exchangeable. While appropriate in some applications (e.g., bag-ofwords models for documents), exchangeability is sometimes assumed simply for computational reasons; non-exchangeable models might be a better choice for applications based on sub ject matter. Drawing on ideas from graphical models and phylogenetics, we describe a non-exchangeable prior for a class of nonparametric latent feature models that is nearly as efficient computationally as its exchangeable counterpart. Our model is applicable to the general setting in which the dependencies between ob jects can be expressed using a tree, where edge lengths indicate the strength of relationships. We demonstrate an application to modeling probabilistic choice. these stochastic processes at inference time imposes a strong constraint on the kinds of models that can be considered. In particular, nonparametric Bayesian models are often based on an assumption of exchangeability --the joint probability of the set of entities being modeled by the prior is assumed to be invariant to permutation. A particular example of exchangeability is the "bag-of-words" assumption widely used in the modeling of document collections. In this paper we aim to extend the range of nonparametric Bayesian modeling by presenting a nonexchangeable prior for a class of nonparametric latent feature models. Our point of departure is the Indian buffet process (IBP), a generative process that defines a prior on sparse binary matrices (Griffiths and Ghahramani, 2006). This process, which will be described in more depth later, can be understood through a culinary metaphor in which diners sequentially enter a buffet line and select which dishes to try. As each person moves through the buffet line, they try each previously sampled dish with probability proportional to the number of people who have already tried it. This process can be shown to be exchangeable from the fact that it is obtained as a conditionallyindependent set of draws from a Bernoulli process with parameters drawn from an underlying stochastic process known as the beta process (Thibaux and Jordan, 2007). To obtain a useful non-exchangeable, nonparametric model while retaining the computational tractability of the IBP, we consider a model in which relationships among the diners are expressed by a tree. In this stochastic process--the Phylogenetic Indian Buffet Process (pIBP)--the probability that a diner chooses a dish in the buffet line depends not only on the number of previous diners who have chosen that dish, but also on how closely related the current diner is to each of the previous diners. We exploit efficient probabilistic calculations on trees (Pearl, 1988) to obtain a tractable algorithm for taking relatedness into 1 INTRODUCTION Nonparametric Bayesian analysis provides a way to define probabilistic models in which structural aspects of the model, such as the number of classes or features possessed by a set of ob jects, are treated as unknown, unbounded and random, and thus viewed as part and parcel of the posterior inference problem. This elegant treatment of structural uncertainty has led to increasing interest in nonparametric Bayesian approaches as alternatives to model selection procedures. Nonparametric Bayesian methods are based on prior distributions that are defined on infinite collections of random variables--i.e., prior distributions expressed as general stochastic processes. This generality provides a rich language in which to express prior knowledge. In practice, however, the need to integrate over (a) 1 2 3 ... ... (b) 1 2 3 ... description of the pIBP. Section 3 discusses MCMC inference in models using the pIBP as a prior. Section 4 presents an application of the pIBP to models of human choice, and shows how combining nonparametric methods with a tree-based prior improves performance. Section 5 presents our conclusions. ... 2 THE PHYLOGENETIC INDIAN BUFFET PROCESS Figure 1: The Phylogenetic Indian Buffet Process. (a) A tree expresses dependencies among featural representations of ob jects. (b) The Indian Buffet Process is a special case of the pIBP where all branches meet at the root. account when computing posterior updates under the pIBP prior. The tree representation is a rich language for expressing non-exchangeability. In particular, factorial and nested models of the kind used in experimental design are readily expressed as trees. Group structure as used in the hierarchical Dirichlet process (Teh et al., 2005) and hierarchical beta process (Thibaux and Jordan, 2007) can also be expressed as trees, as can a variety of other partially exchangeable models (Diaconis, 1988). In biological data analysis we may be able to exploit known phylogenetic or genealogical relationships among species or characters. More generally we may have similarity data available for a set of ob jects that can be used to build a tree representation. The pIBP uses such representations to induce a prior on featural representations such that ob jects that are related in the tree will tend to share features (see Figure 1). It is important to distinguish our approach from previous nonparametric Bayesian work based on random trees (Neal, 2001; Teh et al., 2008). In that work, trees are averaged over and ob jects remain exchangeable. We are conditioning on a known, fixed tree and our ob jects are not exchangeable. While we develop the pIBP in the context of the IBP for concreteness, the idea is actually much broader. The key is that the use of a tree to express relationships among non-exchangeable random variables allows us to exploit the sum-product algorithm in defining the updates for an MCMC sampler. This insight extends the scope of nonparametric Bayesian models without significantly increasing the computational burden associated with inference. The paper is organized as follows. Section 2 presents a short review of the IBP and then provides a detailed Griffiths and Ghahramani (2006) proposed the Indian Buffet Process as a prior distribution on sparse binary matrices Z , where the rows of Z correspond to ob jects and the columns of Z correspond to a set of features or attributes describing these ob jects. As with other nonparametric Bayesian models, the IBP can be interpreted through a culinary metaphor. In this metaphor, ob jects correspond to people and features correspond to an infinite array of dishes at an Indian buffet. The first person samples Poisson() dishes, where is a parameter. The ith customer then samples all previously sampled dishes in proportion to the number of people who have already sampled those dishes, and Poisson(/i) new dishes. This process defines an exchangeable distribution on equivalence classes of Z , the binary matrix with a one at cell (i, k ) when the ith customer chooses the k th dish. Figure 1(b) shows a matrix generated from this process. The IBP can be derived as the infinite limit of a betaBernoulli model. Consider a finite model in which there are K features, and let the probability that an ob ject possesses feature k be Bernoulli(k ). Endowing k with a Beta(/K, 1) distribution, we obtain the IBP in the limit as K . The Phylogenetic Indian Buffet Process uses a similar construction. As with the IBP, we will use the term "pIBP" to refer to both the distribution on binary matrices as well as a generative process which induces this distribution. We first describe the distribution on finite matrices at the heart of the pIBP, and then describe the process that results by letting the number of features go to infinity. 2.1 A PRIOR ON FINITE MATRICES We begin by defining a generative process for Z , an N ×K binary matrix, where N is the number of ob jects and K is a fixed, finite number of features. We use the following notation. Let zik denote the (i, k ) entry of Z , let zk be the k th column of Z , let z(-i)k denote the entries of zk except zik , and let Z-(ik) denote the entries of the full Z matrix except zik . As in the IBP, we associate a parameter k to each column, chosen from a Beta(/K, 1) prior distribution, where is a hyperparameter. Given k , the marginal probability that any particular entry in column k is one is equal to k . Columns are generated independently. We diverge from the IBP, however, in the specification of the joint probability distribution for the column zk . In the IBP, the entries of zk are chosen independently given k . In the pIBP, the entries of zk are dependent, with the pattern of dependence captured by a stochastic process on a rooted tree similar to models used in phylogenetics. In this tree, the N ob jects being modeled are at the leaves, and lengths are assigned to edges in such a way that the total edge length from the root to any leaf is equal to one. To generate the entries of the k th column, we proceed as follows. Assign the value zero to the root node of the tree. Along any path from the root to a leaf, let this value change to a one along any edge of length t with exponential rate k t where k = - log(1 - k ). That is, along an edge of length t, let the probability of changing from a zero to a one be 1 - exp(-k t). Once the value has changed to a one along any path from the root, all leaves below that point are assigned the value one. The parameterization k = - log(1 - k ) is convenient because it ensures that k remains the marginal probability that any single feature is equal to one. To see this, simply note that since every leaf node is at distance one from the root, for any entry in the matrix, p(zik = 1|k ) = 1 - exp(-(- log(1 - k ))) = k which also guarantees that we recover the betaBernoulli model in the special case where all branches join at the root, as in Figure 1 (b). It is a simple corollary that any set of ob jects characterized by a set of branches that meet at a single point will be exchangeable within that set, meaning that the tree can be used to capture notions of partial exchangeability. 2.2 CONDITIONAL PROBABILITIES probabilities that are relevant for posterior inference. Specifically, we will need to evaluate p(zik |z(-i)k , k ) for zik {0, 1} and p(z(-i)k |zik , k ), (2) (1) which are trivial in the beta-Bernoulli model due to the conditional independence of zik , but more challenging in the pIBP where zik are no longer conditionally independent. To compute (1), we use the sum-product algorithm (Pearl, 1988). In order to calculate (2), we use the chain rule of probability to get a set of terms similar to (1), the difference being that the posterior in each term is conditional only on a subset of the other variables. Each term can be reduced to a simple sumproduct calculation by marginalizing over all variables that do not appear in that term, which can be done easily since all variables appear at the leaves of the tree. Both (1) and (2) can be calculated in O(N ) time by a dynamic program. 2.3 A GENERATIVE PROCESS The pIBP can be described as a sequential generative process that arises when we let K go to infinity in the distribution derived in Section 2.1. This process can again be understood in terms of a culinary metaphor, in which each row of Z is viewed as the choices made by a diner in a buffet line, and in which we specify how each diner chooses their dishes based on the dishes chosen by previous diners. We overview this process here, leaving detailed mathematical derivations for later sections. Consider a large extended family that is about to choose dishes at a buffet. Assume that we are given a tree describing the genealogical relationships of the family members and assume that dining preferences are related to genealogy. In particular, family members who are more closely related have more similar preferences. Therefore, as each diner moves through the buffet line, their choice of dishes will be more dependent on the selections of previous diners who are closely related to them and less dependent on the selections of other diners. The pIBP generative process is specified as follows. The first diner (arbitrarily chosen) starts at the head of a buffet line that has infinitely many dishes. This person tries Poisson() dishes and also adds a brief annotation to each of these dishes, k , drawn uniformly from [0, 1]. This note, through its previously described equivalent representation, k = - log(1 - k ), will allow us to efficiently compute the probability that Now that we have defined the generative model on finite matrices, we show how to evaluate conditional probabilities in this model. We treat the tree as a directed graph with variables at each of the interior nodes and zik at each leaf i. Then, given k , or equivalently k , if there is a length t edge from a parent node x to a child node y , we have p(y p(y p(y p(y = 0|x = 0, k ) = 1|x = 0, k ) = 0|x = 1, k ) = 1|x = 1, k ) = = = = exp(-k t) 1 - exp(-k t) 0 1 as the conditional probabilities that define a treestructured graphical model. Expressing this process as a graphical model makes it possible to efficiently compute various conditional subsequent diners choose the k th dish using the sumproduct algorithm. Each subsequent diner enters the buffet line and based on the annotations attached to the dishes as well as the identity of previous diners, samples the k th dish according to the probability (1) where z(-i)k indicates which of the previous diners have chosen the k th dish. Through the stochastic process on the tree, if closely related diners have tried a dish, the current diner is more likely to also sample it. The preferences of all diners who have not entered the buffet line are ignored, which can be done by only performing sum-product on the minimal subtree from the root containing the current diner and all previous diners. Each diner also samples a number of new dishes. If ti is the length of the branch connecting diner i to t he t is rest of the minimal subtree just described and the total length of the t est of this subtree, then diner r t i tries Poisson(( ( + ti + 1) - ( + 1))) new dishes, where (·) is the digamma function. They also add an annotation, k , to each of the new dishes that will be used for future 1nferences, where the density of i ( P t t- k is proportional to - (1 - k ) i 1 - k ) k 1 . This process repeats until all diners have gone through the buffet line, defining a matrix Z as in the IBP. Though this process is not exchangeable, we can let any family member go first and get the same marginal distribution. This means that each row of Z has a Poisson() number of non-zero columns, yielding a sparse matrix as in the IBP. The IBP is the special case of the pIBP corresponding to the tree shown in Figure 1 (b); this fact can be proved by using identities of the digamma function on the integers. be sampled as part of the overall MCMC procedure; we will not discuss such parameters in our presentation. Unlike the IBP, where k can always be integrated out, inference in the pIBP requires treating k as an auxiliary variable, sampling it when needed and integrating it out when possible. By sampling k for nonzero columns of Z as opposed to integrating it out, we are able to exploit the sum-product algorithm as described in Section 2.2. Updating k and all zik for each column takes O(N + mN ) time where m is the total number of times a zik in column k changes value. Once the chain has mixed well m is typically small, so time complexity is only slightly worse than that of the IBP, which is O(N ). Given an initial choice of the non-zero columns of Z and the corresponding k for each of these columns, we construct a Markov chain where at each step, we only need to sample each variable from its conditional distribution given all others. We now describe how to sample each of the variables, first considering the variables for "old" columns--those with non-zero entries-- and then turning to the addition of "new" columns. 3.1 SAMPLING zik FOR OLD COLUMNS The probability of each zik given all other variables is p(zik |Z-(ik) , k , X, ) p(X |Z-(ik) , zik )p(zik |z(-i)k , k ), (3) 3 INFERENCE BY MCMC We now consider how to perform posterior inference in models using the pIBP as a prior. As with the IBP, exact inference is intractable, but the model is amenable to approximate inference via Markov chain Monte Carlo (MCMC) (Robert and Casella, 2004). An important point is that even though we are dealing with a potentially infinite number of columns in Z , we only need to keep track of the non-zero columns, a property shared by other nonparametric Bayesian models. Henceforth, let Z refer to all the non-zero columns of the matrix. The fact that the number of ones in any given row has a Poisson distribution means that the number of non-zero columns is finite (with probability one) and generally small. Given the matrix Z , we assume that data X are generated through a likelihood function p(X |Z ). The likelihood may introduce additional parameters that must where the first term is the probability of X given a full assignment of the parameters and depends on the specific model being used. The term p(zik |z(-i)k , k ) can be computed efficiently using the sum-product algorithm as described in Section 2.2. By appropriately caching messages from the sum-product algorithm, this evaluation can be reduced to O(1) time. We evaluate (3) for zik = 0 and zik = 1 and sample zik from the corresponding posterior distribution. If the value of zik changes, we then update the messages for sum-product in O(N ) time. If a column ever becomes entirely zero, we drop it from Z . 3.2 SAMPLING k FOR OLD COLUMNS We only sample k for the old columns of Z , a fact that will be useful in subsequent calculations. The posterior distribution of each k is independent of all other k and depends only on the k th column of Z . When resampling k , let zik be a non-zero entry in the k th column. Then, p(k |zk , ) p(z(-i)k |k , zik )p(k |zik , ). (4) Section 2.2 describes how to compute p(z(-i)k |k , zik ) efficiently in O(N ) time using the sum-product algo- rithm. In order to obtain p(k |zik , ), we compute p(k |zik , ) K which gives us p(zik = 1|z(-i)k = 0) t t ( + ti + 1) ( + /K + 1) t t =1- . ( + 1) ( + ti + /K + 1) Therefore, the event that we sample zik = 1 in any particular all-zero column is Bernoulli with the above probability. Treating the value /K as a variable in the equation for p(zik = 1|z(-i)k = 0) and doing a first-order Taylor expansion about zero, we get p(zik = 1|z(-i)k = 0) ,,« " "X "" 1 " "X t + ti + 1 - t+1 +o = , K K p(zik |k ) Bernoulli(k ) p(k |) Beta( K ,1) Beta 1 + ,1 K Beta(1, 1) = 1 {k [0,1]} . 1 We see that we can evaluate p(k |zk , ) up to a normalizing constant efficiently, so we can sample the new k using a Metropolis-Hastings step. Specifically, given a proposed value of for k , the ratio of the posterior probabilities of and k is p( |zk , ) p(k |zk , ) = = p(z(-i)k | , zik )p( |zik , ) p(z(-i)k |k , zik )p(k |zik , ) p(z(-i)k | , zik ) 1 { [0,1]} , 1 p(z(-i)k |k , zik ) where (·) is the digamma function. As K grows, the probability that any particular all-zero column becomes non-zero goes to zero. On the other hand, we have a growing number of these Bernoulli variables. Using the fact that K K Poisson(), then we get that Binomial , K for each row i, the number of new non-zero columns with a one in the ith row is distributed . - t t +1 + ti + 1 Poisson (6) Putting this all together, instead of sampling zik for each of the infinitely many all-zero columns individn ually, we sample Ki ew , the number of these columns n which will have zik = 1. The distribution of Ki ew is n n p(Ki ew |X, Z, ) P (X |Znew )P (Ki ew |), so if we use q ( |k ) as the proposal distribution, then the Metropolis-Hastings acceptance ratio is . 1 q (k | ) p(z(-i)k | , zik ) 1 { [0,1]} 1 min , (5) q ( |k ) p(z(-i)k |k , zik ) There are many options for q . In our experiments, we 2 2 used q ( |k ) N (k , k ) where k = c · k (1 - k ) + with c = 0.06 and = 0.08. 3.3 SAMPLING THE NEW COLUMNS In addition to sampling the values of zik in old columns, we need to consider the possibility that one of the infinitely many all-zero columns has a single entry that becomes a one. As mentioned in Section 3.2, we only sample values of for non-zero columns, so when sampling new columns, we must integrate out . For finite K , we can directly compute the probability p(zik = 1|z(-i)k = 0) for each all-zero column. If ti is thet length of the edge that ends at the ith ob ject and is the total length of all other edges in the tree, then we get p(zik = 1|z(-i)k = 0) Z1 p(zik = 1|z(-i)k = 0, )p(z(-i)k = 0| )p( )d 0 (7) n where P (Ki ew |) is given by Equation (6). To comn pute P (X |Znew ), we must augment Z with Ki ew new th columns that have a one in only the i row; this yields n Znew . Though Ki ew can be arbitrarily large, (7) den cays rapidly as Ki ew grows, so we can truncate our evaluation at a finite number of columns and sample n Ki ew from the corresponding multinomial. 3.4 SAMPLING k FOR NEW COLUMNS (1 - (1 - )ti )(1 - ) t /K -1 d 0 P P (/K ) ( t + 1) (/K ) ( t + ti + 1) P P = - ( t + /K + 1) ( t + ti + /K + 1) Z 1 P For each of the new columns generated in the previous step, we must sample an initial value of k . Using the same notation as before for edge lengths, if we are sampling k for a new column in which only the ith element is non-zero, then we are sampling from p(k |zk ) = p(k |z(-i)k = 0, zik = 1) 1 ( P t t- - (1 - k ) i 1 - k ) k 1 . (8) where (·) is the gamma function and similarly p(zik = 0|z(-i)k P (/K ) ( t + ti + 1) P = 0) ( t + ti + /K + 1) To obtain a sample from this distribution, we use the Metropolis-Hastings sampler from Section 3.2 where we replace equation (4) with (8). 3.5 SAMPLING F Politicians Following G¨rur et al. (2006), we place a gamma prior, o¨ G (1, 1), on . Noting that only influences Z through + the number of non-zero columns Ki , we compute p(|Z ) p(Z |)p() " X´ `` ´" Poisson K + ; 1 + t - (1) · G (1, 1) " " " X" G K + + 1, 1 + t - (1) + 1 , Athletes Movie Stars where t is the total edge length in the tree. igure 2: Hypothesis about a tree on preferences (Rumelhart and Greeno, 1971) 4 AN APPLICATION TO CHOICE Choice models play important roles in both econometrics (McFadden, 2000) and cognitive psychology (Luce, 1959). They describe what happens when people are given two or more options and are asked to choose one of them. In this section, we will restrict our attention to choices between pairs of ob jects, though the methods presented here can be applied more generally. Even in the simple case of binary decisions, people's choices are not deterministic. The Elimination By Aspects (EBA) model is a popular attempt to explain this variation (Tversky, 1972). EBA hypothesizes that choices are based on a weighted combination of the features of ob jects. Keeping our earlier notation, let Z be a feature matrix where zik = 1 if the ith ob ject has the k th feature and zik = 0 otherwise. For each of the features, there is a corresponding weight wk . The higher the weight, the more influence that feature has. The EBA model defines the probability of choosing ob ject i over ob ject j as k wk zik (1 - zj k ) k pij = k . (9) wk zik (1 - zj k ) + wk (1 - zik )zj k For comparison with previous results (G¨rur et al., o¨ 2006) we assume extra noise in people's choices, with pij = (1 - )pij + 0.5 . ~ If X is the observed choice matrix where xij contains how many times ob ject i was chosen over ob ject j , then for any given w and Z , the probability of X is P (X |Z, w) = iN i =1