Bayesian Out-Trees Tony Jebara Columbia University New York, NY 10027 jebara@cs.columbia.edu Abstract A Bayesian treatment of latent directed graph structure for non-iid data is provided where each child datum is sampled with a directed conditional dependence on a single unknown parent datum. The latent graph structure is assumed to lie in the family of directed out-tree graphs which leads to efficient Bayesian inference. The latent likelihood of the data and its gradients are computable in closed form via Tutte's directed matrix tree theorem using determinants and inverses of the out-Laplacian. This novel likelihood subsumes iid likelihood, is exchangeable and yields efficient unsupervised and semi-supervised learning algorithms. In addition to handling taxonomy and phylogenetic datasets the out-tree assumption performs surprisingly well as a semi-parametric density estimator on standard iid datasets. Experiments with unsupervised and semisupervised learning are shown on various UCI and taxonomy datasets. Jaakkola, 2006), dependencies across samples 1 in a dataset (Roweis & Saul, 2000; Carreira-Perpinan & Zemel, 2004) or even a heterogeneous combination of the two. It may be acceptable to heuristically choose a single graph structure for some problems (Roweis & Saul, 2000) but, in many settings, a Bayesian treatment of latent graph structure can be more precise (Friedman, 1998; Friedman & Koller, 2003; Kemp et al., 2003; Neal, 2003). Tree structures are a particularly efficient family of subgraphs and are relevant in many real-world non-iid datasets spanning dynamics, genetics, biology, decision-making, disease transmission and natural language processing (Leitner et al., 1996; Helmbold & Schapire, 1997; Willems et al., 1995; Mau et al., 1999; Koo et al., 2007). Bayesian inference over undirected trees is efficient (Meila & Jaakkola, 2006) however, graph families beyond undirected trees may require approximate inference. For example, recovering an optimal graph is NP-hard for graphs in families with more than 1 parent per node and requires approximation methods like MCMC sampling which may have slow mixing times (Friedman & Koller, 2003). This article uses graphs primarily to constrain dependency across exchangeable samples in a dataset. We will assume a latent graph structure was responsible for generating the data and assume it lies in the family of directed out-trees. Like undirected trees, this family of directed graphs also benefits from efficient Bayesian inference algorithms. However, the directed aspect of out-trees is not only beneficial for non-iid structured datasets like taxonomy trees it also (surprisingly) improves density estimation for iid datasets. We conjecture that the directed tree graph structure assumption acts as a flexible semiparametric estimator that overcomes mismatch between a parametric model and an otherwise iid dataset. In manifold learning, dep endencies across samples in a dataset are often constrained using k-nearest neighb or and/or maximum weight spanning tree subgraphs. 1 1 INTRODUCTION Many machine learning methods use graph connectivity structure to constrain the dependencies between random variables or between the samples in a dataset. If the graph structure is latent, Bayesian inference or heuristics are used to recover it. This article explores a distribution over a family of directed graphs known as out-trees where Bayesian inference remains efficient. Furthermore, the directed graph connectivity across samples is helpful not only for structured datasets but iid datasets as well. Graph structure is useful to constrain dependencies between random variables (Pearl, 1988; Meila & This paper is organized as follows. Section 2 describes how out-trees may emerge sequentially in nature and then presents a computationally convenient generative model for them. Section 3 describes Bayesian inference under the latent out-tree assumption and introduces Tutte's directed matrix tree theorem for efficient inference. Section 4 describes an unsupervised maximum likelihood approach to refining the parameters for the out-tree model. Section 5 describes a semisupervised approach where the out-tree model can be used to transfer inductive bias from inputs to labels. A variational Bayesian setup for integrating over both structure and parameters is described in Section 6. Experiments with unsupervised learning and semisupervised learning are shown in Section 9 and followed by a brief discussion. Figure 1: Sampled out-trees using N (Xt |c| X(t) , I ). was actually generated via a non-sequential recipe as follows. We assume an integer T is given that indicates the total number of nodes. Then, from a prior distribution over trees p(T ), an undirected tree T is chosen to connect the T nodes. Then, we choose a root node from the set of nodes 1, . . . , T . We then obtain an out-tree by choosing all edges to point away from the root. Next, the attributes of the root are sampled from a marginal distribution p(Xr ). Then, traversing from parent to child along the out-tree, we sample the child's attribute vector Xt according to a conditional (mutation) distribution that depends on its parent X(t) denoted by p(Xt |X(t) ). As is often the case in learning, we assume that some aspects of this generative process are hidden and must be recovered via inference. For instance, we will assume that the tree structure T , the choice of the root and so on are not available to the learner. Instead, as in many real-world datasets, we can only access the node attributes X1 , . . . , XT given in some arbitrary ordering. One challenge is to recover the marginal p(Xr ) and conditional distributions p(Xt |X(t) ) from the data. Another challenge is to recover information about the lost connectivity structure T . Another challenge is to recover missing information in some of the attribute vectors, i.e. hidden elements in X1 , . . . , XT . This article presents efficient Bayesian inference approaches to these problems. 2 THE GENERATIVE MODEL Consider a real-world example of a directed out-tree: the genealogical dataset of neuroscience PhD graduates2 . This is a population dataset containing t = 1, . . . , T input samples. Each sample or individual t in this dataset is a node which has a single parent (t), the main doctoral advisor for student t. Also, each node t has a corresponding attribute vector Xt associated with it which describes the student (dissertation topic area, year of graduation, institute, etc.). One realistic way such an out tree is generated is in a sequential or temporal manner. A root node labeled t = 1 is sampled with a corresponding vector X1 of attributes. Knowing the value X1 for the root, a number of its children nodes are sampled with attributes that depend on the current settings of the parent, X1 . These attribute vectors are drawn from a conditional asymmetric distribution p(X |X1 ). This is because an advisor is likely to generate students with related dissertation topics, within a finite number of years of his graduation, and so on. This conditional mimics the process of mutation in phylogeny. The children then become parents themselves and go on to generate further descendants by sampling again from the conditional distribution p(Xt |X(t) ). Assume each attribute vector Xt is a 3D Euclidean vector describing the PhD graduate. One choice for the conditional distribution or mutation is the conditioned Gaussian N (Xt |c| X(t) , I ) with a fixed correlation parameter c| and identity variance. Figure 1 shows some synthetic out-trees with 3D attribute vectors generated by this conditional. While the above sequential method for generating realworld data is interesting, computational considerations will encourage us to follow a slightly different generative modeling assumption. The data in Figure 1 2 3 EFFICIENT DISTRIBUTIONS OVER OUT-TREES Available online via http://neurotree.org. We are given a training dataset containing input samples Xt for t = 1 . . . T in some arbitrary order. One quantity to evaluate or manipulate is the likelihood of the dataset p(X1 , . . . , XT |, T ) given some model. A popular method to recover a model of the dataset is to find the model that maximizes the likelihood score. An additional standard assumption most unsupervised methods make is that the dataset is composed of independently identically distribuT ed samples. In t other words, p(X1 , . . . , XT |, T ) = t=1 p(Xt |). This iid assumption can often be inappropriate for real datasets. What is a more minimal set of assumptions on the likelihood function we can make? Likelihoods should be non-negative and sum to unity if we integrate over all X1 , . . . , XT . In addition, since the data arrives in an arbitrary order, a likelihood should be invariant to permutation of the arguments {X1 , . . . , XT } for any given finite dataset size T . This property is called finite exchangeability. It is less strict than infinite exchangeability which is in turn less strict than iid sampling. 3 The next section derives a likelihood that satisfies these properties yet generalizes the iid setting by assuming data was sampled according to an out-tree data structure. The generative model assumes that we first form a complete tree with T nodes (the number T is known a priori) and then children are sampled from their parents using conditional distributions according to the out-tree. More formally, define an out-tree as an acyclic graph T with a set of T vertices X = X1 , . . . , XT and directed edges such that each node Xt has at one parent node X(t) and the root has no parents. Note, here we abuse notation and take Xt to refer to the node corresponding to the t'th sample as well as its attribute vector interchangeably. Rooted out-trees are trees with directed edges pointing away from a well-defined root. For instance, X1 X2 X3 is an out-tree rooted at X3 . Conversely, rooted in-trees have all directed edges from other nodes point towards the root. The previous 3-chain example is thus also an in-tree rooted at node X1 . Many directed trees are neither in-trees nor outtrees. For instance, the tree X1 X2 X3 X4 is a valid directed tree but neither a rooted in-tree nor a rooted out-tree. For each choice of a root, the set of rooted out-trees forms a disjoint set of T T -2 directed trees. Therefore, there are T T -1 out-trees for T nodes. If we knew the latent out-tree structure T that generated our T samples, the likelihood of the data under the generative assumptions of Section 2 would factorize as a product of conditionals of each node given its parent. However, in general, the structure is unknown. Consider treating structure as a random variable and using Bayes' rule to obtain a posterior distribution over out-trees as follows: T p(X |T )p(T ) t=1 p(Xt |X (t) )p(T ) = . p(T |X ) = p(X ) p(X1 , . . . , XT ) A typical assumption is that the prior over out-tree structures is chosen to be uniform yielding p(T ) = 3 Another typical assumption most likelihoods require (which is not strictly necessary) is consistency. For instance, a likelihood should produce consistent marginals, P i.e. XT p(X1 , . . . , XT | , T ) = p(X1 , . . . , XT -1 | , T - 1). In this article, b ecause of the explicit a priori dep endence on the numb er of samples T , such consistency statements will not b e pursued. = T T1-1 . This is merely a normalized constant distribution over all possible out-trees. We rewrite the posterior over out-trees as follows: p(T |X ) = T 1t p(X |T ) = p(Xt |X(t) ). (1) p(X )T T -1 Z =1 1 card(T ) where we have defined the partition function Z that ensures that the likelihood term sums to unity over all possible out-trees: Z = p(X )T T -1 = T tT p(Xt |X(t) ). =1 Here, T enumerates over the set of all out-trees, . This is an unwieldy computation since there are T T -1 possible out-trees connecting T observation vertices. Instead, we consider breaking up the summation into all possible choices of the root of the out-tree r = 1 . . . T and a summation over the subset r of all T T -2 out-trees rooted at node r. It is straightforward to show that all subsets of out-trees with different roots are distinct, in other words i j = {} if i = j . Furthermore, their union forms the set of all out-trees = T=1 j . Thus, the partition function Z is given j by the following sum: Z = rT T =1 r r = rT tT T p(Xt |X(t) ) tT p(Xt |X(t) ) =1 p(Xr ) =1 r r =r where we have used the property that the root has no parent node. To efficiently recover Z we will instead recover the individual components of the above sum over r: Zr = T tT p(Xt |X(t) ) r r =r by making an appeal to the directed variant of Kirchoff 's Matrix Tree Theorem, namely Tutte's Directed Matrix Tree Theorem (West, 1996). The directed matrix tree theorem does not quite sum over all directed trees. It sums over a subset: rooted out-trees. To apply Tutte's theorem we compute an asymmetric weight matrix of size T × T populated by all pairwise conditional probabilities uv = p(Xu |Xv ). Note that we will assume vv = 0 since there are no edges between a node and itself. The matrix allows us to rewrite Zr as a product of edges in the tree instead of a product of nodes: T u Zr = uv . r r v Tr The out-tree Laplacian matrix Q is then obtained as follows: Q = diag ( 1) - . numerical reasons, we use the logarithm of the partition function. This is recovered via the trace of the matrix logarithm (or the sum of the log-singular values after an SVD) as: ln Z = ln( rT p(Xr )) + tr l n pT -p Q 1 ( 2) Here, take 1 to be the ones column vector and note that the diag(v) operator gives a diagonal matrix with v on its diagonal. Note that this out Laplacian is not symmetric since is not symmetric. Similarly, the in Laplacian is given by Qin = diag (1 ) - . The directed matrix tree theorem asserts that the number (or weight) of out-trees rooted at node r is Zr and is given by the matrix cofactor [Q]r obtained by deleting the r'th row and r'th column of the matrix Q. The precise formula is: Zr = |[Q]r | = |[diag( 1) - ]r | . =1 Reinserting this formula into the above gives the total partition function as: Z = rT p(Xr )Zr = rT p(Xr )|[diag( 1) - ]r | which takes O((T + 1)3 ) time. This computation is more efficient than enumerating over all T T -1 outtrees as in the normalized posterior of Equation 1 and makes it possible to consider datasets beyond a thousand points. To scale further, a wide set of approximate methods for large matrices can be leveraged including Nystrom methods (Williams & Seeger, 2001; Drineas & Mahoney, 2005) and column sampling. These methods will be investigated in future work but were not necessary for initial experiments. 4 MAXIMUM tdid LIKELIHOOD =1 =1 which is now efficient to evaluate. Interestingly, Z is the sum of determinants of the minors of the Laplacian. If is symmetric, all terms in the summation above are identical and we need to only work with a single determinant of the T × T matrix. A symmetric would emerge, for example, if we chose symmetric conditional distributions p(Xu |Xv ) = p(Xv |Xu ). In addition, it is known that the log determinant of a symmetric Laplacian matrix is a concave function of the edge-weights (Jakobson & Rivin, 2002). In the asymmetric case, however, the log-partition function does not preserve concavity. A naive implementation recovers Z in O(T ) however it is possible in cubic time as in (Jebara & Long, 2005; Koo et al., 2007) via straightforward applications of Woodbury's formula. This is done by creating a ma^ trix of size (T + 1) × (T + 1) called Q: 1 w pT ^ Q= -p Q here p is a unit-normalized vector of length T whose entries are proportional to the root probabilities: p(r) = p(Xr ) . T r =1 p(Xr ) 4 An interesting property of the partition function Z is that it forms a finitely exchangeable tdid or tree dependent identically distributed likelihood. The likelihood of the data is p(X ) = Z T 1-T or more explicitly: p(X1 , . . . , XT ) = 1 T T -1 rT p(Xr ) |[diag( 1) - ]r | =1 (3) which degenerates into the iid likelihood when the conditional dependence between parent and child nodes is extinguished. Theorem 1 If the conditional dependence of a child node given a parent node degenerates into the marginal p(Xt |X(t) ) p(Xt ) the tdid likelihood simplifies into the iid likelihood. Pro of 1 Work backwards by writing the tdid likelihood in terms of a product over nodes: p(X1 , . . . , XT ) = 1 T T -1 rT p(Xr ) T tT p(Xt |X(t) ). =1 r r =r Removing the dependence on the parent produces: p(X1 , . . . , XT ) = 1 T T -1 rT T =1 The partition function can then be computed by a single evaluation of the matrix determinant as Z = T ^ ( r=1 p(Xr ))|Q| which is O((T + 1)3 ) although faster methods are also possible (Kaltofen & Villard, 2004). This is an improvement over the summation of smaller determinants which required O(T 4 ). In addition, for r r tT p(Xt ) =1 which then simplifies into the iid likelihood: p(X1 , . . . , XT ) = T -2 t T r =1 T p(Xt ) T T -1 =1 T = tT p(Xt ). =1 Thus a generalization of iid likelihood emerges by integrating over a latent out-tree sampling structure. One natural way of performing unsupervised learning is to maximize this tdid likelihood to recover, for instance, a good setting of the parameters that govern the conditional distribution of child given parent. Equation 3 acts as a novel maximum likelihood estimator. We rewrite the tdid likelihood to make the dependence on the conditional distribution's parameters more explicit: Z () . T T -1 This likelihood satisfies certain desiderata outlined earlier. First, it is invariant to reordering of {X1 , . . . , XT } and therefore is finitely exchangeable. Furthermore, it is easy to verify that p(X ) 0 and sums to unity when integrated over all possible X 1 , . . . , XT . p(X1 , . . . , XT |) = We next consider maximum likelihood unsupervised learning where we find a that produces a large p(X |). We maximize the log tdid likelihood using gradient ascent on the parameters. The gradient for any scalar parameter i is given by the chain rule applied to Equation 2: T ^ r T p(Xr ) ^ 1 Q ln Z = + tr Q-1 T i i i r =1 p(Xr ) =1 If we are given a parameter setting and if all Xt variables are observed, it is straightforward to apply these formulae. The resulting probabilities are inserted into the out Laplacian which efficiently recovers the likelihood value or the gradients for unsupervised learning. Given the gradient and the likelihood evaluation, we can now readily maximize the tdid likelihood. However since it is not concave in the exponential family case except when tdid degenerates into iid, we prefer to initialize with the iid solution. For example, in the Gaussian case, a reasonable initialization for is to learn the model under iid assumptions for the seed model and then perform maximum tdid likelihood thereafter. Once we have learned a model from training data ~ X , we evaluate the test likelihood on new data X = ~ 1 , . . . , XU } according to: ~ {X ~ Z ~ ( ) T (T -1) p(X , X | ) ~ . = X X p(X |X , ) = p(X | ) ZX ( ) (T + U )(T +U -1) It is straightforward to show that this quantity still in~ tegrates to one when we integrate over X . This simply involves computing the partition function for the test data aggregated with the training data ZX X relative ~ to the partition function for the training data alone ZX . Both these quantities involve the determinant formula we outlined. Contrast the above test likelihood score to the traditional test likelihood score produced by an iid model which simplifies due to factoring: T U ~ j =1 p(Xj | ) i=1 p(Xi | ) ~ |X , ) = . piid (X T j =1 p(Xj | ) he main computational requirement for evaluating the gradient is the O((T + 1)3 ) matrix inversion. However, the computation of the gradient can easily be approximated for further efficiency. Given an initial guess for , it is possible to follow the gradient or perform line-search. Line search is convenient since evaluating determinants is faster than matrix inversion. Note that maximum likelihood with incomplete information is being performed since we never require anything more than the {X1 , . . . , XT } population data (there is no additional information about the tree structure). While any exponential family distribution could be used to specify the marginal and conditional distributions, we focus on the Gaussian case. The following marginal-conditional decomposition of its parameters holds = {µc , µ , c| , cc , }. These are two vectors in RD and three matrices in RD×D . We further assume matrices cc and are positive definite. This gives the following Gaussian probabilities for the root nodes: p(Xr |) = N (Xr |µ , ), In iid each test point data is independent of the training data and all other test points given the model parameters . Thus, the tdid likelihood is semiparametric since, in addition to depending on parameters , there is a non-parametric dependence on other training and test points. In fact, if we set the conditional Gaussians such that c| = I and µcc = 0, tdid can mimic non-parametric Parzen estimation. 5 SEMI-SUPERVISED OUT-TREES and the following conditional Gaussian probability of child given parent: p(Xt |X(t) , ) = N (Xt |c| X(t) + µc , cc ). Another application of the latent out-tree assumption is in semi-supervised learning problems (Kemp et al., 2003) where output labels are given in addition to input samples. Consider a label yt which is generated from a mutation process over the branches of the tree T just as the attributes of the node Xt are also mutations from their parents. This mutation process defines a distribution over possible labels. For instance, yt may indicate if an individual in a population has diabetes and Xt is a vector of anatomical features for the individual. Instead of generating a label that depends where we have rewritten the partition function in terms of both the unknown Y and unknown which need to be specified to construct the out Laplacian ^ Q. We initialize both randomly and then maximize the partition function using gradient ascent for as in the unsupervised learning case and maximize over untT ^ p(X1 , y1 , . . . , XT , yT |T ) = p(Xt , yt |X(t) , y(t) ). (4) known labels Y by greedily flipping individual labels to increase the partition function. This iterative hill =1 climbing scheme produces a final set of parameters and The derivations for the partition function proceed as labels. While investigating a label flip in the output, it before but now involve a directed Laplacian built from is useful to avoid full matrix inversion and full matrix edge weights using conditionals between both inputs determinants since each label flip only involves a rank ^ and outputs, i.e. uv = p(Xu , yu |Xv , yv ). We may 1 change to the out Laplacian matrix Q and thus each make more restrictive factorization assumptions on step of hill climbing requires no more than quadratic these conditional relationships, for instance, yt might time. This and intermediate caching of results allows depend only on Xt . A more interesting assumption efficient prediction of labels. is yt is independent of input data altogether and is only conditionally dependent on its parent's label y(t) . In other words, the mutation conditional simplifies 6 VARIATIONAL BAYES into p(yt |y(t) )p(Xt |X(t) ). In an iid classification setting, this radical assumption would make learning imSo far we have considered the latent tdid likelihood possible since output data is independent from input which involves agnostically integrating over all strucdata. However, in a latent out-tree setting, inputs and tures T . However, we have assumed that the paoutputs are only conditionally independent given tree rameters are given or are recovered by a point esstructure. When tree structure is unknown, depentimator such as maximum likelihood. A more thordence between inputs and outputs emerges without an ough Bayesian approach is to consider integrating over explicit relationship between input and output spaces both parameters and structure T after introducing (parametric or otherwise). This is because X and y a prior distribution on both. In such a setting, the are sampled given the parent random variable T , i.e. nonparametric density estimator becomes reminiscent X T y . Therefore, observing the input induces of other nonparameteric Bayesian methods such as a posterior on T which subsequently induces a posDirichlet processes (Teh et al., 2004; Neal, 2003; Ferterior on labels. This is particularly useful when we guson, 1973) and infinite mixture models (Rasmussen, cannot make explicit assumptions about the paramet1999; Beal et al., 2002). The joint integration over ric relationship between the input and output spaces. parameters and structure recovers the evidence of the In these settings, the latent out-tree may be a good data p(X ) or equivalently the log-evidence E = ln p(X ) source of inductive bias to couple inputs to outputs (Friedman & Koller, 2003). Consider first splitting the especially if the number of labeled outputs is small. parameters into those adjusting the distribution over To recover the settings of the unobserved yt labels, one root m and those adjusting the distribution of child approach is to maximize likelihood while integrating given parent c . We assume the root p(Xr |m ) and over T . Assume we have observed the labels for the conditional distributions p(Xt |X(t) , c ) are in the extraining points Y = {y1 , . . . , yT } but not for the test ponential family and the priors on their parameters ~ points Y = {y1 , . . . , yU }. To predict labels, we need ~ ~ p() = p(m )pc (c ) are conjugate. Integrating with a the conditional posterior over unobserved labels given uniform structure prior p(T ) = T T1-1 and making the ~ ~ all other observed data, p(Y |X , X , Y , ) as follows: natural assumption that the prior over structure and parameters factorizes yields: ~ ~ p(X , X , Y , Y |) ~ ~ p(Y |X , X , Y , ) = . ~ ~ ~ Y p(X , X , Y , Y | ) rT T tT Instead of maximizing the above conditional latent E = ln p(Xr |m ) p(Xt |X(t) , c )p(T )p() likelihood, we simply maximize the joint latent like=1 =r r lihood (a common simplification in many learning frameworks) to recover parameters and unknown lawhich unfortunately is an intractable quantity. We inbels: stead consider a lower bound on the evidence. This is ~ done by introducing variational distributions, for inZ (Y , ) ~ ~ p(X , X , Y , Y |) = (T +U -1) stance, the distribution q (r) over choices for the root (T + U ) only on the input, we could also consider dependence of a child label on a parent input and a parent label. In other words, for training samples Xt and corresponding labels yt for t = 1 . . . T , we have the following likelihood for a known out-tree T : and applying Jensen. E rT q (r) ln p(Xr |m ) T tT u Wuv . These weights qc (c ) p(c ) =v p(Xu |Xv , c ) are recovered easily from the current matrix. p(Xt |X(t) , c )p() A variational Bayesian treatment is possible over joint out-tree structure and parameters. The method allows =1 =r r us to refine a lower bound on evidence and permits full +H (q ) - (T - 1) ln T (nonparametric) Bayesian estimation with out-trees. We also introduce variational distributions over r = 1, . . . , T out-trees each rooted at node r which we 7 STATIONARY MUTATION denote by qr (Tr ) and a variational distribution over the parameters qc (c ). Re-applying Jensen's inequalIt is helpful to distinguish some important differences ity produces: between the Bayesian averaging over out-trees here r and the Bayesian inference of tree belief networks preE p(Xr |m )p(m ) - (T - 1) ln T q (r) ln sented in (Jaakkola et al., 1999; Meila & Jaakkola, m 2006). While elegantly providing a tractable computT r tation of the Bayesian inference, (Meila & Jaakkola, qc (c ) ln p(Xt |X(t) , c ) + q (r)qr (Tr ) 2006) makes no requirement on the stationarity of c =r ,Tr the conditional distribution p(Xt |X(t) , c ) which is a r +H (q ) + q (r)H (qr ) - K L(qc pc ). key distinguishing component of the out-tree framework. In other words, in (Jaakkola et al., 1999; Meila & Jaakkola, 2006) each conditional has its own paAbove, H denotes the Shannon entropy and K L derameter t and the samples are drawn from custom notes the Kullback-Leibler divergence. Update rules conditionals p(Xt |X(t) , t ). If one assumes decomposfor each variational distribution iteratively maximize able priors, the Bayesian evidence and Bayesian inferthe lower bound by taking derivatives and setting to ence jointly over parameter and structure is elegantly zero. We update the density over out-trees rooted at tractable. However, it explores distinct t for all connode r via: ditionals. This article introduces the constraint that t = t = c for all t = 1, . . . , T except for the root t T R q ( ) ln p(X |X , ) 1 t (t) c e c c c . qr (Tr ) = node. This constraint greatly restricts the model and Zr =r assumes that the mutation distribution is stationary across all samples. In other words the parameters of As in Section 3 this can be rewritten as a product the conditional are fixed. This is a key difference and over edges in the out-tree Tr and summarized simply permits us to recover the iid setting as a special case by a T T matrix whose off diagonal entries are × when the conditional dependence is extinguished. The uv = c qc (c ) ln p(Xu |Xv , c ). For exponential famunrestricted Bayesian inference method in (Meila & ily p(Xu |Xv , c ), such integrals are easy to solve (Box Jaakkola, 2006) can be seen as a step in the variational & Tiao, 1992). Each Zr is also straightforward to reBayesian procedure since the update rule for the discover using Tutte's theorem. The update for the q (r) tribution qc (c ) collapses the individual conditionals distribution is: into a single c model. p(Xr |m )p(m ) q (r) eH (qr ) m 8 HILBERT GAUSSIANS ×e P Tr qr (Tr ) PT t=r R c qc (c ) ln p(Xt |X(t) ,c ) where the entropy H (qr ) and the expectation over qr (Tr ) are efficient to compute from the matrix (Meila & Jaakkola, 2006). Furthermore, the integrals p(Xr |m )p(m ) are known for exponential families. m We update the distribution over parameters via: In addition to the Gaussian, many exponential family choices for the marginal distribution over root attributes p(Xr |m ) and for the conditional p(Xu |Xv , c ) are possible and computationally convenient. One useful feature of the Gaussian is that it is readily converted into conditional form and leads to a flexible linear relationship between parent and child which is PT P qc (c ) p(c )e r,Tr q(r)qr (Tr ) t=r ln p(Xt |X(t) ,c ) determined primarily by the variable c| . To go beP u yond this linear relationship, we may use a mapping r,Tr q (r )qr (Tr ) (uv Tr ) . = p(c ) p(Xu |Xv , c ) on the features or attributes of the parent node which =v in no way changes the normalization properties of the This is simply the prior times a product over all Gaussian. Thus, we may consider first mapping the pairs of data-points likelihoods with different weights parent's features into a higher dimensional vector rep- Dataset (D, T ) Parzen GMM-1 GMM-2 GMM-3 GMM-4 GMM-5 GMM- tdid Spiral (3, 534) -5.61e3 -1.36e3 -1.36e3 -1.19e3 -7.98e2 -6.48e2 -4.86e2 -3.91e2 Heart (13,139) -1.94e3 -2.02e4 -3.23e4 -2.50e4 -1.68e4 -3.15e4 -4.02e2 -5.29e2 Diabetes (8,268) -6.25e3 -2.12e5 -2.85e5 -4.48e5 -2.03e5 -3.40e5 -8.22e2 -8.87e2 Liver (6,200) -3.41e3 -2.53e4 -1.88e4 -2.79e4 -2.62e4 -3.23e4 -4.56e2 -4.99e2 30 20 10 0 20 0 -20 -20 0 20 10 0 -10 20 20 0 -20 -20 0 20 40 a) Original Spiral data 30 20 b) New samples from 30 Table 1: Gaussian test log-likelihoods using RBF Parzen estimators, EM mixtures of Gaussians, the Gaussian mixture model, and the tdid estimator. resentation (X(t) ) of dimensionality RH and then learning a matrix c| of size D × H . This is a generalized linear conditional model that can capture more complex relationships between the parent and child nodes. In such a setting, the conditional Gaussian relationship need not be represented explicitly in the space of (X(t) ) but only implicitly in kernelized form over each dimension of the input space: p(X |X ) = w dD N X (d) t T t,d k (X , Xt ) + µd , d 20 10 0 -10 -20 0 20 40 20 10 4 -20 0 3 2 00 1 2 c) Resampling d) Resampling with small cc Figure 2: Spiral data density estimation. =1 =1 here the unconditional means µd , variances d and weights t,d are scalars for t = 1, . . . , T and d = 1, . . . , D, the latter of which indexes the dimensions of the attributes. Furthermore, the function k (., .) can be any kernel that maps a pair of inputs in the sample space into a scalar measurement of affinity. This gives a general way of extending the linearity assumptions in the conditional model. Instead of making the conditional dependence linear in the values of the parent attributes, we can explore linearity in any features of the parent attributes which leads to another source of nonparametric flexibility in the estimator. 9 EXPERIMENTS To visualize the out-tree model's ability to fit data, we estimated marginal and conditional Gaussian parameters using the latent likelihood for the UCI Spiral dataset in Figure 2(a). Once the = {µc , µ , c| , cc , } parameters were recovered (a total of 27 scalar degrees of freedom), they were used to generate synthetic datasets of 600 samples in Figures 2(b), (c) and (d). In the last example, the matrix cc was reduced to sample a spiral with less noise. Notice how the datasets can produce slightly different spirals that may have more or fewer turns but still maintain the appropriate overall shape. We next show more quantitative unsupervised density estimation experiments on standard UCI datasets where a large test log-likelihood implies a better density estimate. These experiments closely follow the format in (Jebara et al., 2007). Table 1 summarizes the results with various Gaussian models including the marginal-conditional Gaussian model for the tdid outtree approach. On 4 standard datasets (we only use one class for labeled datasets), the test log-likelihood was evaluated after using a variety of density estimators. These estimators include a nonparametric Parzen RBF estimator with a varying scale parameter . In addition, a mixture of 1 to 5 Gaussians were fit using Expectation Maximization to maximize iid likelihood. Comparisons are also shown with semiparametric density estimators like the infinite mixture of Gaussians (Rasmussen, 1999). Cross-validation was used to choose the , and EM local minimum (from ten initializations), for the Parzen and EM approaches respectively. Similarly, cross-validation was used to early-stop the Bayesian out-tree maximum likelihood gradient ascent procedure although this did not have a large effect on performance. Train, cross-validation and test split sizes where 80%, 10% and 10% respectively. The 10 fold averaged test log-likelihoods show that the new method outperforms traditional mixture modeling and Parzen estimators and is comparable to semiparametric methods such as the infinite Gaussian mixture (iid - ) model (Rasmussen, 1999). Despite the cubic time linear algebra steps for tdid estimation, the infinite Gaussian mixture model was the most computationally demanding method. In a semi-supervised learning problem, we evaluated how well the latent out-tree structure works for clas- sification and its ability to perform inductive transfer. First, is learned from only input samples in an unsupervised manner and the Gaussian parameters c| and cc I are recovered as above by maximizing the partition function. Then, observed labels are used in the following conditionals to construct out Laplacian: p(Xu |Xv ) p(yu |yv ) = N (Xu |c| Xv , I ) = I (yu = yv ) + (1 - )I (yu = yv ). method are similarly hindered by the resulting loss of information). The input attributes for each problem is a set of 3D vectors X1 , . . . , XT and the targets are the original discrete-valued y1 , . . . , yT labels. Crustaceans 0.46 0.44 0.42 0.4 Error Rate 0.38 0.36 0.34 0.32 0.3 10 15 20 25 30 Number of Training Examples 35 40 Tree SVM linear SVM = 1 SVM = 2 SVM = 4 Here, the mutation on the outputs is independent of the mutation on the inputs and is simply built from indicator functions with a parameter which controls the stickiness of the label across parent to child. Since only some labels are known, the unknown ones are initialized randomly and improved by maximizing Z . This is done with c| and cc I locked from their unsupervised values. The unknown discrete labels are greedily explored to further increase the partition function. For comparison, support vector machines were trained on the labeled X and y data. Salamanders 0.35 Tree SVM linear SVM = 1 SVM = 2 SVM = 4 Figure 4: Labeling error rates (averaged over tasks) for Out-Trees and SVMs for crustaceans taxonomy. Each task was split into training and testing components and the out-tree model was fit and used to find labels. Results were compared with an SVM baseline classifier using different kernel functions. In all experiments for a given number of labeled examples, half the unlabeled examples were used for cross-validation of for the out-tree model and C for the SVM. The remaining half of the unseen labels were used for testing. Figure 3 shows the average error rate on random folds for all tasks as the number of training examples is varied for salamander species taxonomy. Similarly, Figure 4 shows the crustaceans taxonomy. The out-tree has a statistically significant advantage over both linear and RBF SVMs when classifying data that obeys a directed tree structure. 0.3 Error Rate 0.25 0.2 0.15 0.1 6 8 10 12 14 16 Number of Training Examples 18 20 Figure 3: Labeling error rates (averaged over tasks) for Out-Trees and SVMs for salamanders taxonomy. To evaluate the semi-supervised learning method, two taxonomic datasets (Salamanders and Crustaceans) were used as in (Kemp et al., 2003). These are groups of T = 30 and T = 56 species nodes and D = 19 and D = 74 attribute dimensions respectively. Each has a number of discrete attributes describing the external anatomy of each species. These datasets do not have class labels. Therefore each attribute was in turn used as a label to be predicted from the input data. Each dataset therefore generates D discrete prediction tasks. To reduce dimensionality and avoid redundancies (some attributes have the same settings as the target predictions), the remaining D - 1 attributes were converted into 3D coordinates using PCA before being used as inputs (both the SVM and the out-tree 10 DISCUSSION This article described a Bayesian treatment of a latent directed out-tree connectivity on non-iid data-points. This led to a generative model appropriate for taxonomy and tree data as well as an interesting semiparametric density estimator for datasets in general. The matrix tree theorem was extended to directed trees and enjoys the same efficient Bayesian inference properties as its undirected counterpart. A novel tdid likelihood emerges which permits the recovery of both a marginal density on nodes as well as the conditional of each node given its latent parent. The new likelihood is exchangeable and is a direct generalization of iid likelihood. It degenerates into iid when conditional dependencies between children and parents collapse. A variational Bayesian treatment is also possible by integrating over both parameters and out-tree structures jointly. Experiments with unsupervised and semisupervised learning were promising. Acknowledgments The author thanks A. Howard, C. Kemp, P. Long, J. Tenenbaum and the anonymous reviewers for their comments and for providing data. This work was supported by ONR Award N000140710507 (Mod No: 07PR04918-00) and NSF Award IIS-0347499. Kaltofen, E., & Villard, G. (2004). Computing the sign or the value of the determinant of an integer matrix, a complexity survey. J. Comp. Applied Math, 162, 133­146. Kemp, C. amd Griffiths, T., Stromsten, S., & Tenenbaum, J. (2003). Semi-supervised learning with trees. NIPS. Koo, T., Globerson, A., Carreras, X., & Collins, M. (2007). Structured prediction models via the matrixtree theorem. EMNLP. Leitner, T., Escanilla, D., Franzen, C., Uhlen, M., & Albert, J. (1996). Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proceedings of the National Academy of Sciences, 93, 10864­9. Mau, B., Newton, M., & Larget, B. (1999). Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics, 55, 1­12. Meila, M., & Jaakkola, T. (2006). Tractable Bayesian learning of tree belief networks. Statistics and Computing, 16, 77­92. Neal, R. (2003). Density modeling and clustering using Dirichlet diffusion trees. Bayesian Statistics, 7, 619­ 629. Pearl, J. (1988). Probabilistic reasoning in intel ligent systems: Networks of plausible inference. Morgan Kaufmann. Rasmussen, C. (1999). The infinite Gaussian mixture model. NIPS. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290. Teh, Y., Jordan, M., Beal, M., & Blei, D. (2004). Hierarchical Dirichlet processes. NIPS. West, D. (1996). Introduction to graph theory. Prentice Hall. Willems, F., Shtarkov, Y., & Tjalkens, T. (1995). The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41, 653­664. Williams, C., & Seeger, M. (2001). Using the Nystrom method to speed up kernel machines. NIPS. References Beal, M., Ghahramani, Z., & Rasmussen, C. (2002). The infinite hidden Markov model. NIPS. Box, G., & Tiao, G. (1992). Bayesian inference in statistical analysis. John Wiley & Sons. Carreira-Perpinan, M., & Zemel, R. (2004). Proximity graphs for clustering and manifold learning. NIPS. Drineas, P., & Mahoney, M. (2005). On the Nystrom method for approximating a gram matrix for improved kernel-based learning. JMLR, 6, 2153­2175. Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1, 209­ 230. Friedman, N. (1998). The Bayesian structural EM algorithm. UAI. Friedman, N., & Koller, D. (2003). Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 2, 95­125. Helmbold, D., & Schapire, R. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning, 27, 56­68. Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination. Neural Information Processing Systems (NIPS). Jakobson, D., & Rivin, I. (2002). Extremal metrics on graphs. Forum Math, 14. Jebara, T., & Long, P. (2005). Tree dependent identical ly distributed learning (Technical Report CUCS040-05). Columbia University, Computer Science. Jebara, T., Song, Y., & Thadani, K. (2007). Density estimation under independent similarly distributed sampling assumptions. NIPS.