Hidden Common Cause Relations in Relational Learning

Ricardo Silva Gatsby Computational Neuroscience Unit UCL, London, UK WC1N 3AR rbas@gatsby.ucl.ac.uk

Wei Chu Center for Computational Learning Systems Columbia University, New York, NY 10115 chuwei@cs.columbia.edu

Zoubin Ghahramani Department of Engineering University of Cambridge, UK CB2 1PZ zoubin@eng.cam.ac.uk

Abstract
When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classification model is built upon this graphical representation and evaluated with several empirical studies.

1 Contribution
Prediction problems, such as classification, can be easier when class labels share a sort of relational dependency that is not accounted by the input features [10]. If the variables to be predicted are attributes of objects in a relational database, such dependencies are often postulated from the relations that exist in the database. This paper proposes and evaluates a new method for building classifiers that uses information concerning the relational structure of the problem. Consider the following standard example, adapted from [3]. There are different webpages, each one labeled according to some class (e.g., "student page" or "not a student page"). Features such as the word distribution within the body of each page can be used to predict each webpage's class. However, webpages do not exist in isolation: there are links connecting them. Two pages having a common set of links is evidence for similarity between such pages. For instance, if W1 and W3 both link to W2 , this is commonly considered to be evidence for W1 and W3 having the same class. One way of expressing this dependency is through the following Markov network [5]:


Now at the Statistical Laboratory, University of Cambridge. E-mail: silva@statslab.cam.ac.uk


F1

F2

F 3

C1

C2

C3

Here Fi are the features of page Wi , and Ci is its respective page label. Other edges linking F variables to C variables (e.g., F1 - C2 ) can be added without affecting the main arguments presented in this section. The semantics of the graph, for a fixed input feature set {F1 , F2 , F3 }, are as follows: C1 is marginally dependent on C3 , but conditionally independent given C2 . Depending on the domain, this might be either a suitable or unsuitable representation of relations. For instance, in some domains it could be the case that the most sensible model would state that C1 is only informative about C3 once we know what C2 is: that is, C1 and C3 are marginally independent, but dependent given C2 . This can happen if the existence of a relation (Ci , Cj ) corresponds to the existence of hidden common causes generating this pair of random variables. Consider the following example, loosely based on a problem described by [12]. We have three objects, Microsoft (M ), Sony (S ) and Philips (P ). The task is a regression task where we want to predict the stock market price of each company given its profitability from last year. The given relationships are that M and S are direct competitors (due to the videogame console market), as well S and P (due to the TV set market).
M.Profit S.Profit P.Profit
M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit

M.Stock

S.Stock

P.Stock

M.Stock m

S.Stock s

P.Stock p

M.Stock m

S.Stock s

P.Stock p

( a)

(b )

( c)

Figure 1: (a) Assumptions that relate Microsoft, Sony and Philips stock prices through hidden common cause mechanisms, depicted as unlabeled gray vertices; (b) A graphical representation for generic hidden common causes relationships by using bi-directed edges; (c) A depiction of the same relationship skeleton by a Markov network model, which has different probabilistic semantics. It is expected that several market factors that affect stock prices are unaccounted by the predictor variable Past Year Profit. For example, a shortage of Microsoft consoles is a hidden common factor for both Microsoft's and Sony's stock. Another hidden common cause would be a high price for Sony's consoles. Assume here that these factors have no effect on Philips' stock value. A depiction of several hidden common causes that correpond to the relations C ompetitor(M , S ) and C ompetitor(S, P ) is given in Figure 1(a) as unlabeled gray vertices. Consider a linear regression model for this setup. We assume that for each object Oi  {M , S, P }, the stock price Oi .S tock , centered at the mean, is given by Oi .S tock =  × Oi .P rof it + i where each i is a Gaussian random variable. The fact that there are several hidden common causes between M and S can be modeled by the covariance of m and s, ms . That is, unlike in standard directed Gaussian models, ms is allowed to be non-zero. The same holds for sp . Covariances of error terms of unrelated objects should be zero (mp = 0). This setup is very closely related to the classic seemingly unrelated regression model popular in economics [12]. A graphical representation for this type of model is the directed mixed graph (DMG) [9, 11], with bi-directed edges representing the relationship of having hidden common causes between a pair of vertices. This is shown in Figure 1(b). Contrast this to the Markov network representation in Figure 1(c). The undirected representation encodes that m and p are marginally dependent, which (1 )


does not correspond to our assumptions1 . Moreover, the model in Figure 1(b) states that once we observe Sony's stock price, Philip's stocks (and profit) should have a non-zero association with Microsoft's profit: this follows from a extension of d-separation to DMGs [9]. This is expected from the assumptions (Philip's stocks should tell us something about Microsoft's once we know Sony's, but not before it), but does not hold in the graphical model in Figure 1(c). While it is tempting to use Markov networks to represent relational models (free of concerns raised by cyclic directed representations), it is clear that there are problems for which they are not a sensible choice. This is not to say that Markov networks are not the best representation for large classes of relational problems. Conditional random fields [4] are well-motivated Markov network models for sequence learning. The temporal relationship is closed under marginalization: if we do not measure some steps in the sequence, we will still link the corresponding remaining vertices accordingly, as illustrated in Figure 2. Directed mixed graphs are not a good representation for this sequence structure.
X1 X2 X3 X4 X5 X1 X2 X3 X4 X5

X1

X3

X5

Y1

Y2

Y3

Y4

Y5

Y1

Y2

Y3

Y4

Y5

Y1

Y3

Y5

( a)

(b )

( c)

Figure 2: (a) A conditional random field (CRF) graph for sequence data; (b) A hypothetical scenario where two of the time slices are not measured, as indicated by dashed boxes; (c) The resulting CRF graph for the remaining variables, which corresponds to the same criteria for construction of (a). To summarize, the decision between using a Markov network or a DMG reduces to the following modeling issue: if two unlinked object labels yi , yj are statistically associated when some chain of relationships exists between yi and yj , then the Markov network semantics should apply (as in the case for temporal relationships). However, if the association arises only given the values of the other objects in the chain, then this is accounted by the dependence semantics of the directed mixed graph representation. The DMG representation propagates training data information through other training points. The Markov network representation propagates training data information through test points. Propagation through training points is relevant in real problems. For instance, in a webpage domain where each webpage has links to pages of several kinds (e.g., [3]), a chain of intermediated points between two classes labels yi and yj is likely to be more informative if we know the values of the labels in this chain. The respective Markov network would ignore all training points in this chain besides the endpoints. In this paper, we introduce a non-parametric classification model for relational data that factorizes according to a directed mixed graph. Sections 2 and 3 describes the model and contrasts it to a closely related approach which bears a strong analogy to the Markov network formulation. Experiments in text classification are described in Section 4.

2 Model
Chu et al. [2] describe an approach for Gaussian process classification using relational information, which we review and compare to our proposed model. Previous approach: relational Gaussian processes through indicators - For each point x in the input space X , there is a corresponding function value fx . Given observed input points x1 , x2 , . . . , xn , a Gaussian process prior over f = [f1 , f2 , . . . , fn ]T has the shape P (f ) = 1 (2 )n/2 ||1/2 ex p - 1 T -1 f f 2 ( 2)

1 For Gaussian models, the absence of an edge in the undirected representation (i.e., Gaussian Markov random fields) corresponds to a zero entry in the inverse covariance matrix, where in the DMG it corresponds to a zero in the covariance matrix [9].


X1

X2

X3

X1

X2

X3

X1

X2

X3

f1  12 Y1

f2  23 Y2 2

f3

f1

f2

f3

f1

f2

f3

Y3 3

Y1

Y2 2

Y3 3

Y1 1 1

Y2 2 2

Y3 3 3

1

1

( a)

(b )

( c)

Figure 3: (a) A prediction problem where y3 is unknown and the training set is composed of other two datapoints. Dependencies between f1 , f2 and f3 are given by a Gaussian process prior and not represented in the picture. Indicators ij are known and set to 1; (b) The extra associations that arise by conditioning on  = 1 can be factorized as the Markov network model here depicted, in the spirit of [9]; (c) Our proposed model, which ties the error terms and has origins in known statistical models such as seemingly unrelated regression and structural equation models [11]. where the ij th entry of  is given by a Mercer kernel function K(xi , xj ) [8]. The idea is to start from a standard Gaussian process prior, and add relational information by conditioning on relational indicators. Let ij be an indicator that assumes different values, e.g., 1 or 0. The indicator values are observed for each pair of data points (xi , xj ): they are an encoding of the given relational structure. A model for P (ij = 1|fi , fj ) is defined. This evidence is incorporated into the Gaussian process by conditioning on all indicators ij that are positive. Essentially, the idea boils down to using P (f | = 1) as the prior for a Gaussian process classifier. Figure 3(a) illustrates a problem with datapoints {(x1 , y1 ), (x2 , y2 ), (x3 , y3 )}. Gray vertices represent unobserved variables. Each yi is a binary random variable, with conditional probability given by P (yi = 1|fi ) = (fi / ) (3 )

where (·) is the standard normal cumulative function and  is a hyperparameter. This can be interpreted as the cumulative distribution of fi + i, where fi is given and i is a normal random variable with zero mean and variance  2 . In the example of Figure 3(a), one has two relations: (x1 , x2 ), (x2 , x3 ). This information is incorporated by conditioning on the evidence (12 = 1, 23 = 1). Observed points (x1 , y1 ), (x2 , y2 ) form the training set. The prediction task is to estimate y3 . Notice that 12 is not used to predict y3 : the Markov blanket for f3 includes (f1 , f2 , 23 , y3 , 3) and the input features. Essentially, conditioning on  = 1 corresponds to a pairwise Markov network structure, as depicted in Figure 3(b) [9]2 . Our approach: mixed graph relational model - Figure 3(c) illustrates our proposed setup. For reasons that will become clear in the sequel, we parameterize the conditional probability of yi as  P (yi = 1|gi , vi ) = (gi / vi ) (4 ) where gi = fi + i . As before, Equation (4) can be interpreted as the cumulative distribution of 2 gi + i , with i as a normal random variable with zero mean and variance vi =  2 - i , the last term being the variance of i . That is, we break the original error term as i = i + i , where i and j are independent for all i = j . Random vector  is a multivariate normal with zero mean and covariance matrix  . The key aspect in our model is that the covariance of i and j is non-zero only if objects i and j are related (that is, bi-directed edge yi  yj is in the relational graph). Parameterizing  for relational problems is non-trivial and discussed in the next section. In the example of Figure 3, one noticeable difference of our model 3(c) to a standard Markov network models 3(b) is that now the Markov blanket for f3 includes error terms for all variables (both and  terms), following the motivation presented in Section 1.
2 In the figure, we are not representing explicitly that f1 , f2 and f3 are not independent (the prior covariance matrix  is complete). The figure is meant as a representation of the extra associations that arise when conditioning on  = 1, and the way such associations factorize.


As before, the prior for f in our setup is the Gaussian process prior (2). This means that g has the following Gaussian process prior (implicitly conditioned on x): P (g) = 1 (2 )n/2 |R|1/2 ex p - 1 g 2
R-1

g

(

5)

where R = K +  is the covariance matrix of g = f +  , with Kij = K(xi , xj ).

3 Parametrizing a mixed graph model for relational classification
For simplicity, in this paper we will consider only relationships that induce positive associations between labels. Ideally, the parameterization of  has to fulfill two desiderata: (i). it should respect the marginal independence constraints as encoded by the graphical model (i.e., zero covariance for vertices that are not adjacent), and be positive definite; (ii). it has to be parsimonious in order to facilitate hyperparameter selection, both computationally and statistically. Unlike the multivariate analysis problems in [11], the size of our covariance matrix grows with the number of data points. As shown by [11], exact inference in models with covariance matrices with zero-entry constraints is computationally demanding. We provide two alternative parameterizations that are not as flexible, but which lead to covariance matrices that are simple to compute and easy to implement. We will work under the transductive scenario, where training and all test points are given in advance. The corresponding graph thus contain unobserved and observed label nodes. 3.1 Method I The first method is an automated method to relax some of the independence constraints, while guaranteeing positive-definiteness, and a parameterization that depends on a single scalar . This allows for more efficient inference and is done as follows: 1. Let G be the corresponding bi-directed subgraph of our original mixed graph, and let U0 be a matrix with n × n entries, n being the number of nodes in G 3. Set U0i to be the number of cliques containing yi , plus a small constant ; i 4. Set U to be the corresponding correlation matrix obtained by intepreting U0 as a covariance matrix and rescaling it; Finally, set  = U, where   [0, 1] is a given hyperparameter. Matrix U is always guaranteed to be positive definite: it is equivalent to obtaining the covariance matrix of y from a linear latent variable model, where there is an independent standard Gaussian latent variable as a common parent to every clique, and every observed node yi is given by the sum of its parents plus an independent error term of variance . Marginal independencies are respected, since independent random variables will never be in a same clique in G . In practice, this method cannot be used as is since the number of cliques will in general grow at an exponential rate as a function of n. Instead, we first triangulate the graph: in this case, extracting cliques can be done in polynomial time. This is a relaxation of the original goal, since some of the original marginal independence constraints will not be enforced due to the triangulation3. 3.2 Method II The method suggested in the previous section is appealing under the assumption that vertices that appear in many common cliques are more likely to have more hidden common causes, and hence should have stronger associations. However, sometimes the triangulation introduces bad artifacts, with lots of marginal independence constraints being violated. In this case, this will often result in a poor prediction performance. A cheap alternative approach is not generating cliques, and instead
The need for an approximation is not a shortcoming only of the DMG approach. Notice that the relational Gaussian process of [2] also requires an approximation of its relational kernel.
3

2. Set U0j to be the number of cliques in G where yi and yj appear together; i


10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100

( a)

(b )

( c)

Figure 4: (a) The link matrix for the political books dataset. (b) The relational kernel matrix obtained with the approximated Method I. (c) The kernel matrix obtained with Method II, which tends to produce much weaker associations but does not introduce spurious relations. getting a marginal covariance matrix from a different latent variable model. In this model, we create an independent standard Gaussian variable for each edge yi  yj instead of each clique. No triangulation will be necessary, and all marginal independence constraints will be respected. This, however, has shortcomings of its own: for all pairs (yi , yj ) connected by an edge, it will be the case that U0j = 1, while U0i can be as large as n. This means that the resulting correlation in Uij can be i i close to zero even if yi and yj are always in the same cliques. In Section 4, we will choose between Methods I and II according to the marginal likelihood of the model. 3.3 Algorithm Recall that our model is a Gaussian process classifier with error terms i of variance  such that i = i + i . Without loss of generality, we will assume that  = 1. This results in the following parameterization of the full error covariance matrix: 
=

(1 - )I + U
.

(6 )

The usefulness of separating as and  becomes evident when we use an expectation-propagation (EP) algorithm [7] to perform inference in our relational classifier. Instead of approximating the posterior of f , we approximate the posterior density P (g|D), D = {(x1 , y1 ), . . . , (xin , yn )} being ~ ti (gi ) where the given training data. The approximate posterior has the form Q(g)  P (g) P (g) is the Gaussian process prior with kernel matrix R = K +  as defined in the previous section. Since the covariance matrnx  is diagonal, the true likelihood of y given g factorizes i over each datapoint: P (y|g) = i=1 P (yi |gi ), and standard EP algorithms for Gaussian process classification can be used [8] (with the variance given by  instead of  , and kernel matrix R instead of K). The final algorithm defines a whole new class of relational models, depends on a single hyperparameter  which can be optimized by grid search in [0, 1], and requires virtually no modification of code written for EP-based Gaussian process classifiers4 .

where I is an n × n identity matrix. Matrix (1 - )I corresponds to the covariance matrix 

4 Results
We now compare three different methods in relational classification tasks. We will compare a standard Gaussian process classifier (GPC), the relational Gaussian process (RGP) of [2] and our method, the mixed graph Gaussian process (XGP). A linear kernel K(x, z) = x · z is used, as described by [2]. We set  = 10-4 and the hyperparameter  is found by a grid search in the space {0.1, 0.2, 0.3, . . . , 1.0} maximizing the approximate EP marginal likelihood5.
We provide MATLAB/Octave code for our method in http://www.statslab.cam.ac.uk/silva. For triangulation, we used the MATLAB implementation of the Reverse Cuthill McKee vertex ordering available at http://people.scs.fsu.edu/burkardt/m src/rcm/rcm.html
5 4


Table 1: The averaged AUC scores of citation prediction on test cases of the Cora database are recorded along with standard deviation over 100 trials. "n" denotes the number of papers in one class. "Citations" denotes the citation count within the two paper classes. Gr o u p n Citations GPC GPC with Citations XGP 5vs1 346/488 2466 0 .9 0 5 ± 0 .0 3 1 0 .8 9 1 ± 0 .0 2 2 0 .9 4 5 ± 0 .0 5 3 346/619 3417 0 .9 0 0 ± 0 .0 3 2 0 .9 0 5 ± 0 .0 4 4 0 .9 3 3 ± 0 .0 5 9 5vs2 5vs3 346/1376 3905 0 .8 6 3 ± 0 .0 4 0 0 .8 9 3 ± 0 .0 1 7 0 .8 8 3 ± 0 .0 1 3 5vs4 346/646 2858 0 .9 1 6 ± 0 .0 3 0 0 .8 8 7 ± 0 .0 1 8 0 .9 5 1 ± 0 .0 4 2 5vs6 346/281 1968 0 .8 8 7 ± 0 .0 5 4 0 .8 4 3 ± 0 .0 7 6 0 .9 5 5 ± 0 .0 4 1 346/529 2948 0 .8 6 9 ± 0 .0 4 5 0 .8 6 7 ± 0 .0 4 1 0 .9 2 6 ± 0 .0 7 6 5vs7 4.1 Political books We consider first a simple classification problem where the goal is to classify whether a particular book is of liberal political inclination or not. The features of each book are given by the words in the Amazon.com front page for that particular book. The choice of books, labels, and relationships are given in the data collected by Valdis Krebs and available at http://www-personal.umich.edu/ mejn/netdata. The data containing book features can be found at http://www.statslab.cam.ac.uk/silva. There are 105 books, 43 of which are labeled as liberal books. The relationships are pairs of books which are frequently purchased together by a same customer. Notice this is an easy problem, where labels are strongly associated if they share a relationship. We performed evaluation by sampling 100 times from the original pool of books, assigning half of them as trainining data. The evaluation criterion was the area under the curve (AUC) for this binary problem. This is a problem where Method I is suboptimal. Figure 4(a) shows the original binary link matrix. Figure 4(b) depicts the corresponding U0 matrix obtained with Method I, where entries closer to red correspond to stronger correlations. Method II gives a better performance here (Method I was better in the next two experiments). The AUC result for GPC was of 0.92, while both RGP and XGP achieved 0.98 (the difference between XGP and GPC having a std. deviation of 0.02). 4 . 2 Co r a The Cora collection [6] contains over 50,000 computer science research papers including bibliographic citations. We used a subset in our experiment. The subset consists of 4,285 machine learning papers categorized into 7 classes. The second column of Table 1 shows the class sizes. Each paper was preprocessed as a bag-of-words, a vector of "term frequency" components scaled by "inverse document frequency", and then normalized to unity length. This follows the pre-processing used in [2]. There is a total of 20,082 features. For each class, we randomly selected 1% of the labelled samples for training and tested on the remainder. The partition was repeated 100 times. We used the fact that the database is composed of fairly specialized papers as an illustration of when XGP might not be as optimal as RGP (whose AUC curves are very close to 1), since the population of links tends to be better separated between different classes (but this is also means that the task is fairly easy, and differences disappear very rapidly with increasing sample sizes). The fact there is very little training data also favors RGP, since XGP propagates information through training points. Still, XGP does better than the non-relational GPC. Notice that adding the citation adjacency matrix as a binary input feature for each paper does not improve the performance of the GPC, as shown in Table 1. Results for other classes are of similar qualitative nature and not displayed here. 4.3 WebKB The WebKB dataset consists of homepages from 4 different universities: Cornell, Texas, Washington and Wisconsin [3]. Each webpage belongs to one out of 7 categories: student, professor, course, project, staff, department and "other". The relations come from actual links in the webpages. There is relatively high heterogeneity of types of links in each page: in terms of mixed graph modeling, this linkage mechanism is explained by a hidden common cause (e.g., a student and a course page are associated because that person's interest in enrolling as a student also creates demand for a course). The heterogeneity also suggests that two unlinked pages should not, on average, have an association if they link to a common page W . However, observing the type of page W might create


Table 2: Comparison of the three algorithms on the task "other" vs. "not-other" in the WebKB domain. Results for GPC and RGP taken from [2]. The same partitions for training and test are used to generate the results for XGP. Mean and standard deviation of AUC results are reported. University Nu m b e r s Other or Not Other All Link GPC RGP XGP Cornell 617 8 6 5 1 3 1 7 7 0 .7 0 8 ± 0 .0 2 1 0 .8 8 4 ± 0 .0 2 5 0 .9 1 7 ± 0 .0 2 2 571 8 2 7 1 6 0 9 0 0 .7 9 9 ± 0 .0 2 1 0 .9 0 6 ± 0 .0 2 6 0 .9 4 9 ± 0 .0 1 5 Texas Washington 939 1 2 0 5 1 5 3 8 8 0 .7 8 2 ± 0 .0 2 3 0 .8 7 7 ± 0 .0 2 4 0 .9 2 3 ± 0 .0 1 6 Wisconsin 942 1 2 6 3 2 1 5 9 4 0 .8 3 9 ± 0 .0 1 4 0 .8 9 9 ± 0 .0 1 5 0 .9 4 1 ± 0 .0 1 8 the association. We compare how the three algorithms perform when trying to predict if a webpage is of class "other" or not (the other classifications are easier, with smaller differences. Results are omitted for space purposes). The proportion of "other" to non-"other" is about 4:1, which makes the area under the curve (AUC) a more suitable measure of success. We used the same 100 subsamples from [2], where 10% of the whole data is sampled from the pool for a specific university, and the remaining is used for test. We also used the same features as in [2], pre-processed as described in the previous section. The results are shown in Table 2. Both relational Gaussian processes are far better than the non-relational GPC. XGP gives significant improvements over RGP in all four universities.

5 Conclusion
We introduced a new family of relational classifiers by extending a classical statistical model [12] to non-parametric relational classification. This is inspired by recent advances in relational Gaussian processes [2] and Bayesian inference for mixed graph models [11]. We showed empirically that modeling the type of latent phenomena that our approach postulates can sometimes improve prediction performance in problems traditionally approached by Markov network structures. Several interesting problems can be treated in the future. It is clear that there are many different ways by which the relational covariance matrix can be parameterized. Intermediate solutions between Methods I and II, approximations through matrix factorizations and graph cuts are only a few among many alternatives that can be explored. Moreover, there is a relationship between our model and multiple kernel learning [1], where one of the kernels comes from error covariances. This might provide alternative ways of learning our models, including multiple types of relationships. Acknowledgements: We thank Vikas Sindhwani for the preprocessed Cora database.

References
[1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. 21st International Conference on Machine Learning, 2004. [2] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes. Neural Information Processing Systems, 2006. [3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of AAAI'98, pages 509­516, 1998. [4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 18th International Conference on Machine Learning, 2001. [5] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [6] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3:127­163, 2000. [7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, MIT, 2001. [8] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [9] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962­1030, 2002. [10] P. Sen and L. Getoor. Link-based classification. Report CS-TR-4858, University of Maryland, 2007. [11] R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. UAI, 2006. [12] A. Zellner. An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 1962.