Learning the Bayesian Network Structure: Dirichlet Prior versus Data

Harald Steck Computer-Aided Diagnosis (IKM CKS) Siemens Medical Solutions, 51 Valley Stream Parkway E51 Malvern, PA 19355, USA harald.steck@siemens.com

Abstract
In the Bayesian approach to structure learning of graphical models, the equivalent sample size (ESS) in the Dirichlet prior over the model parameters was recently shown to have an important effect on the maximum-a-posteriori estimate of the Bayesian network structure. In our first contribution, we theoretically analyze the case of large ESS-values, which complements previous work: among other results, we find that the presence of an edge in a Bayesian network is favoured over its absence even if both the Dirichlet prior and the data imply independence, as long as the conditional empirical distribution is notably different from uniform. In our second contribution, we focus on realistic ESS-values, and provide an analytical approximation to the `optimal' ESS-value in a predictive sense (its accuracy is also validated experimentally): this approximation provides an understanding as to which properties of the data have the main effect determining the `optimal' ESS-value.

value grows. These two extremes (empty and complete graph) are indeed reached for several data sets, and they are almost reached for the remaining data sets in [Silander et al., 2007]. The effect of the ESS-value on the MAP graph estimate was earlier noticed in [Steck & Jaakkola, 2002], and one of its main contributions was the theoretical analysis of the case of small ESS-values. That paper also provided a result for the case ESS : the Bayes factor converges to zero in the limit, which means that it favours neither the presence nor the absence of an edge. Given this inconclusive behavior in the limit, our first contribution in this paper is concerned with the case of large but finite ESS-values: in our theoretical analysis in Section 3, we derive the properties of the given data under which the presence of an edge in the graph is favoured. This contribution concerning the case of large but finite ESS-values, combined with the results for small ESS-values [Steck & Jaakkola, 2002], implies immediately that, under these conditions, the number of edges increases when the ESS-value grows, although not necessarily monotonically. Our second contribution is concerned with the case of realistic / intermediate ESS-values (Section 4): our goal is to understand as to which properties of a given data set determine the `optimal' ESS-value (in the predictive sense). We achieve this by deriving an explicit analytical approximation (its validity is also assessed experimentally).

1

INTRODUCTION

The posterior probability and the marginal likelihood of the graph are one of the most popular scoring functions for learning Bayesian network structures, e.g. [Buntine, 1991, Heckerman et al., 1995]. Both of them require the value of a free parameter to be specified by the researcher: the so-called equivalent sample size (ESS), which originates from the Dirichlet prior over the model parameters, cf. [Buntine, 1991, Heckerman et al., 1995]. In the elaborate experiments on 20 UCI data sets in [Silander et al., 2007], it was shown that the chosen ESS-value has a decisive effect on the resulting maximum-a-posteriori (MAP) graph estimate: in the experiment shown in Figure 1 in [Silander et al., 2007], the number of edges increases monotonically from zero to the maximum as the ESS-

2 BRIEF REVIEW OF BAYESIAN NETWORK LEARNING
A Bayesian network model comprises a directed acyclic graph G, and the model parameters  are conditional probabilities, e.g., see [Cowell et al., 1999] for an overview. It describes a joint probability distribution p(X |G) over the vector of random variables X = (X1 , ..., Xn ), where n is the number of variables. We consider discrete random variables Xi with a multinomial distribution in this paper. According to the graph G, the joint distribution factors recursively, p(X |G) = n=1 Xi |i ; i denotes the set of parents i


of variable/node Xi in the graph G; the (joint) states of i will be denoted by i , and the states of Xi by xi . Various approaches to learning the graph G from given data D have been developed in the literature, e.g., see [Cowell et al., 1999] for an overview. The Bayesian approach appears to be most popular. As a scoring function, it employs the posterior probability of the graph G given the data, p(G|D)  p(D|G) p(G), or the marginal likelihood of the graph, p(D|G). Both are equivalent if the prior distribution over the graphs, p(G), is chosen to be uniform. The marginal likelihood of the graph can be expressed in closed form when using the Dirichlet prior over the model parameters [Buntine, 1991, Heckerman et al., 1995]. More importantly, the Dirichlet prior is the only distribution that ensures likelihood equivalence [Heckerman et al., 1995], a desirable property in structure learning. In particular, likelihood equivalence holds only if the hyper-parameters xi ,i of the Dirichlet prior can be expressed as xi ,i =  · qxi ,i for all xi , i and for all i = 1, ..., n; q is a prior distribution over X , and  is a positive constant independent of i, the so-called scale-parameter or equivalent sample size (ESS). When q is chosen to be a uniform distribution, one arrives at the BDeu score [Buntine, 1991]: qxi ,i = 1/(|Xi | · |i |), where | · | denotes the number of (joint) states of the variable(s). Hence, xi ,i =  , |Xi | · |i | (1)

the exact solution, i.e., the globally optimal graph G, can be found in reasonable computation-time due to recent ad¨ vances in structure learning [Silander & Myllymaki, 2006, Koivisto & Sood, 2004]. Besides this popular Bayesian score, various other scoring functions have been devised for structure learning, including the Bayesian information criteria (BIC) [Schwarz, 1978, Haughton, 1988], the Akaike Information Criteria (AIC) [Akaike, 1973], and the Minimum Description Length (MDL) [Rissanen, 1978]. We will use the AIC in our second contribution in Section 4.2.

3 LARGE ESS-VALUE: ITS EFFECT ON THE LEARNED GRAPH
In this section we present our first contribution, namely as to how a large but finite value of the ESS affects the learned optimal Bayesian network structure. This complements the theoretical results for small ESSvalues derived in [Steck & Jaakkola, 2002] and the experimental results for `intermediate' ESS-values obtained in [Silander et al., 2007]. 3.1 UNIFORMITY-MEASURE

so that the ESS  is the only remaining free parameter in the marginal likelihood of the graph G: p(D| , G) equals [Buntine, 1991, Heckerman et al., 1995]
n i=1 i

In this section, we define a new `uniformity' measure and discuss its properties. This measure (and its properties) determines the result of structure learning for large ESS values, as we will see in the next section. Definition: We define the uniformity-measure U of a conditional multivariate distribution p(A, B|) over two random variables A and B given a set of variables , as follows: U ( p(A, B|)) =
a,b,

  (N +i ) 
i i

( )

xi

(Nxi ,i + xi ,i ) , (xi ,i )

(2)

where  is the Gamma function; the first product extends over all the variables, while the second one is over all the joint states i of the parents i of variable Xi , and the third product is over all the states xi of Xi . The cell counts Nxi ,i in the contingency tables serve as sufficient statistics; the sample size is N = xi ,i Nxi ,i . In this Bayesian approach with the Dirichlet prior, the regularized parameter estimates are [Buntine, 1991] Nx , + xi ,i ¯ xi |i  E p(x | |D,G) [xi |i ] = i i , ii Ni + i (3)


| p(a, b,  ) A, B, | · p(a, b,  ) - |A, | · p(a,  ) -|B, | · p(b,  ) + || · p( ) , (4)

where |A, B, | denotes the number of joint states. Obviously, this definition is equivalent to: U ( p(A, B|)) = || ·  p( )2  p(a, b| )
 a,b

=

which is the expectation with respect to the parameter's posterior distribution. For fixed data D and ESS  , finding the maximum-aposteriori (MAP) estimate of the graph G, or the maximum with respect to the marginal likelihood in Eq. 2, is an NPcomplete problem [Chickering, 1996]. One thus has to resort to heuristic search strategies, like local search (e.g., [Heckerman et al., 1995]), as to find a close-to-optimal graph. If the number of variables is not prohibitively large,

| · A, B| · p(a, b| ) - |A| · p(a| ) - |B| · p(b| ) + 1 |A, B, | ·
a,b,


p(a, b,  )2 - |A, | ·  p(a,  )2
a, 2

-|B, | ·  p(b,  ) + || ·  p( )2 ,
b, 

(5)

where U is rewritten in terms of conditional probabilities in the first line, and in terms of squared probabilities in the second line.


The interesting property of U (in each representation) is that it is a weighted sum of four terms, where the weights are the number of (joint) states of the variables. Proposition 1: The measure U has the following three basic properties: · symmetry: U ( p(A, B|)) = U ( p(B, A|)) · non-negativity: U ( p(A, B|))  0 · minimality: U ( p(A, B|)) = 0 if and only if ­ (conditional) independence: p(A, B|) = p(A|) p(B|) ­ and, for each state  with p( ) > 0, at least one of the marginal distributions is uniform: p(A| ) = 1/|A| or p(B| ) = 1/|B|. Concerning the minimality, note that U > 0 if a state  with p( ) > 0 exists such that neither one of p(A| ) or p(B| ) is uniform: this includes distributions where A and B are conditionally independent, but both distributions are skewed. In other words, U is not necessarily equal to zero for independent variables. Proof: The symmetry is obvious. As to proof the other two properties, we focus on the representation in terms of conditional probabilities in Eq. 5; obviously, if these properties hold for each state  , then they hold also for U , the sum. As the remainder of the proof is understood to be conditional on a state  with p( ) > 0, we now omit  from the notation. A necessary condition for minimization of U under the normalization-constraint a,b p(a, b) = 1 (accounted for by introducing the Lagrange multiplier  ), is that all the first partial derivatives vanish, i.e., for all states a, b: 2|A, B| p(a, b) - 2|A| p(a) - 2|B| p(b) +  = 0 (6)

also like to mention that each of the four terms themselves is related to well-known quantities: the squared probabilities in Eq. 5 suggest a relationship to the well-known Gini index, H G ( p(X )) = 1 - x p2 , which is used as an impux rity measure when learning decision trees. It can also be T viewed as a special case of the Tsallis entropy H ( p(X )) = (1 - x px )/( - 1) with parameter  = 2 [Tsallis, 2000]. The Tsallis entropy is used in statistical physics, and can be understood as the leading-order Taylor expansion of the  R Renyi entropy, H ( p(X )) = (log x px )/(1 -  ), which is a generalization of the Shannon entropy. In the limit   1, the Tsallis entropy and the Renyi entropy both coincide with the Shannon entropy.


3.2

LEADING-ORDER APPROXIMATION OF BAYES FACTOR

In this section, we theoretically analyze the behavior of the Bayesian approach to structure learning, as reviewed in Section 2, for the case of large but finite ESS-values. As to determine the most likely graph structure, note that the marginal likelihood p(D| , G+ ) of a graph G+ is important only relative to the marginal likelihood p(D| , G- ) of a competing graph G- . In particular, we consider two graphs in the following that are identical except for the edge A  B, which is present in G+ and absent in G- . Let  denote the parents of A in graph G- ; this implies that, in G+ , the parents of A are  and B. The presence of the edge A  B given the parents  is favoured over its absence (i.e., graph G+ over G- ) if the log Bayes factor log p(D| , G+ )/ p(D| , G- ) is positive; and if negative, the absence of A  B is favoured. Given complete data (i.e., no missing data), the marginal likelihood factorizes (cf. Eq. 2), so that most terms in the Bayes factor cancel out, and it depends only on the variables A and B and the parents : (Na,b, + a,b, ) p(D| , G+ ) =  log -) p(D| , G (a,b, ) a,b,
a,

With the normalization constraint it follows immediately that  = 2. It follows for the particular choice of considering the difference between each pair of these equations pertaining to the same state b, but different states a and a : p(a, b) - p(a , b) = [ p(a) - p(a )]/|B| Hence, if p(a) = p(a ) for all a, a , then p(A) is uniform and hence p(A, B) = p(A) p(B) with arbitrary p(B). Otherwise, i.e., if p(a) = p(a ) for some a, a , then p(A, B) = p(A) p(B), where p(B) has to be uniform, and p(A) can be arbitrary. Conversely, it can be verified easily that such a distribution indeed fulfills the original condition in Eq. 6. Moreover, U = 0 for such a distribution. Finally, it can be shown that the second derivative is positive definite, which A completes the proof. side: Besides the interesting fact that U is a weighted sum where the number of states function as the weights, we

log

-  log +  log


(Nb, + b, ) (Na, + a, ) -  log (a, ) (b, ) b, (N +  ) ( ) (7)

Note that this Bayes factor is symmetric in A and B, as expected for independence tests; the asymmetry of the edge A  B is caused by  being the parents of A (rather than of B) in the graph. Now we can present our main result for large but finite ESS-values: Proposition 2: Given the BDeu score, the leading-order approximation of the Bayes factor of the two graphs G+


and G- defined above reads for large ESS-values ( p(D| , G+ ) p(D| , G- ) N + N = · U ( p(A, B|)) - dF ^ O (N 2 / 2 ) 2 log

N):

(8)

where U is the uniformity measure as defined above, p(A, B|) is the empirical distribution implied by the data, ^ i.e., p(a, b| ) = Na,b, /N , and dF = ||(|A| - 1)(|B| - 1) ^ are the (well known) degrees of freedom.1 Before we give the proof at the end of this section, let us first discuss interesting insights that follow from this approximation: First, note that if U were replaced by the (conditional) mutual information I , then the term N · I - dF would be identical to the Akaike Information Criteria (AIC) [Akaike, 1973]. This analogy suggests that, in Eq. 8, dF serves as a penalty for model complexity. In other words, N · U ( p) - dF > 0 means that U ( p) is notably (or signifi^ ^ cantly) larger than zero, while N · U ( p) - dF < 0 refers to ^ U ( p) being not notably larger than zero. Given the proper^ ties of the new measure U (as discussed in Section 3.1), it hence follows directly: Corollary 1: Given the BDeu score, there exists a value  +  R such that for all finite  >  + the presence of the edge A  B given the parents  is favoured over its absence if N · U ( p(A, B|)) - dF > 0, i.e., if the empirical ^ distribution p implies ^ · a notable dependence between A and B given , · or a notable skewness (i.e., non-uniformity) of both distributions p(A|) and p(B|). Note that A and B ^ ^ can be conditionally independent here. Conversely, the absence of the edge A  B given the parents  is favoured for all large values  N if N · U ( p(A, B|)) - dF < 0, i.e., if the empirical distribution ^ implies that · A and B are not notably dependent given , · and for each state  with p( ) > 0 there is one marginal distribution p(A| ) or p(B| ) that is not no^ ^ tably different from a uniform distribution. Second, given that this applies to any edge in the graph, it follows immediately that the complete graph achieves the largest marginal likelihood if a sufficiently large (but finite)
F must not be corrected for zero-cell counts, as it originates from the prior in the BDeu score, cf. Eq. 10. Note that this e is different from dGff in Section 4.2. 1 Here, d

ESS-value is chosen, assuming that, for each variable conditioned on any set of parents, the data implies notable dependence or a sufficiently skewed distribution. Note, however, that the latter condition may not necessarily hold for large sets of parents when the given data set is small; this is because many zero-cell counts occur if the joint number of states of the variables is larger than the number of samples in the data. Proof of Proposition 2: Each of the four terms in the Bayes factor in Eq. 7 takes the form y log((Ny + y )/(y )), where y denotes a (joint) state y of a set Y of random variables. If y Ny , we obtain for each y (using (z) = (z - 1)! in the first line): log
y (Ny + y ) =  log(k - 1 + y ) (y ) k=1

N

Ny

Ny k=1 Ny

=

k=1

 log y +  log

k - 1 + y y

= Ny log y +  = Ny log y +

k-1 2 2 + O (Ny /y ) k=1 y Ny (Ny - 1) 2 2 + O (Ny /y ). 2y (9)

Inserting this approximation into the log Bayes factor of Eq. 7, we obtain (note that a,b, =  /|A, B, | for the BDeu score, where | · | denotes the number of joint states of the random variables): log = p(D|G+ ) p(D|G- )

a,b,

 Na,b, log b, a,
1 2
a,b,

a,b, 

+ -

 Na,b,  Na,b,

| A, B, | · Na,b, - |A, | · Na, -|B, | · Nb, + || · N | A, B, | - |A, | - |B, | + || (10)

+1 2

a,b,

O (N 2 / 2 )

where the first term vanishes because A,B, is uniform for the BDeu score, the second term equals N 2 · U ( p(A, B|))/(2 ),2 and the third term equals dF · ^ 3 N /(2 ). .3 ILLUSTRATION

This section provides a simple example that shows how the combination of a uniform (prior) distribution and a skewed empirical distribution can create dependence while each individual distribution implies independence. The key is
2 Because

of Na,b, = N · p(a, b,  ). ^


3 2

4 UNDERSTANDING THE OPTIMAL ESS-VALUE
Having theoretically analyzed the cases of large ESSvalues in the first part of this paper, and the case of small ESS-values in [Steck & Jaakkola, 2002], the remainder of this paper focuses on the `optimal' ESS-value (in a predictive sense): we tackle the question as to why the `optimal' ESS-value is about 50 for some of the 20 UCI data sets in the elaborate experiments in [Silander et al., 2007] (their results are also reproduced in columns  M and  I in our Table 1 for comparison), while it ranges between 2 and 13 for the remaining data sets. This section aims to provide an answer to this question, i.e., the goal is to understand as to which properties of the data determine the `optimal' ESS-value. 4.1 OPTIMIZATION OF GRAPH AND ESS

log Bayes factor

1 0 -1 -2 -3 0 0.1 0.2 0.3 0.4 0.5

z

Figure 1: This example illustrates how the dependence increases with the skewness implied by the data for several ESS-values: 100, 10 , 1 (solid curves, from top to bottom); 1,000 and 10,000 (dashed curves, from top to bottom).

that, in this Bayesian approach with the Dirichlet prior, the log Bayes factor is essentially based on a weighted sum of two distributions (see Section 2 and Eq. 3): the empirical distribution p (with weight N ) and the uniform dis^ tribution q of the 'virtual' data points due to the Dirichlet prior (with weight  ). Obviously, the sum of two distributions (where each implies independence) does not necessarily result in a distribution that implies independence: N p(A) p(B) +  q(A)q(B) = c · p(A) p(B) for any ^ ^ distribution p and any constant c in general. The deviation from this equality is typically not statistically significant for uniform q and skewed p as long as the ESS^ value  is sufficiently small; but as the ESS-value  increases in this weighted sum, the deviation can become statistically significant, as illustrated in the following example: let us consider an artificial data set D with N = 100 samples implying independence between two binary variables A and B, i.e., the empirical distribution factorizes like p(A, B) = p(A) p(B); for simplicity, we assume that ^ ^ ^ p(A) = p(B). We control the skewness of the marginal ^ ^ distributions with a parameter z  [0, 0.5]: p(A = 1) = z ^ and p(A = 0) = 1 - z; z = 0.5 implies uniform distribution, ^ while the skewness of the distribution grows as z decreases. Values in the range [0.5, 1] are symmetric to the case considered here. Figure 1 illustrates how the log Bayes factor (cf. Eq. 7) increases from negative values (implying independence) to positive values (suggesting dependence) as the empirical distribution becomes increasingly skewed (i.e., as z decreases). Figure 1 shows that, as the ESS-value increases, the degree of skewness of p has to diminish as ^ to prevent the log Bayes factor from implying dependence. As an aside, note that the curves are quite flat for large and small ESS-value where either one of the distributions dominates the sum; in contrast, the maximum at z = 0 is reached for the ESS-value  = N = 100, i.e., when both distributions are weighted equally.

Given that the ESS-value can have a crucial effect on the learned network structure [Silander et al., 2007, Steck & Jaakkola, 2002], we now depart from the orthodox Bayesian approach of choosing the prior--including the ESS-value-- without having seen the data. In the remainder of this paper we treat the ESS  as a latent random variable to be learned in light of the data D. Various objective functions for determining the `optimal' ESS value  were discussed in [Silander et al., 2007], and found to yield essentially the same result. In this paper, we choose to jointly optimize the graph structure and the ESS-value: (  , G ) = arg max p(D| , G)
( ,G)

(11)

While joint optimization of Eq. 11 is computationally very expensive, a simple coordinate-wise ascent is more tractable. This approach comprises two alternate updatesteps in each round k = 0, 1, ... (until convergence): 1. optimize the graph for a fixed ESS-value:  G = arg maxG p(D|k-1 , G), k 2. optimize the ESS-value for a fixed graph:  k = arg max p(D| , G ). k The first step can be solved by any standard algorithm for structure learning (cf. the review in Section 2). Concerning the initialization of the graph G0 in this iterative algorithm, there are several options. A convenient choice is to learn the graph G0 that optimizes the Bayesian Information Criteria (BIC) [Schwarz, 1978, Haughton, 1988]. This criteria not only is a large-sample approximation to the log marginal likelihood in Eq. 2, but it also is independent of ESS, i.e., it does not contain a free parameter. It can hence be used for initialization of G0 when no initial ESS-value is known, and promises to yield an initial graph G0 that is


already close to the optimum with respect to the marginal likelihood. The main contribution of the remainder of this paper is to derive an analytical approximation for the optimal   in the second update-step. 4.2 OPTIMAL ESS FOR GIVEN GRAPH This section provides an understanding of the most important properties of the data that affect the value of the optimal ESS. We derive an analytical approximation to the optimal ESS-value for a fixed graph G, cf. step 2 above. Step 2 maximizes predictive accuracy in the prequential sense [Heckerman et al., 1995, Dawid, 1984], when the data points are considered to arrive individually in a sequence. As this optimization problem is difficult to solve, we now replace it by a similar, but frequentist objective, departing from Bayesian statistics: we minimize the test error, as is commonly done in cross-validation. Even though this objective is not the same as the original one, it can be expected to yield a sufficiently accurate approximation in as far as we only aim to understand the difference between ESS-values of about 50 versus about 10 or lower for the various data sets in [Silander et al., 2007]. Note that this assumption and the following ones are validated experimentally on 20 UCI data sets in Section 4.3. In the following, we combine two approximations to the test error as to obtain an explicit approximation to the optimal ESS-value   . We carry out both approximations w.r.t. a `reference point' in the space of distributions. For convenience, we choose this to be the distribution p(X |G) im^ plied by the Bayesian network model with graph G and ^ maximum-likelihood parameter estimates  . The test error of this model with respect to the true distribution p(X ) reads T [ p(X ), p(X |G)] = -  p(x) log p(x|G), ^ ^
x

Bayesian network model: 
n e dGff =  i=1 xi ,i

, I (Nxi ,i ) -  I (Ni )
i

(15)

where I (·) is the indicator function: I (a) = 1 if a > 0 and I (a) = 0 otherwise. If the data set is sufficiently large e so that all cell counts Nxi ,i are positive, then dGff equals the well-known number of parameters, which is given for Bayesian networks by i (|Xi | - 1)|i |, where | · | denotes the number of (joint) states of the variable(s). We obtain the second approximation to the test error in Eq. 12 as follows: we assume the (unknown) true distribution p is well-approximated by the functional form of the regularized parameter estimates in Eq. 3 using the optimal value  : (16) pxi ,i  pxi ,i , ~ where Nxi ,i +   · qxi ,i N +  q  Nxi ,i  = p(xi , i ) + ^ xi ,i - N N pxi ,i ~ =

+

O ((

 2 ) ). N

(17)

This first-order Taylor expansion about the maximumNkelihood estimate (our `reference point') assumes   li , which will be justified at the end of this section. Inserting Eqs. 16 and 17 into Eq. 12, now results in our second approximation, T [ p(X ), p(X |G)]  -E p(X ) [log p(X |G)] ^ ^ ^ N -  Nx (qx - ) log p(x|G) + O ^ N N x (  2 ) N

(12)

ext we equate this approximation with the one in Eq. 13, and finally arrive at an explicit approximation to the optimal ESS-value,
e dGff  2 + O( ), E p(X ) [log p(X |G)] - Eq(X ) [log p(X |G)] ^ ^ N ^ (18) where Eq(X ) [log p(X |G)] is the expectation with respect to ^ the prior distribution q, analogous to Eq. 14; regarding + the robustness against zero cell counts, we use Nxi ,i = max{Nxi ,i , 1} in place of Nxi ,i in practice.

when using the log loss. As p is unknown, this cannot be evaluated. As is well-known [Akaike, 1973], however, the test error can be approximated by the training error, -E p(X ) [log p(X |G)], and a penalty term for model com^ ^ plexity: T [ p(X ), p(X |G)] = -E p(X ) [log p(X |G)] + ^ ^ ^ where E p(X ) [log p(X |G)] = ^ ^ = ^ ^  p(x) log p(x|G)
x n e dGff 1 + O ( 2 ), N N (13)

 
Note that the denominator in Eq. 18 is indeed positive if the prior distribution q has a larger entropy H than the empirical distribution p does, which is the case for the BDeu ^ score. This is obvious from E p(X ) [log p(X |G)] - Eq(X ) [log p(X |G)] ^ ^ ^ = H (q(X |G)) - H ( p(X |G)) + KL(q(X |G)|| p(X |G)) ^ ^ given that the Kullback-Leibler divergence and the entropy H are non-negative.

i=1 xi i

 
Nxi ,i Nx , log i i , (14) N Ni

equals the maximum log likelihood divided by the sample e size, and dGff is the effective number of parameters of the


Interpretation of Eq. 18: This explicit approximation of the optimal ESS   provides an understanding of the main properties of the data determining the optimal ESS-value: · skewness and dependence: the denominator of Eq. 18 tends to increase when the entropy (or negative likelihood) of the empirical distribution factored according to the graph structure, p(X |G), decreases: this is the ^ cases if there are strong dependencies along the edges, or if the conditional distributions of the variables are very skewed. For such data sets, one can hence expect small values of   . Conversely, data sets that do neither imply a skewed distribution nor extremely strong dependencies will result in increased values of   . · sample size: Eq. 18 does not explicitly depend on the sample size N . There is an indirect relation, however: the effective number of parameters, which is a measure of model complexity, tends to increase with N ; a more complex optimal model also entails an increased maximum likelihood, and hence this also affects the denominator; one may hence expect both the enumerator and the denominator to grow in a similar way as N increases, which suggests a weak dependence of   on N . Apart from that, when N is sufficiently large, the model complexity tends to approach its `true' value and not increase further. Thus, for a sufficiently large data set,   becomes independent of N . Consequently, our earlier assumption   N indeed holds for sufficiently large N . Note that this behavior is also consistent with the one expected in the asymptotic limit (N  ), namely   /N  0, so that the effect of regularization vanishes. · number of nodes: Eq. 18 implies that   can be expected to be unaffected by the number n of nodes in the (sparse) graph on average because both the enumerator and denominator are additive w.r.t. the nodes, and hence on average grow proportionally to each other when adding additional nodes to the network. 4.3 EXPERIMENTS As an additional confirmation of our assumptions underlying the approximations in Section 4.2, we determined the optimal value   based on our iterative algorithm using Eq. 18 on 20 UCI data sets [Hettich & Bay, 1999]. We used the same pre-processed data as was used in [Silander et al., 2007] (imputation of missing values, discretization of continuous variables), and compared to their exact results. Table 1 summarizes the results, and confirms the validity of our approximation. It is obvious that our approximate   agrees very well with the exact results of [Silander et al., 2007], which were obtained under heavy computational costs there (in our Table 1,  M is the exact

Table 1: Experimental validation of our analytical approximations in Section 4.2. See text for details. Data balance iris thyroid liver ecoli abalone diabetes post op yeast cancer shuttle tictac bc wisc glass page heart cl heart hu heart st wine adult N 625 150 215 345 336 4177 768 90 1484 286 58000 958 699 214 5473 303 294 270 178 32561 n 5 5 6 7 8 9 9 9 9 10 10 10 11 11 11 14 14 14 14 15 I 1...100 1...3 2...2 3...6 7...10 6...6 3...5 3...5 1...6 6...10 1...3 51...62 7...15 5...6 3...3 13...16 5...6 7...10 8...8 48...58 M 48 2 2 4 8 6 4 3 6 8 3 51 8 6 3 13 5 10 8 50  44 2 3 3 8 7 3 3 6 7 3 60 5 6 3 9 5 10 7 49 k 1 2 5 3 3 3 3 3 2 2 2 2 3 4 2 3 3 4 3 3

solution of the maximization problem in Eq. 11, and  I is the range of ESS-values that all yield the same MAP graph as  M does [Silander et al., 2007]): while our approximation   does not always agree precisely with the exact value  M , it correctly identifies the data sets where the optimal ESS is about 50 as opposed to about 10 or lower for the remaining data sets. This suggests that our analytical approximation, and the underlying assumptions, capture the main effects that influence the optimal ESS-value. Moreover, note that the sample size N varies between about 100 and 60, 000, and the number of variables n ranges from 5 to 15. Table 1 shows that both N and n have no obvious effect on the optimal ESS-value, as expected from our analytical approximation (see discussion in Section 4.2). Moreover, note that the results of our two contributions are very similar to each other (see Sections 3 and 4): the decisive properties of the data are the implied dependencies, or--in case of independence--the skewness of the implied distribution. With this insight, it is not surprising that an increase in the optimal ESS value is related to a reduction in the maximum number of edges in the graph attainable in the experiments of [Silander et al., 2007]: the data sets with a large optimal ESS-value of about 50 are exactly the data sets for which the maximum number of edges in the graph (achieved by increasing the ESS-value) is less than 80% of the edgecount of the complete graph (see the column `range' in Ta-


ble 1 of [Silander et al., 2007]). Both have the same cause, namely a data set that implies neither strong dependencies nor skewness. As an aside, note that our iterative algorithm (cf. Section 4.1 using the approximation in Eq. 18) is computationally efficient, as it converges within a small number of iterations, cf. k in Table 1; as a convergence criteria we required   |k - k-1 | < 0.1.

Smets, P., & Bonissone, P. (eds), Proceedings of the 7 t h Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann. [Chickering, 1996] Chickering, D. M. 1996. Learning Bayesian networks is NP-complete. Pages 121­30 of: Proceedings of the International Workshop on Artificial Intelligence and Statistics. [Cowell et al., 1999] Cowell, R. G., Dawid, A. P., Lauritzen, S. L., & Spiegelhalter, D. J. 1999. Probabilistic Networks and Expert Systems. Springer. [Dawid, 1984] Dawid, A. P. 1984. Statistical Theory. The prequential approach. Journal of the Royal Statistical Society, Series A, 147, 277­305. [Haughton, 1988] Haughton, D. M. A. 1988. On the choice of a model to fit data from an exponential family. The Annals of Statistics, 16(1), 342­55. [Heckerman et al., 1995] Heckerman, D., Geiger, D., & Chickering, D. M. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197­243. [Hettich & Bay, 1999] Hettich, S., & Bay, D. 1999. The UCI KDD archive. http://kdd.ics.uci.edu. [Koivisto & Sood, 2004] Koivisto, M., & Sood, K. 2004. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5, 549­ 73. [Rissanen, 1978] Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14, 465­71. [Schwarz, 1978] Schwarz, G. 1978. Estimating the dimension of a model. The Annals of Statistics, 6(2), 461­64. ¨ ¨ [Silander & Myllymaki, 2006] Silander, T., & Myllymaki, P. 2006. A Simple Approach for finding the globally optimal Bayesian network structure. Pages 445­52 of: Proceedings of the Conference on Uncertainty in Artificial Intelligence. [Silander et al., 2007] Silander, T., Kontkanen, P., & Myl¨ lymaki, P. 2007. On Sensitivity of the MAP Bayesian Network Structure to the Equivalent Sample Size Parameter. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence. [Steck & Jaakkola, 2002] Steck, H., & Jaakkola, T. S. 2002. On the Dirichlet Prior and Bayesian Regularization. In: Advanves in Neural Information Processing Systems (NIPS) 15. [Tsallis, 2000] Tsallis, C. 2000. Entropic Nonextensivity: a possible measure of complexity. arXiv:condmat/0010150v1.

5

CONCLUSIONS

This paper presents two contributions that shed light on how the learned graph is affected by the value of the equivalent sample size (ESS). First, we analyzed theoretically the case of large but finite ESS values, which complements the results for small values in the literature. Among other results, it was surprising to find that the presence of an edge in a Bayesian network is favoured over its absence even if both the Dirichlet prior and the data imply independence, as long as the conditional empirical distribution is notably different from uniform. Our second contribution provides an understanding of which properties of the given data determine the optimal ESS value in a predictive sense (when considered as a free parameter). Our analytical approximation (which we also validated experimentally) shows that the optimal ESS-value is approximately independent of the number of variables in the (sparse) graph and of the sample size. Moreover, the optimal ESS-value is small if the data implies a skewed distribution or strong dependencies along the edges of the graph. Interestingly, this condition concerning the optimal ESS-value is very similar to the one derived for large but finite ESS-values. Finally, having shown the crucial effect of the ESS value on the graph structure that maximizes the Bayesian score, this suggests that a similar effect can be expected concerning the posterior distribution over the graphs, and hence for Bayesian model averaging. If one embarks on this popular Bayesian approach, one hence has to choose the prior with great care. Acknowledgements I am grateful to R. Bharat Rao for encouragement and sup¨ port of this work, to Tomi Silander and Petri Myllymaki for providing me with their pre-processed data, and to the anonymous reviewers for valuable comments.

References
[Akaike, 1973] Akaike, H. 1973. Information Theory and an extension of the maximum likelihood principle. Pages 267­81 of: Petrox, B. N., & Caski, F. (eds), Second International Symposium on Information Theory. [Buntine, 1991] Buntine, W. 1991. Theory refinement on Bayesian networks. Pages 52­60 of: D'Ambrosio, B.,