Structure Learning of Bayesian Networks using Constraints

Cassio P. de Campos cassio@idsia.ch Dalle Molle Institute for Artificial Intelligence (IDSIA), Galleria 2, Manno 6928, Switzerland Zhi Zeng Qiang Ji Rensselaer Polytechnic Institute (RPI), 110 8th St., Troy NY 12180, USA zengz@rpi.edu jiq@rpi.edu

Abstract
This paper addresses exact learning of Bayesian network structure from data and expert's knowledge based on score functions that are decomposable. First, it describes useful properties that strongly reduce the time and memory costs of many known methods such as hill-climbing, dynamic programming and sampling variable orderings. Secondly, a branch and bound algorithm is presented that integrates parameter and structural constraints with data in a way to guarantee global optimality with respect to the score function. It is an any-time procedure because, if stopped, it provides the best current solution and an estimation about how far it is from the global solution. We show empirically the advantages of the properties and the constraints, and the applicability of the algorithm to large data sets (up to one hundred variables) that cannot be handled by other current methods (limited to around 30 variables).

1. Introduction
A Bayesian network (BN) is a probabilistic graphical model that relies on a structured dependency among random variables to represent a joint probability distribution in a compact and efficient manner. It is composed by a directed acyclic graph (DAG) where nodes are associated to random variables and conditional probability distributions are defined for variables given their parents in the graph. Learning the graph (or structure) of a BN from data is one of the
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

most challenging problems in such models. Best exact known methods take exponential time on the number of variables and are applicable to small settings (around 30 variables). Approximate procedures can handle larger networks, but usually they get stuck in local maxima. Nevertheless, the quality of the structure plays a crucial role in the accuracy of the model. If the dependency among variables is not properly learned, the estimated distribution may be far from the correct one. In general terms, the problem is to find the best structure (DAG) according to some score function that depends on the data (Heckerman et al., 1995). There are other approaches to learn a structure that are not based on scoring (for example taking some statistical similarity among variables), but we do not discuss them in this paper. The research on this topic is active, e.g. (Chickering, 2002; Teyssier & Koller, 2005; Tsamardinos et al., 2006). Best exact ideas (where it is guaranteed to find the global best scoring structure) are based on dynamic programming (Koivisto et al., 2004; Singh & Moore, 2005; Koivisto, 2006; Silander & Myllymaki, 2006), and they spend time and memory proportional to n · 2n , where n is the number of variables. Such complexity forbids the use of those methods to a couple of tens of variables, mostly because of memory consumption. In the first part of this paper, we present some properties of the problem that bring a considerable improvement on many known methods. We perform the analysis over some well known criteria: Akaike Information Criterion (AIC), and the Minimum Description Length (MDL), which is equivalent to the Bayesian Information Criterion (BIC). However, results extrapolate to the Bayesian Dirichlet (BD) scoring (Cooper & Herskovits, 1992) and some derivations under a few assumptions. We show that the search space of possible structures can be reduced drastically without losing the global optimality guarantee and that the memory requirements are very small in many practical cases

Structure Learning of Bayesian Networks using Constraints

(we show empirically that only a few thousand scores are stored for a problem with 50 variables and one thousand instances). As data sets with many variables cannot be efficiently handled (unless P=NP, as the problem is known to be NP-hard (Chickering et al., 2003)), a desired property of a method is to produce an any-time solution, that is, the procedure, if stopped at any moment, provides an approximate solution, while if run until it finishes, a global optimum solution is found. However, the most efficient exact methods are not any-time. We propose a new any-time exact algorithm using a branch-andbound (B&B) approach with caches. Scores are computed during the initialization and a poll is built. Then we perform the search over the possible graphs iterating over arcs. Although iterating over orderings is probably faster, iterating over arcs allows us to work with constraints in a straightforward way. Because of the B&B properties, the algorithm can be stopped at any-time with a best current solution found so far and an upper bound to the global optimum, which gives a kind of certificate to the answer and allows the user to stop the computation when she believes that the current solution is good enough. (Suzuki, 1996) has proposed a B&B method, but it is not a global exact algorithm, instead the search is conducted after a node ordering is fixed. Our method does not rely on a predefined ordering and finds a global optimum structure considering all possible orderings.

Given a complete data set D = {D1 , . . . , DN } of with N instances, with Dt = {xk1 , . . . , xkn } a instance of n,t 1,t all variables, the goal of structure learning is to find a G that maximizes a score function such as MDL or AIC. max sD (G) = max(LD () - t · W ),
G 

where  represents all parameters of the model (and n thus depends on the graph G), t = i=1 (qi · (ri - 1)) is the number of free parameters, W is criterion-specific (W = log N in MDL and W = 1 in AIC), and LD is 2 the log-likelihood function:
n qi ri
ijk ijk ,

LD () = log
i=1 j=1 k=1

n

(1)

where nijk indicates how many elements of D contain both xk and paj . This function can be writi i n ten as LD () = i=1 LD,i (i ), where LD,i (i ) = qi ri j=1 k=1 nijk log ijk . From now on, the subscript D is omitted for simplicity. An important property of such criteria is that they are decomposable, that is, they can be applied to each n node Xi separately: maxG s(G) = maxG i=1 si (PAi ), where si (PAi ) = Li (PAi )-ti (PAi )·W , with Li (PAi ) = maxi Li (i ) (i is the parameter vector related to Xi , so it depends on the choice of PAi ), and ti (PAi ) = qi ·(ri -1). Because of this property and to avoid computing such functions several times, we create a cache that contains si (PAi ) for each Xi and each parent set PAi . Note that this cache may have an exponential size on n, as there are 2n-1 subsets of {X1 , . . . , Xn } \ {Xi } to be considered as parent sets. This gives a total space and time of O(n · 2n ) to build the cache. Instead, the following results show that this number is much smaller in many practical cases. Lemma 1 Let Xi be a node of G , a DAG for a BN where PAi = J . Suppose J  J is such that si (J) > si (J ). Then J is not the parent set of Xi in the optimal DAG. Proof. Take a graph G that differs from G only on PAi = J, which is also a DAG (as the removal of some arcs does not create cycles) and s(G) = j=i sj (PAj )+si (J) > j=i sj (PAj )+si (J ) = s(G ). Hence any DAG G such that PAi = J has a subgraph G with a better score than G , and thus J is not the optimal parent configuration for Xi . Lemma 1 is quite simple but very useful to discard elements from the cache of Xi . However, it does not tell anything about supersets of J , that is, we still need to compute all the possible parent configurations

2. Bayesian networks
A BN represents a single joint probability density over a collection of random variables. It can be defined as a triple (G, X , P), where G = (VG , EG ) is a DAG with VG a collection of n nodes associated to random variables X (a node per variable), and EG a collection of arcs; P is a collection of conditional probability densities p(Xi |PAi ) where PAi denotes the parents of Xi in the graph (PAi may be empty), respecting the relations of EG . We assume throughout that variables are categorical. In a BN every variable is conditionally independent of its non-descendants given its parents (Markov condition). This structure induces a joint probability distribution by the expression p(X1 , . . . , Xn ) = i p(Xi |PAi ). Before proceeding, we define some notations. Let ri  2 be the number of discrete categories of Xi , qi the number of elements in PAi (the number of configurations of the parent set, that is, qi = Xt PAi rt ) and  be the entire vector of parameters such as ijk = p(xk |paj ), where i i i  {1, . . . , n}, j  {1, ..., qi }, k  {1, ..., ri } (hence xk  Xi and paj  PAi ). i i

Structure Learning of Bayesian Networks using Constraints

and later verify which of them can be removed. Next theorems handle this issue. Theorem 1 Using MDL or AIC as score function and assuming N  4, take G and G DAGs such that G is a subgraph of G . If G is such that jPAi rj  N , for some Xi , and Xi has a proper superset of parents in G w.r.t. G, then G is not an optimal structure. Proof.1 Take a DAG G such that J = PAi for a node Xi , and take G equal to G except that it contains an extra node in J new = PAi , that is, in G we have J new = J {Xe }. Note that the difference in the scores of the two graphs are restricted to si (·). In the graph G , Li (J new ) will certainly not decrease and ti (J new ) will increase, both with respect to the values for G. The difference in the scores will be si (J new ) - si (J), which equals to Li (J new ) - ti (J new ) - (Li (J) - ti (J)) 
qi ri

(all combinations up to log N parents). Although it does not help us to improve the theoretical size bound, Lemma 2 gives us even less elements. Lemma 2 Let Xi be a node with J  J two possible parent sets such that ti (J ) + si (J) > 0. Then J and all supersets J  J are not optimal parent configurations for Xi . Proof. Because Li (·) is a negative function, ti (J ) + si (J) > 0  -ti (J ) - si (J) < 0  (Li (J ) - ti (J )) - si (J) < 0  si (J ) < si (J). Using Lemma 1, we have that J is not the optimal parent set for Xi . The result also follows for any J  J, as we know that ti (J ) > ti (J ). Thus, the idea is to check the validity of Lemma 2 every time the score of a parent set J of Xi is about to be computed, discarding J and all supersets whenever possible. This result allows us to stop computing scores for J and all its supersets. Lemma 1 is stronger, but regards a comparison between exactly two parent configuration. Nevertheless, Lemma 1 can be applied to the final cache to remove all certainly useless parent configurations. As we see in Section 5, the practical size of the cache after these properties is small even for large networks. Lemma 1 is also valid for other decomposable functions, including BD and derivations (e.g. BDe, BDeu), so the benefits shall apply to those scores too, and the memory requirements will be reduced. The other theorems need assumptions about the initial N and the choice of priors. Further discussion is left for future work because of lack of space.

-
j=1 i=1 qi ri

nijk log ijk - ti (J new ) + ti (J)  nijk nijk log nij nij
qi

nij
j=1

-
i=1

- ti (J new ) + ti (J) 

nij H(ij ) - ti (J new ) + ti (J) 
j=1 qi

nij log ri - qi · (re - 1) · (ri - 1) · W
j=1

The first step uses the fact that Li (J new ) is negative, n ^ the second step uses that fact that ijk = nijk , with ij ri nij = nijk , is the value that maximizes Li (·), i=1 and the last step uses the fact that the entropy of a discrete distribution is less than the log of its number of categories. Finally, G is a better graph than G if the last equation is negative, which happens if qi · (re - 1) · (ri - 1) · W  N log ri . Because ri  2  ri - 1  log ri , and N  4  log N  1 (the W of the 2 MDL case), we have that qi = jJ rj  N ensures that si (J new ) < si (J), which implies that the graph G cannot be optimal. Corollary 1 In the optimal structure G, each node has at most O(log N ) parents. Proof. It follows directly from Theorem 1 and the fact that ri  2, for all Xi . Theorem 1 and Corollary 1 ensures that the cache n-1 stores at most O( log N ) elements for each variable
1 Another similar proof appears in (Bouckaert, 1994), but it leads directly to the conclusion of Corollary 1. The intermediate result is algorithmically important.

3. Constraints
An additional way to reduce the space of possible DAGs is to consider some constraints provided by experts. We work with two main types of constraints: constraints on parameters that define rules about the probability values inside the local distributions of the network, and structural constraints that specify where arcs may or may not be included. 3.1. Parameter Constraints We work with a general definition of parameter constraint, where any convex constraint is allowed. If i,PAi is the parameter vector of the node Xi with parent set PAi , then a convex constraint is defined as h(i,PAi )  0, where h : i,PAi  R is a convex function over i,PAi . This definition includes many well known constraints, for example from Qualitative Probabilistic Networks (QPN) (Wellman, 1990): qualitative influences define some knowledge about the state of

Structure Learning of Bayesian Networks using Constraints

a variable given the state of another, which roughly means that observing a greater state for a parent Xa of a variable Xb makes more likely to have greater states in Xb (for any parent configuration except for Xa ). For . example, bj2 2  bj1 2 , where jk = {xk , paj } and j is a b an index ranging over all parent configurations except for Xa . In this case, observing x2 makes more likely to a have x2 . A negative influence is obtained by replacing b the inequality operator  by , and a zero influence is obtained by changing inequality to an equality. Other constraints such as synergies (Wellman, 1990) are also linear and local to a single node. Although we allow the parameter constraints that are general, we have the following restriction about them: if a constraint is specified for a node Xi and a set of parents J, then the actual parent set PAi has to be a superset of J. Furthermore, we have a peculiar interpretation for each constraint C as follows: if J  PAi (proper subset), then the parameter constraint must hold for all configurations of the parents of Xi that do not belong to J. For example, suppose X1 has X2 and X3 as parents (all of them binary), and the following constraint h was defined on X1 : p(x2 |x2 x2 ) + 2 · p(x2 |x2 x1 )  1. If a new node X4 1 2 3 1 2 3 is included as parent of X1 , the constraint h becomes the two following constraints: p(x2 |x2 x2 x1 ) + 2 · p(x2 |x2 x1 x1 )  1 2 3 4 1 2 3 4 1,

· arc(Xi , Xj ) indicates that the node Xi must be a parent of Xj . · Operators or () and not (¬) are used to form the rules. The and operator is not explicitly used as we assume that each constraint is in disjunctive normal form. For example, the constraints i=c,j=c ¬arc(Xi , Xj ) and indegree(Xc , 0, eq) impose that only arcs from node Xc to the others are possible, and that Xc is a root node, that is, a Naive Bayes structure will be learned. The procedure will also act as a feature selection procedure by letting some variables unlinked. Note that the symbol  just employed is not part of the language but is used for easy of expose (in fact it is necessary to write down every constraint defined by such construction). As another example, the constraints j=c indegree(Xj , 3, lt), indegree(Xc , 0, eq), and j=c indegree(Xj , 0, eq)arc(Xc , Xj ) ensure that all nodes have Xc as parent, or no parent at all. Besides Xc , each node may have at most one other parent, and Xc is a root node. This learns the structure of a Tree-augmented Naive (TAN) classifier, also performing a kind of feature selection (some variables may end up unlinked). In fact, it learns a forest of trees, as we have not imposed that all variables must be linked. 3.3. Dealing with constraints All constraints in previous examples can be imposed during the construction of the cache, because they involve just a single node each. In essence, parent sets of a node Xi that do violate some constraint are not stored in the cache, and this can be checked during the cache construction. On the other hand, constraints such as arc(X1 , X2 )  arc(X2 , X3 ) cannot be imposed in that stage, as they impose a non-local condition (the arcs go to distinct variables, namely X2 and X3 ), because the cache construction is essentially a local procedure with respect to each variable. Such constraints that involve distinct nodes can be verified during the B&B phase, so they are addressed later. Regarding parameter constraints, we compute the scores using a constrained optimization problem, i.e. maximize the score function subject to simplex equality constraints and all parameter constraints defined by the user. max Li (i ) - ti (PAi )
i

p(x2 |x2 x2 x2 ) + 2 · p(x2 |x2 x1 x2 )  1, 1 2 3 4 1 2 3 4 that is, h holds for each state of X4 . For example if another parent X5 is included, then four constraints would be enforced with all possible combinations. This interpretation for constraints is in line with the definition of qualitative constraints of QPNs, and most importantly, it allows us to treat the constraints in a principled way for each set of parents. It means that the constraint must hold for all configurations of parents not involved in the constraint, which can be also interpreted as other parents are not relevant and the constraint is valid for each one of their configurations. 3.2. Structural constraints Besides probabilistic constraints, we work with structural constraints on the possible graphs. These constraints help to reduce the search space and are available in many situations. We work with the following rules: · indegree(Xj , k, op), where op  {lt, eq} and k an integer, means that the node Xj must have less than (when op = lt) or equal to (when op = eq) k parents.

subject to

j=1...qi z=1...mhi
ri k=1 ijk

gij (ij ) = 0, hiz (i )  0,

(2)

where gij (ij ) = -1+ imposes that distributions defined for each variable given a parent configura-

Structure Learning of Bayesian Networks using Constraints

tion sum one over all variable states, and the mhi convex constraints hiz define the space of feasible parameters for the node Xi . This is possible because: (1) we have assumed that a constraint over p(xk |xk11 , . . . , xktt ) i i i forces Xi1 , . . . , Xit  PAi , that is, when a parameter constraint is imposed, the parent set of the node must contain at least the variables involved in the constraint; (2) the optimization is computed for every possible parent set, that is, PAi is known in the moment to write down the optimization problem, which is solved for each Xi and each set PAi . We use the optimization package of (Birgin et al., 2000). Theorem 2 Using MDL or AIC as score function and assuming N  4, take G and G as DAGs such that G is a subgraph of G . Suppose that both G and G respect the same set of parameter and structural constraints. If G is such that jPAi rj  N , for some Xi , and Xi has a proper superset of parents in G w.r.t. G, then G is not an optimal structure. Proof. Just note that all derivations in Theorem 1 are also valid in the case of constraints. The only difn ^ ference that deserves a comment is ijk = nijk , which ij may be an unfeasible point for the optimization (2), because the latter contains parameter constraints that might reduce the parameter space (besides the normal constraints of the maximum log-likelihood problem). ^ As ijk is just used as an upper value for the loglikelihood function, and the constrained version can just obtain smaller objective values than the unconn strained version, nijk is an upper bound also for the ij constrained case. Thus, the derivation of Theorem 1 is valid even with constraints. Corollary 1 and Lemmas 1 and 2 are also valid in this setting. The proof of Corollary 1 is straightforward, as it only depends on Theorem 1, while for Lemmas 1 and 2 we need just to ensure that all the parent configurations that are discussed there respect the constraints.

between nodes, a mark stating that the arc must be present, or is prohibited, or is free (may be present or not); Q is a priority queue of triples (G, H, s), ordered by s (initially it contains a single triple with G, H and s just mentioned; and finally (Gbest , sbest ) is the best DAG and score found so far (sbest is initialized with -). The main loop is as follows: While Q is not empty, do 1. Remove the peek (Gcur , Hcur , scur ) of Q. If s  sbest (worst than an already known solution), then start the loop again. If Gcur is a DAG and satisfies all structural constraints, update (Gbest , sbest ) with (Gcur , scur ) and start the loop again. 2. Take v = (Xa1  Xa2  . . .  Xaq+1 ), with a1 = aq+1 , is a directed cycle of Gcur . 3. For y = 1, . . . , q, do · Mark on Hcur that the arc Xay  Xay+1 is prohibited. · Recompute (G, s) from (Gcur , scur ) such that the parents of Xay+1 in G comply with this restriction and with Hcur . Furthermore, the subgraph of G formed by arcs that are demanded by Hcur (those that have a mark must exist) must comply with the structural constraints (it might be impossible to get such graph. In such case, go to the last bullet). Use the values in the cache C(Xay+1 , PAay+1 ) to avoid recomputing scores. · Include the triple (G, Hcur , s) into Q. · Mark on Hcur that the arc Xay  Xay+1 must be present and that the sibling arc Xay+1  Xay is prohibited, and continue. The algorithm uses a B&B search where each case to be solved is a relaxation of a DAG, that is, they may contain cycles. At each step, a graph is picked up from a priority queue, and it is verified if it is a DAG. In such case, it is a feasible structure for the network and we compare its score against the best score so far (which is updated if needed). Otherwise, there must be a directed cycle in the graph, which is then broken into subcases by forcing some arcs to be absent/present. Each subcase is put in the queue to be processed. The procedure stops when the queue is empty. Note that every time we break a cycle, the subcases that are created are independent, that is, the sets of graphs that respect H for each subcase are disjoint. We obtain this fact by properly breaking the cycles: when v = (Xa1  Xa2  . . .  Xaq+1 ) is detected, we create q

4. Constrained B&B algorithm
In this section we describe the B&B algorithm used to find the best structure of the BN and comment on its complexity, correctness, and some extensions and particular cases. The notation (and initialization of the algorithm) is as follows: C : (Xi , PAi )  R is the cache with the scores for all the variables and their possible parent configurations (using Theorem 1 and Lemmas 1 and 2 to have a reduced size); G is the graph created taking the best parent configuration for each node without checking for acyclicity (so it is not necessarily a DAG), and s is the score of G; H is an initially empty matrix containing, for each possible arc

Structure Learning of Bayesian Networks using Constraints

subcases such that the first does not contain Xa1  Xa2 (but may contain the other arcs of that cycle), the second case certainly contains Xa1  Xa2 , but Xa2  Xa3 is prohibited (so they are disjoint because of the difference in the presence of the first arc), and so on such that the y-th case certainly contains Xay  Xay +1 for all y < y and prohibits Xay  Xay+1 . This idea ensures that we never process the same graph twice. So the algorithm runs at most i |C(Xi )| steps, where |C(Xi )| is the size of the cache for Xi . B&B can be stopped at any time and the current best solution as well as an upper bound for the global best score are available. This stopping criterion might be based on the number of steps, time and/or memory consumption. Moreover, the algorithm can be easily parallelized. We can split the content of the priority queue into many different tasks. No shared memory needs to exist among tasks if each one has its own version of the cache. The only data structure that needs consideration is the queue, which from time to time must be balanced between tasks. With a messagepassing idea that avoids using process locks, the gain of parallelization is linear in the number of tasks. As far as we know, best known exact methods are not easily parallelized, they do not deal with constraints, and they do not provide lower and upper estimates of the best structure if stopped early. If run until it ends, the proposed method gives a global optimum solution for the structure learning problem. Some particular cases of the algorithm are worth mentioning. If we fix an ordering for the variables such that all the arcs must link a node towards another non-precedent in the ordering (this is a common idea in many approximate methods), the proposed algorithm does not perform any branch, as the ordering implies acyclicity, and so the initial solution is already the best. The performance would be proportional to the time to create the cache. On the other hand, bounding the maximum number of parents of a node is relevant only for hardest inputs, as it would imply a bound on the cache size, which is already empirically small.

data sets are: adult (15 variables and 30162 instances), car (7 variables and 1728 instances) letter (17 variables and 20000 instances), lung (57 variables and 27 instances), mushroom (23 variables and 1868 instances), nursery (9 variables and 12960 instances), Wisconsin Diagnostic Breast Cancer or wdbc (31 variables and 569 instances), zoo (17 variables and 101 instances). No constraints are employed in this phase as we intend to show the benefits of the properties earlier discussed. Table 1 presents the cache construction results, applying Theorem 1 and Lemmas 1 and 2. Its columns show the data set name, the number of steps the procedure spends to build the cache (a step equals to a call to the score function for a single variable and a parent configuration), the time in seconds, the size of the generated cache (number of scores stored, the memory consumption is actually O(n) times that number), and finally the size of the cache if all scores were computed. Note that the reduction is huge. Although in the next we are going to discuss three distinct algorithms, the benefits of the application of these results imply in performance gain for other algorithms in the literature to learn BN structures. It is also possible to analyze the search space reduction implied by these results by looking columns 2 and 3 of Table 2.
Table 1. Cache sizes (number of stored scores) and time (in seconds) to build them for many networks and data sizes. Steps represent the number of local (single node given a parent set) score evaluations. name steps time(s) size n2n adult 30058 182.09 672 217.9 car 335 0.09 24 28.8 letter 534230 2321.46 41562 220.1 lung 43592 1.33 3753 261.8 mushroom 140694 72.13 8217 226.5 nursery 1905 3.94 49 211.2 wdbc 1692158 351.04 7482 235 zoo 9118 0.31 1875 220.1

5. Experiments
We perform experiments to show the benefits of the reduced cache and search space and the gains of constraints.2 First, we use data sets available at the UCI repository (Asuncion & Newman, 2007). Lines with missing data are removed and continuous variables are discretized over the mean into binary variables. The
2 The software is available online through the web address http://www.ecse.rpi.edu/cvrl/structlearning.html

In Table 2, we show results of three distinct algorithms: the B&B described in Section 4, the dynamic programming (DP) idea of (Silander & Myllymaki, 2006), and an algorithm that picks variable orderings randomly and then find the best structure such that all arcs link a node towards another that is not precedent in the ordering. This last algorithm (named OS) is similar to K2 algorithm with random orderings, but it is always better because a global optimum is found for each ordering.3 The scores obtained by each algorithm (in percentage against the value obtained by B&B) and
We have run a hill-climbing approach (which is also benefited by ideas presented in this paper), but its accuracy was worse than OS. We omit it because of lack of space.
3

Structure Learning of Bayesian Networks using Constraints Table 2. Comparison of MDL scores among B&B, dynamic programming (DP), and ordering sampling (one thousand times). Fail means that it could not solve the problem within 10 million steps. DP and OS scores are in percentage w.r.t. the score of B&B (positive percentage means worse than B&B and negative percentage means better). search reduced B&B DP OS network space space score gap time(s) score time(s) score time(s) adult 2210 271 -286902.8 5.5% 150.3 0.0% 0.77 0.1% 0.17 car 242 210 -13100.5 0.0% 0.01 0.0% 0.01 0.0% 0.01 letter 2272 2188 -173716.2 8.1% 574.1 -0.6% 22.8 1.0% 0.75 lung 23192 2330 -1146.9 2.5% 907.1 Fail Fail 1.0% 0.13 mushroom 2506 2180 -12834.9 15.3% 239.8 Fail Fail 1.0% 0.12 nursery 272 217 -126283.2 0.0% 0.04 0.0% 0.04 0.0% 0.04 wdbc 2930 2216 -3053.1 13.6% 333.5 Fail Fail 0.8% 0.13 zoo 2272 2111 -773.4 0.0% 5.2 0.0% 3.5 1.0% 0.03

the corresponding spent time are presented (excluding the cache construction). A limit of ten million steps is given to each method (steps here are considered as the number of queries to the cache). It is also presented the reduced space where B&B performs its search, as well as the maximum gap of the solution. This gap is obtained by the relaxed version of the problem. So we can guarantee that the global optimal solution is within this gap (even though the solution found by the B&B may already be the best, as shown in the first line of the table). With the reduced cache presented here, finding the best structure for a given ordering is very fast, so it is possible to run OS over millions of orderings in a short period of time.
Table 3. B&B procedure learning TANs. Time (in seconds) to find the global optimum, cache size (number of stored scores) and (reduced) space for B&B search. network time(s) cache size space adult 0.26 114 239 car 0.01 14 26.2 letter 0.32 233 261 lung 0.26 136 251 mushroom 0.71 398 288 nursery 0.06 26 212 wdbc 361.64 361 299

will not provide any solution, while B&B still gives an approximation and a bound to the global optimum. About OS, if we sample one million times instead of one thousand as done before, its results improve and the global optimum is found also for adult and mushroom sets. Still, OS provides no guarantee or estimation about how far is the global optimum (here we know it has achieved the optimum because of the exact methods). It is worth noting that both DP and OS are benefited by the smaller cache. Table 3 shows the results when we employ constraints to force the final network to be a Tree-augmented Naive Bayes (zoo was run, but it is not included because the unconstrained learned network was already TAN). Here the class is isolated in the data set and constraints are included as described in Section 3.2. Note that the cache size, the search space and consequently the time to solve the problems have all decreased. Finally, Table 4 has results for random data sets with predefined number of nodes and instances. A randomly created BN with at most 3n arcs is used to sample the data. Because of that, we are able to generate random parameter and structural constraints that are certainly valid for this true BN (approximately n/2 constraints for each case). The table contains the total time to run the problem and the size of the cache, together with the percentage of gain when using constraints. Note that the code was run in parallel with a number of tasks equals to n, otherwise an increase by a factor of n must be applied to the results in the table. We can see that the gain is recurrent in all cases (the constrained version has also less gap in all cases, although such number is not shown).

Some additional comments are worth. DP can solve the mushroom set in less than 10 minutes if we drop the limit of steps. The expectation for wdbc is around four days. Hence, we cannot expect to obtain an answer in larger cases, such as lung. It is clear that, in worst case, the number of steps of DP is smaller than that of B&B. Nevertheless, B&B eventually bounds some regions without processing them, provides an upper bound at each iteration, and does not suffer from memory exhaustion as DP. This makes the method applicable even to very large settings. Still, DP seems a good choice for small n. When n is large (more than 35), DP will not finish in reasonable time, and hence

6. Conclusions
This paper describes a novel algorithm for learning BN structure from data and expert's knowledge. It integrates structural and parameter constraints with

Structure Learning of Bayesian Networks using Constraints Table 4. Results on random data sets generated from random networks. Time to solve (10 million steps) and size of the cache are presented for the normal unconstrained case and the percentage of gain when using constraints. nodes(n)/ unconstrained constrained gain gap time(s) cache time cache instances 30/100 0% 0.06 125 67% 11.6% 0% 2.7 143 47.5% 26.5% 30/500 50/100 0% 0.26 310 31.4% 16.1% 50/500 0% 20.66 231 57.2% 29.8% 70/100 0% 4.58 1205 36.9% 18.8% 70/500 1.1% 356.9 666 38.4% 21.9% 100/100 0.5% 9.05 2201 47.5% 23.5% 1.4% 1370.4 726 50.2% 33.0% 100/500

Nonmonotone spectral projected gradient methods on convex sets. SIAM J. on Optimiz., 10, 1196­1211. Bouckaert, R. (1994). Properties of bayesian belief network learning algorithms. Conf. on Uncertainty in Artificial Intelligence (pp. 102­109). M. Kaufmann. Chickering, D., Meek, C., & Heckerman, D. (2003). Large-sample learning of bayesian networks is nphard. Conf. on Uncertainty in Artificial Intelligence (pp. 124­13). M. Kaufmann. Chickering, D. M. (2002). Optimal structure identification with greedy search. J. of Machine Learning Research, 3, 507­554. Cooper, G., & Herskovits, E. (1992). A bayesian method for the induction of probabilistic networks from data. Mach. Learning, 9, 309­347. Heckerman, D., Geiger, D., & Chickering, D. M. (1995). Learning bayesian networks: The combination of knowledge and statistical data. Mach. Learning, 20, 197­243. Koivisto, M. (2006). Advances in exact bayesian structure discovery in bayesian networks. Conf. on Uncertainty in Artificial Intelligence (pp. 241­248) AUAI Press. Koivisto, M., Sood, K., & Chickering, M. (2004). Exact bayesian structure discovery in bayesian networks. J. of Machine Learning Research, 5, 2004. Silander, T., & Myllymaki, P. (2006). A simple approach for finding the globally optimal bayesian network structure. Conf. on Uncertainty in Artificial Intelligence. (pp. 445­452) AUAI Press. Singh, A. P., & Moore, A. W. (2005). Finding optimal bayesian networks by dynamic programming (Technical Report). Carnegie Mellon Univ. CALD-05-106. Suzuki, J. (1996). Learning bayesian belief networks based on the minimum description length principle: An efficient algorithm using the B&B technique. Int. Conf. on Machine Learning (pp. 462­470). Teyssier, M., & Koller, D. (2005). Ordering-based search: A simple and effective algorithm for learning bayesian networks. Conf. on Uncertainty in Artificial Intelligence. (pp. 584­590) AUAI Press. Tsamardinos, I., Brown, L. E., & Aliferis, C. (2006). The max-min hill-climbing bayesian network structure learning algorithm. Mach. Learning, 65, 31­78. Wellman, M. P. (1990). Fundamental concepts of qualitative probabilistic networks. Artificial Intelligence, 44, 257­303.

data through a B&B procedure that guarantees global optimality with respect a decomposable score function. It is an any-time procedure in the sense that, if stopped early, it provides the best current solution found so far and a maximum error of such solution. The software is available as described in the experiments. We also describe properties of the structure learning problem based on scoring DAGs that enable the B&B procedure presented here as well as other methods to work over a reduced search space and memory. Such properties allow the construction of a cache with all possible local scores of nodes and their parents without large memory consumption. Because of the properties and the characteristics of the B&B method, even without constraints the B&B is more efficient than state-of-the-art exact methods for large domains. We show through experiments with randomly generated data and public data sets that problems with up to 70 nodes can be exactly processed in reasonable time, and problems with 100 nodes are handled within a small worst case error. These results surpass by far current methods, and may also help to improve other approximate methods and may have interesting practical applications, which we will pursue in future work.

Acknowledgments
This work is supported in part by the grant W911NF06-1-0331 from the U.S. Army Research Office. The first author thanks also the project Ticino in Rete.

References
Asuncion, A., & Newman, D. (2007). UCI machine learning repository. http://www.ics.uci. edu/~mlearn/MLRepository.html. Birgin, E. G., Mart´ inez, J. M., & Raydan, M. (2000).