Explanation Trees for Causal Bayesian Networks Ulf H. Nielsen1 uln@zurich.ibm.com Jean-Philippe Pellet1,2 jep@zurich.ibm.com André Elissee1 ael@zurich.ibm.com 1 Business Optimization Group IBM Research GmbH 8803 Rüschlikon, Switzerland 2 Machine Learning Group 8092 Zurich, Switzerland Swiss Federal Institute of Technology Zurich Abstract Bayesian networks can b e used to extract explanations ab out the observed state of a subset of variables. In this pap er, we explicate the desiderata of an explanation and confront them with the concept of explanation prop osed by existing metho ds. The necessity of taking into account causal approaches when a causal graph is available is discussed. We then intro duce causal explanation trees, based on the construction of explanation trees using the measure of causal information ow (Ay and Polani, 2006). This approach is compared to several other metho ds on known networks. · Explanation of the reasoning process. When we have received some evidence and b elief states are up dated by probabilistic inference, how was the reasoning pro cess by which we arrive at this state? · Explanation of the model, which provides insight into the static comp onents of a network such as (conditional) indep endence relationships, causal mechanisms, etc. We shall fo cus our attention on the explanation of evidence: we wish to explain why variables in sp ecic observed values using assignments in O took on V \ O. To this purp ose, we discuss in section 2 the requirements of such an explanation. In section 3, we list the standard approaches to evidence explanation as well as some recent metho ds to make explanations more concise, and explain some of their drawbacks. We then present causal information trees in section 4, and detail exp eriments and comparisons in section 5. 1 INTRODUCTION A Bayesian network (BN, Pearl, 1988) is an algebraic to ol to compactly represent the joint probability distribution of a set of variables NOTATION Boldface capitals denote sets of random variables or no des in a graph, dep ending on the context. like V by exploiting conditional indep endence amongst variables. It represents all variables in a directed acyclic graph (DAG), where the absence of arcs b etween no des denotes (conditional) indep endence. In addition to graphically representing the structure of the dep endencies b etween the variables, BNs allow inference tasks to b e solved more eciently. In this pap er, we discuss the extraction of explanations in V is the set of all variables in the analysis. Italicized capitals ements of X , Xi , Y V; are random variables or no des and elcalligraphic capitals such as X,Y are their resp ective domains. face lowercase, as Vectors are denoted b oldscalars in italics. Unless e or p; causal BNs (Pearl, 2000; Spirtes et al., 2001)BNs otherwise stated, the scalars x, y are assumed to b e where the arcs depict direct causeeect relationships b etween variables. Generally, explanations in BNs can b e classied in three categories (Lacave and Diez, 2002) dep ending on the fo cus of the explanation: a value of their resp ective upp ercase random variable. The probability distribution of a random variable denoted by X is p(X ), and we write p(X = x) or p(x) the probability of x. We only work with discrete variables. · Explanation of evidence. O = o? 2 O V, what is AN IDEALIZED EXPLANATION O may b e observed, the Given a subset of obEven though many variables served (instantiated) variables that b est explains the state of (some of ) the other variables V\O explanation can b e fo cused only on a sp ecic subset E O. The state E=e is then called explanandum. The set of explanatory variables planation is an assignment H V can include with is adaptable to any causal mo del that can predict the eect of interventions on certain variables. In addition to assuming that the relationships b etween the variables b oth observed and unobserved variables, and an ex- O=o for variables b oth in H = h (compatible H and in O). all variables V V can b e represented by a fully oriented causal BN, we assume that the corresp onding joint probability distribution is faithful and causally sucient (Pearl, 2000; Spirtes et al., 2001). explanatory variables H Faithfulness explanandum E=e of the distribution ensures that there is a unique graph whose arcs depict all (conditional) dep endencies of the distribution, and only those. observed variables O=o Causal suciency V, forsuch bids hidden common causes for variables in rect causation. that we can build a DAG whose arcs represent diAlthough most exp ert-designed BN We insist on the distinction b etween the explanandum are naturally oriented causally, the output of structure causal learning algorithms are often partially directed graphs and may need additional exp ert knowledge to b e fully oriented. To summarize, we wish our explanations to give us causal information by detailing the mechanisms that lead to the explanandum, using all the available information we have ab out the state of the network. e and the observations o (Cha jewska and Halp ern, 1997). Observations are all our knowledge ab out the current state of a system, and this might not coincide exactly with what we want explained. Consider for example the case where we wish to know why the grass is wet while we know it has b een raining. We do not seek an explanation for why it has rained, only for why the grass is wet. A p erfectly valid explanation is that the grass is wet b ecause it is raining if no other factors can suciently explain the facts. An algorithm resp ecting this should then determine, for each variable in relevant to explain in 3 EXISTING METHODS O, whether its observed state is This section reviews and discusses some of the ma jor techniques to nd explanations. e, and for each unobserved variable This excludes V \ O, whether knowing its state adds explanatory p ower to the prop osed explanation. metho ds which marginalize out O, preventing these 3.1 MOST PROBABLE EXPLANATION & VARIANTS A common noncausal measure of explanatory p ower is the conditional probability of the explanatory variables variables from b eing part of an explanation. To explain why a given system is observed in a given state, we must intuitively convey some information ab out the causal mechanisms that lead to the observation made. If we observe that it is raining and some explanation tells us that it rains b ecause the grass is wet, we do not nd it a go o d explanation as it contradicts our understanding of how the system works; that the grass b eing wet cannot make it rain. p ose we have an explanation Supan explanation siders H given the explanandum e. The most probable (MPE) approach (Pearl, 1988) then conas the b est explanation h = arg maxh p(h | e) (or, alternatively, lo oks for the k b est explanations by maximizing this probability). The explanandum tions e is in the case of MPE equal to the full set of observa- O = o, and the set H is V \ E. This list can b e H=h for E = e: long and uninformative b ecause of lack of conciseness; moreover, it is hard to distinguish b etween long explanations, whose resp ective probabilities are low anyway and close to one another. In the intuitive interpretation of this result is that manually H=h observe E = e. setting will b e a favourable conguration to As Halp ern and Pearl (2005) discuss, explanations need to b e causal to b e consistent with users' knowledge of the mechanisms of the system. It is therefore imp ortant that the explanations are given in a data-generating direction, such that users can infer interventional rules from the given explanations (for instance, if I can make it rain somehow, then I know that the grass will b e wet, as opp osed to an imp ossible let me make the grass wet so as to make it rain). Causal metho ds are sub ject to the availability of causal information. In this pap er, we extract the causal information from causal BNs, but in general, the approach partial abduction approach (Shimony, 1991), the set of explanatory variables is a strict subset H V \ E. x The set of variables X = V\H\E we lo ok for excluded from the explanation is then marginalized out b efore the maximum is computed: riori arg maxh p(h, x | e). This is the maximum a poste- (MAP) mo del approach. The excluded variables Automatically selecting the rel- X are selected either by a user, or via automated analysis of the network. mony, 1991). evant explanatory variable is a nontrivial issue (Shi- Partial ab duction is computationally more exp ensive than standard MPE, b ecause it cannot b e readily solved by message passing algorithms, but approximations exist (e.g., Park, 2002). On the other hand, it globally leads to more concise explanations than MPE. Further eorts to make explanations more concise include de Camp os et al. (2001), where the SE analysis also works by comparing two explanations hi ratio and hj , usually with Bayes' factor or the likelihood (Jereys, 1961): Bayes' factor = p osterior ratio prior ratio = p(e | hi ) p(hi | e) / p(hj | e) = . p(hi ) / p(hj ) p(e | hj ) k most prob- able explanations are found and then simplied based on relevance and probabilistic criteria; and Henrion and Druzdzel (1990), where also partial assignments are allowed but only within a predened tree that limits the set of p ossible explanations. An explanation is then a path from the ro ot of the tree to a leaf, denoting variable assignments for each branch taken. This is known as The empirical interpretation of Bayes' factor given by Jereys (1961) is that if it is less than 1 it is in favor of hj , if less than 3 it is a slight supp ort for hi . If it is b etween 3 and 12, it is a p ositive supp ort; and higher than 12, it is a strong supp ort for hi . In Yuan and Lu (2007), Bayes' factor is used to search for explanations consisting of only a few variables by ranking them by their Bayes' factor computed as the ratio b etween the probability of the explanation given the explanandum and its opp osite: scenario-based explanation. The b est expla- nation is the one with the highest p osterior probability. There are several concerns with these approaches MPE/MAP or scenario-basedmaximizing some conditional probability of the explanatory variables (Chajewska and Halp ern, 1997). First, they do not distinguish the explanandum and the observations, such that the additional state information that is not meant to b e explained is excluded from a p ossible explanation. Furthermore, there is no distinction b etween observing an explanatory variable and forcing it to have the Bayes' factor = p(h | e) . 1 - p(h | e) An exhaustive search is p erformed over all subsets of the hyp othesis, and the explanations are shown to b e more concise in a sample network than MPE, Shimony's (1991) MAP, and the simplications describ ed by de Camp os et al. (2001). A similar criticism as b efore can b e applied to these metho ds: additional observations are discarded, and the causal directionality is ignored in the selection of the relevant explanatory variables. X in a certain state x, 1 value x. Thus, dep ending on the choice of the explanatory variables, the intuitive interpretation (as describ ed in the previous section) stating that setting conguration for H = h will be a favourable observing E = e do es not hold. MPEs and, to a lesser extent, MAP mo del explanations, are not robust: little changes in the network will often change the result of the analysis, even though the changes o ccur in parts of the network largely indep endent of the explanandum (Chan and Darwiche, 2006). Common to the metho ds in this subsection is that they order explanations by to 3.3 EXPLANATION TREES The metho d of Flores (2005) constructs a set of b est explanations while at the same time giving a preference for concise explanations, summarizing the results of the analysis in an p(h, e) (this is equivalent explanation tree. We describ e this p(h | e) as p(e) is constant for a given e): this joint metho d in more detail, as the causal information tree metho d (describ ed in section 4) is based on a similar representation. probability cannot b e considered alone to determine the explanatory p ower of h on e. Some of these prob- lems are illustrated by the exp eriments in section 5. 3.2 SE ANALYSIS In SE analysis, Jensen (2001) additionally considers the sensitivity of an explanation h with resp ect to the explanandum. Less sensitive explanations ensure that little changes in the network's parameters will not lead to severely dierent explanations, so that the explanation is stable with resp ect to the sp ecication of the network. Denition 1 An explanation tree for an explanandum E = e is a tree in which every node X is an explanatory variable (with X V \ E), and each branch out of X is a specic instantiation x X of X . A path from the root to a leaf is then a series of assignments X = x, Y = y , · · · , Z = z , summarized as P = p, which constitutes a ful l explanation. Flores's (2005) algorithm, summarized in Algorithm 1, builds such an explanation tree. Starting with an empty tree, the variable to use as the next no de is selected to b e the one that, given the explanandum, reduces the uncertainty the most in the rest of the explanatory variables according to some measure. The no des that are on the path b eing grown are added to The dierence between observation and intervention is fundamental to causality and is best described with the example of Simpson's paradox in Pearl (2000), chap. 6. 1 a conditioning set, so that the part of the explanatory space they already account for is taken into account. Two stopping criteria are used to determine when to stop growing the tree: the minimum p osterior probability ab out the remaining variables in the set of explanatory variables. But this do es not measure the information of the added variables shared with the explanandum. Moreover, the explanandum actually grows as the tree is constructed, since there is no dierence b etween the constructed path of the current branch, and the mini- mum amount of uncertainty reduction that must b e p and e at line 2. Thus, this max- achieved by adding a new variable. Among all explanations represented by the nal tree, the b est one is the one with the largest p osterior probability imization cannot b e interpreted as selecting variables reducing the uncertainty in the explanandum. Second, the algorithm makes no distinction b etween explanandum and observations. To try to x this, we could either additionally condition on observations p(p | e). Algorithm 1 Flores's (2005) Explanation Tree 1: function Input: H : set of explanatory variables E = e : explanandum P = p : path of variable assignments , : stopping criteria Output: T : an explanation tree P 2: X arg maxX H Y H (X ; Y | e, p) (X ; Y | e, p) < or p(p | e) < then 3: if max Y H\X T= ExplanationTree(H, e, p; , ) O = o, e, or marginalize out O altogether. The former case is no dierent from adding and the latter case excludes all o to the explanandum X O from expla- nations, such that b oth cases are unsatisfactory. Third, the criterion to cho ose the b est explanation is the probability of the explanation path given the evidence Inf Inf p(p | e), and not how likely the system is to 4: 5: return end if have pro duced the evidence we are trying to explain with a conguration p, p(e | p). Both measures are 6: T new tree with root X 7: for each x domain(X ) do 8: T ExplanationTree(H \ X , e, p {x}) 9: add a branch x to T with subtree T and 10: assign it the label p(p, x | e) 11: end for 12: return T The algorithm is parametrized with the measure of uncertainty reduction plementation, we used the linked, but since several explanations can cover almost an equal share of the explanation space and often only one will b e included in the explanation tree, the criterion p(p | e) will miss explanations which could have explained the evidence well, but do not cover as large a fraction of the explanation space. Fourth, causal considerations are ignored: there is no Inf(X ; Y | e, p).2 distinction b etween ancestors and descendants of a variables, such that we can get explanations of the typ e it rains b ecause the grass is wet. From an end-user p ersp ective though, trees are a go o d solution for representing several comp eting explanations compactly and readably. sues discussed here. We intro duce in section 4 a mo died approach, which can address the is- mation. ab out and is conditional mutual inforI (X ; Y | z) is a sym- For our im- The mutual information we get by knowing metrical measure of how much reduction in uncertainty Y X in the context Z = z, dened as: I (X ; Y | z) = x y p(y | x, z) p(x | z) p(y | x, z) log . (1) p(y | z) X Y If X and Y are independent given Z = z, we have I (X ; Y | z) = 0. If X fully determines Y , then knowing one is enough to know the other and full information is shared. Explanation trees are interesting in that they can present many mutually exclusive explanations in a compact form. Flores (2005) also argues that explanations as constructed by Algorithm 1 are reasonable and more sensible than (k -)MPE in the sense that on simple networks, the returned explanations are those that we exp ect. Four elements, however, are sub ject to discussion. First, on line 2 of Algorithm 1, variables are added to the tree in order of how much information they provide 4 CAUSAL EXPLANATION TREES The tree is Like the previous metho d, causal explanation trees take advantage of a tree representation. are causal: grown so as to ensure that explanations in any path variables can b e selected as explanatory only if they causally inuence the explanandum. Before dening the causal criterion used in this ap- vention distribution proach, we need to dene the concept of postinter- (Pearl, 2000, p. 72). A standard conditional probability of the form probability (or probability density) of bility of p(e | x) gives the e when X = x is observed. It do es not represent, however, the proba- 2 See Flores (2005) for additional cases where max at line 3 is replaced by min or avg, and Inf is the Gini index. e if we manually force variable X to have value x. Causally, we are interested in the intervention on X , which we denote by do(X = x), rather the observation of x. In causal BNs, the to ol used to evaluate the eect of these conditionings is Pearl's (1995) to evaluate the p ostintervention distribution. do - prior probability ing calculus, which uses the structure of the causal graph ^ p(e | o, p), so that we end up comput^ I (X e | o, p) = p(e | o, x, p) ^^ ^ p(x | o, p)p(e | o, x , p) ^^ X Given a causal Bayesian network B in the sense of Pearl (2000, p. 23) over variables V = {X1 , · · ·, Xd }, the postintervention distribution p(v | do(Xi = xi )), also denoted p(v | xi ), after ^ an intervention do(Xi = xi ) can be expressed as: j p(x1 , · · ·, xd | xi ) = ^ 0 =i p(xj | paj ) Denition 2 x X ^ p(x | o, p)p(e | o, x, p) ^^ logx ^ p(e | o, p) , ensuring that the exp ected value EE I e E ^ ^ p(e | o, p)I (X e | o, p) equals = ^ (X e | o, p) ^ I (X E | o, p). if xi = xi , if xi = xi , Using this criterion, (2) built as follows: the explanation tree is then i.e., the no de which has the the ro ot no de is selected as b eing arg maxX I (X e | o); maximum information ow to the state of the ex- where paj is the values of the graphical parents (i.e., direct causes) of the node Xj in B. The planandum. The imp ortant part is that we may condition on o without confusing observation and explananXO in addition to unobserved vari- truncated factorization of (2) states that the probXi had no incoming causal inuence (i.e., no dum. Furthermore, we also allow selection of an observed variable ables, consistently with our desiderata. When it is observed and we know its value to ability distribution is computed as if the manipulated variable direct causes), and as if one. This makes sense, distribution. With this concept, we can now dene the X O, x X = x: . p(Xi = xi ) had probability as forcing Xi to have a certain x. We must then compute the p ointwise causal information ow from value eectively ignores its direct causes and natural e, with o b eing o without the observation formation ow causal in- ^ I (x e | o , p) = log x X ^^ p(e | o , p, x) ^ p(x | o , p)p(e | o , x , p) ^^ (Ay and Polani, 2006), which will b e value for b eing our measure of causal contribution of explanatory variables towards our explanandum. The tree is then grown recursively: for each p ossible X, a branch is added to the ro ot. For each and so on, where the We X to Y given the interventions do(Z = z), written ^ ^ I (X Y | z), is: I (X Y | z) = x X Denition 3 The causal information ow from new leaf, the next explanatory variable is selected as do -conditioning mation ow arg maxY I (Y e | o, x), ^ set always reects the selected vari- able values from the ro ot to the current leaf. ^ p(x | z) y Y p(y | x, z) log ^^ p(y | x, z) ^^ , ^ p (y | z) use only one stopping criterion, the minimum infor(3) we accept as a causal information con- tribution. The algorithm furthermore allows explicitly ^ where p (y | z) = The expression x X ^ p(x z)p(y | x z). ^^ | , ^ I (X Y | z) measures the amount of X to Y if we intervene on Z, setting it to z (i.e., if we block the causal ow on all paths going through Z). Note that (3) is, in information owing from H V \ {E }). Fi( ally, each leaf is labeled n p ^ with log (e | o, p)/p(e | o) where we make sure that ^ variables selected in p are removed from o if needed). to restrict the search set for explanatory variables (defaulting to This measures how much p erforming the interventions ^ p changes the probability of the explanandum (given the observations) with resp ect to the prior probability of the explanandum. Higher values indicate b etter explanations; negative values indicate that the probability of the explanandum actually decreases with the prop osed explanation. Using the information ow criterion brings us two advantages over standard (conditional) mutual information: rst, we automatically only consider variables that can causally inuence the explanandum. Second, when selecting the through essence, similar to (1). For faithful probability distributions, paths ^ I (X Y | z) = 0 if and only if all (if any) from X to Y go through Z in X regardless of directed the cor- resp onding causal graph. For binary variables, if Y is a deterministic function of Z, then ^ I (X Y | z) = 1. In our application, we use the causal information ow to decide which variable should b e added to the tree b eing built. This is shown in Algorithm 2. At line 2, we use the ith variable on a tree branch, we do -conditioning on the already build path take into account the previously selected variables 1 p, Y and we allow inputting additional observed vari- i-1 causally, as they enter the conditioning ables O in an additional conditioning set. We replace set of variables that have b een intervened on. In practice, computing a causal information ow of the typ e from (3) with the state of the explanandum e, sup- press the corresp onding summation and divide by the I (X e | o, p) ^ at line 2 of Algorithm 2 requires Algorithm 2 Causal Explanation Tree 1: function Input: T= Output: H: O=o: E=e: p: ^ : T: CausalExplTree(H, o, e, p; ) ^ but only 20% when given the drug. Thus, b oth the absence of drug and b eing a male can explain a go o d recovery rate. set of explanatory variables observation set explanandum path of interventions stopping criterion a causal explanation tree Sex f 0.5 m 0.5 2: X arg maxX H I (X e | o, p) ^ 3: if I (X e | o, p) < then return ^ Recovery f , d f , ¬d m, d m, ¬d recovery 0.2 0.3 0.6 0.7 death 0.8 0.7 0.4 0.3 Figure 1: The Drug f m yes (d) 0.25 0.75 no (¬d) 0.75 0.25 4: T new tree with root X 5: for each x domain(X ) do ^ 6: T CausalExplTree(H \ {X }, o, e, p {x}) ^ 7: add a branch x to T with subtree T and ` ´ ^^ 8: assign it the contribution log p(e | o, p, x)/p(e | o) 9: end for 10: return T Drug network. Sex female male -0.710 0.474 Sex female male 0.306 0.694 Drug yes no Drug yes no 0.415 0.637 0.056 Drug yes no 0.250 Drug yes 0.500 no 0.194 to know the distributions p(X | o, p), p(E | o, p), and ^ ^ ^^ p(E | o, X , p) (i.e., p(E | o, x, p) for all x X ). Addi^^ tionally, p(e | o) is needed to lab el the leaves, but as it is only dep endent on e and o, we compute it only once. We can further avoid unnecessary computations by using the graphical reachability criterion from a candidate no de -1.170 -0.585 (a) Causal explanation tree (b) Explanation tree rec.) X to E , blocking paths going through O P. (c) MPE: p(Sex = m, Drug = yes | Recovery = (d) BF: BF(Sex = m ) = 2.27 BF(Sex = m, Drug = no) = 1.68 BF(Drug = yes ) = 1.25 Figure 2: = 0.5 The inference steps were implemented using the factor graph message-passing algorithm (Frey et al., 2001). The complexity of this algorithm, in terms of numb er of calls to an inference engine p er no de in the constructed tree, is planatory Drug: explain Recovery = recovery. Sex = m All aplargely acHowever, ET selects In Figure 2, we try to explain a recovery. proaches correctly realize that counts for the recovery. O(nd), variables |H| where and n is the numb er of ex- d is the average domain m Drug = yes Sex = as the b est explanation according to size of the variables, e.g., 2 for binary variables. For comparison, Flores's (2005) approach is the leaves' lab els, just like MPE. This contradicts the natural idea of explanation, since the drug has a negative impact on the recovery. CET lab els the leaves more sensibly: branches where the drug was not given have a higher rank. Moreover, the branches where O(n2 d2 ). 5 EXPERIMENTS =0 to Most Probable Explanation (MPE), We compare causal explanation trees (CET) with parameter Bayes' factor (BF) following Yuan and Lu (2007), and standard (noncausal) explanation trees (ET) with parameters =f Sex have a negative lab el, indicating that they actu- ally decrease the probability of recovery. Although the rst two BF explanations are sensible, the third one is mistakenly selects Drug = yes as an explanation. = 0.02 and = 0. on three simple networks 3 to compare the relevance of We test the approach Academe some (Figure 3). This network depicts the re- lationships b etween various marks given to students following a course. The explanations. A more extended version of these exp eriments and comments can b e found in Nielsen (2007). Drug (Figure 1). This network comes from Pearl Final mark is determined by Other outside factors and an intermediate mark (T.P. mark ), which is in turn determined by the student's abilities in Theory and Practice as well as Extra curricular activities in this tested sub ject. In Figure 4, the explanandum was set to (2000, chap. 6). It represents the outcome of an exp eriment designed to check the eciency of a new drug on male and female patients. The males have a natural recovery rate of it to 60%. Similarly, 70%; taking the drug decreases 30% of females recover naturally, = fail ; Final mark have b een i.e., we want to explain why a student failed the course. T.P. mark and Global mark 3 The conditional probability tables have been omitted in the gures of the two larger BNs due to lack of space and can be found at http://www.zurich.ibm.com/~uln/ causalexpl/. excluded from the p ossible explanatory variables in the two tree algorithms as they are mo deling artifacts. ET tells us that We could have exp ected Theory = bad is the best explanation. Practice to also be part of Theory {good, average, bad} Practice {good, average, bad} Visit to Asia {yes, no} Tuberculosis {yes, no} Lung Cancer {yes, no} T .P. mark {pass, fail} Extra {yes, no} Smoking {yes, no} Global mark {pass, fail} Other {positive, negative} TborCa {yes, no} Final mark Bronchitis {normal, abnormal} Figure 3: The Academe network. when it is {pass, fail} X - Ray Dyspnea {yes, no} {yes, no} Figure 5: The Asia network. theory good average bad -1.135 -0.242 sis, but ET uses practice good average -1.360 -2.984 practice good -1.728 0.911 theory good average bad 0.182 are not absent : CET selects, justiably, TuberculoDyspnea and then Bronchitis, which causes of X-ray and cannot explain it, esp e- cially not when we know that no lung cancer is present. bad average bad MPE surprisingly excludes a visit to Asia, when this is exp ected to make abnormal X-rays more likely through other 0.911 other plus -2.433 0.613 0.911 other 0.254 0.564 Tuberculosis. BF, while still providing very concise ex- plus minus -1.848 -0.257 minus -0.380 0.111 plus minus 0.071 planations, is stuck on the middle no de like ET, ignore the imp ortant TbOrCa Tuberculosis. Lung cancer Absent Present 0.511 and, (a) Causal explanation tree (b) Explanation tree (c) MPE: p(Theory = bad, T.P. mark = fail, Global mark = fail, Extra = no, Other = positive, Practice = good | Final mark = fail) = 0.208 (d) BF: BF(Theory = bad ) = 3.02 BF(Theory = bad, Extra = no ) = 2.78 BF(Theory = bad, Other = negative ) = 2.53 Figure 4: Lung cancer Absent Present -0.886 Dyspnea 0.489 Absent Present 0.243 Academe: explain Final mark = fail. Theory. Tuberculosis 3.151 0.269 Bronchitis Absent Present 0.058 0.185 Absent Present alternate explanations, as it inuences the nal mark -1.141 3.151 Practice to explain the nal failure when Theory is average or good. MPE includes Practice = good cluding in its long list of states, which do es not seem intuitively likely. BF provides more concise explanations, but, like ET, ignores very similarly to This is what CET do es, in- (a) Causal explanation tree (b) Explanation tree Practice altogether, although a bad practice can account for failure equally well. Asia (c) MPE: p(Bronchitis = yes, Dyspnea = yes, Lung cancer = yes, TbOrCa = yes, Smoker = yes, Tuberculosis = yes, Visit to Asia = no | X-ray = abnormal) = 0.24 (d) BF: BF(TbOrCa = yes ) = 19.60 BF(TbOrCa = yes, Visit to Asia = no ) = 19.21 BF(TbOrCa = yes, Lung cancer = yes ) = 16.42 Figure 6: (Figure 5). This network (Lauritzen and Spiegelhalter, 1988) mo dels the relationships b etween two indicators, X-ray results and dyspnea, of severe diseases for a p erson. Tub erculosis (more likely if a visit to Asia o ccurred) and lung cancer (more likely when the p erson smokes) b oth increase abnormal Xray results and dyspnea; bronchitis also causes increased dyspnea. Asia: explain X-ray = abnormal. In Figure 7, we try to explain the presence of dyspnea for a smoker. While CET is still able to select Smoker as explanatory variable, ET can only add this observation to the explanandum and thus cannot select it. Instead, the b est explanation according to ET is normal X-rays, which do es not seem very likely. Although ET do es select the imp ortant TbOrCa is a mo deling artifact, ex- cluded from the the analysis in the two tree algorithms. In Figure 6, we try to explain abnormal X-ray results. Whereas b oth tree algorithms select a Lung cancer culosis, Lung cancer and Tuber- it ignores the largest factor according to BF as and CET, namely Bronchitis. Here to o, CET selects the b est explanation, they dier on how to explain the more intuitively interpretable explanations. Smoking Smoker 0.343 References Ay, N. & Polani, D. (2006). Information ows in causal networks. Technical report, Max Planck Institute for Mathematics in the Sciences. X-ray result Present 0.895 Bronchitis Absent -1.396 Normal Abnormal 0.201 0.799 Cha jewska, U. & Halpern, J. Y. (1997). Dening explanation in probabilistic systems. In: UAI-97, pages 6271. Morgan Kaufmann. Chan, H. & Darwiche, A. (2006). On the robustness of most probable explanations. In: UAI-06. de Campos, L. M., Gémez, J. A., & Moral, S. (2001). Simplifying explanations in Bayesian belief networks. International Journal of Uncertainty, Know ledge-Based Systems 9 Lung cancer Absent -2.037 Lung cancer Absent 0.878 Lung cancer Absent 0.055 Present Present Present Tuberculosis Absent -2.124 0.683 Tuberculosis 1.046 Tuberculosis 0.145 Present 0.683 Absent Present 0.876 1.046 Absent Present 0.042 0.014 . Fuzziness and Flores, M. J. (2005). tion Bayesian networks Inference: Ad- vanced algorithms for triangulation and partial abduc- (a) Causal explanation tree (b) Explanation tree . PhD thesis, Universidad De Castilla-La Mancha. (c) MPE: p(Bronchitis = yes, X-ray = normal, Lung cancer = no, TbOrCa = no, Smoker = yes, Tuberculosis = no, Visit to Asia = no | Dyspnea = yes) = 0.46 (d) BF: BF(Bronchitis = yes ) = 6.14 BF(Bronchitis = yes, Visit to Asia = no ) = 5.89 BF(Bronchitis = yes, Tuberculosis = no ) = 5.84 Figure 7: Frey, B. J., Kschischang, F. R., & Loeliger, H.-A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47. Halpern, J. Y. & Pearl, J. (2005). Causes and explanations: A structural-model approach. Part II: Explanations. The British Journal for the Philosophy of Science. Henrion, M. & Druzdzel, M. J. (1990). Qualitative propagation and scenario-based scheme for exploiting probabilistic reasoning. In: UAI, pages 1732. Jereys, H. (1961). sity Press. Theory of Probability Asia: explain Dyspnea = yes |Smoker = yes. . Oxford Univer- 6 CONCLUSION Explanations Jensen, F. V. (2001). Graphs. Springer. Bayesian Networks and Decision We have presented an approach to explanation in causal BNs, causal explanation trees. are presented as a tree, compactly representing several explanations and making it more readable than a (p ossibly long) list. Assuming that the BN is causal allows us to use the causal information ow criterion to build the tree. This leads to more sensible explanations, in that we only explain a given state with variables that can causally inuence it. The approach makes an explicit distinction b etween observation and explanandum. This lets the user input all available knowledge ab out the network as observation, while still fo cusing on explaining one of them and allowing the observed variables to b e selected as part of a go o d explanation. The algorithm lab els the leaves so as to reect how a prop osed explanation changes the probability of the explanandum, making the tree easy to interpret. Causal explanation trees, unlike other techniques, do not condition on the explanandum to maximize the probability of the explanatory variables fo cus on Lacave, C. & Diez, F. J. (2002). A review of explanation methods of Bayesian networks. Know ledge Engineering Review, 17(2):107127. Lauritzen, S. L. & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, Series B, 50:157224. Nielsen, U. H. (2007). On causal explanations in Bayesian networks. Master's thesis, IT University of Copenhagen. Park, J. D. (2002). Map complexity results and approximation methods. In: UAI-02, pages 388396. Pearl, J. (1988). tems: Probabilistic Reasoning in Intel ligent Sys- mann. Networks of Plausible Inference . Morgan Kauf- Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669709. Pearl, J. (2000). Causality: Models, Reasoning, ence. Cambridge University Press. and Infer- Shimony, S. E. (1991). Explanation, irrelevance, and statistical independence. In: AAAI, pages 482487. Spirtes, P., Glymour, C., & Scheines, R. (2001). Causation, Prediction, and Search, Second Edition. The MIT Press. Yuan, C. & Lu, T.-C. (2007). Finding explanations in bayesian networks. In: The 18th International Workshop on Principles of Diagnosis. p(h | e), but p(e | h) instead by means of the causal inThis allows it to compare favorably formation ow. to MPE, Bayes' factor, and Flores's (2005) noncausal explanation trees on the tested networks b ecause the returned explanations are intuitive, and those with a p ositive lab el in the tree ensure that they increase the probability of the explanandum.