Refractor Imp ortance Sampling Haohai Yu and Rob ert A. van Engelen Department of Computer Science Florida State University Tallahassee, FL 32306-4530 USA {hyu,engelen}@cs.fsu.edu Abstract In this paper we introduce Refractor Importance Sampling (RIS), an improvement to reduce error variance in Bayesian network importance sampling propagation under evidential reasoning. We prove the existence of a collection of importance functions that are close to the optimal importance function under evidential reasoning. Based on this theoretic result we derive the RIS algorithm. RIS approaches the optimal importance function by applying localized arc changes to minimize the divergence between the evidence-adjusted importance function and the optimal importance function. The validity and performance of RIS is empirically tested with a large set of synthetic Bayesian networks and two realworld networks. 1 Intro duction The Bayesian Network (BN) [Pearl, 1988] formalism is one of the dominant representations for modeling uncertainty in intelligent systems [Neapolitan, 1990, Russell and Norvig, 1995]. A BN is a probabilistic graphical model of a joint probability distribution over a set of statistical variables. Bayesian inference on a BN answers probabilistic queries about the variables and their influence relationships. The posterior probability distribution is computed using belief updating methods [Pearl, 1988, Guo and Hsu, 2002]. Exact inference is NP-hard [Cooper, 1990]. Thus, exact methods only admit relatively small networks or simple network configurations in the worst case. Approximations are also NP-hard [Dagum and Luby, 1993]. However, approximate inference methods have anytime [Garvey and Lesser, 1994] and/or anywhere [Santos et al., 1995] properties that make these methods more attractive compared to exact methods. Stochastic simulation algorithms, also called stochastic sampling or Monte Carlo (MC) algorithms, form one of the most prominent subclasses of approximate inference algorithms of which Logic Sampling [Henrion, 1988] was the first and simplest sampling algorithm. Likelihood weighting [Fung and Chang, 1989] was designed to overcome the poor performance of logic sampling under evidential reasoning with unlikely evidence. Markov Chain Monte Carlo (MCMC) forms another important group of stochastic sampling algorithms. Examples in this group are Gibbs sampling, Metropolis sampling and hybrid-MC sampling [Geman and Geman, 1984, Gilks et al., 1996, MacKay, 1998, Pearl, 1987, Chavez and Cooper, 1990]. Stratified sampling [Bouckaert, 1994], hypercube sampling [Cheng and Druzdzel, 2000c], and quasi-MC methods [Cheng and Druzdzel, 2000b] generate random samples from uniform distributions using various methods to improve sampling results. The importance sampling methods [Rubinstein, 1981] are widely used in Bayesian inference. Self Importance Sampling (SIS) [Shachter and Peot, 1990] and Adaptive Importance Sampling (AIS-BN) [Cheng and Druzdzel, 2000a] are among the most effective algorithms. In this paper we prove that the importance functions of an evidence-updated BN can only approach the optimal importance function when the BN graph structure is modified according to the observed evidence. This implies the existence of a collection of importance functions with minimum divergence to the optimal importance function under evidential reasoning. Based on this result we derive our Refractor Importance Sampling (RIS) class of algorithms. In contrast to AIS-BN and SIS methods, RIS removes the lower bound that prevents the updated importance function to approach the optimal importance function. This is achieved by a graphical structure "refractor", consisting of a localized network structure change that minimizes the divergence between the evidence-adjusted importance function and the optimal importance function. The remainder of this paper is organized as follows. Section 2 proves the existence of a lower bound on the divergence to the optimal importance function under evidential reasoning with a BN. The lower bound is used to derive the class of RIS algorithms introduced in Section 3. Section 4 empirically verifies the properties of the RIS algorithms on a large set of synthetic networks and two real-world networks, and compares the results to other importance sampling algorithms. Finally, Section 5 summarizes our conclusions and describes our future work. 2.2 Imp ortance Sampling 2 Imp ortance Function Divergence In this section we first give BN definitions and briefly review importance sampling. We then give a KLdivergence lower bound for importance sampling error variance. We prove the existence of a collection of importance functions that approach the optimal importance function by adjusting both the quantitative and qualitative components of a BN under dynamic updating with evidence. 2.1 Definitions Importance sampling is an MC method to improve the convergence speed and reduce the error variance with probability density functions. Let g (X) be a function of m variables X = {X1 , . . . , Xm } over domain IRm , such that computing g (X) for any X is feasible. Consider the problem of approximating I = g (X)dX using a sampling technique. Importance sampling approaches this problem by rewriting I = g(X) f (X) f (X)dX, where f (X) is a probability density function over , often referred to as the importance function. In order to achieve minimum error variance 2 equal to f (X) = ( |g (X)|dX)2 - I 2 , the importance function should be f (X) = |g (X)|( |g (X)|dX)-1 , see [Rubinstein, 1981]. Note that when g (X) > 0 the optimal probability density function is f (X) = g (X)I -1 2 and f (X) = 0. It is obvious that in most of cases it is impossible to obtain the optimal importance function. The SIS [Shachter and Peot, 1990] and AIS-BN [Cheng and Druzdzel, 2000a] sampling algorithms are effective methods for approximate Bayesian inference. These methods attempt to approach the optimal importance function through learning by dynamically adjusting the importance function during sampling with evidence. To this end, AIS-BN heuristically changes the CPT values of a BN, a technique that has been shown to significantly improve the convergence rate of the approximation to the exact solution. We use the following definitions for sake of exposition. Def. 3 Let BN = (G, Pr) be a Bayesian network with G = (V, A) and evidence e for variables E V. A posterior BN e of the BN is some (new) network defined as BN e = (Ge , Pre ) with graph Ge over variables V \ E, such that BN e exactly models the posterior joint probability distribution Pre = Pr(· | e). A typical example of a posterior BN e is a BN combined with an updated posterior state as defined by exact inference algorithms, e.g. using evidence absorption [van der Gaag, 1996]. Approximations of BN e are used by importance sampling algorithms. These approximations consist of the original BN with all evidence vertices ignored from further consideration. Def. 4 Let BN = (G, Pr) be a Bayesian network with G = (V, A) and evidence e for variables E V. The evidence-simplified ESBN e of BN is defined by ESBN e = (Ge , Pre ), where Ge = (Ve , X e ), Ve = A V \ E, and Ae = {(X, Y ) | (X, Y ) A , Y E}. / The joint probability distribution Pre of an evidencesimplified BN approximates Pre . For example, SIS and AIS-BN adjust the CPTs of the original BN. The following definitions and notations are used. Def. 1 A Bayesian network BN = (G, Pr) is a DAG G = (V, A) with vertices V and arcs A, A V × V. Pr is the joint probability distribution over the discrete random variables (vertices) V defined by Pr(V) = V V Pr(V | (V )). The set of parents of a vertex V is (V ). The conditional probability tables (CPT) of the BN assign values to Pr(V | (V )) for al l V V. The graph G induces the d-separation criterion [Pearl, 1988], denoted by X, Y | Z , which implies that X and Y are conditionally independent in Pr given Z, with X, Y, Z V. Def. 2 Let BN = (G, Pr) be a Bayesian network. · The combiX ed parent set of X V is defined by n (X) = X (X ) \ X. · Let An (·) denote the transitive closure of (·), i.e. the ancestor set of a vertex. The combined ancestor set of X V is defined by An(X) = X X An (X ) \ X. · Let : V IN denote a topological order of the vertices such that Y An (X ) (Y ) < (X ). The ahead set of a vertex X V given is defined by Ah (X ) = {Y V | (Y ) < (X )}. 2.3 KL-Divergence Bounds We give a lower bound on the KL-divergence [Kullback, 1959] of the evidence-simplified Pre from the exact Pre . The lower bound is valid for all variations of Pre , including those generated by importance sampling algorithms that adjust the CPT. Theorem 1 Let ESBN e = (Ge , Pre ) be an evidencesimplified BN given evidence e for E V. If Pre (V | e (V )) = Pr(V | (V ), e)) for al l V V then the KLdivergence between Pre and Pre is minimal and given by XC Pr(x, (x) | e) ln Pr(x | (x)) + X fg (X, (X )) Theorem 2 Let BN e (Ge , Pre ) be the posterior of a BN = (G, Pr) given evidence e for E V. If X / An(E) for al l X V \ E, then Pre (X | Ah e (X )) = Pr(X | (X )). The evidence vertices in (X ) take configurations fixed by e, that is Pr(X | (X )) = Pr(X | (X ) \ E, e1 , . . . , em ) for al l ei (X ) E. Pro of. See [Cheng and Druzdzel, 2000a]. 2 X X C fg (X, (X )) Pr(x, (x) | e) ln Pr( (e) | e) ln e e 1 + Pre (x | e (x)) Hence, to compute the posterior probability of a vertex that is not an ancestor of an evidence vertex, there is no need to change the parents of the vertex or its CPT. For vertices that are ancestors of evidence vertices, we use Bayes' formula and d-separation to explore the effects of evidence on those vertices. Without loss of generality, only one evidence vertex is considered. The result applies to an evidence vertex set by transitivity. Lemma 1 Let BN e (Ge , Pre ) be the posterior of a BN = (G, Pr) given evidence e = {e} for E V. Let X An (E ). Then, Pre (X | Ah e (X )) = Pr(e|X,Ah (X )) Pr(e|Ah (X )) Pr(X | (X )). Pro of. Because Ah e (X ) = Ah (X ) by Assumption 1, we have Pre (X | Ah e (X )) = Pre (X,Ah (X )) r (X e r (X e = PP(rX,AhX )|)|) ) = PP(rX,AhX ),),) ) = Pre (Ah (X )) (Ah ( e (Ah ( e Pr(e|X,Ah (X )) Pr(X,Ah (X )) Pr(e|X,Ah (X )) Pr(e|Ah (X )) Pr(Ah (X )) = Pr(e|Ah (X )) C fg ( (E)) Pr(e | (e)) - ln Pr(e) (1) where X = V \ E. Pro of. See Appendix A. 2 Theorem 1 bounds the error variance from below, which is empirically verified for SIS and AIS-BN in the results Section 4. The divergence Eq. (1) is zero when specific conditions are met as stated below. Corollary 1 Let ESBN e = (Ge , Pre ) be an evidencesimplified BN given evidence e for E V. If (E) (V \ E) = , then Pre = Pre . Pro of. See Appendix B. 2 by using Theorem 2. Pr(X | (X )) 2 Hence, the optimal importance function is obtained when all evidence vertices are clustered as roots in G. We will now show how Pre can approach the optimal Pre without restrictions. For sake of explanation, the following widely-held assumptions are reiterated: Assumption 1 The topological order of a BN and its posterior version e of BN e are consistent. That is, e (Y ) < e (X ) (Y ) < (X ) for al l X, Y V \ E. Assumption 1 is reasonable for the following facts: 1. According to chain rule, a BN can be built up in any topological order and all of them describe the same joint probability distribution. 2. Although there has never been a widely accepted definition of what causality is, it is widely accepted that the fact of observing evidence for random variables should not change the causality relationship between the variables. Theorem 2 and Lemma 1 show that if we have Pr(e | X, Ah (X )) and Pr(e | Ah (X )) for all X An (E ) we can derive Pre (X | Ah e (XV ) to compute Pre (V) = ) V An (E ) Pr(V | (V )) An(E) Pre (V | Ah e (V )) for the optimal importance function. However, there are two problems to derive Pre . Firstly, Ah (X ) is too large to construct a posterior BNe for Pre in practice. Secondly, instead of the exact Pre (X | Ah e (X )) we have an estimate by importance sampling. The parent sets of X An (E ) can be minimized by exploiting d-separation. Let e (X ) X Ah (X ) denote the minimal vertex set that d-separates evidence E and X Ah (X ), thus E, X Ah (X ) | e (X ) . We will refer to e (X ) as the "shield" of X given E . Let e (X ) Ah (X ) denote the minimal vertex set that d-separates evidence E and Ah (X ), thus E, Ah (X ) | e (X ) . We explore the relationship between e (X ) and e (X ) below. Lemma 2 Let BN e (Ge , Pre ) be the posterior of a BN = (G, Pr) given evidence e = {e} for E V. Then, e (X ) (e (X ) \ X ) (X ) for al l X An (E ). Pro of. See Appendix C. 2 Therefore, we can approach the optimal importance function Pre (X | Ah e (X )) by estimation of Pre (X | (e (X ) \ X ) (X )) from importance samples. Input: Evidence E V and X An (E ) Output: The set S = e (X ) Data: array A, queue Q A topSort (Ah (X )); S {X }; for i |A| to 1 do Q ; push(Q, A[i]); while Q = do V pop(Q); if V = E then S S {A[i]}; break; if V S V An (E ) X An (V ) then push(children(V )); end end end Algorithm 1: Computing the Shield e (X ) Input: BN = (G, Pr), evidence e for E V Output: refractored BN e foreach E E do foreach X An (E ) do expand e (X ) = (e (X ) \ {X }) (X ); update the CPT of X ; end end Algorithm 2: Refractor Procedure A C E D D B A C B Figure 1: Refractor Example 3 Refractor Imp ortance Sampling 3.3 General RIS Pro cedure RIS utilizes both the qualitative and quantitative properties of a BN to approach the optimal importance function. The general procedure of RIS is: 1. The structure of the BN is modified by Algorithms 1 and 2. The CPTs of (a subset of ) ancestor vertices of evidence vertices are expanded. 2. Update the CPT values through some specific learning algorithm (see Section 3.4 for details). 3. Sample the BN with an importance sampling algorithm using the new importance function. 3.4 Variations of RIS The RIS algorithm modifies the BN structure according to the shield e (X ) for vertices X An(E) by expanding the parent set of X and adjusting its CPT accordingly. Visually in the graph, RIS refracts arcs from the evidence vertices, which inspired the choice of name for the method. The algorithms and general procedure of RIS are introduced in this section. 3.1 Computing the Shield Alg. 1 computes e (X ) in O(|A|) worst-case time, assuming An (·) is determined in unit time (e.g. using a lookup table). Function topSort topologically sorts the set Ah (X ) by topological order over V of the BN. Note that the shield e (X ) can be computed in advance for each X V given evidence nodes E. 3.2 Refractor Pro cedure Alg. 2 modifies the graphical structure of BN. The time complexity of this algorithm is O(|V||A|) if |E| |V|, otherwise it is O(|V|2 |A|). The CPT of a vertex X is updated by populating the expanded entries e (X ) \ {X } using sampling data (described in Section 3.4). Fig. 1 shows an example refractored BN using Alg. 2. E is the evidence node. Here, e (C) = {A} and e (B) = {A}. Arcs A B and A C are added. Note that arc A B adjusts for the fact that the influence relationship between A and B has changed through evidence E. Arc E D is no longer required and can be removed as in [van der Gaag, 1996]. Step 1 modifies the BN structure significantly, especially when the ancestor sets of evidence vertices are large, e.g. when evidence vertices are leafs. This increases the complexity of the BN. However, the effect of evidence on other vertices is attenuated when the path length between the evidence and the vertices is increased [Henrion, 1989]. Therefore, instead of modifying all ancestors An(E) of evidence E in Step 1, it is generally sufficient to select a subset of ancestors such as the combined parent set (E). Steps 1 and 2 are independent, because any importance function learning algorithm can be applied in Step 2. Steps 2 and 3 can be combined by using the same importance sampling algorithm for learning and inference. In our experiments, we used SIS and AISBN for both learning and inference (steps 2 and 3), referred to as RISSIS and RISAIS, respectively. AISBN will be referred to by AIS. Cases with Lower MSE 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Cases with Lower MSE 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% AIS RISAIS SIS RISSIS 1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19 Sampling Size x 1000 Sampling Size x 1000 Figure 2: Synthetic BN Results: Ratio of Lowest MSE for RISAIS Versus AIS, and RISSIS Versus SIS. 4 Results 4.2 Test Cases This section presents the experimental results of RISSIS and RISAIS compared to SIS and AIS for synthetic networks and two real-world networks. 4.1 Measurement The MSE (mean squared error) metric was used to measure the error of the importance sampling results compared to the exact solution: 1 X jni X MSE = (Pre (xij ) - Pre (xij ))2 , ni i X =1 i X Because computing the MSE is expensive and PostKLD is exponential in the number of vertices, small-sized synthetic BNs with random variables with two or three states and |V| = 20 vertices and |A| = 30 arcs were evaluated in our experiments. The CPT for each variable is randomly generated with uniform distribution for the probability interval [0.1, 0.9] with bias for the extreme probabilities in intervals (0, 0.1) and (0.9, 1). For the experiments we generated 100 different synthetic BNs with these characteristics. We also verified RIS with two real-world BNs: Alarm-37 [Beinlich et al., 1989] and HeparII-70 [Onisko, 2003]. The probability distributions of these networks are more extreme compared to the synthetic BNs. For each of the two BNs, 20 sets of evidence variables are randomly chosen, each with 10 evidence variables. For the Alarm-37 and HeparII-70 we choose to limit the refractoring to the parents nodes of the evidence set (E) instead of An(E), see Section 3.4. 4.3 Results for Synthetic Test Cases where X = V \ E. We also measured the KLdivergence of the approximate and exact posterior probability distributions: C KL-divergence = Pre (x) ln Pre (x) . Pre (x) fg (X) Recall that Theorem 1 gives a lower bound for the KLdivergence of the posterior probability distributions of SIS and AIS, which is indicated in the results by the PostKLD lower bound from Eq. (1). The number of samples is taken as a measure of running time instead of CPU time in our experimental implementation. Recall that the overhead of RIS is fixed at startup when the evidence set can be predetermined. Furthermore, the RIS overhead is limited to collecting the updated CPT values during sampling (and learning in the case of RISAIS). The reported sampling frequencies for AIS (and RISAIS) are for calculating the posterior results. Because AIS separates the importance function learning stage from the sampling stage, the actual number of samples taken for AIS (total sampling for importance function and sampling the results) is twice that of SIS. Recommended parameters [Cheng and Druzdzel, 2000a] are used in AIS and RISAIS. We compared the MSE of four algorithms, AIS, RISAIS, SIS, and RISSIS. For this comparison a selection of 21 BNs from the generated synthetic test case suite was made. The other 79 test cases have PostKLD 0.1, which means according to Theorem 1 that the RIS advantage is limited. Fig. 2 shows the results for the 21 synthetic BNs, where the sample frequency is varied from 1,000 to 19,000 in increments of 1,000. The dark column in the figures represent the ratio of lowest MSE cases for RISAIS versus AIS and RISSIS versus SIS. A ratio of 50% or higher indicates that the RIS algorithm has lower error variance than the non-RIS algorithm. For RISAIS this is the case for all but one of the 19 measurements taken In total, the MSE is lowest for RISAIS in 61.4% on average over all samples. For RISSIS this is the case 0.45 0.40 KL-Divergence 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 PostKLD AIS RISAIS Figure 3: Synthetic BN Results: KL-Divergence of RISAIS and AIS with PostKLD Lower Bound 0.45 KL-Divergence 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 PostKLD SIS RISSIS Figure 4: Synthetic BN Results: KL-Divergence of RISSIS and SIS with PostKLD Lower Bound for all but four of the 19 measurements taken. In total, the MSE is lowest for RISSIS in 58.4% on average over all samples. In fact, it is to be expected that the higher the PostKLD lower bound the better the RIS algorithms should perform. In order to determine the impact with increasing PostKLD, we selected all 100 synthetic BN test cases and measured the KL-divergence after 11,000 samples. Fig. 3 shows the result for RISAIS, where the 100 BNs are ranked according to the PostKLD. Recall that the PostKLD is the lower bound on the KL-divergence of AIS. From the figure it can be concluded that AIS does not approach the exact solution for a significant number of test cases, whereas RISAIS is not limited by the bound due to the BN refractoring. It should be noted that around points 1 and 26 in Fig. 3 the KL-divergence of RISAIS is worse compared to AIS. We believe the reason is that AIS heuristically changes the original CPT which has a negative impact on the RIS algorithm's ability to adjust the CPT to the optimal importance function. Fig. 4 shows the result for RISSIS, where the 100 BNs are ranked according to the PostKLD. Interestingly, the RISSIS and SIS results are better on average than RISAIS and AIS. Note that the PostKLD lower bound is the same for AIS and SIS. However, in this study SIS appears to approach the PostKLD closer than AIS. Also here we can conclude that SIS does not approach the exact solution for a significant number of test cases, whereas RISSIS is not limited by the bound due to the BN refractoring. 4.4 Results for Alarm-37 and HeparI I-70 Fig. 5 shows the results for Alarm-37 and HeparII70, where the sample frequency is varied from 1,000 to 19,000 in increments of 1,000. The dark column in the figures represent the ratio of lowest MSE cases for RISAIS versus AIS and RISSIS versus SIS. A ratio of 50% or higher indicates that the RIS algorithm has lower error variance than the non-RIS algorithm. For RISAIS this is the case for all but one of the 19 measurements taken. In total, the MSE is lowest for RISAIS in 56.7% on average over all samples. For RISSIS this is the case for all 19 measurements taken. In total, the MSE is lowest for RISSIS in 60.3% on average over all samples. The combined results show that the RIS algorithms have reduced error variance for the synthetic networks Cases with Lower MSE Cases with Lower MSE 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% AIS RISAIS SIS RISSIS 1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19 Sampling Size x 1000 Sampling Size x 1000 Figure 5: Ratio of Lowest MSE for RISAIS Versus AIS, and RISSIS Versus SIS (Alarm-37 and HeparII-70). and the two real-world networks. The PostKLD lower bound limits the ability of AIS and SIS to approach the optimal importance function. By contrast, the RIS approach successfully eliminates this limitation of AIS and SIS, thereby providing an improvement to reduce error variance in BN importance sampling propagation under evidential reasoning. Proceedings of the 2nd European Conference on AI and Medicine, Berlin. Springer-Verlag. [Bouckaert, 1994] Bouckaert, R. R. (1994). A stratified simulation scheme for inference in Bayesian belief networks. In Proceedings of the 10th Conference on Uncertainty in Artificial Intel ligence, pages 110­ 117. [Chavez and Cooper, 1990] Chavez, R. M. and Cooper, G. F. (1990). A randomized approximation algorithm for probabilistic inference on Bayesian belief networks. Networks, 20:661­685. [Cheng and Druzdzel, 2000a] Cheng, J. and Druzdzel, M. J. (2000a). AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks. Journal of Artificial Intel ligence Research, 13:155­188. [Cheng and Druzdzel, 2000b] Cheng, J. and Druzdzel, M. J. (2000b). Computational investigations of low-discrepancy sequences in simulation algorithms for Bayesian networks. In Proceedings of the 16th Conference on Uncertainty in Artificial Intel ligence, pages 72­81. Morgan Kaufmann Publishers. [Cheng and Druzdzel, 2000c] Cheng, J. and Druzdzel, M. J. (2000c). Latin hypercube sampling in Bayesian networks. In Proceedings of the 13th International Florida Artificial Intel ligence Research Symposium Conference (FLAIRS-2000), pages 287­ 292, Orlando, Florida. AAAI Publishers. [Cooper, 1990] Cooper, G. F. (1990). The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intel ligence, 42(2-3):393­405. [Dagum and Luby, 1993] Dagum, P. and Luby, M. (1993). Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial Intel ligence, 60:141­153. 5 Conclusions In order to approach the optimal importance function for importance sampling propagation under evidential reasoning with a Bayesian network, a modification of the network's structure is necessary to eliminate the lower bound on the error variance. To this end, the proposed RIS algorithms refractor the network and adjust the conditional probability tables to minimize the divergence to the optimal importance function. The validity and performance of the RIS approach was empirically tested with a set of synthetic networks and two real-world networks. Additional improvements of RIS are possible to achieve a better accuracy/cost ratio. The goal is to find an effective subset of the full shield size of an ancestor vertex of an evidence vertex or select a limited subset of the ancestors of evidence vertices that are refractored. Also some of the additionally introduced arcs could be removed with the arc removal algorithm [van Engelen, 1997] when they present a weak influence. Such strategies would reduce the complexity of the refractored network while still ensuring higher accuracy over current importance sampling algorithms. References [Beinlich et al., 1989] Beinlich, I., Suermondt, G., Chavez, R., and Cooper, G. (1989). The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In [Fung and Chang, 1989] Fung, R. and Chang, K. C. (1989). Weighting and integrating evidence for stochastic simulation in Bayesian networks. In Proceedings of the 5th Conference on Uncertainty in Artificial Intel ligence, pages 209­219, New York, N. Y. Elsevier Science Publishing Company, Inc. [Garvey and Lesser, 1994] Garvey, A. J. and Lesser, V. R. (1994). A survey of research in deliberative real-time artificial intelligence. Real-Time Systems, 6(3):317­347. [Geman and Geman, 1984] Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 6(6):721­741. [Gilks et al., 1996] Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice. Chapman and Hall, London. [Guo and Hsu, 2002] Guo, H. and Hsu, W. (2002). A survey on algorithms for real-time Bayesian network inference. In In the joint AAAI-02/KDD-02/UAI02 workshop on Real-Time Decision Support and Diagnosis Systems, Edmonton, Alberta, Canada. [Henrion, 1988] Henrion, M. (1988). Propagating uncertainty in Bayesian networks by probabilistic logic sampling. In Proceedings of the 2nd Conference on Uncertainty in Artificial Intel ligence, pages 149­ 163, New York. Elsevier Science. [Henrion, 1989] Henrion, M. (1989). Some practical issues in constructing belief networks. In Kanal, L., Levitt, T., and Lemmer, J., editors, Proceedings of the 3rd Conference on Uncertainty in Artificial Intel ligence, pages 161­173, North Holland. Elsevier Science. [Kullback, 1959] Kullback, S. (1959). Information Theory and Statistics. John Wiley and Sons, New York. [MacKay, 1998] MacKay, D. (1998). Introduction to Monte Carlo methods. In Jordan, M., editor, Learning in Graphical Models. MIT Press. [Neapolitan, 1990] Neapolitan, R. E. (1990). Probabilistic Reasoning in Expert Systems. John Wiley and Sons, NY. [Onisko, 2003] Onisko, A. (2003). Probabilistic Causal Models in Medicine: Application to Diagnosis of Liver Disorders. PhD thesis, Institute of Biocybernetics and Biomedical Engineering, Polish Academy of Science, Warsaw. [Pearl, 1987] Pearl, J. (1987). Evidential reasoning using stochastic simulation of causal models. Artificial Intel ligence, 32(2):245­257. [Pearl, 1988] Pearl, J. (1988). Probabilistic Reasoning in Intel ligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, Inc., San Mateo, CA. [Rubinstein, 1981] Rubinstein, R. Y. (1981). Simulation and Monte Carlo Method. John Wiley and Sons, Hoboken, NJ. [Russell and Norvig, 1995] Russell, S. and Norvig, P. (1995). Artificial intelligence: A modern approach. In Prentice Hal l Series in Artificial Intel ligence. Prentice Hall. [Santos et al., 1995] Santos, E. J., Shimony, S. E., Solomon, E., and Williams, E. (1995). On a distributed anytime architecture for probabilistic reasoning. Technical report, AFIT/EN/TR95-02, Department of Electrical and Computer Engineering, Air Force Institute of Technology. [Shachter and Peot, 1990] Shachter, R. D. and Peot, M. A. (1990). Simulation approaches to general probabilistic inference on belief networks. In Proceedings of the 5th Conference on Uncertainty in Artificial Intel ligence, volume 5. [Shannon, 1956] Shannon, C. E. (1956). The bandwagon. IRE Transactions on Information Theory, IT-2:3. [van der Gaag, 1996] van der Gaag, L. (1996). On evidence absorption for belief networks. International Journal of Approximate Reasoning, 15(3):265­286. [van Engelen, 1997] van Engelen, R. (1997). Approximating Bayesian belief networks by arc removal. IEEE Transactions on Pattern Analysis and Machine Intel ligence, 19(8):916­920. Appendix A Pro of of Theorem 1 C e = According to fg (Xj ) Pr (xj | (xj ) \ e) = 1. Shannon's infC rmation theory [Shannon, 1956], to o 1 minimize fg (Xj ) Pr(xj | (xj ), e) ln Pre (xj | (xj )\e) we should set Pre (xj | (xj ) \ e) = Pr(xj | (xj ), e). This proves the Theorem 1. 2 Pro of. We use the KL-divergence (Cross Entropy) [Kullback, 1959] to measure the difference between the posterior BN e and ESBN e . The KL-divergence = C Pr1 (v) E1 [ln Pr1 (V) ] = fg (V) Pr1 (v) ln Pr2 (v) . Hence, the Pr2 (V) KL-divergence between posterior BN e and ESBN e is C Pre (x) This is fg (X) Pre (x) ln Pre (x) where X = V \ E. further simplified as follows C Pre (x) = fg (X) Pre (x) ln Pre (x) C Pr(x|e) = e fg (X) Pr(x | e) ln Prx (x) e Pr(xj | (xj )) Pr(ei | (ei )) C j x i e x fg (X) Pr(x | e) ln Pr(e) Pre (xj |e (xj )) x j x e Pr(xj | (xj )) Pr(ei | (ei )) C j x i e x = e fg (X) Pr(x | e) ln j x B Pro of of Corollary 1 Let X = V \ E, then Pre (v\e) = fg (V\E) Pre (v \ e) ln Pre (v\e) C Pr(x|e) = e fg (X) Pr(x | e) ln Prx (x) e Pr(xj | (xj )) Pr(ei | (ei )) C j x i e x . e fg (X) Pr(x | e) ln Pr(e) j x Pro of. C Pr (xj |e (xj )) Pr (xj |e (xj )) +ln C 1 Pr(e) fg (X) C fg (X) Pr(x | e) x j x = e i e Pr(x | e) ln C Pr(xj | (xj )) Pr(ei | (ei )) x j x Pre (xj |e (xj )) x Pr(x | (xj )) -ln Pr(e)= r ln Pre (xj |e (xj )) fg (X) Pe (x | e) j x j C + Pr(ei | (ei )) -ln Pr(e) fg (X) Pr(x | e) ln i e X C Pr(xj | (xj )) = fg (Xj , (Xj )) ln Pre (xj |e (xj )) j X C C + fg (X\{Xj , (Xj )}) Pr(x | e) fg ( (E)) e Pr( (e) | e) ln i e Pr(ei | (ei )) -ln Pr(e)= X C Pr(xj | (xj )) fg (Xj , (Xj )) Pr(xj , (xj ) | e) ln Pre (xj |e (xj )) j X C e + l Pr(ei | (ei )) fg ( (E)) Pr( (e) | e)Cn i e X -ln Pr(e) = fg (Xj , (Xj )) Pr(xj , (xj ) | e) j X X C ln Pr(xj | (xj )) + fg (Xj , (Xj )) j X Pr(xj , (xj ) | e) ln Pre (xj |1 e (xj )) C e + Pr(ei | (ei )) fg ( (E)) Pr( (e) | e)ln i e -ln Pr(e) (Eq. 1) Since (E) (V \ E) = Xj X, Pr(xj | (xj ), e) = Pr(xj | (xj )), from Theorem 1, set Pre (xj | e (xj )) = Pr(xj | (xj )) to minimize the divergence, then x e Pr(xj | (xj )) Pr(ei | (ei )) C j x i e x = fg (X) Pr(x | e) ln Pr(e) Pre (xj |e (xj )) x j e Pr(ei | (ei )) C i e . Also fg (X) Pr(x | e) ln Pr(e) from (E) (V \ E) = , Ei E, e (Ei ) E Pr(ei | (ei )) = Pr(e), so ie e Pr(ei | (ei )) C i e = 0. The fg (X) Pr(x | e) ln Pr(e) e KL-divergence between Pre and Pr is zero, thus Pre (v \ e) = Pre (v \ e) according to [Kullback, 1959]. 2 C Pro of of Lemma 2 Pro of. Xk Ah(Xj ) \ e (Xj ), consider the following three cases. Case 1: If a path Xk E exists then we show that this path is d-separated by e (Xj ). There are two possibilities. First, Xk E bypasses Xj , so it must The first, third, and fourth terms in Eq. 1 are pass one of the parents of Xj . Then (Xj ) d-separates decided by the original probability distribution, so the path. Second, Xk E does not pass Xj . Then the X C X fg (Xj , (Xj )) Pr(xj , (xj ) | e) ln Pr(xj | (xj )) path must b e d-separated by e (Xj ) \ Xj , so (e (Xj ) \ j C e Xj ) (Xj ) d-separates the path. + Pr(ei | (ei )) fg ( (E)) Pr( (e) | e) ln i e -ln Pr(e) is constant. To minimize the difference Case 2: If paths N Xk and N E exist, so N between posterior BN e and ESBN e the only choice is Ah(Xj ), and N d-separate the Xk and E , according to to minimize the following term: Case 1, (e (Xj )\Xj ) (Xj ) d-separates path N E . X C 1 Pr(xj , (xj ) | e) ln Pre (xj |e (xj )) Case 3: If paths Xk B and E B exist, according j X X fg (Xj ,(Xj )) C C = to topological order {B , descendants of B }((e (Xj )\ fg ( (Xj )) fg (Xj ) Pr(xj , (xj ) | e) j X Xj ) (Xj )) = , so (e (Xj ) \ Xj ) (Xj ) d-separates ln Pre (xj |1 e (xj )) this path. This is equal to minimizing the term for each From cases 1 to 3 we see that E, Ah(Xj ) | ((e (Xj ) \ Xj X Cand each possible configuration of 1 Xj ) (Xj )) , so e (Xj ) (e (Xj ) \ Xj ) (Xj ). 2 (xj ). fg (Xj ) Pr(xj , (xj ) | e) ln Pre (xj |e (xj )) = C 1 Pr( (xj ) | e) fg (Xj ) Pr(xj | (xj ), e)ln Pre (xj | (xj )\e) . C e We have fg (Xj ) Pr (xj | e (xj ))