Sensitivity analysis in decision circuits Debarun Bhattacharjya and Ross D. Shachter Department of Management Science and Engineering Stanford University Stanford, CA 94305, USA E-mail: debarunb@stanford.edu, shachter@stanford.edu Abstract Decision circuits have been developed to perform efficient evaluation of influence diagrams [Bhattacharjya and Shachter, 2007], building on the advances in arithmetic circuits for belief network inference [Darwiche, 2003]. In the process of model building and analysis, we perform sensitivity analysis to understand how the optimal solution changes in response to changes in the model. When sequential decision problems under uncertainty are represented as decision circuits, we can exploit the efficient solution process embodied in the decision circuit and the wealth of derivative information available to compute the value of information for the uncertainties in the problem and the effects of changes to model parameters on the value and the optimal strategy. answer. For example, suppose we vary one parameter keeping all other parameters constant. For what range of this varying parameter is the current optimal strategy still optimal? How does the value change as this parameter is varied? What if the structure of the influence diagram changes and an uncertainty that would not have been revealed before we make a decision is now observed before the first decision is made? These are only some of the queries on the model that we seek to answer. Several issues in sensitivity analysis of Bayesian belief networks have been studied, using arithmetic circuits [Darwiche, 2003] for efficient solutions [Chan and Darwiche, 2002; 2004; 2006]. Our work builds on this body of research. Arithmetic circuits are graphical representations that have been shown to be efficient at performing inference on belief networks. Decision circuits promise similar benefits in the context of sequential decision problems. In sections 2, 3 and 4 we briefly discuss some preliminaries, and review literature on circuits and sensitivity analysis in influence diagrams. In section 5, we introduce key ideas of performing sensitivity analysis with decision circuits for normal form influence diagrams, i.e. influence diagrams involving a single decision node with no parents. We show how to perform some basic sensitivity analysis with decision circuits, such as: plotting the certain equivalent in a one-way sensitivity analysis, finding the range of a parameter over which the current optimal stays optimal, and computing the value of information of uncertainties that are not affected by decisions, using partial derivatives. This serves as an introduction to section 6, where we present results for influence diagrams that may contain multiple decision nodes. We show the challenges and necessary modifications from the previous section. Finally, section 7 describes our conclusions and directions for future work. 1 INTRODUCTION Influence diagrams are powerful communication tools and computational aids for the analysis of practical decision problems [Howard and Matheson, 1984]. Decision circuits are a recent graphical representation that have been introduced for the efficient evaluation of influence diagrams [Bhattacharjya and Shachter, 2007]. In this paper, we show that they are also useful for efficient sensitivity analysis in influence diagrams. The phrase sensitivity analysis refers, in general, to understanding how the output for a system varies as a result of changes in the system's input(s). For influence diagrams, one may be concerned about how the optimal solution and the certain equivalent (C E ) change with respect to a change in the parameters, i.e. the probabilities and the utilities, or a change in the informational assumptions of the problem. There are many questions that we can use sensitivity analysis to a) W B Report G yes yes W sunny rainy sunny rainy Prob. "sunny" 0.8 0.2 1 1 "rainy" 0.2 0.8 0 0 U Weather Prob. sunny 0.6 rainy 0.4 Value ($) B yes yes no no yes yes no no W sunny rainy sunny rainy sunny rainy sunny rainy 35 45 95 -5 40 50 100 0 R b) B W no no U R G B W G yes yes yes yes no Figure 1: (a) A belief network; (b) An influence diagram. U no no no 2 PRELIMINARIES We assume that the reader is familiar with graphical models such as Bayesian belief networks and influence diagrams (see Shachter (2007) for an overview). Consider the two examples shown in Figure 1. Figure 1a presents a belief network with two nodes, labelled W (Weather) and B (Bring umbrella). We are interested in knowing whether a friend will bring an umbrella, and we believe that it is easier to model this if we condition on the weather. Figure 1b shows the influence diagram for a decision problem in which our friend chooses whether to bring an umbrella based on her belief about the weather and her preferences, represented by the node U (Utility). She will observe a weather Report (R) before she makes her decision. We use these examples to demonstrate the concepts in later sections. We make a distinction between extensive and normal form influence diagrams. When there is only one decision node and it has no parents, the diagram is said to be in normal form [Savage, 1954; Raiffa, 1968]. We extend that definition to allow for evidence, i.e. that we observe certain variables taking on specific values. Alternatively, when decisions are represented by separate nodes, or when the diagram has a decision node with at least one parent, the diagram is said to be in extensive form. There can be a large number of strategies in an influence diagram, one for each possible combination of observed uncertainties and decision alternatives. Although it is possible to convert any extensive form influence diagram into a normal form influence diagram, it may not be efficient to do so. The influence diagram shown in Figure 1b is in extensive form because the decision node B has a parent R. In this paper we assume that the influence diagram has a single value node, referred to as utility node U . When making decisions, we choose the alternative that maximizes the probability that utility variable U = 1. The parents of utility node U , pa(U), are called the value attributes and we assess P (U = 1|pa(U)) = u(v (pa(U))) where u(.) is a von Neumann-Morgenstern utility function [von Neumann and Morgenstern, 1947] such that U = 1 is at least Figure 2: An influence diagram example with numbers. as good and U = 0 is at least as bad as anything that can happen, and v (.) characterizes the value of the attributes in terms of a single numeraire, which we assume is dollars. Therefore, the certain equivalent (C E ) of an uncertain V , given by u-1 (E [u(V )]), represents the certain payment that the decision-maker finds indifferent to V . Our sensitivity results will be expressed in terms of the certain equivalent because the utility values used for the internal computations have no intrinsic meaning. We also assume that the utility function u(.) is strictly increasing and continuously differentiable. The most common utility functions are linear, u(v ) = av + b, and exponential, u(v ) = -ae-v/ + b, where a > 0 and > 0, both of which allow us to express the value of information exactly in closed form. We will present some sensitivity analysis results for extensive form influence diagrams with the help of the influence diagram shown in Figure 2. In this example, our friend will decide whether to Gather evidence (G) and purchase a weather report. Her information gathering decision and the report will be known to her when she decides whether to bring her umbrella. If our friend does not gather evidence, the report is not informative and always states "sunny". The conditional probability tables and the value function v (.) are shown in the figure. We assume that the decision maker has an exponential utility function u(.) with risk tolerance = .02. The optimal strategy is to gather evidence and bring the umbrella when the report says "rainy" and not to bring the umbrella when the report says "sunny". The C E of the decision problem is $52.5. 3 ARITHMETIC AND DECISION CIRCUITS We review basic concepts regarding arithmetic and decision circuits in this section. Throughout this paper, variables are denoted by upper-case letters (X ) and their values by lower-case letters (x). A bold-faced letter indicates a set of variables. If X is a variable with parents Pa(X), then X Pa(X) is called the family for variable X . The values of a binary variable X are denoted x and x. ¯ 3.1 Arithmetic Circuits linear functions. An arithmetic circuit is a rooted, directed acyclic graph whose leaf nodes are constants or variables and all other nodes represent either summation or multiplication. The size of an arithmetic circuit is the number of edges it contains. The value of the multi-linear function is computed at the root of the circuit by evaluating the circuit in an upward pass, starting from the leaves and ending at the root. The result is denoted as f (e), where f (e) = P (e) (see Lemma 1). We can calculate partial derivatives by differentiating the circuit through a subsequent downward pass, in which the parents are visited before the children. The upward and downward passes are also referred to as sweeps. For further details, please see Darwiche (2003). Compact arithmetic circuits have been devised for belief networks that had previously been intractable [Darwiche, 2002; Chavira and Darwiche, 2005]. The circuit is compiled offline, where both the local structure (in the form of determinism and context-specific independence) as well as the conditional independencies of the graph at the global level are exploited. Several inference queries on the network can then be processed online, through subsequent operations on the compiled arithmetic circuit. Efficient sensitivity analysis of the belief network is possible with sweeps of the compiled circuit. 3.2 Decision Circuits Belief networks are associated with a unique multilinear function over two kinds of variables, evidence indicators and network parameters. An evidence indicator x is a binary (0-1) variable, with x = 0 whenever X has been observed taking another value, i.e. it is observed not to be x. There is an evidence indicator associated with each possible instantiation x of each network variable X . A network parameter x|pa(X) represents a conditional probability, x|pa(X) = P (x|pa(X)), and there is a network parameter for each possible instantation xpa(X) of family X Pa(X). Each term in the multi-linear function corresponds to one instantiation z of all the network variables Z, involving the product of all evidence indicators and network parameters consistent with z. The multi-lineax function for a belief network is given zr as f = pa(X)z x x|pa(X) where the sum is over every instantiation of all variables in the network and xpa(X) z represents all families consistent with z. For example, consider the belief network of Figure 1a. Suppose that W and B are binary variables with states w and w, and b and ¯ respectively. ¯ b The multi-linear function for this network is: f = w b w b|w + w b w b|w + w ¯w ¯|w + w ¯w ¯|w . ¯ ¯ ¯ b ¯ b¯ ¯ b b The multi-linear function is a useful construct for answering inference queries in belief networks. By setting the evidence indicators to 0 or 1, we can find the probability of observing any set of network variables E. For instance, if we assign evidence to be e = ¯ by b setting b = 0 and all the other three evidence indicators as 1, the function returns P (¯) = w ¯|w + w ¯|w . b ¯ b¯ b In general, the evidence indicators help in summing the appropriate entries in the joint probability distribution table for computing the probability of the evidence, P (e). Furthermore, the partial derivatives of the multi-linear function also provide solutions to several common probabilistic inference queries. We list two lemmas with important relationships between inference queries and the multi-linear function, as proven in Darwiche (2003): Lemma 1. For evidence e, we have: P (e) = f (e). Lemma 2. For every variable X and evidence e such f that X E, we have: P (x, e) = x (e). / Arithmetic circuits are graphical structures that efficiently represent, evaluate, and differentiate multi- Decision circuits are arithmetic circuits augmented with maximization nodes. They represent the dynamic programming function corresponding to a sequential decision problem. The size of a decision circuit is the number of edges it contains. Figure 3 presents a decision circuit corresponding to the influence diagram shown in Figure 1b. Decision circuits for influence diagrams can be constructed in the variable elimination order [Bhattacharjya and Shachter, 2007], similar to a construction technique for arithmetic circuits [Darwiche, 2000]. The evidence indicator d for decision D is initialized to 0 only if the alternative d is no longer available to the decision maker. The network parameter d|pa(D) for a decision D is initialized to 1 if the alternative is conditionally available under scenario pa(D), and 0 otherwise. Once compiled, the decision circuit can be evaluated in an upward sweep analogous to evaluation in arithmetic circuits. Decision circuits are evaluated with evidence e = e {U = 1} when e is observed. The best outcome U = 1 is also deemed to be observed since this is an M E U problem and therefore the goal is to find optimal policies that maximize the probability of the best outcome given the evidence. The value + max max * b| r + * b| r + * b b + * b| r + b| r * r + * + * * * * + * * + r|w r|w w w w r | w r | w r w [Howard, 1983]. Sensitivity plots displaying the certain equivalents of different strategies can help identify the important variables. Sensitivity analysis for decision problems can be broadly classified in terms of whether one parameter is varied while others are kept constant (one-way sensitivity analysis ) or when multiple parameters are simultaneously varied (n-way sensitivity analysis ). One-way sensitivity analysis throws light upon the critical model variables whereas n-way sensitivity analysis provides insights into the general robustness of the model. Sensitivity analysis in belief networks explores the sensitivity of inference queries such as the probability of evidence and conditional marginal probabilities given the evidence, to the conditional probabilities of the network [Laskey, 1995; Castillo et al, 1997; Kjaerulff and van der Gaag, 2000; van der Gaag and Renooij, 2001]. Sensitivity to inference queries using arithmetic circuits has also been studied [Chan and Darwiche, 2002; 2004] as has the sensitivity of Most Probable Explanations to parameter changes [Chan and Darwiche, 2006]. Our work differs from this line of research in that it uses decision circuits to determine sensitivity of the optimal strategy and the certain equivalent to parameter changes in decision problems. To perform sensitivity analysis for decision problems, we will vary the model output with respect to metaparameters, similar to Chan and Darwiche (2002) and Nielsen and Jensen (2003). For a variable X , we analyze sensitivity to all parameters of the form x|pa(X) as linear functions of a meta-parameter . For example if X is a binary variable with no parents, we can set x = and x = 1 - . In this paper we ¯ assume that there are K meta-parameters, denoted k , k = 1, ..., K , each between 0 and 1, and that these meta-parameters are drawn explicitly in the decision circuit, although it is also possible to allow the metaparameters to be implicit [Chan and Darwiche, 2002]. We assume that each meta-parameter is associated with only one variable. This ensures that the root value of the decision circuit is a piecewise-linear function of each meta-parameter. This can be extended to cases where the model inputs are non-linear functions of the meta-parameters, such as sensitivity to risk tolerance. * ** * u |b w * ** * u u u |b w u |b w u |b w u |b w u |b w u |b w u |b w Figure 3: Decision circuit for the influence diagram shown in Figure 1b. of the root node of the circuit is denoted g (e ). The optimal strategy is computed on the upward sweep at the maximization nodes, where the alternative d with the highest value is chosen, breaking ties arbitrarily. The network parameter d|u is set to 0 for all other alternatives d. The circuit can be differentiated in a subsequent downward sweep, by treating the maximization nodes as summation nodes. Although optimal policies are determined on the upward sweep, the M E U and C E are calculated also using results from the downward sweep. Specifically, from Bhattacharjya and Shachter (2007): Lemma 3. M E U = g u g (e ) (e )+ g (e u ¯ ) . Lemma 4. For utility function u(.), we have: C E = u-1 (M E U ). 4 SENSITIVITY ANALYSIS IN INFLUENCE DIAGRAMS The conditional probabilities in an influence diagram can be difficult to assess due to the paucity of data, expert judgments about key uncertainties, the decision maker's imprecision regarding preferences, and several other practical reasons. As a result, it is often beneficial to inspect the change in the outputs of the decision model based on variations in the inputs of the model. Such issues fall under the umbrella of sensitivity analysis. Sensitivity analysis has been an essential aspect of decision analysis throughout the field's development. Sensitivity analysis aids in identifying the model's critical elements, forming the basis for iterative refinement of the model, and can also be used after the analysis for defending a particular strategy to the decision maker 5 ANALYZING NORMAL FORM INFLUENCE DIAGRAMS When there is only one decision node in an influence diagram and it has no parents, the diagram is said to be in normal form. We extend this definition to allow the observation that E = e. We discuss sensitivity analysis for normal form influence diagrams in this section. 5.1 Partial derivatives Consider an influence diagram with a single decision node D that has no parents. Let X be an arbitrary uncertain variable, with parents Pa (Note that D can be a parent of X ). The dynamic programming function corresponding to this normal zorm influence diagram fx is given by g = maxd d d ,pa(X)z x|pa(X) x where d is an alternative for decision node D. In normal form, each alternative represents a strategy. Let E Ud be the expected utility for strategy d. Then M E U = maxd E Ud , optimal strategy d = arg maxd E Ud and C E = u-1 (M E U ). We will discuss many of the normal form sensitivity results using the following theorem, which outlines the significance of derivatives of g (e ) and g (e). These are the root values of the decision circuit evaluated at evidence e and e, respectively. Theorem 1. If evidence e = e {U = 1} and e are swept through a decision circuit constructed for a normal form influence diagram, then for any node v : = (i) E Ud v and their derivatives. The single and double derivatives of the form used in Theorem 1 can be obtained in time linear in the size of the circuit [Darwiche, 2000]. 2 ge) The double derivatives d( v are computed by sweep g ing (e , for all strategies d, down the circuit. Thus d U the time complexity for obtaining Ev d for all nodes is O((NS )(dc)) where NS is the number of strategies and dc is the size of the decision circuit. Note that the slope of C Ed with respect to v is not constant unless the utility function is linear; it will depend on the value of C Ed in general. ) 5.2 One-way sensitivity analysis plots In this sub-section, we show how to create a graph of the certain equivalent for the decision problem as a function of a particular meta-parameter when all others are kept at their reference values. The expected utility for a particular strategy is a multi-linear function of all the meta-parameters. Therefore, the expected utility for a particular strategy d is a linear function of a particular meta-parameter k , keeping all other meta-parameters at their reference values. We denote this linear function as E Ud = d (k ). We already have a point on this line from the initial sweep used to evaluate the circuit, which we de0 0 note as (k , d ). We also have the slope of this line, since it equals the partial derivative from part (i) of Theorem 1 choosing v = k . We denote the slope as d,k . Plotting the decision problem's C E with respect to changes in k entails applying the inverse utility function u-1 (.) to the maximum of the lines for all strategies. If the required resolution is , then the number of points required between 0 and 1 for the plot is 1/ . The time complexity for preparing the plot for all meta-parameters, once we have the lines for expected utilities of all strategies with respect to all meta-parameters, is O(( 1 )(K )(NS )) where K is the number of meta-parameters and NS is the number of strategies. Another approach to plotting the C E against a metaparameter is the standard method of sample points, where the decision problem is re-evaluated for every sample point over a range. If the required resolution is , then the time complexity of this approach for obtaining plots for all meta-parameters is O(( 1 )(K )(dc)). 5.3 Admissible intervals = v g (e ) d g (e) g (e) 2 g (e ) g (e) g (e ) d v - v d (g (e))2 . (ii) If M E U P I is the maximal expected utility with perfecx information on uncertainty X, then: M E U P I = t E maxd Ud . x (iii) C Ed v = u (C Ed ) . E Ud v Proof. (i): On the downward sweep the maximization nod es are tz eated as summation nodes, therefore d rx g= d d ,pa(X)z x|pa(X) x for the downward sweep. Differentiating g with respect to d and evaluating at evidence e results in P (U = 1, e|d) since only terms associated with alternative d remain in the summation. The expected utility of alternative d is g) 1 P (U = 1|e, d), therefore E Ud = P (e) (e . The red sult follows by recognizing that P (e) = g (e) (e is not responsive to any strategy d), and differentiating with respect to node v using the quotient rule of differentiation. x (ii): M E U P I = P (x|e) maxd P (U = 1|e, x, d) x = maxd P (U = 1, x|e, d) x E = maxd Ud . x We have used some results from the proof of part (i), and the fact that the uncertain variable X is not affected by the decision, thus P (x|e) = P (x|e, d)d. (iii) E Ud = u(C Ed ); the result follows from differentiating both sides with respect to v using the chain rule of differentiation. U Theorem 1 presents a recipe for computing Ev d and C Ed for any node v in the circuit, using g (e), g (e ) v Another sensitivity question in the spirit of one-way sensitivity analysis is the following one. Suppose a certain meta-parameter is allowed to vary, keeping all other meta-parameters constant at their reference values. What is the range of this meta-parameter over which the current optimal strategy remains optimal? We call this range the admissible interval for a metaparameter. In other words, we investigate how robust the optimal strategy is, with respect to changes in any meta-parameter. The admissible intervals are easy to obtain for normal form influence diagrams, based on the ideas from the previous subsection. Since we have the lines for the expected utilities of all strategies with respect to all meta-parameters, we can obtain the admissible intervals Ik for all meta-parameters simultaneously. Suppose + and - are the admissible positive and nega0 tive changes to k from its reference value k such that d stays optimal. We present the following theorem, without proof, for computing +, - and the intervals Ik . The proof entails finding the points at which another strategy overtakes d and recognising that k lies between 0 and 1. The time complexity for computing these intervals for all meta-parameters, once we have the lines for expected utilities of all strategies with respect to all meta-parameters, is O((K )(NS )). Theorem 2. For meta-parameter k , the admissible , 0 -0 d positive change + = minds.t.d,k >d ,k d d d - ,k ,k exponential, then the value of information of the uncertain variable X is the increase in the certain equivalent u-1 (M E U P I ) - u-1 (M E U ). In general, this is usually a good approximation for the value of information, even if the decision maker's utility function has a form other than linear or exponential [Raiffa, 1968]. Once we have the results from Theorem 1, computing the value of information involves summing and maximizing the partial derivatives. If the number of variables analyzed for value of information is var, and assuming that the maximum number of possibilities for all of these variables is bounded by some constant, then the time complexity for obtaining the value of information for all these variables is O((NS )( var)). 6 ANALYZING EXTENSIVE FORM INFLUENCE DIAGRAMS the admissible negative , change - = 0 -0 d maxds.t.d,k