Inference for Multiplicative Mo dels

Ydo Wexler

Christopher Meek

Microsoft Research, Microsoft Corporation Redmond, WA 98052, USA

Abstract
The paper introduces a generalization for known probabilistic models such as log-linear and graphical models, called here multiplicative models. These models, that express probabilities via product of parameters are shown to capture multiple forms of contextual independence between variables, including decision graphs and noisy-OR functions. An inference algorithm for multiplicative models is provided and its correctness is proved. The complexity analysis of the inference algorithm uses a more refined parameter than the tree-width of the underlying graph, and shows the computational cost does not exceed that of the variable elimination algorithm in graphical models. The paper ends with examples where using the new models and algorithm is computationally beneficial.

idence, called inference algorithms, exploit this structure, avoiding a direct computation of the join probabilities [5, 19]. The complexity of such algorithms depends on the topology of the model, and is exponential in the tree-width of the underlying graph. The common distinction within graphical models is between undirected graphical models [15], a subset of log-linear models, where there are no restrictions on the functions  , and Bayesian networks (BNs) [19] in which every function is a conditional distribution i (Di ) = P (Xi |i ) where i is the set of parent variables of Xi in the model. Another type of probabilistic models that can be represented visually, called factor graphs, extends undirected graphical models and incorporates many of the desired properties of graphical modes [14]. Aside of the independences that are imposed by the model's structure, often there exist additional independences stemming from the specific values of the functions. These independences are not systematically exploited by the traditional inference algorithms, resulting in an unnecessary computational cost. For such non-structural independences we use the name context-specific independence (CSI), which was suggested in previous studies [2, 20]. We note that the term CSI takes here a more general meaning as it is not restricted to any specific type of independence. Several studies have suggested changes in the traditional representation of graphical models in order to capture context-specific independences. These include similarity networks suggested by Heckerman (1991) [12], multinets (Geiger & Heckerman 1996) [9], asymmetric influence diagrams (Fung and Shachter 1990) [8], and structured representations of the functions  based on decision trees (Boutilier et al. 1996 [2], Poole& Zhang 2003 [20]). Other studies resorted to revised representations for specific functions (e.g. Quickscore algorithm by Heckerman 1989 for noisy-OR functions [11]).

1

Intro duction

Probabilistic models that represent associations and/or interactions among random variables have been heavily applied in the past century in various fields of science and engineering. The statistical methods originating with the work of Fisher (1925, 1956) [6, 7] culminated in the log-linear models which describe the association patterns among a set of categorical variables without specifying any variable as a response (dependent) variable [1]. A specific type of probabilistic models, probabilistic graphical models, can be visually described as an interaction graph, and embody independence assumptions in the domain of interest [15]. Their main attraction is that the independences encoded in the structure of the model allow to indirectly specify the join distribution as a product of functions i (Di ), each depends only on a limited set of variables Di . Algorithms that compute the posterior distribution conditioned on ev-


Although the new representations proved useful from an empirical view point, they lack the ability to encompass a wide variety of CSI. In addition, the theoretical complexity of inference using these representation remained a function only of the topology of the graph underlying the model. In this paper we approach the problem of inference from a more general perspective. We introduce a set of models called multiplicative models in which the functions  that account for the dependency of variables are in a multiplicative representation, where a value of an instance is a product over a set of parameters. We show that multiplicative models generalize over loglinear models, factor graphs, and graphical models. In addition, we show that multiplicative models can capture multiple forms of CSI, including CSIs captured via decision trees, decision graphs, and via noisy-OR functions. This leads to the question whether an inference algorithm that takes advantage of these independences can be constructed without additional cost. We provide such an algorithm, and show how different types of independences are utilized in this procedure to reduce the needed computations. The inference algorithm provided herein simplifies over the inference algorithm suggested by Poole & Zhang (2003) [20] when applied to Bayesian networks, by avoiding the use of tables and tables splitting operations. The more general nature of the algorithm also enables it to deal with different representations, and thus account for CSI that can not be represented by decision trees and decision graphs. We prove the correctness of the inference procedure and give a new notion of complexity instead of the exponent of the tree-width which is commonly used to describe the complexity of inference in graphical models. The new time complexity is shown to be less than or equal to the standard complexity.

denote its domain, or the set of possible values it can get, by dom(V ). For a set of variables D = {Vi }n 1 , i= the notation dom(D) corresponds to the cross product of the domains dom(Vi ), i = 1, . . . , n. Let D = {Vi }n 1 be a set of n multivalued variables, i= and let the function  (D) : dom(D)  R specify the values in a full table for the set D, then the following is a definition for a mapping function of D. Definition 1 (Mapping function) A function f is cal led a mapping function of D with respect to the lattice L, if it is defined as f : dom(Z )  L for every Z  D, and maps partial instances Z = z onto L. We use this definition to define a lattice multiplicative model of  (D). Definition 2 (Lattice multiplicative mo del) A model  = {S ,  } of a function  (D) is cal led a lattice multiplicative model with respect to a lattice (L, , , ) and a mapping function f , if S  s L,  = {s  R : s  S } and  (D = d) = s .
f (d),sS

The set S is called the structure of the model, and the set  is called the parameters of the model. In multiplicative models elements s  S for which s = 1 can be removed from S . Here we focus on a lattice L which is a set of propositional clauses over the variables and their values, and call this model a propositional multiplicative model, or simply a multiplicative model. In this model, the operators on the lattice are  and . The mapping function used for this model is called the propositional mapping function, and is defined as follows. Definition 3 (Prop ositional mapping function) A mapping function f is cal led a propositional mapping function of D with respect to the lattice L, if for every set Z  D the function mapV every partial s instance Z = z into the conjunction (Vi = vi ),
i Z

2

Multiplicative mo dels

We propose a generalization of graphical models, factor graphs and log-linear models which represents the dependency of variables in the model via the notion of multiplicative models. In these models a value of an instance in the dependency function is a product over a specific set of parameters. The definition relies on the concept of a lattice. A lattice (L, , , ) is a partially ordered set (poset) with respect to some relation , in which for every two elements l1 , l2  L their least upper bound is denoted as l1  l2 and their greatest lower bound is denoted as l1  l2 . We usually use upper case letters to denote random variables and sets of random variables, and lower case letters to denote their values. For a variable V we

where vi is the projection of z onto the variable Vi . Definition 4 (Prop ositional multiplicative mo del) A lattice multiplicative model  = {S ,  } of a function  (D) is cal led a propositional multiplicative model with respect to a lattice (L, , , ) and a propositional mapping function f , if the elements of L are propositional clauses over the variables in D and for two clauses c and c we denote c c if c is . implied by c Example 1 Consider a set D which contains two ternary variables A and B . The corresponding lattice


contains propositional clauses over A and B , and for the two clauses c = (A = 0) and c = (A = 0)  (B = 2) we denote c c . The corresponding mapping function maps the instance A = 0, B = 2 into the propositional clause (A = 0)  (B = 2), and the partial instance A = 0 into the clause (A = 0). In this definition, the standard model which uses fulltable representations of the functions  (D), such as graphical models, and handles each instance separately, is also a multiplicative model with the set S containing all mapping f (d) of instances D = d, and with values d =  (d). Another well-known model that falls into Definition 2 is the log-linear model. 2.1 Log-linear mo dels

els that can potentially reduce the amount of work needed for inference [12, 9]. The notion of ContextSpecific Independence (CSI) was then introduced by Smith et al. (1993) [23] and Boutilier et al. (1996) [2]. Context-specific independence corresponds to regularities within probabilistic models based on the values assigned in the model. Formally, we say that the sets of variables X and Y are contextually independent in the context of C = c given Z if (1) P (X, Y |Z = z , C = c) = P (X |Z = z , C = c) ˇ P (Y |Z = z , C = c) for every value Z = z . One aspect of this equation is that if X and Y are contextually independent given Z , then P (X |Y = y1 , Z = z , C = c) = P (X |Y = y2 , Z = z , C = c) (2) for any two values y1 , y2 of Y , which appear as repetitive values in conditional probability tables, such as those used in BNs. These repetition which are the basis of compact representations like decision trees and graphs were exploited for inference in BN [2, 20]. Another kind of CSI which was exploited for enhanced inference in BNs is the independence in noisy-OR functions. A noisy-OR function is a conditional probability function of a binary effect variable E given a set of m binary cause variables C = {C1 , . . . , Cm }. The conditional probabiiities of the function are P (E = l 0|C1 , . . . , Cm ) = c0 P (E = 0|Ci ), where c0 is a
:Ci =1

Log-linear models are usually used to analyze categorical data, and are a direct generalization of undirected graphical models. These models that have been heavily used for statistical analysis for the past four decades describe the association patterns among a set of categorical variables without specifying any variable as a response (dependent) variable, treating all variables symmetrically [1]. Formally, a log-linear model specifies the natural log of the expected frequency of values d for a set of variables D as a linear combination of the main effect Vii v of every variable Vi  D, and if |D| > 1 interaction effects S of every subset of variables S  D, where the s instances s are consistent with d. For example, suppose that we want to investigate relationships between three categorical variables, A, B and C , then the full log-linear model is ln(Fa,b,c ) = ľ+A +B +C +AbB +AcC +B C +AbB C a b c a a bc ac where ľ is the overall mean of the natural log of the expected frequencies. Clearly in the log-linear models instances are partially ordered by inclusion of their sets and by consistency of instantiations. To formalize log-linear models as a multiplicative models, for every subset Z  D and for every instantiation Z = z such tV at Z = 0, the set S h z contains all clauses of the form (V = v ), where v
Z

constant, and the values P (E = 0|Ci ) are some real numbers. For any particular CSI of the sets of variables X and Y in the context C = c given the set Z , as in Eq. 1, there exists a multiplicative model that captures this independence. Such a model is any multiplicative model where the structure does not contain elements s that involve variables from X and Y , such that there exists an instance Z = z for which s  (Z = z ) =  and s  (C = c) = . We now define two types of multiplicative models that capture two different types of common CSIs. 2.2.1 Positive mo dels

is the pro jection of z onto the variable V . In addition, we set the parameters of the model to  = eľ and S f (s) = es . 2.2 Context-sp ecific indep endence

With the introduction of graphical models and in particular Bayesian Networks (BNs), and the proof that inference in these models is NP-hard [4], several studies looked for further independences encoded in mod-

Representing the dependency of variables using loglinear models has some desirable properties, such as being general while ensuring the existence of a maximum likelihood without enforcing dependencies to be strictly positive. However, in the representation discussed in Section 2.1 the log-linear models use more parameters than necessary [3, 13]. Take for example the log-linear model for two binary variables A and B .


Assuming all possible effects exist, the corresponding log-linear model uses eight parameters rather than the four parameters in a standard representation as a full table: A , A , B , B , AB , AB , AB , AB . 0 1 0 1 00 01 10 11 Another representation of the log-linear models that accounts for these redundancies uses only parameters which involve non-zero instantiations of variables [10]. In the above example the only parameters used in this representation are: A , B , AB . We describe this 1 1 11 representation of log-linear models as a multiplicative model, which we call here the positive model. Definition 5 (Positive mo del) A positive model  of a function  (D) is a multiplicative model wrt to the lattice (L, , , ) and a (propositional) mapping function f in which S contains only elements s = f (z ) where Z  D and no variable in Z = z is set to zero. Log-linear models, and thus positive models, are known to capture conditional and contextual independences [16]. Example 2 An example is a function  over two binary variables A and B where  (0, 0) ˇ  (1, 1) =  (0, 1) ˇ  (1, 0). This implies that A is independent of B and the function can be written as  (A, B ) =  (A) ˇ  (B ). In the corresponding positive model the ) parameter (A=1)(B =1) = (0,0)ˇ(1,1) = 1. Thus, this  (0,1)ˇ (1,0 independence is captured in the model. Example 3 In a more complex function with three binary variables A, B and C , every pair of variables is independent whenever the third variable is set to zero. For this function the corresponding positive model assigns (V =1)(U =1) = 1 for every pair of variables V , U  {A, B , C } and where V = U . 2.2.2 Decision trees and graphs as multiplicative mo dels

of  (d = v1 v2 ˇ ˇ ˇ vm vm+1 ˇ ˇ ˇ vn ), where Vj = vj for m < j  n is any possible value of Vj . We note that in a decision tree every instance D = d is mapped to a single path in the tree. An example of a decision tree that encodes a function over the variables A, B , C, D is shown in Figure 1. One can choose to use decision graphs [18] instead of decision trees. These are more compact structures that can encode for more distributions. For a function  (D) over a set of variables D, a decision graph G that represents  (D) is a directed graph with sets of variables from D at internal nodes and values from  (D) at the leaves. Similar to decision trees, every edge from a set of variables W to a child Z corresponds to a different set of values H  w om(W ), and can be represented d as a set of clauses (W = w). A value at the end
H

of a path p equals to the value of  (d), where d is an instance of D consistence with the sets of values encoded by p. Again, as in decision trees, we note that in a decision graph every instance D = d is mapped to a single path in the graph. Definition 6 (Decision-graph mo del) A decision graph model  of a function  (D) is a multiplicative model wrt to the lattice (L, , , ) and a mapping function f where every two elements s1 , s2  S sats isfy s1  s2 = , and s = , where  = f alse and
S

= true. For a specific decision graph G that represents  (D), the decision graph model of G is (G) in which the structure contains one clause for every path from the root to a leaf in G, which is a conjunction of the clauses on the edges. For every such path s, we set s to the value at the end of the path. We note that in this model for every instance D = d there is only one element s  S such that s f (d).

Common structures for representing functions with contextual independence are decision trees (DTs) and decision graphs (DGs) [22, 18]. These structures capture contextual independences that are the result of repetitive values, as specified in Eq. 2. Several studies have used decision trees to enhance inference in graphical models [2, 20]. We show how DTs and DGs fall into the category of multiplicative models. For a function  (D) over a set of variables D, a decision tree T that represents  (D) is a tree with variables from D at internal nodes and values from  (D) at the leaves. Every edge from a variable V to a child in T corresponds to a different set of values Hv  dom(V ), and can be represented as a set of clauses (V = v ).
H

3

Inference for multiplicative mo dels

Consider a model that enicodes for the probability distribution P (x) = i (di ), with sets Di = {Xi1 , . . . Ximi }, and multiplicative models i = {Si , i } over all the functions i (Di ) wrt a lattice (L, , , ). We first show how to perform inference, and compute a probability of a set of query variables Q using a multiplicative model. In particular we perform inference for a multiplicative model via the variable elimination scheme (Zhang & Poole 1996 [24], Dechter 1999 [5]) which was originally suggested for inference in BNs. Then, we prove the correctness of the algorithm and analyze its time complexity. We define an operation M (V , {i }), which given a variable V  X and a set of models {i }, i = 1, . . . , m over X returns a model  over the variables X \ V . This

A value at the end of a path p = v1  v2  ˇ ˇ ˇ  vm , where vi is some value of Vi , equals to the value


A 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

B 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

C 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

D 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

 (A, B , C, D) 0.4 0.4 0.4 0.4 0.8 0.8 0.8 0.8 0.1 0.1 0.032 0.08 0.1 0.1 0.65 0.08

Figure 1: (left) A full-table over the binary variables A, B , C, D that specifies the value of the function  for each instance. (right) A decision tree corresponding to the function  on left. Under every node in the tree appears the corresponding proposition in the decision-tree representation, and below the corresponding proposition in a positive representation of the propositions in the decision tree. operator is analogous to marginalization in standard inference algorithms. In addition, for a model  we define a relevance indicator Is (V ) for each element s  S and each variable V in D, which is set to 1 if there exists a pair of instances d1 , d2 of D that differ only by the value of V and for which s f (d1 ) but s f (d2 ). Otherwise, Is (V ) is set to 0. These operations allow us to write an inference procedure which computes the probability of a set of query variables in a multiplicative model as in Algorithm 1. The algorithm operates like the bucket-elimination algorithm [5], where given an order on the variables we iterate over them (Line 2), and marginalize out one variable at a time (Line 9). Only elements that include terms that involve the current variable are considered in the marginalization.
1 2 3 4 5 6 7

Algorithm 1: VE for multiplicative models Input: A model with n variables Xi (i = 1, . . . , n) and m functions i (Di  X ), that encodes for the distribution P (X ). A set of multiplicative models i = {Si , i } wrt a mapping function f , where i model i (Di ), and a set of k query variables Q = {Xi : i  k } Output: The distribution P (Q). t = m + 1; for j = k + 1 to n do for i = 1 to t - 1 do Si [j ]  {s : s  Si , Is (Xj ) = 1}; i [j ]  {s : s  Si [j ], s  i }; Si  Si \ Si [j ]; i  i \ i [j ]; end for; {St , t }  M (V , {Si [j ], i [j ]}); t = t + 1; end for; P ; s P (Q)  (q ) = si : Q = q , si  Si
i

Note that for graphical models, in which the elements 8 of Si are a mapping of instances of the functions Di , 9 this algorithm is exactly the known variable elimi- 10 nation algorithm, in its implementation as bucket- 11 elimination [5], where the sets Si [j ] are the tables in the bucket of the variable Xj .
12

f (q )

A general algorithm for computing M (V , {i }) is given 13 return P (Q); as Algorithm 2. We use there the notation s  V for an V element s  S and a variable V to denote s (V = Figure 2: Algorithm for variable elimination with a mul=v tiplicative model v ). This operation removes all terms that specify a value for V . For example, if s = (V = 0)  (U = 0) then s  V = (U = 0). of R is chosen, and selects those elements r with paThe algorithm has two main parts: upto Line 5 the alrameters r = 1. gorithm generates the set R of possible new elements To compute the possible new elements, Lines 2 and 3 in the model. From Line 6 it computes the new paramfirst create a closure under the operator  of each eters r , where at each iteration a "minimal" element structure Si . Then, in Line 5 all conjunctions of terms


Algorithm 2: M (V , {i }) Input: A variable V and a set of representations i = {Si , i }, i = 1, . . . , t, wrt a lattice (L, , , ), where Isi (V ) = 1 for every i and si  Si . Output: A representation  = {S ,  }.
1 2

Assume that after removing the set of variable U we are left with the set X = X \ U , and now wish to eliminate a variable V  X . We write the probability of an instance xv of X \ V which is the pro jection of an instance X = x onto X \ V via the parameters  : P (xv ) = V
=v

S ;  ; for i = 1 to t do s Ri = { s : S i  S i };
Si


P (x ) =

V
=v

is
f (x ),sSi

i s

3 4 5 6 7 8

end for; 1 R{
it

We can decompose the product into terms that involve the vasriable V and those which do not. Denoting i (xv ) = i s , we get
f (x ),sSi ,Is (V )=0

ri : ri  Ri };

while R =  do * r is a minimal element in R * r  M in(R) = {r : r  R , r  R s.t. r V is
s r

P (xv ) = (xv ) ˇ };

V
=v

is
f (x ),sSi ,Is (V )=1

i s .

(3)

r =
9 10 11 12 13 14 15

=v

(r(V =v )),sSi

r
S ,r r

r

;

R  R \ { r }; if r = 1 then S  S  {r};     {r }; end while; return {S ,  }; Figure 3:
M (V , {i }). Algorithm for computing the operation

Now, lets examine what the algorithm encodes for after removing variable V , and show that it equals Eq. 3. While the elements that do not involve variable V are not changed, the elements that do involve V are removed and the elements st  St are added. Therefore, after applying Algorithm 2 for V the re^ maining sets encode for P (xv ) = (xv ) ˇ  (xv ) where s  (xv ) = st . To express  (xv ) in the
t

f (x ),st St

terms of Algorithm 2, recall that St  R and if an element s  R and s  St then s = 1. Thus, we can / rewrite  (xv ) using elements of R as  (xv ) = r
f (x ),r R

r .

from the different closures consist of the set of possible new elements. In analogy to inference in graphical models, this operation is equivalent to the operation of tables' multiplication, often denoted as . In these models the set R is the set of instances in the table after marginalization. We note that for some models, like graphical models, lines 2-5 are trivial, and are executed implicitly, since the elements in R are knownSto be all instances of a full-table over variables in i. 3.1 Correctness of the inference pro cedure

From lines 2-5 in Algorithm 2, there is one element r f (x ) in R for which r  R such that r f (x ) also satisfies r r . First, to show there is such an element r we recall from1 Line 5 that all elements in R can be written as r = ri , where ri  Ri , and
it

Ri is the closure of Si under the operator . Consider  the set of elements ri f (x ), i = 1, . . . , t, for which all other elements ri  Ri such that 1ri f (x ) satisfy  ri ri . Then, every element r = ri such that
it

r We prove the correctness of Algorithm 1 by showing that the algorithm maintains the property that after iterating over the set of variables U , the models i = {Si , i } encode to the probability distribution P (X \ U ). At the beginning of the algorithm every model i represents the corresponding function i (Di ). Thus, P (X = x) = i i (Di = di ) = is
f (di ),sSi

f (x ) also satisfies r

r .

Now, assume by contradiction that there were two such   elements, r1 , r2  R. Then from the definition of r1        r2 and r2 r1 , yielding r1 = r2 . and r2 we get r1 Thus, from line 9 in Algorithm 2 r V is  (xv ) = r ˇ r = s
r  ,r R =v (r  (V =v )),sSi

i s .

where the last equality is due to the fact that the der nominator in the computation of r is r . In the terms of Algorithm 1 the set {s : s
r  ,r R 

(r  (V =


v)), s  Si } can be rewritten as {s : s f (x ), s  Si , Is (V ) = 1}. Thus, we can write is s  (xv ) =
f (x ),sSi ,Is (V )=1

^ and from Eq. 3 we get P (xv ) = P (xv ). Namely, the new models encode for P (X \ V ). 3.2 Incorp orating evidence

In many practical scenarios we observe the value of some of the variables in the model, and wish to incorporate this evidence. The multiplicative models allow us to do so in a most natural way. Consider a set E of evidence nodes for which we observed the values E = e, and a multiplicative model  = {S ,  }. Then, in order to incorporate the Vevidence into , we adjust (V = v ), where v is the the elements in S by s = s
E

directly leads to fewer operations whenever we want to either obtain a value of  or update the values s . Examples of models with a diameter of 1 are graphical models and decision graph models, in which for every s element s  S , the only element s such that s , is s itself. On the other hand, the diameter of a positive model can be as high as log2|S | . This maximum is achieved for a positive model of m binary variables, when all 2m parameters do not equal one, and hence all possible elements are in S . In this scenario the diameter is exactly m . 2 Although in the worst scenario the diameter of a positive model can be large, often this is not the case, and the diameter is typically bounded to be very small. Example 4 Consider as an example the Potts model [21] in which a function  (D) over a set D i= {Vi }n 1 decomposes according to  (D = d) = i= c0  (vi , vj ), where vi and vj are projections of d
,j

pro jection of e onto the variable V  E . Then, we remove every element not consistent with the evidence, s = . 3.3 Complexity of inference

It is well known that the complexity of inference in graphical models is NP-hard and its cost exponential in the tree-width of the underlying graph [4]. We analyze the time complexity of the inference procedure for multiplicative models given in Algorithm 1. As a by-product we refine the standard complexity and provide a new complexity bound which is based on the representation used. One can then say that the complexity of the problem is the minimum complexity among all possible representations. 3.3.1 Diameter of multiplicative mo dels

onto the variables Vi and Vj respectively, and c0 is a constant. Although in general a positive model over n binary variables has a diameter of n , in this example, 2 the structure of the positive model includes only elements that involve at most two variables. Therefore, the diameter of the model is bounded by two. Similarly, in a more complex scenario where the function  decomposes to functions of k -tuples of variables, the diameter wil l be bounded by k . Consider a tree decomposition of the graph in which there is an edge between a pair of variables V , U if there exists an element s in one of the models for which Is (V ) ˇ Is (U ) = 1. We denote by S (W ) = {s  (X \ Z ) : s  Si } the set of parts of elements in the models i that involve variables from the set of graph vertices Z which is mapped onto the tree node W . Further denoting as S - (W ) the closure of S (W ) under the operator , we say that complexity of the algorithm for this tree decomposition is the maximum over the nodes W in the tree of |S - (W )| ˇ diam(S - (W )), as described in Section 3.3.1. Then, the overall complexity of the algorithm is the complexity for the tree decomposition that yields the minimum for this term. To see that this is indeed the time complexity of the algorithm, consider the elements in a set R in Algorithm 2. The number of elements there does not exceed the number of elements in S - (W ) for the corresponding tree decomposition and where W maps onto the variables that appear in R. Most of the computation stems from computing the products in Line 9, and these can be done for the entire set of elements of R in time proportional to |R| ˇ diam(R). Therefore, having the ability to choose an elimination order, the complexity of the algorithm is |S - (W )| ˇ diam(S - (W ))

The structure of a multiplicative model determines the amount of computations needed to obtain the value  (d) of a single instantiation of values to variables in a set D. Although at first glance it seems that for a model  = {S, } of a function  (D) the number of operations needed to obtaD values of all instances in D = d amounts to a total of |{s : s d}|, the real
=d

number of operations can be dramatically lower and we denote it by  (). For hierarchical models, in which if an element s is not in the structure of the model then all elements s s are also not in the model, Good (1963) provides a method that computes all such values in time |S | log |S | [10]. We denote the ratio between the number of computations and the number of elements in S , which is the size of the model, by diam() = |(|) and name it the diameter of . S From a computational perspective, it is clearly beneficial to use models with a small diameter, as this


maximized over all nodes W in a tree decomposition and minimized over all possible such decompositions. 3.4 Benefits of inference for multiplicative mo dels

Different multiplicative models capture different contextual independences, hence specifying different number of parameters. Take for example the function over four binary variables A, B , C, D with values according to the table in Figure 1. The structure of the corresponding decision-tree model contains six elements while the structure of the corresponding positive model contains eight elements. In this latter model, the CSI captured in the decision tree, yielding the value of  to be independent of B given that A, C and D are set to one, does not have any effect. This variation and the structure of the model affect the run time of the inference algorithm.

[2] C. Boutilier and et al. Context-specific independence in Bayesian networks. In Uncertainty in Artificial Intel ligence, pages 115­123, 1996. [3] R. Christensen. Log-Linear Models and Logistic Regression. Springer, 1997. [4] G. Cooper. The computational complexity of probabilistic inference using bayesian belief networks. Artificial Intel ligence, 42:393­405, 1990. [5] R. Dechter. Bucket elimination: A unifying framework for reasoning. Artificial Intel ligence, 113:41­85, 1999.

[6] R. Fisher. Statistical Methods for Research Workers. Macmillan Pub Co, 1925. [7] R. Fisher. Statistical Methods and Scientific Inference. Oliver and Boyd, 1956. [8] R. Fung and R. Shachter. Contingent influence diagrams. Working Paper, Dept. of EngineeringEconomic Systems, Stanford University, 1990. [9] D. Geiger and D. Heckerman. Knowledge representation and inference in similarity networks and bayesian multinets. Artificial Intel ligence, 82:45­74, 1996. [10] I. Good. Maximum entropy for hypothesis formulation, especially for multidimensional contingency taAn example where there are substantial computational bles. The Annals of Math. Stat., 34:911­934, 1963. savings when using the inference algorithm proposed [11] D. Heckerman. A tractable inference algorithm for can be found in a model such as the QMR-DT netdiagnosing multiple diseases. UAI, 230:362­367, 1989. work [17], which is comprised of noisy-OR functions, [12] D. Heckerman. Probabilistic Similarity Networks. mentioned in Section 2.2. The QMR-DT network is MIT Press, 1991. a two-level or bipartite BN where all variables are bi[13] D. Knoke and P. Burke. Log-Linear Models. Sage nary. The top level of the graph contains nodes for Publications Inc, 1980. [14] F. R. Kschischang, B. Frey, and H. Loeliger. Factor the diseases C , and the bottom level contains nodes graphs and the sum-product algorithm. IEEE Trans. for the findings E . The conditional probabilities in Inform. Theory, 47(2):498­519, 2001. the network P (Ei = ei |i ), where i are the parents [15] S. Lauritzen. Graphical Models. Oxford University of finding Ei in the network, are represented by noisyPress, 1996. OR functions. [16] J. Lindsey. Conditional independence and log linear models for multi-dimensional contingency tables. Heckerman (1989) has developed an algorithm, called Quality and Quantity, 8(4):377­390, 1974. Quickscore, which takes advantage of the indepen[17] B. Middleton and et al. Probabilistic diagnosis using a dence of the cause variables in the context of a negative reformulation of the INTERNIST-1/QMR knowledge finding Ei = 0 and uses it to speed up inference in the base: Part II. Evaluation of diagnostic performance. SIAM Journal on Computing, 30:256­267, 1991. QMR-DT network [11]. [18] J. J. Oliver. Decision graphs - an extension of decision For every noisy-OR function P (E |C1 , . . . , Cm ) a structrees. Proceedings of the Fourth International Workture of a multiplicative model that captures the inshop on Artificial Intel ligence and Statistics, pages 343­350, 1993. dependence does not contain elements s such that [19] J. Pearl. Probabilistic Reasoning in Intel ligent Syss  (E = 0) =  for which Is (Ci ) = 1 and Is (Cj ) = 1, tems: Networks of Plausible Inference. Morgan Kauffor all i, j  m. mann, 1988. [20] D. Poole and N. Zhang. Exploiting contextual indeIn addition, running Algorithm 1 using multiplicative pendence in probabilistic inference. JAIR, 18:263­313, models with structures C 2003. Si = {(Ei = 1) (Ci = ci ) : Ci = ci } [21] R. Potts. Some generalized order-disorder transfori i mations. Proceedings of the Cambridge Philosophical C Society, 48:106­109, 1952. {(Ei = 0)(Ci = 1) : Ci  i }((Ei = 0) (Ci = 0)) [22] S. Safavian and D. Landgrebe. A survey of decision i i tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3):660­674, 1991. is identical to the Quickscore algorithm and gains the [23] J. Smith, S. Holtzman, and J. Matheson. Structuring same savings automatically. conditional relationships in influence diagrams. Oper. Res., 41(2):280­297, 1993. References [24] N. Zhang and D. Poole. Exploiting causal independence in bayesian network inference. JAIR, 5:301­328, [1] Y. Bishop, E. Fienberg, and P. Holland. Discrete mul1996. tivariate analysis. MIT Press, 1975.