Budgeted Distribution Learning of Belief Net Parameters

Liuyang Li liuyang@ualberta.ca Barnabás Póczos poczos@ualberta.ca Csaba Szepesvári szepesva@ualberta.ca Russ Greiner rgreiner@ualberta.ca Department of Computing Science, University of Alberta, AB, Canada, T6G 2E8

Abstract
Most learning algorithms assume that a training dataset is given initially. We address the common situation where data is not available initially, but can be obtained, at a cost. We focus on learning Bayesian belief networks (BNs) over discrete variables. As such BNs are models of probabilistic distributions, we consider the "generative" challenge of learning the parameters for a fixed structure, that best match the true distribution. We focus on the budgeted learning setting, where there is a known fixed cost ci for acquiring the value of the ith feature for any specified instance, and a known total budget to spend acquiring all information. After formally defining this problem from a Bayesian perspective, we first consider non-sequential algorithms that must decide, before seeing any results, which features of which instances to probe. We show this is NP-hard, even if all variables are independent, then prove that the greedy allocation algorithm iga is optimal here when the costs are uniform, but can otherwise be sub-optimal. We then show that general (sequential) policies perform better than non-sequential, and explore the challenges of learning the parameters for general belief networks in this sequential setting, describing conditions for when the obvious round-robin algorithm will, versus will not, work optimally. We also explore the effectiveness of this and various other heuristic algorithms.

1. Introduction
Consider the challenge of producing a Bayesian belief network (BN) (Verma & Pearl, 1991) for modeling a particular medical situation -- e.g., for encoding the various interactions between a given set of symptoms and diseases. Here, experts have identified the relevant symptom and disease variables X = (X1 , . . . , Xd ), and have structured them as nodes in a graph, whose directed arcs represent their dependencies. However, they have not specified the actual parameters (here, conditional probability table (CPtable) values, as these variables are all discrete), but have provided a prior distribution for each CPtable row Xi |ui , corresponding to the posterior distribution of the variable Xi given a specific assignment ui to its parents Ui (Tong & Koller, 2001b). Fortunately, there are many known techniques for estimating these parameters, given a data sample (Heckerman, 1999). Unfortunately, we do not initially have any data (perhaps this is just the start of a funded study). We do, however, have access to a set of patients {X1 , X2 , . . . }, whose individual features we can "probe", at a cost. That is, we can acquire the value xj i of feature i for patient j, at known cost ci . The total 1 K cost of the sequence of K probes xj1 , . . . , xjK then is i i K k=1 cik . Our funders have provided a total budget B to spend on such probes; so we can consider any probe K sequence where k=1 cik  B. Our task, now, is to use this budget effectively, to (sequentially) purchase the probes that allow us to obtain good estimates of the parameters -- in particular, to find the parameters that most closely match the true distribution; see Section 2. We refer to this as Budgeted Distribution Learning (bdl). Section 2 provides the relevant framework. Section 3 then focuses on the simple case where the variables are independent, dealing first with "non-sequential algorithms" that must decide on all of the probes before seeing any of their responses. We provide an efficient

Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

Budgeted Distribution Learning

greedy algorithm, iga, that provably returns the optimal allocation in the unit-cost case, then we show that this task is NP-hard when the costs are arbitrary. We also show that this optimal allocation is not the optimal policy, and provide empirical studies that illustrate the behaviour of this iga algorithm, as well as several obvious sequential approaches. Section 4 then extends these results to general belief networks. We describe conditions for when the simple round-robin algorithm (which acquires full data instances) will produce optimal results. We also explore the effectiveness of this and various other heuristic algorithms. The extended technical report (LPSzG, 2010) contains the proofs, and other material; in particular, showing that many of these results hold for Gaussian distributions, as well as Dirichlet. We close this section by summarizing related work -- in particular, placing our system within the context of active learning, and contrasting our system with discriminative budgeted learning. 1.1. Related Work While most learning algorithms begin with a given data set, there is today a large literature on active learning (Muslea, 2002) and experimental design (Melas, 2006), which explore the challenges of first acquiring this training data. The standard example is "active discriminate label learning" (ADLL), in which the system attempts to produce a classifier -- i.e., a function that maps attributes of each instance xj = xj , . . . , xj to a label y j -- and assumes the n 1 active learner initially has access to the attributes of a large number of instances {xj }j , and needs to pay funds to purchase the labels. Lizotte et al.'s (2003) "budgeted discriminative attribute learning" model (BDAL) is also seeking a good classifier; it differs from ADLL by initially providing the learner with all of the labels {y j }j and allowing the learner to purchase the attribute values {xj }i,j i (called "probes") that it specifies. This task is potentially more difficult, as the BDAL learner has to consider dependencies among the attributes {xi } as well as dependencies connecting each attribute with the label y. Another complexity is that different attributes can have different costs. This work also differs from ADLL by imposing a hard limit on the purchases -- i.e., allowing the learner only a fixed budget to spend acquiring information. Our bdl framework resembles BDAL as both are acquiring data to find the best parameters for a given belief net structure. Our goal, however, is a good generative model, rather than BDAL's accurate discrim-

inator (that is, a good classifier). This eliminates the distinction between attribute versus label (they are all just "features"), and explains why our learner starts with no data at all (recall that BDAL assumes the learner initially has the labels {y j }j ). The simplest version of our bdl system deals with the trivial belief net where all variables are independent (i.e., the nodes are not connected); see Section 3. This task relates directly to "bandit problems" (Berry & Fristedt, 1985), which basically uses a set of trials to identify the optimal "bandit" -- i.e., the "slot machine" that returns the maximal expected payout. While standard bandit problems combine exploration and exploitation (each "probe" or "trial" provides both information about the quality of the bandit played, and also a reward), our model differs as it uses its budget purely for exploration -- i.e., it does not acquire any reward during this time. Madani et al. (2004) investigated this "budgeted variant" of the standard bandit problem. As its loss function depends only on the single bandit selected, this system was able to essentially ignore apparently-inferior bandits. By contrast, our bdl loss function depends on all of the bandits (here, variables in the BN); hence, we need to learn information about all variables, rather than just one. Our bdl also relates to the interventional active learning of generative BNs (IAL) framework (Tong & Koller, 2001a;b; Murphy, 2001; Steck & Jaakkola, 2002). Here, the learner has the option of setting the values of a fixed set of features ("interventions"), then requesting the values of the remaining instances, which it receives at "unit" cost (i.e., the sum of the costs of the remaining features is a constant). The IAL objective is typically to quickly learn the parameters (or the structure) of the belief network that is as close to the correct distribution as possible; we use their criteria (Expected KL divergence; Equation 1) to evaluate the quality of our estimated parameters. Our bdl differs as (1) we do not get to set any values, but can only observe the results of our probes; (2) we have an explicit budget; and (3) we have the option of purchasing only a subset of the features for an instance. We will see that this makes computing the posterior distributions more complicated.

2. Budgeted Distribution Learning
In our bdl setting, we start with a parametric model pX (ˇ|) that defines a distribution over the random variables X = (X1 , . . . , Xd )T , Xi  R, and a prior p ( ˇ ) over the parameters   Rm . In what follows,  typically denotes a random variable drawn from p ( ˇ ) and X denotes a random variable drawn from pX ( ˇ |  ). (In the context of Figure 1, X

Budgeted Distribution Learning
Data
S ? ? ? R ? ? ? W ? ? ?

Budget=20
C

Prior Knowledge
Cloudy
Sprinkler Wet Grass S|C=F ť (1 , 1) S|C=T ť (1 , 1)  W|S=F,R=F ť (1 , 1) W|S=F,R=T ť (1 , 1) W|S=T,R=F ť (1 , 1) W|S=T,R=T ť (1 , 1)

C ť (1 , 1)
Rain

Budget=0
C S R

Data
W

Cost(C)=1 Cost(S)=2 Cost(R)=1 Cost(W)=2

? ? ?

R|C=F ť (1 , 1) R|C=T ť (1 , 1)

T

T

?

T

T
F

F
T T

T
F F

?
T T

?

?

?

?

b

c

?

a
Point estimate
Cloudy Sprinkler C F T P(S=F|C) 0.27 0.46 P(S=T|C) 0.73 0.54 S R Wet Grass P(W=F|S,R)

Probes
P(C=S) P(C=T)

0.44
Rain

0.56
C F P(R=F|C) 0.73 0.42 P(R=T|C) 0.27 0.58

Update
Cloudy Sprinkler Wet Grass

C ť (2.64 , 3.36)
Rain

P(W=T|S,R)

T

S|C=F ť (1 , 2.64) S|C=T ť (2 , 2.36)

R|C=F ť (2.64 , 1) R|C=T ť (1.85 , 2.51)

F
F

F
T F T

0.5
0.5 0.22 0.40

0.5
0.6 0.78 0.60

e

T T

d

W|S=F,R=F ť W|S=F,R=T ť W|S=T,R=F ť W|S=T,R=T ť

(1 , 1) (1.5 , 1.5) (1 , 3.49) (1 , 1.51)

Figure 1. Budgeted learning of the parameters of the sprinkler network (Russell & Norvig, 2002).

is the set of four variables C, S, R, W , and  = C , S|c=T , S|c=F , . . . , W |r=F,s=F is the set of 9 parameters shown in Figure 1(b), corresponding to CPtable entries of this belief net structure.) Further, we are given a fixed budget B  R+ (here B = 20). We start with no data (Figure 1(a)), and must employ some "strategy" to collect the data. Moreover, we know it costs ci = Cost(Xi )  R0 to see the value of variable Xi for any specified data instance. It is helpful to view the data set D as a growing matrix, where each row corresponds to a data instance and each column corresponds to an attribute (Fig. 1(a), 1(c)). We start with an empty matrix, where the values of the cells can only be determined through probes. A probe is defined as a purchase of the value of the ith attribute and the j th instance, cell (i, j), at cost ci . Letting Xj refer to the j th instance (i.e., j th row of this data set), a probe can be applied to a previously probed instance Xj (e.g., probe (5, 3) after probing (2, 3)) or an un-probed instance, which will increase the number of rows of D by one. The cost of a set |A| of probes A = (ik , jk ) k=1 is just the simple sum c(A) = k cik . When the budget is exhausted (i.e., when c(A) = B), bdl then passes the collected data to a parameter learning system. The task of the bdl learner is to make the probes wisely, so that the distribution of X, based on these learnt parameters, is estimated as accurately as possible. Let A  { (i, j) : 1  i  d, j  1 } be the attribute j values probed by the learner, and XA = (Xi )(i,j)A be the responses obtained to these probes. We assume that the random variables X, X1 , X2 , . . . are drawn from pX (ˇ|), conditionally independently of each other given . Let XA denote a random variable drawn from the posterior distribution p (ˇ|XA ), Ż and let XA = E |XA denote its expected value. Here, we use the objective function proposed by Tong

& Koller (2001b): J(A) = Ż E KL pX ( ˇ | XA ) pX ( ˇ | XA ) . (1)

Hence, when the posterior XA is well concentrated Ż around its mean XA , we expect the cost J(A) to be small. (LPSzG, 2010) proves that this cost equals Ż over the J(A) = E KL pX ( ˇ |  ) pX ( ˇ | XA ) prior . We explore this task along two axes: One dimension is whether the variables are independent (Section 3) or not (Section 4); and the other, whether the learner is sequential. In the sequential (on-line) framework the set A can be selected in an incremental manner: when deciding about the k + 1st probe, the learner can use the result of the previous k probes. In the nonsequential (aka off-line, allocation) version, the set A must be selected initially, and in particular, without knowing the values of any of the probes. Most of our theoretical results deal with non-sequential algorithms.

3. Independent Variable Model
In this section, we will assume that the features are independent of one another: Assumption A1 The joint distribution of the variables (X1 , . . . , Xd ) is the product of the marginal distributions. Further, the parameter vector  = (1 , . . . , d )T includes one parameter i for each attribute Xi , in that the distribution of Xi depends only on i . Formally, for any x = (x1 , . . . , xd )T , we have d pX ( x |  ) = i=1 pXi ( xi | i ). The prior also factord izes: p (  ) = i=1 pi ( i ). As a simple example to illustrate these conditions, consider the case when i  [0, 1], pi (ˇ) is a Betadistribution, and pXi (ˇ|i ) is a Bernoulli distribution with parameter i . We can view this as having d independent coins, each with a prior on its bias. Our task

Budgeted Distribution Learning

is to learn as much about the distributions of the coins as is possible given a budget, assuming that each time coin i is flipped, the learner incurs a cost of ci > 0. 3.1. Optimal Allocation Algorithm, iga Given this independence, it makes sense to consider just the subset of probes associated with the ith random variable: Ai = { (i, j) : j (i, j)  A } . We also let V = { (i, j) : 1  i  d, j  1 } be the set of all possible probes. Under these independence assumptions, the cost J decomposes into the sum of costs defined for the individual variables:1 Proposition 1. For any A = {A1 , . . . , Ad }  d V, J(A) = Ji (Ai ) = i=1 Ji (Ai ), where E KL pXi (ˇ|i ) pXi (ˇ|E i |XAi ) is the objective function (1) applied to Xi , where XAi is set of values of the Xi feature in the datasample D = {X1 , X2 , . . .}. Proposition 2. For any 1  i  d, |Ai | = |Ai | implies Ji (Ai ) = Ji (Ai ). That is, the cost associated with the ith attribute depends only on how many times that attribute was probed. We therefore define Ji (k) to be the cost associated with the ith attribute after it has been probed k times, and note that the cost of any allocation A is the same as the cost of the corresponding "compact allocation" of the form A = d { (i, j) : 1  j  |Ai | }, i=1 which depends only on the cardinalities ai = |Ai |, i = 1, . . . , d. Definition 1. Let J be a set function mapping subsets of V to reals. We say that J is monotone (nonincreasing) if A  A  V implies J(A)  J(A ), and is supermodular if A  A  V and v  V implies J(A) - J(A  {v})  J(A ) - J(A  {v}). (That is, adding {v} to a small set A reduces the cost by more than adding {v} to the larger set A .) Let Betax (, ) denote the density of a Beta distributed variable evaluated at x, and let Berx () denote the probability mass function of a Bernoulli random variable with parameter , evaluated at x. Proposition 3. The objective function Ji for a variable Xi with pi (i ) = Betai (, ), ,   Z+ prior distribution, and BerXi (i ) model likelihood is strictly monotonically decreasing in the number of probes, and supermodular. Let i (k) = Ji (k) - Ji (k + 1) be the change of cost associated with the ith attribute. The iga algorithm, shown in Figure 2, initially computes these j (0) values for each variable j, and selects the
1 Recall that the proofs of this claim, and the following ones, can all be found in the website (LPSzG, 2010).

IGA( budget B; costs ck ; reductions k (ˇ) ) s := 0 a1 := 0, . . ., ad := 0 while s < B do j  := arg maxj { j (aj )/cj } aj  := aj  + 1 s := s + cj  end while RETURN (a1 , . . . , ad ) Figure 2. Incremental Greedy Allocation algorithm, iga

largest j  = arg maxj {j (0)/cj }, and assigns one probe to that variable. It then computes the expected j  (1)/cj  value for that variable, and again finds the largest value along this modified frontier (replacing j  (0)/cj  with j  (1)/cj  ). The expectation is based on the chance of reaching a node, given the priors. To illustrate, consider the two unit-cost binary variables shown in Figure 3, with a budget B = 2. iga first computes A (0) and B (0). As A (0)  1.51E-4 < B (0)  1.54E-4, iga allocates 1 probe to B. It then computes B (1) as the difference between the weighted sum of the 3 nodes at level 2 with the weighted sum at level 1, which here is B (1)  1.49E-4. As this is less that A (0), iga allocates 1 probe to A. Having spent its entire budget, iga terminates, returning the allocation: 1 probe to A and 1 to B. One challenge is computing j (k+1) from j (k). Fortunately, this is efficient in general -- e.g., when considering Beta priors, this takes only O(k) time, where k  B (assuming each ci  1). Here, the entire algorithm, over n variables and a budget of B, requires only O((n + B)B ln n) time, and only O(n) space. Now consider the following assumptions: Assumption A2 (i) All costs are equal and, in particular (without the loss of generality), ci  1. (ii) The objective functions Ji for each attribute are both monotone and supermodular. Proposition 3 shows that Assumption A2(ii) is satisfied by the Beta prior distribution and Bernoulli model likelihood.2 Proposition 4. Given Assumptions A1 and A2, iga computes an optimal allocation. While iga is the optimal allocation policy, it is not the optimal policy. To illustrate, note that the optimal policy for Figure 3 (for B = 2) would first probe A; and if it is tails, probe A again, and otherwise probe B.
2 (LPSzG, 2010) shows it also holds for Gaussian distributions; we conjecture that it holds more generally.

Budgeted Distribution Learning

JA
0.008455

A

B

JB
0.0088489

A

1-Beta(1, 49)
1 50 -Beta(2, 49)

1-Beta(28, 28)
1 2 -Beta(29, 28)
29 114 -Beta(30, 28)

B

1.5146e-004 0.0083035 1.4599e-004 0.0081576
2 2550 -Beta(3, 49) 98 2550 -Beta(2, 50) 49 51 -Beta(1, 51)
28 57 -Beta(29, 29)

1.5390e-004

49 50 -Beta(1, 50)

1 2 -Beta(28, 29)

0.008695

1.4864e-004 29 0.0085463 114 -Beta(28, 30)

Figure 3. An example to illustrate iga. We display J for each variable, at each level, beside each DAG. Indented numbers are the difference between each pair of Js. The notation "a - Beta(c, d)" means that the probability of arriving at this node is a, and the posterior distribution of the parameter is Beta(c, d).

Everything above assumes that the variables have uniform cost. Otherwise: Proposition 5. It is NP-hard to compute the optimal budgeted allocation policy (for determining the parameters that minimize the J(ˇ) objective function Equation 1), even if the variables are independent (Assumption A1). 3.2. Other (Sequential) Algorithms We can consider many other algorithms. Round-robin attempts to probe each variable the same number of times, until it terminates on the final iteration. Random just selects each variable uniformly at random. Finally, the adaptive greedy algorithm (aga) policy picks arg maxj j (1)/cj -- i.e., the variable with the largest expected one-step change of cost (per probe cost) associated with each attribute, given its current posterior distribution (i.e., based on the responses observed to all previous probes). It then probes this variable once, updates its posterior using the outcome (using standard Bayesian update), and iterates until the learner runs out of budget. We can use Figure 3 to help contrast these two algorithms. Recall iga would probe A and B once here; aga will do likewise. Now consider a slightly different problem, where B's prior is changed to Beta(28, 29). The optimal policy here is the same as for the original example. However, aga would find this optimal policy here, but iga would still allocate one probe to each of the two variables, which is suboptimal. This illustrates the benefit of adaptivity. Unfortunately, aga might not work well for the general problem, with arbitrary costs. Assume one variable has cost c1 = 1 and the largest one-step 1 (1) = r, while each the remaining d - 1 variables has cost cj =  1 and j (1) = r - . If the budget B = 1, then aga would probe the first coin and obtain a reduction of r; a better strategy, however, could probe each of the other coins 1/(d - 1) times, obtaining a reduction of at least (d-1)(r-), which is essentially O(d) better than aga.

3.3. Experiments on Independent Variables In this section, we report empirical results of a series of experiments that compare the effectiveness of these various algorithms for the independent variable model in the case when the priors are informative. The first two experiments used 5 independent binary variables, with prior Beta(i , i ). We generated nonuniform costs from a uniform discrete distribution on {1, 2, ..., 5}, where each integer cost has a probability of 1 5 . We generate non-uniform priors by first drawing an effective sample size ei uniformly from {10, 11, . . . , 30}, then drawing i uniformly from {1, 2, . . . , ei - 1}, and setting i = ei - i . The true s are then generated from these Beta(i , i ) priors. We ran 4,000 experiments for each budget B  {1, . . . , 50}; Fig. 4(a) ­ 4(b) plot the empirical means over the cost values J achieved. We repeated these experiments on a larger network using 10 independent nodes, and the fixed budget B = 200. Fig. 4(c) ­ 4(d) show how the empirical means of the estimated costs J (that we can achieve after spending our budget B = 200) converge in the number of runs for the different algorithms. As expected, we see that algorithms that use the distributional information (iga and aga) perform much better than ones that do not, roundrobin and random. We ran a Wilcoxon signed rank test on this data, and found that in all of these subfigures, aga and iga are significantly better than Random and RoundRobin for essentially every number of probes (at  = 0.05 significance level).

4. General Distributions
This section provides results when the joint probability distribution of X is given by a general (discrete) belief network, that can include dependencies among the variables. In general, a belief network specifies the joint probability distribution pX ( ˇ |  ) over the random variable X = (X1 , . . . , Xd )T in a parsimonious manner. We assume we are given a labeled directed acyclic graph, G = V, E (V = {1, . . . , d}, E  V × V), where node i is associated with the random variable Xi . Let Ui be the set of parent nodes

Budgeted Distribution Learning
0.11 0.105 KL-Divergence 0.1 0.095 0.09 0.085 0.08 0.075 10 20 30 Budget 40 50 Random Round-robin AGA IGA 0.11 KL-Divergence 0.105 0.1 0.095 0.09 0.085 10 20 30 Budget 40 50 0.11 Random Round-robin AGA IGA 0.13 0.22 KL-Divergence KL-Divergence 0.125 Random Round-robin AGA IGA

0.2

0.12 Random Round-robin AGA IGA 1000 2000 3000 Number of Runs 4000

0.18

0.115

0.16 0 1000 2000 3000 Number of Runs 4000

(a) Uniform cost

(b) Different cost

(c) Uniform cost, B = 200

(d) Different cost, B = 200

Figure 4. Empirical studies using Informative priors

of node i. The graph G encodes the conditional independencies of X: each node is independent of its non-descendants, given its parents (Pearl, 1988). As we assume that X is discrete valued, we need to specify, for each node Xi , the conditional distributions, Xi |ui = pXi |ui ( xi | ui ;  ), called the conditional probability tables. We can then write the probability of any complete tuple x = (x1 , . . . , xd )T , as a simple product d pX ( x |  ) = i=1 pXi ( xi | ui , i ) = i xi |ui . We assume that these parameters {Xi |ui } are initially independent, which mean the belief network corresponds to a product of Dirichlet distributions. Given a data set D, we can compute the posterior distribution; if this D consists of complete instances (i.e., specifies a value for each variable), then the network remains factored as a product of Dirichlet distributions, each of which is updated: if the parameter is initially Xi |Ui =u  Beta(, ), and D includes a instances matching Xi = + and Ui = u and b instances matching Xi = - and Ui = u, then the posterior is Xi |ui =u  Beta( + a,  + b) (Heckerman, 1999). Unfortunately, if there are omissions in the training data (that is, there is a training instance that includes the values for some, but not all of the features), then the posterior distribution can become a mixture, which in general does not have a simple analytic form. Indeed, if the data instance specifies only d - k values (i.e., there are k omissions), then the posterior distribution corresponds to a mixture of 2k products of Dirichlet distributions (if each variable is binary). We therefore need a way to estimate this objective function. (As computing J(ˇ) requires estimating the distribution itself, and not just its mean values, we cannot use EM (Dempster et al., 1977).) See Section 4.3. 4.1. Complete 2-Node Belief Networks with BDe Priors and Uniform Costs A network has BDe Dirichlet priors if the CPtables entries are "matching" (Tong & Koller, 2001b). For example, in the structure A  B, the parameters A 

Beta(3, 4), B|+a  Beta(1, 2), and B|-a  Beta(3, 1) are BDe as the "effective" number of pseudo-instances corresponding to +a here is 3 based both on the "3" in A 's first parameter and the fact that B|+a has an effective sample size of 1 + 2 = 3. Similarly the "effective" number of pseudo-instances corresponding to -a here is 4 based both on the "4" in A 's second parameter and B|-a 's effective sample size of 3 + 1 = 4. Proposition 6. For a complete 2-node belief network with BDe Beta priors and uniform costs, when the budget B is an even number, an allocation algorithm that takes full data instances gives a posterior distribution with the minimum expected risk J(ˇ). This claim shows that, for some situations, the best allocation algorithm involves the obvious round-robin approach: if our budget is B and there are d = 2 variables, probe each variable B/d times. 4.2. Non-BDe, Non-uniform Costs, Incomplete The BDe and uniform cost constraints are crucial for a round-robin like algorithm, which takes full data instances, to perform well. As one counter-example, consider the non-BDe distribution in Fig. 5(a). Here, X and Y are basically independent from each other. As we are very certain about the probability of Y but we know little about X, and their costs are the same, it makes sense to allocate more probes to X. (That is, if the budget B = 2, we will do much better probing X twice, rather than the round-robin approach of probing X once and Y once.) Now consider the BDe distribution shown in Fig. 5(b), where the costs are very different. As X and Y are highly correlated, and the cost of probing X is significantly cheaper, we clearly get more "information per unit cost" by probing X more. (So given the budget B = 101, we would do much better proving X 101 times, rather than probing X once and Y once.) Finally, this RoundRobin also requires that the struc-

Budgeted Distribution Learning
Cost(X) = 1
X

Cost(Y) = 1
Y |X=F Y |X=T
KL-Divergence

0.28
Cancer

Beta(1, 1)

Beta(1, 99) Beta(1, 99)
Brain Tumor

0.27 0.26 0.25 0.24 0.23

Random Round-robin Greedy

(a) Uniform costs, non-BDe priors
Cost(X) = 1
X

Calcium Increase

Cost(Y) = 100
Y |X=F Y |X=T

Beta(100, 100)

Beta(99, 1) Beta(1, 99)

Papilledema

Coma

0.22

(b) Different costs, BDe priors Figure 5. Situations where RoundRobin is suboptimal (a) Cancer BN

500 1000 Number of Runs

1500

(b) KL divergences

ture be complete -- i.e., that every node is connected to every other. As a counter-example, imagine X and Y are independent, where X  Beta(1, 999) and Y  Beta(500, 500). Given the budget B = 2, we will do much better probing Y twice, rather than the round-robin approach of probing X once and Y once. 4.3. Estimating J(ˇ) from Partial Data The above counter-examples show that the roundrobin algorithm is not always optimal, which means the optimal algorithm will probably produce partial instances. To evaluate such algorithms, we need to compute the objective function J(ˇ) on the resulting distribution; as discussed above, this is trickly as the resulting distribution is no longer simple. This section provides a sampling algorithm for estimating this expected KL value. In general, we can write the final datasample as D = {Do , Dm }, where Do and Dm are the observed and missing elements, respectively. Our goal is to obtain the observations Do that lead to a posterior over  that is well concentrated around its mean (in expectation) -- i.e., that minimize
Ż J(Do ) = E1 p ( ˇ | Do ) KL pX ( ˇ | 1 ) pX ( ˇ |  ) , (2)

Figure 7. (a) Cancer belief network. (b) KL divergences from the true parameters.
j ^ compute J(pX (ˇ|); Do  {Xi = v}; . . . ) as an estimate of the resulting objective function. The average

J(i, j) =
v

j j ^ ^ p(Xi = v|)J(pX (ˇ|); Do  {Xi = v}; . . . )

j estimates the quality of probing this Xi ; we then probe the optimal ^ ^ = arg maxi,j J(i, j). i, j

4.4. Empirical Studies on General BNs We performed many experiments using 3 algorithms: the Greedy defined above, as well as the obvious Random and Round-robin, over a number of belief network structures. Here we report our findings on the "Cancer" structure (Fig. 7(a)), a BN with 5 correlated binary variables and 11 parameters (Pearl, 1988). Fig. 7(b) compares these 3 algorithms (where Greedy uses S = 5) with a budget B = 10. The plots show how the running averages of the J(. . . ) functions converge for each algorithm. We can see that Greedy performs better than the other two algorithms, which shows that considering the defined risk can help the learner to obtain a better estimate of the parameters. A Wilcoxon signed rank test confirmed that Greedy significantly outperformed Random and Round-robin at  = 0.05 significance level.

Ż . where  = Ep ( ˇ | Do ) []. Observe that the posterior is a mixture of Dirichlets: p(  | Do ) =
Dm

p(  | Do , Dm ) p( Dm | D0 ) (3)

5. Conclusions
This paper theoretically and empirically explores the budgeted distribution learning task, where the learner is allowed to sequentially probe specified attributes of data instances to obtain their values, with the goal of learning the joint distribution over the attributes, subject to the fixed total budget. We first studied the simple case where all variables are independent, and proved that the simple iga is the optimal allocation algorithm when the costs are uniform, but that even this simple allocation task is NP-hard for general costs. We then showed that sequential algorithms can do better than allocation algorithms, and presented empirical studies to show that algorithms that use the distribu-

Since the number of components of the posterior can be exponential in the number of omissions in the dataset, |Dm |, its analytical computation is not feasible for a dataset with a large number of omissions. We there define the J sampling algorithm (shown in Figure 6) to estimate the objective function in (2), and to serve as the basis for the obvious Greedy algorithm: At each time, consider probing each unj specified feature Xi . Each of the k  2 possible outcomes of this probe will occur with probability j p(Xi |, Do ). While we do not know the true , we ^ can use the estimate (X j  Do ) based on the data, j where X  Do are the features specified in the partial j instance X j . Then for each {Xi = v} outcome, we

Budgeted Distribution Learning J( pX (ˇ|0 ); Do ; S) % 0 = initial param, Do = observed data, S = number of imputed datasets for i = 1..S do i Draw ith imputed datasample Dm from pX ( ˇ | Do , 0 ) i ^ from p ( ˇ | Do , Dm ) % Each (i) is a product of Dirichlet distr'ns ^ Draw (i) end for 1 ^ ^ Ż Ż  := S S (i) % ... = mean[] . . .  estimate of (p ( ˇ | Do )) = Ep ( ˇ | Do ) [] i=1 for i = 1..S do ^ (i)xn |un ^ ^ K(i) := d Ż n=1 un pX ( un | (i) ) xn (i)xn |un ln 
Xn |un

% % n indexes the variables; un ranges over instantiations of the parents of the nth variable Xn ^ ^ Ż % (i)xn |un = parameter of (i) associated with [Xn = xn , Un = un ]; similarly for xn |un end for 1 ^ RETURN J := S S K(i) % . . . = mean[K] i=1 Figure 6. Sampling algorithm for estimating the EKL of a posterior distribution

tional information (iga and aga) perform much better than ones that do not, round-robin and random. We then considered more general belief networks with dependencies, and showed restricted situations where the obvious round-robin algorithm is guaranteed to be the optimal allocation algorithm, followed by examples where round-robin is not optimal. This means we will need to consider partially specified training instances, which lead to complex posterior distributions that do not have a simple closed form. We provided a stochastic algorithm that approximates the evaluation (Expected KL divergence), which we used to produce a greedy algorithm. While round robin algorithm does work well in practice for certain belief nets, our empirical results show that our greedy algorithm can be significantly superior. The results of these explorations reveal a number of insights about the challenges of acquiring the information needed to learn a distribution. The webpage (LPSzG, 2010) shows that these results hold for other relevant distributions as well. This paper presents many initial steps in addressing the challenges of the budgeted distribution learning framework. However, much remains to be done. For example, our results show that, except for a few special cases, learning in the Bayesian framework can be hard. In the future, we plan to study the challenge of approximating this solution.

Dempster, A., Laird, N., and Rubin, D. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistics Soc, B, 39:1­38, 1977. Heckerman, D. A tutorial on learning with Bayesian networks. In Learning in Graphical Models, pp. 79­ 119. MIT Press, 1999. Lizotte, D., Madani, O., and Greiner, R. Budgeted learning of Naive-Bayes classifiers. In UAI, pp. 378­ 385, 2003. LPSzG, 2010. http://sites.google.com/site/ BudgetedDistributionLearning/. Madani, O., Lizotte, D., and Greiner, R. Active model selection. In UAI, pp. 357­365, 2004. Melas, V.B. Functional Approach to Optimal Experimental Design. Springer, 2006. Murphy, K. Active learning of causal Bayes net structure. Technical report, UCB, 2001. Muslea, I. Active learning with multiple views. PhD thesis, USC, 2002. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988. Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach. Prentice Hall, 2002. Steck, H. and Jaakkola, T. Unsupervised active learning in large domains. In UAI, pp. 469­476, 2002. Tong, S. and Koller, D. Active learning for structure in Bayesian networks. In IJCAI, pp. 863­869, 2001a. Tong, S. and Koller, D. Active learning for parameter estimation in Bayesian networks. In NIPS, pp. 647­ 653, 2001b. Verma, T. and Pearl, J. Equivalence and synthesis of causal models. In UAI, pp. 255­270, 1991.

Acknowledgements
This work was supported in part by AICML, AITF (formerly iCore and AIF), NSERC and the PASCAL2 Network of Excellence under EC grant no. 216886. Cs. Szepesvári is on leave from SZTAKI, Hungary.

References
Berry, A. and Fristedt, B. Bandit Problems: Sequential Allocation of Experiments. Springer, 1985.