Learning From Measurements in Exponential Families

Percy Liang Computer Science Division, University of California, Berkeley, CA 94720, USA

pliang@cs.berkeley.edu

Michael I. Jordan jordan@cs.berkeley.edu Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720, USA Dan Klein Computer Science Division, University of California, Berkeley, CA 94720, USA klein@cs.berkeley.edu

Abstract
Given a model family and a set of unlabeled examples, one could either label specific examples or state general constraints--both provide information about the desired model. In general, what is the most cost-effective way to learn? To address this question, we introduce measurements, a general class of mechanisms for providing information about a target model. We present a Bayesian decision-theoretic framework, which allows us to both integrate diverse measurements and choose new measurements to make. We use a variational inference algorithm, which exploits exponential family duality. The merits of our approach are demonstrated on two sequence labeling tasks.

feat feat feat

feat

feat

... size size size ...

View
avail

of

Los Gatos Foothills ...
avail avail ... size

Available July

1

...

2

bedroom

1

bath ...

Figure 1. A sequence labeling task: Given a sequence of words from a Craigslist housing ad, label each word according to the type of information it provides: address, availability, contact, features, size, etc.

To this end, we introduce measurements, which subsume the notions of labels, partial labels, and general constraints on model predictions. Formally, a measurement is the expectation of a function (called a measurement feature) over the outputs of the unlabeled examples. A measurement provides a glimpse of the hidden outputs, thus providing partial information about the underlying model. As a motivating application, consider the sequence labeling task shown in Figure 1. Given a Craigslist ad (a sequence of words) as input, the task is to output a label for each word indicating the semantic field to which it belongs (e.g., address, size, availability, etc.). Past research on this task has shown that in addition to obtaining labels of full sequences, it is particularly efficient to directly impose soft, cross-cutting constraints on the predictions of the model--for example, "the word bedroom is labeled as size at least 90% of the time." Given both labels and constraints, how do we integrate them in a coherent manner? Additionally, how do we compare the value of information of various labelings and constraints? To address these questions, we present a Bayesian decision-theoretic framework. Our setup follows the principles of Bayesian experimental design (see Chaloner and Verdinelli (1995) for an overview) but generalizes traditional designs in that we receive information not directly, via labeled data, but indirectly,

1. Introduction
Suppose we are faced with a prediction problem and a set of unlabeled examples. The traditional approach in machine learning is to label some of these examples and then fit a model to that labeled data. However, recent work has shown that specifying general constraints on model predictions can be more efficient for identifying the desired model (Chang et al., 2007; Mann & McCallum, 2008). In practice, one might want to use both labels and constraints, though previously these two sources have been handled in different ways. In this paper, we adopt a unified statistical view in which both labels and constraints are seen as ways of providing information about an unknown model.
Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the author(s)/owner(s).

Learning From Measurements in Exponential Families

via measurements. In order to scale up to large datasets, we present a variational approximation which exploits properties of Fenchel duality for exponential families. Our approximation is similar to the framework of Gra¸a et al. c (2008) for handling constraints on model predictions in a generative setting. Our variational objective can be optimized by solving a saddle point problem. Empirically, we tested our method on a synthetic dataset and two natural language datasets (Craigslist ads and part-of-speech tagging), showing that we can integrate various types of measurements in a coherent way and also improve performance by actively selecting the measurements.

We now give examples of measurements more formally: Fully-Labeled Example To represent the output of some input Xi  X , let the components of  include (x, y)  I[x = Xi , y = b] for all b  Y.1 Then the corresponding components of  entirely determine Yi . While these measurements are sums over all n examples,  can be computed by just inspecting Xi . Partially-Labeled Example Suppose that we only observe f (Yi ), a partial version of Yi ; for example, let Yi be a sequence and f (Yi ) a subsequence. If  includes (x, y)  I[x = Xi , f (y) = b] for all b  f (Y), then  reveals Yi up to f (Yi ). Labeled Predicate We can determine the outputs of all inputs x for which f (x) = 1 by measuring (x, y)  I[f (x) = 1, y = b] for all b  Y. An example of a labeled predicate2 in document classification occurs when x is a document and f (x) = 1 if x contains the word "market." Druck et al. (2008) showed that labeling these predicates can be more cost-effective than labeling full examples. For sequence labeling tasks, we typically want to provide the frequency of some label b over all positions where the input sequence is some a (Mann & McCallum, 2008). For this, make measurements of the form (x, y)  i=1 I[xi = a, yi = b], where is the length of the sequence. In Quadrianto et al. (2008), examples are partitioned into sets, and all (x, y)  I[f (x) = a, y = b] are measured, where f (x) is the set to which x belongs. Label Proportions Measuring (x, y)  I[y = b] for all b  Y yields the proportions of each output label. This is the information used in expectation regularization (Mann & McCallum, 2007). Structured Label Constraints Sometimes we have structural constraints on the outputs. In the Craigslist task, for instance, the output is a sequence y = (y1 , . . . , y ). Domain knowledge tells us that each label either appears in a contiguous block or not at all. To capture this constraint, we use pseudo-1 measurement features (x, y)  i=1 I[x = a, yi = b, yi+1 = c] for each a  X and labels b = c. We set the measurement values to  = 0 and their measure1 if a = true 0 otherwise. 2 Druck et al. (2008) uses the term feature instead of predicate. We use predicate to denote an indicator function of the input x, reserving feature for functions on (x, y).
1

2. Measurements
Consider a prediction task, where X denotes the set of possible inputs and Y denotes the set of possible outputs. We start with a sequence of inputs X = (X1 , . . . , Xn ), but unlike supervised learning, we do not observe the corresponding hidden outputs Y = (Y1 , . . . , Yn ). Instead, we propose making k measurements on the data as follows:
n

=
i=1

(Xi , Yi ) + W ,

(1)

where (x, y)  Rk is a vector of measurement features,   Rk is a vector of (observed) measurement values, and W is some measurement noise. The purpose of measurements is to provide a unified language for specifying partial information about Y . Traditional methods do deal with missing data, but the setting there is usually that of partial labels on individual examples. In contrast, we consider a more general space of mechanisms for partial supervision. Importantly, a measurement is an aggregate statistic that can span multiple examples. In practice, measurement values  can arise in two ways. The first is via real measurements obtained from the data; examples include labels of individual examples or aggregate values from a real experiment (e.g., pooling in genetics or census-taking). The second is via pseudo-measurements, which are set by hand to reflect prior knowledge, perhaps by "measuring" in a thought experiment. In binary classification, declaring that the fraction of positive examples is at least 95% is an example of a pseudo-measurement. The difference between pseudo- and real measurements is purely a conceptual one, as the two types are handled the same way inferentially.

(

The indicator function is I[a] =

Learning From Measurements in Exponential Families

ment noises to independent -U [0, 1]. These quantities ensure that transitions into b, which mark the beginning of a new block for b, happen between 0 and 1 times (i.e., at most once). See Gra¸a et al. (2008) for c other types of constraints on structured outputs. Label Preferences Suppose we don't know the exact proportions but strongly believe that b is the most common label. This information can be encoded by the pseudo-measurement (x, y)  I[y = b ] - I[y = b] for b  Y and setting  = 0 with noise -U [0, n]. These n n quantities ensure that i=1 I[y = b ]  i=1 I[y = b] for all b  Y. These preferences can also be adapted to operate conditioned on predicates. It is often natural to obtain measurements of different types. We want to combine all the diverse measurements in a coherent way. This is important since there will naturally be varying amounts of redundancy across measurements. Furthermore, we would like a mechanism for determining which measurements to make next, accounting for both their costs and possible benefits. How to achieve these two goals in a principled way is the focus of the next section.

Bayesian model (Figure 2(b)):
n def

p(, Y,  | X, ) = p()
i=1

p (Yi | Xi )p( | X, Y, ).

(3) For computational reasons, we assume the parameter prior and the noise distribution have log-concave densities: log p() = -h () + constant and log p( | X, Y, ) = -h ( -  X (Y )) + constant, where g and h are even convex functions, and  X (Y ) = n i=1 (Xi , Yi ). For example, we could use a Gaussian prior on  (h () =  ||||2 ) and independent box 2 noise (h (u) = W[j, |uj |  j ]).3 Given (3), we can obtain the posterior p( | , X, ) by marginalization. It is conceptually useful to decompose this marginalization into two steps: We first combine pseudo-measurements pseudo   with a preliminary prior p() to obtain a new prior p( | pseudo , X, ). Then we combine this prior with real measurements to obtain the final posterior p( | , X, ). This situation is analogous to multinomial estimation with a conjugate Dirichlet prior: pseudocounts (concentration parameters) determine the prior, which combines with real counts to form the posterior. We make one conceptual point regarding the relationship between measurement features and model features. While the two are the same type of mathematical object, they play different roles. Consider features fb (x, y) = i=1 I[word xi ends in -room, yi = b] for all labels b. As measurement features, f would indicate that words ending in -room are likely to be labeled according to  . As model features, f would indicate that words ending in -room are only labeled similarly. In this way, measurement features (along with  ) provide direct information whereas model features provide indirect information. In general, measurement features should be finer-grained than model features, since finer features are easier to measure but coarser features generalize better.4 3.2. Active Measurement Selection We now have a handle on how to learn from measurements, but how do we choose the optimal measurements  to make in the first place? To talk about optimality, we must define a utility function. For us, this involves predictive accuracy. First, define r(y, y ) ^
0 if a = true  otherwise. 4 A feature f1 is finer than another feature f2 if f2 (x, y) = 0 implies f1 (x, y) = 0.
3

def

3. A Bayesian Framework
In this section, we present a Bayesian framework for measurements, which provides a unified way of both estimating model parameters given fixed measurements (Section 3.1) and optimally choosing new measurements (Section 3.2). 3.1. From Measurements to Model Our goal is learn a predictor based on observed measurements. For the predictor, we use conditional exponential families, which include a broad class of prediction models, e.g., linear regression, logistic regression, and conditional random fields (Lafferty et al., 2001). A conditional exponential family distribution is defined as follows: p (y | x) = exp{ (x, y),  - A(; x)}
d def

(2)

for x  X , y  Y, where (x, y)  R is a vector of model features,   Rd is a vector of model parameters, and A(; x) = log e (x,y), dy is the logpartition function. Specifically, we would like to infer the model parameters  from measurement values  and inputs X. Recall that the outputs Y are hidden. For guidance on how to perform this inference, we define the following

(

W[a] =

Learning From Measurements in Exponential Families

to be the reward (e.g., label accuracy, or equivalently, negative Hamming loss) if the actual output is y and we predict y . If we use the Bayes-optimal predictor ^ to make predictions on a new example X with true output Y , the expected reward is as follows: R(,  ) = Ep (X ) max Ep(Y
^ Y def ,|X ,,X,) [r(Y


Y X


Y X

 
(a) Traditional design

(b) Measurement design

^ , Y )].

(4) In short, R(,  ) measures our satisfaction with having made measurements (,  ). We also introduce C(), the cost of measuring . Then the net (expected) utility is the difference: U (,  ) = R(,  ) - C().
def

Figure 2. In traditional experimental design (a), one selects X and observes Y . In our measurement framework (b), one selects  and observes  .

4.1. Approximate Inference Our approximate computation of the true posterior p(Y,  | , X, ) proceeds in three steps. First, we apply a standard mean-field factorization (Section 4.1.1). Next, we relax the contribution of the measurements and apply Fenchel duality to obtain a workable objective function (Section 4.1.2). Finally, we present a strategy to optimize this function (Section 4.1.3). 4.1.1. Mean-field factorization Following standard variational principles, we turn (approximate) posterior computation into an optimization problem over a tractable set of distributions Q: min KL (q(Y, ) || p(Y,  | , X, )) .
qQ

(5)

In practice, we choose measurements in a sequential fashion. Suppose we have already made measurements (0 , 0 ) and want to choose the next  yielding the highest expected utility. However, since we do not know what measurement value  we will obtain, we must integrate over  . Thus, the best subsequent measurement (feature) is given by the following:   = argmax U (),
 def

(6) (7)

U () = Ep( |X,,0 ,0 ) [U ((0 , ), (0 ,  ))], where  is the set of candidate measurement features. Note that   is obtained via one-step lookahead, so it is only Bayes-optimal if   is the final measurement. This completes the description of our measurement framework. Most of the computations above are intractable, so the remainder of this paper will focus on designing practical approximations. We pause briefly to compare our framework with traditional experimental design (active learning). In both, there is an unknown parameter  which governs p (y | x). However, in traditional design, one chooses a set of inputs X1 , . . . , Xn , whereupon the outputs Y1 , . . . , Yn are revealed, and inference is then made on . In our measurement framework, we choose measurement features , whereupon the measurement values  are revealed, and inference is then made on  through the latent variable Y , which must be integrated out. Figure 2 illustrates the distinction.

def

We use a mean-field approximation with a degenerate distribution over : Q = {q(Y, ) : q(Y, ) = q(Y ) ()}. ~
def

(8)

Now let us expand the original optimization problem (7) using (8) and (3):
q(Y ),

min -H(q(Y )) + Eq(Y ) [h ( -  X (Y ))]
n

(9)

-
i=1

Eq(Y ) log p (Yi | Xi ) + h ().

4.1.2. Relaxation and Fenchel duality One problem is that the contribution of the measurements to the posterior (the second term of (9)) couples all the outputs Y1 , . . . , Yn . To progress towards tractability, we replace Eq(Y ) [h ( -  X (Y ))] in (9) with h ( - Eq(Y ) [ X (Y )]), which is a lower bound by Jensen's inequality. In doing so, we no longer guarantee a lower bound on the marginal likelihood. However, this relaxation does let us rewrite (9) using Fenchel duality, which allows us to optimize over a vector   Rk rather than over an entire distribution q(Y ). Note that optimizing q(Y ) while holding  fixed is exactly a maximum (cross-)entropy problem subject to approximate moment-matching constraints.

4. Approximation Methods
We now present methods for making the Bayesian principles described in the previous section practical. We first present an approximate inference algorithm for computing the posterior given fixed measurements (Section 4.1). We then present a method for actively choosing measurements (Section 4.2).

Learning From Measurements in Exponential Families

By Fenchel duality, the optimal q(Y ) belongs to an exponential family (cf. Dud´ et al. (2007); Gra¸a et al. ik c (2008)):
n

q(Y ) q, (y | x)

=
i=1

q, (Yi | Xi ) exp{ (x, y),  +

(10) (11)

Because n is typically large, we use stochastic approximations to the gradient. In particular, we take alternating stochastic gradient steps in  and . At the end, we return an average of the parameter values obtained along the way to provide stability. The alternating quality of our algorithm is similar in spirit to Chang et al. (2007). However, they maintain a list of candidate Y s instead of a distribution q(Y ). They also use a penalty for violating constraints (the analog of our ), which must be manually set. We only require specifying the form of the measurement noise, which is more natural; from this,  is learned automatically. Note that the only computations we need are expected feature vectors, which are standard quantities needed in any case for gradient-based optimization procedures. In contrast, the use of Generalized Expectation Criteria requires computing the covariance between  and , which is more complex and expensive for graphical models (Mann & McCallum, 2008). 4.1.4. Intuitions For simplicity, assume zero measurement noise and a flat improper prior on . Let P = {p (y | x) :   Rd } denote our model family. Let Q (different from before) be the set of all distributions which are consistent with our measurements (,  ). Our variational approximation can be interpreted as finding a q, (y | x)  Q and a p (y | x)  P such that KL (q || p) is minimized.5 Intuitively, all external information is fed into Q, a staging area, which ensures we work with coherent distributions. This information is then transferred to our model family P, which allows us to generalize beyond our observations. When the measurement features are the same as the model features (  ), the problem reduces to standard supervised learning by maximum entropy duality. In particular, Q  P contains the unique solution. Another way to obtain supervised learning is to measure (x, y)  I[x = a, y = b] for all a  X , b  Y (cf. Section 2). Then Q is a single point which typically lies outside P. Druck et al. (2008) incorporate measurements using Generalized Expectation Criteria, an objective function that penalizes some notion of distance (e.g., KLdivergence) between Ep (Y |X) [(X, Y )] and the measurement values  . Even when   , their objective function does not reduce to supervised learning and
5 In the language of information geometry, optimizing q with p fixed is an I-projection; optimizing p with q fixed is an M-projection.

=

(x, y),  - B(, ; x)}, where B(, ; x) = log e (x,y), + (x,y), dy is the associated log-partition function for q, (y | x). See Appendix A for the derivation. We can now reformulate (9) as the following saddle point problem:
Rd Rk

min max L(, ),
n n

(12) B(, ; Xi ) +
i=1 i=1

L(, ) = ,  -

A(; Xi ) -

h () + h (),  where h () = supuRk { u,  - h (u)} is the Fenchel  conjugate of h . If we assume independent box noise, h () = j j |j |.  ~ ~ Let (, ) be the solution to this saddle point problem. This pair specifies the approximate posterior q(Y, ) via (10), (11), and (8). 4.1.3. Optimization We use a gradient-based approach to optimize L(, ) in (12). The gradients can be computed using standard moment-generating properties of log-partition functions: L(, )  = -
n

(13) Eq, (Y |Xi ) [(Xi , Y )] - h (),  (14) h ().

L(, ) 

i=1 n

=
i=1 n

Ep (Y |Xi ) [(Xi , Y )] - Eq, (Y |Xi ) [(Xi , Y )] +
i=1

~ ~ Note that at (, ), both sets of moment-matching constraints (approximately) hold: (1) the measurement feature expectations under q, are close to the observed values  , indicating that we have represented the measurements faithfully; and (2) the model feature expectations under q, are close to those of p , indicating that we have learned a good model.

Learning From Measurements in Exponential Families

thus does not resolve redundant measurements in a coherent way. 4.2. Approximate Active Measurement Selection In traditional design, if p (y | x) is a linear regression model and p() is Gaussian, then U () has a closedform expression. If a non-conjugate p() is employed, for example, a sparsity prior for compressed sensing, one must resort to approximations such as expectation propagation (Seeger & Nickisch, 2008). In our measurement setting, inference is further complicated by marginalization over Y , so let us apply our posterior approximations to measurement selection (Section 3.2). First, consider the expected reward (4). Since we do not have access to the true test distribution p (x), we use a heldout set of unlabeled exm 1 ~ ~ amples X1 , . . . , Xm . Define p(x) = m i=1 Xi (x) to ~ ~ be the corresponding empirical distribution. Second, we replace the true posterior p( | , X, ) ~ with a point estimate (,  ), thereby obtaining an approximate utility:
def ~ U (,  ) = Ep(X ) max Ep (Y ~ ~ ^ Y |X ) [r(Y

Algorithm for Active Measurement Selection ~ ~  0    0    0   0  0 while more measurements are desired: -for each candidate measurement feature   : ~ ~ --draw t samples from q0 ( ) specified by (0 , 0 ) --for each sampled measurement value  : ~ ~ ---(, )  ApproxInference((0 , ), (0 ,  )) ~ ^ ---u,  Ep(X ) maxY Ep (Y |X ) [r(Y , Y )]-C() ^ ~ P~ 1 --u  t  u, -   argmax u -obtain measurement value   =  X (  ) + W -0  (0 ,   ) 0  (0 ,   ) ~ ~ -(0 , 0 )  ApproxInference(0 , 0 ) ~ Output 0 Figure 3. Pseudocode for choosing measurements in a sequential manner based on our variational approximation.

what drives experimental design in that case is the reduction of uncertainty in . However, our utility function is predictive accuracy. Intuitively, what drives our method is reduction of uncertainty in predictions ~ ~ based on . For this, the magnitude of  does provide some guidance.

^ , Y )] - C(). (15)

5. Experiments
We now present empirical results. In Section 5.1, we show how the measurement framework can effectively integrate both labeled data and labeled predicates. In Section 5.2, we actively choose the measurements. 5.1. Learning from Measurements For the Craigslist task introduced in Section 1, we use a linear-chain conditional random field (CRF), which is a conditional exponential family where the input x = (x1 , . . . , x ) is a sequence of words, the output y = (y1 , . . . , y ) is a sequence of labels, and 1 the model features are (x, y) = i=1  (yi , x, i) + -1 2 i=1  (yi , yi+1 ). The components of the node features 1 (yi , x, i) are indicator functions of the form I[yi = a, s(xi ) = b], where a ranges over the 11 possible labels, and s(·) is either the identity function or a function mapping each word to one of 100 clusters. To create these clusters, we ran the Brown word clustering algorithm.6 We started n = 1000 unlabeled examples and conIn order to capture topical similarity, three-word sequences (xi-d , xi , xi+d ) were created for each sequence, position i = 1, . . . , , and offset d = 1, 2, 3. The word clusters obtained from these three-word sequences essentially capture the same type of structure in the data as the SVD features used by Haghighi and Klein (2006) and Mann and McCallum (2008).
6

The final step is to marginalize out  . Suppose we have already made measurements (0 , 0 ). The true posterior p(Y,  | X, 0 , 0 ) is currently approximated ~ ~ by q0 (Y, ) (represented by (0 , 0 )). Using this approximation leads to q0 ( ) = Eq0 (Y ) [p( | X, Y, )] as a substitute for p( | X, , 0 , 0 ). ~ Though we can compute U ((0 , ), (0 ,  )) for a fixed ~ over  . Thus, we use a Monte  , we cannot integrate U Carlo approximation: Draw t samples from q0 ( ) by first drawing Y from q0 (Y ) and then sampling  according to (1). Let q0 ( ) be the empirical distribution ~ formed from these samples. Now U () from (6) can be then approximated with the following: ~ ~ U () = Eq0 ( ) [U ((0 , ), (0 ,  ))]. ~ (16)
def

The pseudocode for our algorithm is given in Figure 3. This procedure is similar in spirit to the active learning algorithm proposed by Roy and McCallum (2001), where examples were chosen iteratively to minimize expected loss on heldout data under a Na¨ Bayes ive model. One potential weakness with our approach is that we do not maintain any uncertainty in  in the variational approximation. If we were doing parameter estimation, this approximation would be entirely useless since

Learning From Measurements in Exponential Families
0.9 0.9

test accuracy

0.7 0.6 0.4 20 40 60
random entropy full

test accuracy

Table 1. CCR07 (Chang et al., 2007) and MM08 (Mann & McCallum, 2008) outperform our method when there are few examples, but we achieve the best overall number with 100 examples.

0.8

0.8 0.6 0.5 0.4 10 20
random entropy full

# labeled examples CCR07 MM08 no labeled predicates + 33 labeled predicates

10 74.7 74.6 67.7 71.4

25 78.5 77.2 75.6 76.5

100 81.7 80.5 81.5 82.5

80

100

30

# measurements k

# measurements k

(a) Labeling examples

(b) Labeling word types

sidered two types of measurements: fully-labeled examples and labeled predicates where we provide the frequency of the most common label for a word type.7 The labeled examples were chosen at random and we chose three "prototypes" for each of the 11 labels based on the 100 available labeled training examples (see Haghighi and Klein (2006) for details). We optimized L(, ) for 50 iterations (50n stochastic steps). Table 1 shows that the performance of our method (100 examples) improves as we add more labeled examples and predicates. To the best of our knowledge, our 82.5% is the best result published so far on this task. More interestingly, compared to past work, we get larger gains as we label more examples, which suggests that our measurement framework is integrating the diverse, increasing information more effectively. 5.2. Active Measurement Selection 5.2.1. Synthetic dataset Consider the following multiclass classification problem: the output space is Y = {1, . . . , 4}, and the input space is X = yY {(x1 , x2 ) : i  {1, . . . , 5}, x1 = (y, i), x2  {y} × {1, . . . , 2i-1 }}. Inputs are generated uniformly from X and each input x is assigned a label y which is extracted from x with probability 0.9 and uniformly from Y with probability 0.1. We consider two types of measurements: fully-labeled examples and labeled x2 -predicates. For simplicity, assume all measurements have the same cost. We started with n = 100 unlabeled examples and no measurements. Following Figure 3, for each candidate measurement feature, we drew t = 3 samples of  . Given each hypothetical measurement, we ran approximate inference for 10 iterations, warm starting from ~ ~ ~ the previous parameter setting (, ). Our utility U was computed on a heldout set of 500 examples. Then the best measurement feature was added, followed by
The frequency was measured on the 100 available labeled examples and extrapolated to the rest. We assumed zero measurement noise.
7

Figure 4. Comparison of three methods on the synthetic dataset: iteratively choosing the next measurement at random, based on entropy, or by running our full algorithm.

10 more iterations of approximate inference. Finally, we evaluated test accuracy on 1000 fresh examples. We compared the full algorithm we just described with two alternatives: (1) choosing the next measurement at random and (2) choosing the example or predicate with the highest entropy.8 Figure 4 shows the results, averaged over 10 trials. We see that both the entropybased heuristic and the full algorithm provide substantial gains over random, and moreover, the full algorithm provides a slight edge over entropy. One property that entropy fails to capture is the propagation effect: Two measurements might have the same entropy, but they could have different degrees of impact on other examples through re-estimating the model. However, the full algorithm does come with a significant computational cost, so for the experiments in the next section we used entropy. 5.2.2. Part-of-speech tagging Now we turn to part-of-speech tagging.9 Using standard capitalization, suffix, word form, and word cluster features applied on the previous, current, and next words, we seek to predict the tag of the current word. We considered two types of measurements: (1) tagging a whole sentence and (2) providing the frequency of the most common tag for a word type, where the word type is one of the 100 most frequent. We started with 1000 unlabeled training examples and labeled 10 examples at random. Then we went through candidate measurements, evaluating them using entropy,10 and adding the best five each round, after which a single iteration of approximate inference was
8 The entropy of a predicate is the sum over the label posteriors at words for which the predicate is nonzero. 9 We used the Wall-Street Journal (WSJ) portion of the Penn Treebank-- sections 0­21 for training, sections 22­24 for testing. 10 The entropy was normalized by the number of occurrences of the word.

Learning From Measurements in Exponential Families
0.9 0.8

test accuracy

0.7 0.6 0.6 20 40 60
random entropy

test accuracy

0.8

0.7 0.7 0.6 0.5 20 40 60
frequency random entropy

Gra¸a, J., Ganchev, K., & Taskar, B. (2008). Exc pectation maximization and posterior constraints. Advances in Neural Information Processing Systems (NIPS) (pp. 569­576). Haghighi, A., & Klein, D. (2006). Prototype-driven learning for sequence models. North American Association for Computational Linguistics (NAACL) (pp. 320­327). Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling data. International Conference on Machine Learning (ICML) (pp. 282­289). Mann, G., & McCallum, A. (2007). Simple, robust, scalable semi-supervised learning via expectation regularization. International Conference on Machine Learning (ICML) (pp. 593­600). Mann, G., & McCallum, A. (2008). Generalized expectation criteria for semi-supervised learning of conditional random fields. Human Language Technology and Association for Computational Linguistics (HLT/ACL) (pp. 870­878). Quadrianto, N., Smola, A. J., Caetano, T. S., & Le, Q. V. (2008). Estimating labels from label proportions. International Conference on Machine Learning (ICML) (pp. 776­783). Roy, N., & McCallum, A. (2001). Toward optimal active learning through sampling estimation of error reduction. International Conference on Machine Learning (ICML) (pp. 441­448). Seeger, M., & Nickisch, H. (2008). Compressed sensing and Bayesian experimental design. International Conference on Machine Learning (ICML) (pp. 912­ 919).

80

100

80

100

# measurements k

# measurements k

(a) Labeling examples

(b) Labeling word types

Figure 5. On part-of-speech tagging, choosing measurements based on entropy outperforms choosing randomly and choosing based on frequency.

run. Figure 5 shows the results: On both types of measurements, entropy outperforms choosing words at random. A simple baseline which chooses the most frequent words underperforms even random, presumably due to lack of diversity.

6. Conclusion
Our ultimate goal is "efficient learning"--narrowing in on the desired model with as little human effort as possible, whether it be by labeling examples or specifying constraints. Measurement-based learning allows us to integrate all of these in a coherent way. Furthermore, it is the first framework to directly target our ultimate goal by quantifying what it means to learn efficiently. Acknowledgments We thank Zoubin Ghahramani for helpful comments. We wish to acknowledge support from MURI Grant N00014-06-1-0734.

References
Borwein, J. M., & Zhu, Q. J. (2005). Techniques of variational analysis. Springer. Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, 10, 273­304. Chang, M., Ratinov, L., & Roth, D. (2007). Guiding semi-supervision with constraint-driven learning. Association for Computational Linguistics (ACL) (pp. 280­287). Druck, G., Mann, G., & McCallum, A. (2008). Learning from labeled features using generalized expectation criteria. ACM Special Interest Group on Information Retreival (SIGIR) (pp. 595­602). Dud´ M., Phillips, S. J., & Schapire, R. E. (2007). ik, Maximum entropy density estimation. Journal of Machine Learning Research, 8, 1217­1260.

A. Derivation of (12)
Let f (q) = -H(q(Y )) - Eq(Y ) [ i=1 (Xi , Yi )],  , g(u) = h (u -  ), and A(q) = Eq(Y ) [ X (Y )].11 Minimization of (9) with respect to q is equivalent to minimization of f (q) + g(Aq) + constant. By strong duality (Theorem 4.4.3 of Borwein and Zhu (2005)), inf q {f (q) + g(Aq)} = sup {-f  (A ) - g  (-)}. The conjugate functions12 are as follows: f  (A ) = Pn X n log e  (y), + i=1 (Xi ,yi ), dy = i=1 B(, ; Xi )   and -g (-) = ,  - h (). Perform algebra to obtain (12).
0 if a = true  otherwise. 12 The conjugate of g(u) is g  () = supu { u,  - g(u)}.
11

n

(

W[a] =