An Asymptotic Analysis of Generative, Discriminative, and Pseudolikeliho o d Estimators Percy Liang Computer Science Division, University of California, Berkeley, CA, USA pliang@cs.berkeley.edu Michael I. Jordan jordan@cs.berkeley.edu Computer Science Division and Department of Statistics, University of California, Berkeley, CA, USA Abstract Statistical and computational concerns have motivated parameter estimators based on various forms of likelihood, e.g., joint, conditional, and pseudolikelihood. In this paper, we present a unified framework for studying these estimators, which allows us to compare their relative (statistical) efficiencies. Our asymptotic analysis suggests that modeling more of the data tends to reduce variance, but at the cost of being more sensitive to model misspecification. We present experiments validating our analysis. better when training data are limited. This intuition is supported by the theoretical comparison of Naive Bayes and logistic regression (Ng & Jordan, 2002) and the recent empirical success of hybrid methods (McCallum et al., 2006; Lasserre et al., 2006). Computational concerns have also spurred the development of alternatives to the full likelihood; these methods can be seen as optimizing an alternate ob jective or performing approximate inference during optimization. Examples include pseudolikelihood (Besag, 1975), composite likelihood (Lindsay, 1988), tree-reweighted belief propagation (Wainwright et al., 2003), piecewise training (Sutton & McCallum, 2005), agreement-based learning (Liang et al., 2008), and many others (Varin, 2008). We can think of all these schemes as simply different estimators operating in a single model family. In this work, we analyze the statistical properties of a class of convex composite likelihood estimators for exponential families, which contains the generative, discriminative, and pseudolikelihood estimators as special cases. The main focus of our analysis is on prediction error. Standard tools from learning theory based on uniform convergence typically only provide upper bounds on this quantity. Moreover, they generally express estimation error in terms of the overall complexity of the model family.1 In our case, since all estimators operate in the same model family, these tools are inadequate for comparing different estimators. Instead, we turn to asymptotic analysis, a mainstay of theoretical statistics. There is much relevant statistical work on the estimators that we treat; note in particular that Lindsay (1988) used asymptotic arguments to show that composite likelihoods are generally 1 There are more advanced techniques such as local Rademacher complexities, which focus on the relevant regions of the model family, but these typically only apply to empirical risk minimization. 1. Intro duction Probabilistic models play a prominent role in domains such as natural language processing, bioinformatics, and computer vision, where they provide methods for jointly reasoning about many interdependent variables. For prediction tasks, one generally models a conditional distribution over outputs given an input. There can be reasons, however, for pursuing alternatives to conditional modeling. First, we might be able to leverage additional statistical strength present in the input by using generative methods rather than discriminative ones. Second, the exact inference required for a full conditional likelihood could be intractable; in this case, one might turn to computationally more efficient alternatives such as pseudolikelihood (Besag, 1975). The generative-discriminative distinction has received much attention in machine learning. The standing intuition is that while discriminative methods achieve lower asymptotic error, generative methods might be Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 by the author(s)/owner(s). An Asymptotic Analysis of Generative, Discriminative, and Pseudolikeliho o d Estimators less efficient than the joint likelihood. The ma jority of these results are, however, focused on parameter estimation. In the current paper, our focus is on prediction, and we also consider model misspecification. We draw two main conclusions from our analysis: First, when the model is well-specified, conditioning on fewer variables increases statistical efficiency; this to some extent accounts for the better generalization enjoyed by generative estimators and the worse performance of pseudolikelihood estimators. Second, model misspecification can severely increase both the approximation and estimation errors of generative estimators. We confirm our theoretical results by comparing our three estimators on a toy example to verify the asymptotics and on a Markov model for part-of-speech tagging. equivalent to the multi-conditional learning framework of McCallum et al. (2006). A composite likelihood consists of a weighted sum of component likelihoods, each of which is the probability of one subset of variables conditioned on another. In this work, we only consider the case where the first set is all the variables. We adopt the following more fundamental way of specifying the components: Each component r is defined by a partitioning of the outcome space Z . We represent a partitioning by an associated equivalence function that maps each z Z to its partition: Definition 1 (Equivalence function). An equivalence function r is a measurable map from Z to measurable subsets of Z such that for each z Z and z r(z ), r(z ) = r(z ). The component likelihood associated with r takes the following form: p (z | z r(z )) = exp{(z ) 2. Exp onential Family Estimators In structured prediction tasks, we are interested in learning a mapping from an input space X to an output space Y . Probabilistic modeling is a common platform for solving such tasks, allowing for the natural handling of missing data and the incorporation of latent variables. In this paper, we focus on regular exponential families, which define distributions over an outcome space Z as follows: p (z ) = exp{(z ) d def - A(; r(z ))}. (3) By maximizing this quantity, we are intuitively taking probability mass away from some neighborhood r(z ) of z and putting it on z . Without loss of generality, assume the component weights sum to 1, so we can think of taking an expectation over a random component R drawn from some fixed distribution Pr . We then define the criterion function: m (z ) = ERPr log p (z | z R(z )). def - A()} for z Z , (1) where (z ) R is a vector of sufficient statistics (features), Rd is a vector of parameters, and e def A() = log (z) (dz ) is the log-partition function. In our case, the outcomes are input-output pairs: z = (x, y ) and Z = X × Y . Exponential families include a wide range of popular models used in machine learning. For example, for a conditional random field (CRF) (Lafferty et al., 2001) defined on a graph G = (V , E ), we have an output variable for each node (y = {yi }iV ), and the features are i ( (x, y ) = V no de (yi , x, i) + i,j )E edge (yi , yj ). From the density p (z ), we can compute event probabilities as follows: p (z s) = exp{A(; s) - A()}, (2) e(z) where A(; s) = log 1[z s] (dz ) is a conditional log-partition function. 2.1. Comp osite Likeliho o d Estimators In this paper, we consider a class of composite likelihood estimators (Lindsay, 1988), which is incidentally (4) Given data points Z (1) , . . . , Z (n) drawn i.i.d. from some true distribution p (not necessarily in the exponential family), the maximum composite likelihood estimator is defined by averaging the criterion function over these data points: ^ ^ = argmax Em (Z ), (5) ^ where Em (Z ) = 1 n n i=1 m (Z (i) ). We can now place the three estimators of interest in our framework: Generative: We have one component rg (x, y ) = X × Y , which has one partition--the whole outcome space. Fully discriminative: We have one component rd (x, y ) = x × Y . The outcomes in each partition have the same value of x, but different y . Pseudolikeliho o d discriminative: Assume y = {yi }iV . For each i V , we have a component ri (x, y ) = {(x , y ) : x = x, y Y , yj = yj for j = i}. Pr is uniform over these components. An Asymptotic Analysis of Generative, Discriminative, and Pseudolikeliho o d Estimators = vv arameter estimates p true distribution of the data m criterion function (defines the estimator) R risk (expected log-loss) ^ ^ = argmax Em (Z ) [empirical parameter estimate] = argmax Em (Z ) [limiting parameter vector] Random variables for asymptotic analysis R Pr [choose composite likelihood component] rd (x, y ) = x × Y [fully discriminative component] = (Z ), Z p [sample from true distribution] t = (Z t ), Z t p (· | · R(Z )) [from true distrib.] m = (Z m ), Z m p (· | · R(Z )) [for estimation] e = (Z e ), Z e p (· | · rd (Z )) [for prediction] P v well-specified (if p = p , then = ). Note, however, that in general we do not assume that our model is well-specified. Our asymptotic analysis is driven by Taylor expansions, so we need to compute a few derivatives. The derivatives of the log-partition function are moments of the sufficient statistics (a standard result, see Wainwright and Jordan (2003)): A(; s) = EZ p (·|·s) ((Z )) ¨ A(; s) = varZ p (·|·s) ((Z )). (8) (9) Table 1. Notation used in the paper. From these moments, we can obtain the derivatives of m and R (to simplify notation, we express these in terms of random variables whose distributions are defined in Table 1): m = - E(m | Z ) m 2.2. Prediction and Evaluation ^ Given a parameter estimate , we make predictions based on p (y | x). In this paper, we evaluate our ^ model according to log-loss; the risk is the expected log-loss: R() = E(X,Y )p [- log p (Y | X )]. (6) (10) (11) (12) (13) m = -E[var( | R(Z )) | Z ] ¨ R( ) = E(e - ) ¨ R( ) = E var(e | Z ). 3.1. Asymptotics of the Parameters The quality of an estimator is determined by the gap ^ between the risk of the estimate R() and the Bayes risk R = H (Y | X ). It will be useful to relate these two via the risk of = argmax EZ p m (Z ), which leads to the following standard decomposition: ^ ^ R() - R = (R() - R( )) + (R( ) - R ) . t e a otal error stimation error pprox. error ^ We first analyze how fast converges to by comput^ ing the asymptotic distribution of - . In Section 3.2 we use this result to get the asymptotic distribution of ^ the estimation error R() - R( ). The following standard lemma will prove to be very useful in our analysis: Lemma 1. For random vectors X, Y , Z , we have var(X | Z ) = E[var(X | Y , Z ) | Z ] + var[E(X | Y , Z ) | Z ]. The important implication of this lemma is that conditioning on another variable Y reduces the variance of X . This lemma already hints at how conditioning on more variables can lead to poorer estimators: conditioning reduces the variance of the data, which can make it harder to learn about the parameters. The following theorem gives us the asymptotic variance of a general composite likelihood estimator: Theorem 1 (Asymptotic distribution of the parame^P ters). Assume - . Then ^ n( - ) N (0, ) . (14) The asymptotic variance is = -1 + -1 (Cc + Cm )-1 , m (7) The estimation error is due to having only finite data; the approximation error is due to the intrinsic suboptimality of the estimator.2 3. Asymptotic Analysis We first compute the asymptotic estimation errors of composite likelihood estimators in general (Sections 3.1 and 3.2). Then we use these results to compare the estimators of interest (Sections 3.3 and 3.4). In this paper, we assume that our exponential family is identifiable.3 Also assume that our estimators con^P verge ( - ) and are consistent when the model is Note that is not necessarily the minimum risk parameter vector in the model family. 3 In the non-identifiable case, the analysis becomes more cluttered, but the results are essentially the same, since predictions depend on only the distributions induced by the parameters. See the longer version of this paper for an in-depth discussion. 2 (15) where = E var( | R(Z )) is the sensitivity, Cc = E var[E(m | R(Z )) | Z ] is the component correction, and Cm = E[var(t | Z ) - var(m | Z )] + E[E(t | Z ) - E(m | Z )] is the misspecification correction. An Asymptotic Analysis of Generative, Discriminative, and Pseudolikeliho o d Estimators Proof. The standard asymptotic normality result for M-estimators (Theorem 5.21 of van der Vaart (1998)), which includes composite likelihood estimators, gives us the asymptotic variance: = (Em )-1 (Em )(Em )-1 . ¨ ¨ (16) Theorem 2 (Asymptotic distribution of the risk). Let be the asymptotic variance as defined in (15). De def ¨ def ¨ note R = R( ) and R = R( ). Then 0 . d ^ n(R() - R( )) - N , R R (19) Furthermore, if R = 0, then , ¨1 d1 ^ ¨1 n(R() - R( )) - tr W R 2 R 2 , 1 2 (20) The remainder of the proof simply re-expresses in terms of more interpretable quantities. Algebraic manipulation of (10) yields: Em = E[( - E(t | Z )) + (E(t | Z ) - E(m | Z ))] . Note that cross terms cancel conditioned on Z and that E[ - E(t | Z )] = E[t - E(t | Z )] , so Em = Cm + E var(m | Z ). (17) where W (V , n) is the Wishart distribution with n degrees of freedom. Proof. Perform a Taylor expansion of the risk function around : ^ ^ R() = R( ) + R ( - ) + (21) 1^ ¨^ ^ ( - ) R( - ) + o(|| - ||2 ). 2 We use a standard argument known as the delta method (van d Vaart, 1998). Multiplying (21) on er both sides by n, rearranging terms, and applying Slutsky's theorem, we get (19). However, when R = 0, the first-order term of the expansion (21) is zero, so we must consider the second-order term to get a ¨ non-degenerate distribution. Note that R is positive semidefinite. Multiplying (21) by n and rearranging yields the following: [ 1 + 1 ^ ^ ¨ n(R() - R( )) = tr R 2 n( - )] ··· 2 d ¨1 ¨1 ¨1 ^ Since R 2 n( - ) - N (0, R 2 R 2 ), applying the continuous mapping theorem with the outer product function yields a Wishart as the limiting distribution. ^ Thus, n(R() - R( )) is asymptotically equal in dis1 tribution to 2 times the trace of a sample from that Wishart distribution. We can also understand (20) in the following way. d ¨1 ¨1 Let V = R 2 R 2 . Note that 1 tr W (V , 1) = 2 1 2 tr (V W (I , 1)), which is the distribution of a weighted sum of independent 2 variables, where the weights 1 are determined by the diagonal elements of V . The mean of this distribution is 1 tr(V ) and the variance is 2 tr(V · V ), where · denotes elementwise product. An important question is when we obtain the ordi1 nary O(n- 2 ) convergence (19) versus the much better O(n-1 ) convergence (20). A sufficient condition for O(n-1 ) convergence is R( ) = 0. When the model is well-specified, this is true for any consistent estimator. Even if the model is misspecified, the fully discriminative estimator still achieves the O(n-1 ) rate. The We then apply Lemma 1 to decompose the second term of the right-hand side: E var(m | Z ) = (18) E var(m | R(Z )) + E var[E(m | R(Z )) | Z ]. Substitute (18) into (17) to get an expression for Em ; (11) already provides one for Em . Substitute these ¨ two expressions into (16) and simplify to get (15). The decomposition in (15) allows us to make several qualitative judgments. First, the sensitivity = E var(m | R(Z )) is the expected amount of variation in the features given Z and R (equivalently, given R(Z )). The larger the sensitivity, the more the data can tell us about the parameters, and thus the lower the asymptotic variance will be. The component correction Cc intuitively measures how different the feature expectations E(m | R(Z )) under the various components are. Cc is zero for the generative and fully discriminative estimators, but the pseudolikelihood discriminative estimator pays a penalty for having more than one component. The misspecification correction Cm is zero when the d model is well-specified (in this case, m | Z = t | Z ), but is in general nonzero under model misspecification. In this latter case, one incurs a nonzero approximation error (defined in (7)) as expected, but we see that there is also a nonzero effect on estimation error. 3.2. Asymptotics of the Risk The following theorem turns Theorem 1 from a statement about the asymptotic distribution of the parameters into one about the risk: An Asymptotic Analysis of Generative, Discriminative, and Pseudolikeliho o d Estimators reason is that whenever a training criterion m is the same (up to constants) as the test criterion R(·), R -1 vanishes and we obtain the O(n ) rate. This is in concordance with a related observation made by Wainwright (2006) that it is better to use the same inference procedure at both training and test time. When the model is well-specified, there is another appealing property that holds if the training and test criterion are the same up to constants: the asymptotic distribution of the risk depends on only the dimensionality of the exponential family, not the actual structure of the model. In particular, for composite likelihood estimators with one component, = -1 = ¨ ¨1 ¨1 (-Em )-1 = R-1 . Therefore, R 2 R 2 = Id and so ¨ d d1 ^ n(R() - R( )) - 2 trW (Id , 1) = 1 2 , where d is 2d the number of parameters. This result is essentially another way of looking at the fact that the likelihood ratio test statistic is asymptotically distributed as 2 . 3.3. Comparing Estimation Errors In the previous section, we analyzed the asymptotics of a single estimator. Now, given two estimators, we would like to be able to tell which one is better. In order to compare two estimators, it would be convenient if they converged to the same limit. In this section, we ensure this by assuming that the model is wellspecified and that our estimators are consistent. Since all parameter estimates are used in the same way for prediction, it suffices to analyze the relative efficiencies of the parameter estimates. The following theorem says that coarser partitionings of Z generally lead to more efficient estimators: ^ Theorem 3 (Asymptotic relative efficiency). Let 1 ^ and 2 be two consistent estimators with asymptotic variances 1 and 2 as defined in (15). Assume that ^ R1 is constant (1 has exactly one component) and R1 (z ) R2 (z ) for al l z Z . If the model is wel l^ ^ specified, then 1 2 (1 is no worse than 2 ). Proof. We first show that -1 -1 , where 1 and 1 2 2 are the sensitivities of the two estimators. Because the model is well-specified, k = E var(t | Rk (Z )) for k = 1, 2. The assumption R1 (Z ) R2 (Z ) means that R2 (Z ) provides more information about Z than R1 (Z ); formally, the -fields satisfy (R1 (Z )) (R2 (Z )). Thus, we can use Lemma 1 to decompose the variance: 1 = E var(t | R2 (Z )) + E var[E(t | R2 (Z )) | R1 (Z )]. The first term of the right-hand side is exactly 2 and the second term is positive semidefinite, so 1 2 , which implies -1 -1 . 1 2 Let Cc1 and Cc2 be the component corrections of the two estimators. Note that Cc1 = 0 because the R1 is constant, so Cc1 Cc2 . The misspecification corrections are both zero. Putting these results together yields the theorem. One might wonder if we really need R1 to be constant. Is it not enough to just assume that R1 (z ) R2 (z ) (for some coupling of R1 and R2 )? The answer is no, as the following counterexample shows: Counterexample Let Z = {1, 2, 3}. The general shape of the distribution is given by the single feature (1) = 1, (2) = 3, (3) = 2 and a scalar parameter controls the peakiness of the distribution. Let the true parameter be = 1. Consider two esti^ mators: 1 has two components, r1a = {{1, 2}, {3}} ^ and r1b = {{1}, {2, 3}}; 2 also has two components, r2a = {{1, 2}, {3}} and r2b = {{1}, {2}, {3}}. Coupling r1a with r2a and r1b with r2b , we have R1 (z ) R2 (z ). However, we computed and found ^ that 1 4.19 and 2 3.15, so 2 actually has lower asymptotic variance although it has finer partitionings. To explain this, note that the contribution of r2b to the criterion function is zero, so the second estimator is equivalent to just using the single component r2a (= r1a ), so the first estimator actually suffers by using the additional component r1b . In general, while we would still expect coarser partitionings to be better even for estimators with many components, this counterexample shows that we must exercise caution. 3.4. Comparing Estimators Finally, we use Theorem 3 to compare the estimation ^ and approximation errors of the generative (g ), fully ^d ), and pseudolikelihood discriminadiscriminative ( ^ tive (p ) estimators. The subscripts g, d, p will be attached to other variables to refer to the quantities associated with the corresponding estimators. In the following corollaries, we use the word "lower" loosely to mean "no more than," although in general we expect the inequality to be strict. Corollary 1 (Generative versus fully discriminative). ^ (1) If the model is wel l-specified, g has lower asymp^ totic estimation error than d ; both have zero approx^ imation error. (2) If the model is misspecified, d has lower approximation and asymptotic estimation errors ^ than g . Proof. For (1), since Rd (z ) Rg (z ), we have g d by Theorem 3. Zero approximation error follows from consistency. For (2), since the discriminative estimator An Asymptotic Analysis of Generative, Discriminative, and Pseudolikeliho o d Estimators achieves the minimum risk in the model family, it has the lowest approximation error. Also, by Theorem 2 and the ensuing discussion, it always converges at a O(n-1 ) rate, whereas the generative estimator will in 1 general converge at a O(n- 2 ) rate. Note that there is a qualitative change of asymptotics in going from the well-specified to the misspecified scenario. This discontinuity demonstrates one weakness of asymptotic analyses: we would expect that for a very minor model misspecification, the generative estimator would still dominate the discriminative estimator for moderate sample sizes, but even a small misspecification is magnified in the asymptotic limit. In the following toy example where the model is wellspecified, we see concretely that the generative estimator has smaller asymptotic estimation error: Example Consider a model where x and y are binary variables: (x, y ) = 0 1[x = 0, y = 1] + 1 1[x = 1, y = 1], where the true parameters are = a 3 (0, 0). We can compute g = var() = 116 -1 -1 3 ¨ nd R( ) = d = E var( | X ) = 1 16 4.1. A Simple Graphical Mo del Consider a four-node binary-valued graphical model where z = (x1 , x2 , y1 , y2 ). The true model family p is an Markov random field parametrized by = ( , , ) as follows: ( z ) = 1[y1 = y2 ] + (1[x1 = y1 ] + 1[x2 = y2 ]) + (1[x1 = y2 ] + 1[x2 = y1 ]). To emulate misspecification, we set to be nonzero and force = 0 during parameter estimation. In the first experiment, we estimated the variance (by running 10K trials) of the estimation error as we increased the number of data points. We set = = 1 for the true model. When = 0 (the model is well-specified), Figures 1(a)­(c) show that scaling the variance by n yields a constant; this implies that all three estimators achieve O(n-1 ) convergence. When the model is misspecified with = 0.5 (Figures 1(d)­(f )), there is a sharp difference between the rates of the generative and discriminative estimators. The fully discriminative estimator still enjoys the O(n-1 ) convergence; scaling by n reveals that the generative and pseudolikelihood discriminative estima1 tors are only attaining a O(n- 2 ) rate as predicted by Theorem 2 (Figure 1(f )). Note that the generative estimator is affected most severely. Figures 1(g)­(h) demonstrate the non-asymptotic impact of varying the parameters of the graphical model in terms of the total error. In (g), as we increase the amount of misspecification , the error increases for all estimators, but most sharply for the generative estimator. In (h), as we increase the strength of the edge potential , the pseudolikelihood discriminative estimator suffers, the fully discriminative estimator is unaffected, and the generative estimator actually improves. 4.2. Part-of-sp eech Tagging In this section, we present experiments on part-ofspeech (POS) tagging. In POS tagging, the input is a sequence of words x = (x1 , . . . , x ) and the output is a sequence of POS tags y = (y1 , . . . , y ), e.g., noun, verb, etc. (There are 45 tags total.) We consider the following model, specified by the following features (roughly 2 million total): i 2 0 0 2 . The mean asymptotic estimation error (scaled by n) of the 3 generative estimator is 1 tr(d -1 ) = 4 while that of g 2 the discriminative estimator is 1 tr(d -1 ) = 1. d 2 We now show that fully discriminative estimators are statistically superior to pseudolikelihood discriminative estimators in all regimes, but of course pseudolikelihood is computationally more efficient. Corollary 2 (Fully discriminative versus pseudolikelihood discriminative). (1) If the model is wel l-specified, ^ ^ d has lower asymptotic estimation error than p ; both have zero approximation error. (2) If the model is mis^ specified, d has lower approximation and asymptotic ^ estimation errors than p . Proof. For (1), since Rp (z ) Rd (z ), d p by Theorem 3. Zero approximation error follows from consistency. For (2), the same arguments as the corresponding part of the proof of Corollary 1 apply. 4. Exp eriments In this section, we validate our theoretical analysis empirically. First, we evaluate the three estimators on a simple graphical model which allows us to plot the real asymptotics of the estimation error (Section 4.1). Then we show that in the non-asymptotic regime, the qualitative predictions of the asymptotic analyses are also valid (Section 4.2). (x, y ) = =1 node (yi , xi ) + i-1 =1 edge (yi , yi+1 ), (22) where the node features node (yi , xi ) are a vector of indicator functions of the form 1[yi = a, xi = b], and An Asymptotic Analysis of Generative, Discriminative, and Pseudolikeliho o d Estimators 2.1e-6 7.2e-5 2.7e-3 n · var(EstErr) var(EstErr) 1.7e-6 1.3e-6 8.4e-7 4.3e-7 20K 40K 60K 80K 100K 5.9e-5 4.5e-5 3.2e-5 1.9e-5 20K 40K 60K 80K 100K n · var(EstErr) 2.5e-3 2.2e-3 2.0e-3 1.8e-3 20K 40K 60K 80K 100K n n n (a) = = 1 1.2e-5 4.3e-4 (b) = = 1 6.1e-2 (c) = = 1 n · var(EstErr) var(EstErr) 1.0e-5 7.5e-6 5.0e-6 2.5e-6 20K 40K 60K 80K 100K n · var(EstErr) 3.5e-4 2.6e-4 1.8e-4 9.2e-5 20K 40K 60K 80K 100K 4.9e-2 3.7e-2 2.5e-2 1.4e-2 20K 40K 60K 80K 100K n n n (d) = = 1 4.8e-2 (e) = = 1 8.3e-9 (f ) = = 1 TotalErr TotalErr 3.9e-2 2.9e-2 1.9e-2 9.7e-3 0.2 0.4 0.6 0.8 1.0 7.1e-9 5.9e-9 4.8e-9 3.6e-9 0.6 1.2 1.8 2.4 3.0 Generative g Fully discrim. d Pseudo. discrim. p (g) Vary misspecification (h) Vary edge potentials Figure 1. Asymptotics of the simple four-node graphical model. In (a)­(c), = = 1 and = 0; we plot the asymptotic variance of the estimation error, scaled by 1, n, and n. In (d)­(f ), we repeat with = 0.5. In (g), we take n = 20000 examples, = = 1 and vary . In (h), we take n = 20000, = 1, = 0 and vary . the edge features edge (yi , yi+1 ) are a vector of indicator functions of the form 1[yi = a, yi+1 = b]. Trained generatively, this model is essentially an HMM, but slightly more expressive. Trained (fully) discriminatively, this model is a CRF. We used the Wall Street Journal (WSJ) portion of the Penn Treebank, with sections 0­21 for training (38K sentences) and 22­24 for testing (5.5K sentences). Table 2(a) shows that the discriminative estimators perform better than the generative one. This is not surprising given that the model is misspecified (language does not come from an HMM). To verify that the generative estimator is superior when the model is well-specified, we used the learned generative model in the previous experiment to sample 1000 synthetic training and 1000 synthetic test examples. We then applied the estimators as before on this artificial data. Table 2(b) shows that the generative es- Accuracy Log-loss Train Test Train Test Gen. 0.940 0.935 4.628 4.945 Fully dis. 0.977 0.956 1.480 3.120 Pseudo dis. 0.975 0.955 1.562 3.170 (a) Real data (misspecified) Accuracy Log-loss Train Test Train Test Gen. 0.989 0.898 0.570 7.297 Full dis. 0.992 0.879 0.407 12.431 0.469 10.840 Pseudo dis. 0.990 0.891 (b) Synthetic data (well-specified) Table 2. Part-of-speech tagging results. Discriminative estimators outperform the generative estimator (on both test accuracy and log-loss) when the model is misspecified, but the reverse is true when the model is well-specified. An Asymptotic Analysis of Generative, Discriminative, and Pseudolikeliho o d Estimators timator has an advantage over the fully discriminative estimator, and both are better than the pseudolikelihood estimator. 5. Discussion and Extensions We believe our analysis captures the essence of the generative-discriminative distinction: by modeling the input, we reduce the variance of the parameter estimates. In related work, Ng and Jordan (2002) showed that Naive Bayes requires exponentially fewer examples than logistic regression to obtain the same estimation error. The key property needed in their proof was that the Naive Bayes estimator decouples into d independent closed form optimization problems, which does not seem to be the defining property of generative estimation. In particular, this property does not apply to general globally-normalized generative models, but one would still expect those models to have the advantages of being generative. Given that the generative and discriminative estimators are complementary, one natural question is how to interpolate between the two to get the benefits of both. Our framework naturally suggests two ways to go about this. First, we could vary the coarseness of the partitioning. Generative and discriminative estimators differ only in this coarseness and there is a range of intermediate choices corresponding to conditioning on more or fewer of the input variables. Second, we could take a weighted combination of estimators (e.g., Bouchard and Triggs (2004); McCallum et al. (2006)). For one-parameter models, Lindsay (1988) derived the optimal weighting of the component likelihoods, but unfortunately these results cannot be applied directly in practice. It would also be interesting to perform a similar asymptotic analysis on other estimators used in practice, for example marginal likelihoods with latent variables, tree-reweighted belief propagation (Wainwright et al., 2003; Wainwright, 2006), piecewise training (Sutton & McCallum, 2005), etc. Another important extension is to curved exponential families, which account for many of the popular generative models based on directed graphical models. tive, and pseudolikelihood estimators as special cases. Our work provides new theoretical support for existing intuitions and a basis for developing new estimators which balance the tradeoff between computational and statistical efficiency. Acknowledgments We thank Peter Bartlett for useful discussions and Simon Lacoste-Julien for comments. We also wish to acknowledge NSF grant 0509559 and a grant from Microsoft Research. References Besag, J. (1975). The analysis of non-lattice data. The Statistician, 24, 179­195. Bouchard, G., & Triggs, B. (2004). The trade-off between generative and discriminative classifiers. International Conference on Computational Statistics (pp. 721­728). Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling data. International Conference on Machine Learning (ICML). Lasserre, J. A., Bishop, C. M., & Minka, T. P. (2006). Principled hybrids of generative and discriminative models. Computer Vision and Pattern Recognition (CVPR) (pp. 87­94). Liang, P., Klein, D., & Jordan, M. I. (2008). Agreementbased learning. Advances in Neural Information Processing Systems (NIPS). Lindsay, B. (1988). Composite likelihood methods. Contemporary Mathematics, 80, 221­239. McCallum, A., Pal, C., Druck, G., & Wang, X. (2006). Multi-conditional learning: Generative/discriminative training for clustering and classification. Association for the Advancement of Artificial Intel ligence (AAAI). Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems (NIPS). Sutton, C., & McCallum, A. (2005). Piecewise training of undirected models. Uncertainty in Artificial Intel ligence (UAI). van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Varin, C. (2008). On composite marginal likelihoods. Advances in Statistical Analysis, 92, 1­28. Wainwright, M. (2006). Estimating the "wrong" graphical model: Benefits in the computation-limited setting. Journal of Machine Learning Research, 7, 1829­1859. Wainwright, M., Jaakkola, T., & Willsky, A. (2003). Treereweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching. Artificial Intel ligence and Statistics (AISTATS). Wainwright, M., & Jordan, M. I. (2003). Graphical models, exponential families, and variational inference (Technical Report). Department of Statistics, University of California at Berkeley. 6. Conclusion We have analyzed the asymptotic distributions of composite likelihood estimators in the exponential family. The idea of considering different partitionings of the outcome space allows a clean and intuitive characterization of the asymptotic variances, which enables us to compare the commonly used generative, discrimina-