Analyzing the Errors of Unsupervised Learning Percy Liang Dan Klein Computer Science Division, EECS Department University of California at Berkeley Berkeley, CA 94720 {pliang,klein}@cs.berkeley.edu Abstract We identify four types of errors that unsupervised induction systems make and study each one in turn. Our contributions include (1) using a meta-model to analyze the incorrect biases of a model in a systematic way, (2) providing an efficient and robust method of measuring distance between two parameter settings of a model, and (3) showing that local optima issues which typically plague EM can be somewhat alleviated by increasing the number of training examples. We conduct our analyses on three models: the HMM, the PCFG, and a simple dependency model. 1 Introduction The unsupervised induction of linguistic structure from raw text is an important problem both for understanding language acquisition and for building language processing systems such as parsers from limited resources. Early work on inducing grammars via EM encountered two serious obstacles: the inappropriateness of the likelihood objective and the tendency of EM to get stuck in local optima. Without additional constraints on bracketing (Pereira and Shabes, 1992) or on allowable rewrite rules (Carroll and Charniak, 1992), unsupervised grammar learning was ineffective. Since then, there has been a large body of work addressing the flaws of the EM-based approach. Syntactic models empirically more learnable than PCFGs have been developed (Clark, 2001; Klein and Manning, 2004). Smith and Eisner (2005) proposed a new objective function; Smith and Eisner (2006) introduced a new training procedure. Bayesian approaches can also improve performance (Goldwater and Griffiths, 2007; Johnson, 2007; Kurihara and Sato, 2006). 879 Though these methods have improved induction accuracy, at the core they all still involve optimizing non-convex objective functions related to the likelihood of some model, and thus are not completely immune to the difficulties associated with early approaches. It is therefore important to better understand the behavior of unsupervised induction systems in general. In this paper, we take a step back and present a more statistical view of unsupervised learning in the context of grammar induction. We identify four types of error that a system can make: approximation, identifiability, estimation, and optimization errors (see Figure 1). We try to isolate each one in turn and study its properties. Approximation error is caused by a mis-match between the likelihood objective optimized by EM and the true relationship between sentences and their syntactic structures. Our key idea for understanding this mis-match is to "cheat" and initialize EM with the true relationship and then study the ways in which EM repurposes our desired syntactic structures to increase likelihood. We present a metamodel of the changes that EM makes and show how this tool can shed some light on the undesired biases of the HMM, the PCFG, and the dependency model with valence (Klein and Manning, 2004). Identifiability error can be incurred when two distinct parameter settings yield the same probability distribution over sentences. One type of nonidentifiability present in HMMs and PCFGs is label symmetry, which even makes computing a meaningful distance between parameters NP-hard. We present a method to obtain lower and upper bounds on such a distance. Estimation error arises from having too few training examples, and optimization error stems from Proceedings of ACL-08: HLT, pages 879­887, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics EM getting stuck in local optima. While it is to be expected that estimation error should decrease as the amount of data increases, we show that optimization error can also decrease. We present striking experiments showing that if our data actually comes from the model family we are learning with, we can sometimes recover the true parameters by simply running EM without clever initialization. This result runs counter to the conventional attitude that EM is doomed to local optima; it suggests that increasing the amount of data might be an effective way to partially combat local optima. PCFGs on WSJ-10 (7424 sentences) and DMVs on WSJ-20 (25523 sentences) (Klein and Manning, 2004). We ran EM for 100 iterations with the parameters initialized uniformly (always plus a small amount of random noise). We evaluated the HMM and PCFG by mapping model states to Treebank tags to maximize accuracy. 3 Decomposition of errors Now we will describe the four types of errors (Figure 1) more formally. Let p (x, y) denote the distribution which governs the true relationship between the input x and output y. In general, p does not live in our model family P . We are presented with a set of n unlabeled examples x(1) , . . . , x(n) drawn i.i.d. from the true p . In unsupervised induction, our goal is to approximate p by some model p P in terms of strong generative capacity. A standard approach is to use the EM algorithm to optimize ^ the empirical likelihood E log p (x).1 However, EM ^ only finds a local maximum, which we denote EM , so there is a discrepancy between what we get (pEM ) ^ and what we want (p ). We will define this discrepancy later, but for now, it suffices to remark that the discrepancy depends on the distribution over y whereas learning depends only on the distribution over x. This is an important property that distinguishes unsupervised induction from more standard supervised learning or density estimation scenarios. Now let us walk through the four types of er^ ror bottom up. First, EM , the local maximum ^ found by EM, is in general different from ^ log p (x), any global maximum, which argmax E we could find given unlimited computational resources. Optimization error refers to the discrep^ ^ ancy between and EM . Second, our training data is only a noisy sample from the true p . If we had infinite data, we would choose an optimal parameter setting under the model, 2 argmax E log p (x), where now the expectation E is taken with respect to the true p instead of the training data. The discrepancy between ^ 2 and is the estimation error. Note that 2 might not be unique. Let 1 denote def 1 P ^ Here, the expectation Ef (x) = n n 1 f (x(i) ) denotes i= averaging some function f over the training data. 1 2 Unsupervised models Let x denote an input sentence and y denote the unobserved desired output (e.g., a parse tree). We consider a model family P = {p (x, y) : }. For example, if P is the set of all PCFGs, then the parameters would specify all the rule probabilities of a particular grammar. We sometimes use and p interchangeably to simplify notation. In this paper, we analyze the following three model families: In the HMM, the input x is a sequence of words and the output y is the corresponding sequence of part-of-speech tags. In the PCFG, the input x is a sequence of POS tags and the output y is a binary parse tree with yield x. We represent y as a multiset of binary rewrites of the form (y y1 y2 ), where y is a nonterminal and y1 , y2 can be either nonterminals or terminals. In the dependency model with valence (DMV) (Klein and Manning, 2004), the input x = (x1 , . . . , xm ) is a sequence of POS tags and the output y specifies the directed links of a projective dependency tree. The generative model is as follows: for each head xi , we generate an independent sequence of arguments to the left and to the right from a direction-dependent distribution over tags. At each point, we stop with a probability parametrized by the direction and whether any arguments have already been generated in that direction. See Klein and Manning (2004) for a formal description. In all our experiments, we used the Wall Street Journal (WSJ) portion of the Penn Treebank. We binarized the PCFG trees and created gold dependency trees according to the Collins head rules. We trained 45-state HMMs on all 49208 sentences, 11-state 880 log-likeliho o d p 1 2 = true model -16.7 1.0 Approximation error (Section 4) -17.6 -18.0 -18.4 20 40 60 80 100 Lab eled F1 -17.2 0.8 0.6 0.4 0.2 20 40 60 80 100 = Best(argmax E log p (x)) Identifiability error (Section 5) argmax E log p (x) iteration iteration Estimation error (Section 6) ^ ^ argmax E log p (x) Optimization error (Section 7) Figure 2: For the PCFG, when we initialize EM with the ^ supervised estimate gen , the likelihood increases but the accuracy decreases. ^ ^ EM = EM(E log p (x)) P ^ Figure 1: The discrepancy between what we get (EM ) and what we want (p ) can be decomposed into four types of errors. The box represents our model family P , which is the set of possible parametrized distributions we can represent. Best(S ) returns the S which has the smallest discrepancy with p . the maximizer of E log p (x) that has the smallest discrepancy with p . Since 1 and 2 have the same value under the objective function, we would not be able to choose 1 over 2 , even with infinite data or unlimited computation. Identifiability error refers to the discrepancy between 1 and 2 . Finally, the model family P has fundamental limitations. Approximation error refers to the discrep ancy between p and p1 . Note that 1 is not necessarily the best in P . If we had labeled data, we could find a parameter setting in P which is closer to p by optimizing joint likelihood E log p (x, y) (generative training) or even conditional likelihood E log p (y | x) (discriminative training). In the remaining sections, we try to study each of the four errors in isolation. In practice, since it is difficult to work with some of the parameter settings that participate in the error decomposition, we use computationally feasible surrogates so that the error under study remains the dominant effect. cussed by many authors (Merialdo, 1994; Smith and Eisner, 2005; Haghighi and Klein, 2006).2 To confront the question of specifically how the likelihood diverges from prediction accuracy, we perform the following experiment: we ini^ tialize EM with the supervised estimate3 gen = ^ argmax E log p (x, y), which acts as a surrogate . As we run EM, the likelihood increases but for p the accuracy decreases (Figure 2 shows this trend for the PCFG; the HMM and DMV models behave similarly). We believe that the initial iterations of EM contain valuable information about the incorrect biases of these models. However, EM is changing hundreds of thousands of parameters at once in a non-trivial way, so we need a way of characterizing the important changes. One broad observation we can make is that the first iteration of EM reinforces the systematic mistakes of the supervised initializer. In the first E-step, the posterior counts that are computed summarize the predictions of the supervised system. If these match the empirical counts, then the M-step does not change the parameters. But if the supervised system predicts too many J Js, for example, then the M-step will update the parameters to reinforce this bias. 4.1 A meta-model for analyzing EM We would like to go further and characterize the specific changes EM makes. An initial approach is to find the parameters that changed the most during the first iteration (weighted by the correspond2 Here, we think of discrepancy between p and p as the error incurred when using p for prediction on examples generated from p; in symbols, E(x,y)p loss(y, argmaxy p (y | x)). 3 For all our models, the supervised estimate is solved in closed form by taking ratios of counts. 4 Approximation error We start by analyzing approximation error, the dis crepancy between p and p1 (the model found by optimizing likelihood), a point which has been dis881 ing expected counts computed in the E-step). For the HMM, the three most changed parameters are the transitions 2 : D T8 : J J, S TA RT0 : N N P, and 8 : J J3 : N N.4 If we delve deeper, we can see that 2 : D T3 : N N (the parameter with the 10th largest change) fell and 2 : D T8 : J J rose. After checking with a few examples, we can then deduce that some nouns were retagged as adjectives. Unfortunately, this type of ad-hoc reasoning requires considerable manual effort and is rather subjective. Instead, we propose using a general meta-model to analyze the changes EM makes in an automatic and objective way. Instead of treating parameters as the primary object of study, we look at predictions made by the model and study how they change over time. While a model is a distribution over sentences, a meta-model is a distribution over how the predictions of the model change. Let R(y) denote the set of parts of a prediction y that we are interested in tracking. Each part (c, l) R(y) consists of a configuration c and a location l. For a PCFG, we define a configuration to be a rewrite rule (e.g., c = P PI N N P), and a location l = [i, k , j ] to be a span [i, j ] split at k , where the rewrite c is applied. In this work, each configuration is associated with a parameter of the model, but in general, a configuration could be a larger unit such as a subtree, allowing one to track more complex changes. The size of a configuration governs how much the meta-model generalizes from individual examples. Let y(i,t) denote the model prediction on the i-th training example after t iterations of EM. To simplify notation, we write Rt = R(y(i,t) ). The metamodel explains how Rt became Rt+1 .5 In general, we expect a part in Rt+1 to be explained by a part in Rt that has a similar location and furthermore, we expect the locations of the two parts to be related in some consistent way. The metamodel uses two notions to formalize this idea: a distance d(l, l ) and a relation r(l, l ). For the PCFG, d(l, l ) is the number of positions among i,j ,k that are the same as the corresponding ones in l , and r((i, k , j ), (i , k , j )) = (sign(i - i ), sign(j - Here 2:DT means state 2 of the HMM, which was greedily mapped to DT. 5 If the same part appears in both Rt and Rt+1 , we remove it from both sets. 4 j ), sign(k - k )) is one of 33 values. We define a migration as a triple (c, c , r(l, l )); this is the unit of change we want to extract from the meta-model. Our meta-model provides the following generative story of how Rt becomes Rt+1 : each new part (c , l ) Rt+1 chooses an old part (c, l) Rt with some probability that depends on (1) the distance between the locations l and l and (2) the likelihood of the particular migration. Formally: pmeta (Rt+1 | Rt ) = ( ( c ,l )Rt+1 c,l)Rt Zl-1 e-d(l,l p(c ) | c, r(l, l )), ( -d(l,l ) is a normalization where Zl = c,l)Rt e constant, and is a hyperparameter controlling the possibility of distant migrations (set to 3 in our experiments). We learn the parameters of the meta-model with an EM algorithm similar to the one for IBM model 1. Fortunately, the likelihood objective is convex, so we need not worry about local optima. 4.2 Results of the meta-model We used our meta-model to analyze the approximation errors of the HMM, DMV, and PCFG. For these models, we initialized EM with the supervised es^ timate gen and collected the model predictions as EM ran. We then trained the meta-model on the predictions between successive iterations. The metamodel gives us an expected count for each migration. Figure 3 lists the migrations with the highest expected counts. From these migrations, we can see that EM tries to explain x better by making the corresponding y more regular. In fact, many of the HMM migrations on the first iteration attempt to resolve inconsistencies in gold tags. For example, noun adjuncts (e.g., stock-index), tagged as both nouns and adjectives in the Treebank, tend to become consolidated under adjectives, as captured by migration (B). EM also re-purposes under-utilized states to better capture distributional similarities. For example, state 24 has migrated to state 40 (N), both of which are now dominated by proper nouns. State 40 initially contained only #, but was quickly overrun with distributionally similar proper nouns such as Oct. and Chapter, which also precede numbers, just as # does. 882 Iteration 01 (A) START (B) 4:NN 8:JJ 4:NN 24:NNP 4:NN 24:NNP 36:NNPS (C) 24:NNP Iteration 12 4:NN 4:NN (D) 8:JJ 4:NN (E) START 24:NNP 8:JJ (F) 27:TO 11:RB Iteration 23 24:NNP (G) U.S. 8:JJ 24:NNP (H) 4:NN 8:JJ 24:NNP (I) 3:DT 8:JJ Iteration 34 11:RB up (J) 32:RP 24:NNP (K) U.S. 8:JJ 11:RB (L) 19:, 32:RP Iteration 45 24:NNP (M) 15:CD 34:$ 24:NNP (N) 2:IN 40:NNP 11:RB (O) down 32:RP (a) Top HMM migrations. Example: migration (D) means a NNNN transition is replaced by JJNN. Iteration 01 (A) (B) (C) DT NN NN JJ NN NN NNP NNP Iteration 12 (D) (E) (F) NNP NNP NNP NNP NNP NNP DT NNP NNP Iteration 23 (G) (H) (I) DT JJ NNS MD RB VB VBP RB VB Iteration 34 (J) (K) (L) DT JJ NN DT NNP NN PRP$ JJ NN Iteration 45 (M) (N) (O) POS JJ NN NNS RB VBP NNS RB VBD (b) Top DMV migrations. Example: migration (A) means a DT attaches to the closer NN. Iteration 01 4:S RB 1:VP Iteration 12 0:NP NNP 0:NP Iteration 23 0:NP DT 0:NP Iteration 34 1:VP TO VB Iteration 45 0:NP CD NN (A) RB 1:VP 1:VP 0:NP 0:NP 2:PP (D) NNP NNP 0:NP 1:VP VBN 2:PP (G) DT NN 0:NP 4:S 0:NP 1:VP (J) TO VB 2:PP 1:VP MD 1:VP (M) CD NN 3:ADJP 1:VP VBD 0:NP (B) 1:VP 1:VP VBZ 2:PP 1:VP 0:NP (E) 1:VP 0:NP 2:PP 1:VP 4:S 1:VP (H) 0:NP 1:VP 4:S 1:VP TO VB (K) MD VB 1:VP 0:NP NNP NNP (N) VBD 3:ADJP 1:VP 0:NP 0:NP NN (C) VBZ 0:NP 1:VP (F) 0:NP 4:S 1:VP (I) TO VB 2:PP (L) NNP NNP 6:NP (O) 0:NP NN 0:NP (c) Top PCFG migrations. Example: migration (D) means a NPNNP NP rewrite is replaced by NPNNP NNP, where the new NNP right child spans less than the old NP right child. Figure 3: We show the prominent migrations that occur during the first 5 iterations of EM for the HMM, DMV, and PCFG, as recovered by our meta-model. We sort the migrations across each iteration by their expected counts under the meta-model and show the top 3. Iteration 0 corresponds to the correct outputs. Blue indicates the new iteration, red indicates the old. DMV migrations also try to regularize model predictions, but in a different way--in terms of the number of arguments. Because the stop probability is different for adjacent and non-adjacent arguments, it is statistically much cheaper to generate one argument rather than two or more. For example, if we train a DMV on only D T J J N N, it can fit the data perfectly by using a chain of single arguments, but perfect fit is not possible if N N generates both D T and J J (which is the desired structure); this explains migration (J). Indeed, we observed that the variance of the number of arguments decreases with more EM iterations (for N N, from 1.38 to 0.41). In general, low-entropy conditional distributions are preferred. Migration (H) explains how adverbs now consistently attach to verbs rather than modals. After a few iterations, the modal has committed itself to generating exactly one verb to the right, 883 which is statistically advantageous because there must be a verb after a modal, while the adverb is optional. This leaves the verb to generate the adverb. The PCFG migrations regularize categories in a manner similar to the HMM, but with the added complexity of changing bracketing structures. For example, sentential adverbs are re-analyzed as VP adverbs (A). Sometimes, multiple migrations explain the same phenomenon.6 For example, migrations (B) and (C) indicate that P Ps that previously attached to N Ps are now raised to the verbal level. Tree rotation is another common phenomenon, leading to many left-branching structures (D,G,H). The migrations that happen during one iteration can also trigger additional migrations in the next. For example, the raising of the P P (B,C) inspires more of the We could consolidate these migrations by using larger configurations, but at the risk of decreased generalization. 6 same raising (E). As another example, migration (I) regularizes T O V B infinitival clauses into P Ps, and this momentum carries over to the next iteration with even greater force (J). In summary, the meta-model facilitates our analyses by automatically identifying the broad trends. We believe that the central idea of modeling the errors of a system is a powerful one which can be used to analyze a wide range of models, both supervised and unsupervised. The above non-identifiabilities apply to all parameter settings, but another type of non-identifiability concerns only the maximizers of E log p (x). Suppose the true data comes from a K -state HMM. If we attempt to fit an HMM with K + 1 states, we can split any one of the K states and maintain the same distribution on x. Or, if we learn a PCFG on the same HMM data, then we can choose either the left- or right-branching chain structures, which both mimic the true HMM equally well. 5.1 Permutation-invariant distance 5 Identifiability error While approximation error is incurred when likelihood diverges from accuracy, identifiability error is concerned with the case where likelihood is indifferent to accuracy. We say a set of parameters S is identifiable (in terms of x) if p (x) = p (x) for every , S where = .7 In general, identifiability error is incurred when the set of maximizers of E log p (x) is non-identifiable.8 Label symmetry is perhaps the most familiar example of non-identifiability and is intrinsic to models with hidden labels (HMM and PCFG, but not DMV). We can permute the hidden labels without changing the objective function or even the nature of the solution, so there is no reason to prefer one permutation over another. While seemingly benign, this symmetry actually presents a serious challenge in measuring discrepancy (Section 5.1). Grenager et al. (2005) augments an HMM to allow emission from a generic stopword distribution at any position with probability q . Their model would definitely not be identifiable if q were a free parameter, since we can set q to 0 and just mix in the stopword distribution with each of the other emission distributions to obtain a different parameter setting yielding the same overall distribution. This is a case where our notion of desired structure is absent in the likelihood, and a prior over parameters could help break ties. For our three model families, is identifiable in terms of (x, y), but not in terms of x alone. 8 We emphasize that non-identifiability is in terms of x, so two parameter settings could still induce the same marginal distribution on x (weak generative capacity) while having different joint distributions on (x, y) (strong generative capacity). Recall that discrepancy depends on the latter. 7 KL-divergence is a natural measure of discrepancy between two distributions, but it is somewhat nontrivial to compute--for our three recursive models, it requires solving fixed point equations, and becomes completely intractable in face of label symmetry. Thus we propose a more manageable alternative: dµ ( || ) def j = µj | j - j | j , µj (1) where we weight the difference between the j -th component of the parameter vectors by µj , the j th expected sufficient statistic with respect to p (the expected counts computed in the E-step).9 Unlike KL, our distance dµ is only defined on distributions in the model family and is not invariant to reparametrization. Like KL, dµ is asymmetric, with the first argument holding the status of being the "true" parameter setting. In our case, the parameters are conditional probabilities, so 0 dµ ( || ) 1, so we can interpret dµ as an expected difference between these probabilities. Unfortunately, label symmetry can wreak havoc on our distance measure dµ . Suppose we want to measure the distance between and . If is simply with the labels permuted, then dµ ( || ) would be substantial even though the distance ought to be zero. We define a revised distance to correct for this by taking the minimum distance over all label permutations: Dµ ( || ) = min dµ ( || ( )), 9 (2) Without this factor, rarely used components could contribute to the sum as much as frequently used ones, thus, making the distance overly pessimistic. 884 where ( ) denotes the parameter setting resulting from permuting the labels according to . (The DMV has no label symmetries, so just dµ works.) For mixture models, we can compute Dµ ( || ) efficiently as follows. Note that each term in the summation of (1) is associated with one of the K labels. We can form a K × K matrix M , where each entry Mij is the distance between the parameters involving label i of and label j of . Dµ ( || ) can then be computed by finding a maximum weighted bipartite matching on M using the O(K 3 ) Hungarian algorithm (Kuhn, 1955). For models such as the HMM and PCFG, computing Dµ is NP-hard, since the summation in dµ (1) contains both first-order terms which depend on one label (e.g., emission parameters) and higher-order terms which depend on more than one label (e.g., transitions or rewrites). We cannot capture these problematic higher-order dependencies in M . However, we can bound Dµ ( || ) as follows. We create M using only first-order terms and find the best matching (permutation) to obtain a lower bound Dµ and an associated permutation 0 achieving it. Since Dµ ( || ) takes the minimum over all permutations, dµ ( || ( )) is an upper bound for any , in particular for = 0 . We then use a local search procedure that changes to further tighten the upper bound. Let Dµ denote the final value. form solution, so there is no optimization error. Using our distance Dµ (defined in Section 5.1) to quan^ tify estimation error, we see that, for the HMM, gen quickly approaches as we increase the amount of data (Table 1). # examples ^ Dµ ( || gen ) || ^gen ) D µ ( ^ Dµ ( || gen-EM ) || Dµ ( ^gen-EM ) 500 0.003 0.005 0.022 0.049 5K 6.3e-4 0.001 0.018 0.039 50K 2.7e-4 5.2e-4 0.008 0.016 500K 8.5e-5 1.7e-4 0.002 0.004 Table 1: Lower and upper bounds on the distance from the true for the HMM as we increase the number of examples. In the unsupervised case, we use the following ^ procedure to obtain a surrogate for : initialize EM ^ with the supervised estimate gen and run EM for ^gen-EM denote the final param100 iterations. Let ^ eters, which should be representative of . Table 1 ^ shows that the estimation error of gen-EM is an order ^gen , which is to exof magnitude higher than that of ^ pected since gen-EM does not have access to labeled data. However, this error can also be driven down given a moderate number of examples. 7 Optimization error Finally, we study optimization error, which is the ^ discrepancy between the global maximizer and ^EM , the result of running EM starting from a uni form initialization (plus some small noise). As be^ ^ fore, we cannot compute , so we use gen-EM as a ^gen-EM and ^ surrogate. Also, instead of comparing with each other, we compare each of their discrepancies with respect to . Let us first consider optimization error in terms of prediction error. The first observation is that there is a gap between the prediction accuracies ^ ^ of gen-EM and EM , but this gap shrinks considerably as we increase the number of examples. Figures 4(a,b,c) support this for all three model fami^ ^ lies: for the HMM, both gen-EM and EM eventually achieve around 90% accuracy; for the DMV, 85%. ^ ^ For the PCFG, EM still lags gen-EM by 10%, but we believe that more data can further reduce this gap. Figure 4(d) shows that these trends are not particular to artificial data. On real WSJ data, the gap 6 Estimation error Thus far, we have considered approximation and identifiability errors, which have to do with flaws of the model. The remaining errors have to do with how well we can fit the model. To focus on these errors, we consider the case where the true model is in our family (p P ). To keep the setting as realistic as possible, we do supervised learning on real ^ labeled data to obtain = argmax E log p(x, y). We then throw away our real data and let p = p . Now we start anew: sample new artificial data from , learn a model using this artificial data, and see how close we get to recovering . In order to compute estimation error, we need to ^ compare with , the global maximizer of the likelihood on our generated data. However, we cannot ^ compute exactly. Let us therefore first consider the ^ simpler supervised scenario. Here, gen has a closed 885 1.0 1.0 1.0 0.8 Directed F1 Accuracy 0.8 0.7 0.6 500 5K 50K 500K 0.8 0.7 0.6 500 5K 50K 500K 0.8 0.6 0.5 500 5K 50K Accuracy 0.9 0.9 Lab eled F1 0.9 0.7 0.6 0.4 0.3 1K 3K 10K 40K # examples # examples # examples # examples (a) HMM (artificial data) 0.12 (b) DMV (artificial data) log-likeliho o d (c) PCFG (artificial data) -165.5 1.0 (d) HMM (real data) 0.07 0.05 0.02 500 5K 50K 500K ^ gen-EM ^ EM (rand 1) ^ EM (rand 2) ^ EM (rand 3) -169.4 -171.4 -173.3 20 40 60 80 100 Accuracy Dµ ( || ·) 0.1 -167.4 0.8 0.6 0.4 0.2 20 40 Sup. init. Unif. init. 60 80 100 # examples iteration iteration (e) HMM (artificial data) (f ) HMM log-likelihood/accuracy on 500K examples ^ ^ Figure 4: Compares the performance of EM (EM with a uniform initialization) against gen-EM (EM initialized with the supervised estimate) on (a­c) various models, (d) real data. (e) measures distance instead of accuracy and (f) shows a sample EM run. ^ ^ between gen-EM and EM also diminishes for the HMM. To reaffirm the trends, we also measure distance Dµ . Figure 4(e) shows that the distance from ^ EM to the true parameters decreases, but the gap ^ ^ between gen-EM and EM does not close as decisively as it did for prediction error. It is quite surprising that by simply running EM with a neutral initialization, we can accurately learn a complex model with thousands of parameters. Figures 4(f,g) show how both likelihood and accuracy, which both start quite low, improve substantially over time for the HMM on artificial data. Carroll and Charniak (1992) report that EM fared poorly with local optima. We do not claim that there are no local optima, but only that the likelihood surface that EM is optimizing can become smoother with more examples. With more examples, there is less noise in the aggregate statistics, so it might be easier for EM to pick out the salient patterns. Srebro et al. (2006) made a similar observation in the context of learning Gaussian mixtures. They characterized three regimes: one where EM was successful in recovering the true clusters (given lots of data), another where EM failed but the global optimum was successful, and the last where both failed (without much data). There is also a rich body of theoretical work on 886 learning latent-variable models. Specialized algorithms can provably learn certain constrained discrete hidden-variable models, some in terms of weak generative capacity (Ron et al., 1998; Clark and Thollard, 2005; Adriaans, 1999), others in term of strong generative capacity (Dasgupta, 1999; Feldman et al., 2005). But with the exception of Dasgupta and Schulman (2007), there is little theoretical understanding of EM, let alone on complex model families such as the HMM, PCFG, and DMV. 8 Conclusion In recent years, many methods have improved unsupervised induction, but these methods must still deal with the four types of errors we have identified in this paper. One of our main contributions of this paper is the idea of using the meta-model to diagnose the approximation error. Using this tool, we can better understand model biases and hopefully correct for them. We also introduced a method for measuring distances in face of label symmetry and ran experiments exploring the effectiveness of EM as a function of the amount of data. Finally, we hope that setting up the general framework to understand the errors of unsupervised induction systems will aid the development of better methods and further analyses. References P. W. Adriaans. 1999. Learning shallow context-free languages under simple distributions. Technical report, Stanford University. G. Carroll and E. Charniak. 1992. Two experiments on learning probabilistic dependency grammars from corpora. In Workshop Notes for Statistically-Based NLP Techniques, pages 1­13. A. Clark and F. Thollard. 2005. PAC-learnability of probabilistic deterministic finite state automata. JMLR, 5:473­497. A. Clark. 2001. Unsupervised induction of stochastic context free grammars with distributional clustering. In CoNLL. S. Dasgupta and L. Schulman. 2007. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. JMLR, 8. S. Dasgupta. 1999. Learning mixtures of Gaussians. In FOCS. J. Feldman, R. O'Donnell, and R. A. Servedio. 2005. Learning mixtures of product distributions over discrete domains. In FOCS, pages 501­510. S. Goldwater and T. Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In ACL. T. Grenager, D. Klein, and C. D. Manning. 2005. Unsupervised learning of field segmentation models for information extraction. In ACL. A. Haghighi and D. Klein. 2006. Prototype-based grammar induction. In ACL. M. Johnson. 2007. Why doesn't EM find good HMM POS-taggers? In EMNLP/CoNLL. D. Klein and C. D. Manning. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In ACL. H. W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistic Quarterly, 2:83­97. K. Kurihara and T. Sato. 2006. Variational Bayesian grammar induction for natural language. In International Colloquium on Grammatical Inference. B. Merialdo. 1994. Tagging English text with a probabilistic model. Computational Linguistics, 20:155­ 171. F. Pereira and Y. Shabes. 1992. Inside-outside reestimation from partially bracketed corpora. In ACL. D. Ron, Y. Singer, and N. Tishby. 1998. On the learnability and usage of acyclic probabilistic finite automata. Journal of Computer and System Sciences, 56:133­ 152. N. Smith and J. Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data. In ACL. N. Smith and J. Eisner. 2006. Annealing structural bias in multilingual weighted grammar induction. In ACL. N. Srebro, G. Shakhnarovich, and S. Roweis. 2006. An investigation of computational and informational limits in Gaussian mixture clustering. In ICML, pages 865­872. 887