Online EM for Unsupervised Models Percy Liang Dan Klein Computer Science Division, EECS Department University of California at Berkeley Berkeley, CA 94720 {pliang,klein}@cs.berkeley.edu Abstract The (batch) EM algorithm plays an important role in unsupervised induction, but it sometimes suffers from slow convergence. In this paper, we show that online variants (1) provide significant speedups and (2) can even find better solutions than those found by batch EM. We support these findings on four unsupervised tasks: part-of-speech tagging, document classification, word segmentation, and word alignment. 1 Introduction In unsupervised NLP tasks such as tagging, parsing, and alignment, one wishes to induce latent linguistic structures from raw text. Probabilistic modeling has emerged as a dominant paradigm for these problems, and the EM algorithm has been a driving force for learning models in a simple and intuitive manner. However, on some tasks, EM can converge slowly. For instance, on unsupervised part-ofspeech tagging, EM requires over 100 iterations to reach its peak performance on the Wall-Street Journal (Johnson, 2007). The slowness of EM is mainly due to its batch nature: Parameters are updated only once after each pass through the data. When parameter estimates are still rough or if there is high redundancy in the data, computing statistics on the entire dataset just to make one update can be wasteful. In this paper, we investigate two flavors of online EM--incremental EM (Neal and Hinton, 1998) and stepwise EM (Sato and Ishii, 2000; Capp´ and e Moulines, 2009), both of which involve updating parameters after each example or after a mini-batch 611 (subset) of examples. Online algorithms have the potential to speed up learning by making updates more frequently. However, these updates can be seen as noisy approximations to the full batch update, and this noise can in fact impede learning. This tradeoff between speed and stability is familiar to online algorithms for convex supervised learning problems--e.g., Perceptron, MIRA, stochastic gradient, etc. Unsupervised learning raises two additional issues: (1) Since the EM objective is nonconvex, we often get convergence to different local optima of varying quality; and (2) we evaluate on accuracy metrics which are at best loosely correlated with the EM likelihood objective (Liang and Klein, 2008). We will see that these issues can lead to surprising results. In Section 4, we present a thorough investigation of online EM, mostly focusing on stepwise EM since it dominates incremental EM. For stepwise EM, we find that choosing a good stepsize and mini-batch size is important but can fortunately be done adequately without supervision. With a proper choice, stepwise EM reaches the same performance as batch EM, but much more quickly. Moreover, it can even surpass the performance of batch EM. Our results are particularly striking on part-of-speech tagging: Batch EM crawls to an accuracy of 57.3% after 100 iterations, whereas stepwise EM shoots up to 65.4% after just two iterations. 2 Tasks, models, and datasets In this paper, we focus on unsupervised induction via probabilistic modeling. In particular, we define a probabilistic model p(x, z; ) of the input x (e.g., Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, pages 611­619, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics a sentence) and hidden output z (e.g., a parse tree) with parameters (e.g., rule probabilities). Given a set of unlabeled examples x(1) , . . . , x(n) , the standard training objective is to maximize the marginal log-likelihood of these examples: n () = i=1 log p(x(i) ; ). (1) ^ A trained model is then evaluated on the accuracy ^ of its predictions: argmaxz p(z | x(i) ; ) against the (i) ; the exact evaluation metric depends true output z on the task. What makes unsupervised induction hard at best and ill-defined at worst is that the training objective (1) does not depend on the true outputs at all. We ran experiments on four tasks described below. Two of these tasks--part-of-speech tagging and document classification--are "clustering" tasks. For these, the output z consists of labels; for evaluation, we map each predicted label to the true label that maximizes accuracy. The other two tasks-- segmentation and alignment--only involve unlabeled combinatorial structures, which can be evaluated directly. Part-of-speech tagging For each sentence x = (x1 , . . . , x ), represented as a sequence of words, we wish to predict the corresponding sequence of partof-speech (POS) tags z = (z1 , . . . , z ). We used a simple bigram HMM trained on the Wall Street Journal (WSJ) portion of the Penn Treebank (49208 sentences, 45 tags). No tagging dictionary was used. We evaluated using per-position accuracy. Document classification For each document x = (x1 , . . . , x ) consisting of words,1 we wish to predict the document class z {1, . . . , 20}. Each document x is modeled as a bag of words drawn independently given the class z. We used the 20 Newsgroups dataset (18828 documents, 20 classes). We evaluated on class accuracy. Word segmentation For each sentence x = (x1 , . . . , x ), represented as a sequence of English phonemes or Chinese characters without spaces separating the words, we would like to predict We removed the 50 most common words and words that occurred fewer than 5 times. 1 a segmentation of the sequence into words z = (z1 , . . . , z|z| ), where each segment (word) zi is a contiguous subsequence of 1, . . . , . Since the na¨ve i unigram model has a degenerate maximum likelihood solution that makes each sentence a separate word, we incorporate a penalty for longer segments: |z| p(x, z; ) k=1 p(xzk ; )e-|zk | , where > 1 determines the strength of the penalty. For English, we used = 1.6; Chinese, = 2.5. To speed up inference, we restricted the maximum segment length to 10 for English and 5 for Chinese. We applied this model on the Bernstein-Ratner corpus from the CHILDES database used in Goldwater et al. (2006) (9790 sentences) and the Academia Sinica (AS) corpus from the first SIGHAN Chinese word segmentation bakeoff (we used the first 100K sentences). We evaluated using F1 on word tokens. To the best of our knowledge, our penalized unigram model is new and actually beats the more complicated model of Johnson (2008) 83.5% to 78%, which had been the best published result on this task. Word alignment For each pair of translated sentences x = (e1 , . . . , ene , f1 , . . . , fnf ), we wish to predict the word alignments z {0, 1}ne nf . We trained two IBM model 1s using agreement-based learning (Liang et al., 2008). We used the first 30K sentence pairs of the English-French Hansards data from the NAACL 2003 Shared Task, 447+37 of which were hand-aligned (Och and Ney, 2003). We evaluated using the standard alignment error rate (AER). 3 EM algorithms Given a probabilistic model p(x, z; ) and unlabeled examples x(1) , . . . , x(n) , recall we would like to maximize the marginal likelihood of the data (1). Let (x, z) denote a mapping from a fullylabeled example (x, z) to a vector of sufficient statistics (counts in the case of multinomials) for the model. For example, one component of this vector for HMMs would be the number of times state 7 emits the word "house" in sentence x with state sequence z. Given a vector of sufficient statistics µ, let (µ) denote the maximum likelihood estimate. In our case, (µ) are simply probabilities obtained by normalizing each block of counts. This closed-form 612 Batch EM µ initialization for each iteration t = 1, . . . , T : -µ 0 -for each example i = 1, . . . , n: --si z p(z | x(i) ; (µ)) (x(i) , z) [inference] --µ µ + si [accumulate new] -µ µ [replace old with new] Incremental EM (iEM) si initialization for i = 1, . . . , n n µ i=1 si for each iteration t = 1, . . . , T : -for each example i = 1, . . . , n in random order: --si z p(z | x(i) ; (µ)) (x(i) , z) [inference] --µ µ + si - si ; si si [replace old with new] Stepwise EM (sEM) solution is one of the features that makes EM (both batch and online) attractive. 3.1 Batch EM In the (batch) EM algorithm, we alternate between the E-step and the M-step. In the E-step, we compute the expected sufficient statistics µ across all the examples based on the posterior over z under the current parameters (µ). In all our models, this step can be done via a dynamic program (for example, forward-backward for POS tagging). In the M-step, we use these sufficient statistics µ to re-estimate the parameters. Since the M-step is trivial, we represent it implicitly by (·) in order to concentrate on the computation of the sufficient statistics. This focus will be important for online EM, so writing batch EM in this way accentuates the parallel between batch and online. 3.2 Online EM µ initialization; k = 0 for each iteration t = 1, . . . , T : -for each example i = 1, . . . , n in random order: --si z p(z | x(i) ; (µ)) (x(i) , z) [inference] --µ (1-k )µ + k si ; k k+1 [towards new] bounded as follows (Neal and Hinton, 1998): () L(q1 , . . . , qn , ) def n (2) = i=1 z qi (z | x(i) ) log p(x(i) , z; ) + H(qi ) , To obtain an online EM algorithm, we store a single set of sufficient statistics µ and update it after processing each example. For the i-th example, we compute sufficient statistics si . There are two main variants of online EM algorithms which differ in exactly how the new si is incorporated into µ. The first is incremental EM (iEM) (Neal and Hinton, 1998), in which we not only keep track of µ but also the sufficient statistics s1 , . . . , sn for each example (µ = n si ). When we process example i, i=1 we subtract out the old si and add the new si . Sato and Ishii (2000) developed another variant, later generalized by Capp´ and Moulines (2009), e which we call stepwise EM (sEM). In sEM, we interpolate between µ and si based on a stepsize k (k is the number of updates made to µ so far). The two algorithms are motivated in different ways. Recall that the log-likelihood can be lower 613 where H(qi ) is the entropy of the distribution qi (z | x(i) ). Batch EM alternates between optimizing L with respect to q1 , . . . , qn in the E-step (represented implicitly via sufficient statistics µ ) and with respect to in the M-step. Incremental EM alternates between optimizing with respect to a single qi and . Stepwise EM is motivated from the stochastic approximation literature, where we think of approximating the update µ in batch EM with a single sample si . Since one sample is a bad approximation, we interpolate between si and the current µ. Thus, sEM can be seen as stochastic gradient in the space of sufficient statistics. Stepsize reduction power Stepwise EM leaves open the choice of the stepsize k . Standard results from the stochastic approximation literature state 2 that k = and k < are suffik=0 k=0 cient to guarantee convergence to a local optimum. In particular, if we take k = (k + 2)- , then any 0.5 < 1 is valid. The smaller the , the larger the updates, and the more quickly we forget (decay) our old sufficient statistics. This can lead to swift progress but also generates instability. Mini-batch size m We can add some stability to sEM by updating on multiple examples at once instead of just one. In particular, partition the n examples into mini-batches of size m and run sEM, treating each mini-batch as a single example. Formally, for each i = 0, m, 2m, 3m, . . . , first compute the sufficient statistics si+1 , . . . , si+m on x(i+1) , . . . , x(i+m) and then update µ using si+1 + · · · + si+m . The larger the m, the less frequent the updates, but the more stable they are. In this way, mini-batches interpolate between a pure online (m = 1) and a pure batch (m = n) algorithm.2 Fast implementation Due to sparsity in NLP, the sufficient statistics of an example si are nonzero for a small fraction of its components. For iEM, the time required to update µ with si depends only on the number of nonzero components of si . However, the sEM update is µ (1-k )µ+k si , and a na¨ve i implementation would take time proportional to the total number of components. The key to a more efficient solution is to note that (µ) is invariant to scaling of µ. Therefore, we can store S = Q µ j ) j