Improving A Simple Bigram HMM Part-of-Speech Tagger by Latent Annotation and Self-Training Zhongqiang Huang , Vladimir Eidelman , Mary Harper Laboratory for Computational Linguistics and Information Processing Institute for Advanced Computer Studies University of Maryland, College Park Human Language Technology Center of Excellence Johns Hopkins University {zqhuang,vlad,mharper}@umiacs.umd.edu Abstract In this paper, we describe and evaluate a bigram part-of-speech (POS) tagger that uses latent annotations and then investigate using additional genre-matched unlabeled data for self-training the tagger. The use of latent annotations substantially improves the performance of a baseline HMM bigram tagger, outperforming a trigram HMM tagger with sophisticated smoothing. The performance of the latent tagger is further enhanced by self-training with a large set of unlabeled data, even in situations where standard bigram or trigram taggers do not benefit from selftraining when trained on greater amounts of labeled training data. Our best model obtains a state-of-the-art Chinese tagging accuracy of 94.78% when evaluated on a representative test set of the Penn Chinese Treebank 6.0. 1 Introduction Part-of-speech (POS) tagging, the process of assigning every word in a sentence with a POS tag (e.g., NN (normal noun) or JJ (adjective)), is prerequisite for many advanced natural language processing tasks. Building upon the large body of research to improve tagging performance for various languages using various models (e.g., (Thede and Harper, 1999; Brants, 2000; Tseng et al., 2005b; Huang et al., 2007)) and the recent work on PCFG grammars with latent annotations (Matsuzaki et al., 2005; Petrov et al., 2006), we will investigate the use of fine-grained latent annotations for Chinese POS tagging. While state-of-the-art tagging systems have achieved accuracies above 97% in English, Chinese 213 POS tagging (Tseng et al., 2005b; Huang et al., 2007) has proven to be more challenging, and it is the focus of this study. The value of the latent variable approach for tagging is that it can learn more fine grained tags to better model the training data. Liang and Klein (2008) analyzed the errors of unsupervised learning using EM and found that both estimation and optimization errors decrease as the amount of unlabeled data increases. In our case, the learning of latent annotations through EM may also benefit from a large set of automatically labeled data to improve tagging performance. Semi-supervised, self-labeled data has been effectively used to train acoustic models for speech recognition (Ma and Schwartz, 2008); however, early investigations of self-training on POS tagging have mixed outcomes. Clark et al. (2003) reported positive results with little labeled training data but negative results when the amount of labeled training data increases. Wang et al. (2007) reported that self-training improves a trigram tagger's accuracy, but this tagger was trained with only a small amount of in-domain labeled data. In this paper, we will investigate whether the performance of a simple bigram HMM tagger can be improved by introducing latent annotations and whether self-training can further improve its performance. To the best of our knowledge, this is the first attempt to use latent annotations with self-training to enhance the performance of a POS tagger. 2 Model POS tagging using a hidden Markov model can be considered as an instance of Bayesian inference, Proceedings of NAACL HLT 2009: Short Papers, pages 213­216, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics wherein we observe a sequence of words and need to assign them the most likely sequence of POS tags. i If ti denotes the tag sequence t1 , · · · , ti , and w1 1 denotes the word sequence w1 , · · · , wi , given the first-order Markov assumption of a bigram tagger, n n the best tag sequence (w1 ) for sentence w1 can be computed efficiently as1 : n (w1 ) i (ax ) = y In the E step, the posterior probabilities of cooccurrence events can be computed as: p(ax , wi |w) i (ax )i (ax ) p(ax , by |w) i (ax )p(by |ax )i+1 (by ) p(by |ax )p(wi+1 |by )j+1 (by ) = arg maxtn 1 arg maxtn 1 n p(tn |w1 ) 1 i p(ti |ti-1 )p(wi |ti ) In the M step, the above posterior probabilities are used as weighted observations to update the transition and emission probabilities2 : p(by |ax ) p(w|ax ) = = c(ax , by )/ c(ax , w)/ by w with a set of transition parameters {p(b|a)}, for transiting to tag b from tag a, and a set of emission parameters {p(w|a)}, for generating word w from tag a. A simple HMM tagger is trained by pulling counts from labeled data and normalizing to get the conditional probabilities. It is well know that the independence assumption of a bigram tagger is too strong in many cases. A common practice for weakening the independence assumption is to use a second-order Markov assumption, i.e., a trigram tagger. This is similar to explicitly annotating each POS tag with the preceding tag. Rather than explicit annotation, we could use latent annotations to split the POS tags, similarly to the introduction of latent annotations to PCFG grammars (Matsuzaki et al., 2005; Petrov et al., 2006). For example, the NR tag may be split into NR-1 and NR-2, and correspondingly the POS tag sequence of "Mr./NR Smith/NR saw/VV Ms./NR Smith/NR" could be refined as: "Mr./NR-2 Smith/NR-1 saw/VV-2 Ms./NR-2 Smith/NR-1". The objective of training a bigram tagger with latent annotations is to find the transition and emission probabilities associated with the latent tags such that the likelihood of the training data is maximized. Unlike training a standard bigram tagger where the POS tags are observed, in the latent case, the latent tags are not observable, and so a variant of EM algorithm is used to estimate the parameters. n Given a sentence w1 and its tag sequence tn , con1 sider the i-th word wi and its latent tag ax a = ti (which means ax is a latent tag of tag a, the i-th tag in the sequence) and the (i + 1)-th word wi+1 and its latent tag by b = ti+1 , the forward, i+1 (by ) = i+1 n p(w1 , by ), and backward, i (ax ) = p(wi+1 |ax ), probabilities can be computed recursively: i+1 (by ) 1 c(ax , by ) c(ax , w) A hierarchical split-and-merge method, similar to (Petrov et al., 2006), is used to gradually increase the number of latent annotations while allocating them adaptively to places where they would produce the greatest increase in training likelihood (e.g., we observe heavy splitting in categories such as NN (normal noun) and VV (verb), that cover a wide variety of words, but only minimal splitting in categories like IJ (interjection) and ON (onomatopoeia)). Whereas tag transition occurrences are frequent, allowing extensive optimization using EM, word-tag co-occurrences are sparser and more likely to suffer from over-fitting. To handle this problem, we map all words with frequency less than threshold3 to symbol unk and for each latent tag accumulate the word tag statistics of these rare words to cr (ax , unk) = w:c(w)< c(ax , w). These statistics are redistributed among the rare words (w : c(w) < ) to compute their emission probabilities: c(ax , w) p(w|ax ) = cr (ax , unk) · c(a, w)/cr (a, unk) = c(ax , w)/ w c(ax , w) The impact of this rare word handling method will be investigated in Section 3. A character-based unknown word model, similar to the one described in (Huang et al., 2007), is used to handle unknown Chinese words during tagging. A decoding method similar to the max-rule-product method in (Petrov and Klein, 2007) is used to tag sentences using our model. 3 Experiments = x i (ax )p(by |ax )p(wi+1 |by ) The Penn Chinese Treebank 6.0 (CTB6) (Xue et al., 2005) is used as the labeled data in our study. CTB6 2 3 We assume that symbols exist implicitly for boundary conditions. c(·) represents the count of the event. The value of is tuned on the development set. 214 contains news articles, which are used as the primary source of labeled data in our experiments, as well as broadcast news transcriptions. Since the news articles were collected during different time periods from different sources with a diversity of topics, in order to obtain a representative split of train-testdevelopment sets, we divide them into blocks of 10 files in sorted order and for each block use the first file for development, the second for test, and the remaining for training. The broadcast news data exhibits many of the characteristics of newswire text (it contains many nonverbal expressions, e.g., numbers and symbols, and is fully punctuated) and so is also included in the training data set. We also utilize a greater number of unlabeled sentences in the self-training experiments. They are selected from similar sources to the newswire articles, and are normalized (Zhang and Kahn, 2008) and word segmented (Tseng et al., 2005a). See Table 1 for a summary of the data used. sentences words Train 24,416 678,811 Dev 1904 51,229 Test 1975 52,861 Unlabeled 210,000 6,254,947 from the context based on the latent annotations, and their performance improves significantly, outperforming the trigram tagger. The performance gap between the two bigram taggers suggests that over-fitting occurs in the word emission model when more latent annotations are available for optimization; sharing the statistics among rare words alleviates some of the sparseness while supporting the modeling of deeper dependencies among more frequent events. In the later experiments, we use Bigram+LA to denote the Bigram+LA:2 tagger. Figure 2 compares the self-training capability of three models (the bigram tagger w/ or w/o latent annotations, and the aforementioned trigram tagger) using different sizes of labeled training data and the full set of unlabeled data. For each model, a tagger is first trained on the allocated labeled training data and is then used to tag the unlabeled data. A new tagger is then trained on the combination4 of the allocated labeled training data and the newly automatically labeled data. 95 Token accuracy (%) 94 93 92 91 90 Bigram+LA+ST Bigram+LA Trigram+ST Trigram Bigram+ST Bigram Table 1: The number of sentences and words in the data. 94.5 Token accuracy (%) 94 93.5 89 93 92.5 92 91.5 50 100 150 200 250 300 Number of latent annotations 350 400 Bigram+LA:1 Bigram+LA:2 Trigram 0.1 0.2 0.4 0.6 0.8 Fraction of CTB6 training data 1 Figure 2: The performance of three taggers evaluated on the development set, before and after self-training with different sizes of labeled training data. Figure 1: The learning curves of the bigram tagger with latent annotations on the development set. Figure 1 plots the learning curves of two bigram taggers with latent annotations (Bigram+LA:2 has the special handling of rare words as described in Section 2 while Bigram+LA:1 does not) and compares its performance with a state-of-the-art trigram HMM tagger (Huang et al., 2007) that uses trigram transition and emission models together with bidirectional decoding. Both bigram taggers initially have much lower tagging accuracy than the trigram tagger, due to its strong but invalid independence assumption. As the number of latent annotations increases, the bigram taggers are able to learn more 215 There are two interesting observations that distinguish the bigram tagger with latent annotations from the other two taggers. First, although all of the taggers improve as more labeled training data is available, the performance gap between the bigram tagger with latent annotations and the other two taggers also increases. This is because more latent annotations can be used to take advantage of the additional training data to learn deeper dependencies. Second, the bigram tagger with latent annotations benefits much more from self-training, although it We always balance the size of manually and automatically labeled data through duplication (for the trigram tagger) or posterior weighting (for the bigram tagger w/ or w/o latent annotations), as this provides superior performance. 4 already has the highest performance among the three taggers before self-training. The bigram tagger without latent annotations benefits little from selftraining. Except for a slight improvement when there is a small amount of labeled training, selftraining slightly hurts tagging performance as the amount of labeled data increases. The trigram tagger benefits from self-training initially but eventually has a similar pattern to the bigram tagger when trained on the full labeled set. The performance of the latent bigram tagger improves consistently with self-training. Although the gain decreases for models trained on larger training sets, since stronger models are harder to improve, self-training still contributes significantly to model accuracy. The final tagging performance on the test set is reported in Table 2. All of the improvements are statistically significant (p < 0.005). Tagger Bigram Trigram Bigram+LA Bigram+LA+ST Token Accuracy (%) 92.25 93.99 94.53 94.78 data selection methods to choose materials that are most suitable for self-training and evaluate the effect of the amount of automatically labeled data. Acknowledgments This work was supported by NSF IIS-0703859 and DARPA HR0011-06-C-0023 and HR0011-062-001. Any opinions, findings and/or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. References T. Brants. 2000. TnT a statistical part-of-speech tagger. In ANLP. S. Clark, J. R. Curran, and M. Osborne. 2003. Bootstrapping pos taggers using unlabelled data. In CoNLL. Z. Huang, M. Harper, and W. Wang. 2007. Mandarin part-of-speech tagging and discriminative reranking. EMNLP. P. Liang and D. Klein. 2008. Analyzing the errors of unsupervised learning. In ACL. J. Ma and R. Schwartz. 2008. Factors that affect unsupervised training of acoustic models. In Interspeech. T. Matsuzaki, Y. Miyao, and J. Tsujii. 2005. Probabilistic CFG with latent annotations. In ACL. Association for Computational Linguistics. S. Petrov and D. Klein. 2007. Improved inference for unlexicalized parsing. In HLT-NAACL. S. Petrov, L. Barrett, R. Thibaux, and D. Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In ACL. S. M. Thede and M. P. Harper. 1999. A second-order hidden markov model for part-of-speech tagging. In ACL. H. Tseng, P. Chang, G. Andrew, D. Jurafsky, and C. Manning. 2005a. A conditional random field word segmenter. In SIGHAN Workshop on Chinese Language Processing. H. Tseng, D. Jurafsky, and C. Manning. 2005b. Morphological features help pos tagging of unknown words across language varieties. In SIGHAN Workshop on Chinese Language Processing. W. Wang, Z. Huang, and M. Harper. 2007. Semisupervised learning for part-of-speech tagging of Mandarin transcribed speech. In ICASSP. N. Xue, F. Xia, F. Chiou, and M. Palmer. 2005. The penn chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering. B. Zhang and J. G. Kahn. 2008. Evaluation of decatur text normalizer for language model training. Technical report, University of Washington. Table 2: The performance of the taggers on the test set. It is worth mentioning that we initially added latent annotations to a trigram tagger, rather than a bigram tagger, to build from a stronger starting point; however, this did not work well. A trigram tagger requires sophisticated smoothing to handle data sparsity, and introducing latent annotations exacerbates the sparsity problem, especially for trigram word emissions. The uniform extension of a bigram tagger to a trigram tagger ignores whether the use of additional context is helpful and supported by enough data, nor is it able to use a longer context. In contrast, the bigram tagger with latent annotations is able to learn different granularities for tags based on the training data. 4 Conclusion In this paper, we showed that the accuracy of a simple bigram HMM tagger can be substantially improved by introducing latent annotations together with proper handling of rare words. We also showed that this tagger is able to benefit from self-training, despite the fact that other models, such as bigram or trigram HMM taggers, do not. In the future work, we will investigate automatic 216