Unsupervised Segmentation of Chinese Text by Use of Branching Entropy Graduate School of Information Science and Technology University of Tokyo Zhihui Jin and Kumiko Tanaka-Ishii The theme of this paper is the following assumption: The uncertainty of tokens coming after a sequence helps determine whether a given position is at a boundary. (A) Intuitively, as illustrated in Figure 1, the variety of successive tokens at each character inside a word monotonically decreases according to the o set length, because the longer the preceding character n-gram, the longer the preceding context and the more it restricts the appearance of possible next tokens. For example, it is easier to guess which character comes after \natura" than after \na". On the other hand, the uncertainty at the position of a word border becomes greater, and the complexity increases, as the position is out of context. With the same example, it is di cult to guess which character comes after \natural ". This suggests that a word border can be detected by focusing on the di erentials of the uncertainty of branching. In this paper, we report our study on applying this assumption to Chinese word seg428 1 Introduction We propose an unsupervised segmentation method based on an assumption about language data: that the increasing point of entropy of successive characters is the location of a word boundary. A large-scale experiment was conducted by using 200 MB of unsegmented training data and 1 MB of test data, and precision of 90% was attained with recall being around 80%. Moreover, we found that the precision was stable at around 90% independently of the learning data size. Abstract Figure 1: Intuitive illustration of a variety of successive tokens and a word boundary mentation by formalizing the uncertainty of successive tokens via the branching entropy (which we mathematically de ne in the next section). Our intention in this paper is above all to study the fundamental and scienti c statistical property underlying language data, so that it can be applied to language engineering. The above assumption (A) dates back to the fundamental work done by Harris (Harris, 1955), where he says that when the number of di erent tokens coming after every pre x of a word marks the maximum value, then the location corresponds to the morpheme boundary. Recently, with the increasing availability of corpora, this property underlying language has been tested through segmentation into words and morphemes. Kempe (Kempe, 1999) reports a preliminary experiment to detect word borders in German and English texts by monitoring the entropy of successive characters for 4-grams. Also, the second author of this paper (Tanaka-Ishii, 2005) have shown how Japanese and Chinese can be segmented into words by formalizing the uncertainty with the branching entropy. Even though the test data was limited to a small amount in this work, the report suggested how assumption Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 428­435, Sydney, July 2006. c 2006 Association for Computational Linguistics (A) holds better when each of the sequence elements forms a semantic unit. This motivated our work to conduct a further, larger-scale test in the Chinese language, which is the only human language consisting entirely of ideograms (i.e., semantic units). In this sense, the choice of Chinese as the language in our work is essential. If the assumption holds well, the most important and direct application is unsupervised text segmentation into words. Many works in unsupervised segmentation so far could be interpreted as formulating assumption (A) in a similar sense where branching stays low inside words but increases at a word or morpheme border. None of these works, however, is directly based on (A), and they introduce other factors within their overall methodologies. Some works are based on in-word branching frequencies formulated in an original evaluation function, as in (Ando and Lee, 2000) (boundary precision=84.5%,recall=78.0%, tested on 12500 Japanese ideogram words). Sun et al. (Sun et al., 1998) uses mutual information (boundary p=91.8%, no report for recall, 1588 Chinese characters), and Feng(Feng et al., 2004) incorporates branching counts in the evaluation function to be optimized for obtaining boundaries (word precision=76%, recall=78%, 2000 sentences). From the performance results listed here, we can see that unsupervised segmentation is more di cult, by far, than supervised segmentation; therefore, the algorithms are complex, and previous studies have tended to be limited in terms of both the test corpus size and the target. In contrast, as assumption (A) is simple, we keep this simplicity in our formalization and directly test the assumption on a large-scale test corpus consisting of 1001 KB manually segmented data with the training corpus consisting of 200 MB of Chinese text. Chinese is such an important language that supervised segmentation methods are already very mature. The current state-of-the-art segmentation software developed by (Low et al., 2005), which ranks as the best in the SIGHAN bakeo (Emerson, 2005), attains word precision and recall of 96.9% and 96.8%, respectively, on the PKU track. There is also free 429 5 4.5 4 3.5 entropy 3 2.5 2 1.5 1 0.5 1 2 3 4 offset 5 6 7 8 Figure 2: Decrease in H (X jXn) for Chinese characters when n is increased software such as (Zhang et al., 2003) whose performance is also high. Even then, as most supervised methods learn on manually segmented newspaper data, when the input text is not from newspapers, the performance can be insu cient. Given that the construction of learning data is costly, we believe the performance can be raised by combining the supervised and unsupervised methods. Consequently, this paper veri es assumption (A) in a fundamental manner for Chinese text and addresses the questions of why and to what extent (A) holds, when applying it to the Chinese word segmentation problem. We rst formalize assumption (A) in a general manner. 2 The Assumption Given a set of elements and a set of n-gram sequences n formed of , the conditional entropy of an element occurring after an n-gram sequence Xn is de ned as H (X jXn) = X P (xn) xn 2 n x2 P X (xjxn ) log P (xjxn ); (1) where P (x) = P (X = x), P (xjxn ) = P (X = xjXn = xn), and P (X = x) indicates the probability of occurrence of x. A well-known observation on language data states that H (X jXn) decreases as n increases (Bell et al., 1990). For example, Figure 2 shows how H (X jXn) shifts when n increases from 1 to 8 characters, where n is the length of a word pre x. This is calculated for all words existing in the test corpus, with the entropy being measured in the learning data (the learning and test data are de ned in x4). This phenomenon indicates that X will become easier to estimate as the context of Xn gets longer. This can be intuitively understood: it is easy to guess that \e" will follow after \Hello! How ar", but it is di cult to guess what comes after the short string \He". The last term log P (xjxn ) in the above formula indicates the information of a token of x coming after xn , and thus the branching after xn . The latter half of the formula, the local entropy value for a given xn , H (X jXn = xn ) = X (xjxn) log P (xjxn ); (2) indicates the average information of branching for a speci c n-gram sequence xn . As our interest in this paper is this local entropy, we denote H (X jXn = xn ) simply as h(xn ) in the rest of this paper. The decrease in H (X jXn) globally indicates that given an n-length sequence xn and another (n + 1)-length sequence yn+1 , the following inequality holds on average: x2 P Figure 3: Our model for boundary detection based on the entropy of branching 3 Boundary Detection Using the Entropy of Branching Assumption (B) gives a hint on how to utilize the branching entropy as an indicator of the context boundary. When two semantic units, both longer than 1, are put together, the entropy would appear as in the rst gure of Figure 3. The rst semantic unit is from o sets 0 to 4, and the second is from 4 to 8, with each unit formed by elements of . In the gure, one possible transition of the branching degree is shown, where the plot at k on the horizontal axis denotes the entropy for h(x0;k ) and xn;m denotes the substring between o sets n and m. Ideally, the entropy would take a maximum at 4, because it will decrease as k is increased in the ranges of k < 4 and 4 < k < 8, and at k = 4, it will rise. Therefore, the position at k = 4 is detected as the \local maximum value" when monitoring h(x0;k ) over k. The boundary condition after such observation can be rede ned as the following: Bmax Boundaries are locations where the entropy is locally maximized. A similar method is proposed by Harris (Harris, 1955), where morpheme borders can be detected by using the local maximum of the number of di erent tokens coming after a prex. This only holds, however, for semantic units longer than 1. Units often have a length of 430 h(xn ) > h(yn+1 ): (3) One reason why inequality (3) holds for language data is that there is context in language, and yn+1 carries a longer context as compared with xn . Therefore, if we suppose that xn is the pre x of xn+1 , then it is very likely that h(xn ) > h(xn+1 ) (4) holds, because the longer the preceding ngram, the longer the same context. For example, it is easier to guess what comes after x6 =\natura" than what comes after x5 = \natur". Therefore, the decrease in H (X jXn) can be expressed as the concept that if the context is longer, the uncertainty of the branching decreases on average. Then, taking the logical contraposition, if the uncertainty does not decrease, the context is not longer, which can be interpreted as the following: If the entropy of successive tokens increases, the location is at a context border. (B) For example, in the case of x7 = \natural", the entropy h(\natural") should be larger than h(\natura"), because it is uncertain what character will allow x7 to succeed. In the next section, we utilize assumption (B) to detect context boundaries. 1, especially in our case with Chinese characters as elements, so that there are many onecharacter words. If a unit has length 1, then the situation will look like the second graph in Figure 3, where three semantic units, x0;4, x4;5, and x5;8, are present, with the middle unit having length 1. First, at k = 4, the value of h increases. At k = 5, the value may increase or decrease, because the longer context results in an uncertainty decrease, though an uncertainty decrease does not necessarily mean a longer context. When h increases at k = 5, the situation will look like the second graph. In this case, the condition Bmax will not su ce, and we need a second boundary condition: Bincrease Boundaries are locations where the entropy is increased. On the other hand, when h decreases at k = 5, then even Bincrease cannot be applied to detect k = 5 as a boundary. We have other chances to detect k = 5, however, by considering h(xi;k ), where 0 < i < k. According to inequality (3), then, a similar trend should be present for plots of h(xi;k ), assuming that h(x0;n ) > h(x0;n+1 ); then, we have h(xi;n ) > h(xi;n+1 ); for 0 < i < n: (5) Therefore, when the target language consists of many one-element units, Bincrease is crucial for collecting all boundaries. Note that the boundaries detected by Bmax are included in those detected by the condition Bincrease , and also that Bincrease is a boundary condition representing the assumption (B) more directly. So far, we have considered only regularorder processing: the branching degree is calculated for successive elements of xn . We can also consider the reverse order, which involves calculating h for the previous element of xn . In the case of the previous element, the question is whether the head of xn forms the beginning of a context boundary. Next, we move on to explain how we actually applied the above formalization to the problem of Chinese segmentation. 431 The value h(xi;k ) would hopefully rise for some i if the boundary at k = 5 is important, although h(xi;k ) can increase or decrease at k = 5, just as in the case for h(x0;n). The whole data for training amounted to 200 MB, from the Contemporary Chinese Corpus of the Center of Chinese Linguistics at Peking University (Center for Chinese Linguistics, 2006). It consists of several years of Peoples' Daily newspapers, contemporary Chinese literature, and some popular Chinese magazines. Note that as our method is unsupervised, this learning corpus is just text without any segmentation. The test data were constructed by selecting sentences from the manually segmented People's Daily corpus of Peking University. In total, the test data amounts to 1001 KB, consisting 147026 Chinese words. The word boundaries indicated in the corpus were used as our golden standard. As punctuation is clear from text boundaries in Chinese text, we pre-processed the test data by segmenting sentences at punctuation locations to form text fragments. Then, from all fragments, n-grams of less than 6 characters were obtained. The branching entropies for all these n-grams existing within the test data were obtained from the 200 MB of data. We used 6 as the maximum n-gram length because Chinese words with a length of more than 5 characters are rare. Therefore, scanning the n-grams up to a length of 6 was su cient. Another reason is that we actually conducted the experiment up to 8-grams, but the performance did not improve from when we used 6-grams. Using this list of words ranging from unigrams to 6-grams and their branching entropies, the test data were processed so as to obtain the word boundaries. Figure 4 shows an actual graph of the entropy shift for the input phrase (wei lai fa zhan de mu biao he zhi dao fang zhen, the aim and guideline of future development). The upper gure shows the entropy shift for the forward case, and the lower gure shows the entropy shift for the backward case. Note that for the backward case, the branching entropy was calculated for characters before the xn . In the upper gure, there are two lines, one 4 Data 5 Analysis for Small Examples took 6-grams out of the learning data. If we consider all the increasing points in all 12 lines and take the set union, then we again obtain 100 % precision and recall. It is amazing how all 12 lines indicate only correct word boundaries. Also, note how the correct full segmentation is obtained only with partial information from 4 lines taken from the 12 lines. Based on this observation, we next explain the algorithm that we used for a larger-scale experiment. Figure 4: Entropy shift for a small example (forward and backward) for the branching entropy after the substrings starting from . The leftmost line plots h( ), h( ) : : : h( ). There are two increasing points, indicating that the phrase was segmented between and , and between and . The second line plots h( ) : : : h( ). The increasing locations are between and , between and , and after . The lower gure is the same. There are two lines, one for the branching entropy before the substring ending with su x . The rightmost line plots h( ), h( ) . . . h( ) running from back to front. We can see increasing points (as seen from back to front) between and , and between and . As for the last line, it also starts from and runs from back to front, indicating boundaries between and , between and , and just before . If we consider all the increasing points in all four lines and take the set union of them, we obtain the correct segmentation as follows: j jj jj j , which is the 100 % correct segmentation in terms of both recall and precision. In fact, as there are 12 characters in this input, there should be 12 lines starting from each character for all substrings. For readability, however, we only show two lines each for the forward and backward cases. Also, the maximum length of a line is 6, because we only 432 6 Algorithm for Segmentation Having determined the entropy for all n-grams in the learning data, we could scan through each chunk of test data in both the forward order and the backward order to determine the locations of segmentation. As our intention in this paper is above all to study the innate linguistic structure described by assumption (B), we do not want to add any artifacts other than this assumption. For such exact veri cation, we have to scan through all possible substrings of an input, which amounts to O(n2 ) computational complexity, where n indicates the input length of characters. Usually, however, h(xm;n ) becomes impossible to measure when n m becomes large. Also, as noted in the previous section, words longer than 6 characters are very rare in Chinese text. Therefore, given a string x, all ngrams of no more than 6 grams are scanned, and the points where the boundary condition holds are output as boundaries. As for the boundary conditions, we have Bmax and Bincrease , and we also utilize Bordinary , where location n is considered as a boundary when the branching entropy h(xn ) is simply above a given threshold. Precisely, there are three boundary conditions: Bmax h(xn) > valmax, where h(xn ) takes a local maximum, Bincrease h(xn+1) h(xn ) > valdelta, Bordinary h(xn) > val, where v almax, v aldelta, and v al are arbitrary thresholds. Usually, when precision and recall are addressed in the Chinese word segmentation domain, they are calculated based on the number of words. For example, consider a correctly segmented sequence \aaajbbbjcccjddd", with a,b,c,d being characters and \j" indicating a word boundary. Suppose that the machine's result is \aaabbbjcccjddd"; then the correct words are only \ccc" and \ddd", giving a value of 2. Therefore, the precision is 2 divided by the number of words in the results (i.e., 3 for the words \aaabbb", \ccc", \ddd"), giving 67%, and the recall is 2 divided by the total number of words in the golden standard (i.e., 4 for the words \aaa",\bbb", \ccc", \ddd") giving 50%. We call these values the word precision and recall, respectively, throughout this paper. In our case, we use slightly di erent measures for the boundary precision and recall, which are based on the correct number of boundaries. These scores are also utilized especially in previous works on unsupervised segmentation (Ando and Lee, 2000) (Sun et al., 1998). Precisely, c P recision = NNorrect test Ncorrect ; where Recall = N true 7 Large-Scale Experiments 7.1 De nition of Precision and Recall 1 0.9 0.8 0.7 recall 0.6 0.5 0.4 0.3 0.2 0.1 0 0.55 Bincrease Bordinary Bmax 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 precision Figure 5: Precision and recall measure serves for directly measuring boundaries. Note that all precision and recall scores from now on in this paper are boundary precision and recall. Even in comparing the supervised methods with our unsupervised method later, the precision and recall values are all recalculated as boundary precision and recall. The precision and recall graph is shown in Figure 5. The horizontal axis is the precision and the vertical axis is the recall. The three lines from right to left (top to bottom) correspond to Bincrease (0:0 valdelta 2:4), Bmax (4:0 valmax 6:2), and Bordinary (4:0 val 6:2). All are plotted with an interval of 0.1. For every condition, the larger the threshold, the higher the precision and the lower the recall. We can see how Bincrease and Bmax keep high precision as compared with Bordinary . We also can see that the boundary can be more easily detected if it is judged as comprising the proximity value of h(xn ). For Bincrease , in particular, when v aldelta = 0:0, the precision and recall are still at 0.88 and 0.79, respectively. Upon increasing the threshold to v aldelta = 2:4, the precision is higher than 0.96 at the cost of a low recall of 0.29. As for Bmax , we also observe a similar tendency but with low recall due to the smaller number of local maximum points as compared with the number of increasing points. Thus, we see how Bincrease attains a better performance among the three conditions. This shows the correctness of assumption (B). From now on, we consider only Bincrease and proceed through our other experiments. 433 7.2 Precision and Recall (6) (7) Ncorrect is the number of correct boundaries in result, and, Ntrue is the number of boundaries in the golden standard. For example, in the case of the machine result being \aaabbbjcccjddd", the precision is 100% and the recall is 75%. Thus, we consider there to be no imprecise result as a boundary in the output of \aaabbbjcccjddd". The crucial reason for using the boundary precision and recall is that boundary detection and word extraction are not exactly the same task. In this sense, assumption (A) or (B) is a general assumption about a boundary (of a sentence, phrase, word, morpheme). Therefore, the boundary precision and recall Ntest is the number of boundaries in the test the result, 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 100 1000 10000 100000 1e+06 size(KB) recall precision Figure 6: Precision and recall depending on training data size Next, we investigated how the training data size a ects the precision and recall. This time, the horizontal axis is the amount of learning data, varying from 10 KB up to 200 MB, on a log scale. The vertical axis shows the precision and recall. The boundary condition is Bincrease with v aldelta = 0:1. We can see how the precision always remains high, whereas the recall depends on the amount of data. The precision is stable at an amazingly high value, even when the branching entropy is obtained from a very small corpus of 10 KB. Also, the linear increase in the recall suggests that if we had more than 200 MB of data, we would expect to have an even higher recall. As the horizontal axis is in a log scale, however, we would have to have gigabytes of data to achieve the last several percent of recall. According to our manual error analysis, the top-most three errors were the following: Numbers: dates, years, quantities (example: 1998, written in Chinese number characters) One-character words (example: (at) (again) (toward) (and)) Compound Chinese words (example: (open mind) being segmented into (open) and (mind)) The reason for the bad results with numbers is probably because the branching entropy for digits is less biased than for usual ideograms. Also, for one-character words, our method is limited, as we explained in x3. Both of these two problems, however, can be solved by ap434 plying special preprocessing for numbers and one-character words, given that many of the one-character words are functional characters, which are limited in number. Such improvements remain for our future work. The third error type, in fact, is one that could be judged as correct segmentation. In the case of \open mind", it was not segmented into two words in the golden standard; therefore, our result was judged as incorrect. This could, however, be judged as correct. The structures of Chinese words and phrases are very similar, and there are no clear criteria for distinguishing between a word and a phrase. The unsupervised method determines the structure and segments words and phrases into smaller pieces. Manual recalculation of the accuracy comprising such cases also remains for our future work. 8 Conclusion We have reported an unsupervised Chinese segmentation method based on the branching entropy. This method is based on an assumption that \if the entropy of successive tokens increases, the location is at the context border." The entropies of n-grams were learned from an unsegmented 200-MB corpus, and the actual segmentation was conducted directly according to the above assumption, on 1 MB of test data. We found that the precision was as high as 90% with recall being around 80%. We also found an amazing tendency for the precision to always remain high, regardless of the size of the learning data. There are two important considerations for our future work. The rst is to gure out how to combine the supervised and unsupervised methods. In particular, as the performance of the supervised methods could be insu cient for data that are not from newspapers, there is the possibility of combining the supervised and unsupervised methods to achieve a higher accuracy for general data. The second future work is to verify our basic assumption in other languages. In particular, we should undertake experimental studies in languages written with phonogram characters. 7.3 Error Analysis References R.K. Ando and L. Lee. 2000. Mostly-unsupervised statistical segmentation of Japanese: Applications to kanji. In ANLP-NAACL. T.C. Bell, J.G. Cleary, and Witten. I.H. 1990. Text Compression. Prentice Hall. Center for Chinese Linguistics. 2006. Chinese corpus. visited 2006, searchable from part http://www.icl.pku.edu.cn. http://ccl.pku.edu.cn/YuLiao Contents.Asp, of it freely available from T. Emerson. 2005. The second international chinese word segmentation bakeo . In SIGHAN. H.D. Feng, K. Chen, C.Y. Kit, and Deng. X.T. 2004. Unsupervised segmentation of chinese corpus using accessor variety. In IJCNLP, pages 255{261. S.Z. Harris. 1955. From phoneme to morpheme. Language, pages 190{222. A. Kempe. 1999. Experiments in unsupervised entropy-based corpus segmentation. In WorkJ.K. Low, H.T Ng, and W. Guo. 2005. A maximum entropy approach to chinese word segmentation. In SIGHAN. M. Sun, D. Shen, and B. K. Tsou. 1998. Chinese word segmentation without using lexicon and hand-crafted training data. In COLINGACL. K. Tanaka-Ishii. 2005. Entropy as an indicator of context boundaries |an experiment using a web search engine |. In IJCNLP, pages 93{105. H.P. Zhang, Yu H.Y., Xiong D.Y., and Q Liu. 2003. Hhmm-based chinese lexical analyzer ictclas. In SIGHAN. visited 2006, available from http://www.nlp.org.cn. shop of EACL in Computational Natural Language Learning, pages 7{13. 435