Self-Organizing в-gram Model for Automatic Word Spacing
Seong-Bae Park Yoon-Shik Tae Se-Young Park Department of Computer Engineering Kyungpook National University Daegu 702-701, Korea sbpark,ystae,sypark @sejong.knu.ac.kr

Abstract
An automatic word spacing is one of the important tasks in Korean language processing and information retrieval. Since there are a number of confusing cases in word spacing of Korean, there are some mistakes in many texts including news articles. This paper presents a high-accurate method for automatic word spacing based on self-organizing в-gram model. This method is basically a variant of в-gram model, but achieves high accuracy by automatically adapting context size. In order to find the optimal context size, the proposed method automatically increases the context size when the contextual distribution after increasing it dose not agree with that of the current context. It also decreases the context size when the distribution of reduced context is similar to that of the current context. This approach achieves high accuracy by considering higher dimensional data in case of necessity, and the increased computational cost are compensated by the reduced context size. The experimental results show that the self-organizing structure of в-gram model enhances the basic model.

1 Introduction
Even though Korean widely uses Chinese characters, the ideograms, it has a word spacing model unlike Chinese and Japanese. The word spacing of Korean, however, is not a simple task, though the basic rule for it is simple. The basic rule asserts that all content words should be spaced. However,
633

there are a number of exceptions due to various postpositions and endings. For instance, it is difficult to distinguish some postpositions from incomplete nouns. Such exceptions induce many mistakes of word spacing even in news articles. The problem of the inaccurate word spacing is that they are fatal in language processing and information retrieval. The incorrect word spacing would result in the incorrect morphological analysis. For instance, let us consider a famous Korean sentence: " The true word spacing for this sentence is " whose meaning is that my father entered the room. If the sentence is written it means that as " my father entered the bag, which is totally different from the original meaning. That is, since the morphological analysis is the first-step in most NLP applications, the sentences with incorrect word spacing must be corrected for their further processing. In addition, the wrong word spacing would result in the incorrect index for terms in information retrieval. Thus, correcting the sentences with incorrect word spacing is a critical task in Korean information processing. One of the most simple and strong models for automatic word spacing is в-gram model. In spite of the advantages of the в-gram model, its problem should be also considered for achieving high performance. The main problem of the model is that it is usually modeled with fixed window size, в. The small value for в represents the narrow context in modeling, which results in poor performance in general. However, it is also difficult to increase в for better performance due to data sparseness. Since the corpus size is physically limited, it is highly possible that many в-grams which do not appear in the corpus exist in the real world.

ЇЩи љ иЩ АЂ." # ЇЩи #љ иЩ АЂ." ЩЩи# љ #иЩ АЂ.",

Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 633­640, Sydney, July 2006. c 2006 Association for Computational Linguistics


The goal of this paper is to provide a new method for processing automatic word spacing with an в-gram model. The proposed method automatically adapts the window size в. That is, this method begins with a bigram model, and it shrinks to an unigram model when data sparseness occurs. It also grows up to a trigram, fourgram, and so on when it requires more specific information in determining word spacing. In a word, the proposed model organizes the windows size в online, and achieves high accuracy by removing both data sparseness and information lack. The rest of the paper is organized as follows. Section 2 surveys the previous work on automatic word spacing and the smoothing methods for вgram models. Section 3 describes the general way to automatic word spacing by an в-gram model, and Section 4 proposes a self-organizing в-gram model to overcome some drawbacks of в-gram models. Section 5 presents the experimental results. Finally, Section 6 draws conclusions.

tics for all bigrams, a great large volume of corpora will be required. If higher order в-gram is adopted for better accuracy, the volume of corpora required will be increased exponentially. The main drawback of в-gram model is that it suffers from data sparseness however large the corpus is. That is, there are many в-grams of which frequency is zero. To avoid this problem, many smoothing techniques have been proposed for construction of в-gram models (Chen and Goodman, 1996). Most of them belongs to one of two categories. One is to pretend each в-gram occurs once more than it actually did (Mitchell, 1996). The other is to interpolate в-grams with lower dimensional data (Jelinek and Mercer, 1980; Katz, 1987). However, these methods artificially modify the original distribution of corpus. Thus, the final probabilities used in learning with вgrams are the ones distorted by a smoothing technique. A maximum entropy model can be considered as another way to avoid zero probability in в-gram models (Rosenfeld, 1996). Instead of constructing separate models and then interpolate them, it builds a single, combined model to capture all the information provided by various knowledge sources. Even though a maximum entropy approach is simple, general, and strong, it is computationally very expensive. In addition, its performance is mainly dependent on the relevance of knowledge sources, since the prior knowledge on the target problem is very important (Park and Zhang, 2002). Thus, when prior knowledge is not clear and computational cost is an important factor, в-gram models are more suitable than a maximum entropy model. Adapting features or contexts has been an important issue in language modeling (Siu and Ostendorf, 2000). In order to incorporate longdistance features into a language model, (Rosenfeld, 1996) adopted triggers, and (Mochihashi and Mastumoto, 2006) used a particle filter. However, these methods are restricted to a specific language model. Instead of long-distance features, some other researchers tried local context extension. For this purpose, (Schutze and Singer, 1994) adopted Ј a variable memory Markov model proposed by (Ron et al., 1996), (Kim et al., 2003) applied selective extension of features to POS tagging, and (Dickinson and Meurers, 2005) expanded context of в-gram models to find errors in syntactic anno-

2 Previous Work
Many previous work has explored the possibility of automatic word spacing. While most of them reported high accuracy, they can be categorized into two parts in methodology: analytic approach and statistical approach. The analytic approach is based on the results of morphological analysis. Kang used the fundamental morphological analysis techniques (Kang, 2000), and Kim et al. distinguished each word by the morphemic information of postpositions and endings (Kim et al., 1998). The main drawbacks of this approach are that (i) the analytic step is very complex, and (ii) it is expensive to construct and maintain the analytic knowledge. In the other hand, the statistical approach extracts from corpora the probability that a space is put between two syllables. Since this approach can obtain the necessary information automatically, it does require neither the linguistic knowledge on syllable composition nor the costs for knowledge construction and maintenance. In addition, the fact that it does not use a morphological analyzer produces solid results even for unknown words. Many previous studies using corpora are based on bigram information. According to (Kang, 2004), the number of syllables used in modern Korean is about НМ , which implies that the number of bigrams reaches НМ . In order to obtain stable statis634


Accuracy (%)

tation. In these methods, only neighbor words or features of the target в-grams became candidates to be added into the context. Since they required more information for better performance or detecting errors, only the context extension was considered.

0.9

0.85

3 Automatic Word Spacing by в-gram Model
The problem of automatic word spacing can be regarded as a binary classification task. Let a senлЦ . If i.i.d. samtence be given as Ы лН лО pling is assumed, the data from this sentence are ДмН нН Е ДмЦ нЦ Е where given as м О Ъв and н О ижй аз . In this representation, м is a contextual representation of a syllable л . If a space should be put after л , then н , the class of м , is true. It is false otherwise. Therefore, the automatic word spacing is to estiЪв ижй аз . That mate a function is, our task is to determine whether a space should be put after a syllable л expressed as м with its context. The probabilistic method is one of the strong and most widely used methods for estimating . That is, for each л ,

0.8

unigram bigram trigram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram

0.75

0.7 0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 No. of Training Examples

Figure 1: The performance of в-gram models according to the values of в in automatic word spacing. where is a counting function. Determining the context size, the value of в, in в-gram models is closely related with the corpus size. The larger is в, the larger corpus is required to avoid data sparseness. In contrast, though loworder в-grams do not suffer from data sparseness severely, they do not reflect the language characteristics well, either. Typically researchers have О or в П, and achieved high perforused в mance in many tasks (Bengio et al., 2003). Figure 1 supports that bigram and trigram outperН) and high-order (в ) form low-order (в в-grams in automatic word spacing. All the experimental settings for this figure follows those in Section 5. In this figure, bigram model shows the best accuracy and trigram achieves the second best, whereas unigram model results in the worst accuracy. Since the bigram model is best, a selforganizing в-gram model explained below starts from bigram.

w

Дм Е

н ижй
О

ж бм

аз Ш

Дн

мЕ

to be two or three, corresponding to a bigram or trigram respectively. м corresponds to л  Н л w when в О. In the same way, it is л  О л  Н л hen в П. The simple and easy way to estimate Ш Дм н Е is to use maximum likelihood estimate with a large corpus. For instance, consider the case в О. Then, the probability Ш Дм н Е is represented as Ш Дл  Н л н Е, and is computed by

м Е is rewritten as Дм ЕШ Дн Е S Ш Дн м Е Ш н Ш Дм Е ince Ш Дм Е is independent of finding the class of м , Дм Е is determined by multiplying Ш Дм н Е and Ш Дн Е. That is, I Дм Е ж бм Дм н ЕШ Дн Е н ижй аз Ш n в-gram model, м is expressed with в neighbor syllables around л . Typically, в is taken
О

here Ш Дн

4 Self-Organizing в-gram Model
To tackle the problem of fixed window size in вgram models, we propose a self-organizing structure for them. 4.1 Expanding в-grams

Ш Дл  Н л н Е

Ш

Дл  Н л Вн Ш Дн Е Д

Е

(1)

л  Н л Вн Е Дн Е
635

When в-grams are compared with Дв З НЕ-grams, their performance in many tasks is lower than that of Дв З НЕ-grams (Charniak, 1993). Simultaneously the computational cost for Дв З НЕ-grams is far higher than that for в-grams. Thus, it can be justified to use Дв З НЕ-grams instead of в-


Function HowLargeExpand (мв ) Input: мв : в-grams Output: an integer for expanding size 1. Retrieve Дв З НЕ-grams мвЗН for мв . 2. Compute

Function HowSmallShrink (мв ) Input: мв : в-grams Output: an integer for shrinking size 1. If в

Н Then return 0.

2. Retrieve Дв   НЕ-grams мв Н for мв . 3. Compute

3
. If

УФДШ Дн мв Е Ш Дн мвЗН ЕЕ EXP Then return 0.

4
. If

УФДШ Дн мв Е Ш Дн мв Н ЕЕ SHR Then return 0.

4. return HowLargeExpand(мвЗН ) + 1. Figure 2: A function that determines how large a window size should be. grams only when higher performance is expected. In other words, Дв З НЕ-grams should be different from в-grams. Otherwise, the performance would not be different. Since our task is attempted with a probabilistic method, the difference can be measured with conditional distributions. If the conditional distributions of в-grams and Дв З НЕ-grams are similar each other, there is no reason to adopt Дв З НЕ-grams. Let Ш Дн мв Е be a class-conditional probability by в-grams and Ш Дн мДвЗНЕ Е that by Дв З НЕgrams. Then, the difference Дв в З НЕ between them is measured by Kullback-Leibler divergence. That is,

5. return HowSmallShrink(мв Н ) - 1. Figure 3: A function that determines how small a window size should be used. 4.2 Shrinking в-grams

Shrinking в-grams is accomplished in the direction opposite to expanding в-grams. After comparing в-grams with Дв   НЕ-grams, Дв   НЕ-grams are used instead of в-grams only when they are similar enough. The difference Дв в   НЕ between в-grams and Дв   НЕ-grams is, once again, measured by Kullback-Leibler divergence. That is,

I Дв в   НЕ

УФДШ Дн

мв Е Ш Дн мДв НЕ Е

w Дв в З НЕ

УФДШ Дн

мв Е Ш Дн мДвЗНЕ Е
До мв Е д дДо мДвЗНЕ

hich is computed by
До

о

О ижй

аз д

мв Е аг

Е

(2)

Дв в З НЕ that is larger than a predefined threshold EXP implies that Ш Дн мДвЗНЕ Е is different from Ш Дн мв Е. In this case, Дв З НЕ-grams is used instead of в-grams.
Figure 2 depicts an algorithm that determines how large в-grams should be used. It recursively finds the optimal expanding window size. For instance, let bigrams (в О) be used at first. When П the difference between bigrams and trigrams (в ) is larger than EXP , that between trigrams and ) is checked again. If it is less fourgrams (в than EXP , then this function returns 1 and trigrams are used instead of bigrams. Otherwise, it considers higher в-grams again.
636

f Дв в   НЕ is smaller than another predefined threshold SHR , then Дв   НЕ-grams are used instead of в-grams. Figure 3 shows an algorithm which determines how deeply the shrinking is occurred. The main stream of this algorithm is equivalent to that in Figure 2. It also recursively finds the optimal shrinking window size, but can not be further reduced when the current model is an unigram. The merit of shrinking в-grams is that it can construct a model with a lower dimensionality. Since the maximum likelihood estimate is used in calculating probabilities, this helps obtaining stable probabilities. According to the well-known curse of dimensionality, the data density required is reduced exponentially by reducing dimensions. Thus, if the lower dimensional model is not different so much from the higher dimensional one, it is highly possible that the probabilities from lower dimensional space are more stable than those from higher dimensional space.


Function ChangingWindowSize (мв ) Input: мв : в-grams Output: an integer for changing window size 1. Set exp := HowLargeExpand (мв ). 2. If exp

4.4

Sequence Tagging

М Then return exp. М Then return shr.

3. Set shr := HowSmallShrink(мв ). 4. If shr 5. return 0. Figure 4: A function that determines the changing window size of в-grams. 4.3 Overall Self-Organizing Structure

Since natural language sentences are sequential as their nature, the word spacing can be considered as a special POS tagging task (Lee et al., 2002) for which a hidden Markov model is usually adopted. The best sequence of word spacing for the sentence is defined as

Ѓ нН Ц

ж б мШ ДнН Ц

Ш ДмН Ц нН Ц ЕШ ДнН Ц Е Ш ДмН Ц Е нН ж б мШ ДмН Ц нН Ц ЕШ ДнН Ц Е нН
ж бм
Ц Ц

нН

Ц

мН Ц Е

For a given i.i.d. sample м , there are three possibilities on changing в-grams. First one is not to change в-grams. It is obvious when в-grams are not changed. This occurs when both Дв в З НЕ EXP and Дв в   НЕ SHR are met. This is when the expanding results in too similar distribution to that of the current в-grams and the distribution after shrinking is too different from that of the current в-grams. The remaining possibilities are then expanding and shrinking. The application order between them can affect the performance of the proposed method. In this paper, an expanding is checked prior to a shrinking as shown in Figure 4. The function ChangingWindowSize first calls HowLargeExpand. The non-zero return value of HowLargeExpand implies that the window size of the current в-grams should be enlarged. Otherwise, ChangingWindowSize checks if the window size should be shrinked by calling HowSmallShrink. If HowSmallShrink returns a negative integer, the window size should be shrinked to (в + shr). If both functions return zero, the window size should not be changed. The reason why HowLargeExpand is called prior to HowSmallShrink is that the expanded вgrams handle more specific data. (в З Н)-grams, in general, help obtaining higher accuracy than вgrams, since (в З Н)-gram data are more specific than в-gram ones. However, it is time-consuming to consider higher-order data, since the number of kinds of data increases. The time increased due to expanding is compensated by shrinking. After shrinking, only lower-oder data are considered, and then processing time for them decreases.
637

by where Ц is a sentence length. If we assume that the syllables are independent of each other, Ш ДмН Ц нН Ц Е is given by

w

Ш ДмН Ц нН Ц Е

Ц

НШ м н
Д

Е

hich can be computed using Equation (1). In addition, by Markov assumption, the probability of a current tag н conditionally depends on only the previous tags. That is,

T

Ш ДнН Ц Е

Ц

НШ н н
Д Д

   Н Е

hus, the best sequence is determined by

Ѓ нН Ц

ж бм

Ц

нН

Ц

НШ м н

Е Ё Ш Дн

н    НЕ (3)

Since this equation follows Markov assumption, the best sequence is found by applying the Viterbi algorithm.

5 Experiments
5.1 Data Set The data set used in this paper is the HANTEC corpora version 2.0 distributed by KISTI1 . From this corpus, we extracted only the HKIB94 part which consists of 22,000 news articles in 1994 from Hankook Ilbo. The reason why HKIB94 is chosen is that the word spacing of news articles is relatively more accurate than other texts. Even though this data set is composed of totally 12,523,688 Korean syllables, the number of unique syllables is just
1

http://www.kisti.re.kr


Methods baseline bigram trigram self-organizing bigram decision tree support vector machine

Accuracy (%) 72.19 88.34 87.59 91.31 88.68 89.10

Order Expanding then Shrinking Shrinking then Expanding

No. of Errors 108,831 114,343

Table 2: The number of errors caused by the application order of context expanding and shrinking. online, two operations of expanding and shrinking were proposed. Table 2 shows how much the number of errors is affected by their application order. The number of errors made by expanding first is 108,831 while that by shrinking first is 114,343. That is, if shrinking is applied ahead of expanding, 5,512 additional errors are made. Thus, it is clear that expanding should be considered first. The errors by expanding can be explained with two reasons: (i) the expression power of the model and (ii) data sparseness. Since Korean is a partially-free word order language and the omission of words are very frequent, в-gram model that captures local information could not express the target task sufficiently. In addition, the classconditional distribution after expanding could be very different from that before expanding due to data sparseness. In such cases, the expanding should not be applied since the distribution after expanding is not trustworthy. However, only the difference between two distributions is considered in the proposed method, and the errors could be made by data sparseness. Figure 5 shows that the number of training instances does not matter in computing probabilities of в-grams. Even though the accuracy increases slightly, the accuracy difference after 900,000 instances is not significant. It implies that the errors made by the proposed method is not from the lack of training instance but from the lack of its expression power for the target task. This result also complies with Figure 1. 5.3 Effect of Right Context

Table 1: The experimental results of various methods for automatic word spacing. 2,037 after removing all special symbols, digits, and English alphabets. The data set is divided into three parts: training (70%), held-out (20%), and test (10%). The held-out set is used only to estimate EXP and SHR . The number of instances in the training set is 8,766,578, that in the held-out set is 2,504,739, and that in test set is 1,252,371. Among the 1,252,371 test cases, the number of positive instances is 348,278, and that of negative instances is 904,093. Since about 72% of test cases are negative, this is the baseline of the automatic word spacing. 5.2 Experimental Results

To evaluate the performance of the proposed method, two well-known machine learning algorithms are compared together. The tested machine learning algorithms are (i) decision tree and (ii) support vector machines. We use C4.5 release 8 (Quinlan, 1993) for decision tree induction and Ы Ю Х а и (Joachims, 1998) for support vector machines. For all experiments with decision trees and support vector machines, the context size is set to two since the bigram shows the best performance in Figure 1. Table 1 gives the experimental results of various methods including machine learning algorithms and self-organizing в-gram model. The `selforganizing bigram' in this table is the one proposed in this paper. The normal в-grams achieve an accuracy of around 88%, while decision tree and support vector machine produce that of around 89%. The self-organizing в-gram model achieves 91.31%. The accuracy improvement by the selforganizing в-gram model is about 19% over the baseline, about 3% over the normal в-gram model, and 2% over decision trees and support vector machines. In order to organize the context size for в-grams6
38

All the experiments above considered left context only. However, Kang reported that the probabilistic model using both left and right context outperforms the one that uses left context only (Kang, 2004). In his work, the word spacing probabili aty Ш Дл л ЗН Е between two adjacent syllables л nd л ЗН is given as

Ш Дл л ЗН Е

М О Ђ ШФ Дл  Н М

Ђ ШХ Дл л ЗН Е (4) М О Ђ ШЪ Дл ЗН л ЗО Е

лЕ


1

0.95

Method Normal HMM Self-Organizing HMM

Accuracy(%) 92.37 94.71

0.9 Accuracy (%)

Table 4: The effect of considering a tag sequence.
0.85 0.8

5.4

Effect of Considering Tag Sequence

0.75

0.7 0 1e+06 2e+06 3e+06 4e+06 5e+06 No. of Training Examples 6e+06 7e+06 8e+06

Figure 5: The effect of the number of training examples in the self-organizing в-gram model. Context Left Context Only Right Context Only Both Contexts Accuracy (%) 91.31 88.26 92.54

Table 3: The effect of using both left and right context. where ШЪ ШХ ШФ are computed respectively based on the syllable frequency. In order to reflect the idea of bidirectional context in the proposed model, the model is enhanced by modifying Ш Дл  Н л н Е in Equation (1). That is, the likelihood of Ш Дл  Н л н Е is expanded to be
Ш Дл  Н л

The state-of-the-art performance on Korean word spacing is to use the hidden Markov model. According to the previous work (Lee et al., 2002), the hidden Markov model shows the best performance when it sees two previous tags and two previous syllables. For the simplicity in the experiments, the value in Equation (3) is set to be one. The for performance comparison between normal HMM and the proposed method is given in Table 4. The proposed method considers the various number of previous syllables, whereas the normal HMM has the fixed context. Thus, the proposed method in Table 4 is specified as `self-organizing HMM.' The accuracy of the self-organizing HMM is 94.71%, while that of the normal HMM is just 92.37%. Even though the normal HMM considО), the accuracy of ers more previous tags ( the self-organizing model is 2.34% higher than that of the normal HMM. Therefore, the proposed method that considers the sequence of word spacing tags achieves higher accuracy than any other methods reported ever.

нЕ

ЋН Ђ Ш Дл  О л  Н

S

ЋО Ђ Ш Дл  Н л н Е ЋП Ђ Ш Дл л ЗН н Е

нЕ

6 Conclusions
In this paper we have proposed a new method to learn word spacing in Korean by adaptively organizing context size. Our method is based on the simple в-gram model, but the context size в is changed as needed. When the increased context is much different from the current one, the context size is increased. In the same way, the context is decreased, if the decreased context is not so much different from the current one. The benefits of this method are that it can consider wider context by increasing context size as required, and save the computational cost due to the reduced context. The experiments on HANTEC corpora showed that the proposed method improves the accuracy of the trigram model by 3.72%. Even compared with some well-known machine learning algorithms, it achieved the improvement of 2.63% over decision trees and 2.21% over support vector machines. In addition, we showed two ways for improving the

ince the coefficients of Equation (4) were determined arbitrarily (Kang, 2004), they are replaced with parameters Ћ of which values are determined using a held-out data. The change of accuracy by the context is shown in Table 3. When only the right context is used, the accuracy gets 88.26% which is worse than the left context only. That is, the original в-gram is a relatively good model. However, when both left and right context are used, the accuracy becomes 92.54%. The accuracy improvement by using additional right context is 1.23%. This results coincide with the previous report (Lee et al., 2002). The Ћ 's to achieve this accuracy are ЋН М ЋО М П, and ЋП М О.
639


proposed method: considering right context and word spacing sequence. By considering left and right context at the same time, the accuracy is improved by 1.23%, and the consideration of word spacing sequence gives the accuracy improvement of 2.34%. The в-gram model is one of the most widely used methods in natural language processing and information retrieval. Especially, it is one of the successful language models, which is a key technique in language and speech processing. Therefore, the proposed method can be applied to not only word spacing but also many other tasks. Even though word spacing is one of the important tasks in Korean information processing, it is just a simple task in many other languages such as English, German, and French. However, due to its generality, the importance of the proposed method yet does hold in such languages.

S.-S. Kang. 2004. Improvement of Automatic Word Segmentation of Korean by Simplifying Syllable Bigram. In Proceedings of the 15th Conference on Korean Language and Information Processing, pp. 227­231. (in Korean) S. Katz. 1987. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing. Vol. 35, No. 3, pp. 400­401. K.-S. Kim, H.-J. Lee, and S.-J. Lee. 1998. ThreeStage Spacing System for Korean in Sentence with No Word Boundaries. Journal of KISS, Vol. 25, No. 12, pp. 1838­1844. (in Korean) J.-D. Kim, H.-C. Rim, and J. Tsujii. 2003. SelfOrganizing Markov Models and Their Application to Part-of-Speech Tagging. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 296­302. D.-G. Lee, S.-Z. Lee, H.-C. Rim, and H.-S. Lim, 2002. Automatic Word Spacing Using Hidden Markov Model for Refining Korean Text Corpora. In Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization, pp. 51­57. T. Mitchell. 1997. Machine Learning. McGraw Hill. D. Mochihashi and Y. Matsumoto. 2006. Context as Filtering. Advances in Neural Information Processing Systems 18, pp. 907­914. S.-B. Park and B.-T. Zhang. 2002. A Boosted Maximum Entropy Model for Learning Text Chunking. In Proceedings of the 19th International Conference on Machine Learning, pp. 482­489. R. Quinlan. 1993. C4.5: Program for Machine Learning. Morgan Kaufmann Publishers. D. Ron, Y. Singer, and N. Tishby. 1996. The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine Learning, Vol. 25, No. 2, pp. 117­149. R. Rosenfeld. 1996. A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer, Speech and Language, Vol. 10, pp. 187­ 228. H. Schutze and Y. Singer. 1994. Part-of-Speech TagЈ ging Using a Variable Memory Markov Model. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 181­ 187. M. Siu and M. Ostendorf. 2000. Variable N-Grams and Extensions for Conversational Speech Language Modeling. IEEE Transactions on Speech and Audio Processing, Vol. 8, No. 1, pp. 63­75.

Acknowledgements
This work was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF-2005-202-D00465).

References
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. 2003. A Neural Probabilistic Language Model. Journal of Machine Learning Research, Vol. 3, pp. 1137­1155. E. Charniak. 1993. Statistical Language Learning. MIT Press. S. Chen and J. Goodman. 1996. An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 310­318. M. Dickinson and W. Meurers. 2005. Detecting Errors in Discontinuous Structural Annotation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 322­329. F. Jelinek and R. Mercer. 1980. Interpolated Estimation of Markov Source Parameters from Sparse Data. In Proceedings of the Workshop on Pattern Recognition in Practice. T. Joachims. 1998. Making Large-Scale SVM Learning Practical. LS8, Universit t Dortmund. S.-S. Kang, 2000. Eojeol-Block Bidirectional Algorithm for Automatic Word Spacing of Hangul Sentences. Journal of KISS, Vol. 27, No. 4, pp. 441­ 447. (in Korean)

640