Robust Extraction of Named Entity Including Unfamiliar Word Masatoshi Tsuchiya Shinya Hida Seiichi Nakagawa Information and Media Center / Department of Information and Computer Sciences, Toyohashi University of Technology tsuchiya@imc.tut.ac.jp, {hida,nakagawa}@slp.ics.tut.ac.jp Abstract Table 1: Statistics of NE Types of IREX Corpus NE Type ARTIFACT DATE LOCATION MONEY ORGANIZATION PERCENT PERSON TIME Total Frequency (%) 747 (4.0) 3567 (19.1) 5463 (29.2) 390 (2.1) 3676 (19.7) 492 (2.6) 3840 (20.6) 502 (2.7) 18677 This paper proposes a novel method to extract named entities including unfamiliar words which do not occur or occur few times in a training corpus using a large unannotated corpus. The proposed method consists of two steps. The first step is to assign the most similar and familiar word to each unfamiliar word based on their context vectors calculated from a large unannotated corpus. After that, traditional machine learning approaches are employed as the second step. The experiments of extracting Japanese named entities from IREX corpus and NHK corpus show the effectiveness of the proposed method. 1 Introduction It is widely agreed that extraction of named entity (henceforth, denoted as NE) is an important subtask for various NLP applications. Various machine learning approaches such as maximum entropy(Uchimoto et al., 2000), decision list(Sassano and Utsuro, 2000; Isozaki, 2001), and Support Vector Machine(Yamada et al., 2002; Isozaki and Kazawa, 2002) were investigated for extracting NEs. All of them require a corpus whose NEs are annotated properly as training data. However, it is difficult to obtain an enough corpus in the real world, because there are increasing the number of NEs like personal names and company names. For example, a large database of organization names(Nichigai Associates, 2007) already contains 171,708 entries and is still increasing. Therefore, a robust method to extract NEs including unfamiliar words which do not occur or occur few times in a training corpus is necessary. This paper proposes a novel method of extracting NEs which contain unfamiliar morphemes using a large unannotated corpus, in order to resolve the above problem. The proposed method consists 125 of two steps. The first step is to assign the most similar and familiar morpheme to each unfamiliar morpheme based on their context vectors calculated from a large unannotated corpus. The second step is to employ traditional machine learning approaches using both features of original morphemes and features of similar morphemes. The experiments of extracting Japanese NEs from IREX corpus and NHK corpus show the effectiveness of the proposed method. 2 Extraction of Japanese Named Entity 2.1 Task of the IREX Workshop The task of NE extraction of the IREX workshop (Sekine and Eriguchi, 2000) is to recognize eight NE types in Table 1. The organizer of the IREX workshop provided a training corpus, which consists of 1,174 newspaper articles published from January 1st 1995 to 10th which include 18,677 NEs. In the Japanese language, no other corpus whose NEs are annotated is publicly available as far as we know.1 2.2 Chunking of Named Entities It is quite common that the task of extracting Japanese NEs from a sentence is formalized as a chunking problem against a sequence of mor1 The organizer of the IREX workshop also provides the testing data to its participants, however, we cannot see it because we did not join it. Proceedings of ACL-08: HLT, Short Papers (Companion Volume), pages 125­128, Columbus, Ohio, USA, June 2008. c 2008 Association for Computational Linguistics phemes. For representing proper chunks, we employ IOB2 representation, one of those which have been studied well in various chunking tasks of NLP (Tjong Kim Sang, 1999). This representation uses the following three labels. B I O Current token is the beginning of a chunk. Current token is a middle or the end of a chunk consisting of more than one token. Current token is outside of any chunk. and bigrams, Vm = f (m, m0 ), f (m, m0 , m0 ), f (m0 , m), f (m0 , m0 , m), ··· f (m, mN ), · · · f (m, mN , mN ), ··· f (mN , m), · · · f (mN , mN , m) Actually, we prepare the 16 derived labels from the label B and the label I for eight NE types, in order to distinguish them. When the task of extracting Japanese NEs from a sentence is formalized as a chunking problem of a sequence of morphemes, the segmentation boundary problem arises as widely known. For example, the NE definition of IREX tells that a Chinese character " (bei)" must be extracted as an NE means America from a morpheme " (hou-bei)" which means visiting America. A naive chunker using a morpheme as a chunking unit cannot extract such kind of NEs. In order to cope this problem, (Uchimoto et al., 2000) proposed employing translation rules to modify problematic morphemes, and (Asahara and Matsumoto, 2003; Nakano and Hirai, 2004) formalized the task of extracting NEs as a chunking problem of a sequence of characters instead of a sequence of morphemes. In this paper, we keep the naive formalization, because it is still enough to compare performances of proposed methods and baseline methods. where M {m0 , m1 , . . . , mN } is a set of all morphemes of the unannotated corpus, f (mi , mj ) is a frequency that a sequence of a morpheme mi and a morpheme mj occurs in the unannotated corpus, and f (mi , mj , mk ) is a frequency that a sequence of morphemes mi , mj and mk occurs in the unannotated corpus. Suppose an unfamiliar morpheme mu M MF , where MF is a set of familiar morphemes that occur frequently in the annotated corpus. The most similar morpheme mu to the morpheme mu measured ^ with their context vectors is given by the following equation, mu = argmax sim(Vmu , Vm ), ^ mMF , (1) where sim(Vi , Vj ) is a similarity function between context vectors. In this paper, the cosine function is employed as it. 3.2 Features The feature set Fi at i-th position is defined as a tuple of the morpheme feature M F (mi ) of the i-th morpheme mi , the similar morpheme feature S F (mi ), and the character type feature C F (mi ). T Fi = M F (mi ), S F (mi ), C F (mi ) 3 Robust Extraction of Named Entities Including Unfamiliar Words The proposed method of extracting NEs consists of two steps. Its first step is to assign the most similar and familiar morpheme to each unfamiliar morpheme based on their context vectors calculated from a large unannotated corpus. The second step is to employ traditional machine learning approaches using both features of original morphemes and features of similar morphemes. The following subsections describe these steps respectively. 3.1 Assignment of Similar Morpheme he morpheme feature M F (mi ) is a pair of the surface string and the part-of-speech of mi . The similar morpheme feature S F (mi ) is defined as S F (mi ) = F (mi ) if mi M MF ^ , M F (mi ) otherwise M A context vector Vm of a morpheme m is a vector consisting of frequencies of all possible unigrams 126 where mi is the most similar and familiar morpheme ^ to mi given by Equation (1). The character type feature C F (mi ) is a set of four binary flags to indicate that the surface string of mi contains a Chinese character, a hiragana character, a katakana character, and an English alphabet respectively. When we identify the chunk label ci for the ith morpheme mi , the surrounding five feature sets Fi-2 , Fi-1 , Fi , Fi+1 , Fi+2 and the preceding two chunk labels ci-2 , ci-1 are refered. Morpheme Feature (English P OS translation) (kyou) (today) Noun­Adverbial (no) gen Particle (Ishikari) (Ishikari) Noun­Proper (heiya) (plain) Noun­Generic (no) gen Particle (tenki) (weather) Noun­Generic (ha) top Particle (hare) (fine) Noun­Generic Similar Morpheme Feature (English P OS translation) (kyou) (today) Noun­Adverbial (no) gen Particle (Kantou) (Kantou) Noun­Proper (heiya) (plain) Noun­Generic (no) gen Particle (tenki) (weather) Noun­Generic (ha) top Particle (hare) (fine) Noun­Generic Character Type Feature 1, 0, 0, 0 0, 1, 0, 0 1, 0, 0, 0 1, 0, 0, 0 0, 1, 0, 0 1, 0, 0, 0 0, 1, 0, 0 1, 1, 0, 0 Chunk Label O O B-LOCATION I-LOCATION O O O O Figure 1: Example of Training Instance for Proposed Method Feature set Chunk label - Parsing Direction - Fi-2 Fi-1 Fi Fi+1 Fi+2 ci-2 ci-1 ci Table 2: NE Extraction Performance of IREX Corpus Proposed CRF S VM 0.487 0.518 0.921 0.909 0.866 0.863 0.951 0.610 0.774 0.766 0.936 0.863 0.825 0.842 0.901 0.903 0.842 0.834 Baseline CRF S VM 0.458 0.457 0.916 0.916 0.847 0.846 0.937 0.937 0.744 0.742 0.928 0.928 0.788 0.787 0.902 0.901 0.821 0.820 NExT 0.682 0.696 0.895 0.506 0.821 0.672 0.800 0.732 Figure 1 shows an example of training instance of the proposed method for the sentence " (kyou) (no) (Ishikari) (heiya) (no) (tenki) (ha) (hare)" which means "It is fine at Ishikari-plain, today". " (Kantou)" is assigned as the most similar and familiar morpheme to " (Ishikari)" which is unfamiliar in the training corpus. ARTIFACT DATE LOCATION MONEY ORGANIZATION PERCENT PERSON TIME Total Table 3: Statistics of NE Types of NHK Corpus NE Type DATE LOCATION MONEY ORGANIZATION PERCENT PERSON TIME Total Frequency (%) 755 (19%) 1465 (36%) 124 (3%) 1056 (26%) 55 (1%) 516 (13%) 101 (2%) 4072 4 Experimental Evaluation 4.1 Experimental Setup IREX Corpus is used as the annotated corpus to train statistical NE chunkers, and MF is defined experimentally as a set of all morphemes which occur five or more times in IREX corpus. Mainichi Newspaper Corpus (1993­1995), which contains 3.5M sentences consisting of 140M words, is used as the unannotated corpus to calculate context vectors. MeCab2 (Kudo et al., 2004) is used as a preprocessing morphological analyzer through experiments. In this paper, either Conditional Random Fields(CRF)3 (Lafferty et al., 2001) or Support Vector Machine(SVM)4 (Cristianini and Shawe-Taylor, 2000) is employed to train a statistical NE chunker. 4.2 Experiment of IREX Corpus Table 2 shows the results of extracting NEs of IREX corpus, which are measured with F-measure through 5-fold cross validation. The columns of "Proposed" show the results with S F , and the ones of "Baseline" show the results without S F . The column of "NExT" shows the result of using NExT(Masui et http://mecab.sourceforge.net/ http://chasen.org/taku/software/CRF++/ 4 http://chasen.org/taku/software/ yamcha/ 3 2 al., 2002), an NE chunker based on hand-crafted rules, without 5-fold cross validation. As shown in Table 2, machine learning approaches with S F outperform ones without S F . Please note that the result of SVM without S F and the result of (Yamada et al., 2002) are comparable, because our using feature set without S F is quite similar to their feature set. This fact suggests that S F is effective to achieve better performances than the previous research. CRF with S F achieves better performance than SVM with S F , although CRF and SVM are comparable in the case without S F . NExT achieves poorer performance than CRF and SVM. 4.3 Experiment of NHK Corpus Nippon Housou Kyoukai (NHK) corpus is a set of transcriptions of 30 broadcast news programs which were broadcasted from June 1st 1996 to 12th. Table 3 shows the statistics of NEs of NHK corpus which were annotated by a graduate student except 127 Table 4: NE Extraction Performance of NHK Corpus Proposed CRF S VM 0.630 0.595 0.837 0.825 0.988 0.660 0.662 0.636 0.538 0.430 0.794 0.813 0.250 0.224 0.746 0.719 Baseline CRF S VM 0.571 0.569 0.797 0.811 0.971 0.623 0.601 0.598 0.539 0.435 0.752 0.787 0.200 0.247 0.702 0.697 NExT 0.523 0.741 0.996 0.612 0.254 0.622 0.260 0.615 vestigate incorporating S F and these features in the near future. DATE LOCATION MONEY ORGANIZATION PERCENT PERSON TIME Total References Masayuki Asahara and Yuji Matsumoto. 2003. Japanese named entity extraction with redundant morphological analysis. In Proc. of HLT­NAACL '03, pages 8­15. Nello Cristianini and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press. Hideki Isozaki and Hideto Kazawa. 2002. Efficient support vector classifiers for named entity recognition. In Proc. of the 19th COLING, pages 1­7. Hideki Isozaki. 2001. Japanese named entity recognition based on a simple rule generator and decision tree learning. In Proc. of ACL '01, pages 314­321. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Appliying conditional random fields to japanese morphological analysis. In Proc. of EMNLP2004, p ag es 2 3 0 ­ 2 3 7 . John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML, pages 282­289. Fumito Masui, Shinya Suzuki, and Junichi Fukumoto. 2002. Development of named entity extraction tool NExT for text processing. In Proceedings of the 8th Annual Meeting of the Association for Natural Language Processing, pages 176­179. (in Japanese). Keigo Nakano and Yuzo Hirai. 2004. Japanese named entity extraction with bunsetsu features. Transactions of Information Processing Society of Japan, 45(3):934­941, Mar. (in Japanese). Nichigai Associates, editor. 2007. DCS Kikan-mei Jisho. Nichigai Associates. (in Japanese). Manabu Sassano and Takehito Utsuro. 2000. Named entity chunking techniques in supervised learning for japanese named entity recognition. In Proc. of the 18th COLING, pages 705­711. Satoshi Sekine and Yoshio Eriguchi. 2000. Japanese named entity extraction evaluation: analysis of results. In Proc. of the 18th COLING, pages 1106­1110. E. Tjong Kim Sang. 1999. Representing text chunks. In Proc. of the 9th EACL, pages 173­179. Kiyotaka Uchimoto, Ma Qing, Masaki Murata, Hiromi Ozaku, Masao Utiyama, and Hitoshi Isahara. 2000. Named entity extraction based on a maximum entropy model and transformation rules. Journal of Natural Language Processing, 7(2):63­90, Apr. (in Japanese). Hiroyasu Yamada, Taku Kudo, and Yuji Matsumoto. 2002. Japanese named entity extraction using support vector machine. Transactions of Information Processing Society of Japan, 43(1):44­53, Jan. (in Japanese). Table 5: Extraction of Familiar/Unfamiliar NEs CRF (Proposed) CRF (Baseline) Familiar 0.789 0.757 Unfamiliar 0.654 0.556 Other 0.621 0.614 for ARTIFACT in accordance with the NE definition of IREX. Because all articles of IREX corpus had been published earlier than broadcasting programs of NHK corpus, we can suppose that NHK corpus contains unfamiliar NEs like real input texts. Table 4 shows the results of chunkers trained from whole IREX corpus against NHK corpus. The methods with S F outperform the ones without S F . Furthermore, performance improvements between the ones with S F and the ones without S F are greater than Table 2. The performance of CRF with S F and one of CRF without S F are compared in Table 5. The column "Familiar" shows the results of extracting NEs which consist of familiar morphemes, as well as the column "Unfamiliar" shows the results of extracting NEs which consist of unfamiliar morphemes. The column "Other" shows the results of extracting NEs which contain both familiar morpheme and unfamiliar one. These results indicate that S F is especially effective to extract NEs consisting of unfamiliar morphemes. 5 Concluding Remarks This paper proposes a novel method to extract NEs including unfamiliar morphemes which do not occur or occur few times in a training corpus using a large unannotated corpus. The experimental results show that S F is effective for robust extracting NEs which consist of unfamiliar morphemes. There are other effective features of extracting NEs like N -best morpheme sequences described in (Asahara and Matsumoto, 2003) and features of surrounding phrases described in (Nakano and Hirai, 2004). We will in128