SIGIR 2007 Proceedings Session 27: Domain Specific NLP Detecting, Categorizing and Clustering Entity Mentions in Chinese Text Wenjie Li1 , Donglei Qian1,2 , Qin Lu1 , Chunfa Yuan2 1 Depar tment of Computing The Hong Kong Polytechnic University 2 cswjli@comp.polyu.edu.hk csluqin@comp.polyu.edu.hk Depar tment of Computer Science and Technology Tsinghua University, China qdl05@mails.tsinghua.edu.cn cfyuan@mail.tsinghua.edu.cn ABSTRACT The work presented in this paper is motivated by the practical need for content extraction, and the available data source and evaluation benchmark from the ACE program. The Chinese Entity Detection and Recognition (EDR) task is of particular interest to us. This task presents us several language-independent and language-dependent challenges, e.g. rising from the complication of extraction targets and the problem of word segmentation, etc. In this paper, we propose a novel solution to alleviate the problems special in the task. Mention detection takes advantages of machine learning approaches and character-based models. It manipulates different types of entities being mentioned and different constitution units (i.e. extents and heads) separately. Mentions referring to the same entity are linked together by integrating most-specific-first and closest-first rule based pairwise clustering algorithms. Types of mentions and entities are determined by head-driven classification approaches. The implemented system achieves ACE value of 66.1 when evaluated on the EDR 2005 Chinese corpus, which has been one of the top-tier results. Alternative approaches to mention detection and clustering are also discussed and analyzed. Categories and Subject Descriptors I.7 [DOCUMENT AND TEXT PROCESSING]: Miscellaneous General Terms Algorithms, Languages Keywords entity mentions in Chinese, mention detection, mention categorization, and mention clustering 1. INTRODUCTION Today's global web of electronic information provides a resource of unbounded information-bearing potential. But to fully exploit this potential requires the ability to extract content information from human language automatically. The Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. task of Entity Detection and Recognition (EDR) is one of the most important tasks in information extraction suggested by the Automatic Content Extraction (ACE) program, whose ob jective is to develop extraction technologies to support automatic processing of human language data. The ACE EDR 2005 task requires detecting the seven types of ACE-defined entities being mentioned in the source languages (Arabic, Chinese or English), linking together the mentions that refer to the same entities, and recognizing the types and roles of the mentions and entities identified. It is a task fundamental to many applications, such as text summarization, question answering and machine translation. The ACE EDR task is motivated by and addresses the same issues as the MUC program that precedes it. However, it defines the targets of extraction in terms of the ob jects (i.e. logic ob jects in the world) rather than in terms of the words. To simulate a real world application, EDR presents us three new challenges. First, the named entities task, as defined in MUC, is to identify those words that are names of entities. In ACE, on the other hand, the corresponding task is to identify all mentions of an entity, whether named (e.g. George Bush), nominal (e.g. our president), or pronominal (e.g. he). The complicated constitution of nominal mentions makes the more difficult to be detected correctly. Second, according to the ACE guideline, not only the extent but also the head of each mention must be identified. The nested rather than liner structure problem then has to be addressed in mention detection. Third, as the target of extraction is the entities, the mentions corresponding to the same entity must be linked together into the entity (i.e. the set of coreferring mentions). Practical coreference resolution is a must. However, not all entities are named entities. For example, half entities in the EDR 2005 Chinese corpus do not contain any name mentions. This presents the challenge to the traditional coreference resolution methods. Furthermore, there also exits some language specific challenges rising from word segmentation problem when working on Chinese language. In this paper, we propose a novel solution to alleviate the aforesaid language dependent and independent difficulties and investigate the use of several classification and clustering approaches, including Conditional Random Fields (CRF) and Support Vector Machine (SVM). The ma jor differences of our work from the others are in three aspects. First, named, nominal and pronominal mentions are detected separately in order to avoid the noise introduced in learning process and caused by different linguistic characteristics they exhibit. Second, heads and extents are detected separately. Normally heads are detected more accurately. They will 647 SIGIR 2007 Proceedings Session 27: Domain Specific NLP play an important role in the subsequent categorization and clustering processes. Third, we do not treat the three types of mentions equally when linking them together into the entities. A pairwise clustering algorithm integrating mostspecific-first and closest-first rules is suggested. The final system is developed as a sequence of three separated processes: (1) mention detection, which finds all mentions and their nested heads, if available; (2) mention categorization, which classifies the entity types and the roles of the detected mentions; and (3) mention clustering, which merge the mentions into the clusters corresponding to the entities present in the text. The remainder of the paper is organized as follows. Section 2 briefly reviews the related work on named entity recognition and coreference resolution. Sections 3 to 5 details the approaches applied in the three parts of our Chinese EDR system. Section 6 then presents experiment set-up and evaluation results. When evaluated on the EDR 2005 Chinese corpus, the system achieves top-tier results. Section 7 discuss and analyze the alternative classification-based approaches to mention detection and clustering. Finally, Section 8 concludes the paper. coreferring to each other. The clustering algorithms then partitioned the mentions into the entity clusters based on the pairwise coreference decisions or taking one step ahead to consider the relationships between the mentions under concern and the entity clusters already constructed [12, 2]. The clustering algorithms were mostly the greedy clustering, also known as best-first clustering algorithms, though the closest-first or aggressive-merge clustering algorithms have also been analyzed. The mention pairs under the consideration were normally a mention and its possible preceding antecedents. In [5] however, they included all the symmetric pairs. The features used in the classification models could be the content match of the two mentions (e.g. their editing distance, the character/word matching percentage etc.), their relative positions (e.g. the distance, the order), the gender match which is particular for pronouns, or the combinations of them. 3. SEQUENTIAL TAGGING BASED MENTION DETECTION 3.1 Separated Models for Mention Extents and Heads with Different Types According to the ACE guidelines, entity mentions are categorized into three common types, namely named mentions (NAM), nominal mentions (NOM) and pronominal mentions (PRO). A NAM is the mention headed by a proper name. It is the most specific mention. The NOM or PRO in contrast is headed by a common noun or refers to the person or the thing that is preciously specified or understood from the context. They are non-specific mentions though in different degrees. Amongst three types of mentions, the number of PRO is quite limited. Thus, the best results could be expected provided with the sufficient dictionary entries. NAM and NOM, however, are normally longer in length and can be expressed more flexibly. In terms of the different characteristics they present, we believe it is appropriate to detect the boundaries of three types of mentions separately in order to avoid the noise introduced by one another. This idea has been verified experimentally in [10]. For each mention, both extent (i.e. the whole mention) and its head must be identified. It is nature's way of finding out extents first, and then identifying the nested heads by analyzing phrase structures. The problem with it is if extent detection is less or even much less accurate than head detection, the errors it propagates will certainly bring negative impact on head detection in sequential processing mode. Our solution is to detect extends and heads independently in parallel mode. At least, the high accuracy from head detection could be expected. Then, extent and head boundaries are checked for consistency and the heads are combined together with the extents they corresponding to. During the combination, head and extent boundaries constrain each other. This provides the second chance for those wrongly detected boundaries to be corrected. 2. RELATED WORK Entity detection and recognition has close ties to named entity recognition and coreference resolution, which have been the focus of attentions in the recent past due to the increasing demand of content extraction. Named entity recognition has been studied extensively in English. Recent research mainly focused on machine learning approaches, including Hidden Markov Models [13], Maximum Entropy [5], Robust Risk Minimization [3], Support Vector Machine [4], and Transformation-Based Learning [14]. With these approaches, the recognition task was transformed into a sequential tagging or a multi-class classification problem. The lexical, syntactic and semantic features of the words were used as input to various learning algorithms. Most state-of-the-art systems also incorporated learning models with external word knowledge, such as Gazetteer and various types of dictionaries. The same task in Chinese, however, is more difficult than in English due to word segmentation problem. Previous study has showed that most Chinese segmentation errors (about 90%) stemmed from unknown words, which are mostly names. A word-based Chinese named entity recognition model might take advantages of the boundary information added from word segmentation, but it would also unavoidably suffer from the segmentation errors which can hardly be recovered. A straightforward solution is to adapt a general use segmentation system into a specific task [14]. An alternative solution is to develop the models based on characters. It was observed in [6] that pure character-based models could outperform word-based models. Even so, word information was still considered useful. For example, the tags indicating the relative position of a character to or in a segmented word [1] or the possible words surrounding the current focused character within a window size [3] were used as the features integrated into the character-based models. Recent research in coreference resolution has exhibited a shift from knowledge-based approaches to data-driven approaches [11, 9, 7]. The central idea was to recast coreference resolution as a binary classification task. The classifiers were trained to determine whether two mentions are 3.2 Detection by Integrating Sequential Tagging into Character-based Models We transform mention detection into the sequential tagging task. Six character-based CRF models are developed with FlexCRFs1 for extents and heads of three types of mentions. We regard each single character as a token and encode 1 Free downloadable from http://www.jasist.ac.jp/ hieux- 648 SIGIR 2007 Proceedings Session 27: Domain Specific NLP character related information (such as surface N -gram patterns), and word related information (such as dictionaries and word segmentation outputs) as the associated features. The sequential tagging is encoded with the typical "B-I-O" scheme. Tag "B" is for the token that begins an extent or a head, "I" for the tokens that are inside and "O" for the tokens that are outside. Given observation sequence o = {o1 , · · · , oT } and label sequence s = {s1 , · · · , sT }, CRFs define following conditional probability ! T K XX 1 k fk (o(t), st-1 , st ) P (s|o) = exp Z (o) t=1 k=1 boundaries are inserted by the segmentation system preceding or following the characters of concern. To obtain these features, a dictionary-based bi-directional maximum matching segmentation system is implemented. 3.3 Extent and Head Combination where Z (o) is the normalization factor applied to all label sequences; fk denotes one of the K edge and state feature functions; k is the feature weight associated with fk . Given training set {(o1 , s1 ), · · · (oN , sN )}, the feature weights are trained to maximize the log-likelihood L= N X i=1 When mention heads and extents are detected independently, their boundaries are sometimes inconsistent. In other words, the extent and the head of the same mention might be overlapped due to incorrect boundary identification. There are also some cases where the heads of the extents are missed or vice versa. We apply heuristic rules to correct the inconsistent or the missing cases, and combine the heads with the extents they belonging to. The rules we use include removing surplus extents if no heads are found to associate with them, adding extents (same as heads) if heads find no extents associated, and expanding extents to allow entire heads to be included if heads overlap extents. Heads have high priority in rectifying extents during combination. log (P (si |oi )) - K X 2 k k=1 4. 2 2 HEAD-DRIVEN MENTION CATEGORIZATION where the second item is used as the smoothing function, and 2 denotes the variance of all feature weights. For an input observation sequence o, the optimal tagging sequence is obtained by the following formula using the Viterbi algorithm s = arg max P (s|o) s Table 1 presents the features we select and investigate for both extents and heads, where index 0 indicates the current focused character, indices -n/n indicate the characters n positions to the left or right of the current character (n {1, 2}). In Table 1, the seven character-based N -gram features are applied to the three types of mentions and are deemed as the fundamental features. The dictionary-based and/or segmentation-based features are used in combination with the character-based features when they are necessary (see Section 6.1). We manually construct three dictionaries, particularly for detecting person names, location names and pronouns. They are the lists of Chinese surnames, countries and capital cities in the world, China provinces and cities, and Chinese pronouns. Word information associated to a character can be encoded in two ways, by either checking whether word boundaries are inserted around the focused character by a segmentation system (we call this segmentation-based features) or checking whether the character together with the surrounding characters can form some words in given dictionaries (we call this word-based features). The latter requires larger searching space, involves more uncertainty, and results in unsatisf( ctory performance in our experiments. Taking " a ( Chinese people)" as an example, the character " person)" can be a part of three potential words, , and . The segmented boundary, if it is correct, can more accurately indicate where this character is located (e.g. at the beginning or the end) in a particular word. The segmentation-based features we use consider the left or the right boundaries. The features exam whether the uan/flexcrfs/flexcrfs.html Entity types are additional information used to characterize the entities mentioned. In ACE EDR 2005, the entities are categorized into the following seven types: person, organization, Geo-political entity, location, facility, vehicle and weapon. They are further divided into 45 subtypes. Since the mentions are the instances of the entities, they naturally inherit the type attributes from the entities they refer to. We therefore identify the types according to the mentions instead of entities in the first place and then use them as the clues to mention clustering in the subsequent process. Type identification is a typical classification task. The classifier we choose is the Support Vector Machine (SVM) which has been successfully used for classification and regression in natural language processing and text categorization. Given training set {(x1 , y1 ), · · · , (xl , yl )}, where xi Rn , yi {1, -1} (i = 1, 2, · · · , l), SVM algorithm solves the following primal problem min 1 w 2 2 w,b, +C l X i=1 i sub ject to yi (w · xi + b) 1 - i ; i 0, i = 1, · · · , l Its dual is max l X i=1 l l 1 XX i yi j yj K (xi , xj ) 2 i=1 j =1 i - sub ject to l X i=1 i yi = 0; i [0, C ], i = 1, · · · , l The decision function is f (x) = sg n l X i=1 ! i yi K (xi , x) + b where C denotes the extra cost for errors in non-separable cases. We use following radial basis kernel function K (xi , x) = exp(- |xi - x|2 ) 649 SIGIR 2007 Proceedings Session 27: Domain Specific NLP Table 1: Features used with CRF for Mention (Extents and Heads) Detection Character-based Features Character Unigram and Bigram c-2 , c-1 , c0 , c1 , c2 , c-1 c0 , c0 c1 Surname Unigram IS SURNAME(c0 ) IN LOCATION NAME(c-1 c0 ), Location Name Bigram IN LOCATION NAME(c0 c1 ) Dictionary-based Features IS PRONOUN(c0 ), Pronoun Unigram and Bigram IS PRONOUN(c-1 c0 ), IS PRONOUN(c0 c1 ) LEFT BOUNDARY BEFORE(c0 ), Segmentation-based Features Segmented Left/Right Boundary RIGHT BOUNDARY AFTER(c0 ) As classical SVM classifiers generate only binary results, we use one-against-the-rest approach to produce multi-class classification outputs. We implement a set of SVM classifiers to identify mention type first and then the subtypes under each type they belonging to2 . Based on our previous experiments, it is reasonable to believe that head information is more related to the types (or subtypes) of a mention than the information provided by a diversified extent. All classifiers are trained on the characters appearing in mention heads or head context ([-2, +2]). Considering the same character plays different roles when appearing at different positions (e.g. inside the head and next to the boundaries, inside the head but away from the boundaries, outside the head but within the context window), different weights are assigned to it according to where it is located (see Table 2). Head boundary characters are more informative. Table 2: Features and Assigned Weights Position Feature Scope Weight Beside Head Boundary c0 , cn 1 Inside Head c1 - cn-1 0.8 In Head Context c-2 , c-1 , cn+1 , cn+2 0.5 Note that this time the classifiers are not built for NAM, NOM and PRO separately, as they are manipulated in mention detection. Previous attempts to build separate classifiers fail to generate better results. We wonder whether performance improvement could be achieved when a larger training corpus is available. tent matches of heads, and relative positions, particularly for PRO, to decide whether to link two mentions together or not. The process of clustering is like a snow ball rolling process. Rules are applied starting from the most specific mention (i.e. NAM) pairs. NAM-NAM pairs are examined first. Both the linked and the dangling NAM mentions are recognized as the named entities. Then, NOM mentions are examined to see if they can be added to any existing named entity, i.e. NOM-NAM pairs are examined. If the rule allows, they are a part of the named entities. The next rule goes forward to check NOM-NOM pairs. Similarly, both the linked and dangling NOM mentions are recognized as the nominal entities, if they are not directly or indirectly connected with any NAM mention. Finally, the most nonspecific PRO mentions are examined and added to the existing named or nominal entities if they can be linked to any preceding NOM or NAM mention. When there is more than one link allowed, the closest one is picked up. If not, PRO-PRO pairs are examined and the pronominal entities are constructed at the last. Figure 1 illustrates two named entities and one nominal entity. Note that, two nominal mentions are not allowed to link together if they have been included into two different named entities (e.g. NOM1 and NOM2 are not linked). This constraint is to avoid merging two different named entities tog( ther (such as e (President Clinton) and President Bush)). The advantage from using rule-based approach is to ensure a high precision. 5. RULE BASED MENTION CLUSTERING An entity is a set of mentions which refer to the same ob ject. The mentions coreferring to one another are observed having the following common characteristics. First, from the contents they convey, NAM is the most specific one in expressing an entity. NOM takes second place. PRO is most non-specific in terms to its flexibility in coreferring to any other specific mentions. On the other hand, from discourse point of view, most pronouns corefer to the entities mentioned in preceding the first or the second sentence. Coreference resolution of PRO is preferred in local context. NAM and NOM on the other hand are allowed to corefer freely within a document, which is considered as global context. Four simple yet effective clustering rules are specially designed to accommodate to the above-mentioned characteristics. The rules mainly use types (and/or subtypes), conIn fact, ACE EDR 2005 also requires categorizing the entity classes and roles. Although in this paper, we introduce the approaches and the experiments with entity categorization, the same approaches also apply to class and role categorization. 2 Figure 1: Linked Mention and Entity Sets 6. EXPERIMENT AND EVALUATION Corpora used for training and evaluating learning models are provided by Linguistic Data Consortium (LDC) for ACE 2005 Chinese EDR task. There are 3,907 entities and 9,198 mentions in evaluation corpus. Among them, 46.6%, 50.8% and 2.5% entities are named, nominal and pronominal entities respectively. Pie graphs in figure 2 show the mention distributions. 6.1 Evaluation on Mention Detection The purposes of conducting the following sets of experiments are to reveal how the proposed features contribute to mention head and extent detection, and to how effective the suggested sequential tagging approach is. 650 SIGIR 2007 Proceedings Session 27: Domain Specific NLP 6.2 Evaluation on Mention Categorization We use accuracy criterion to evaluate the performance of classification, and set up the experiments on the true mentions extracted from the ACE evaluation corpus instead of on the mentions detected by our proposed methods. Tables 5 and 6 present the accuracies of types and subtypes separately. Table 5: Typ e Classification Accuracy based on TRUE Mentions Type Accuracy NAM 89.87 NOM 91.36 PRO 87.83 ALL 90.38 The overall type accuracy of 90.38% is quite competitive. But the accuracy of LOC is much lower than the accuracies of the other subtypes. We check the errors made related to LOC mentions. The reasons can be two folds, i.e. there are fewer numbers of training samples and possibly its subtype categories are not quite well defined. Figure 2: Mention Distribution in Evaluation Corpus The results are evaluated according to the standard Precision-Recall-F-Measure criteria. They are summarized in Table 3. As expected, head detection is more accurate than extent extraction with regard to all types of mentions, and PRO detection is more accurate than NAM and NOM detection. But the differences between head and extent detection results are not significant with NAM and PRO, although we do see the 25% difference with NOM. This can be easily explained in terms of data distribution. Many mentions may share the same head and extent boundaries. The percentage of such cases is about 90% with NAM or PRO, but only 35% with NOM. It is interesting to find out that if we regard character N grams as the basic set of features for all types of mentions, the value added from word segmentation features is more remarkable than it from dictionary-based features in both head and extent detection. It looks like segmentation-based features could discover additional patterns which are not covered by dictionary-based features, even for PRO mentions. This is unanticipated. The reasons may be two folds. The incomplete name lists is obviously one of the reasons. However, it is impossible to make it complete in any way. When taking a closer look at the results, we find that segmentation-based features could indirectly recover some errors made in mention detection by correctly grouping nonmention words together. Take " (returning native Jin Men people)" as an example. " " is a location " ame. Without using segmentation-based features, " n is detected as the name but it is wrong. After adding segmentation-based features, " " is correctly detected, simply because " " is grouped together as a common general noun. What surprises us is that N -gram features alone don't work at all in detecting PRO extents and even heads. Table 4 presents the evaluations on mentions after combing their extents and heads. The overall F-measure of 70.95% is promising although there is still room for improvement. One may ask what about if mentions of different types of mentions are detected together with one unified classifier for either heads or extents. The previous experiments have shown that the unified classifiers perform worse than the separate ones [10]. Table 4: Mention Detection Results after Head and Extent Combination Precision Recall F-measure NAM 81.40 75.00 78.07 NOM 61.05 59.14 60.08 PRO 89.93 85.73 87.78 ALL 73.00 69.01 70.95 6.3 Evaluation on Mention Clustering The mention clustering algorithm is evaluated by Precision-Recall-F-measure criteria, and tested on true mentions (true boundaries and true types/subtypes). As mention clustering is the last step in the EDR task, when it is done, the final results are available. Therefore, we evaluate using the provided ACE EDR evaluation tool. The result is shown in Figure 33 . The top F-measures come from LOC, GPE and ORG. In overall, F-measure of 87.4% is quite a good performance. Figure 3: ACE Evaluation Results based on TRUE Mentions and Typ es Compared with the learning models, handcrafted rules are more accurate and drive to high precision. In fact, we have implemented in both ways. The rule-based approach ends up with better F-measure and saves above 2 days of training time. It has the advantages of being more flexible to integrate constraints and handling special cases. Yet, it is still weak in handling the coreference resolution of PRO, especially in handling those PRO which have no coreferred NAM or NOM in the text at all (i.e. they belong to the PRO entities). This is the most serious problem the other researchers may also encounter. 6.4 ACE EDR Evaluation Finally, the whole system is evaluated. Features used in mention detection are those asterisked in Figure 3. All the 3 FA: spurious entities output; Miss: entities missed; Err: entities detected but misrecognized; Tot: reference entities. P re = (T ot - M iss - E rr )/(T ot + F A - M iss) Rec = (T ot - M iss - E rr )/T ot 651 SIGIR 2007 Proceedings Session 27: Domain Specific NLP Table 3: Mention Extent and Head Detection Results b efore Combination Mention Type = NAM Feature Used Extent Alone Head Alone Precision Recall F-measure Precision Recall F-measure Character 76.86 69.86 73.19 81.46 73.22 77.12 Character + Dictionary 76.69 71.21 73.85 79.75 75.03 77.32 Character + Segmentation 78.23 71.50 74.71 82.19 75.10 78.49 All* 78.21 72.66 75.33 83.15 75.32 79.04 Mention Type = NOM Extent Alone Head Alone Precision Recall F-measure Precision Recall F-measure 65.26 59.81 62.42 79.26 77.45 78.34 64.55 64.11 64.33 83.57 78.49 80.95 Mention Type = PRO Extent Alone Head Alone Precision Recall F-measure Precision Recall F-measure 0.00 0.00 0.00 0.00 0.00 0.00 93.39 84.31 88.62 93.58 87.71 90.55 92.85 85.86 89.22 94.35 87.71 90.91 92.32 85.73 88.90 93.49 88.22 90.78 Feature Used Character Character + Segmentation* Feature Used Character Character + Dictionary Character + Segmentation All* Table 6: Subtyp e Classification Accuracy based on TRUE Mentions Subtype Accuracy FAC GPE LOC ORG PER VEH WEA NAM 91.42 94.35 78.26 92.27 97.85 71.42 100.0 NOM 89.66 94.17 75.33 94.25 76.90 98.04 94.26 PRO 63.63 87.67 83.33 58.33 94.43 75.00 83.33 ALL 89.40 94.12 76.35 92.12 85.96 93.29 93.90 ACE participating systems are evaluated and ranked according to EDR values (under "Cost" columns in Figure 4 in the next page), though the standard Precision-Recall-F-measure criteria are given as well. The EDR values in Figure 4 are normalized values with different cost parameters assigned to different types of mentions. Among all types, GPE has the highest value. It benefits from the use of external location lists in mention detection. The FAC and WEA values are comparatively low due to the small number of training samples. The PER value is also low. We simply use a Chinese surname list in PER detection. The list might be insufficient compared to the characters appearing in Chinese names and translated foreign names. We consider introducing additional word lists as the complement in the future. Table 7 compares the ACE values of our system with the official results of the best and worst ACE 2005 Chinese EDR participating systems. Our system is in the leading group. Table 7: Comparisons with Other ACE CEDR Systems ACE Value Best 69.2 Worst 3.8 Average 51.2 Ours 66.1 7. DISCUSSION 7.1 Weighted Bayes Classification based Mention Detection Training CRF models in sequential tagging based mention detection is quite time consuming. Under the limited hardware and software resources, naturally the following question comes to our mind. Can we make use of the same set of features, but simply assume that the characters are independent and focus on one single character once a time to avoid long distance dependency in the sequence? If this assumption holds, the mention detection would become more effective. We implement this idea by considering boundary detection as a classification problem and adopting a weighed Bayes classification algorithm. For any character, it can always be followed (or preceded) by a left, a right boundary or none of them. We train two binary Bayes classifiers, one for left boundaries and one for right boundaries. The left and right boundaries are identified independently. The equation is given below, where n is the number of the features used, fi denotes the ith of n features, i is the weight of the feature fi . i is tuned experimentally. n i Y L/R P (fi )|B L/R BN B = arg max P (B ); B L/R {T ,F } i=1 i = 1, 2, · · · , n The features used in B L or B R classifiers are listed in Table 8. Note that even the features considered in Bayes clas- 652 SIGIR 2007 Proceedings Session 27: Domain Specific NLP Figure 4: ACE Evaluation Results based on Identified Mentions and Typ es sification are mostly the same as those used in sequential tagging, bigram pronoun features are not necessary and IN, IS features are replaced with START and END features. Different from sequential tagging where boundaries guarantee to be paired well, the left/right boundaries identified by Bayes classifiers are independent. Paring is thus required. We have experimented with minimum distance and maximum distance paring. The former performs better. The experiment results are illustrated in Table 9. Comparing Tables 3 and 9, sequential tagging detection significantly outperforms the classification detection with all three types of mentions. The increments range from 3% to 18%. The precisions of the two approaches are distributed quite similarly among the three mention types. But the recall of sequential tagging approach, especially to NAM, is significantly higher than the recall of classification approach. Sequential tagging approach itself does not require additional boundary matching process. Thus, the approach itself avoids the additional errors introduced by paring boundaries in classification approach. Taking NAM head detection as an example, F-measures of left and right boundaries alone are 86.34% and 78.78% respectively before paring. However, F-measure drops significantly to 69.6% after paring. We have tried many other ways to combine the boundaries. Unfortunately, no one gives the result approaching to 79% in sequential tagging approach. The independent assumption definitely does not hold. sults shown in Figure 3, learning-based algorithm detects as 4 times of spurious entities as rule-based algorithm does, and 1/6.5 times of missing entities. The lower precision leads to the overall lower F-measure. Figure 5: Learning-based Coreference Resolution Results based on TRUE Mentions and Typ es Features used in both algorithms are almost the same except for those adapted for a particular algorithm. But the difference between their F-measures is more than 10%. After taking a closer look, we find the main difference lies in the way of applying these features. Note that the mentions are merged in SVM-based algorithm depends on the links identified among all the mentions. A wrongly identified link may merge two true entities as an incorrect one. However, the merging in rule-based algorithm starts from the most specific mentions, forms and completes the entity sets incrementally. In such a way, the mentions which have been included in an entity set are not allowed to link to any mentions already included in another entity. 7.2 Learning-based Mention Clustering Recent researches in coreference resolution have yielded data-driven systems that rival their hand-crafted counterparts in performance [8]. In our study, the learning-based mention clustering is also implemented in attempt to solving the resolution problem under a more generalized framework. A SVM classifier encodes the linking rules mentioned in Section 5 into the features and decides whether any two mentions detected corefer or not. The features include one real value feature, i.e. the degree of head similarity, and four binary features, i.e. whether two mentions occur in the same sentence, or one before another, whether they are apposition, whether they are of the same entity subtype and whether one is the abbreviation of the other. The coreferred mentioned are then linked together as an entity. Algorithms similar to this one are called aggressive-merge-clustering in [8]. SVM-based coreference resolution actually encodes additional information by considering the content similarity and the position information in a more flexible way. Thus, it achieves quite a high recall. However, it suffers from low precision. Figure 5 shows entity (after mention clustering) precision and recall which are based on the best SVM parameters tuned experimentally4 . Compared with the evaluation re4 8. CONCLUSION This paper presents our recent work on Chinese entity detection and tracking. A novel solution is proposed to alleviate the language-independent and language-dependent problems special in this task. The system is implemented in the sequence of three separated processes, namely mention detection, mention categorization and mention clustering. Detection takes advantages of machine learning approaches and character-based models. It manipulates different types of entities being mentioned and different constitution units (i.e. extents and heads) separately. Mentions referring to the same entity are linked together by integrating mostspecific-first and closed-first rule based pairwise clustering algorithms. Types are identified by head-driven classification approaches. The system achieves ACE value of 66.1, which has been one of the top-tier results. The preliminary success encourages us to further explore how can effectively integrate three processes together in the future. 9. ACKNOWLEDGMENTS The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong (parameter in the kernel function). Tunable SVM parameters are C (extra cost for errors) and 653 SIGIR 2007 Proceedings Session 27: Domain Specific NLP Table 8: Bayes Classifier Features Boundary Left Features Boundary Right Features c-2 , c-1 , c0 , c1 , c2 , c-1 c0 , c0 c1 c-2 , c-1 , c0 , c1 , c2 , c-1 c0 , c0 c1 START OF LOCATION NAME(c-1 c0 ) END OF LOCATION NAME(c-1 c0 ) START OF LOCATION NAME(c0 c1 ) END OF LOCATION NAME(c0 c1 ) IS SURNAME(c0 ) IS SURNAME(c0 ) START OF PRONOUN(c0 ) END OF PRONOUN(c0 ) LEFT BOUNDARY BEFORE(c0 ) RIGHT BOUNDARY AFTER(c0 ) Table 9: Mention Extent and Head Detection Results by Weighted Bayes Classification Mention Type = NAM Feature Used Extent Alone Head Alone Precision Recall F-measure Precision Recall F-measure Character 80.21 57.84 67.21 75.31 60.60 67.16 Character + Dictionary 79.45 58.73 67.53 76.71 60.19 67.45 Character + Segmentation 80.64 62.51 70.43 81.30 60.10 69.11 All 80.10 63.14 70.62 81.05 60.99 69.60 Mention Type = NOM Extent Alone Head Alone Precision Recall F-measure Precision Recall F-measure 66.19 50.24 57.13 83.10 69.96 75.97 64.46 51.90 57.50 83.90 70.46 76.59 Mention Type = PRO Extent Alone Head Alone Precision Recall F-measure Precision Recall F-measure 91.94 74.05 82.04 91.78 75.80 83.02 92.09 81.58 86.51 93.01 81.81 87.05 94.65 75.74 84.14 95.74 76.56 84.46 93.09 82.23 87.32 95.10 80.66 87.56 S. Roukos. A mention-synchronous coreference resolution algorithm based on the bell tree. In proceedings of ACL, pages 136­143, 2004. V. Ng. Machine learning for coreference resolution: From local classification to global ranking. In proceedings of ACL, pages 157­164, 2005. V. Ng and C. Cardie. Improving machine learning approaches to coreference resolution. In proceedings of ACL, pages 104­111, 2002. D. Qian, W. Li, C. Yuan, Q. Lu, and M. Wu. Applying machine learning to chinese named entity detection and tracking. In proceedings of CiCling, pages 154­165, 2007. W. Soon, H. Ng, and C. Lim. Machine learning approach to coreference resolution of noun phrases. In proceedings of Computational Linguistics, pages 521­544, 2001. X. Yang, J. Su, G. Zhou, and C. Tan. An np-cluster based approach to coreference resolution. In proceedings of COCLING, pages 23­27, 2004. G. Zhou and J. Su. Named entity recognition using an hmm-based chunk tagge. In proceedings of ACL, pages 473­480, 2002. Y. Zhou, C. Huang, J. Gao, and L. Wu. Transformation based chinese entity detection and tracking. In proceedings of IJCNLP, pages 232­237, 2005. Feature Used Character Character + Segmentation Feature Used Character Character + Dictionary Character + Segmentation All Kong (pro ject number: CERG PolyU5211/05E) and partially supported by a grant from the National Natural Science Foundation of China (pro ject number: 60573186). 10. REFERENCES [8] [1] W. Chen, Y. Zhang, and H. Isahra. Chinese named entity recognition with conditional random fields. In proceedings of SIGHAN, pages 118­121, 2006. [2] R. Florian, H. Hassan, A. Ittycheriah, H. Jing, N. Kambhatla, X. Luo, N. Nicolov, and S. Roukos. A statistical model for multilingual entity detection and tracking. In proceedings of HLT/NAACL, pages 1­8, 2004. [3] H. Guo, J. Jiang, G. Hu, and T. Zhang. Chinese named entity recognition based on multilevel linguistics features. In proceedings of IJCNLP, pages 90­99, 2005. [4] H. Isozaki and H. Kazawa. Efficient support vector classifiers for named entity recognition. In proceedings of IJCNLP, pages 1­7, 2002. [5] A. Ittycheriah, L. Lita, N. Kambhatla, N. Nicolov, S. Roukos, and M. Stys. Identifying and tracking entity mentions in a maximum entropy framework. In proceedings of HLT/NAACL, pages 40­42, 2003. [6] H. Jing, R. Florian, X. Luo, T. Zhang, and A. Ittycheriah. Howtogetachinesename (entity): Segmentation and combination issues. In proceedings of EMNLP, pages 200­207, 2003. [7] X. Luo, A. Ittycheriah, H. Jing, N.Kambhatla, and [9] [10] [11] [12] [13] [14] 654