SIGIR 2007 Proceedings Poster Topic Segmentation Using Weighted Lexical Links (WLL) Computer Science Lab. LIA University of Avignon Avignon, 84000, France Laurianne Sitbon laurianne.sitbon@univ-avignon.fr patrice.bellot@univ-avignon.fr Computer Science Lab. LIA University of Avignon Avignon, 84000, France Patrice Bellot ABSTRACT This paper presents two new approaches of lexical chains for topic segmentation using weighted lexical chains (WLC) or weighted lexical links (WLL) between repeated occurrences of lemmas along the text. The main advantage of using these new approaches is the suppression of the empirical parameter called hiatus in lexical chain processing. An evaluation according to the WindowDiff measure on a large automatically built corpus shows slight improvements in WLL compared to state-of-the-art methods based on lexical chains. Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing General Terms: Performance Keywords: Linear topic segmentation, Unsupervised segmentation, Lexical links, Lexical chains. A A A A A A BBBBB B C D C D C C D C D C C C Figure 1: Lexical chain construction for four lemmas A, B, C and D along a text (vertical lines indicate sentences boundaries) consider that the influence of each lemma covers a least h sentences before and after each occurrence. Figure 1 illustrates the process of lexical chain construction of a lemma and how lexical chains can occur in a text. The vertical lines indicate sentence boundaries. The hiatus used for this sample is 3 sentences. The occurrences of the lemma "A" illustrates the importance of the hiatus in determining the boundaries of a chain. The representation of chains of lemmas "D" and "B" shows that these two lemmas are repeated in two distinct parts of the text that can belong to two different topics. The representation of the two chains of lemma "C" illustrates that a single lemma can occur frequently in several distinct parts of the text. This example of 4 chained lemmas also suggests a topic boundary between the last sentence where the lexical chain of lemma "A" is active and the first sentence activating the chain corresponding to lemma "B", since there is no active lexical chain at this position. Intuitively, the nearer the repetitions of a given lemma, the more significant the lemma for the topic. However the classical use of lexical chains is binary: a lexical chain is either active or not. This does not capture the differences between lemma significance. That is why we propose to weight the chains according to their lemma density. Thus, at a given point of the text, a chain will be active to a varying degree rather than the previous binary case. We also hypothesise that this weight can depend on the syntactic or the semantic category of the lemma (proper nouns may be more representative of a topic than other categories). The weight of a chain is computed as : W (chain, m) = catm × nbm × log ( Ltext ) Lchain (1) 1. INTRODUCTION Linear topic segmentation consists in placing boundaries inside a text at positions where the topic changes. An approach based on lexical chains, introduced in [4], was recently integrated in LC s eg [3]. Results indicate superior performance over state-of-the-art unsupervised methods based on lexical cohesion such as C99 [1] which uses a local ranking of the sentence similarity matrix. This unsupervised approach uses word distribution and collocations in the text. This is topic and style independent and, to a certain extent, language independent. Our first proposal considers that the weight can introduce a differentiation between syntactical categories of repeated words and between lemmas occurring at different density levels, leading to a concept of weighted lexical chains (WLC). Our following proposal is to extend the concept of lexical chains to weighted lexical links (WLL). Lexical links give an adapted weight to each repetition in the text, and can be either bounded or unbounded in contrast to lexical chains. 2. SEGMENTATION WITH WLL A lexical chain links repetitions of a lemma if the gap between two repetitions is lower than a certain value, called the hiatus. A lemma can generate several chains or none, depending on the hiatus and on the repartition of its occurrences. The classical use of this structure is to determine whether the chain of a lemma is active at a given position in a text, i.e. if there is an occurrence of the lemma at a distance (in words or sentences) smaller than the hiatus h. We Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. where catm is the syntactic weight of the lemma, nbm is the number of occurrences of lemma m in the chain, Ltext the length of the text and Lchain is the length of the chain. The example given for lemma "B" in Figure 1 suggests that an adaptive hiatus could break the final link in the chain and increase its weight. The adaptive hiatus is computed with the average distance between each occurrence in the whole text. This states the notion now referred to as weighted lexical chains (WLC). Another option allowed by the weighting scheme is to use no hiatus at all. 737 SIGIR 2007 Proceedings Poster This means that any chain starts at the first occurrence of its lemma and ends at the last lemma. It also means that there is necessarily and only one chain associated to each lemma. As the notion of chains supposes that one knows the maximal length of the links composing it, this new principle is referred to as "weighted lexical link" (WLL). After processing WLC or WLL, we compute a similarity score between each successive sentence [4]. The score compares the active lexical chains or links of preceding and following sentences : P w(A, m) × w(B , m) (2) sim(A, B ) = pP m P w2 (A, m) × m w2 (B , m) m where w(X, m) is the weight of the lexical chain or link C activated in sentence X, and where C corresponds to lemma m in sentence X given by W (C, m) in Equation 1. Lastly, boundaries are selected between the least similar pairs of sentences where sim(A, B) is lower than the empirically derived threshold : (3) simlimit = µ + K × 2 with µ and are the mean and the variance of all computed similarities [3]. K is a parameter that can be empirically fixed or that can be computed dynamically to achieve a user need such as an average segment size. verbs, adjectives, nouns and proper nouns) and only keeps lemmas which occur at least twice. The difference between approaches stands in hiatus determination (pre-defined at 11 sentences for LC, or at higher number of sentences than the largest text (here 120 sentences) to simulate no hiatus (WLL), or automatically computed with the average distance between two occurrences (WLC)) and weighting. The weight of each chain (or link) is 1 for LC implementation, and is computed with Equation 1 (with C atm = 1 for all lemmas) for WLL and WLC approaches. The results of segmentation with LC, WLL and WLC approaches are in the first line of Table 1. These results show that the WLL method performs slightly better than LC. W indow Dif f LC 0.3272 WLL 0.3187 WLC 0.3454 Table 1: Window Diff evaluation for segmentation with LC approach (hiatus 11), WLL approach (no hiatus), and WLC approach (the hiatus is computed for each repeated lemma) 4. CONCLUSION AND FUTURE WORK We proposed an effective segmentation method which uses the new concept of weighted lexical links (WLL) which is an improvement on the lexical chains approach. This method has been implemented in the freely available modular tool LI A_topic_seg as part of the technolangue OURAL project supported by the French Ministry of Research (http://projetoural.org/). Future work plans to improve results by determining best weighting parameters. Previous experiments which over-weighted named entities has already shown that this adversely affects performance. The weighting equation could include the idf (inverse document frequency) of the lemma. The mutual information with other lemmas occurring in the same sentence could also improve results [2]. 3. EVALUATION Our test corpus is composed of 50 composite texts extracted and built from Le Monde, a French newspaper. Each document is composed of ten extracts from several articles randomly chosen in different thematic categories. The extract size varies between 9 and 11 sentences. The corpus can be considered as a valid reference since the segments composing the documents are chosen in different texts to ensure thematic variability. The advantages of this corpus compared to a manually annotated corpus are the amount of data we can test, a cheaper human cost work, and objectivity. The W indow Dif f measure [5] evaluates the effectiveness of a segmentation tool by comparing, for each document of the corpus, the initial boundaries and the boundaries computed by our segmenter. By moving an analysis window on equivalent sentences in a pair (reference document, hypothesised document), W indow Dif f measures the difference between the number of boundaries between the positions i and i + k in the hypothesised document h and the number of boundaries in the reference document r. Thus, if the difference is null, we can consider that the words at positions i and i + k are locally in the same segments (if the text was reduced to the current analysis window). The short added segments are penalised as well as the others. The measure is the normalised sum of the differences between the numbers of boundaries. P (|b(ri , ri+k ) - b(hi , hi+k )|) (4) W D(r, h) = N -k where b(xi , xj ) represents the number of boundaries between positions i and j in text x, and N represents the number of sentences of the text. The weaker the W indow Dif f measure, the more effective the segmenter. WLC, WLL and LC (classical lexical chains) [3] (as a reference) approaches have been jointly implemented in the LI A_topic_seg segmentation tool. The part-of-speech tagger LI A_tag g er 1 processes the lemmatisation (where the lemmas are of syntactical categories: 1 http://www.lia.univ- avignon.fr/chercheurs/ bechet/download_fred.html 5. REFERENCES [1] F. Y. Y. Choi. Advances in domain independent linear text segmentation. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, USA, 2000. [2] O. Ferret. Using collocations for topic segmentation and link detection. In Proceedings of the 19th international conference on Computational linguistics, volume 1, Taipei, Taiwan, 2002. Association for Computational Linguistics. [3] M. Galley, K. McKeown, E. Folser-Lussier, and H. Jing. Discourse segmentation of multi-party conversation. In Proceedings of ACL'03, Sapporo, Japan, 2003. [4] M. A. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the ACL'94, Las Cruces, NM, 1994. [5] L. Pevzner and M. A. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, pages 19­36, 2002. 738