SIGIR 2007 Proceedings Poster User-Oriented Text Segmentation Evaluation Measure Mar tin Franz, J. Scott McCarley, Jian-Ming Xu IBM T. J. Watson Research Center 1101 Kitchawan Road Yorktown Heights, NY, USA franzm, jsmc, jianxu @us.ibm.com ABSTRACT The pap er describ es a user oriented p erformance evaluation measure for text segmentation. Exp eriments show that the prop osed measure differentiates well b etween error distributions with varying user impact. retrieved word Reference System Miss reading direction FA Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval Figure 1: Misplaced story boundaries cause spurious content presented to a user. General Terms Algorithms, Exp erimentation Keywords Story Segmentation 1. INTRODUCTION Presentation of results in search engines that index audio or video broadcast news requires that the broadcast b e automatically divided into short segments. Preferably, these short segments should b e topically coherent and aligned with the user's idea of news "stories." Evaluating the accuracy of the text segmentation comp onent thus requires a metric that reflects the manner in which the content is presented to the user. We emphasize that the user seeks the contents of the story, not the b oundaries b etween stories. Story b oundary errors prevent the user from accessing the content of the story, or exp ose the user to content from the wrong story. The aim of work describ ed in this pap er is to find segmentation p erformance measure to b e used in designing segmentation algorithms applied in an integrated system for processing of broadcast news. The desired prop erties of the measure include: reflecting end user b ehavior and interaction with the presented material in an information retrieval system, measuring the segmentation quality in user-oriented units (words or tokens,) and avoiding data-sp ecific parameters. Most currently used story segmentation measures, such as Pk [2], TDT Cseg [3] and WindowDiff [4] are based on counting the numb er of incorrectly prop osed b oundaries. The count is smoothed by a sliding window scheme to assign partial credit to system b oundaries at incorrect p ositions that are close to the reference b oundaries. There are several shortcomings to these approaches. First, the user is primarily interested in the content of the stories, not the b oundaries b etween the stories. Miss and false alarm rates of b oundaries do not clearly measure what content is missed by the user, or incorrectly presented to the user. Second, the width of the sliding window is an arbitrary parameter, and whether it should b e a constant or a constant fraction of a typical story length is unclear. Thus results over long pro jects with evolving data sources are difficult to compare. In the next section, we prop ose a segmentation metric that directly measures the content that, due to incorrect story b oundaries, is missed by the user, or incorrectly presented to the user. 2. CONTENT-BASED MEASURE When a user is presented a story with incorrect b oundaries, there are two typ es of errors. First, the user misses some of the story b ecause either the system-prop osed start of the story was too late or the system-prop osed story end came too soon. Second, the user is presented with some content from a different story (false-alarms) if the systemprop osed story starts too soon or ends too late. Our measure of segmenter p erformance is computed by averaging the numb er of words of missed content and the numb er of words of false alarm content over all word p ositions in the corpus: RM iss = 1X 1X M iss(w), RF A = F A(w) Nw Nw (1) Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. where the M iss(w) and F A(w) are numb ers of missed and 701 SIGIR 2007 Proceedings Poster 0.2 baseline noise close to true boundaries uniform noise 0.15 P(Miss) 0.1 0.05 0 0 0.05 0.1 P(FA) 0.15 0.2 Figure 2: Sliding window measure. 200 baseline noise close to true boundaries uniform noise 150 Miss[words] 100 50 0 0 50 100 FA[words] 150 200 pus [5]. The training set contains the chronologically first 44 shows (1157 stories); the test set contains the chronologically last 18 shows (474 stories). We introduce noise into the system by swapping the systemcalculated probabilities of a story b oundary at a word p osition at 200 pairs of utterance b oundaries. Swapping probabilities ensures that the numb er of b oundaries produced by a noisy system is identical to the numb er produced by the baseline (no noise) system even as a threshold is varied to prob e the false alarm/miss tradeoff. We choose the swaps from two distributions. In the first case, the swap p oints are distributed uniformly in the corpus - thus stories are likely to b e broken in the middle. In the second case, all the swaps p oints are located within ten utterances of a true story b oundary, effectively moving story b oundaries a small distance. The second case is b elieved to b e less disruptive to the user exp erience. The Miss/FA curves obtained using a window-based measure as describ ed in [3] and the prop osed measure are shown on Figures 2 and 3, resp ectively. The square markers corresp ond to "practical" op erating p oints, where the average length of produced stories matches the test data. For comparison purp oses, a naive baseline of equally spaced segments (of length matching the average story length) yields RM iss = 548 and RF A = 92. We observe that near the "practical" op erating p oints, the window based measure does not clearly distinguish b etween the two typ es of noise; whereas the more user-disruptive uniform noise clearly impacts the prop osed measure more than the near-b oundary noise. Figure 3: Content based measure. false alarm words, resp ectively, at a given word p osition w, and N is the total numb er of word p ositions in the test set. RM iss and RF A are measured in words, which means that the values indicate an average numb er of missed and spurious words presented to a user retrieving a story. The simplest assumption in choosing this metric rather than one of several p ossible generalizations is that all word p ositions in the corpus are equally likely to b e the target of the user, and that all word p ositions are equally valuable to the user. Dep ending up on the application, the metric could b e generalized by 1) by assigning idf-like weights to individual dictionary words in the dictionary, or by decreasing the weight of rep eating words, 2) the per word (micro) averaging can b e replaced or combined with per story length fraction (macro) averaging, 3) imp ortance-based weights can b e assigned to leading/trailing missed/FA content. Also, recall and false alarm content could b e measured by time, rather than numb er of words. In this pap er we validate the simplest measure of this family rather than exploring all p ossibilities. 4. CONCLUSION We have introduced a new segmentation measure that directly measures whether appropriate content is presented to the user, and have shown that this measure is sensitive to particular typ es of errors that directly impact the user. 5. ACKNOWLEDGMENTS This work was partially supp orted by the Defense Advanced Research Pro jects Agency under contract No. HR0011-06-2-0001. The views and findings contained in this material are those of the authors and do not necessarily reflect the p osition or p olicy of the U.S. government and no official endorsement should b e inferred. 6. REFERENCES [1] J. Allan, editor. Topic Detection and Tracking: Event-based Information Organization. Kluwer Academic Publishers, Norvell, Massachusetts, 2002. [2] D. Beeferman, A. Berger, and J. D. Lafferty. Statistical models for text segmentation. Machine Learning, 34(1-3):177­210, 1999. [3] G. Doddington. The topic detection and tracking phase 2 (TDT2) evaluation plan. In DARPA Broadcast News Transcription and Understanding Workshop, pages 223­229, 1998. [4] L. Pevzner and M. A. Hearst. A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist., 28(1):19­36, 2002. [5] S. Strassel and M. Glenn. Creating the annotated TDT4 Y2003 evaluation corpus. http://www.nist.gov/sp eech/tests/tdt/tdt2003/pap ers/ ldc.ppt, 2003. 3. VALIDATION EXPERIMENTS To show that the prop osed measure is sensitive to degradation that directly impacts user exp erience, we measure the p erformance of a segmentation system with introduced noise. Our system is a maximum entropy based segmenter, trained to estimate the segmentation probability at utterance b oundaries based on a variety of features, an approach similar to systems describ ed in [1]. The training and test data are extracted from the PRI subset of the TDT-4 cor- 702