SIGIR 2007 Proceedings Poster Story Segmentation of Broadcast News in Arabic, Chinese and English Using Multi-Window Features Mar tin Franz IBM T. J. Watson Research Center 1101 Kitchawan Rd. Yorktown Heights, NY, USA Jian-Ming Xu IBM T. J. Watson Research Center 1101 Kitchawan Rd. Yorktown Heights, NY, USA franzm@us.ibm.com ABSTRACT The pap er describ es a maximum entropy based story segmentation system for Arabic, Chinese and English. In exp eriments with broadcast news data from TDT-3, TDT-4, and corp ora collected in the DARPA GALE pro ject we obtain a substantial p erformance gain using multiple overlapping windows for text-based features. jianxu@us.ibm.com lence.) The classifier is a maximum entropy model [2], using the following categories of features: Lexical features include unigrams, bigrams and trigrams of tokens found in the text windows surrounding the prop osed story b oundary. For Arabic and English we convert words into tokens using stemmers ([4, 5], resp ectively), while the Chinese tokens are formed as overlapping character bigrams. Text similarity features are based on similarity of the text windows surrounding the prop osed b oundary. The similarity scores are computed using a symmetrized version of the Okapi [6] formula. For the ab ove mentioned text-based features we exp eriment with various lengths of the text windows, as well as with a combination of multiple windows of different lengths. Prosody features include features based on the duration of non-sp eech signal at the prop osed b oundary, utterance duration and word count, and the rate of sp eech in the material surrounding the prop osed b oundary. Position in the show features reflect the time of the prop osed b oundary relative to the b eginning of the programming block. For the data collected in the GALE pro ject, we also included speaker and video shot change features. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval General Terms Algorithms, Exp erimentation Keywords Story Segmentation 1. INTRODUCTION Automatic identification of story b oundaries in broadcast news (BN) is imp ortant in processing large volumes of unstructured data, where it is essential to identify topically homogeneous units of material to b e indexed. A recent overview can b e found in [7]. The pap er describ es our exp eriments to develop algorithms for application in a nearreal-time system for processing Arabic and Chinese BN in the DARPA GALE pro ject. 3. EXPERIMENTS AND RESULTS To measure segmentation p erformance, we use a sliding window metric established in Topic Detection and Tracking (TDT) pro ject [1]. In contrast with TDT we do not distinguish b etween NEWS, TEASER, and MISCELLANEOUS material, except in the exp eriments where we follow the conditions of the TDT-3 segmentation task. Selecting an op erating p oint (OP) on the system's ROC curve is an application sp ecific decision. We rep ort the Miss and False Alarm (FA) rates at the OP in which the average story length matches the ground truth average story length of the test data, and Cseg values [1] at the lowest Cseg OP. The English subset of the TDT-4 [8] corpus consists of 450 files, each based on a news program 30 or 60 minutes long. In our exp eriments we use the chronologically first 350 files (248 hours) as a training set, and the last 50 files (30 hours) as a test set. Figure 1(a) and Table 1 show p erformance indicators for various versions of the system. The baseline system uses only the silence duration, left/right similarity feature windows of 20 tokens, and lexical feature windows of 50 tokens. The rest of the results show the cumulative effect of additional features. We obtain substantial p erformance improve- 2. MAXIMUM ENTROPY MODEL We treat story segmentation as a binary classification problem, where the model is trained to estimate the likelihood of a story b oundary occurring at an utterance b oundary given the surrounding context and other features. For our purp ose we define utterances as intervals of sp eech divided by occurrences of non-sp eech material (typically si- Copyright is held by the author/owner(s). SIGIR ' 07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 703 SIGIR 2007 Proceedings Poster (a) TDT-4, English 0.5 0.4 baseline multi-win. similarity multi-win. n-grams position in the show sentence duration (b) TDT-4, Arabic, Chinese 0.5 0.4 VOA, VOA, VOA, VOA, Arabic, baseline Arabic, full Chinese, baseline Chinese, full (c) TDT-3 0.5 0.4 English, best TDT-3 English, baseline English, full Chinese, best TDT-3 Chinese, baseline Chinese, full P(Miss) P(Miss) 0.3 0.2 0.1 0 0 0.1 0.3 0.2 0.1 0 P(Miss) 0.5 0.3 0.2 0.1 0 0.2 0.3 P(FA) 0.4 0.5 0 0.1 0.2 0.3 P(FA) 0.4 0 0.1 0.2 0.3 P(FA) 0.4 0.5 Figure 1: (a) TDT-4, English (b) TDT-4, Arabic and Chinese. Square markers correspond to operating points where the average story length matches the test data. (c) TDT-3 Table 1: TDT-4, Miss, FA, and Cseg values for baseline (B) and full (F) system. Arabic Chinese English M F C M F C M F C B .26 .045 .062 .20 .020 .042 .30 .061 .071 F .18 .041 .044 .15 .018 .029 .21 .041 .048 ment with similarity and lexical features based on multiple windows. Adding the features based on the p osition in the show and sentence duration obtains modest improvement. Additionally, to show the effect of multi-window features, we test the system with the following changes from the baseline: For the text similarity features, we use window lengths of 20, 50, 100 and 200 tokens, and the Cseg values range from 0.066 to 0.080; using all 4 window lengths yields Cseg of 0.063. For the lexical features, we use window lengths of 10, 20 and 50 tokens, and the Cseg values range from 0.069 to 0.075; using all 3 window lengths yields Cseg of 0.060. In exp eriments with Arabic and Chinese TDT-4 data, we used the Voice of America (VOA) subset of the corpus. The training and test sets contain 44 and 24 shows for Arabic and 45 and 19 shows for Chinese, each show b eing an hour long. Figure 1(b) compares the baseline system with the system including the multi-window, p osition in the show, and sentence duration features. We observe similar improvement over the baseline as in the English TDT-4 data. To compare the p erformance with previously published results, we test our segmenter under the conditions of the TDT-3 [1] segmentation task. Figure 1(c) compares a baseline system, using a set of features similar to the state of the art system in TDT-3 [3], with a full system using the multiwindow features. The lowest Cseg values for the full version of our system are 0.0593 for English and 0.0528 for Chinese, compared with the 0.0810 and 0.0670 rep orted by the state of the art system in TDT-3 (also shown on Figure 1 (c)). For the GALE pro ject we trained segmentation models using data from multiple news programs from Al Arabiya and Al Jazeera (Arabic), and Phoenix Infonews (Chinese). For each source we collected and annotated approximately 50 hours of data, spanning a month of programming. In addition to the features used on the TDT data, we included features based on sp eaker and video shot changes, but they did not yield consistent p erformance improvement. Tested on the most recent 20% of the annotated sets, the Cseg values range b etween 0.034 and 0.067 for Arabic and b etween 0.042 and 0.061 for Chinese. 4. CONCLUSION Our story segmentation algorithm shows comp etitive p erformance on Arabic, Chinese and English data, primarily due to the use of multi-window features. The p erformance gain is consistent in the three languages. Future work will focus on investigating additional audio and video features. 5. ACKNOWLEDGMENTS This work was partially supp orted by the Defense Advanced Research Pro jects Agency under contract No. HR0011-06-2-0001. The views and findings contained in this material are those of the authors and do not necessarily reflect the p osition or p olicy of the U.S. government and no official endorsement should b e inferred. 6. REFERENCES [1] The 1999 topic detection and tracking TDT-3 evaluation pro ject. http://www.nist.gov/sp eech/tests/tdt/tdt99/index.htm. [2] A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39­71, 1996. [3] M. Franz, J. S. McCarley, T. Ward, and W. J. Zhu. Segmentation and detection at IBM: Hybrid statstical models and two-tiered clustering. In TDT-3 Workshop, 2000. [4] Y.-S. Lee, K. Papineni, S. Roukos, O. Emam, and H. Hassan. Language model based arabic word segmentation. In ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 399­406, 2003. [5] M. Porter. An algorithm for suffix stripping. Program, 14(3):130­137, 1980. [6] S. E. Rob ertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Text REtrieval Conference, pages 21­30, 1992. [7] A. Rosenb erg and J. Hirschb erg. Story segmentation of broadcast news in English, Mandarin and Arabic. In HLT-NAACL, pages 125­128, New York, NY, 2006. [8] S. Strassel and M. Glenn. Creating the annotated TDT4 Y2003 evaluation corpus. http://www.nist.gov/sp eech/tests/tdt/tdt2003/pap ers/ ldc.ppt, 2003. 704