SIGIR 2007 Proceedings Session 9: Topic Detection and Tracking Analyzing Feature Trajectories for Event Detection Qi He qihe@pmail.ntu.edu.sg Kuiyu Chang ASKYChang@ntu.edu.sg Ee-Peng Lim ASEPLim@ntu.edu.sg School of Computer Engineering Nanyang Technological University Block N4, Nanyang Avenue, Singapore 639798 ABSTRACT We consider the problem of analyzing word tra jectories in b oth time and frequency domains, with the sp ecific goal of identifying imp ortant and less-rep orted, p eriodic and ap eriodic words. A set of words with identical trends can b e group ed together to reconstruct an event in a completely unsup ervised manner. The document frequency of each word across time is treated like a time series, where each element is the document frequency - inverse document frequency (DFIDF) score at one time p oint. In this pap er, we 1) first applied sp ectral analysis to categorize features for different event characteristics: imp ortant and less-rep orted, p eriodic and ap eriodic; 2) modeled ap eriodic features with Gaussian density and p eriodic features with Gaussian mixture densities, and subsequently detected each feature's burst by the truncated Gaussian approach; 3) prop osed an unsup ervised greedy event detection algorithm to detect b oth ap eriodic and p eriodic events. All of the ab ove methods can b e applied to time series data in general. We extensively evaluated our methods on the 1-year Reuters News Corpus [3] and showed that they were able to uncover meaningful ap eriodic and p eriodic events. Categories and Sub ject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Exp erimentation. Keywords: feature categorization, event detection, DFT, Gaussian solutions prop osed for event detection [20, 5, 17, 4, 21, 7, 14, 10] are either too simplistic (based on cosine similarity [5]) or impractical due to the need to tune a large numb er of parameters [9]. The ineffectiveness of current TDT technologies can b e easily illustrated by subscribing to any of the many online news alerts services such as the industryleading Google News Alerts [2], which generates more than 50% false alarms [10]. As further proof, p ortals like Yahoo take a more pragmatic approach by requiring all machine generated news alerts to go through a human op erator for confirmation b efore sending them out to subscrib ers. Instead of attacking the problem with variations of the same hammer (cosine similarity and TFIDF), a fundamental understanding of the characteristics of news stream data is necessary b efore any ma jor breakthroughs can b e made in TDT. Thus in this pap er, we look at news stories and feature trends from the p ersp ective of analyzing a time-series word signal. Previous work like [9] has attempted to reconstruct an event with its representative features. However, in many predictive event detection tasks (i.e., retrosp ective event detection), there is a vast set of p otential features only for a fixed set of observations (i.e., the obvious bursts). Of these features, often only a small numb er are exp ected to b e useful. In particular, we study the novel problem of analyzing feature trajectories for event detection, b orrowing a well-known technique from signal processing: identifying distributional correlations among all features by sp ectral analysis. To evaluate our method, we subsequently prop ose an unsup ervised event detection algorithm for news streams. 0.8 Easter 0.6 April 0.4 0.2 0 8/20/1996 12/8/1996 3/28/1997 7/16/1997 1. INTRODUCTION There are more than 4,000 online news sources in the world. Manually monitoring all of them for imp ortant events has b ecome difficult or practically imp ossible. In fact, the topic detection and tracking (TDT) community has for many years b een trying to come up with a practical solution to help p eople monitor news effectively. Unfortunately, the holy grail is still elusive, b ecause the vast ma jority of TDT Unaudited 0.4 Ended 0.3 0.2 0.1 0 8/20/1996 12/8/1996 3/28/1997 7/16/1997 (a) ap eriodic event (b) p eriodic event Figure 1: Feature correlation (DFIDF:time) between a) Easter and April b) Unaudited and Ended. As an illustrative example, consider the correlation b etween the words Easter and April from the Reuters Corpus1 . From the plot of their normalized DFIDF in Figure 1(a), we observe the heavy overlap b etween the two words circa 04/1997, which means they probably b oth b elong to the same event during that time (Easter feast ). In this example, the hidden event Easter feast is a typical imp ortant ap eriodic event over 1-year data. Another example is given by Figure 1(b), where b oth the words Unaudited and Ended 1 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. Reuters Corpus is the default dataset for all examples. 207 SIGIR 2007 Proceedings Session 9: Topic Detection and Tracking exhibit similar b ehaviour over p eriods of 3 months. These two words actually originated from the same p eriodic event, net income-loss reports, which are released quarterly by publicly listed companies. Other observations drawn from Figure 1 are: 1) the bursty p eriod of April is much longer than Easter, which suggests that April may exist in other events during the same p eriod; 2) Unaudited has a higher average DFIDF value than Ended, which indicates Unaudited to b e more representative for the underlying event. These two examples are but the tip of the iceb erg among all word trends and correlations hidden in a news stream like Reuters. If a large numb er of them can b e uncovered, it could significantly aid TDT tasks. In particular, it indicates the significance of mining correlating features for detecting corresp onding events. To summarize, we p ostulate that: 1) An event is describ ed by its representative features. A p eriodic event has a list of p eriodic features and an ap eriodic event has a list of ap eriodic features; 2) Representative features from the same event share similar distributions over time and are highly correlated; 3) An imp ortant event has a set of active (largely rep orted) representative features, whereas an unimp ortant event has a set of inactive (less-rep orted) representative features; 4) A feature may b e included by several events with overlaps in time frames. Based on these observations, we can either mine representative features given an event or detect an event from a list of highly correlated features. In this pap er, we focus on the latter, i.e., how correlated features can b e uncovered to form an event in an unsup ervised manner. phrase pairs, without considering their p eriodicities. On the contrary, our pap er considers all of the ab ove. Recently, there has b een significant interest in modeling an event in text streams as a "burst of activities" by incorp orating temp oral information. Kleinb erg's seminal work describ ed how bursty features can b e extracted from text streams using an infinite automaton model [12], which inspired a whole series of applications such as Kumar's identification of bursty communities from Weblog graphs [13], Mei's summarization of evolutionary themes in text streams [15], He's clustering of text streams using bursty features [11], etc. Nevertheless, none of the existing work sp ecifically identified features for events, except for Fung et al. [9], who clustered busty features to identify various bursty events. Our work differs from [9] in several ways: 1) we analyze every single feature, not only bursty features; 2) we classify features along two categorical dimensions (p eriodicity and p ower), yielding altogether five primary feature typ es; 3) we do not restrict each feature to exclusively b elong to only one event. Sp ectral analysis techniques have previously b een used by Vlachos et al. [19] to identify p eriodicities and bursts from query logs. Their focus was on detecting multiple p eriodicities from the p ower sp ectrum graph, which were then used to index words for "query-by-burst" search. In this pap er, we use sp ectral analysis to classify word features along two dimensions, namely p eriodicity and p ower sp ectrum, with the ultimate goal of identifying b oth p eriodic and ap eriodic bursty events. 1.1 Contributions This pap er has three main contributions: ˇ To the b est of our knowledge, our approach is the first to categorize word features for heterogenous events. Sp ecifically, every word feature is categorized into one of the following five feature typ es based on its p ower sp ectrum strength and p eriodicity: 1) HH (high p ower and high/long p eriodicity): imp ortant ap eriodic events, 2) HL (high p ower and low p eriodicity): imp ortant p eriodic events, 3) LH (low p ower and high p eriodicity): unimp ortant ap eriodic events, 4) LL (low p ower and low p eriodicity): non-events, and 5) SW (stopwords), a higher p ower and p eriodicity subset of LL comprising stopwords, which contains no information. ˇ We prop ose a simple and effective mixture densitybased approach to model and detect feature bursts. ˇ We come up with an unsup ervised event detection algorithm to detect b oth ap eriodic and p eriodic events. Our algorithm has b een evaluated on a real news stream to show its effectiveness. 3. DATA REPRESENTATION Let T b e the duration/p eriod (in days) of a news stream, and F represents the complete word feature space in the classical static Vector Space Model (VSM). 3.1 Event Periodicity Classification Within T , there may exist certain events that occur only once, e.g., Tony Blair elected as Prime Minister of U.K., and other recurring events of various p eriodicities, e.g., weekly soccer matches. We thus categorize all events into two typ es: ap eriodic and p eriodic, defined as follows. Definition 1. (Ap eriodic Event) An event is aperiodic within T if it only happens once. Definition 2. (Periodic Event) If events of a certain event genre occur regularly with a fixed periodicity P T /2 , we say that this particular event genre is periodic, with each member event qualified as a periodic event. Note that the definition of "ap eriodic" is relative, i.e., it is true only for a given T , and may b e invalid for any other T > T . For example, the event Christmas feast is ap eriodic for T 365 but p eriodic for T 730. 2. RELATED WORK This work is largely motivated by a broader family of problems collectively known as Topic Detection and Tracking (TDT) [20, 5, 17, 4, 21, 7, 14, 10]. Moreover, most TDT research so far has b een concerned with clustering/classifying documents into topic typ es, identifying novel sentences [6] for new events, etc., without much regard to analyzing the word tra jectory with resp ect to time. Swan and Allan [18] first attempted using co-occuring terms to construct an event. However, they only considered named entities and noun 3.2 Representative Features Intuitively, an event can b e describ ed very concisely by a few discriminative and representative word features and vice-versa, e.g., "hurricane", "sweep", and "strike" could b e representative features of a Hurricane genre event. Likewise, a set of strongly correlated features could b e used to reconstruct an event description, assuming that strongly correlated features are representative. The representation vector of a word feature is defined as follows: 208 SIGIR 2007 Proceedings Session 9: Topic Detection and Tracking Definition 3. (Feature Tra jectory) The trajectory of a word feature f can be written as the sequence yf = [yf (1), yf (2), . . . , yf (T )], where each element yf (t) is a measure of feature f at time t, which could be defined using the normalized DFIDF score2 y f (t ) = N DFf (t) × log ( ), N (t ) DFf ˇ HH: high Sf , ap eriodic or long-term p eriodic (Pf > T /2 ); ˇ HL: high Sf , short-term p eriodic (Pf T /2 ); ˇ LH: low Sf , ap eriodic or long-term p eriodic; ˇ LL: low Sf , short-term p eriodic. The b oundary b etween long-term and short-term p eriodic is set to T /2 . However, distinguishing b etween a high and low DPS is not straightforward, which will b e tackled later. where DFf (t) is the number of documents (local DF) containing feature f at day t, DFf is the total number of documents (global DF) containing feature f over T , N (t) is the number of documents for day t, and N is the total number of documents over T . Properties of Different Feature Sets To b etter understand the prop erties of HH, HL, LH and LL, we select four features, Christmas, soccer, DBS and your as illustrative examples. Since the b oundary b etween high and low p ower sp ectrum is unclear, these chosen examples have relative wide range of p ower sp ectrum values. Figure 2(a) shows the DFIDF tra jectory for Christmas with a distinct burst around Christmas day. For the 1-year Reuters dataset, "Christmas" is classified as a typical ap eriodic event with Pf = 365 and Sf = 135.68, as shown in Figure 2(b). Clearly, the value of Sf = 135.68 is reasonable for a well-known bursty event like Christmas. 1.5 150 100 50 0 0.00 0.13 0.25 0.37 0.50 P=365 S=135.68 1 4. IDENTIFYING FEATURES FOR EVENTS In this section, we show how representative features can b e extracted for (un)imp ortant or (a)p eriodic events. 4.1 Spectral Analysis for Dominant Period Given a feature f , we decomp ose its feature tra jectory yf = [yf (1), yf (2), ..., yf (T )] into the sequence of T complex numb ers [X1 , . . . , XT ] via the discrete Fourier transform (DFT): Xk = tT =1 y f (t )e - 2T i (k-1)t , k = 1, 2, . . . , T . 0.5 0 8/20/1996 12/8/1996 3/28/1997 7/16/1997 DFT can represent the original time series as a linear combination of complex sinusoids, which is illustrated by the inverse discrete Fourier transform (IDFT): y f (t ) = T 2 i 1k Xk e T (k-1)t , t = 1, 2, . . . , T , T =1 (a) Christmas(DFIDF:time) (b) Christmas(S:frequency) Figure 2: Feature "Christmas" with relative high Sf and long-term Pf . The DFIDF tra jectory for soccer is shown in Figure 3(a), from which we can observe that there is a regular burst every 7 days, which is again verified by its computed value of Pf = 7, as shown in Figure 3(b). Using the domain knowledge that soccer games have more matches every Saturday, which makes it a typical and heavily rep orted p eriodic event, we thus consider the value of Sf = 155.13 to b e high. 0.6 0.4 0.2 0 8/20/1996 12/8/1996 3/28/1997 7/16/1997 200 150 100 50 0 0.00 P=7 S=155.13 where the Fourier coefficient Xk denotes the amplitude of the sinusoid with frequency k/T . The original tra jectory can b e reconstructed with just the dominant frequencies, which can b e determined from the p ower sp ectrum using the p opular p eriodogram estimator. The p eriodogram is a sequence of the squared magnitude of the Fourier coefficients, Xk 2, k = 1, 2, . . . , T /2 , which indicates the signal p ower at frequency k/T in the sp ectrum. From the p ower sp ectrum, the dominant p eriod is chosen as the inverse of the frequency with the highest p ower sp ectrum, as follows. Definition 4. (Dominant Period) The dominant period (DP) of a given feature f is Pf = T / arg max Xk 2. k 0.13 0.25 0.37 0.50 (a) soccer(DFIDF:time) (b) soccer(S:frequency) Figure 3: Feature "soccer" with relative high Sf and short-term Pf . From the DFIDF tra jectory for DBS in Figure 4(a), we can immediately deduce DBS to b e an infrequent word with a trivial burst on 08/17/1997 corresp onding to DBS Land Raffles Holdings plans. This is confirmed by the long p eriod of Pf = 365 and low p ower of Sf = 0.3084 as shown in Figure 4(b). Moreover, since this ap eriodic event is only rep orted in a few news stories over a very short time of few days, we therefore say that its low p ower value of Sf = 0.3084 is representative of unimp ortant events. The most confusing example is shown in Figure 5 for the word feature your, which looks very similar to the graph for soccer in Figure 3. At first glance, we may b e tempted to group b oth your and soccer into the same category of HL or LL since b oth distributions look similar and have the same dominant p eriod of approximately a week. However, further Accordingly, we have Definition 5. (Dominant Power Sp ectrum) The dominant power spectrum (DPS) of a given feature f is Sf = Xk 2, with Xk 2 Xj 2, j = k. 4.2 Categorizing Features The DPS of a feature tra jectory is a strong indicator of its activeness at the sp ecified frequency; the higher the DPS, the more likely for the feature to b e bursty. Combining DPS with DP, we therefore categorize all features into four typ es: T 2 We normalize yf (t) as yf (t) = yf (t)/ i=1 yf (i) so that it could b e interpreted as a probability. 209 SIGIR 2007 Proceedings 0.15 0.1 0.05 0 8/20/1996 12/8/1996 3/28/1997 7/16/1997 0.4 P=365 S=0.3084 0.3 0.2 0.1 0 0.00 0.13 0.25 Session 9: Topic Detection and Tracking The SW set is initially seeded with a small set of 29 p opular stopwords utilized by Google search engine. 0.37 0.50 (a) DBS(DFIDF:time) (b) DBS(S:frequency) Figure 4: Feature "DBS" with relative low Sf and long-term Pf . analysis indicates that the p eriodicity of your is due to the differences in document counts for weekdays (average 2,919 p er day) and weekends3 (average 479 p er day). One would have exp ected the "p eriodicity" of a stopword like your to b e a day. Moreover, despite our DFIDF normalization, the weekday/weekend imbalance still prevailed; stopwords occur 4 times more frequently on weekends than on weekdays. Thus, the DPS remains the only distinguishing factor b etween your (Sf = 9.42) and soccer (Sf = 155.13). However, it is very dangerous to simply conclude that a p ower value of S = 9.42 corresp onds to a stopword feature. 0.2 0.15 0.1 0.05 0 8/20/1996 12/8/1996 3/28/1997 7/16/1997 10 5 0 0.00 P=7 S=9.42 Algorithm 1 Heuristic Stopwords detection (HS) Input: Seed SW set, weekday tra jectories of all words 1: From the seed set SW, compute the maximum DPS as UDPS, maximum DF I DF as UDFIDF, and minimum of DF I DF as LDFIDF. 2: for fi F do 3: Compute DFT for fi . 4: if Sfi U DP S and DF I DFfi [LDF I DF, U DF I DF ] then 5: fi S W 6: F = F - fi 7: end if 8: end for Overview of Feature Categorization After the SW set is generated, all stopwords are removed from F . We then set the b oundary b etween high and low DPS to b e the upp er b ound of the SW set's DPS. An overview of all five feature sets is shown in Figure 7. 0.13 0.25 0.37 0.50 (a) your(DFIDF:time) (b) your(S:frequency) Figure 5: Feature "your" as an example confusing with feature "soccer". Before introducing our solution to this problem, let's look at another LL example as shown in Figure 6 for beenb, which is actually a confirmed typ o. We therefore classify beenb as a noisy feature that does not contribute to any event. Clearly, the tra jectory of your is very different from beenb, which means that the former has to b e considered separately. 0.004 0.003 0.002 0.001 0 8/20/1996 12/8/1996 3/28/1997 7/16/1997 1.20E-05 1.20E-05 1.20E-05 1.20E-05 1.20E-05 0.003 P=8 S=1.20E-05 0.126 0.249 0.373 0.496 (a) b eenb(DFIDF:time) (b) b eenb(S:frequency) Figure 7: The 5 feature sets for events. Figure 6: Feature "beenb" with relative low Sf and short-term Pf . Stop Words (SW) Feature Set Based on the ab ove analysis, we realize that there must b e another feature set b etween HL and LL that corresp onds to the set of stopwords. Features from this set has moderate DPS and low but known dominant p eriod. Since it is hard to distinguish this feature set from HL and LL only based on DPS, we introduce another factor called average DFIDF (DF I DF ). As shown in Figure 5, features like your usually have a lower DPS than a HL feature like soccer, but have a much higher DF I DF than another LL noisy feature such as beenb. Since such prop erties are usually characteristics of stopwords, we group features like your into the newly defined stopword (SW) feature set. Since setting the DPS and DF I DF thresholds for identifying stopwords is more of an art than science, we prop osed a heuristic HS algorithm, Algorithm 1. The basic idea is to only use news stories from weekdays to identify stopwords. The "weekends" here also include public holidays falling on weekdays. 3 5. IDENTIFYING BURSTS FOR FEATURES Since only features from HH, HL and LH are meaningful and could p otentially b e representative to some events, we pruned all other feature classified as LL or SW. In this section, we describ e how bursts can b e identified from the remaining features. Unlike Kleinb erg's burst identification algorithm [12], we can identify b oth significant and trivial bursts without the need to set any parameters. 5.1 Detecting Aperiodic Features' Bursts For each feature in HH and HL, we truncate its tra jectory by keeping only the bursty p eriod, which is modeled with a Gaussian distribution. For example, Figure 8 shows the word feature Iraq with a burst circa 09/06/1996 b eing modeled as a Gaussian. Its bursty p eriod is defined by [f - f , f + f ] as shown in Figure 8(b). 5.2 Detecting Periodic Features' Bursts Since we have computed the DP for a p eriodic feature f , we can easily model its p eriodic feature tra jectory yf using 210 SIGIR 2007 Proceedings 0.8 0.6 0.4 0.2 0 8/20/96 0.8 burst= [-, +] 0.6 0.4 0.2 0 8/20/96 12/8/96 3/28/97 Session 9: Topic Detection and Tracking Document Overlap 7/16/97 12/8/96 3/28/97 7/16/97 (a) original DFIDF:time (b) identifying burst Figure 8: Modeling Iraq 's time series as a truncated Gaussian with = 09/06/1996 and = 6.26. a mixture of K = T /Pf Gaussians: f (yf = yf (t)|f ) = kK =1 Let Mi b e the set of all documents containing feature fi . Given two features fi and fj , the overlapping document set containing b oth features is Mi Mj . Intuitively, the higher the |Mi Mj |, the more likelty that fi and fj will b e highly correlated. We define the degree of document overlap b etween two features fi and fj as follows. Definition 8. (Feature DF Overlap) d(fi , fj ) = |M i M j | . min(|Mi |,|Mj |) k - 12 (yf (t)-ľk )2 1 2 e 2k , 2 k Accordingly, the DF Overlap among a set of features R is also defined. Definition 9. (Set DF Overlap) d(R) = fi ,fj R where the parameter set f = {k , k , k }K 1 comprises: k= ˇ k is the probability of assigning yf into the kth GausK sian. k > 0, k [1, K ] and k=1 k = 1; ˇ k /k is mean/standard deviation of the k th min d(fi , fj ). 6.2 Unsupervised Greedy Event Detection We use features from HH to detect imp ortant ap eriodic events, features from LH to detect less-rep orted/unimp ortant ap eriodic events, and features from HL to detect p eriodic events. All of them share the same algorithm. Given bursty feature fi H H , the goal is to find highly correlated features from HH. The set of features similar to fi can then collectively describ e an event. Sp ecifically, we need to find a subset Ri of H H that minimizes the following cost function: C (Ri ) = K L(Ri ) f , Ri H H . d(Ri ) Sfj j Ri (2) Gaussian. The well known Exp ectation Maximization (EM) [8] algorithm is used to compute the mixing prop ortions k , as well as the individual Gaussian density parameters k and K . Each Gaussian represents one p eriodic event, and is modeled similarly as mentioned in Section 5.1. 6. EVENTS FROM FEATURES After identifying and modeling bursts for all features, the next task is to paint a picture of the event with a p otential set of representative features. The underlying event e (associated with the burst of fi ) can b e represented by Ri as y (e ) = f j Ri 6.1 Feature Correlation If two features fi and fj are representative of the same event, they must satisfy the following necessary conditions: 1. fi and fj are identically distributed: yfi yfj . 2. fi and fj have a high document overlap. f Sfj u Ri Sfu y fj . (3) The burst analysis for event e is exactly the same as the feature tra jectory. The cost in Eq. 2 can b e minimized using our unsup ervised greedy UG event detection algorithm, which is describ ed in Algorithm 2. The UG algorithm allows a feature Algorithm 2 Unsup ervised Greedy event detection (UG). Input: HH, document index for each feature. 1: Sort and select features in descending DPS order: Sf1 Sf2 . . . Sf|H H | . 2: k = 0. 3: for fi H H do 4: k = k + 1. 5: Init: Ri fi , C (Ri ) = 1/Sfi and H H = H H - fi . 6: while H H not empty do 7: m = arg min C (Ri fm ). 8: if C (Ri fm ) < C (Ri ) then 9: Ri fm and H H = H H - fm . 10: else 11: break while. 12: end if 13: end while 14: Output ek as Eq. 3. 15: end for m Measuring Feature Distribution Similarity We measure the similarity b etween two features fi and fj using discrete KL-divergence defined as follows. Definition 6. (feature similarity) K L(fi , fj ) is given by max(K L(fi |fj ), K L(fj |fi )), where K L(fi |fj ) = tT =1 f (yfi (t)|fi )log f (yfi (t)|fi ) . f (yfj (t)|fj ) (1) Since KL-divergence is not symmetric, we define the similarity b etween b etween fi and fj as the maximum of K L(fi |fj ) and K L(fj |fi ). Further, the similarity b etween two ap eriodic features can b e computed using a closed form of the KL-divergence [16]. The same discrete KL-divergence formula of Eq. 1 is employed to compute the similarity b etween two p eriodic features, Next, we define the overal similarity among a set of features R using the maximum inter-feature KL-Divergence value as follows. Definition 7. (set's similarity)K L(R) = to b e contained in multiple events so that we can detect sevmax K L(fi , fj ). eral events happ ening at the same time. Furthermore, trivial fi ,fj R events only containing year /month features (i.e., an event 211 SIGIR 2007 Proceedings Session 9: Topic Detection and Tracking LH HH P(f) 130 only containing 1 feature Aug could b e identified over a 1year news stream) could b e removed, although such events will have inherent high cost and should already b e ranked very low. Note that our UG algorithm only requires one data-dep endant parameter, the b oundary b etween high and low p ower sp ectrum, to b e set once, and this parameter can b e easily estimated using the HS algorithm (Algorithm 1). 259 7. EXPERIMENTS LL HL 100 In this section, we study the p erformances of our feature categorizing method and event detection algorithm. We first introduce the dataset and exp erimental setup, then we subjectively evaluate the categorization of features for HH, HL, LH, LL and SW. Finally, we study the (a)p eriodic event detection problem with Algorithm 2. 65 1 0 2 4 6 S(f) 8 11.18 12879.82 7.1 Dataset and Experimental Setup The Reuters Corpus contains 806,791 English news stories from 08/20/1996 to 08/19/1997 at a day resolution. Version 2 of the op en source Lucene software [1] was used to tokenize the news text content and generate the document-word vector. In order to preserve the time-sensitive past/present/future tenses of verbs and the differences b etween lower case nouns and upp er case named entities, no stemming was done. Since dynamic stopword removal is one of the functionalities of our method, no stopword was removed. We did remove nonEnglish characters, however, after which the numb er of word features amounts to 423,433. All exp eriments were implemented in Java and conducted on a 3.2 GHz Pentium 4 PC running Windows 2003 Server with 1 GB of memory. Figure 9: Distribution of SW (stopwords) in the HH, HL, LH, and LL regions. LH HH 365 364 183 LL HL 4.5 4 3.5 3 2.5 2 1.5 1 0.5 5 11.18 50 100 200 S(f) 300 400 500 1000 12879.82 0 100 50 P(f) 20 8 6 4 2 1 7.2 Categorizing Features We downloaded 34 well-known stopwords utilized by the Google search engine as our seed training features, which includes a, about, an, are, as, at, be, by, de, for, from, how, in, is, it, of, on, or, that, the, this, to, was, what, when, where, who, wil l, with, la, com, und, en and www. We excluded the last five stopwords as they are uncommon in news stories. By only analyzing news stories over 259 weekdays, we computed the upp er b ound of the p ower sp ectrum for stopwords at 11.18 and corresp onding DF I DF ranges from 0.1182 to 0.3691. Any feature f satisfying Sf <= 11.18 and 0.1182 <= DF I DFf <= 0.3691 over weekdays will b e considered a stopword. In this manner, 470 stopwords were found and removed as visualized in Figure 9. Some detected stopwords are A (P = 65, S = 3.36, DF I DF = 0.3103), At (P = 259, S = 1.86, DF I DF = 0.1551), GMT (P = 130, S = 6.16, DF I DF = 0.1628) and much (P = 22, S = 0.80, DF I DF = 0.1865). After the removal of these stopwords, the distribution of weekday and weekend news are more or less matched, and in the ensuing exp eriments, we shall make use of the full corpus (weekdays and weekends). The upp er b ound p ower sp ectrum value of 11.18 for stopwords training was selected as the b oundary b etween the high p ower and low p ower sp ectrum. The b oundary b etween high and low p eriodicity was set to 365/2 = 183. All 422,963 (423433 - 470) word features were categorized into 4 feature sets: HH (69 features), HL (1,087 features), LH (83,471 features), and LL (338,806 features) as shown in Figure 10. In Figure 10, each gray level denotes the relative density of features in a square region, measured by log10 (1 + Dk ), where Dk is the numb er of features within the k-th square region. From the figure, we can make the 0 Figure 10: Distribution of categorized features over the four quadrants (shading in log scale). following observations: 1. Most features have low S and are easily distinguishable from those features having a much higher S , which allows us to detect imp ortant (a)p eriodic events from trivial events by selecting features with high S . 2. Features in the HH and LH quadrants are ap eriodic, which are nicely separated (big horizontal gap) from the p eriodic features. This allows reliably detecting ap eriodic events and p eriodic events indep endently. 3. The (vertical) b oundary b etween high and low p ower sp ectrum is not as clearcut and the exact value will b e application sp ecific. By checking the scatter distribution of features from SW on HH, HL, LH, and LL as shown in Figure 9, we found that 87.02%(409/470) of the detected stopwords originated from LL. The LL classification and high DF I DF scores of stopwords agree with the generally accepted notion that stopwords are equally frequent over all time. Therefore, setting the b oundary b etween high and low p ower sp ectrum using the upp er b ound Sf of SW is a reasonable heuristic. 212 SIGIR 2007 Proceedings Session 9: Topic Detection and Tracking 7.3 Detecting Aperiodic Events We shall evaluate our two hyp otheses, 1)imp ortant ap eriodic events can b e defined by a set of HH features, and 2)less rep orted ap eriodic events can b e defined by a set of LH features. Since no b enchmark news streams exist for event detection (TDT datasets are not prop er streams), we evaluate the quality of the automatically detected events by comparing them to manually-confirmed events by searching through the corpus. Among the 69 HH features, we detected 17 imp ortant ap eriodic events as shown in Table 1 (e1 - e17 ). Note that the entire identification took less than 1 second, after removing events containing only the month feature. Among the 17 events, other than the overlaps b etween e3 and e4 (b oth describ es the same hostage event), e11 and e16 (b oth ab out company rep orts), the 14 identified events are extremely accurate and corresp ond very well to the ma jor events of the p eriod. For example, the defeat of Bob Dole, election of Tony Blair, Missile attack on Iraq, etc. Recall that selecting the features for one event should minimize the cost in Eq. 2 such that 1) the numb er of features span different events, and 2) not all features relevant to an event will b e selected, e.g., the feature Clinton is representative to e12 but since Clinton relates to many other events, its time domain signal is far different from those of other representative features like Dole and Bob. The numb er of documents of a detected event is roughly estimated by the numb er of indexed documents containing the representative features. We can see that all 17 imp ortant ap eriodic events are p opularly rep orted events. After 742 minutes of computation time, we detected 23, 525 less rep orted ap eriodic events from 83,471 LH features. Table 1 lists the top 5 detected ap eriodic events (e18 - e22 ) with resp ect to the cost. We found that these 5 events are actually very trivial events with only a few news rep orts, and are usually subsumed by some larger topics. For example, e22 is one of the rescue events in an airplane hijack topic. One advantage of our UG Algorithm for discovering less-rep orted ap eriodic events is that we are able to precisely detect the true event p eriod. The key idea of our work lies in the observations that (a)p eriodic events have (a)p eriodic representative features and (un)imp ortant events have (in)active representative features, differentiated by their p ower sp ectrums and time p eriods. To address the real event detection problem, a simple and effective mixture density-based approach was used to identify feature bursts and their associated bursty p eriods. We also designed an unsup ervised greedy algorithm to detect b oth ap eriodic and p eriodic events, which was successful in detecting real events as shown in the evaluation on a real news stream. Although we have not made any b enchmark comparison against another approach, simply b ecause there is no previous work in the addressed problem. Future work includes evaluating the recall of detected events for a lab eled news stream, and comparing our model against the closest equivalent methods, which currently are limited to the methods of Kleinb erg [12] (which can only detect certain typ e of bursty events dep ending on parameter settings), Fung et al. [9], and Swan and Allan [18]. Nevertheless, we b elieve our simple and effective method will b e useful for all TDT practitioners, and will b e esp ecially useful for the initial exploratory analysis of news streams. 9. REFERENCES [1] Apache lucene-core 2.0.0, http://lucene.apache.org. [2] Go ogle news alerts, http://www.go ogle.com/alerts. [3] Reuters corpus, http://www.reuters.com/researchandstandards/corpus/. [4] J. Allan. Topic Detection and Tracking. Event-based Information Organization. Kluwer Academic Publishers, 2002. [5] J. Allan, V. Lavrenko, and H. Jin. First story detection in tdt is hard. In CIKM, pages 374­381, 2000. [6] J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level. In SIGIR, pages 314­321, 2003. [7] T. Brants, F. Chen, and A. Farahat. A system for new event detection. In SIGIR, pages 330­337, 2003. [8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likeliho o d from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1­38, 1977. [9] G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu. Parameter free bursty events detection in text streams. In VLDB, pages 181­192, 2005. [10] Q. He, K. Chang, and E.-P. Lim. A mo del for anticipatory event detection. In ER, pages 168­181, 2006. [11] Q. He, K. Chang, E.-P. Lim, and J. Zhang. Bursty feature reprensentation for clustering text streams. In SDM, accepted, 2007. [12] J. Kleinb erg. Bursty and hierarchical structure in streams. In SIGKDD, pages 91­101, 2002. [13] R. Kumar, J. Novak, P. Raghavan, and A. Tomkins. On the bursty evolution of blogspace. In WWW, pages 159­178, 2005. [14] G. Kumaran and J. Allan. Text classification and named entities for new event detection. In SIGIR, pages 297­304, 2004. [15] Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temp oral text mining. In SIGKDD, pages 198­207, 2005. [16] W. D. Penny. Kullback-liebler divergences of normal, gamma, dirichlet and wishart densities. Technical report, 2001. [17] N. Stokes and J. Carthy. Combining semantic and syntactic do cument classifiers to improve first story detection. In SIGIR, pages 424­425, 2001. [18] R. Swan and J. Allan. Automatic generation of overview timelines. In SIGIR, pages 49­56, 2000. [19] M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying similarities, p erio dicities and bursts for online search queries. In SIGMOD, pages 131­142, 2004. [20] Y. Yang, T. Pierce, and J. Carb onell. A study of retrosp ective and on-line event detection. In SIGIR, pages 28­36, 1998. [21] Y. Yang, J. Zhang, J. Carb onell, and C. Jin. Topic-conditioned novelty detection. In SIGKDD, pages 688­693, 2002. 7.4 Detecting Periodic Events Among the 1,087 HL features, 330 imp ortant p eriodic events were detected within 10 minutes of computing time. Table 1 lists the top 5 detected p eriodic events with resp ect to the cost (e23 - e27 ). All of the detected p eriodic events are indeed valid, and corresp ond to real life p eriodic events. The GMM model is able to detect and estimate the bursty p eriod nicely although it cannot distinguish the slight difference b etween every Monday-Friday and all weekdays as shown in e23 . We also notice that e26 is actually a subset of e27 (soccer game), which is acceptable since the Sheffield league results are announced indep endently every weekend. 8. CONCLUSIONS This pap er took a whole new p ersp ective of analyzing feature tra jectories as time domain signals. By considering the word document frequencies in b oth time and frequency domains, we were able to derive many new characteristics ab out news streams that were previously unknown, e.g., the different distributions of stopwords during weekdays and weekends. For the first time in the area of TDT, we applied a systematic approach to automatically detect imp ortant and less-rep orted, p eriodic and ap eriodic events. 213 SIGIR 2007 Proceedings Session 9: Topic Detection and Tracking Table 1: All important aperiodic events (e1 - e17 ), top 5 less-reported aperiodic events (e18 - e22 ) and top 5 important periodic events (e23 - e27 ). Detected Event and Bursty Period Doc True Event # e1 (Sali,Berisha,Albania,Albanian,March) 02/02/1997- 1409 Albanian's president Sali Berisha lost in an early election 05/29/1997 and resigned, 12/1996-07/1997. e2 (Seko,Mobutu,Sese,Kabila) 03/22/1997-06/09/1997 2273 Zaire's president Mobutu Sese coordinated the native reb ellion and failed on 05/16/1997. e3 (Marxist,Peruvian) 11/19/1996-03/05/1997 824 Peru reb els (Tupac Amaru revolutionary Movement) led a hostage siege in Lima in early 1997. e4 (Movement,Tupac,Amaru,Lima,hostage,hostages) 824 The same as e3 . 11/16/1996-03/20/1997 e5 (Kinshasa,Kabila,Laurent,Congo) 03/26/1997- 1378 Zaire was renamed the Democratic Republic of Congo on 06/15/1997 05/16/1997. e6 (Jospin,Lionel,June) 05/10/1997-07/09/1997 605 Following the early General Elections circa 06/1997, Lionel Jospin was app ointed Prime Minister on 06/02/1997. e7 (Iraq,missile) 08/31/1996-09/13/1996 1262 U.S. fired missile at Iraq on 09/03/1996 and 09/04/1996. e8 (Kurdish,Baghdad,Iraqi) 08/29/1996-09/09/1996 1132 Iraqi troop fought with Kurdish faction circa 09/1996. e9 (May,Blair) 03/24/1997-07/04/1997 1049 Tony Blair b ecame the Primary Minister of the United Kingdom on 05/02/1997. e10 (slalom,skiing) 12/05/1996-03/21/1997 253 Slalom Game of Alpine Skiing in 01/1997-02/1997. e11 (Interim,months) 09/24/1996-12/31/1996 3063 Tokyo released company interim results for the past several months in 09/1996-12/1996. e12 (Dole,Bob) 09/09/1996-11/24/1996 1599 Dole Bob lost the 1996 US presidential election. e13 (July,Sen) 06/25/1997-06/25/1997 344 Camb odia's Prime Minister Hun Sen launched a bloody military coup in 07/1997. e14 (Hebron) 10/15/1996-02/14/1997 2098 Hebron was divided into two sectors in early 1997. e15 (April,Easter) 02/23/1997-05/04/1997 480 Easter feasts circa 04/1997 (for western and Orthodox). e16 (Diluted,Group) 04/27/1997-07/20/1997 1888 Tokyo released all 96/97 group results in 04/199707/1997. e17 (Decemb er,Christmas) 11/17/1996-01/26/1997 1326 Christmas feast in late 12/1997. e18 (Kolaceva,winter,Together,promenades,Za jedno, 3 University students organized a vigil on Kolaceva street Slob odan,Belgrade,Serbian,Serbia,Draskovic,municipal, against government on 1/25/1997. Kragujevac) 1/25/1997 e19 (Tutsi,Luvengi,Burundi,Uvira,fuel,Banyamulenge, 6 Fresh fighting erupted around Uvira b etween Zaire armed Burundian,Kivu,Kiliba,Runingo,Kagunga,Bwegera) forces and Banyamulengs Tutsi reb els on 10/19/1996. 10/19/1996 e20 (Malantacchi,Korea,Guy,Rider,Unions,lab our, 2 Marcello Malantacchi secretary general of the InternaTrade,unions,Confederation,rammed,Geneva,stoppages, tional Metalworkers Federation and Guy Rider who heads Virgin,hire,Myongdong,Metalworkers) 1/11/1997 the Geneva office of the International Confederation of Free Trade Unions attacked the new lab our law of South Korea on 1/11/1997. e21 (DBS,Raffles) 8/17/1997 9 The list of the unit of Singap ore DBS Land Raffles Holdings plans on 8/17/1997. e22 (preserver,fuel,Galawa,Huddle,Leul,Beausse) 3 Rescued a woman and her baby during a hijacked 11/24/1996 Ethiopian plane that ran out of fuel and crashed into the sea near Le Galawa b each on 11/24/1996. e23 (PRICE,LISTING,MLN,MATURITY,COUPON, 7966 Announce b ond price on all weekdays. MOODY,AMT,FIRST,ISS,TYPE,PAY,BORROWER) Monday-Friday/week e24 (Unaudited,Ended,Months,Weighted,Provision,Cost, 2264 Net income-loss rep orts released by companies in every Selling,Revenues,Loss,Income,except,Shrs,Revs) every season. season e25 (rating,Wall,Street,Ian) Monday-Friday/week 21767 Stock rep orts from Wall Street on all weekdays. e26 (Sheffield,league,scoring,goals,striker,games) every 574 Match results of Sheffield soccer league were published on Friday, Saturday and Sunday Friday, Saturday and Sunday 10 times than other 4 days. e27 (soccer,matches,Results,season,game,Cup,match, 2396 Soccer games held on Friday, Saturday and Sunday 7 times victory,b eat,played,play,division) every Friday, Saturthan other 4 days. day and Sunday 214