Enhancing Topic Tracking with Temporal Information Baoli Li, Wenjie Li, Qin Lu Department of Computing, The Hong Kong Polytechnic University Hung Hom, Kowloon, Hong Kong csblli@gmail.com, {cswjli, csluqin}@comp.polyu.edu.hk ABSTRACT In this paper, we propose a new strategy with time granularity reasoning for utilizing temporal information in topic tracking. Compared with previous ones, our work has four distinguished characteristics. Firstly, we try to determine a set of topic times for a target topic from the given on-topic stories. It helps to avoid the negative influence from other irrelevant times. Secondly, we take into account time granularity variance when deciding whether a coreference relationship exists between two times. Thirdly, both publication time and times presented in texts are considered. Finally, as time is only one attribute of a topic, we increase the similarity between a story and a target topic only when they are related not only temporally but also semantically. Experiments on two TDT corpora show that our method makes good use of temporal information in news stories. express a time in temporally ordered news stream. As a matter of fact, people tend to use temporal expressions with different granularities to refer to an identical event as time lapses. For example, when an explosion just happended, we may refer to its time by temporal expressions like "today" or "yesterday"; but a few days later, we are likely to mention the same event as "the explosion happened last week". Obviously, we cannot take it for granted that a coarser temporal expression is less likely to refer to an event than a finer one. In this paper, we propose a new time reasoning strategy for comparing two times. It considers time granularity variance and may refer "July, 2005" to "July 7, 2005" under some conditions. We apply this time reasoning strategy to enhance topic tracking. 2. METHOD Our method follows the common practice that adjusts the similarity between an incoming story and a target topic according to their temporal relatedness. When comparing two times, we take a novel reasoning mechanism, which takes into account the phenomenon of time granularity variance in news stream. This time reasoning strategy is incorporated with the widely used centroid based method as outlined in figure 1. TRAINING: T1. recognize and normalize temporal expressions in the given on-topic stories; T2. determine a set of topic times (Topic_Time_Set); T3. construct and normalize the on-topic stories' vectors, and designate their normalized centroid as the topic vector (T_Vector); TRACKING: R1. recognize and normalize temporal expressions in each incoming news story; R2. for each normalized new time (Test_Time) and a topic time (Topic_Time) within Topic_Time_Set, estimate the coreference level (TC_Level) between them; R3. assign the highest time coreference level (TC_LevelHighest) between each Test_Time and each Topic_Time as the temporal coreference level between the incoming story and the target topic; R4. construct and normalize the vector (S_Vector) for the incoming story; R5. compute the cosine similarity value (Sim) between T_Vector and S_Vector; R6. if TC_LevelHighest > LNull, increase the similarity value Sim with an increment ; R7. use the final Sim value to make decision and output it as confidence value; Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval ­ Information Filtering. General Terms Algorithms. Keywords Topic Tracking, Temporal Information Processing. 1. INTRODUCTION Temporal information is an important attribute of a topic, and a topic usually exists in a limited period. Therefore, many researchers have explored how to utilize temporal information in topic tracking [1][2][3]. Most past research depends on story's publication time. The similarity between two stories is adjusted according to the difference between their publication times or simply their sequence numbers. This strategy assumes that a news story is published on the same day that the reported events took place. However, this assumption does not hold in most cases. To overcome the above problem, recent researchers have applied temporal information processing to detect and normalize temporal expressions in text [2][3]. When normalized time values are available, a reasoning mechanism for comparing two times is required. Kim et al. rely on an exact match strategy [2], in which "July, 2005" and "July 7, 2005" are considered as two distinct times and have no relationship with each other. Makkonen et al. propose a mathematical scheme to measure the similarity between two time intervals overlapped [3]. For example, the similarity between "July, 2005" and "July 7, 2005" is 1/31, which means that they refer to an identical time with 3.23% possibility. However, all these methods ignore the phenomenon of time granularity variance when people Copyright is held by the author/owner(s). SIGIR'06, August 6-11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. Figure 1. Overview of the proposed topic tracking method. In the following subsections, we will detail the steps T2, R2, and R6 in the above figure. 2.1 Determining a Set of Topic Times We use a set of times to describe a topic's temporal attribute. It may contain one or several times. Generally, not all times in the given on-topic stories are relevant to the target topic. So we need sift off times that are not likely to be a topic time. Due to the promptness of news story, a time is considered as a candidate topic time only when it is not far away in the past (e.g. in the previous ten days) or it is in the future. If a given on-topic story does not contain a valid topic time, its publication time will be regarded as a topic time. 2.2 Estimating Temporal Coreference Level It seems impossible to establish a rigorous reasoning mechanism that considers the phenomenon of time granularity variance in news 667 stream. Our time granularity reasoning scheme, which decides the coreference level between two times, consists of 11 heuristic rules derived from quantitative and qualitative analyses. We define five temporal coreference levels, i.e. LNull, LYear, LMonth, LWeek, and LDay, from weak to strong respectively. LNull indicates no relationship between two times. Given a target topic time (Topic_Time), we first determine all possible granularities G_set that a temporal expression (Test_Time) in an incoming story S can take for referring to Topic_Time. It is done by 13 heuristic rules based on the Topic_Time's granularity and the difference between the publication time (SRL_Time) of the story S and Topic_Time. The less the difference between SRL_Time and Topic_Time is, the finer temporal expression we are likely to use. For example, if the story S was released on the same day as Topic_Time, the temporal expressions in S that refer to Topic_Time would be very likely to take granularity DAY rather than YEAR. The time granularities can be YEAR (GYear), MONTH (GMonth), WEEK (GWeek), or DAY (GDay), from coarse to fine respectively. If the granularities of Topic_Time and Test_Time are identical, they will be compared with exact match strategy. Otherwise, the finer time will be zoomed out to the same granularity as the other one for comparison. The zoomed-out granularity should be within G_set. Finally, the coreference level between Topic_Time and Test_Time will be assigned the granularity level on which they match. detection cost, i.e. the optimal value at the best possible threshold, is used for measuring performance. In our study, we use one on-topic story for training. To verify the effectiveness of our proposed topic tracking algorithm, we experiment with the following methods on two datasets. C_NTR: uses the baseline centroid method, and does not consider temporal expressions in news stories specially. C_WT_NR: uses the centroid method with temporal information processing, but without time granularity reasoning. C_WTR_AllVT: our proposed method as described in section 2. C_WTR_ManT: like the previous one, but takes human-specified topic times for each topic; the manually specified topic times are derived from the "WHEN" attribute of the seminal event for each topic; this method indicates the upper bound of our algorithm. Table 1. Minimal Norm. Detection Costs on two datasets. Method Corpus TDT2 Man. V3.2 TDT3 Man. Part C_NTR C_WT_NR C_WTR_AllVT C_WTR_ManT 0.1882 0.1259 0.1862 0.1279 0.1785 0.1196 0.1689 0.1155 2.3 Increasing Similarity In centroid based method, a central topic vector is constructed as topic representative. Each incoming story is then evaluated against this centroid while tracking. The similarity between an incoming story and the centroid is computed as a decision value. If the value exceeds a predefined threshold, the new story will be labeled as ontopic. We choose cosine distance of two vectors r x = ( x1 , x 2 ,..., x m ) (m is the size of feature space) and r y = ( y1 , y 2 ,..., y m ) as similarity as follows: rr Sim ( x , y ) = (( x i i / j x 2 )( y i / j j y 2 )) = j O i i (1) In equation (1), Oi is the contribution of the feature fi to the total similarity between two vectors. We suppose that temporal attribute contributes to the similarity as the most important features when a story is temporally related to the target topic. On the other hand, the time dimension takes effect only when other important semantic dimensions match. This guarantees that we do not add an increment for semantically irrelevant stories. Obviously, news stories released on the same day are not necessarily of the same topic, although they are very likely to be temporally related. At step R6 in figure 1, the increment is determined by the largest sum operand O in equation (1), whose corresponding feature f is among the important feature set (IFS) for describing the target topic. Its formal description is given as follows: = MAXi (Oi ), where i = {1,2,..., m} and fi IFS. (2) Actually, the increment can be weighted according to the temporal coreference level between the incoming story and the target topic. At present, we take a uniform weight 1 for all coreference levels. To determine the most important features of a topic from the given on-topic stories, we adopt the method proposed in [4], and use In pre-processing stage, we adopt a rule-based method (finite state automata) to extract and normalize temporal expressions. This module achieves 86.8% recall and 89.7% precision on 100 randomly selected news stories according to our specification. We take 5 for parameter k when applicable. Table 1 shows the optimal performance of each method. The C_WT_NR method does not consistently show advantage over the baseline C_NTR method, whereas our proposed method, which incorporates time granularity reasoning, demonstrates quite well performance over the baseline C_NTR method on both the TDT2 and TDT3 corpora. The C_WTR_AllVT method obtains 5.15% and 5% improvement respectively. The highest performance is obtained with human annotated topic times (i.e. the C_WTR_ManT method). The topicweighted minimal normalized detection cost drops 10.26% from 0.1882 to 0.1689 on TDT2 corpus, and decreases 8.26% from 0.1259 to 0.1155 on TDT3 corpus. These upper bounds of performance exhibit the great potential of our proposed method. 4. CONCLUSION This paper presents a new strategy with time granularity reasoning for using temporal information in topic tracking. Experiments on two TDT corpora show that our proposed algorithm is promising. We expect that the proposed method, especially the time reasoning strategy, is also applicable to other subtasks of TDT. 5. ACKNOWLEDGEMENT This work was supported by the Research Grants Council of Hong Kong (CERG reference number PolyU5181/03E). 6. REFERENCES [1] James Allan et al. 1998. Topic Detection and Tracking Pilot Study: Final Report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. [2] Pyung Kim et al. 2004. Usefulness of temporal information automatically extracted from news articles for topic tracking. ACM TALIP, 3(4), pages 227-242. 2 metric to choose the top k features. [3] Juha Makkonen et al. 2004. Simple Semantics in Topic Detection and Tracking. Information Retrieval, Vol. 7, pages 347-368. 3. EXPERIMENTS We test the proposed method on TDT2 Mandarin corpus and the Mandarin part of TDT3 corpus. Minimal topic weighted normalized [4] Baoli Li et.al. 2005. Profile-based Event Tracking. In: Proceedings of the SIGIR-2005 Conference, pages 631-632. 668