SIGIR 2007 Proceedings Poster Novelty Detection Using Local Context Analysis Ronald T. Fernández Grupo de Sistemas Inteligentes Depar tamento de Electrónica y Computación Universidad de Santiago de Compostela, Spain David E. Losada Grupo de Sistemas Inteligentes Depar tamento de Electrónica y Computación Universidad de Santiago de Compostela, Spain ronald.teijeira@rai.usc.es dlosada@usc.es Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models, Information Filtering General Terms Exp erimentation Keywords Local Context Analysis, Novelty Detection usually some p opular IR model to rank sentences given a query (e.g. variants of tf-idf applied at the sentence level [1]). Next, in order to estimate how redundant the sentences are, some methods have b een prop osed to compute the overlapping b etween each sentence and the previously seen sentences. We have chosen two baseline methods which are simple and robust [1]: NewWords and SetDif. NewWords counts the numb er of words in the current sentence, si , which did not occur in the previously seen sentences: i-1 [ Nnw (si |s1 , ..., si-1 ) = Wsi Wsj j =1 1. INTRODUCTION The aim of this work is to determine the utility of Local Context Analysis (LCA)[5] for retrieval of relevant and novel sentences. LCA has b een successful in different areas and we check here whether this method is also useful to drive the selection of novel material. We adopt the Novelty task as defined in the TREC conference [2, 4, 3]. Giving a set of documents associated to a topic, the task consists of finding the relevant and novel sentences. This problem is interesting for many areas, such as text summarization, web information access, question answering, etc. Some researchers have prop osed that the estimation of novelty for a given sentence should b e based on the set of seen sentences that share common meanings [6]. In this way, the degree of redundancy of a sentence si is not influenced by past sentences that are totally unrelated to si . The intuition is that novelty estimation might b e more robust if focused on this set of terms. In our work we pursue a similar idea b ecause we apply LCA to focus the estimation of novelty on query-related terms. where Wsi is the set of words in the sentence si . SetDif computes the numb er of different words b etween each sentence si and the previously seen sentence that is the most similar to si : Nsd (si |s1 , ..., si-1 ) = min Nsd (si |sj ) 1j i-1 Nsd (si |sj ) = Wsi Wsj We prop ose variants of these methods to estimate the novelty score focusing on query-related terms. We exp ect to improve p erformance when the novelty scores consider only terms that are highly related to the query. 2.1 LCA LCA is a method based on the idea that a common term from the top-ranked relevant documents (or passages) will tend to co-occur with query terms within the top-ranked documents (or passages) [5]. We apply LCA to produce a set of query-related terms and the novelty scores are adjusted accordingly. The imp ortance of the terms in the top ranked sentences is computed as: b e l (q , t ) = Y ti q 2. THE NOVELTY TASK AND LCA The groups participating in the Novelty task start from a common ranking of documents for each query. Two different subtask are prop osed: 1) to produce a ranking of relevant sentences and 2) to filter out redundant sentences from this ranking. Successful algorithms tested for this task apply + log (af (t, ti ))idft /log (n))idfi Copyright is held by the author/owner(s). SIGIR`07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. where t is a term, N is the numb er of sentences in the collection, Ni is the numb er of sentences containing the term ti , f tij is the numb er of occurrences of the term ti in the sentence pj , f tj is the numb er of occurrences of the term t in th sentence pj , idfi = min(1.0, log10 (N/Ni )/5.0), P af (t, ti ) = n=1 f tij f tj , and is 0.1 (a constant) to avoid j zero b el value. This measure can b e applied to rank terms in decreasing order of estimated imp ortance given a query. Selecting the 725 SIGIR 2007 Proceedings Poster NW P@5 T2002 P@10 P@5 T2003 P@10 P@5 T2004 P@10 0.200 0.180 0.596 0.572 0.224 0.252 10 t. 0.204 0.151 0.532 0.478* 0.248 0.190* NW LCA 50 t. 100 t. 0.229 0.245 0.190 0.222* 0.552 0.572 0.538 0.562 0.288* 0.284* 0.246 0.264 all t. 0.237 0.235* 0.596 0.580 0.256 0.274 Table 1: NewWords vs. NewWords with LCA SD P@5 P@10 P@5 T2003 P@10 P@5 T2004 P@10 T2002 0.208 0.184 0.568 0.580 0.236 0.256 SD LCA 50 t. 100 t. 0.220 0.241 0.214 0.229 0.540 0.564 0.544 0.558 0.296 0.308* 0.262 0.272 10 t. 0.216 0.188 0.564 0.536 0.256 0.220 all t. 0.233 0.233* 0.584 0.590 0.264 0.286 Table 2: SetDif vs. SetDif with LCA In 2003, the baseline p erforms very well b ecause of the high p opulation of relevant sentences in the collection [4]. Hence, it is very difficult to improve the results b ecause any reasonable sentence retrieval strategy yields a good top 10. In the other two collections the application of LCA yielded significant improvements. The results indicate that the larger the vocabulary is the b etter the precision is. With 10 terms the method does not estimate redundance satisfactorily b ecause all the decisions are made based on very few terms. On the other hand, if vocabularies contain all terms in the top 25 sentences then redundance is estimated successfully. LCA seems useful in terms of P@5 but its utility is questionable if the aim is to retrieve 10 good sentences. In such case, selecting simply all terms in the top 25 sentences is the most robust approach. To the b est of our knowledge, this sort of vocabulary selection, which is a form of pseudorelevance feedback for novelty purp oses, has not b een applied in the literature. top ranked terms we can conform a query-oriented vocabulary (Tq ). Using this vocabulary, we compute NewWords and SetDif for each sentence as follows: i-1 [ WLC Asj ,q NLC Anw (si |s1 , ..., si-1 ) = WLC Asi ,q j =1 4. CONCLUSIONS We have presented the results of our attempts to identify relevant and novel sentences in a ranked list of documents using different methods and their variants using LCA. Although NewWords and SetDif are comp etitive methods for novelty detection, our results indicate that precision at top ranks might b e further improved if redundancy decisions are made in terms of a more focused vocabulary. Nevertheless, it is still unclear whether such vocabulary should b e selected using LCA. Given our current results, a simple method (based on extracting the terms app earing in the top 25 sentences) p erforms well and does not require LCA. Anyway, in the future we will keep studying the effects of the vocabulary size on novelty detection. and NLC Asd (si |s1 , ..., si-1 ) = 1j min i-1 NLC Asd (si |sj ) NLC Asd (si |sj ) = WLC Asi WLC Asj where WLC Asi ,q = Wsi Tq . 3. EXPERIMENTS 5. ACKNOWLEDGEMENTS This work was partially supp orted by national pro ject TIN2005-08521-C02-01 and Galician network 2006/23. David E. Losada b elongs to the "Ram´n y Ca jal" program, whose o funds come from MEC and the FEDER program. We used the three different collections of data which were made available in the context of the TREC Novelty tracks in 2002, 2003 and 2004 [2, 4, 3]. In 2002 and 2003, the ranking of documents provided by NIST consists only of relevant documents. In 2004, the collection is more realistic b ecause the ranks of documents contain relevant and irrelevant material. To generate an initial rank of sentences we applied a variation of tf-idf which proved successful in the past [1]. Given these ranks, the top 25 ranked sentences1 are mined for selecting imp ortant terms using LCA. This gives us the queryoriented vocabulary Tq and, subsequently, sentences are reranked using NLC Anw and NLC Asd . The top 10% sentences of this ranking are used for evaluation. We made exp eriments with varying sizes of this vocabulary to check the stability of the method. The evaluation measures applied are precision at 5 (P@5) and precision at 10 sentences (P@10). In Table 1 we show the p erformance values using NewWords and NewWords with LCA applying different vocabulary sizes (10, 50, 100 and all terms in the top 25 ranked sentences). Analogously, in Table 2 we rep ort results for the SetDif method. Results indicated with a star are statistically significant using a t-test at the p < .05 level. Preliminary exp eriments showed that 25 sentences is a reasonable numb er for estimating the query-oriented vocabulary. 1 6. REFERENCES [1] J. Allan, C. Wade, and A. Bolivar. Retrieval and novelty detection at the sentence level. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), pages 314­321, 2003. [2] D. Harman. Overview of the TREC 2002 Novelty Track. In Proceedings of the eleventh Text REtrieval Conference (TREC 2002), 2002. [3] I. Sob oroff. Overview of the TREC 2004 Novelty Track. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004. [4] I. Sob oroff and D. Harman. Overview of the TREC 2003 Novelty Track. In Proceedings of the twelfth Text REtrieval Conference (TREC 2003), 2003. [5] J. Xu and W. B. Croft. Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems, 18(1):79­112, 2000. [6] L. Zhao, M. Zhang, and S. Ma. The nature of novelty detection. Inf. Retr., 9(5):521­541, 2006. 726