SIGIR 2007 Proceedings Poster A Multi-Criteria Content-based Filtering System Gabriella Pasi Dipartimento di Informatica Sistemistica e Communicazione Universitą degli Studi di Milano +39-02-6448-7847 Gloria Bordogna IDPA Consiglio Nazionale delle Ricerche +39-035-622-4262 Robert Villa Department of Computer Science University of Glasgow +44-(0)141-330-2998 gloria.bordogna@idpa.cnr.it villar@dcs.gla.ac.uk gabriella.pasi@unimib.it ABSTRACT In this paper we present a novel filtering system, based on a new model which reshapes the aims of content-based filtering. The filtering system has been developed within the EC project PENG [3], aimed at providing news professionals, such as journalists, with a system supporting both filtering and retrieval capabilities. In particular, we suggest that in tackling the problem of information overload, it is necessary for filtering systems to take into account multiple aspects of incoming documents in order to estimate their relevance to a user's profile, and in order to help users better understand documents, as distinct from solely attempting to either select relevant material from a stream, or block inappropriate material. Aiming to so this, a filtering model based on multiple criteria has been defined, based on the ideas gleamed in the project requirements stage. The filtering model is briefly described in this paper. Systems such as those above consider the filtering task as a hard classification task: the aim is to determine, for each new document presented to the system, whether that document is relevant or not based on the user's profile, and any other information the system can gather based on user feedback. The results of a filtering system are the selected documents, presented to the user, who may then provide relevance feedback to the system which may be used to aid the future relevance judgments. The intuition behind such systems is that in order to support the user when confronted with information overload, the aim is the accurate identification of relevant information. While a laudable goal, in practice we conjecture that such a filtering framework does not satisfactorily model the requirements typical of many filtering scenarios. We base this conjecture on the experiences in the PENG project, which we outline in the next section. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval - Information Filtering. 2. A FILTERING FRAMEWORK PENG [3], an EU funded project, aimed to provide an integrated environment for news professionals (journalists and editors) including a filtering component for news streams. Requirements gathering was carried out principally through interviews with fourteen journalists and other news professionals. Through these interviews, it was found that some of the assumptions underlying the classical filtering framework do not appear to hold, at least in the domain of news gathering. Findings included: the fear which many journalists had towards filtering systems blocking access to potentially useful information relevant material is selected based on many different criteria, which may be personal or related to the working environment. A particularly important part of this is the reliability of the source of the information the assumption of filtering as selection was found to be lacking. Rather than tackling the information overload problem through aiding a "blind" selection of material, it was found that it was more important to support the users in understanding the relevance estimate of incoming data. The first and last points may be considered as related: fear of missing important information may be considered as a lack of knowledge about the data arriving. This initial requirements gathering led us to define a new model for filtering incoming news, based on the consideration of multiple criteria. The output of the filtering system is a ranked list of items, which can be dynamically re-organized by users based on their multi-faceted needs. In doing this we simultaneously make the filtering task General Terms Design, Algorithms, Human Factors Keywords Content-based filtering, models, frameworks, requirements gathering 1. INTRODUCTION Content-based information filtering systems aim to select relevant information from a continuous and large data stream which is being pushed towards a user, based on knowledge of the user's interests encoded into a user profile. Such systems are typically designed to manage large volumes of dynamically generated information, such as news streams, or more recently, RSS feeds and 'blogs'. The information stored in a user's profile represents a "stable" and structured information need, typically static over a relative long period of time compared to user queries addressed to an Information Retrieval system [1]. Content based filtering systems include [2,5], and previous entries into the TREC filtering TRAC [4]. Copyright is held by the author/owner(s). SIGIR'07, July 23-27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 775 SIGIR 2007 Proceedings Poster more difficult and yet easier. The filtering system must now provide customizable ranked lists, placing a document in a position relative to other documents, attempting to both provide a degree of explanation ­ structure ­ and allowing flexible methods of structuring the results. Yet this also makes the filtering easier and controllable, since the system does not need to make binary relevance assessments, and through extra structure, the user is better able to explore results lessening the risk of missing important data. Core to this is the user as an active participant. Novelty: how much new information does the document provide compared to the existing documents? An existing example of using novelty is [2] Timeliness: does the document reflect the most up to date aspects of the user profile? Timeliness is evaluated as the conformity of the considered document to a time-window useful to the user, stored in her/his profile. The criteria in the re-ranking stage are used in the following way: each filtered document has an associated RSV score (as explained above) and two additional scores: one indicating its novelty with respect to the user profile and the other one indicating its fulfillment of the temporal constraints specified in the user profile. In the current version of the filtering system novelty and timeliness are evaluated for each filtered document, and these two scores are used for re-ranking filtered documents, with respect to one of the two above specified constraints. We are also studying the possibility of combining these parameters to produce overall ranking score. 3. THE FILTERING MODEL The PENG filtering model has two main corresponding to the framework introduced above. components The first is a profile matching component, which compares the incoming documents to a user profile. This corresponds most closely to conventional filtering, but has two important differences. First, the output of the matching is a ranked list, maintained over time, the length of the list being maximized and dependent on processing resources available. Secondly, the aim of the matching is not to determine if a document is strictly relevant to a profile, but rather to disregard the documents which are very unlikely to be relevant, keeping as many potentially relevant documents in the system (it is of course impractical to keep a complete ranked list for all received documents). The second component is the re-ranking and structuring component, which enables the user to visualize the matching results in different ways, with the aim of providing a greater indication of how a new document relates to other documents. Both of the above stages operate based on a number of different criteria. Currently the matching stage computes an overall Retrieval Status Value (RSV) for each incoming document based on the consideration of the following criteria: Aboutness, which corresponds to the conventional cosine similarity between profile and document. Coverage, which measures how much of the user profile is entailed by the contents of the latest document. This has been modelled by means of fuzzy inclusion [6]. Reliability, referred to the source of the document, aimed at filtering out documents which do not come from sources of the required reliability specified by the user in her/his profile. Aboutness and coverage scores are combined to compute the RSV. In the filtering system, documents and information content specified in the user's profile are represented using the classical vector space model. 4. ONGOING WORK Work is currently ongoing, in developing the techniques required to evaluate the proposed filtering framework and model, including altering conventional filtering evaluation as used in [4] to take account of the new aims of this style of filtering. 5. ACKNOWLEDGMENTS This work has been carried out in the PENG Specific Targeted Research Project (IST-004597) founded within the Sixth Program Framework of the European Research. 6. REFERENCES [1] Belkin, N. J. and Croft, W. B. (1992) Information filtering and Information Retrieval: Two sides of the same Coin? In Communications of the ACM, 35, 12 [2] Gabrilovich E., Dumais S., and Horvitz E. (2004) Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty, In WWW2004, May 2004, New York [3] Pasi G., Villa R. (2005) The PENG Project overview, in IDDI-05-DEXA Workshop, Copenhagen. [4] Robertson S., Soboroff I. (2002) The TREC 2002 Filtering Track Report, NIST Special Publication 500-251: The Eleventh Text REtrieval Conference [5] Bell, Timothy A. H. and Moffat, Alistair (1996) The Design of a High Performance Information Filtering System, In SIGIR'96, Zurich, Switzerland [6] Miyamoto S. (1990), Fuzzy IR and clustering techniques, Kluwer. The re-ranking stage utilizes the following criteria: 776