SIGIR 2007 Proceedings Poster An Effective Method for Finding Best Entry Points in Semi-Structured Documents Eugen Popovici VALORIA Lab. University of South-Brittany BP 573, 56015 Vannes, France Pierre-François Marteau VALORIA Lab. University of South-Brittany BP 573, 56015 Vannes, France Gildas Ménier VALORIA Lab., University of South-Brittany BP 573, 56015 Vannes, France {eugen.popovici, pierre-francois.marteau, gildas.menier}@univ-ubs.fr ABSTRACT Focused structured document retrieval employs the concept of best entry point (BEP), which is intended to provide optimal starting-point from which users can browse to relevant document components [4]. In this paper we describe and evaluate a method for finding BEPs in XML documents. Experiments conducted within the framework of INEX 2006 evaluation campaign on the Wikipedia XML collection [2] shown the effectiveness of the proposed approach. required in the Best In Context task of INEX 2006. The aim of the task is to first identify relevant articles (the fetching phase), and then to identify the element corresponding to the best entry points for the fetched articles (the browsing phase). In the fetching phase, articles should be ranked according to their topical relevance. In the browsing phase, we have a single element whose opening tag corresponds to the best entry point for starting to read the relevant information in the article. Recent BEPs studies [3] suggested that the users have a strong preference for the most specific, most focused components that contain the most amount of relevant information and the least amount of irrelevant content. They also identified several types of BEPs of which "Start reading here" BEPs ­ i.e. a leaf level entry point, representing the point where the users would prefer to be directed to and where they would want to start reading the text. This is the BEP type the most related to the INEX 2006 Best In Context task requirements. This was the most popular BEP type occurring in the Shakespeare collection. 62% of the "Start reading here" BEPs were the first leaf nodes occurring in a sequence of relevant leaf nodes. This emphasizes the importance of the document order for the automatic detection of BEPs. Starting from these experimental observations we tuned our algorithm for detecting the BEPs to return the first most relevant nonoverlapping focused element for the requested topic. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval ­ retrieval models, search process. General Terms Algorithms, Experimentation. Keywords Best entry points, XML, element retrieval, INEX. 1. INTRODUCTION In traditional (flat) information retrieval, results are typically presented as a list of matching documents. For structured or focused information retrieval this alone is not a viable option: The user also needs to know where in the documents he can find the relevant text. Structured document retrieval makes use of document components as the basis of the retrieval process, rather than complete documents. Within this context the concept of Best Entry Points (BEPs) has been defined as document components from which the user can browse to obtain optimal access to relevant document components [6]. BEPs algorithms were studied in [4] on experimental data from the Shakespeare user study, which developed and evaluated different approaches to the problem of automatic identification of BEPs. The basic characteristics of BEPs are presented in [6] while [7] investigates different types of best entry points in structured document retrieval, and their usage and effectiveness in real information search tasks. In this paper we focus on automatic detection of the one "true" BEP ­ i.e. a unique BEP per semi-structured document as Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Ámsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 2. DETECTING BEPs Each element in an XML document may be composed of a set of possible nested XML elements, textual pieces of information, unordered pairs, or a mixture of such items. XML documents are generally represented as rooted, ordered, and labeled trees in which each node corresponds to an element and each edge represent a parent-child relationship. Leaf nodes are document components that correspond to the last elements of the hierarchical relationship chain. We compute the relevance score for all the leaves elements of the XML tree containing at least one of the researched terms. The relevance value vi[0..1] of a node ei is computed as the normalized sum of the IDFs of the content only (CO) request terms k that occur inside the element boundaries: 2.1 Textual Content Ranking Scheme vi (ei , CO ) = k k . k where k is the number of terms k in the CO request, k is an IDF weighting factor specifying the discriminating power of the term k in the collection : k = 1 ­ log((1+|Dk|)/(1+|D|)) ; where 851 SIGIR 2007 Proceedings Poster approach. At indexing time, the most frequent presentation tags and jump tags (i.e. emph, collectionlink, languagelink, etc.) were ignored. The documents were preprocessed for stopword removal and stemming and no constraints were used for phrase search. Table 1. INEX 2006 Best In Context Task official evaluation results. The ranks / 77 submissions are given in parentheses. Metric BEPD A=0.01 A=0.1 A=1 A=10 0.5587 (7) A=100 0.7572 (7) 0.1963 (1) 0.2564 (2) 0.3628 (6) EPRUM 0.0399 (1) 0.0568 (9) 0.0858 (16) 0.1467 (20) 0.2175 (35) Figure 1. INEX 2006 Best In Context Task Results, Metric:EPRUM-BEP-Exh-BEPDistance, A=0.01. |Dk| is the number of documents in which k is occurring ; |D| the total number of documents in the collection ; and a normalization constant = 1 / k (k ) ; 2.2 Removing Overlap & Selecting the BEP We implemented a focused retrieval approach in which we consider the XML element containing a researched term as the basic and implicitly valid unit of retrieval regardless of its size. However, for XML nodes with mixed content like paragraphs or sections including presentation tags (i.e. italics, emphasized, etc.) overlapping elements may occur in the result list. To find the most exhaustive and specific element in a path we use a two phase process. First, we aggregate the relevance of the elements in order to reflect the relevance of their descendant elements (if any). The weights are calculated in a bottom-up manner starting from leaves to the highest non overlapping nodes composing the answer. The relevance of a node is computed as the arithmetic average of all its descendant relevant nodes including its own relevance. Second the most relevant element is selected recursively as long as it not overlaps with an already selected element. The first most relevant non-overlapping element in the document order is selected as the BEP for the document. Its weight is propagated at the document level and used to rank the files by relevance. The runs are evaluated with different values for the A parameter (see Table 1). Note that high values of A (e.g. 10) does not discriminate whether the retrieved element is near to or far from the BEP. Whereas, low values of A (e.g. 0.1) favour runs that return elements very close to a BEP [1]. The proposed approach was ranked on the 1st place out of 77 official submissions by both the BEPD and EPRUM metrics when using A=0.01. Similar to the proposed approach, [5] reported evaluation results that used a vague interpretation of the structural constraints specified in the CAS topics and different methods for aggregating the nodes relevance values. These extensions did not augment the retrieval performances of the base method. 4. CONCLUSION We have described and evaluated an efficient and effective method to automatically detect BEPs in semi-structured documents. This may help users to obtain optimal access to relevant information inside an XML document. In the future we plan to compare the benefits brought by our BEPs approach versus the performances of a standard `plain' IR engine from the users' point of view. 5. REFERENCES [1] M. Lalmas, G. Kazai, J. Kamps, J. Pehcevski, B. Piwowarski and S. Robertson. INEX 2006 Evaluation Measures. In INEX'06, pages 20­34, 2006. [2] L. Denoyer and P. Gallinari. The Wikipedia XMLCorpus. SIGIR Forum, 2006. [3] G. Kazai and E. Ashoori. What does Shakespeare have to do with INEX? In SIGIR 2006 Workshop on XML Element Retrieval Methodology, pages 20­26, 2006. [4] M. Lalmas and J. Reid. Automatic identification of best entry points for focused structured document retrieval. In CIKM'03, pages 540­543, NY, USA, 2003. [5] E. Popovici, G. Ménier, and P.-F. Marteau. SIRIUS XML IR System at INEX 2006: Approximate Matching of Structure and Textual Content. In INEX'06, pages 185­199, 2006. [6] J. Reid, M. Lalmas, K. Finesilver, and M. Hertzum. Best entry points for structured document retrieval: Part I: characteristics. Inf. Process. Manage., 42(1):74­88, 2006. [7] J. Reid, M. Lalmas, K. Finesilver, and M. Hertzum. Best entry points for structured document retrieval: Part II: types, usage and effectiveness. Inf. Process. Manage., 42(1):89­ 105, 2006. 3. EVALUATION The proposed approach was evaluated within the framework of INEX 2006 evaluation campaign. The test collection [2] is composed of 659,388 English articles in XML format extracted from Wikipedia and totaling 4.6 GB. A set of 111 content only (CO) and content and structure (CAS) topics with their associated BEPs assessments were used for the evaluation. In INEX 2006, the assessors were asked to indicate one and only one BEP for every document that has relevant content. Runs were evaluated with two metrics: i) a set based measure, BEPD (For BEPDistance); and ii) an extension of precision recall (EPRUM) [1]. 3.1 Experimental Results We present in Figure 1 and Table 1 the INEX 2006 Best In Context task official results compared with the proposed 852