Measuring Similarity of Semi-structured Documents with Context Weights Christopher C. Yang, Nan Liu Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong Ph: (852) 2609 8239 Email: {yang,nliu}@se.cuhk.edu.hk ABSTRACT In this work, we study similarity measures for text-centric XML documents based on an extended vector space model, which considers both document content and structure. Experimental results based on a benchmark showed superior performance of the proposed measure over the baseline which ignores structural knowledge of XML documents. determined by wd(t) which stands for the weight of t in document d and is commonly computed by a score of the tf *idf family. The relevance of a document to a query is usually evaluated by the similarity of their corresponding vectors such as the cosine measure: cos( x , y ) = x y x× y = t w x (t ) * w y (t ) t w x (t ) * t w y (t ) 2 2 (1) Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval models ­ retrieval models. General Terms Models, Experimentation Keywords XML Search and Retrieval 1. INTRODUCTION In traditional IR, both documents and queries are expressed in free text which provides no clue about the logical structure of the documents. In contrast, a XML document has a hierarchical structure, where the different tags of nodes indicate different semantic properties of the text underneath such as it being the "title" of an article. Its hierarchical nature also reveals the semantic relationships among different element types. The richer representation of XML allows a user to formulate their information need more precisely by imposing constraints on both the content and structure. In recent years, there has been an increasing interest on XML search and retrieval in the IR community such as the INEX evaluation initiative [3]. However, the focus of most existing research is on finding the most relevant component in an XML document to return as answer to a short query. The problem of document to document comparison is not well studied yet, while it is important for many applications such as similarity search, clustering and classification. Several recent works [1,3,4] proposed extensions of the regular vector space model for XML documents so that both content and structure are taken into account. The basic idea underlying these approaches is to use pairs of the form (t, c) instead of single terms t as the dimensions of the vector space, where a term t is further differentiated by the context c of its appearance which is typically identified by the path leading to the term from the root of the hierarchical structure of the XML document. Thus the weight of individual terms wx(t) should be replaced by weight of terms in context denoted by wx(c,t). Moreover, when computing vector similarity, it is proposed that terms occurring in different but similar contexts should also be accounted for by utilizing a context resemblance measure: 1 + min( c x , c y ) if c x or c y is prefix of the other cr (c x , c y ) = 1 + max( c x , c y ) 0 otherwise Thus the cosine similarity between two XML documents become cos( x , y ) = ( t ,c x ) x ( t ,c y ) y w x ( c x , t ) * w y ( c y , t ) * cr ( c x , c y ) t w x (c , t ) * t w y (c , t ) 2 2 To instantiate this similarity measure, we need to specify the computation of the weight wx(c,t). In [1], wx(c,t) is computed as tfx(c,t)×idf(c,t) where · idf(c,t) = log( |N|/|N(c,t)| )with |N| = total number of documents in the collection and |N(c,t)| = number of documents containing (c,t). tfx(c,t) is the number of occurrences of (t,c) in x. 2. VECTOR SPACE MODEL FOR XML The vector space is a standard model for text retrieval, in which queries and documents are both represented by vectors in a high dimensional space whose axes correspond to the set of terms in the document corpus. The coordinate of each document vector is Copyright is held by the author/owner(s). SIGIR'06, August 6-11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. · To further exploit the structural information, [4] introduced the concept of "weighted term frequency" to reflect that different locations carries different importance when comparing two documents. For example, in an article, a word occurring in the "title" is more important than in a "paragraph", which in turn is more important than in the "reference". In [2], the importance of different tags are manually assigned a weights (xi) and the weight w(c) for a particular context c with a tree path x0x1...xn is 719 estimated by in=0 ( xi ) . And wx(c,t) w(c)×tfx(c,t)×idf(c,t). is replaced by Instead of manually assigning the importance weights to different element tags and computing the importance for a context by the product of the weights of the elements on the path, we propose to apply genetic algorithm to search for the optimal context weights w(c)'s based on a set of training documents for which the relevant documents have been manually selected. Genetic algorithms(GAs) provide a optimization procedure motivated by biological evolution. Given the current population of hypotheses, GAs generates successor hypotheses by mutating and recombining the best currently known hypotheses so that the members of the population will approach the optimal hypothesis in the long run. Given the collection of XML documents, we may find the set of all possible contexts in which a term may occur, denoted by C. Each hypothesis thus represents a set of |C| numbers in the range [0, 1], each of which corresponds to the importance weight of a particular context and is encoded by a 5-bit binary string. So a hypothesis is encoded by a 5×|C| bit binary string. To evaluate a hypothesis, we instantiate the similarity measure with the set of context weights and use it to rank the collection for the set of training documents and compute the mean average precision (MAP) as the fitness of the hypothesis. Mutation is simulated by randomly inverting some bits in the representation and crossover is implemented by swap bits randomly sampled from two input hypotheses. The detailed algorithm is presented in Table 1. P Generate p hypothesis randomly For i = 1 to MAX_ITER do (1) Create new generation P' (2) For each h in P, compute Fitness(h) (3) Rank h's by Fitness(h) (4) Compute a probability distribution Pr(h) such that the probability for a hypothesis hi is proportional to its rank produced in (3) (5) Probabilistically select (1 ­ r)p members of P to add to Ps following Pr(h) (6) Probabilistically select rp/2 pairs of hypotheses from P following Pr(h) (7) Choose m percent of the members of Ps uniformly. For each, invert k randomly selected bits in its binary string representation. (8) P P' Return the hypothesis from P that has the highest fitness Table 1: Algorithm for optimizing context weights where the parameters p, r, m, k controls the population size, crossover rate, mutation rate and number of bits to invert by mutation. XML documents, we compare it with a baseline, which ignores the structure and compare two documents using the regular vector space with each term weighted by tfx(t)×idf(t). For the structure based similarity measure, we compare two schemes for computing wx(c,t) both based on w(c)×tfx(c,t)×idf(c,t). The first scheme, denoted by structure I, uses the same weight 1.0 for any context c. The second scheme, denoted by structure II, uses the w(c)'s tuned by GAs. For structure II, we evaluate its performance through 8fold cross validation, where in each run, 32 documents are used for training and the remaining 8 for testing and the final performance is computed by averaging the performance on the testing data in the 8 different runs. The experimental results are summarized in Figure 1, based on which we can make the following interesting observations: · Utilizing structure has shown clearly superior performance over the baseline. Both structure I and structure II are above the baseline at nearly all recall points. The context weight is useful for improving the structure based similarity measure as shown by the 0.09 improvement in average performance. bas eline (av. precision .536) s t ruc ture I (av. precision .565) s t rc uture II (av. Precision .574) 0. 9 0. 8 0. 7 P recisi on 0. 6 0. 5 0. 4 0. 3 0. 2 0.1 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 Re ca ll · Figure 1: Performance of different similarity measures 4. REFERENCES [1] D. Carmel, Y.S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer. "Searching XML Documents via XML Fragments", In Proceedings of SIGIR' 2003, Toronto, Canada, 2003 [2] V. Kakade and P. Raghavan. "Encoding XML in Vector Spaces", In Proceedings of ECIR'2005, Santiago, Spain [3] Initiative for the evaluation of XML retrieval http://qmir.dcs.qmul.ac.hk/INEX/ [4] S. Liu, Q. Zhu and W.W. Chu. "Configurable Indexing and Ranking for XML Information Retrieval", In Proceedings of SIGIR' 2003, Toronto, Canada 3. Evaluation: The dataset used for evaluation consists of 1894 XML documents describing items in the Museum of Qin Terracotta Warriors and Horses with a total size of 7.8 MB. Each document contains the following types of elements: title, description, record type, resource type, object condition, location, repository, reference and source. We chose 40 documents for evaluation and manually selected their relevant documents from the collection. To study the effectiveness of the structure based similarity measure for 720