WWW 2007 / Poster Paper Topic: Semantic Web Integrating Web Directories by Learning their Structures Christopher C. Yang The Chinese University of Hong Kong Jianfeng Lin The Chinese University of Hong Kong yang@se.cuhk.edu.hk ABSTRACT Documents in the Web are often organized using category trees by information providers (e.g. CNN, BBC) or search engines (e.g. Google, Yahoo!). Such category trees are commonly known as Web directories. The category tree structures from different internet content providers may be similar to some extent but are usually not exactly the same. As a result, it is desirable to integrate these category trees together so that web users only need to browse through a unified category tree to extract information from multiple providers. In this paper, we address this problem by capturing structural information of multiple category trees, which are embedded with the knowledge of professional in organizing the documents. Our experiments with real Web data show that the proposed technique is promising. 1) 2) 3) 4) jflin@se.cuhk.edu.hk Extend the Bäyes rule to determine the category relationship between categories from different category trees. Develop four decision rules to map a category from the source category tree to a category in the master category tree. Develop an integration technique that satisfies the constraints imposed by the structures of the source category trees. The integration technique is able to expand or modify the master category tree by learning the organization of documents from the source category trees. 2. PROBLEM DEFINITION For category tree integration problem, There exists a source category tree T s = {C s , C s , ..., C s } and a master category tree 1 2 |T | s Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval ­ Retrieval models, Search process T = {C , C , ..., C } . Both trees have a set of categories with m m 1 m 2 m |T m | General Terms: Algorithms, Experimentation, Theory Keywords: Category tree, Integration, Hierarchical structure 1. INTRODUCTION Category trees and web directory are often used to organize information in the web. However, different participants on the internet often construct and maintain different category trees with different structures to facilitate the organization and retrieval of information. For example, Yahoo! and Google have edited their directories in their own way. Category tree integration arises in a variety of situations, ranging from B2B and B2C e-business, personal information management, supply chain management etc. The number of category trees on the internet is so large that manual integration is tedious, error-prone or even impossible. Previous work about data sharing and integration mainly focus on ontology integration [2][3] and schema matching [4], but ontology and schema capture the structure of specific semantic. On the other hand, category trees capture the parent and child relationship between topics of documents. Agrawal and Srikant [1] first explore how to integrate web categories trees. Unfortunately, such approach only considers flat hierarchical structures where documents are assigned to the leave nodes. It does not take the hierarchical structure of a category tree into consideration during category tree integration. Our approach captures the knowledge of category tree structures that are generated by information professionals in the process of integrating category trees. The contributions can be summarized as: Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. certain hierarchical structure, and a set of documents are assigned to each category. The only relationship between these categories within a tree is the subsumption relationship between parents and the children. When the source category tree is integrated with the master category tree, two integration operators can be applied to the category of the source category tree, Cis : · · Map: Cis may be mapped to a existing category in the master category tree, C m , noted as Map(Cis ; C m ) ; or j j Add: Cis may be mapped to an expanded category in the m master category tree, , noted as C|T |+1 m Add (C ; C s i m new ,C m parent ,C m child m ) , where C parent is the parent of m m Cnew and Ccm ild is the child of Cnew ; if Ccm ild is omitted, h h m Cnew is add as a leaf category At the current stage of research, we do not consider splitting or merging nodes. However, we shall investigate splitting when the source node can be split to map with two master nodes and merging when two source nodes can be merged to map with one master nodes in our future work. 3. INTEGRATION TECHNIQUES 3.1 Category Relationships The mapping algorithm is based on the relationships between categories in the master and source category tree. We adopt the Bäyes rule P(A|B), to determine the category relationships [5]. P( A | B) = n u m b er o f d o c u m en t s i n B p r ed i c t ed t o b e i n A number of documents in B 1239 WWW 2007 / Poster Paper We identify 5 types of relationships and the relationships between categories, and they should be determined as follows: · · · · · Topic: Semantic Web in these cases. The correct position of S'j will be further affected by its descendants. Our experiments show that this rule is useful. We adopt a top down, level based method to run the algorithm. The nodes are processed in breadth first order. In this process, when given a node Sj, we will find one rule out of four to fire based on the condition of the rules. Match ( A, B ) : P ( A | B ) thH P ( B | A) thH Disjoint ( A, B ) : P ( A | B ) thL P ( B | A) thL Subconcept ( A, B ) : P ( A | B ) < thH P ( B | A) thH Superconcept ( A, B ) : P ( A | B ) thH P ( B | A) < thH Overlap ( A, B ) : thL