WWW 2007 / Poster Paper

Topic: Semantic Web

Integrating Web Directories by Learning their Structures
Christopher C. Yang
The Chinese University of Hong Kong

Jianfeng Lin
The Chinese University of Hong Kong

yang@se.cuhk.edu.hk ABSTRACT
Documents in the Web are often organized using category trees by information providers (e.g. CNN, BBC) or search engines (e.g. Google, Yahoo!). Such category trees are commonly known as Web directories. The category tree structures from different internet content providers may be similar to some extent but are usually not exactly the same. As a result, it is desirable to integrate these category trees together so that web users only need to browse through a unified category tree to extract information from multiple providers. In this paper, we address this problem by capturing structural information of multiple category trees, which are embedded with the knowledge of professional in organizing the documents. Our experiments with real Web data show that the proposed technique is promising. 1) 2) 3) 4)

jflin@se.cuhk.edu.hk

Extend the Bäyes rule to determine the category relationship between categories from different category trees. Develop four decision rules to map a category from the source category tree to a category in the master category tree. Develop an integration technique that satisfies the constraints imposed by the structures of the source category trees. The integration technique is able to expand or modify the master category tree by learning the organization of documents from the source category trees.

2. PROBLEM DEFINITION
For category tree integration problem, There exists a source category tree T s = {C s , C s , ..., C s } and a master category tree 1 2 |T |
s

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval ­ Retrieval models, Search process

T = {C , C , ..., C } . Both trees have a set of categories with
m m 1 m 2 m |T m |

General Terms: Algorithms, Experimentation, Theory Keywords: Category tree, Integration, Hierarchical structure

1. INTRODUCTION
Category trees and web directory are often used to organize information in the web. However, different participants on the internet often construct and maintain different category trees with different structures to facilitate the organization and retrieval of information. For example, Yahoo! and Google have edited their directories in their own way. Category tree integration arises in a variety of situations, ranging from B2B and B2C e-business, personal information management, supply chain management etc. The number of category trees on the internet is so large that manual integration is tedious, error-prone or even impossible. Previous work about data sharing and integration mainly focus on ontology integration [2][3] and schema matching [4], but ontology and schema capture the structure of specific semantic. On the other hand, category trees capture the parent and child relationship between topics of documents. Agrawal and Srikant [1] first explore how to integrate web categories trees. Unfortunately, such approach only considers flat hierarchical structures where documents are assigned to the leave nodes. It does not take the hierarchical structure of a category tree into consideration during category tree integration. Our approach captures the knowledge of category tree structures that are generated by information professionals in the process of integrating category trees. The contributions can be summarized as:
Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.

certain hierarchical structure, and a set of documents are assigned to each category. The only relationship between these categories within a tree is the subsumption relationship between parents and the children. When the source category tree is integrated with the master category tree, two integration operators can be applied to the category of the source category tree, Cis : · · Map: Cis may be mapped to a existing category in the

master category tree, C m , noted as Map(Cis ; C m ) ; or j j Add: Cis may be mapped to an expanded category in the m master category tree, , noted as C|T |+1
m

Add (C ; C

s i

m new

,C

m parent

,C

m child

m ) , where C parent is the parent of

m m Cnew and Ccm ild is the child of Cnew ; if Ccm ild is omitted, h h m Cnew is add as a leaf category

At the current stage of research, we do not consider splitting or merging nodes. However, we shall investigate splitting when the source node can be split to map with two master nodes and merging when two source nodes can be merged to map with one master nodes in our future work.

3. INTEGRATION TECHNIQUES 3.1 Category Relationships
The mapping algorithm is based on the relationships between categories in the master and source category tree. We adopt the Bäyes rule P(A|B), to determine the category relationships [5].

P( A | B) =

n u m b er o f d o c u m en t s i n B p r ed i c t ed t o b e i n A number of documents in B

1239


WWW 2007 / Poster Paper We identify 5 types of relationships and the relationships between categories, and they should be determined as follows: · · · · ·

Topic: Semantic Web in these cases. The correct position of S'j will be further affected by its descendants. Our experiments show that this rule is useful. We adopt a top down, level based method to run the algorithm. The nodes are processed in breadth first order. In this process, when given a node Sj, we will find one rule out of four to fire based on the condition of the rules.

Match ( A, B ) : P ( A | B )  thH  P ( B | A)  thH Disjoint ( A, B ) : P ( A | B )  thL  P ( B | A)  thL Subconcept ( A, B ) : P ( A | B ) < thH  P ( B | A)  thH Superconcept ( A, B ) : P ( A | B )  thH  P ( B | A) < thH Overlap ( A, B ) : thL <P ( A | B ) < thH  0 < P ( B | A)  thH or 0 <P ( A | B ) < thH  thL < P ( B | A) < thH

Mi
M i'
M new
S
' j

Sj

Mi
M new

Sj

thH and thL are the parameters. In the ideal case, they should be 1 and 0 respectively, but in real system, they are usually a little smaller or larger than 1 or 0.

S 'j

M i'
Figure 2. Rule 3, Rule 4

3.2 Decision Rules and algorithm
Rule learning methods usually attempt to select the best from all possible covering rules according to some minimality criterions. In this work, we develop four rules to integrate categories. The objectives of the rules are to maintain the structure of the source category trees while integrating with the master category tree. Rule 1 (Figure 1, Left): Maintaining Parent-Child Relationship Given , and M ap ( S j , M i ) S 'j  Child ( S j )

4. EXPERIMENTS
We collect 10 data sets as experiment data from Yahoo! and Open Directory Project. Each data set consists of two category trees, one from Yahoo! and one from Open Directory Project, serving as source and master category tree respectively. The root nodes of the two category trees match with each other. They rooted at "Science", "Shopping" and "Society". The average number of categories in source and master category trees are 7.1 and 9.7 respectively. The "number of documents in B predicted to be A" in Section 3.1 is decided by an automatic text classifier. We use SVMlight developed by Joachims [6], which is a fast and effective implementation of SVM. We measure the accuracy of the integration result by measuring how many categories are correctly processed. Our experiment based on the ten datasets show that it obtains 85% accuracy on average, and two of ten data sets even reach 100% accuracy.

M i'  Descedant ( M i ) , if Match( S 'j , M i' ) , then Map( S 'j , M i' ) .
Rule 2 (Figure 1, Right): Expanding With A New Branch Given Map ( S j , M i ) , S 'j  Child ( S j ) , if S 'j disjoint with

M j  Decencent ( M i )
' j

,

Disjoint ( S 'j , M j )
' j

,

then

Add ( S ; M new , M i ) , and all the descendants of S is also added
as the descendants of S 'j

Mi

Sj

Mi M new
S 'j

Sj

5. CONCLUSION
In this work, we explore how to make use of implicit information embedded in the hierarch category tree structure to integrate different category trees in this paper. Nodes in one category tree are mapped or inserted to proper position of the other. We use real world data to conduct our experiment and get good result. For simplicity, we omit nodes splitting or merging problem in this poster, which is also very important and hard. We will extend our techniques to handle this work in our future research. and then

M i'

S 'j
Figure 1. Rule 1, Rule 2

Rule 3 (Figure 2, Left): Expanding As A Subconcept , Given M ap ( S j , M i ) S 'j  Child ( S j )

6. REFERENCES
[1] [2] [3]
R. Agrawal and R. Srikant. On Integrating Catalogs. In proceedings of WWW10 Conference, May 1-5, 2001, Hong Kong, pp 603-612 H. Li and K. Yamanishi, Text Classification using ESC-Based Stochastic Decision Lists, In CIKM , Kansac City, Mo, UAS, 1999. N.F. Noy and M.A. Musen. Anchor-PROMPT: Using Non-Local Context for Semantic Matching. In Proceedings of workshop on OIS at IJCAI, 2001 E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal 10(4):334-350, 2001 S. Zhu, C.C. Yang and W. Lam. CatRelate: A New Hierarchical Document Category Integration Algorithm by Learning Category Relationship. ICADL 2004, LNCS 3334, pp. 280-289, 2004 T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Scholkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.

M i'  Descedant ( M i ) ,
Add ( S 'j ; M new , M i' ) .

if

SubC oncept ( S 'j , M i' ) ,

Rule 4 (Figure 2, Right): Expanding As A Superconcept , Given M ap ( S j , M i ) S 'j  Child ( S j )

and

M i'  Descedant ( M i ) , if SuperConcept ( S 'j , M i' ) , then
A d d ( S ; M n ew , M i , M ) .
' j ' i

[4] [5] [6]

Rule 1-4 are the basic rules of category integration, but there is a shortcoming of rule 2. When a category, say S'j, is incorrectly mapped to an expanded category Mnew, all its descendants will also have incorrect mappings. We develop some adjustment rules

1240