On Hierarchical Web Catalog Integration with Conceptual Relationships in Thesaurus Ing-Xiang Chen Yuan Ze University 135 Yuan-Tung Rd. Chungli, Taiwan, 320 Jui-Chi Ho Yuan Ze University 135 Yuan-Tung Rd. Chungli, Taiwan, 320 Cheng-Zen Yang Yuan Ze University 135 Yuan-Tung Rd. Chungli, Taiwan, 320 {sean,ricky,czyang}@syslab.cse.yzu.edu.tw ABSTRACT Web catalog integration is an interesting problem in current digital content management. Past studies have shown that using a flattened structure with auxiliary information extracted from the source catalog can improve the integration results. However, the nature of a flattened structure ignores the hierarchical relationships, and thus the p erformance improvement of catalog integration may b e reduced. In this pap er, we prop ose an enhanced hierarchical catalog integration (EHCI) approach with conceptual thesauri extracted from the source catalog. The results show that our enhanced hierarchical integration approach effectively b oosts the accuracy of hierarchical catalog integration. Categories and Subject Descriptors H.3.2 [Information Storage]: Record classification; H.3.5 [On-line Information Services]: Web-based services hierarchical relationships b etween the categories and sub categories in the destination catalog, and thus the integration accuracy is restrained. Past studies have shown that exploiting a hierarchical structure in classification may bring b etter advantages than using a flattened structure [3]. However, it has not b een testified in Web catalog integration. In this pap er, we prop ose an enhanced hierarchical catalog integration (EHCI) approach with conceptual relationships extracted from the source catalog thesaurus to improve the integration p erformance. We applied SVM classifiers to the EHCI approach in our exp eriments and compared its p erformance with that of a simple hierarchical catalog integration approach (SHCI), which is designed referring to [3] and [5]. The exp erimental results have shown that the EHCI approach effectively b oosts the accuracy of hierarchical catalog integration over all categories. General Terms Algorithms, Exp erimentation 2. HIERARCHICAL INTEGRATION In the integration process, we assume that there are two hierarchical catalogs participating in the integration process. Figure 1 illustrates the process of automated hierarchical catalog integration. One is the source catalog S with a set of m categories S1 , S2 , . . . , Sm , their sub categories, and so on down to the lowest-layered sub categories. The other is the destination catalog D with a set of n categories D1 , D2 , . . . , Dn , their sub categories, and so on down to the lowestlayered sub categories. The integration process is p erformed by merging each document d in S into a corresp ondent destination category in D. That is, for each layered directory in the hierarchy, training documents in each directory are trained as directory classifiers and local classifiers to help each document d integrate into a corresp onding directory. Only the documents integrated into the corresp onding layered categories and sub categories are viewed as correctly integrated. Keywords hierarchical catalog integration, conceptual relationships, thesaurus 1. INTRODUCTION With the explosive growth of various kinds of Web information, an integrated Web catalog is b ecoming imp ortant service for on-line vendors and Internet users [1, 2, 4]. An early study has shown that only ab out 20% of the categorized sites retrieved from b oth Yahoo! and Google catalogs are the same [2], which means that users may need to sp end much effort browsing different Web catalogs to obtain the required materials. As noted in [1], catalog integration is not just a classification task b ecause when some implicit source information is exploited, the integration accuracy can b e highly improved. A foremost approach has first b een prop osed to enhance the Naive Bayes classifier with implicit source information [1]. Other studies (e.g. [2, 4]) employ SVMs to enhance the accuracy p erformance. Since these studies only consider a flattened integration structure, they completely ignore the Copyright is held by the author/owner(s). SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. ACM 1-59593-369-7/06/0008. 3. THE INTEGRATION APPROACH In hierarchical catalog integration, a one-against-rest strategy is used at each decision p oint in the hierarchy to integrate documents into the matching directory, and the destination classifiers are trained with the hierarchy lab els for further enhancement. To improve the integration accuracy, a weight formula is designed to extract the semantic concepts existing in the source catalog. In Equation 1, the weight of each thesaurus is exp onentially decreased and accumulated according to the increased layers to represent the semantic 635 S D Table 1: The experimental data. Sm Dn S1 S2 ... D1 D2 ... S m1 S m2 ... S mj D 21 D 22 ... D2k Categories Autos Movies Outdoors Photo Software Total T Y -G 1823 7776 1724 1399 1940 14662 CY 148 1035 100 80 109 1472 tY 442 1592 234 246 712 3226 TG-Y 1094 5174 2308 615 5693 14884 CG 312 1165 523 158 1185 3343 tG 462 1395 221 227 701 3006 D 221 D 222 ... 90% 80% Figure 1: The process of hierarchical integration. 70% 60% 50% 40% 30% 20% 10% 0% Autos Movies Outdoors Photo Software SHCI_Y2G EHCI_Y2G SHCI_G2Y EHCI_G2Y concept extracted from the source lab els. Equation 1 calculates the feature weight of each document, in which Li represents the relevant lab el weight assigned exp onentially as 1/2i , and fx represents the occurrence ratio of feature x in the document. If feature x app ears in the lab el feature, Lx is denoted as the lab el weight with the layer that x is located. Otherwise, Lx is denoted by zero. Li represents the lab el weight with an i-layer depth, and the weight corresp ondingly decreases down to the top layer with an n-layer depth. With such a thesaurus weighting scheme, the conceptual relationships of the layered source categories can b e transformed and added into the test documents. Likewise, to build enhanced classifiers in destination categories, the EHCI scheme is used in the destination catalog. With the enhancement of the features and native category lab el information, the classifiers can thus b e trained to b e more distinctive to classify the documents into the correct categories. The threshold is heuristically set as 0.05 to accommodate the weights of the source thesauri to enhance the destination classifiers, and so is the value of set in the native destination category. Lx F eatur eW eig ht = × Pn + (1 - ) × fx Figure 2: The integration accuracy. 5. CONCLUDING REMARKS This pap er rep orts our studies on the effects of a hierarchical scheme to enhance the integration accuracy. By exploiting the hierarchical relationships b etween categories and sub categories, the improvement of integration accuracy is very promising. It shows that a hierarchical integration scheme is effective for Web catalog integration, and our EHCI approach consistently achieves improvements on realworld catalogs with SVM classifiers. i=0 Li (1) 6. ACKNOWLEDGMENTS This work was supp orted in part by National Science Council of R.O.C. under grant NSC-94-2213-E-155-050. 4. EXPERIMENTAL RESULTS The EHCI approach describ ed in the previous section was applied to a collection of 17888 pages retrieved from the Yahoo! catalog and a collection of 17890 pages retrieved from the Google catalog. In Table 1, a set of 1472 classes in the Yahoo! catalog (CY ) and a set of 3343 classes in the Google catalog (CG ) were organized according to the original hierarchy in a depth of six layers. The test documents are selected by intersecting documents of Yahoo! with those of Google, namely YG, in which the numb er of tY and tG is different in the sense that some test documents may app ear in more than one class simultaneously. The training documents of the Yahoo! catalog (TY -G ) and the Google catalog (TG-Y ) are gathered by subtracting the intersected documents, namely YG. In this exp eriment, the documents are b oth integrated from Yahoo! into Google and from Google to Yahoo!. Figure 2 illustrates that the overall p erformance of EHCI outp erforms the original HCI approach in b oth Yahoo!-to-Google and Google-to-Yahoo! catalog integration. The result further shows that the conceptual relationships are consistently effective in catalog integration. 7. REFERENCES [1] R. Agrawal and R. Srikant. On integrating catalogs. Proc. WWW10, pages 603­612, May 2001. [2] I.-X. Chen, J.-C. Ho, and C.-Z. Yang. An iterative approach for web catalog integration with supp ort vector machines. Proc. AIRS'05, pages 703­708, Oct. 2005. [3] S. Dumais and H. Chen. Hierarchical classification of web content. Proc. SIGIR'00, pages 256­263, Jul. 2000. [4] S. Sarawagi, S. Chakrabarti, and S. Godb ole. Cross-training: Learning probabilistic mappings b etween topics. Proc. SIGKDD'03, pages 177­186, Aug. 2003. [5] A. Sun, E.-P. Lim, and W.-K. Ng. Performance measurement framework for hierarchical text classification. JASIST, 54(11):1014­1028, Jun 2003. 636