WWW 2007 / Poster Paper Topic: Systems Mirror Site Maintenance Based on Evolution Associations of Web Directories Ling Chen L3S/University of Hannover Appelstr. 9a 30167 Hannover, Germany Sourav Bhowmick SCE, Nanyang Technological University Singapore, 639798 Wolfgang Nejdl L3S/University of Hannover Appelstr. 9a 30167 Hannover, Germany lchen@l3s.de ABSTRACT assourav@ntu.edu.sg nejdl@l3s.de Mirroring Web sites is a well-known technique commonly used in the Web community. A mirror site should b e updated frequently to ensure that it reflects the content of the original site. Existing mirroring tools apply page-level strategies to check each page of a site, which is inefficient and exp ensive. In this pap er, we prop ose a novel site-level mirror maintenance strategy. Our approach studies the evolution of Web directory structures and mines association rules b etween ancestor-descendant Web directories. Discovered rules indicate the evolution correlations b etween Web directories. Thus, when maintaining the mirror of a Web site (directory), we can optimally skip sub directories which are negatively correlated with it in undergoing significant changes. The preliminary exp erimental results show that our approach improves the efficiency of the mirror maintenance process significantly while sacrificing slightly in keeping the "freshness" of the mirrors. Categories and Sub ject Descriptors: H.2.8 [Database Management]: Database Applications[data mining] General Terms: Measurement, Performance, Exp erimentation. Keywords: Web evolution, evolution correlation, mirror maintenance. A mirror site is a replica of an original Web site and is often located in a different place throughout the Internet. There are lots of reasons for Web mirroring such as load reduction, b etter availability, and faster access etc [3]. Web mirroring can b e generally classified as internal mirroring and external mirroring. The former refers to the situation that b oth the original site and mirror sites are set up and maintained by the same organization, such as www.akamai.com. The latter means that the original site and mirror sites are autonomous. For example, many Linux user groups in different countries, such as www.linuxforum.net in China, mirror the site www.samba.org to provide quick access for Linux fans in their own countries. A mirror site should b e up dated regularly with resp ect to the original site. For internal mirroring, this process can b e p erformed efficiently by synchronizing mirror sites only if the original site evolves. However, for external mirroring, Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 1. INTRODUCTION the server holding the mirror site has to scan the original site frequently to detect the changes. To maintain the freshness of mirror sites, www.samba.org suggests external mirror sites scan it once p er day. Obviously, daily scan of the whole original site is not a light workload for the server holding the mirror site, esp ecially when the server holds more than one mirror sites. Hence, optimized mirror maintenance strategies are needed so that mirroring severs do not need to scan the whole original site every day. In this pap er, we prop ose a site-level mirror maintenance strategy based on the historical evolution of the original Web site. Since existing Web mirroring tools, like "rsync" [1], usually mirror a site according to its Web site directory tree, we study the evolutionary characteristics of Web site directory structure. Particularly, we discover Negative Evolution Association Rules (N E AR) of ancestor-descendant sub directories from the evolution history of the site. A NEAR sa sb · · · sk Žsk+1 , where sa through sk+1 is a sequence of Web directories with ancestor-descendant relationships, indicates that sa sb · · · sk frequently change together while sk+1 rarely undergoes significant changes together with its ancestor directories sa sb · · · sk . Based on discovered rules, when maintaining the mirror of a changed target directory (e.g., sa ), its sub directories which frequently change together with it (e.g., sb · · · sk ) should b e maintained together. However, the mirror maintenance process can b e optimized by skipping sub directories which have negative evolution associations with it (e.g., sk+1 ). 2. NEAR MINING The file system of a Web site can b e represented as a directory tree S = N, E , where N is the set of nodes corresp onding to files or directories in the server, E is the set of edges representing the consisting-of relationship b etween corresp onding directories and files. Accordingly, each Web sub directory can b e represented as a subtree. A directory subtree sa is an ancestor of another subtree sb , denoted as sa sb , if there is a path from the root of sa to the root of sb . Let T = t1 , t2 , . . . , tn be a sequence of time p oints with a particular time granularity. Given a directory tree S of some original Web site, we study the sequence of historical versions of S on T , which is denoted as S t1 , S t2 , . . . , S tn . Before defining NEARs, we first define three evolution metrics, Degree of Change, Frequency of Change and Correlation of Change. Given two versions of a tree structure, S ti and S ti+1 , let D(S ti , S ti+1 ) b e the edit distance [5] b etween the two versions. The Degree of Change of the tree, denoted t ( ti S i+1 as DoC (S ti , S ti+1 ), is DSSi ,S ti+1 |) , where |S ti S ti+1 | is the |t 1297 WWW 2007 / Poster Paper BR OP OR CR Topic: Systems 1.2 1 0.8 0.6 0.4 0.2 0 1 2 BR OP OR CR 1.2 1 0.8 0.6 0.4 1.2 1 0.8 0.6 0.4 BR OP OR CR 0.2 0.2 0 3 4 5 6 7 8 9 10 20 21 22 23 24 25 26 27 28 29 0 mi sc c om edu or g gov net v er s ion distance (a) variation of distance Siz e of training data (b) variation of training data size (c) performance on domains Figure 1: Performance of Mirror Maintenance. size of the merged tree of S ti and S ti+1 . The value of DoC which allows to skip mirroring a particular sub directory, disranges from 0 to 1. The greater the value of DoC, the more covered NEARs can b e used together with existing tools. significantly the tree changes. Basically, when p erforming a top-down traversal on the tarGiven a sequence of tree versions S t1 , S t2 , . . . , S tn , get directory tree, our optimized mirror maintenance approach either skips a sub directory based on NEARs or mirthe Frequency of Change of a set of subtrees X = {s1 , s2 , rors pages using existing tools. The details of prop osed mir. . . , sm }, with resp ect to some DoC threshold , denoted n -1 m Gj ror maintenance approach can b e referred to [2]. =1 as F oC (X ), is jn-1 , wher e Gj = i=1 Gji and Gji = 1 tj tj + 1 3. EXPERIMENTAL RESULTS , if DoC (si , si ) We evaluate the p erformance of our optimized mirror main. That is, F oC of a set of 0, otherwise tenance approach based on real data collected by A. Ntoulas subtrees is the fraction of versions where the set of subet al. [4], who downloaded pages from 154 "p opular" Web trees undergo significant changes together. The greater the sites every week from Octob er 2002 until Octob er 2003, for a value of FoC, the more frequently the set of subtrees change total of 51 weeks. The following three exp eriments are contogether. We adopt the -coefficient to define the Correducted and the resp ective results are rep orted in Figure 1 lation of Change b etween two subtree sets. The C oC b e(a), (b) and (c). (i) NEARs mined from the first 20 versions tween two subtree sets X and Y (X Y = ), with reare used to maintain mirrors and the quality of the mainsp ect to some DoC threshold , denoted as C oC (X, Y ), tained version is evaluated with resp ect to the (20 + k)th F oC (X Y )-F oC (X )F oC (Y ) . According to is (k 1) version. We vary the value of k to study the temp oF oC (X )(1-F oC (X ))F oC (Y )(1-F oC (Y )) ral validity of discovered rules. Four evaluation metrics are the definition of -coefficient, if C oC (X, Y ) is greater than used: Bypass Ratio (B R), Overal l Precision (OP ), Overzero, X and Y are p ositively correlated in undergoing sigal l Recal l (OR), and Change Recal l (C R). Bypass Ratio is nificant changes. Otherwise, they are negatively correlated. the fraction of Web pages skipp ed by our approach. OverBased on the evolution metrics describ ed ab ove, Negative al l Precision and Overal l Recal l measure the accuracy of our Evolution Patterns can b e defined as follows. Given DoC approach with resp ect to the whole test version. Change Rethreshold , F oC threshold , and C oC threshold (0 cal l measures the accuracy of our approach with resp ect to , 1, 0), P = X, Y , where X = s1 , s2 , . . . , sk , changes to the test version. As a good mirror maintenance Y = sk+1 , and si si+1 (1 i k), is a Negative Evoapproach, values of all the four metrics should b e as high lution Pattern (N E P ) if (i) F oC (X ) ; (ii) F oC (X as p ossible. We observed that generally, our approach skips Ys) < ; (iii) C oC (X, Y ) - . Given a NEP s1 s2 · · · sk around one third pages of a site and loses slight "freshness". k+1 , valid NEAR s1 s2 · · · sk Žsk+1 can b e derived if The p erformance does not deteriorate with the increase of k. its confidence is no less than some sp ecified threshold. Sim(ii) NEARs are mined from the first k versions and the subilar to traditional association rule mining, the confidence of F o C ( X ) -F o C ( X Y ) sequent version is used as the test data. The results show the NEAR can b e computed as . F o C (X ) that accuracy increases while efficiency decreases slightly. (iii) We evaluate the effectiveness of our approach on sites 2.1 Overview of Algorithm from different domains, such as .com, .g ov , .or g etc. The Given a sequence of historical tree versions and the set of results indicate our approach is applicable to sites from difrelated thresholds, the mining of NEARs can b e decomp osed ferent domains. into the following two phases. Phase I: Historical change organization. In this 4. CONCLUSIONS phase, we first use an algorithm WD-Diff, which modifies the In this pap er, we prop osed an optimized mirror maintenance approach based on NEARs discovered from the evoexisting algorithm X-Diff [5], to efficiently detect changes b elution history of a Web site. Promising exp erimental results tween each two successive Web directory tree versions. We showed the effectiveness of this approach. then organize detected changes into a sp ecial data structure, 5. REFERENCES called General Delta Tree (GDT ), which not only records [1] Rsync. In http://samba.anu.edu.au/rsync/. the change information of subtrees but also preserves the [2] Web mirroring pro ject. In http://www.l3s.de/~ ancestor-descendant relationships b etween subtrees. lchen/mirror.pdf. [3] K. Bharat and A. Z. Bro der. Mirror, mirror on the web: A study Phase II: Rule generation. We discover NEARs from of host pairs with replicated content. Computer Networks, 1999. the GDT with resp ect to the set of given thresholds. A [4] A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the divide-and-conquer strategy is employed to discover NEARs evolution of the web from a search engine p ersp ective. In starting with different Web directory subtrees. The details WWW, 2004. [5] Y. Wang, D. J. DeWitt, and J.-Y. Cai. X-diff: An effective of the algorithm can b e referred to [2]. change detection algorithm for xml do cuments.. In ICDE, 2003. Since most of existing mirroring tools offer the option 1298