WWW 2007 / Track: Data Mining

Session: Mining Textual Data

Do Not Crawl in the DUST: Different URLs with Similar Text
Ziv Bar-Yossef

Idit Keidar
Dept. of Electrical Engineering Technion, Haifa 32000, Israel

Uri Schonfeld

Dept. of Electrical Engineering Technion, Haifa 32000, Israel Google Haifa Engineering Center, Israel

idish@ee.technion.ac.il

Dept. of Computer Science University of California Los Angeles, CA 90095, USA

zivby@ee.technion.ac.il ABSTRACT
We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering dust; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines dust effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can b enefit from information ab out dust to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of p opularity statistics such as PageRank. Categories and Sub ject Descriptors: H.3.3: Information Search and Retrieval. General Terms: Algorithms. Keywords: search engines, crawling, URL normalization, duplicate detection, anti-aliasing.

shuri@shuri.org

INTRODUCTION

The DUST problem. The web is abundant with dust: Different URLs with Similar Text [17, 10, 20, 18]. For example, the URLs http://google.com/news and http://news. google.com return similar content. Adding a trailing slash or /index.html to either returns the same result. Many web sites define links, redirections, or aliases, such as allowing the tilde symb ol ~ to replace a string like /people. A single web server often has multiple DNS names, and any can b e typ ed in the URL. As the ab ove examples illustrate, dust is typically not random, but rather stems from some general rules, which we call DUST rules, such as "~" "/people", or "/index.html" at the end of the URL can b e omitted. A short version of this pap er app eared as a p oster at WWW2006 [21]. A full version, including proofs and more exp erimental results, is available as a technical rep ort [2]. Supp orted by the Europ ean Commission Marie Curie International Re-integration Grant and by Bank Hap oalim. This work was done while the author was at the Technion.
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 812, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.

dust rules are typically not universal. Many are artifacts of a particular web server implementation. For example, URLs of dynamically generated pages often include parameters; which parameters impact the page's content is up to the software that generates the pages. Some sites use their own conventions; for example, a forum site we studied allows accessing story numb er "num" b oth via the URL http:// domain/story?id=num and via http://domain/story_num. Our study of the CNN web site has discovered that URLs of the form http://cnn.com/money/whatever get redirected to http://money.cnn.com/whatever. In this pap er, we focus on mining dust rules within a given web site. We are not aware of any previous work tackling this problem. Standard techniques for avoiding dust employ universal rules, such as adding http:// or removing a trailing slash, in order to obtain some level of canonization. Additional dust is found by comparing document sketches. However, this is conducted on a page by page basis, and all the pages must b e fetched in order to employ this technique. By knowing dust rules, one can reduce the overhead of this process. Knowledge ab out dust rules can b e valuable for search engines for additional reasons: dust rules allow for a canonical URL representation, thereby reducing overhead in crawling, indexing, and caching [17, 10], and increasing the accuracy of page metrics, like PageRank. For example, in one crawl we examined, the numb er of URLs fetched would have b een reduced by 26%. We focus on URLs with similar contents rather than identical ones, since different versions of the same document are not always identical; they tend to differ in insignificant ways, e.g., counters, dates, and advertisements. Likewise, some URL parameters impact only the way pages are displayed (fonts, image sizes, etc.) without altering their contents. Detecting DUST from a URL list. Contrary to initial intuition, we show that it is p ossible to discover likely dust rules without fetching a single web page. We present an algorithm, DustBuster, which discovers such likely rules from a list of URLs. Such a URL list can b e obtained from many sources including a previous crawl or web server logs.1 The rules are then verified (or refuted) by sampling a small numb er of actual web pages. At first glance, it is not clear that a URL list can provide reliable information regarding dust, as it does not include actual page contents. We show, however, how to use a URL
1 Increasingly many web server logs are available nowadays to search engines via protocols like Google Sitemaps [13].

111

WWW 2007 / Track: Data Mining

Session: Mining Textual Data

list to discover two typ es of dust rules: substring substitutions, which are similar to the "replace" function in editors, and parameter substitutions. A substring substitution rule replaces an occurrence of the string in a URL by the string . A parameter substitution rule replaces the value of a parameter in a URL by some default value. Thanks to the standard syntax of parameter usage in URLs, detecting parameter substitution rules is fairly straightforward. Most of our work therefore focuses on substring substitution rules. DustBuster uses three heuristics, which together are very effective at detecting likely dust rules and distinguishing them from invalid ones. The first heuristic leverages the observation that if a rule is common in a web site, then we can exp ect to find in the URL list multiple examples of pages accessed b oth ways. For example, in the site where story?id= can b e replaced by story_, we are likely to see many different URL pairs that differ only in this substring; we say that such a pair of URLs is an instance of the rule "story?id=" "story_". The set of all instances of a rule is called the rule's support. Our first attempt to uncover dust is therefore to seek rules that have large supp ort. Nevertheless, some rules that have large supp ort are not dust rules. For example, in one site we found instances such as http://movie- forum.com/story_100 and http:// politics- forum.com/story_100 which supp ort the invalid rule "movie- forum" "politics- forum". Another example is "1" "2", which emanates from instances like pic- 1. jpg and pic- 2.jpg, story_1 and story_2, and lect1 and lect2, none of which are dust. Our second and third heuristics address the challenge of eliminating such invalid rules. The second heuristic is based on the observation that invalid rules tend to flock together. For example in most instances of "1" "2", one could also replace the "1" by other digits. We therefore ignore rules that come in large groups. Further eliminating invalid rules requires calculating the fraction of dust in the supp ort of each rule. How could this b e done without insp ecting page content? Our third heuristic uses cues from the URL list to guess which instances are likely to b e dust and which are not. In case the URL list is produced from a previous crawl, we typically have document sketches [7] available for each URL in the list. These sketches can b e used to estimate the similarity b etween documents and thus to eliminate rules whose supp ort does not contain sufficiently many dust pairs. In case the URL list is produced from web server logs, document sketches are not available. The only cue ab out the contents of URLs in these logs is the sizes of these contents. We thus use the size field from the log to filter out instances (URL pairs) that have "mismatching" sizes. The difficulty with size-based filtering is that the size of a dynamic page can vary dramatically, e.g., when many users comment on an interesting story or when a web page is p ersonalized. To account for such variability, we compare the ranges of sizes seen in all accesses to each page. When the size ranges of two URLs do not overlap, they are unlikely to b e dust. Having discovered likely dust rules, another challenge that needs to b e addressed is eliminating redundant ones. For example, the rule "http://site- name/story?id=" "http://site- name/story_" will b e discovered, along with many consisting of substrings thereof, e.g., "?id=" "_". However, b efore p erforming validations, it is not obvious which rule should b e kept in such situations the latter could b e either valid in all cases, or invalid outside the context of

the former. We are able to use supp ort information from the URL list to remove many redundant likely dust rules. We remove additional redundancies after p erforming some validations, and thus compile a succinct list of rules. Canonization. Once the correct dust rules are discovered, we exploit them for URL canonization. While the canonization problem is NP-hard in general, we have devised an efficient canonization algorithm that typical ly succeeds in transforming URLs to a site-sp ecific canonical form. Experimental results. We exp eriment with DustBuster on four web sites with very different characteristics. Two of our exp eriments use web server logs, and two use crawl outputs. We find that DustBuster can discover rules very effectively from moderate sized URL lists, with as little as 20,000 entries. Limited sampling is then used in order to validate or refute each rule. Our exp eriments show that up to 90% of the top ten rules discovered by DustBuster prior to the validation phase are found to b e valid, and in most sites 70% of the top 100 rules are valid. Furthermore, dust rules discovered by DustBuster may account for 47% of the dust in a web site and that using DustBuster can reduce a crawl by up to 26%. Roadmap. The rest of this pap er is organized as follows. Section 2 reviews related work. We formally define the dust detection and canonization problems in Section 3. Section 4 presents the basic heuristics our algorithm uses. DustBuster and the canonization algorithm app ear in Section 5. Section 6 presents exp erimental results. We end with some concluding remarks in Section 7.

2. RELATED WORK
The standard way of dealing with dust is using document sketches [6, 11, 7, 22, 9, 16, 15], which are short summaries used to determine similarities among documents. To compute such a sketch, however, one needs to fetch and insp ect the whole document. Our approach cannot replace document sketches, since it does not find dust across sites or dust that does not stem from rules. However, it is desirable to use our approach to complement document sketches in order to reduce the overhead of collecting redundant data. Moreover, since document sketches do not give rules, they cannot b e used for URL canonization, which is imp ortant, e.g., to improve the accuracy of page p opularity metrics. One common source of dust is mirroring. A numb er of previous works have dealt with automatic detection of mirror sites on the web [3, 4, 8, 19]. We deal with the complementary problem of detecting dust within one site. A major challenge that site-sp ecific dust detection must address is efficiently discovering prosp ective rules out of a daunting numb er of p ossibilities (all p ossible substring substitutions). In contrast, mirror detection is given pairs of sites to compare, and only needs to determine whether they are mirrors. Our problem may seem similar to mining association rules [1], yet the two problems differ substantially. Whereas the input of such mining algorithms consists of complete lists of items that b elong together, our input includes individual items from different lists. The absence of complete lists renders techniques used therein inapplicable to our problem. One way to view our work is as producing an Abstract Rewrite System (ARS) [5] for URL canonization via dust rules. For ease of readability, we have chosen not to adopt the ARS terminology in this pap er.

112

WWW 2007 / Track: Data Mining

Session: Mining Textual Data

PROBLEM DEFINITION

URLs. We view URLs as strings over an alphab et of tokens. Tokens are either alphanumeric strings or nonalphanumeric characters. In addition, we require every URL to start with the sp ecial token ^ and to end with the sp ecial token $ (^ and $ are not included in ). A URL u is valid, if its domain name resolves to a valid IP address and its contents can b e fetched by accessing the corresp onding web server (the http return code is not in the 4xx or 5xx series). If u is valid, we denote by doc(u) the returned document. DUST. Two valid URLs u1 , u2 are called dust if their corresp onding documents, doc(u1 ) and doc(u2 ), are "similar". To this end, any method of measuring the similarity b etween two documents can b e used. For our implementation and exp eriments, we use the p opular shingling resemblance measure due to Broder et al. [7]. DUST rules. We seek general rules for detecting when two URLs are dust. A dust rule is a relation over the space of URLs. may b e many-to-many. Every pair of URLs b elonging to is called an instance of . The support of , denoted supp ort(), is the collection of all its instances. Our algorithm focuses primarily on detecting substring substitution rules. A substring substitution rule is sp ecified by an ordered pair of strings (, ) over the token alphab et . (In addition, we allow these strings to simultaneously start with the token ^ and/or to simultaneously end with the token $.) Instances of substring substitution rules are defined as follows: Definition 3.1 (Instance of a rule). A pair u1 , u2 of URLs is an instance of a substring substitution rule , if there exist strings p, s s.t. u1 = ps and u2 = p s. For example, the pair of URLs http://www.site.com/ index.html and http://www.site.com is an instance of the dust rule "/index.html$" "$". The DUST problem. Our goal is to detect dust and eliminate redundancies in a collection of URLs b elonging to a given web site S . This is solved by a combination of two algorithms, one that discovers dust rules from a URL list, and another that uses them in order to transform URLs to their canonical form. A URL list is a list of records consisting of: (1) a URL; (2) the http return code; (3) the size of the returned document; and (4) the document's sketch. The last two fields are optional. This typ e of list can b e obtained from web server logs or from a previous crawl. The URL list is a (non-random) sample of the URLs that b elong to the web site. For a given web site S , we denote by US the set of URLs that b elong to S . A dust rule is said to b e valid w.r.t. S , if for each u1 US and for each u2 s.t. (u1 , u2 ) is an instance of , u2 US and (u1 , u2 ) is dust. A DUST rule detection algorithm is given a list L of URLs from a web site S and outputs an ordered list of dust rules. The algorithm may also fetch pages (which may or may not app ear in the URL list). The ranking of rules represents the confidence of the algorithm in the validity of the rules. Canonization. Let R b e an ordered list of dust rules that have b een found to b e valid w.r.t. some web site S . We would like to define what is a canonization of the URLs in US using the rules in R. To this end, we assume that there is some standard way of applying every rule R, so that

applying to any URL u US results in a URL (u) that also b elongs to US (this assumption holds true in all the data sets we exp erimented with). The rules in R naturally induce a lab eled graph GR on US : there is an edge from u1 to u2 lab eled by if and only if (u1 , u2 ) is an instance of . Note that adjacent URLs in GR corresp ond to similar documents. For the purp ose of canonization, we assume that document dissimilarity resp ects at least a weak form of the triangle inequality, so that URLs connected by short paths in GR are similar URLs too. Thus, if GR has a b ounded diameter (as it does in the data sets we encountered), then every two URLs connected by a path are similar. A canonization that maps every URL u to some URL that is reachable from u thus makes sense, b ecause the original URL and its canonical form are guaranteed to b e dust. A set of canonical URLs is a subset C US US that is reachable from every URL in US (equivalently, C US is a dominating set in the transitive closure of GR ). A canonization is any mapping C : US C US that maps every URL u US to some canonical URL C (u), which is reachable from u by a directed path. Our goal is to find a small set of canonical URLs and a corresp onding canonization, which is efficiently computable. Finding the minimum size set of canonical URLs is intractable, due to the NP-hardness of the minimum dominating set problem (cf. [12]). Fortunately, our empirical study indicates that for typical collections of dust rules found in web sites, efficient canonization is p ossible. Thus, although we cannot design an algorithm that always obtains an optimal canonization, we will seek one that maps URLs to a smal l set of canonical URLs, and always terminates in p olynomial time. Metrics. We use three measures to evaluate dust detection and canonization. The first measure is precision-- the fraction of valid rules among the rules rep orted by the dust detection algorithm. The second, and most imp ortant, measure is the discovered redundancy--the amount of redundancy eliminated in a crawl. It is defined as the difference b etween the numb er of unique URLs in the crawl b efore and after canonization, divided by the former. The third measure is coverage: given a large collection of URLs that includes dust, what p ercentage of the duplicate URLs is detected. The numb er of duplicate URLs in a given URL list is defined as the difference b etween the numb er of unique URLs and the numb er of unique document sketches. Since we do not have access to the entire web site, we measure the achieved coverage within the URL list. We count the numb er of duplicate URLs in the list b efore and after canonization, and the difference b etween them divided by the former is the coverage. One of the standard measures of information retrieval is recal l. In our case, recall would measure what p ercent of all correct dust rules is discovered. However, it is clearly imp ossible to construct a complete list of all valid rules to compare against, and therefore, recall is not directly measurable in our case.

4. BASIC HEURISTICS
Our algorithm for extracting likely string substitution rules from the URL list uses three heuristics: the large support heuristic , the smal l buckets heuristic, and the similarity like-

113

WWW 2007 / Track: Data Mining

Session: Mining Textual Data

liness heuristic. Our empirical results provide evidence that these heuristics are effective on web-sites of varying scop es and characteristics. Large support heuristic. Large Support Heuristic The supp ort of a valid DUST rule is large. For example, if a rule "index.html$" "$" is valid, we should exp ect many instances witnessing to this effect, e.g., www.site.com/d1/index.html and www.site.com/d1/, and www.site.com/d3/index.html and www.site.com/d3/. We would thus like to discover rules of large supp ort. Note that valid rules of small supp ort are not very interesting anyway, b ecause the savings gained by applying them are negligible. Finding the supp ort of a rule on the web site requires knowing all the URLs associated with the site. Since the only data at our disp osal is the URL list, which is unlikely to b e complete, the b est we can do is compute the supp ort of rules in this URL list. That is, for each rule , we can find the numb er of instances (u1 , u2 ) of , for which b oth u1 and u2 app ear in the URL list. We call these instances the support of in the URL list and denote them by supp ortL (). If the URL list is long enough, we exp ect this supp ort to b e representative of the overall supp ort of the rule on the site. Note that since | supp ortL ( )| = | supp ortL ( )|, for every and , our algorithm cannot know whether b oth rules are valid or just one of them is. It therefore outputs the pair , instead. Finding which of the two directions is valid is left to the final phase of DustBuster. Given a URL list L, how do we compute the size of the supp ort of every p ossible rule? To this end, we introduce a new characterization of the supp ort size. Consider a substring of a URL u = ps. We call the pair (p, s) the envelope of in u. For example, if u =http://www.site. com/index.html and ="index", then the envelop e of in u is the pair of strings "^http://www.site.com/" and ".html$". By Definition 3.1, a pair of URLs (u1 , u2 ) is an instance of a substitution rule if and only if there exists a shared envelop e (p, s) so that u1 = ps and u2 = p s. For a string , denote by EL () the set of envelop es of in URLs that satisfy the following conditions: (1) these URLs app ear in the URL list L; and (2) the URLs have as a substring. If occurs in a URL u several times, then u contributes as many envelop es to EL () as the numb er of occurrences of in u. The following theorem, proven in the full draft of the pap er, shows that under certain conditions, |EL () EL ( )| equals | supp ortL ( )|. As we shall see later, this gives rise to an efficient procedure for computing supp ort size, since we can compute the envelop e sets of each substring separately, and then by join and sort op erations find the pairs of substrings whose envelop e sets have large intersections. Theorem 4.1. Let = be two non-empty and nonsemiperiodic strings. Then, | supp ortL ( )| = |EL () EL ( )|. A string is semiperiodic, if it can b e written as = k for some string , where || > | |, k 1, k is the string obtained by concatenating k copies of the string , and is a (p ossibly empty) prefix of [14]. If is not semip eriodic, it is non-semiperiodic. For example, the strings "/////"

and "a.a.a" are semip eriodic, while the strings "a.a.b" and "%////" are not. Unfortunately, the theorem does not hold for rules where one of the strings is either semip eriodic or empty. For example, let b e the semip eriodic string "a.a" and = "a". Let u1 = http://a.a.a/ and let u2 = http://a.a/ . There are two ways in which we can substitute with in u1 and obtain u2 . Similarly, let b e "a." and b e the empty string. There are two ways in which we can substitute with in u1 to obtain u2 . This means that the instance (u1 , u2 ) will b e associated with two envelop es in EL ( ) EL ( ) and with two envelop es in EL () EL ( ) and not just one. Thus, when or are semip eriodic or empty, |EL () EL ( )| can overestimate the supp ort size. On the other hand, such examples are quite rare, and in practice we exp ect a minimal gap b etween |EL () EL ( )| and the supp ort size. Small buckets heuristic. While most valid dust rules have large supp ort, the converse is not necessarily true: there can b e rules with large supp ort that are not valid. One class of such rules is substitutions among numb ered items, e.g., (lect1.ps,lect2.ps), (lect1.ps,lect3.ps), and so on. We would like to somehow filter out the rules with "misleading" supp ort. The supp ort for a rule can b e thought of as a collection of recommendations, where each envelop e (p, s) EL () EL ( ) represents a single recommendation. Consider an envelop e (p, s) that is willing to give a recommendation to any rule, for example "^http://" "^". Naturally its recommendations lose their value. This typ e of supp ort only leads to many invalid rules b eing considered. This is the intuitive motivation for the following heuristic to separate the valid dust rules from invalid ones. If an envelop e (p, s) b elongs to many envelop e sets EL (1 ), EL (2 ),. . . , EL (k ), then it contributes to the intersections EL (i ) EL (j ), for all 1 i = j k. The substrings 1 , 2 , . . . , k constitute what we call a bucket. That is, for a given envelop e (p, s), bucket(p, s) is the set of all substrings s.t. ps L. An envelop e p ertaining to a large bucket supp orts many rules. Small Buckets Heuristic Much of the supp ort of valid DUST substring substitution rules is likely to b elong to small buckets. Similarity likeliness heuristic. The ab ove two heuristics use the URL strings alone to detect dust. In order to raise the precision of the algorithm, we use a third heuristic that b etter captures the "similarity dimension", by providing hints as to which instances are likely to b e similar. Similarity Likeliness Heuristic The likely similar supp ort of a valid DUST rule is large. We show b elow that using cues from the URL list we can determine which URL pairs in the supp ort of a rule are likely to have similar content, i.e., are likely similar, and which are not. The likely similar supp ort, rather than the complete supp ort, is used to determine whether a rule is valid or not. For example, in a forum web site we examined, the URL list included two sets of URLs http://politics.domain/ story_num and http://movies.domain/story_num with different numb ers. The supp ort of the invalid rule "http://

114

WWW 2007 / Track: Data Mining

Session: Mining Textual Data

politics.domain" "http://movies.domain" was large, yet since the corresp onding stories were very different, the likely similar supp ort of the rule was found to b e small. How do we use the URL list to estimate similarity b etween documents? The simplest case is that the URL list includes a document sketch, such as the shingles of Broder et al. [7], for each URL. Such sketches are typically available when the URL list is the output of a previous crawl of the web site. When available, documents sketches are used to indicate which URL pairs are likely similar. When the URL list is taken from web server logs, documents sketches are not available. In this case we use document sizes (document sizes are usually given by web server software). We determine two documents to b e similar if their sizes "match". Size matching, however, turns out to b e quite intricate, b ecause the same document may have very different sizes when insp ected at different p oints of time or by different users. This is esp ecially true when dealing with forums or blogging web sites. Therefore, if two URLs have different "size" values in the URL list, we cannot immediately infer that these URLs are not dust. Instead, for each unique URL, we track all its occurrences in the URL list, and keep the minimum and the maximum size values encountered. We denote the interval b etween these two numb ers by Iu . A pair of URLs, u1 and u2 , in the supp ort are considered likely similar if the intervals Iu1 and Iu2 overlap. Our exp eriments show that this heuristic is very effective in improving the precision of our algorithm, often increasing precision by a factor of two.

URLs are unlikely to b e similar. The set of all disqualified envelop es is then denoted D, . (3) In practice, substitutions of long substrings are rare. Hence, our algorithm considers substrings of length at most S , for some given parameter S . To conclude, our algorithm computes for every two substrings , that app ear in the URL list and whose length is at most S , the size of the set (EL () EL ( )) \ (O D, ).
1:Function DetectLikelyRules(URLList L) 2: create table ST (substring, prefix, suffix, size range/doc sketch) 3: create table IT (substring1, substring2) 4: create table RT (substring1, substring2, support size) 5: for each record r L do 6: for = 0 to S do 7: for each substring of r.url of length do 8: p := prefix of r.url preceding 9: s := suffix of r.url succeeding 10: add (, p, s, r.size range/r.doc sketch) to ST 11: group tuples in ST into buckets by (prefix,suffix) 12: for each bucket B do 13: if (|B| = 1 or |B| > T) continue 14: for each pair of distinct tuples t1 , t2 B do 15: if (LikelySimilar(t1 , t2 )) 16: add (t1 .substring, t2 .substring) to IT 17: group tuples in IT into rule supports by (substring1,substring2) 18: for each rule support R do 19: t := first tuple in R 20: add tuple (t.substring1, t.substring2, |R|) to RT 21: sort RT by support size 22: return all rules in RT whose support size is M S

Figure 1: Discovering likely DUST rules. Our algorithm for discovering likely dust rules is describ ed in Figure 1. The algorithm gets as input the URL list L. We assume the URL list has b een pre-processed so that: (1) only unique URLs have b een kept; (2) all the URLs have b een tokenized and include the preceding ^ and succeeding $; (3) all records corresp onding to errors (http return codes in the 4xx and 5xx series) have b een filtered out; (4) for each URL, the corresp onding document sketch or size range has b een recorded. The algorithm uses three tables: a substring table ST, an instance table IT, and a rule table RT. Their attributes are listed in Figure 1. In principle, the tables can b e stored in any database structure; our implementation uses text files. In lines 510, the algorithm scans the URL list, and records all substrings of lengths 0 to S of the URLs in the list. For each such substring , a tuple is added to the substring table ST. This tuple consists of the substring , as well as its envelop e (p, s), and either the URL's document sketch or its size range. The substrings are then group ed into buckets by their envelop es (line 11). Our implementation does this by sorting the file holding the ST table by the second and third attributes. Note that two substrings , app ear in the bucket of (p, s) if and only if (p, s) EL () EL ( ). In lines 1216, the algorithm enumerates the envelop es found. An envelop e (p, s) contributes 1 to the intersection of the envelop e sets EL () EL ( ), for every , that app ear in its bucket. Thus, if the bucket has only a single entry, we know (p, s) does not contribute any instance to any rule, and thus can b e tossed away. If the bucket is overflowing (its size exceeds T ), then (p, s) is also ignored (line 13). In lines 1416, the algorithm enumerates all the pairs (, ) of substrings that b elong to the bucket of (p, s). If it seems likely that the documents associated with the URLs

DUSTBUSTER

In this section we describ e DustBuster--our algorithm for discovering site-sp ecific dust rules. DustBuster has four phases. The first phase uses the URL list alone to generate a short list of likely dust rules. The second phase removes redundancies from this list. The next phase generates likely parameter substitution rules and is discussed in the full draft of the pap er. The last phase validates or refutes each of the rules in the list, by fetching a small sample of pages.

5.1 Detecting likely DUST rules
Our strategy for discovering likely dust rules is the following: we compute the size of the supp ort of each rule that has at least one instance in the URL list, and output the rules whose supp ort exceeds some threshold M S . Based on Theorem 4.1, we compute the size of the supp ort of a rule as the size of the set EL () EL ( ). That is roughly what our algorithm does, but with three reservations: (1) Based on the small buckets heuristic, we avoid considering certain rules by ignoring large buckets in the computation of envelop e set intersections. Buckets bigger than some threshold T are called overflowing, and all envelop es p ertaining to them are denoted collectively by O and are not included in the envelop e sets. (2) Based on the similarity likeliness heuristic, we filter supp ort by estimating the likelihood of two documents b eing similar. We eliminate rules by filtering out instances whose associated documents are unlikely to b e similar in content. That is, for a given instance u1 = ps and u2 = p s, the envelop e (p, s) is disqualified if u1 and u2 are found unlikely to b e similar using the tests introduced in Section 4. These techniques are provided as a b oolean function LikelySimilar which returns false only if the documents of the two input

115

WWW 2007 / Track: Data Mining

Session: Mining Textual Data

ps and p s are similar (through size or document sketch matching) (line 15), (, ) is added to the instance table IT (line 16). The numb er of times a pair (, ) has b een added to the instance table is exactly the size of the set (EL () EL ( )) \ (O D, ), which is our estimated supp ort for the rules and . Hence, all that is left to do is compute these counts and sort the pairs by their count (lines 1722). The algorithm's output is an ordered list of pairs. Each pair representing two likely dust rules (one in each direction). Only rules whose supp ort is large enough (bigger than M S ) are kept in the list. The full draft of the pap er contains the following analysis: Proposition 5.1. Let n be the number of records in the URL list and let m be the average length (in tokens) of URLs ~ in the URL list. The above algorithm runs in O (mnS T 2 ) ~ time and uses O(mnS T 2 ) storage space, where O suppresses logarithmic factors. Note that mn is the input length and S and T are usually small constants, indep endent of m and n. Hence, the algorithm's running time and space are only (quasi-) linear.

5.2 Eliminating redundant rules
By design, the output of the ab ove algorithm includes many overlapping pairs. For example, when running on a forum site, our algorithm finds the pair (".co.il/story?id=", ".co.il/story_"), as well as numerous pairs of substrings of these, such as ("story?id=", "story_"). Note that every instance of the former pair is also an instance of the latter. We thus say that the former refines the latter. It is desirable to eliminate redundancies prior to attempting to validate the rules, in order to reduce the cost of validation. However, when one likely dust rule refines another, it is not obvious which should b e kept. In some cases, the broader rule is always true, and all the rules that refine it are redundant. In other cases, the broader rule is only valid in sp ecific contexts identified by the refining ones. In some cases, we can use information from the URL list in order to deduce that a pair is redundant. When two pairs have exactly the same supp ort in the URL list, this gives a strong indication that the latter, seemingly more general rule, is valid only in the context sp ecified by the former rule. We can thus eliminate the latter rule from the list. We next discuss in more detail the notion of refinement and show how to use it to eliminate redundant rules. Definition 5.2 (Refinement). A rule refines a rule , if supp ort() supp ort( ). That is, refines , if every instance (u1 , u2 ) of is also an instance of . Testing refinement for substitution rules turns out to b e easy, as captured in the following lemma (proven in the full draft of the pap er): Lemma 5.3. A substitution rule refines a substitution rule if and only if there exists an envelope ( , ) s.t. = and = . The characterization given by the ab ove lemma immediately yields an efficient algorithm for deciding whether a substitution rule refines a substitution rule : we simply check that is a substring of , replace by , and check whether the outcome is . If has multiple

occurrences in , we check all of them. Note that our algorithm's input is a list of pairs rather than rules, where each pair represents two rules. When considering two pairs (, ) and ( , ), we check refinement in b oth directions. Now, supp ose a rule was found to refine a rule . Then, supp ort( ) supp ort( ), implying that also supp ortL ( ) supp ortL ( ). Hence, if | supp ortL ( )| = | supp ortL ( )|, then supp ortL ( ) = supp ortL ( ). If the URL list is sufficiently representative of the web site, this gives an indication that every instance of the refined rule that occurs on the web site is also an instance of the refinement . We choose to keep only the refinement , b ecause it gives the full context of the substitution. One small obstacle to using the ab ove approach is the following. In the first phase of our algorithm, we do not compute the exact size of the supp ort | supp ortL ( )|, but rather calculate the quantity |(EL () EL ( )) \ (O D, )|. It is p ossible that refines and supp ortL ( ) = supp ortL ( ), yet |(EL ( ) EL ( )) \ (O D , )| < |(EL () EL ( )) \ (O D, )|. In practice, if the supp orts are identical, the difference b etween the calculated supp ort sizes should b e small. We thus eliminate the refined rule, even if its calculated supp ort size is slightly ab ove the calculated supp ort size of the refining rule. However, to increase the effectiveness of this phase, we run the first phase of the algorithm twice, once with a lower overflow threshold Tlow and once with a higher overflow threshold Thigh . While the supp ort calculated using the lower threshold is more effective in filtering out invalid rules, the supp ort calculated using the higher threshold is more effective in eliminating redundant rules. The algorithm for eliminating refined rules from the list app ears in Figure 2. The algorithm gets as input a list of pairs, representing likely rules, sorted by their calculated supp ort size. It uses three tunable parameters: (1) the maximum relative deficiency, MRD, (2) the maximum absolute deficiency, MAD; and (3) the maximum window size, MW. MRD and MAD determine the maximum difference allowed b etween the calculated supp ort sizes of the refining rule and the refined rule, when we eliminate the refined rule. MW determines how far down the list we look for refinements.
1:Function EliminateRedundancies(pairs list R) 2: for i = 1 to |R| do 3: if (already eliminated R[i]) continue 4: for j = 1 to min(MW, |R| - i) do 5: if (R[i].size - R[i + j ].size > max(MRD · R[i].size, MAD)) break 6: if (R[i] refines R[i + j ]) 7: eliminate R[i + j ] 8: else if (R[i + j ] refines R[i]) then 9: eliminate R[i] 10: break 11: return R

Figure 2: Eliminating redundant rules.

The algorithm scans the list from top to b ottom. For each rule R[i], which has not b een eliminated yet, the algorithm scans a "window" of rules b elow R[i]. Supp ose s is the calculated size of the supp ort of R[i]. The window size is chosen so that (1) it never exceeds M W (line 4); and (2) the difference b etween s and the calculated supp ort size of the

116

WWW 2007 / Track: Data Mining

Session: Mining Textual Data

lowest rule in the window is at most the maximum b etween M RD · s and M AD (line 5). Now, if R[i] refines a rule R[j ] in the window, the refined rule R[j ] is eliminated (line 7), while if some rule R[j ] in the window refines R[i], R[i] is eliminated (line 9). It is easy to verify that the running time of the algorithm is at most |R| · M W . In our exp eriments, this algorithm reduces the set of rules by over 90%.

5.3 Validating DUST rules
So far, the algorithm has generated likely rules from the URL list alone, without fetching even a single page from the web site. Fetching a small numb er of pages for validating or refuting these rules is necessary for two reasons. First, it can significantly improve the final precision of the algorithm. Second, the first two phases of DustBuster, which discover likely substring substitution rules, cannot distinguish b etween the two directions of a rule. The discovery of the pair (, ) can represent b oth and . This does not mean that in reality b oth rules are valid or invalid simultaneously. It is often the case that only one of the directions is valid; for example, in many sites removing the substring index.html is always valid, whereas adding one is not. Only by attempting to fetch actual page contents we can tell which direction is valid, if any. The validation phase of DustBuster therefore fetches a small sample of web pages from the web site in order to check the validity of the rules generated in the previous phases. The validation of a single rule is presented in Figure 3. The algorithm is given as input a likely rule R and a list of URLs from the web site and decides whether the rule is valid. It uses two parameters: the validation count, N (how many samples to use in order to validate each rule), and the refutation threshold, (the minimum fraction of counterexamples to a rule required to declare the rule invalid).
1:Function ValidateRule(R, L) 2: positive := 0 3: negative := 0 4: while (positive < (1 - )N and negative < N ) do 5: u := a random URL from L on which applying R results in a different URL 6: v := outcome of application of R to u 7: fetch u and v 8: if (fetch u failed) continue 9: if (fetch v failed or DocSketch(u) = DocSketch(v)) 10 negative := negative + 1 11: else 12: positive := positive + 1 13: if (negative N ) 14: return false 15: return true

or they are not similar, then it is accounted as a negative example (lines 912). The testing is stopp ed when either the numb er of negative examples surpasses the refutation threshold or when the numb er of p ositive examples is large enough to guarantee the numb er of negative examples will not surpass the threshold. One could ask why we declare a rule valid even if we find (a small numb er of ) counterexamples to it. There are several reasons: (1) the document sketch comparison test sometimes makes mistakes, since it has an inherent false negative probability; (2) dynamic pages sometimes change significantly b etween successive prob es (even if the prob es are made at short intervals); and (3) the fetching of a URL may sometimes fail at some p oint in the middle, after part of the page has b een fetched. By choosing a refutation threshold smaller than one, we can account for such situations. Figure 4 shows the algorithm for validating a list of likely dust rules. Its input consists of a list of pairs representing likely substring transformations, (R[i]., R[i]. ), and a test URL list L. For a pair of substrings (, ), we use the notation > to denote that either || > | | or || = | | and succeeds in the lexicographical order. In this case, we say that the rule shrinks the URL. We give precedence to shrinking substitutions. Therefore, given a pair (, ), if > , we first try to validate the rule . If this rule is valid, we ignore the rule in the other direction since, even if this rule turns out to b e valid as well, using this rule during canonization is only likely to create cycles, i.e., rules that can b e applied an infinite numb er of times b ecause they cancel out each others' changes. If the shrinking rule is invalid, though, we do attempt to validate the opp osite direction, so as not to lose a valid rule. Whenever one of the directions of (, ) is found to b e valid, we remove from the list all pairs refining (, ) once a broader rule is deemed valid, there is no longer a need for refinements thereof. By eliminating these rules prior to validating them, we reduce the numb er of pages we fetch. We assume that each pair in R is ordered so that R[i]. > R[i]. .
1:Function Validate(rules list R, test URLList L) 2 create an empty list of rules LR 3: for i = 1 to |R| do 4: for j = 1 to i - 1 do 5: if (R[j ] was not eliminated and R[i] refines R[j ]) 6: eliminate R[i] from the list 7: break if (R[i] was eliminated) 8: 9: continue 10: if (ValidateRule(R[i]. R[i]. , L)) 11: add R[i]. R[i]. to LR 12: else if (ValidateRule(R[i]. R[i]., L)) 13: add R[i]. R[i]. to LR 14: else 15: eliminate R[i] from the list 16: return LR

Figure 3: Validating a single likely rule. In order to determine whether a rule is valid, the algorithm rep eatedly chooses random URLs from the given test URL list until hitting a URL on which applying the rule results in a different URL (line 5). The algorithm then applies the rule to the random URL u, resulting in a new URL v . The algorithm then fetches u and v . Using document sketches, such as the shingling technique of Broder et al. [7], the algorithm tests whether u and v are similar. If they are, the algorithm accounts for u as a p ositive example attesting to the validity of the rule. If v cannot b e fetched,

Figure 4: Validating likely rules. The running time of the algorithm is at most O(|R|2 + N |R|). Since the list is assumed to b e rather short, this running time is manageable. The numb er of pages fetched is O(N |R|) in the worst-case, but much smaller in practice, since we eliminate many redundant rules after validating rules they refine.

117

WWW 2007 / Track: Data Mining

Session: Mining Textual Data

5.4 URL canonization
The canonization algorithm receives a URL u and a list of valid dust rules R. The idea b ehind this algorithm is very simple: it rep eatedly applies to u all the rules in R, until there is an iteration in which u is unchanged, or until a predetermined maximum numb er of iterations has b een reached. For details see the full draft of the pap er. As the general canonization problem is hard, we cannot exp ect this p olynomial time algorithm to always produce a minimal canonization. Nevertheless, our empirical study shows that the savings obtained using this algorithm are high. We b elieve that this common case success stems from two features. First, our p olicy of choosing shrinking rules whenever p ossible typically eliminates cycles. Second, our elimination of refinements of valid rules leaves a small set of rules, most of which do no affect each other.

up to 5%, and an absolute deficiency, MAD, of 1. The maximum window size, MW, was set to 1100 rules. The value of MS, the minimum supp ort size, was set to 3. The algorithm uses a validation count, N, of 100 and a refutation threshold, , of 5%-10%. Finally, the canonization uses a maximum of 10 iterations. Shingling [7] is used in the validation phase to determine similarity b etween documents. Detecting likely DUST rules and eliminating redundant ones. DustBuster's first phase scans the log and detects a very long list of likely dust rules. Subsequently, the redundancy elimination phase dramatically shortens this list. Table 2 shows the sizes of the lists b efore and after redundancy elimination. It can b e seen that in all of our exp eriments, over 90% of the rules in the original list have b een eliminated. Web Site Forum Site Academic Site Large News Site Small News Site Rules Detected 402 26,899 12,144 4,220 Rules Remaining after 2nd Phase 37 (9.2%) 2,041 (7.6%) 1,243 (9.76%) 96 (2.3%)

EXPERIMENTAL RESULTS

Experiment setup. We exp eriment with DustBuster on four web sites: a dynamic forum site, an academic site (www.ee.technion.ac.il), a large news site (cnn.com) and a smaller news site (nydailynews.com). In the forum site, page contents are highly dynamic, as users continuously add comments. The site supp orts multiple domain names. Most of the site's pages are generated by the same software. The news sites are similar in their structure to many other news sites on the web. The large news site has a more complex structure, and it makes use of several sub-domains as well as URL redirections. The academic site is the most diverse: It includes b oth static pages and dynamic software-generated content. Moreover, individual pages and directories on the site are constructed and maintained by a large numb er of users (faculty memb ers, lab managers, etc.) In the academic and forum sites, we detect likely dust rules from web server logs, whereas in the news sites, we detect likely dust rules from a crawl log. Table 1 depicts the sizes of the logs used. In the crawl logs each URL app ears once, while in the web server logs the same URL may app ear multiple times. In the validation phase, we use random entries from additional logs, different from those used to detect the rules. The canonization algorithm is tested on yet another log, different from the ones used to detect and validate the rules. Web Site Forum Site Academic Site Large News Site Small News Site Log Size 38,816 344,266 11,883 9,456 Unique URLs 15,608 17,742 11,883 9,456

Table 2: Rule elimination in second phase. In Figure 5, we examine the precision level in the short list of likely rules produced at the end of these two phases in three of the sites. Recall that no page contents are fetched in these phases. As this list is ordered by likeliness, we examine the precision@k; that is, for each top k rules in this list, the curves show which p ercentage of them are later deemed valid (by DustBuster's validation phase) in at least one direction. We observe that, quite surprisingly, when similarity-based filtering is used, DustBuster's detection phase achieves a very high precision rate even though it does not fetch even a single page. In the forum site, out of the 4050 detected rules, over 80% are indeed valid. In the academic site, over 60% of the 300350 detected rules are valid, and of the top 100 detected rules, over 80% are valid. In the large news sites, 74% of the top 200 rules are valid. This high precision is achieved, to a large extent, thanks to the similarity-based filtering (size matching or shingle matching), as shown in Figures 5(b) and 5(c). The log includes invalid rules. For example, the forum site includes multiple domains, and the stories in each domain are different. Thus, although we find many pairs http://domain1/ story_num and http://domain2/story_num with the same num, these represent different stories. Similarly, the academic site has pairs like http://site/course1/lect- num. ppt and http://site/course2/lect- num.ppt, although the lectures are different. Such invalid rules are not detected, since stories/lectures typically vary in size. Figure 5(b) illustrates the impact of size matching in the academic site. We see that when size matching is not employed, the precision drops by around 50%. Thus, size matching reduces the numb er of accesses needed for validation. Nevertheless, size matching has its limitations valid rules (such as "ps" "pdf") are missed at the price of increasing precision. Figure 5(c) shows similar results for the large news site. When we do not use shingles-filtered supp ort, the precision at the top 200 drops to 40%. Shingles-based filtering reduces the list of likely rules by roughly 70%. Most of the filtered rules turned out to b e indeed invalid.

Table 1: Log sizes.

Parameter settings. The following DustBuster parameters were carefully chosen in all our exp eriments. Our empirical results suggest that these settings lead to good results (see more details in the full draft of the pap er). The maximum substring length, S , was set to 35 tokens. The maximum bucket size used for detecting dust rules, Tlow , was set to 6, and the maximum bucket size used for eliminating redundant rules, Thigh , was set to 11. In the elimination of redundant rules, we allowed a relative deficiency, MRD, of

118

WWW 2007 / Track: Data Mining

Session: Mining Textual Data

1 With Size Matching No Size Matching

1 Without shingles-filtered support With shingles-filtered support 0.8

0.8

0.4

Precision 0 50 100 150 top k rules 200 250 300

precision

0.6

0.4

0.2

log#4 log#3 log#2 log#1 0 5 10 15 20 25 top k rules 30 35 40 45 50

0.2

0 0 200 400 600 top k rules 800 1000 1200

(a) Forum, 4 different logs.

(b) Academic site, impact of size match- (c) Large news site, impact of shingle ing. matching, 4 shingles used.

Figure 5: Precision@k of likely dust rules detected in DustBuster's first two phases without fetching actual content.
1 1

0.8

precision

0.4

precision log#4 log#3 log#2 log#1 0 20 40 60 number of validations 80 100

0.6

0.4

0.2

log#4 log#3 log#2 log#1 0 20 40 60 number of validations 80 100

(a) Forum, 4 different logs.

(b) Academic, 4 different logs.

Figure 6: Precision among rules that DustBuster attempted to validate vs. number of validations used (N). Validation. We now study how many validations are needed in order to declare that a rule is valid; that is, we study what the parameter N in Figure 4 should b e set to. To this end, we run DustBuster with values of N ranging from 0 to 100, and check which p ercentage of the rules found to b e valid with each value of N are also found valid when N=100. The results from conducting this exp eriment on the likely dust rules found in 4 logs from the forum site and 4 from the academic site are shown in Figure 6 (similar results were obtained for the other sites). In all these exp eriments, 100% precision is reached after 40 validations. Moreover, results obtained in different logs are consistent with each other. At the end of the validation phase, DustBuster outputs a list of valid substring substitution rules without redundancies. Table 3 shows the numb er of valid rules detected on each of the sites. The list of 7 rules found in one of the logs in the forum site is depicted in Figure 7 b elow. These 7 rules or refinements thereof app ear in the outputs produced using each of the studied logs. Some studied logs include 1 3 additional rules, which are insignificant (have very small supp ort). Similar consistency is observed in the academic site outputs. We conclude that the most significant dust rules can b e adequately detected using a fairly small log with roughly 15,000 unique URLs. Coverage. We now turn our attention to coverage, or the p ercentage of duplicate URLs discovered by DustBuster, in the academic site. When multiple URLs have the same document sketch, all but one of them are considered duplicates. In order to study the coverage achieved by DustBuster, we use two different logs from the same site: a training log and a test log. We run DustBuster on the training log in order to Web Site Forum Site Academic Site Large News Site Small News Site Valid Rules Detected 7 52 62 5

Table 3: The number of rules found to be valid.

1 2 3 4 5 6 7

".co.il/story_" ".co.il/story?id=" "\&LastView=\&Close=" "" ".php3?" "?" ".il/story_" ".il/story.php3?id=" "\&NewOnly=1\&tvqz=2" "\&NewOnly=1" ".co.il/thread_" ".co.il/thread?rep=" "http://www.../story_" "http://www.../story?id="

Figure 7: The valid rules detected in the forum site.

learn dust rules and we then apply these rules on the test log. We count what fraction of the duplicates in the test log are covered by the detected dust rules. We detect duplicates in the test log by fetching the contents of all of its URLs and computing their document sketches. Figure 8 classifies these duplicates. As the figure shows, 47.1% of the duplicates in the test log are eliminated by DustBuster's canonization algorithm using rules discovered on another log. The rest of the dust can b e divided among several categories: (1) duplicate images and icons; (2) replicated documents (e.g., pap ers co-authored by multiple faculty memb ers and whose copies app ear on each of their web pages); (3) "soft errors",

119

WWW 2007 / Track: Data Mining

Session: Mining Textual Data

i.e., pages with no meaningful content, such as error message pages, empty search results pages, etc.

for technical assistance. We thank Israel Cidon, Yoram Moses, and Avigdor Gal for their insightful input.

8. REFERENCES
[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 20th VLDB, pages 487499, 1994. [2] Z. Bar-Yossef, I. Keidar, and U. Schonfeld. Do not crawl in the DUST: different URLs with similar text. Technical Report CCIT Report #601, Dept. Electrical Engineering, Technion, 2006. [3] K. Bharat and A. Z. Broder. Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content. Computer Networks, 31(1116):15791590, 1999. [4] K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. IEEE Data Engin. Bul l., 23(4):2126, 2000. [5] M. Bognar. A survey on abstract rewriting. Available online at: www.di.ubi.pt/~desousa/1998- 1999/logica/mb.ps, 1995. [6] S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In Proc. 14th SIGMOD, pages 398409, 1995. [7] A. Z. Broder, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proc. 6th WWW, pages 11571166, 1997. [8] J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. 19th SIGMOD, pages 355366, 2000. [9] E. Di Iorio, M. Diligenti, M. Gori, M. Maggini, and A. Pucci. Detecting Near-replicas on the Web by Content and Hyperlink Analysis. In Proc. 11th WWW, 2003. [10] F. Douglis, A. Feldman, B. Krishnamurthy, and J. Mogul. Rate of change and other metrics: a live study of the world wide web. In Proc. 1st USITS, 1997. [11] H. Garcia-Molina, L. Gravano, and N. Shivakumar. dscam: Finding document copies across multiple databases. In Proc. 4th PDIS, pages 6879, 1996. [12] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. [13] Google Inc. Google sitemaps. http://sitemaps.google.com. [14] D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and COmputational Biology. Cambridge University Press, 1997. [15] T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. J. Amer. Soc. Infor. Sci. Tech., 54(3):203215, 2003. [16] N. Jain, M. Dahlin, and R. Tewari. Using bloom filters to refine web search results. In Proc. 7th WebDB, pages 2530, 2005. [17] T. Kelly and J. C. Mogul. Aliasing on the world wide web: prevalence and performance implications. In Proc. 11th WWW, pages 281292, 2002. [18] S. J. Kim, H. S. Jeong, and S. H. Lee. Reliable evaluations of URL normalization. In Proc. 4th ICCSA, pages 609617, 2006. [19] H. Liang. A URL-String-Based Algorithm for Finding WWW Mirror Host. Master's thesis, Auburn University, 2001. [20] F. McCown and M. L. Nelson. Evaluation of crawling policies for a web-repository crawler. In Proc. 17th HYPERTEXT, pages 157168, 2006. [21] U. Schonfeld, Z. Bar-Yossef, and I. Keidar. Do not crawl in the DUST: different URLs with similar text. In Proc. 15th WWW, pages 10151016, 2006. [22] N. Shivakumar and H. Garcia-Molina. Finding Near-Replicas of Documents and Servers on the Web. In Proc. 1st WebDB, pages 204212, 1998.

Figure 8: dust classification, academic. Savings in crawl size. The next measure we use to evaluate the effectiveness of the method is the discovered redundancy, i.e., the p ercent of the URLs we can avoid fetching in a crawl by using the dust rules to canonize the URLs. To this end, we p erformed a full crawl of the academic site, and recorded in a list all the URLs fetched. We p erformed canonization on this list using dust rules learned from the crawl, and counted the numb er of unique URLs b efore (Ub ) and after (Ua ) canonization. The discovered redundancy is - then given by UbU Ua . We found this redundancy to b e 18% b (see Table 4), meaning that the crawl could have b een reduced by that amount. In the two news sites, the dust rules were learned from the crawl logs and we measured the reduction that can b e achieved in the next crawl. By setting a slightly more relaxed refutation threshold ( = 10%), we obtained reductions of 26% and 6%, resp ectively. In the case of the forum site, we used four logs to detect dust rules, and used these rules to reduce a fifth log. The reduction achieved in this case was 4.7%. Web Site Academic Site Small News Site Large News Site Forum Site(using logs) Reduction Achieved 18% 26% 6% 4.7%

Table 4: Reductions in crawl size.

CONCLUSIONS

We have introduced the problem of mining site-sp ecific dust rules. Knowing ab out such rules can b e very useful for search engines: It can reduce crawling overhead by up to 26% and thus increase crawl efficiency. It can also reduce indexing overhead. Moreover, knowledge of dust rules is essential for canonizing URL names, and canonical names are very imp ortant for statistical analysis of URL p opularity based on PageRank or traffic. We presented DustBuster, an algorithm for mining dust very effectively from a URL list. The URL list can either b e obtained from a web server log or a crawl of the site. Acknowledgments. We thank Tal Cohen and the forum site team, and Greg Pendler and the http://ee.technion. ac.il admins for providing us with access to web logs and

120