Structure-Driven Crawler Generation by Example Marcio L. A. Vidal ´ Altigran S. da Silva Edleno S. de Moura Joao M. B. Cavalcanti ~ Universidade Federal do Amazonas Depar tamento de Ciência da Computacao ¸~ Manaus ­ Brazil {mvidal,alti,edleno,john}@dcc.ufam.edu.br ABSTRACT Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to b e present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a b etter criteria to guide the crawling process than their content. In this pap er, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry p oint to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the exp eriments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation. Categories and Subject Descriptors H.3 [Information Systems]: Information Storage and Retrieval--Clustering ; General Terms Algorithms, Exp erimentation Keywords Digital Libraries, Tree Edit Distance, Web Crawlers 1. INTRODUCTION An ever increasing numb er of applications on the Web target at processing collections of similar pages obtained from Web sites. The ultimate goal is to take advantage of the valuable information these pages implicitly contain to p erform tasks such as creating digital libraries, querying, searching, data extraction, data mining and feature analysis. For some of these applications, the criteria to determine when a page is to b e present in a collection are related to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. the page content, i.e. words, phrases, etc. However, there are other situations in which the features of the inner structure of the pages provide b etter criteria to define a collection than their content. For instance, consider an application that requires collecting pages containing information ab out Jazz artists whose pages are available in the E-Jazz Web Site1 . To define a set of content-related features that encompasses all the artists would b e a hard task, since artists are usually related to some jazz genre or to a musical instrument, and there will b e several distinct styles and instruments to b e considered. On the other hand, there will also b e non-artist related pages that share a same sub ject with several artist pages. For instance, the page in Figure 1(a) is an artist page that shares the same sub ject with the page in Figure 1(b), which is a Jazz genre page. Furthermore, the page in Figure 1(a) shares no common sub ject with the page in Figures 1(c), despite the fact that b oth of them are artist pages. In such situations using content-related features to characterize the pages to b e collected is likely to fail. In such cases, representing the pages to b e collected using features related with content is not appropriate. Motivated by this problem, we prop ose a new Web crawling technique that is based on the structure of the Web pages instead of their content. While many work in the literature have addressed the issue of content-driven Web crawling, the use of page structure as a criterion to guide the traversal of crawlers has b een almost neglected. Nevertheless, this is an interesting alternative to solve practical problems in several imp ortant Web applications. The applications we consider include: automatically providing data-rich Web pages for wrapp ers which generally rely on structural patterns for p erforming extraction of implicit data [3, 12, 19]; structureaware Web page searching and querying [2, 10]; building of digital libraries [4, 16]; Web usage mining [8] and other applications where a content-driven crawler is not the optmal choice. In this pap er we present an approach for generating structure-driven crawlers that requires a minimum effort from users, since it relies on a handful of examples (usually one) of the pages to b e fetched. To accomplish this, given a sample page and an entry p oint to a Web site, the tool greedily traverses the Web site looking for target pages, i.e., pages that are structurally similar to the sample page. Next, it records all paths that lead to target pages and generates a navigation pattern [13] which is comp osed by sequences of patterns of links a crawler has to follow to reach the target 1 http://www.ejazz.com.br 292 (a) pages. Finally, we automatically generate a crawler based on these patterns. From this p oint on, the crawler can b e used to fetch pages that are structurally similar to the sample page, even if new similar pages are added later. A remarkable feature of the structure-driven approach is that in many applications the structural criteria are the only option for producing an effective crawler. Furthermore, the structure criteria is usually more precise and safer than content criteria. As a result, while content-driven crawlers seldom achieve higher quality levels, the structure-driven approach tends to b e extremely precise without losing information. Indeed, in the exp eriments we have carried out involving several crawling sessions over 11 real Web sites, structure-driven crawlers generated by our approach were able to collect almost all pages that matched the samples given, including those pages added long after their generation, achieving 100% precision and at least 95% recall in all cases. The remainder of the pap er is organized as follows. Section 2 covers research initiatives related to ours. Section 3 reviews the concept of tree edit distance and the algorithm we use to evaluate the structural similarity of Web pages. In Section 4 we focus on the details of the structure-driven crawling approach and the techniques used for generating structure-driven Web crawlers. Section 5 describ es the results of exp eriments. Finally, in Section 6 we present our conclusions and suggestions for future work. (b) Figure 1: Sample pages from the e-jazz Web site. (c) prop erties of link collections available in the pages. More precisely, they use the set of paths b etween the root of the page and the HTML tags to characterize the structure. Pages that share the same paths are considered structurally similar. Here we adopt a tree edit distance algorithm prop osed in [12] called RTDM (Restricted Top-Down Mapping). An adaptation of this algorithm is deployed in order to verify which pages are to b e collected by the crawler based on their structural similarity with the given sample page. The present work is also related to the work on focused craw ling [6]. In focused crawling the main idea is to have crawlers that only fetch Web pages that are relevant to a given topic. To sp ecify this topic, a set of example Web pages is provided. To drive the crawl process through the Web, a classifier is used to evaluate the relevance of the pages found with resp ect to the focus topics. We refer to this kind of focused crawler as content-driven, since the classifier considers the similarity b etween the content of the pages found during the crawl and the content of the page given as example. Distinctly, the crawlers addressed in the present pap er are termed structure-driven, since they rely on the structural similarity of pages. The use of features other than page content have b een considered in recent pap ers, such as [1, 14]. Among the features taken as evidence for determining how relevant a page is to b e fetched are the tokens present in URLs, the nature of Web pages that refer to a given page, and the linking structure of the sites. However, in none of these work the internal structure of the pages was considered as a piece of evidence of similarity. It is imp ortant to notice that in the present pap er we deal with a problem that is slightly distinct from the one addressed by typical focused crawlers work. While focused crawlers seek to crawl the whole Web, or a region of interest on the Web, we aim at generating crawlers that continuously fetch pages from sp ecific Web sites known in advance according to an implicit set of structural criteria. 2. RELATED WORK The work we present in this pap er is motivated by previous work on automatic Web crawling for feeding wrapp ers with data-rich pages. More sp ecifically, in [13] the concept of Navigation Pattern NP was introduced as a way of guiding the navigation of a crawler through a target Web site to fetch pages of interest for data extraction tasks. The main novelty we introduce in the present pap er is that, while in that previous work NPs are fixed and manually generated, our approach automatically generates NPs and the role of the user is limited to providing examples of the pages to b e fetched. The internal structure of the DOM trees associated to Web pages was previously used in [9] and [12] to create clusters of pages that are structurally similar. Like this pap er, the main motivation of b oth work is to organize the pages to feed data extraction processes p erformed by wrapp ers. In the present pap er, however, we deal with a related but different problem: how to collect pages of a single class of structurally similar pages according to a sample page given as input. Another noteworthy distinction b etween this article and previous work is the way the structural similarity b etween pages is evaluated. In [9] the authors rely on the layout 3. STRUCTURAL SIMILARITY In this section we review the concept of tree edit distance (TED) [17], which is a key concept in our approach and describ e how it is used to compare two given Web pages according to their structure. Intuitively, the edit distance b etween two trees TA and TB is the cost associated with the minimal set of op erations needed to transform TA into TB . Thus, in general, we consider that the similarity b etween TA and TB is inverse to their edit distance. In our work we assume that the structure of a Web page can b e describ ed by a tree, in particular a labeled ordered rooted tree. A rooted tree is a tree whose root vertex is fixed. Ordered rooted trees are rooted trees in which the relative order of the children is fixed for each vertex. Lab eled ordered 293 rooted trees have a lab el attached to each of their vertex. In its traditional formulation, the TED problem considers three op erations: (a) vertex removal, (b) vertex insertion, and (c) vertex replacement. To each of these op erations, a cost is assigned. The solution of this problem consists of determining the minimal set of op erations (i.e., the one with the minimum cost) to transform one tree into another. In our work we consider a restricted top-down version of the TED problem which restricts the removal and insertion op erations to take place only in the leaves of the trees [7]. Such a formulation is particularly useful for dealing with the HTML pages represented by their underlying DOM trees. In our approach for generating structure-driven crawlers we adopt an algorithm called RTDM [12], to calculate the tree edit distance b etween two pages according the restricted top-down formulation. To determine the restricted topdown TED b etween two trees T1 and T2 , the RTDM algorithm first finds all identical sub-trees that occur at the same level of the input trees. This step is p erformed in linear time using a graph of equivalence classes. However, RTDM is based on a p ost-order traversal of the trees. Although much simpler, this approach is applicable b ecause we only look for the identical sub-trees of the same level. This first step of the algorithm has linear cost, with resp ect to the numb er of vertices in the trees. Once the vertices in the trees are group ed in equivalence classes, an adaptation of Yang's algorithm [18] is applied to obtain the minimal restricted top-down mapping b etween the trees. This second step of the algorithm has a worst case complexity of O(n1 n2 ), but, in practice, it p erforms much b etter due to the fact that it only deals with restricted top-down sets of op erations. For more details on the RTDM algorithm we refer the interested reader to [12]. Figure 2: Overall structure of the E-Jazz Web site to include new links to these new pages. Thus, these new links and pages will have to b e taken into account in future crawling processes. Two parameters are required as input to generate a crawler capable of accomplishing such task: an URL indicating a sample page to serve as an example for the pages to b e fetched and another URL indicating an entry p oint to the site where these pages can b e found. In our example, page 0/2/0/0 serves as the sample page, while page 0/2 serve as the entry p oint. The creation of a crawler is accomplished in two phases: site mapping and navigation pattern generation. In the site mapping phase, all paths starting from the entry p oint are traversed to look for target pages. A page is considered as a target when it is structurally similar to the given sample page. Notice that, the traversing is limited to the same Web site. Every path from the entry p oint that leads to a target page is recorded. Thus, if we consider a Web site as a graph, the site mapping phase generates as output a minimum spanning tree where all leaves are target pages. This tree is called Target Pages Map (TPM). In Figure 2 the output of the mapping corresp onds to the tree in the area delimited with a dashed line. In the second phase, navigation pattern generation, the goal is to create a generic representation of the TPM. This generic representation is a list of regular expressions, where each regular expression represents the links occurring in a page the crawler has to follow to reach the target pages. This generic representation is called a Navigation Pattern (NP). A NP is meant to guide a crawler in the traversal of a site to fetch the target pages. Thus, it corresp onds to a single path into the TPM. However, as many paths leading to target pages can exist, we choose the one that p otentially leads to the greatest numb er of target pages. Notice that, if correctly generated, regular expressions in the NP will meet our requirement of accounting for new links added to the pages in the site, even after the mapping phase has b een completed, since it uses regular expressions to identify links that lead to target pages. As an example, Table 1 shows a very simple but real navigation pattern for the E-Jazz Web site. In this table, ex- 4. CRAWLER GENERATION In this section we detail the process of semi-automatically generating structure-driven crawlers prop osed in our approach. We b egin by illustrating the process by means of a simple example and then we proceed to discuss further details on the two phases involved in it: site mapping and navigation pattern generation. 4.1 Overview and a Simple Example Throughout this section, we illustrate some of the concepts we use by taking the E-Jazz Web site as an example. The site features information on jazz styles, instruments, recordings and artists. We concentrate on the artist-related p ortion. The overall structure of the site is presented in a simplified form in Figure 2. In Figure 2, pages are lab eled for reference in the discussion that follows. Assume that all pages lab elled 0/2/i (1 i k) are similar to page 0/2/0, i.e., they all contain links to artists pages. We call these pages, artist hub pages. Also, assume that all pages lab elled 0/2/i/j (1 i K, 1 j Ki ) are similar to page 0/2/0/0, i.e., they are all artist pages. Now, supp ose we want to fetch all artist pages available on the E-Jazz Web site. To accomplish this task, a crawler would b egin its traversal at page 0/2. Next, the crawler would have to access all artist hub pages and finally, it would fetch all artist pages. Notice that new artist pages can b e added to the Web site. This implies that some artist hub pages are often up dated 294 pression E1 is applied to the entry page. It matches only links that lead to artist hub pages. Next, the expression E2 is applied to each artist hub page and matches all the links in these pages that lead to all the artist pages and only to these pages. Notice that no user intervention other than providing the sample page and the entry p oint is required. In the following sections we detail the two phases required by our structure-driven approach for producing a structuredriven crawler. 4.2 Site Mapping In the site mapping phase, we consider a Web site as a directed graph S = P, L , where P is the set of pages in the Web site S and L is a set of pairs x, y such that a link exists from page x to page y in S . The ultimate goal of this phase is to generate a tree called Target Pages Map (TPM) defined as follows. Definition 1. Let S be a Web site as defined above. Let e, p S respectively be the entry point and the sample page supplied by the user. We define a Target Pages Map (TPM) as a tree T that is a subset of the minimum spanning tree of S where e is the root and the set of leaf nodes is formed exactly by the set of nodes pi from S that are structural ly similar to p. To generate a TPM we simply crawl the Web site starting from the entry p oint using a breadth-first traversal procedure. This procedure is describ ed in Figure 3. It takes as input an entry p oint e of a Web site and a sample page p to b e used for comparison. Initially, in Line 7 all links in e are enqueued in Q. Next, the crawling process goes on with the procedure iterating over this queue in the loop of Lines 8­15. The breadth-first traversal is the b est choice for mapping the target Web site, since this strategy is known to provide a b etter coverage of the sites traversed [15]. This iteration ends when the queue is empty or the Stop function returns a true value. This function encodes a set of practical constraints for the crawling, for instance, the maximum numb er of pages fetched or the maximum numb er of levels to b e reached. In our current implementation these constraints are set by means of user-defined parameters, which are fairly simple to b e estimated. Notice that these parameters are used only for the site mapping phase and are not required once the resulting crawler is generated. In Lines 9 and 10 the first element in the queue is taken and compared with p using the RTDMSim function. This comparison, is based on the structure of these pages, i.e., we verify how similar their underlying DOM trees are using the RTDM algorithm presented in Section 3. If the current page x and the sample page p are considered structurally similar, we add x to a set X of target pages (Line 11), i.e., pages that must b e collected. If x and p are not similar, the links in x are enqueued (Line 12) and the crawling goes on. At the end of the iteration, the set X will have all target pages found stored in it. After that, in the iteration of Lines 16­19 all nodes and arcs present in the paths from the entry p oint to target pages in X will b e added to the target page map (TPM) T . Notice that the M apping procedure obtains a sub-tree of the minimal spanning tree of the site. In our running example, the TPM generated would corresp ond to the tree delimited by the shaded area in Figure 2. Notice that page lab eled 0/2/0/1 is directly linked by an 1 2 3 4 5 6 7 8 9 10 11 12 15 16 17 18 19 20 Mapping input: An entry point e of a Web site S and a sample page p output: A TPM T begin let Q be queue Q links extracted from e while Q = Stop x Dequeue(Q) if RTDMSim(p,x) then X X {x} else Q Q links extracted from x end foreach x X do let C = e, v1 , . . . , vk , x be the path from e to x Add the nodes and arcs in C to T . end end Figure 3: Procedure used in the site mapping phase. entry p oint page 0/2, while all other similar pages near by are linked by the artist hub page lab eled 0/2/0. Actually, there is a link in the Web site from page 0/2/0 to page 0/2/0/1. However it was not considered, in this case, b ecause the crawler has found the link from 0/2 to 0/2/0/1 first. Such "shortcuts" are common in some Web sites and we will explain how we address these anomalies later. 4.3 Navigation Pattern Generation The goal of this phase is to build a Navigation Pattern (NP) from the TPM generated in the mapping phase. As we have already mentioned, the NP consists of a list of regular expressions that represent the links in a page the crawler has to follow to reach the target pages. In order to generate it, we have to generalize the URLs of the pages in the paths of the TPM using regular expressions, and select among all p ossible paths from the entry p oints the "b est" one that leads to desired target pages. The first step in this phase consists of grouping the nodes in the TPM tree that represent pages with URL that are considered similar. We will p ostp one the discussion on the notion of URL similarity for the moment. The procedure for p erforming this grouping is presented in Figure 4. The Group procedure is initially applied to the root of the TPM tree and then recursively applied to its child nodes. In this procedure, (x) denotes the URL of node x, and C (x) denotes the set of child nodes of node x. For a given node r , Lines 3­6 define a partitioning on the set of child nodes of r , where each partition Gi contains only nodes corresp onding to pages with a similar URL. To evaluate the similarity b etween two URLs we use the URLsim function (Line 6). This function will b e detailed later. 1 Group(r ) 2 b egin 3 let G = {G1 , G2 , . . . , Gk }, k |C (r )| such that: 4 G1 . . . Gk = C (r ) 5 Gi Gj = 6 c Gi iff cm Gi and URLsim((c ), (cm )) 7 foreach Gi = {ci1 . . . cik } do i 8 = URLpattern((ci1 ) . . . (cik )) 19 0 11 12 13 14 ni , ci1 . . . cik i Remove all cij Gi as a child of r Add ni as a child of r call Group(ni ) end end i Figure 4: Grouping nodes procedure. 295 Exp. Id E1 E2 Regular Expression http://www.ejazz.com.br/artistas/default_^[a- zA- Z]+$ http://www.ejazz.com.br/detalhes- artistas.asp?cd=^d+$ Applied in Entry Point Artist Hub Page Table 1: Example of a simple navigation pattern. A Perl-like syntax is used. Next, in Line 11, a new node ni is created from the nodes in Gi as follows: the URL in ni corresp onds to a regular expression that generalizes the set of URL from the nodes in Gi according to function URLpattern. This function will b e detailed later. The set of children of ni is comp osed by all children of the nodes in Gi . Then, all the nodes in Gi are removed from the set of r children (Line 10), and replaced by the node ni (Line 11). In summary, we replace all nodes corresp onding to pages with similar URLs by a single node that represents all of them, so that there will b e such a node replacing each partition of similar children. Finally, in Line 12, the Group procedure is recursively called for the new node ni . The execution of the Group procedure is illustrated in Figure 5, where the left side (a), illustrates a TPM T and the rigth side (b) illustrates the tree T generated as the result of applying this procedure. Consider that nodes with the same fill pattern corresp ond to pages with similar URLs. Each node represented by a capital letter (A, B, C, ...) in T , except for the root r, groups all nodes from T represented by non capital letters (a1, b2, b2, ...). Notice that nodes lab eled A, C and D actually represent single nodes, a, c and d, resp ectively. 4.3.1 Evaluating Similarity Between URLs We consider that a URL is a string formed by several substrings separated by a "/". We call these substrings levels. Thus, let u = u1 /u2 / . . . /un (n 1) and v = v1 /v2 / . . . /vm (m 1) b e two URLs. URLs u and v are considered similar if: (1) they have the same numb er of levels, i.e., n = m; (2) the first level in u and v are equal, i.e., u1 = v1 ; and (3) there are at most K levels in the same p osition in u and v (except for the first) that are not equal, i.e., let = { ui , vi | ui = vi }, then | | K . By p erforming preliminary exp eriments with several real URLs, we have found that a value of K = 2 results in an accurate and safe similarity evaluation. This result was confirmed by our final exp eriments. Thus, the function URLsim called in Line 6 of algorithm of Figure 4 is true for two given URL u and v if they satisfy the three conditions ab ove. 4.3.2 URL Pattern Generation We now detail the procedure used to generate a pattern that represents a set of URLs. We recall that the URLs are assumed to b e similar in the same sense as describ ed ab ove and that these URLs differ at most in K levels. Thus, our strategy for generating URL patterns consists in building regular expressions that match all strings occurring in each level where the URLs differ from each other. In order to do this we introduce the notion of a Regex tree. Definition 3. A Regex tree R is a hierarchy (a tree) in which each node n is associated with a regular expression en over an alphabet that denotes a language Ln , such that for each internal node a in R whose children are {c1 , . . . , ck } we have Lc1 . . . Lck = La and Lc1 . . . Lck = . (a) (b ) Figure 5: Illustration of the execution of the Group procedure. After the execution of the grouping procedure we will p ossibly have more than one path leading to the target pages. This is depicted in Figure 5 where three p ossible paths are found: r to F, r to G and r to H. From theses paths, we choose the one that leads to the greatest numb er of target pages. In our example, we see that this condition is satisfied by path r to H. Thus, this path will finally b e elected as the navigation pattern NP. This notion is formalized in Definition 2. Definition 2. Let S be a Web site and let T be a TPM for S obtained when e, p S are respectively the entry point and the sample page supplied by the user. A N P is the path in the tree T obtained after applying the Group procedure over T , which begins in e and ends in the node of T that corresponds to the greatest number of target pages in T So far, we have p ostp oned the discussion on how we measure the similarity b etween two URLs using function URLsim (Line 6) and on how we generate a pattern for representing a set of similar URL using function URLpattern (Line 8). Now we detail these two functions. Figure 6: A Regex tree. A Perl-like syntax is used. The Regex tree we adopt in our work is presented in Figure 6. It is an adaptation of the so called Delimiters Tree defined in [11]. According to Definition 3, the Regex tree is built in such a way that each node defines a language that is disjoint with the languages in its sibling nodes. In particular, each leaf node defines a language that is disjoint with the languages in its sibling leaf nodes. Thus, when applied to the tokens in a URL, a given token will match exactly one leaf node. This prop erty allows us to use the leaf nodes in the Regex tree to tokenize a URL or a URL level. Furthermore, each internal node defines a language that is formed by the union of the languages of its child nodes. This prop erty allows us to find a single regular expression that matches all tokens in a same p osition occurring in a 296 set of URL. Let lt and ls b e two leaf nodes matched by two distinct tokens t and s. The node a that is the deep est common ancestor of lt and ls defines a regular expression that matches t and s. The complete procedure for generating an URL pattern is describ ed in Figure 7, where the symb ol "·" is used to denote the string concatenation op eration. To explain this procedure, we will use the set of sample URLs presented in Figure 8. When taking as input these URLs, the URLPattern procedure iterates in the loop of Lines 7­18 over the levels in the URLs that are distinct. As illustrated in Figure 8(a), only levels 6 and 7 will b e processed in this loop. We will focus on level 7, since it is more illustrative of the procedure functionality. This level is detailed in Figure 8(b). According to Line 8 there is no common prefix b etween all ui [7], but there is common suffix ".html" b etween them. In the loop of Lines 9­11, we tokenize each infix fi [7] generating the tokens present in Figure 8(b). Next, in the loop of Lines 12­16 each set of tokens in the same p osition of the infixes is given as an argument for calling the TokenRegex function. This means that in Figure 8(b), each column la^ b eled fi [7][j ] will b e given as an argument of TokenRegex. The result of each calling app ears in the line lab eled . For ^ column fi [7][1] all tokens will match the same leaf in the Regex tree of Figure 6. Thus, the regular expression in this ^ node is used. For column fi [7][2], as the same toke rep eats for all lines, it is simply considered as the regular expression. ^ In the case of the tokens in column fi [7][5], notice that the tokens "8" and "D" match distinct leafs in the Regex tree and the deep est common ancestor corresp onds to the node whose regular expression is"\w". Thus, this regular expression is used. expression 1 in the navigation pattern are enqueued. From this p oint on, and while the queue is not empty, the pages in the site are visited in a breadth-first way. If the condition in Line 15 fails, it means that a target page in the last level (n) was reached. Thus, in Line 18, this page is stored. At the end, all target pages from the Web site are fetched. 5. EXPERIMENTAL RESULTS In this section we present the results of an exp erimental evaluation we carried out with the prop osed structure-driven crawling approach. For this we used 11 representative real Web sites which present collections of data-rich Web pages characterized by having a common structure. We have only used sites with a large numb er of pages and an expressive amount of target pages to b e fetched. The list of the sites used, along with brief description and the definition of the target pages we wanted to fetch, is presented in Table 2. For each site, our exp eriments consisted in first generating a navigation pattern using a single sample page. Next, the structure-driven crawler using this navigation pattern was executed over the site. The results are presented in Table 3. In Table 3, the "Manual" column shows the numb er of pages found in each site as b eing target pages. These numb ers were obtained by exhaustive manual navigation through the p ortions of each site where the target pages were located. The "Automatic" column shows the numb er of target pages automatically identified and fetched by the generated crawler. The recall p ercentage is also shown in this column. It is worth noticing that, in all cases, we reached 100% precision, since all pages fetched were manually correctly verified as target pages. The columns "Generation" and "Crawling" show the numb er of links traversed, resp ectively, for generating the crawler during the site mapping phase and for fetching the target pages during the crawling. Notice that, in all cases, the numb ers in the "Crawling" column are smaller than the numb ers in the "Generation" column. This occurs b ecause, during crawling, only the links matching the regular expression in the navigation pattern are traversed. We run each generated crawler over the corresp onding Web site of Table 2 two more times. For some of them, it was detected that new pages were added after the generation of the crawler. Table 4 shows the results of these two additional crawling processes over each site. We notice that the recall figures in theO "Automatic" column remain similar to the ones in Table 3. The same can b e said with resp ect to the numb er of links traversed during the crawls, which are presented in the "Traversed"column. The column "New Pages" accounts for the new target pages found by the crawler. This will occur to all new pages as long as they retain the same structure of the sample page supplied for the crawler generation. Notice that if the structure changes, a new crawler can b e easily generated. 4.4 Structure-Driven Crawler In the previous section we have describ ed how to create a navigation pattern (NP) to reach target pages in a Web site. In this section we will explain how to use the NP generated to drive a crawler by means of the procedure presented in Figure 9. Basically, it consists of a breath-first traversal of the target Web site where the links to b e followed are selected according to the regular expressions in the NP. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Structure-Driven Crawler input:An entry point e of a Web site S A sample page p A navigation pattern N P output: The set P of pages structurally similar to p begin let Q be a queue let N P = 1 , . . . , n foreach link l from e that match 1 do Q Q { l, 2 } end while Q = do u, j Dequeue(Q) g Download(u) if j = n then foreach link l from g that match j do Q Q { l, j + 1 } else P P {g } end 6. CONCLUSION AND FUTURE WORK We have prop osed and evaluated a new approach for automatically generating structure-driven Web crawlers. The new approach uses the structure of Web pages instead of their content to determine which pages should b e collected or not. Our exp eriments indicate that the new structuredriven approach can b e extremely effective, resulting in high precision and recall levels when crawling sp ecific domain pages in data-rich Web sites. The prop osed method also has the advantage of requiring only a few examples to learn how to identify the set of Web pages that should b e fetched. Figure 9: Structure-driven crawler The algorithm uses a queue Q whose elements are pairs u, j , where u is an URL and j indicates the current level b eing traversed in the Web site. In the loop of Lines 9­11 the links from the entry page that match the first regular 297 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 12 13 14 URLPattern input: U = {u1 , u2 , . . . , un }, a set of URLs output: u , a URL pattern begin u u1 let ui [k ] be the k -th level of URL ui foreach k such that { u1 [k ], . . . , un [k ] | ui [k ] = uj [k ] for some 1 i, j n} do let ui [k ] = p · fi [k ] · s, where p (s) is the common prefix (suffix) between all ui [k ] for i 1 to n do fi [k ][1], fi [k ][2], . . . , fi [k ][mi,k ] Tokenize(fi [k ]) end for j 1 to MAX {m1,k , m2,k , . . . , mn,k } do ^ ^ ^ x TokenRegex(f1 [k ][j ], f2 [k ][j ], . . . , fn [k ][j ]) ^ [k ][j ] = f1 [k ][j ] if j mk,i or tf1 [k ][j ] = otherwise where fi ·x end u [ k ] s · · p end end Tokenize input: s, a string output: T = {s1 , s2 , . . . , sn }, a set of tokens begin let R be a Regex tree i0 while s = "" do i + +; si the leftmost largest substring from s that matches a leaf in R s s - si end end 1 2 3 5 6 7 8 9 10 11 12 15 17 TokenRegex input: T = {t1 , t2 , . . . , tn }, a set of tokens output: pT , a regex matching the tokens in T begin if ti = tj for all ti , tj = then pT ti else let R be a Regex tree let (ti ) the leaf node that matches token ti = a the deepest an ancestor of all (ti ) p T ea if there is some ti = then pT pt · end Figure 7: Procedure for generating an URL pattern Site www.ejazz.com.br informatik.uni-trier.de/ley/db/conf/vldb www.olympic.org www1.folha.uol.com.br/folha/turismo www1.folha.uol.com.br/folha/dinheiro dot.kde.org www.amazon.com www.wallstreetandtech.com www.cnn.com/weather sports.yahoo.com www.nasa.gov/home Description Pages on jazz genre, artists, instruments, etc. VLDB Conference section of DBLP International Olympic Committee Web site Traveling section from "Folha de Sao Paulo" newspaper ~ Money section from "Folha de Sao Paulo" newspaper ~ Package release section from the KDE Desktop Manager Amazon Essential CDs section WallStreet and Technology News Weather in several cities around the world Sports section from Yahoo Nasa official site Target Pages Artists VLDB conferences Olympic heroes News News Package announcements CD Last news Weather forecast European soccer teams News from Nasa Table 2: List of the Web sites used in the experiments. Site E-jazz VLDB OLYMPIC Traveling Money KDE CDs WallStreet CNN Yahoo Sports NASA Target Pages Manual Automatic 149 149(100%) 30 30(100%) 335 328(98%) 301 301(100%) 470 468(99%) 30 30(100%) 416 398(96%) 261 253(97%) 51 49(96%) 38 37(97%) 339 325(95%) Links Traversed Generation Crawling 2213 199 70 32 395 379 348 335 550 528 120 31 440 426 1579 283 485 65 1307 45 687 389 Site Travel Money KDE CDs WallStreet NASA Crawl 1 2 1 2 1 2 1 2 1 2 1 2 Pages Manual 314 303 486 482 29 29 409 418 267 272 334 337 Fetched Automatic 308(98%) 291(96%) 478(98%) 476(99%) 29(100%) 29(100%) 394(96%) 412(98%) 257(96%) 267(98%) 320(95%) 323(95%) Links 335 310 497 497 31 34 492 487 271 273 339 341 New Pages 7 11 84 80 14 19 4 18 17 25 12 13 Table 3: Results of the experiments with each generated crawler. In fact, we have used just one example in our exp eriments, while in the content-driven approach it is usually necessary to provide a significant set of examples in the learning phase, usually a few dozen [5, 14, 16]. The structure-driven approach is complementary to the traditional content-driven approach, in the sense that it is more suited for sites that are data intensive and, at the same time, present regular structure. This means that our new Table 4: Results of crawling after the adding of new target pages. method is the b est option for a restricted set of crawling tasks. However, it is imp ortant to notice that this kind of data intensive Web sites is b ecoming more p opular as the Web grows. As future work we plan to increase the expressiveness of the navigation patterns by adding the capability of dealing with sites in the so-called Hidden Web, i.e, those sites whose 298 (a) u1 u2 u3 ui [1] www.informatik.uni- trier.de www.informatik.uni- trier.de www.informatik.uni- trier.de www.informatik.uni- trier.de ^ fi [7][1] Chaves Ugher Mendes [a - Z ]+ ui [2] ley ley ley ley ui [3] db db db db ui [4] indices indices indices indices ui [5] a- tree a- tree a- tree a- tree ^ fi [7][4] * ui [6] c u m [a- Z]+ ^ fi [7][5] 8 D \w* ui [7] Chaves:Silvio_8=.html Ugher:Mangabeira_D=.html Mendes:Sergio.html [a- Z]+:[a- Z]+_*\w*.html s .html .html .html .html (b) u1 [7] u2 [7] u3 [7] ^ fi [7][2] : : : : ^ fi [7][3] Silvio Mangabeira Sergio [a - Z ]+ Figure 8: Example of a simple navigation pattern. A Perl-like syntax is used. pages are automatically generated from results of queries to databases using arguments taken from forms. Another future extension we envision is the combination of the contentdriven and structure-driven approaches, creating a hybrid strategy that combines the advantages of using content and structure. The idea is to produce methods that are more flexible than the structure-driven method prop osed here, while still achieving high precision and recall levels. [8] Cooley, R. The use of web structure and content to identify sub jectively interesting web usage patterns. ACM Transactions on Internet Technology 3, 2 (2003), 93­116. [9] Crescenzi, V., Merialdo, P., and Missier, P. Clustering web pages based on their structure. Data and Know ledge Engineering 54, 3 (2004), 277­393. [10] Davulcu, H., et al. A layered architecture for querying dynamic web content. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Philadelphia, PY, USA, 1999), pp. 491­502. [11] de Castro Reis, D., et al. A framework for generating attribute extractors for web data sources. In Proceedings of the 9th International Symposium on String Processing and Information Retrieval (Lisb on, Portugal, 2002), pp. 210­226. [12] de Castro Reis, D., et al. Automatic web news extraction using tree edit distance. In Proceedings of the 13th international conference on World Wide Web (New York, NY, USA, 2004), pp. 502­511. [13] Lage, J. P., et al. Automatic generation of agents for collecting hidden web pages for data extraction. Data an Know ledge Engineering 49, 2 (2004), 177­196. [14] Liu, H., Milios, E. E., and Janssen, J. Probabilistic models for focused web crawling. In Procendings of the ACM CIKM International Workshop on Web Information and Data Management (Washington, DC, USA, 2004), pp. 16­22. [15] Najork, M., and Wiener, J. L. Breadth-first crawling yields high-quality pages. In Proceedings of the 10th International World Wide Web Conference (Hong Kong, China, 2001), pp. 114­118. [16] Qin, J., Zhou, Y., and Chau, M. Building domain-sp ecific web collections for scientific digital libraries: a meta-search enhanced focused crawling method. In Joint Conference on Digital Libraries (Tuscon, AZ, USA, 2004), pp. 135­141. [17] Selkow, S. M. The tree-to-tree editing problem. Information Processing Letters 6 (Dec. 1977), 184­186. [18] Yang, W. Identifying syntactic differences b etween two programs. Software ­ Practice And Experience 21, 7 (July 1991), 739­755. [19] Zhai, Y., and Liu, B. Web data extraction based on partial tree alignment. In Proceedings of the 14th international conference on World Wide Web (Chiba, Japan, 2005), pp. 76­85. 7. ACKNOLEDGEMENTS This work is partially supp orted by pro jects GERINDO (CNPq/CT-INFO 552.087/02-5) and SIRIAA (CNPq/CTAmazonia 55.3126/2005-9) and individual grants by CNPq ^ (303032/2004-9 Altigran S. da Silva) (303576/2004-9 Edleno S. de Moura) and FAPEAM (Marcio Vidal). This research is also sp onsored by UOL (www.uol.com.br), through its "UOL Bolsa Pesquisa" program, Proc. num. 200503301456. 8. REFERENCES [1] Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. On the design of a learning crawler for topical resource discovery. ACM Transactions on Information Systems 19, 3 (2001), 286­309. [2] Ahnizeret, K., et al. Information retrieval aware web site modelling and generation. In Proceedings of the 23rd International Conference on Conceptual Modeling (Shanghai, China, 2004), pp. 402­419. [3] Arasu, A., and Garcia-Molina, H. Extracting structured data from web pages. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, CA, USA, 2003), pp. 337­348. [4] Calado, P., et al. Web-dl: an exp erience in building digital libraries from the web. In Proceedings of the ACM International Conference on Information and Know ledge Management (McLean, VA, USA, 2002), pp. 675­677. [5] Chakrabarti, S., Punera, K., and Subramanyam, M. Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th International World Wide Web Conference (Honolulu, HI, USA, 2002), pp. 148­159. [6] Chakrabarti, S., van den Berg, M., and Dom, B. Focused crawling: A new approach to topic-sp ecific web resource discovery. Computer Networks 31, 11-16 (1999), 1623­1640. [7] Chawathe, S. S. Comparing hierarchical data in external memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases (Edinburgh, Scotland, U.K., 1999), pp. 90­101. 299