WWW 2007 / Track: Search

Session: Web Graphs

Random Web Crawls
Toufik Bennouas
Criteo R&D 6, boulevard Saint Denis 75010 Paris, France

Fabien de Montgolfier
LIAFA - Universite Paris 7 ´ 2, place Jussieu, Case 7014, 75251 Paris, France

t.bennouas@criteo.com

fm@liafa.jussieu.fr

ABSTRACT
This pap er prop oses a random Web crawl model. A Web craw l is a (biased and partial) image of the Web. This pap er deals with the hyp erlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hyp ertextual links. Of course a Web crawl has a very sp ecial structure; we recall some known results ab out it. We then prop ose a model generating similar structures. Our model simply simulates a crawling, i.e. builds and crawls the graph at the same time. The graphs generated have lot of known prop erties of Web crawls. Our model is simpler than most random Web graph models, but captures the sames prop erties. Notice that it models the craw ling process instead of the page writing process of Web graph models.

Categories and Subject Descriptors
I.6.m [Simulation and Modeling]: Miscellaneous

General Terms
Theory

Keywords
web graph, crawling, crawl order, model, hyp erlink structure

1.

INTRODUCTION

The Web is a fascinating ob ject, very extensively studied since a few years. Among the many research problems it op ens, an interesting one is the topological issues, i.e. describing the shap e of the Web [11]. Understanding the hyp erlink structure allowed the design of the most p owerful search engines like Google, famous b ecause it uses the PageRank algorithm from Brin and Page [21], and the design of other ranking methods like HITS from Kleinb erg [15], and cyb er-communities detection [19, 14], and many other applications. However, we only know parts of the Web. The craw lers are software that automatically browse the Web and cache the "most relevant" information, esp ecially the documents URL and their hyp erlinks. This is recursively p erformed, the analysis of crawled pages allowing to get new valid URLs. But bandwidth limitations, HTML errors, unreferenced pages, removed or modified pages, and the existence of dynamic pages (generated from requests in
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.

URL or from a session mechanism) make very hard, if not imp ossible, to output "the" Web: crawlers instead produce partial and biased images. This is not even a snapshot of the Web, since the crawling takes a long time while the pages are changing rapidly. From theses observations of the Web, one can try to infer the prop erties of the "real" Web, the underlying ob ject, but it is hard since the biases are not well known. So we can not say that the prop erties of The Web graph are known, but only that some prop erties of Web craw ls are known. The ob ject we deal with in this pap er are therefore the Web crawls, and not the Web itself. In Section 2 we recall some of the most commonly admitted prop erties of Web crawls. In order to explain why the graphs have these prop erties, many hyp otheses from sociology or computer sciences fields have b een prop osed. Many models describ e random graphs and some of them (see Section 3) are sp ecifically designed to model Web Graphs, i.e. the hyp erlink structure of the Web. The authors usually compare measurements on their random graphs with existing crawls of the Web and conclude how accurate their model is [4, 6, 17, 18, 7, 2, 9, 10, 17, 18]. We also prop ose a model generating random Web crawls, and show that our random crawls share a lot of prop erties with real crawls. But our approach here is quite different from the random Web graph models. Indeed we do not try to model the page writing process, using sociological assumptions ab out how the p eople link their own pages to the existing ones. We try to model the pages craw ling process itself instead. So we do not supp ose that pages are linked preferentially to well-known pages, nor that the links of a page are likely a copy of the links of another pages, or such kind of things. We instead p ostulates only two things ab out a crawl: · The in- and out-degree of the pages follow Zipf laws (aka p ower laws), and · the graph is output by a crawler We present our model in details in Section 4. In Section 5 we show that our random crawls have most of the main crawl prop erties presented in Section 2.

2. WEB CRAWL PROPERTIES
Web crawls can b e quite large ob jects (for instance Google currently claims more than 8 billion pages in database) but are very sparse graphs, since the average degree is around 7 links p er page [8]. Here crawls are directed graphs. They are not necessarily connected, since a connecting page may

451


WWW 2007 / Track: Search
be removed from the Web and then deleted from the crawl. They have very few sources (pages with no incoming links, either submitted by p eoples or unlinked after a while) and a lot of sinks (pages not crawled yet or with no external hyp erlink).

Session: Web Graphs

2.4 Cores
Another well-known prop erty of crawls is the existence of cores. A core is a dense directed bipartite subgraph, consisting in many hubs pages (or fans ) p ointing many authorities. It is supp osed [19, 16] that such cores are the central structure of cybercommunities, set of pages ab out the same topics. The authorities are the most relevant pages, but they do not necessarily p oint one each other (b ecause comp etition, for instance) but the hubs list most of them. Starting from this assumption, HITS algorithm [15] ranks the pages containing a given keyword according to a hub factor and an authority factor. Kumar et al. [19, 18] enumerate over 200,000 bipartite cores from a 200,000,000 pages crawl of the Web. Cores sizes (counting hubs, authorities, or b oth) follow Zipf laws of exp onent b etween 1.09 and 1.4.

2.1 Connectivity and Clustering
Smal l-World graphs, defined by Watts and Strogatz [24] and studied by many authors since, are graphs that fulfill the following two prop erties: 1. the characteristic path length (average distance b etween two vertices) is small: O(log n) or O(log log n) 2. the clustering coefficient (probability that two vertices sharing a common neighb or are linked) is high: O(1). This definition can b e applied to directed graphs when omitting arc direction. The Web (in fact the crawls) characteristic path length seems small (ab out 16 clicks [8] or 19 [3]), consistant with the O(log n) axiom. Its clustering coefficient is high. Existing computations of its exact value differ, but it is admitted that it is greater than 0.1, while random graphs (Erd¨s-R´nyi, see Section 3.1) with the same o e average degree have clustering coefficient p  0. Crawls diameter (maximum distance b etween two pages) is p otentially infinite b ecause a dynamic page lab eled by n in URL may refer to a dynamic page lab eled by n + 1, but since Web crawlers usually p erform BFS (see Section 4.1) the diameter of crawls may b e actually small.

2.5 Spectral properties and PageRank factor
Another ranking method, the most p opular since it does not dep ends on given keywords, is Google's PageRank factor [21]. It is an accessibility measure of the page. Briefly, the PageRank of a page is the probability for a random surfer to b e present on this page after a very long surf. It can b e computed by basic linear algebra algorithms. PageRank distribution also follows a Zipf law with the same exp onent as the in-degree distribution [22]. Pages with high PageRank are very visible, since they are effectively p opular on the Web and are linked from other pages with high PageRank. A crawler therefore easily finds them [5, 20] while it may miss low-ranked pages. This is indeed a useful bias for search engine crawlers!

2.2 Degree distribution
Zipf laws (a.k.a power laws ) are probability laws such that log (P r ob(X = d)) =  -  log(d) If the in-degree (resp ectively out-degree) distribution of a graph follows a Zipf law, P r ob(X = d) is the probability for a vertex to have in- (resp. out-) degree d. In other words, the numb er of vertices with degree d is k.d- (k dep ends on the numb er of vertices n). A graph class such that the degree of almost all graphs follow a Zipf law is called scale-free b ecause some parameters like  are scale invariant. Scale-free graphs have b een extensively studied [4, 6, 17, 18]. Many graphs modeling social networks, interaction b etween ob jects (proteins, p eoples, neurons...) or other network prop erties seem to have the scale-free prop erty. For Web crawls, a measure from Broder & al. [8] on a 200 000 000 pages crawl show that the in and out-degrees follow Zipf law. The exp onents are in = 2.1 for in-degree and out = 2.72 for out-degree.

3. RANDOM GRAPHS MODELS ¨ ´ 3.1 Basic models: Erdos-Renyi
For a long time the most used random graph model was Erd¨s-R´nyi model [13]. The random graph dep ends on two o e parameters, the numb er of vertices n and the probability p for two vertices to b e linked. The existence of each edge is a random variable indep endent from others. For suitable values (p = d/n), E.-R. graphs indeed have characteristic path length of O(log n) but very small clustering (p = o(1)) and degree distribution following a Poisson law and not a Zipf law. Therefore they do not accurately describ e the crawls. Other models have then b e prop osed where attachment is not indep endent.

3.2 Incremental generation models
Most random Web graph models [4, 6, 17, 18, 7] prop ose an incremental construction of the graph. When the existence of a link is prob ed, it dep ends on the existing links. That process models the creation of the Web across time. In some models all the link going from a page are inserted at once, and in other ones it is incremental.

2.3 Strongly connected components and the Bow Tie structure
According to Broder, Kumar et al [8] the Web has a Bow Tie structure: a quarter of the page are in a Giant Strongly Connected Comp onent (GSCC), a quarter are the "in " page, leading to the GSCC but not linked from there, another quarter are the "out " pages reachable from the GSCC but not linking to it, and the last quarter is not related to the GSCC. This famous assertion was rep orted even by Nature [23] but, since four years, an increasing numb er of p eople susp ects it is a crawling artifact. According to the same survey, the distribution of the size of strongly connected comp onents follows a Zipf law with exp onent roughly 2.5.

3.3 Preferential Attachment models
The first evolving graph model (BA) was given by Barabasi and Alb ert [4]. The main idea is that new nodes are more likely to join to existing nodes with high degrees. This model is now referred to as an example of a preferential attachment model. They concluded that the model generates graphs whose in-degree distribution follows a Zipf law with exp onent  = 3.

452


WWW 2007 / Track: Search
Another preferential attachment model, called the Linearized Chord Diagram (LCD), was given in [6]. In this model a new vertex is created at each step, and connects to existing vertices with a constant numb er of edges. A vertex is selected as the end-p oint of the an edge with probability prop ortional to its in-degree, with an appropriate normalization factor. In-degrees follow a Zipf law with exp onent roughly 2 when out-degrees are 7 (constant). In the ACL [2] model, each vertex is associated a inweight (resp ectively out-weight) dep endent of in-degree (resp ectively out-degree). A vertex is selected as the end-p oint of the an edge with probability prop ortional to its weight. In these models edges are added but never deleted. The CL-del model [9] and CFV model [10] incorp orate in their design b oth the addition and deletion of nodes and edges.

Session: Web Graphs
4. Unknown : the URL was never encountered The crawling algorithm basically choose and remove from its Unvisited set an URL to crawl, and then adds the outgoing unprob ed links of the page, if any, to the Unvisited set. The craw ling strategy is the way the Unvisited set is managed. It may b e: · DFS (depth-first search) The strategy is FIFO and the data structure is a stack · BFS (breadth-first search) The strategy is LIFO and the data structure is a queue · DEG (higher degree) The most p ointed URL is chosen. The data structure is a priority queue (an heap) · RND (random) An uniform random URL is chosen We supp ose the crawled pages are ordered by their discovery date. For discussing structural prop erties, the craw led pages only are to b e considered. Notice that the first three strategies can only b e correctly implemented with a single computed. The most p owerful crawlers are distributed on many computers and their strategy is hard to define. It is usually something b etween BFS and Random.

3.4 Copy models
A model was prop osed by [17] to explain other relevant prop erties of the Web, esp ecially the great numb er of cores, since the ACL model generates graphs which on average contain few cores. The linear growth coping model from Kumar& al. [18] p ostulates that a Web page author shall copy an existing page when writing its own, including the hyp erlinks. In this model, each new page has a master page from which it copies a given amount of links. The master page is chosen prop ortionally to in-degree. Other links from the new page are then added following uniform or preferential attachment. The result is a graph with all prop erties of previous models, plus the existence of many cores. These models often use many parameters needing fine tune, and sociological assumptions on how the Web pages are written. We prop ose a model based on a computer science assumption: the Web graphs we know are produced by crawlers. This allow us to design a simpler (it dep ends only on two parameters get from exp eriments) and very accurate model of Web crawl.

4.2 Model description
Our model shall mimic a crawler strategy. It works in two steps: first constructing the set of pages, then adding the hyp erlinks.

4.

A WEB CRAWL MODEL

In this section, we present the crawling strategies and derive our Web crawl model from them. It aims to mimic the crawling process itself, rather than the page writing process as web graph models do.

4.1 Web crawls strategies
Let us consider a theoretical crawler. We supp ose the crawler visits each page only once. The b enefit is to avoid modeling the disapp earance of pages or links across time, b ecause the law it follows is still debatable (is the pages lifetime related to their p opularity, or to their degree properties?) When scanning a page, the crawler gets at once the set of its outgoing links. At any time the (p otentially infinite) set of valid URL is divided into 1. Craw led : the corresp onding pages were visited and their outgoing links are known 2. Unvisited : a link to this URL has b een found but not prob ed yet 3. Erroneous : the URL was prob ed but p oints a nonexisting or non-HTML file (some search engines index them, but they do not contain URL and are not interesting for our purp oses)

Constructing the set of pages. Each page p has two fields: its in-degree din (p) and its out-degree dout (p). In the first step of the crawl constructing process, we set a value to each of them. The in-degree and out-degree are set according to two indep endent Zipf laws. The exp onent of each law is a parameter of the model, therefore our model dep ends on two parameters in (for in-degree) and out (for out-degree). These values are well known for real crawls: following [8], we have in = 2.1 and out = 2.72. We shall have to chose the pages at random according to their in-degree. For solving the problem, n pages (the maximal size of the crawl) are generated and their in- and out-degrees are set. Then, a set L is created, where each page p is duplicated din (p) times. The size of this set is the maximal numb er of hyp erlinks. Each time we need to choose a page at random according to the in-degree law, we just have to remove one element from L. Constructing the hyperlinks. Now the pages degrees are
pre-set, but the graph top ology is not yet defined. An algorithm, simulating a crawling, shall add the links. There are indeed four algorithms, dep ending on which crawling strategy shall b e simulated. The generic algorithm is simply: 1. Remove a page p from the Unvisited Set and mark p as craw led 2. Remove the first dout (p) pages from L 3. Set these pages as the pages p ointed by p 4. Add the unvisited ones to the Unvisited Set 5. Go to 1

453


WWW 2007 / Track: Search
20 18 16 Number of vertices crawled (log) 14 12 10 8 6 4 2 0 0 2 4 6 Out-degree (log) 8 10 12 16/x**2.6 BFS, 5% 18.05/x**2.62 BFS, 25% 19.26/x**2.615 BFS, 75% Number of vertices crawled (log) 18

Session: Web Graphs
13.77/x**2.6 BFS, 1% 15.4/x**2.13 BFS, 25% 16.48/x**2.1 BFS, 75%

16

14

12

10

8

6

4

2

0 0 2 4 6 8 10 12 14 In-degree (log)

Figure 1: Out-degree distribution at three steps of a BF S The Unvisited Set is seeded with one or more pages. The way it is managed dep ends on which crawling strategy is simulated, i.e. which algorithm is chosen: · For DFS algorithm, the Unvisited Set is a stack (FIFO) · For BFS algorithm, it is a queue (LIFO) · For DEG algorithm, it is a priority queue (heap) · For RND algorithm, a random page is extracted from the Unvisited Set Because the average out-degree of a page is large enough, the crawling process will not stop unless almost all pages have b een crawled. The progress of the crawl (expressed in p ercent) is the fraction of crawled pages over n. As it approaches n, some weird things will occur as no more unknown pages are allowed. In our exp eriments (see the next section) we sometimes go up to 100% progress but results are more realistic b efore 30%; when the crawl can expand toward unkown pages. Our model differs radically from preferential attachment or copy models b ecause the neighb orhood of a page is not set at writing time but at crawling time. So a page is allowed to p oint known or unknown pages as well.

Figure 2: In-degree distribution at three steps of a BF S

5.2 Small-World properties
The distribution of path length (Figure 3) clearly follows a Gaussian law for BFS, DEG and RAND strategies. This distribution is plotted at progress 30% but it does not change a lot accros time, as shown in Figure 5. DFS produces far greater distances b etween vertices, and the distribution follows an unknown law (Figure 4). DFS crawls diameter is ab out 10% of the numb er of vertices! This is b ecause DFS crawls are like long tight trees. It is why DFS is not used by real crawlers, and this pap er focuses on the three other crawls strategies. The clustering (Figure 6), computed on 500,000 pages simulation) is high and do not decrease too much as the crawl goes bigger. Our crawls definitely are small-world graphs.

5.3 Bow-tie structure?
The relative size of the four b ow-tie comp onents (SCC, IN, OUT and OTHER) are roughly the same for BFS, DEG and even RAND (but not DFS) strategies (Figure 7). When using only one seed, the size of the largest SCC converges toward two thirds of the size of the graph. These prop ortions thus differ from [8] crawl observations since the "in" and "others" parts are smaller. But with many seeds (it may b e seen as many pages submitted to the crawler p ortal) the size of the "in" comp onent is larger and can b e up to one quarter of the pages. Our model replicates indeed very well genuine crawls b ow-tie top ology.

5.

RESULTS

We present here simulation results using the different strategies and showing how the measurements evolve across time. Thanks to the scale-free effect, the actual numb er of pages does not matter, since it is big enough. We have used several graphs of different sizes but with the same exp onents in = 2.1 and out = 2.72 (exp erimental values from [8]). And unless otherwise sp ecified, we present results from BFS, the most used crawling strategy, and simulations up to 20,000,000 crawled pages.

5.4 Cores
We used Agrawal practical algorithm [1] for cores enumeration (notice that the maximal core problem is NPcomplete). Figure 10 gives the numb er of core of a given minimal size for a crawl up to 25 000 vertices. As shown, the numb er of cores is very dep endent from exp onents of Zipf laws, since high exp onents mean sparser graphs. It means that our simulated crawls contain many core, as real crawls do. Figure 9 shows that the numb er of (4, 4)-cores (at least four hubs and four authorities) is prop ortional to n and after a while stays b etween n/100 and n/50.

5.1 Degree distribution
At any step of the crawl, the actual degree distribution follows a Zipf law of the given parameters (2.1 and 2.72) with very small deviation (see Figures 1 and 2). This result is indep endent from the crawl strategy (BFS, etc.) It demonstrates that our generated crawls really are scale-free graphs.

454


WWW 2007 / Track: Search

Session: Web Graphs

2.5e+06 BFS DEG RAND 2e+06

1.5e+06 Population

1e+06

500000

0 0 5 10 Path length 15 20 25

Figure 3: Distribution of path length for BFS and DEG and RAND

11

10

9

8 Population (log)

7

6

5

4

3

2 0 1 2 3 4 5 Path length (log) 6 7 8 9 10

Figure 4: Distribution of path length for DFS (log/log scale)

455


WWW 2007 / Track: Search

Session: Web Graphs

40 BFS, apl DEG, apl RAND, apl BFS, d DEG, d RAND, d log(N)/log(average degree)

35

Average path length/Diameter

30

25

Diameter

20

15

10

Average path length

5 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 Number of vertices crawled

Figure 5: Evolution of diameter and average path length across time for BFS, DEG and RAND

0.4 0.750029-0.0454094*log(x) 0.546027-0.0293349*log(x) BFS DEG RAND 0.35

Clustering coefficient

0.3

0.25

0.2

0.15 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 Number of vertices crawled

Figure 6: Evolution of clustering coefficient across time

456


WWW 2007 / Track: Search
0.65 BFS RAND DEG DFS

Session: Web Graphs
BFS RAND DEG DFS

0.7

0.6

0.65 SCC size (proportion) OUT size (proportion) 0 100000 200000 300000 400000 500000

0.55

0.6

0.5

0.55

0.45

0.5

0.4

0.45

0.35

0.4 Number of vertices crawled

0.3 0 100000 200000 300000 400000 500000 Number of vertices crawled

Figure 7: Evolution of the size of of the largest SCC (left) and of the OUT component (right) across time. One seed, up to 500,000 pages in the crawl

5.5 Pagerank distribution and "quality pages" crawling speed
Figure 10 shows the PageRank distribution (PageRank is normalized to 1 and logarithms are therefore negative). We have found result similar to Pandurangan et al. observations [22]: the distribution is a Zipf law with exp onent 2.1. The crawl quickly converges to this value. Figure 11 shows the sum of the PageRank of the crawled pages across time (the PageRank computed at the end of the crawl, so that it must vary from 0 at b eginning to 1 when crawl stops). In a very few steps, BFS and DEG strategies find the very small amount of pages that contains most of the total PageRank. This prop erty of real BFS crawlers is known since Na jork and Wiener [20]. Our results can b e compared to Boldi et al crawling strategies exp eriments [5].

0.035 BFS

0.03

Fraction of cores (4,4)

0.025

0.02

0.015

0.01 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Number of vertices crawled

5.6 Discovery speed and Frontier
Figure 12 shows another dynamical prop erty: the discovery rate. It is the probability for the extremity of a link of b eing already crawled. It converges toward 40% for all strategies. This is an interesting scale-free prop erty: after a while, the probability for a URL to p oint a new page is very high, ab out 60%. This "expander" prop erty is very usefull for true crawlers. This simulation shows it does not dep ends only on the dynamical nature of the web, but also from the crawling process itself.

Figure 9: (4,4) Cores distribution Figure 13 focuses on a well known top ological prop erty of crawls, that our simulations also produces, the very high numb er of sinks regardless of crawl size. Notice that their existence is a problem for practical PageRank computation [21]. In other words, the large "out" comp onent of the b owtie is very broad and short... Eiron et al survey the Web frontier ranking [12].

hubs auth cores

hubs auth cores

6. CONCLUSION
As said in Section 2, a good crawl model should output graphs with the following prop erties: 1. highly clustered 2. with a short characteristic path length 3. in- and out-degree distributions follow Zipf laws 4. with many sinks 5. such that high PageRank vertices (as computed in the final graph) are crawled early 6. with a b ow tie structure

2 2 2 2 2 2 3 3 3

2 3 4 5 6 7 3 4 5

220 83 37 14 14 14 84 37 14

3 3 4 4 4 4 5 5 6

6 7 4 5 6 7 5 6 6

5 2 40 14 5 2 12 3 3

Figure 8: Number of small cores (over 1000 pages)

457


WWW 2007 / Track: Search

Session: Web Graphs

16 -22.75*x**(-2.1) BFS 14

12 Number of vertices crawled (log)

10

8

6

4

2

0 -20

-18

-16

-14

-12 PageRank values(log)

-10

-8

-6

-4

Figure 10: PageRank distribution

1.2 1.68*10**(-8)*x+0.062*log(x)-0.39 BFS DFS RAND DEG

1

0.8 Total PageRank

0.6

0.4

0.2

0 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07 Number of vertices crawled

Figure 11: PageRank capture

458


WWW 2007 / Track: Search

Session: Web Graphs

55 4.4*log(x)-31.46 BFS DFS 5.59*log(x)-51.81 RAND 4.4*log(x)-31.46 DEG

50

probability of seeing an old vertex (%)

45

40

35

30

25

20

15

10 0 5e+06 1e+07 Number of vertices crawled 1.5e+07 2e+07

Figure 12: Evolution of the discovery rate

0.5 0.644592+-0.0235404*log(x) RAND DFS 0.791703-0.0323677*log(x) BFS DEG

0.45

Fraction of sinks vertices

0.4

0.35

0.3

0.25

0.2 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07 Number of vertices crawled

Figure 13: Evolution of the proportion of sinks (pages with no crawled successors) among crawled pages

459


WWW 2007 / Track: Search
As shown in Section 5, our model meets all these ob jectives. Prop erty 3 of course is ensured by the model, but the other ones are results of the generating process. The basic assumption of degree distribution, together with the crawling strategy, is enough to mimic the prop erties observed in large real crawls. This is conceptually simpler than other model that also have the same prop erties like the Copy model [17]. The Bow Tie structure we observe differs from [8] since the largest strongly connected comp onent is larger. But together with the other top ological prop erties measured, it proves that we reproduce quite well the top ology of real crawls with our very simple model. It is nice, b ecause we have fewer assumption than [6] or [18]. Our approach is different from the Web graph models, that mimic the page writing strategy instead of the page crawling, but give similar result. It p oints out that we need more numerical or other measures on graph in order to analyze their structure. BFS, RAND and DEG strategies are the most used in simple crawlers. We show that they produce very similar results for top ological asp ects. For dynamical asp ects (PageRank capture for instance) BFS and DEG seems b etter, but are harder to implement in a real crawler. DFS is definitely bad, and for this reason is not used by crawlers. Parallel crawler use, however, more sophisticated strategies that were not modeled here. So our random Web craw ls model can b e compared with the existing random Web graph models [4, 6, 17, 18, 7, 2, 9, 10, 17, 18]. But unlike them, it is not based on sociological assumptions ab out how the pages are written, but on an assumption on the law followed by the pages degrees and, for the structural prop erties, on only one assumption that the graph is output by a crawler. The design is then quite different from the design of the random Web graph models, but the results are the same. We can interpret this conclusion in a p essimistic way: it is hard to tell what are the biases of the crawling. Indeed we have not supp osed that the Web graph has any other sp ecific prop erty than degrees following a Zipf law, and yet our random crawls have all prop erties of real crawls. This means that one can crawl anything following a Zipf law, not only the Web, and output crawls with the sp ecific prop erties of the Web crawls. So the comparison of the result of a Web graph model with real crawls could b e not enough to assert that the model captures prop erties of the Web.

Session: Web Graphs
[6] B. Bollob´s, O. Riordan, J. Sp encer, and G. Tusn´dy. a a The degree sequence of a scale-free random graph process. Random Structures and Algorithms, 18(3):279­290, May 2001. [7] A. Bonato. A survey of models of the web graph. In The Proceedings of Combinatorial and Algorithmic Aspects of Networking, August 2004. [8] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Ra jagopalan, R. Stata, A. Tomkins, J. Wiener, A. Borodin, G. O. Rob erts, J. S. Rosenthal, and P. Tsaparas. Graph structure of the web. In Proceedings of the ninth international conference on World Wide Web, pages 309­320. Foretec Seminars, Inc., 2000. [9] F. Chung and L. Lu. Coupling on-line and off-line analyses for random p ower graphs. Internet Mathematics, 1(4):409­461, 2003. [10] A. F. Colin Coop er and J. Vera. Random deletion in a scale free random graph process. Internet Mathematics, 1(4):463­483, 2003. [11] K. Efe, V. Raghavan, C. H. Chu, A. L. Broadwater, L. Bolelli, and S. Ertekin. The shap e of the Web and its implications for searching the Web, 31 ­6 2000. [12] N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In WWW conference, 2004. [13] P. Erd¨s and A. R´nyi. On random graphs. In o e Publicationes of the Mathematicae, volume 6, 1959. [14] H. Ino, M. Kudo, and A. Nakamura. Partitioning of web graphs by community top ology. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 661­669, 2005. [15] J. M. Kleinb erg. Authoritative sources in a hyp erlinked environment. J. ACM, 46(5):604­632, 1999. [16] J. M. Kleinb erg. Hubs, authorities, and communities. ACM Computing Surveys (CSUR), 31(4es):5, 1999. [17] R. Kumar, P. Raghavan, S. Ra jagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, page 57. IEEE Computer Society, 2000. [18] R. Kumar, P. Raghavan, S. Ra jagopalan, and A. Tomkins. Extracting large-scale knowledge bases from the web. In VLDB, pages 639­650, 1999. [19] R. Kumar, P. Raghavan, S. Ra jagopalan, and A. Tomkins. Trawling the web for emerging cyb er-communities. In WWW conference, 1999. [20] M. Na jork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings International WWW Conference(10), Hong-Kong, 2001. [21] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical rep ort, Computer Science Department, Stanford University, 1998. [22] G. Pandurangan, P. Raghavan, and E. Upfal. Using pagerank to characterize web structure. In 8th Annual International Conference on Computing and Combinatorics, pages 330­339, 2002. [23] Nature. 405:113, 11 May 2000. [24] D. J. Watts and S. H. Strogatz. Collective dynamics of small-world networks. Nature, 393(1­7):440­442, 1998.

7.

REFERENCES

[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB '94, pages 487­499. Morgan Kaufmann Publishers Inc., 1994. [2] W. Aiello, F. Chung, and L. Lu. A random graph model for massive graphs. In Proceedings of the thirty-second annual ACM symposium on Theory of computing, pages 171­180, 2000. [3] R. Alb ert, H. Jeong, and A.-L. Barabasi. Diameter of the world-wide web. Nature, 401:130­131, Septemb er 1999. [4] A.-L. Barabasi and R. Alb ert. Emergence of scaling in random networks. Science, 286:509­512, 1999. [5] P. Boldi, M. Santini, and S. Vigna. Do your worst to make the b est: Paradoxical effects in pagerank incremental computations. In Workshop on Algorithms and Models for the Web-Graph, 2004.

460