WWW 2007 / Track: Search Session: Web Graphs Extraction and Classification of Dense Communities in the Web Istituto di Informatica e Telematica - CNR via Moruzzi, 1 Pisa, Italy Yon Dourisboure Filippo Geraci Marco Pellegrini Istituto di Informatica e Telematica - CNR via Moruzzi, 1 Pisa, Italy Istituto di Informatica e Telematica - CNR via Moruzzi, 1 Pisa, Italy yon.dourisboure@iit.cnr.it ABSTRACT filippo.geraci@iit.cnr.it Keywords marco.pellegrini@iit.cnr.it The World Wide Web (WWW) is rapidly b ecoming imp ortant for society as a medium for sharing data, information and services, and there is a growing interest in tools for understanding collective b ehaviors and emerging phenomena in the WWW. In this pap er we focus on the problem of searching and classifying communities in the web. Loosely sp eaking a community is a group of pages related to a common interest. More formally communities have b een associated in the computer science literature with the existence of a locally dense sub-graph of the web-graph (where web pages are nodes and hyp er-links are arcs of the web-graph). The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on web-graphs built on three publicly available large crawls of the web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated exp erimentally by emb edding artificial communities in the web-graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for communities of a thirty nodes or more (even at low density). It is still ab out 80% even for communities of twenty nodes with density over 50% of the arcs present. At the lower extremes the algorithm catches 35% of dense communities made of ten nodes. We complete our Community Watch system by clustering the communities found in the web-graph into homogeneous groups by topic and lab elling each group by representative keywords. Web graph, Communities, Dense Subgraph. 1. INTRODUCTION Why are cyber-communities important?. Searching for social structures in the World Wide Web has emerged as one of the foremost research problems related to the breathtaking expansion of the World Wide Web. Thus there is a keen academic as well as industrial interest in developing efficient algorithms for collecting, storing and analyzing the pattern of pages and hyp er-links that form the World Wide Web, since the pioneering work of Gibson, Kleinb erg and Raghavan [19]. Nowadays many communities of the real world that want to have a ma jor impact and recognition are represented in the Web. Thus the detection of cybercommunities, i.e. set of sites and pages sharing a common interest, improves also our knowledge of the world in general. Cyber-communities as dense subgraphs of the web graph. The most p opular way of defining cyb ercommunities is based on the interpretation of WWW hyperlinks as social links [10]. For example, the web page of a conference contains an hyp er-link to all of its sp onsors, similarly the home-page of a car lover contains links to all famous car manufactures. In this way, the Web is modelled by the web graph, a directed graph in which each vertex represents a web-page and each arc represents an hyp er-link b etween the two corresp onding pages. Intuitively, cyb er-communities corresp ond to dense subgraphs of the web graph. An open problem. Monika Henzinger in a recent survey on algorithmic challenges in web search engines [26] remarks that the Trawling algorithm of Kumar et al. [31] is able to enumerate dense bipartite graphs in the order of tens of nodes and states this op en problem: "In order to more completely capture these cyb er-communities, it would b e interesting to detect much larger bipartite subgraphs, in the order of hundreds or thousands of nodes. They do not need to b e complete, but should b e dense, i.e. they should contain at least a constant fraction of the corresp onding complete bipartite subgraphs. Are there efficient algorithms to detect them? And can these algorithms b e implemented efficiently if only a small part of the graph fits in main memory?" Theoretical results. From a theoretical p oint of view, the dense k-subgraph problem, i.e. finding the densest subgraph with k vertices in a given graph, is clearly NP-Hard (it is easy to see by a reduction from the max-clique problem). Some approximation algorithms with a non constant approximation factor can b e found in the literature for example in [24, 14, 13], none of which seem to b e of practical applicability. Studies ab out the inherent complexity of the problem of obtaining a constant factor approximation algorithm are rep orted in [25] and [12]. Categories and Subject Descriptors F.2.2 [Nonnumerical Algorithms and Problems]: Computations on Discrete Structures; H.2.8 [Database Applications]: Data Mining; H.3.3 [Information Search and Retrieval]: Clustering General Terms Algorithms, Exp erimentation Work partially supp orted by the EU Research and Training Network COMBSTRU (HPRN-CT-2002-00278) and by the Italian Registry of ccTLD"it" Works also for Dipartimento di Ingegneria dell´nformazione, Universit´ di Siena, Italy i a Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 461 WWW 2007 / Track: Search Some heuristic methods. In the literature there are a few heuristic methods to extract communities from the web (or from large graphs in general). The most imp ortant and ground breaking algorithm is due to Kumar et al. in [31] where the authors aim at enumerating complete bipartite subgraphs with very few vertices, then extend them to dense bipartite subgraphs by using local searches (based on the HITS ranking algorithm). The technique in [31] is aimed at detecting small complete bipartite communities, of the order of ten vertices, while the subsequent community expansion guided by the hub and authority scores of the HITS algorithm (regardless of further density considerations). In [16] Flake, Lawrence, Giles and Coetzee use the notion of maximum flow to extract communities, but they are also limited to communities for which an initial seed node is available. In [20] Gibson, Kumar and Tomkins use a new sampling method (shingling) based on the notion of min-wise indep endent p ermutations, introduced in [7], to evaluate the similarity of neighb orhoods of vertices and then extract very large and very dense subgraphs of the web-host graph. This technique is sp ecifically aimed to detecting very large and dense subgraphs, in a graph, like the web-hostgraph of quite large average degree. The authors in [20, Section 4.2] remark that (with a reasonable set of parameters) the shingling method is effective for dense subgraphs of over 50 nodes but breaks down b elow 24 nodes. Thus there is room for improvements via alternative approaches. Our contribution. In this pap er we prop ose two new simple characterization of dense subgraphs. From these characterization we derive a new heuristic, which is based on a two-step filtering approach. In the first filtering step we estimate efficiently the average degree and the similarity of neighb or sets of vertices of a candidate community. This initial filtering is very efficient since it is based only on degree-counting. The second filtering step is based on an iterative refinement of the candidate community aimed at removing small degree vertices (relative to the target average density), and thus increasing the average degree of the remaining "core" community. We test our algorithm on very large snapshots of the web graph (b oth for the global web-graph and for some large national domains) and we give exp erimental evidence the effectiveness of the method. We have coupled the community extraction algorithm with a clustering tool that groups the communities found into homogeneous groups by topic and provide a useful user interface for exploring the community data. The user interface of the Community Watch system is publicly available at http://comwatch.iit.cnr.it. To the b est of our knowledge this is the first publicly available tool to visualize cyb ercommunities. Target size. In our method the user supplies a target threshold t and the algorithm lists all the communities found with average degree at least t. Naturally the lower the tvalue the more communities will b e found and the slower the method. In our exp eriments our method is still effective for values of t quite close to the average degree of the webgraphs (say within a factor 2), and communities of a few tens of nodes. Our heuristic is particularly efficient for detecting communities of large and medium size, while the method in [31] is explicitly targeted towards communities with a small complete bipartite core-set. Final applications. The detection of dense subgraphs of the web-graph might serve as a stepping stone towards achieving several broader goals. One p ossible goal is to improve the p erformance of critical tools in the WWW infrastructure such as crawlers, indexing and ranking comp onents of search engines. In this case often dense subgraphs are associated with negative phenomena such as the Tightly Knit Community (TKC) effect [34], link-farm spamming [23], and data duplication (mirroring) [2]. In this pap er, Session: Web Graphs following [33] we place instead the accent on the "p ositive" asp ect of cyb er-communities: our intent at the moment is to provide an exploratory tool capable of extracting a synthetic description of the current status and current trends in the social structure of the WWW. Visualization of the Communities. Given a single dense community it is easy by manual insp ection to gain some hint as to its general area of interest and purp ose, however gaining insight on hundreds (or thousands) of communities can b ecome a tiresome task, therefore we have coupled our dense-subgraph extraction algorithm with a visualization tool that helps in the exploratory approach. This tool is based on the efficient clustering/lab elling system describ ed in detail in [17][18]. In nutshell from each community, using standard IR techniques, we extract a vector of representative words with weights related to the words frequencies (word-vector). A clustering algorithm is applied to the word-vectors and we obtain groups of communities that are homogeneous by topic, moreover a list of representative keywords for each cluster is generated so to guide the user to assess the intrinsic topic of each cluster of communities. Mirrors and Link-farms. Information retrieval on the WWW is complicated by the phenomenon of "data replication" (mirroring) and several forms of spamming (e.g. linkfarms). For mirrors, off-line detection of such structures using the techniques in [2] implies pairwise comparisons of all (or most if some heuristic filtering is used) pairs of web-sites, which is an exp ensive computations. Link-farm detection implies technique b orderline with those used for community detection. In our context, however, efficiency and effectiveness of the community detection algorithm are not really impaired by such b orderline phenomena. For this reason we do not attempt to filter out these phenomena b efore applying our algorithms. Instead we envision these steps (mirror detection and link-farm detection) as a p ost-processing phase in our Community Watch system. In particular since we p erform efficiently b oth the community detection and community clustering we can apply mirror and link-farm detection separately and indep endently in each cluster thus retaining the overall system scalability. 2. PREVIOUS WORK Given the hyp ertext nature of the WWW one can approach the problem of finding cyb er-communities by using as main source the textual content of the web pages, the hyp erlinks structure, or b oth. Among the methods for finding group of coherent pages based only on text content we can mention [8]. Recommendation systems usually collect information on social networks from a variety of sources (not only link structure) (e.g. [29]). Problems of a similar nature app ears in the areas of social network analysis, citation analysis and bibliometrics, where however, given the relatively smaller data sets involved (relative to the WWW), efficiency is often not a critical issue [35]. Since the pioneering work [19] the prevailing trend in the Computer Science community is to use mainly the linkstructure as basis of the computation. Previous literature on the problem of finding cyb er-communities using link-based analysis in the web-graph can b e broadly split into two large groups. In the first group are methods that need an initial seed of a community to start the process of community identification. Assuming the availability of a seed for a p ossible community naturally directs the computational effort in the region of the web-graph closest to the seed and suggests the use of sophisticated but computational intensive techniques, usually based of max-flow/min-cut approaches. In this category we can list the work of [19, 15, 16, 27, 28]. The second group of algorithms does not assume any seed and aims at finding all (or most) of the communities by exploring the 462 WWW 2007 / Track: Search whole web graph. In this category falls the work of [31, 30, 36, 32, 20]. Certain particular artifacts in the WWW called "link farms" whose purp ose is to bias search-engines pageranktyp e ranking algorithms are a very particular typ e of "artificial" cyb er-community that is traced using techniques b ordering with those used to find dense subgraphs in general. See for example [37, 3]. Ab ello et al. [1] prop ose a method based on local searches with random restarts to escap e local minima, which is quite computational intensive. A graph representing p oint to p oint telecommunications with 53 M nodes and 170M edges is used as input. The equipment used is a multiprocessor machine of 10 200MHz processors and total 6GB RAM memory. A timing result of roughly 36 hours is rep orted in [1] for an exp eriment handling a graph obtained by removing all nodes of degree larger than 30, thus, in effect, op erating on a reduced graph of 9K nodes and 320K edges. Even discounting for the difference in equipment we feel that the method in [1] would not scale well to searching for medium-density and medium-size communities in graphs as large as those we are able to handle (up to 20M nodes and 180M edges after cleaning). Girvan and Newman [21] define a notion of local density based on counting the numb er of shortest paths in a graph sharing a given edge. This notion, though p owerful, entails algorithm that do not scale well to the size of the web-graph. Sp ectral methods describ ed in [9] also lack scalability (i.e. in [9] the method is applied to graphs from psychological exp eriments with 10K nodes and 70K edges). A system similar in spirit to that prop osed in this pap er is Campfire describ ed in [33] which is based on the Trawling algorithm for finding the dense core, on HITS for community expansion and on an indexing structure of community keywords that can b e queried by the user. Our system is different from Campfire first of all in the algorithms used to detect communities but also in the final user interface: we provide a clustering/lab elling interface that is suitable to giving a global view of the available data. Session: Web Graphs 3.2 Definitions of Web Community The basic argument linking the (informal) notion of web communities and the (formal) notion of dense subgraphs is develop ed and justified in [31]. It is summarized in [31] as follows: "Web communities are characterized by dense directed bipartite subgraph". Without entering in a formal definition of density in [31] it is stated the hyp othesis that:"A random large enough and dense enough bipartite subgraph of the Web almost surely has a core", (i.e. a complete bipartite sub-graph of size (i, j ) for some small integer values, i and j ). A standard definition of -density, as used for example in [20], is as follows: a -dense bipartite subgraph of a graph G = (V , E ) is a disjoint pair of sets of vertices, X, Y V such that |{(x, y ) E |x X y Y }| |X||Y |, for a real parameter [0..1]. Note that |Y | is also a lower b ound to the average out-degree of a node in X . Similarly a dense quasi-clique is a subset X V such that ´ |{(x, y ) E |x X y X }| `|X | , for a real para2 meter [0..1], as in [1, 14]. This notion of a core of a dense subgraph in [31] is consistent with the notion of -density for values of large enough, where the notion of "almost surely", (i, j )-core, "large enough", "dense enough", must b e interpreted as a function of . Our formulation unifies the notion of a -dense bipartite subgraph and a clique as a pair of not necessarily disjoint sets of vertices, X, Y V such that x X, |N + (x) Y | |Y | and y Y , |N - (y ) X | |X |. For two constants and . Our definition implies that in [20], and conversely, any -dense subgraph following [20] contains a -dense subgraph in our definition1 . Thus a community in the web is defined by two sets of pages, the set of the Y centers of the community, i.e. pages sharing a common topic, and the set X of the fans, i.e., pages that are interested in the topic. Typically, every fan contains a link to most of the centers, at the same time, there are few links among centers (often for commercial reasons) and among fans (fans may not know each other). 3. PRELIMINARIES 4. HEURISTIC FOR LARGE DENSE SUBGRAPHS EXTRACTION 4.1 Description The definition of -dense subgraph can b e used to test if a pair of sets X, Y V is a -dense subgraph (b oth bipartite and clique). However it cannot b e used to find efficiently a -dense subgraph (X, Y ) emb edded in G. In the following of this section we discuss a sequence of prop erties and then we will proceed by relaxing them up to the p oint of having prop erties that can b e computed directly on the input graph G. These prop erties will hold exactly (with equality) for an isolated complete bipartite graph (and clique), will hold approximately for an isolated -dense graph, where the measure of approximation will b e related to the parameter . However at the end we need a the final relaxation step in which we will consider the subgraphs as emb edded in G. 3.1 Notions and notation A directed graph G = (V , E ) consists of a set V of vertices and a set E of arcs, where an arc is an ordered pair of vertices. The web graph is the directed graph representing the Web: vertices are pages and arcs are hyp erlinks. Let u, v b e any vertices of a directed graph G, if there exists an arc a = (u, v ), then a is an outlink of u, and an inlink of v . Moreover, v is called a successor of u, and u a predecessor of v . For every vertex u, N+ (u) denotes the set of its successors, and N- (u) the set of its predecessors. Then, the outdegree and the indegree of u are resp ectively d+ (u) = |N+ (u)| and d- (u) = |N- (u)|. Let X by any subset of V , the successors and thS predecessors of e X are resp ectively defined by: N+ (X ) = uX N+ (u) and S N- (X ) = uX N- (u). Observe that X N+ (X ) = is p ossible. A graph G = (V , E ) is called a complete bipartite graph, if V can b e partitioned into two disjoint subsets X and Y , such that, for every vertex u of X , the set of successors of u is exactly Y , i.e., u X, N+ (u) = Y . Consequently for every node v Y its predecessor set is X . e Finally, let N(u) b e the set of vertices that share at least one ¯ e successor with u: N(u) = w V | N+ (u) N+ (w) = . Two more useful definitions. Define for sets A and B the |B|, for a constant . relation A B when |A B | Define for p ositive numb ers a, b the relation a b when |a - b| |a|, for a constant . When the constant can b e inferred from the context the subscript is omitted. 4.1.1 Initial intuitive outline First of all, let us give an initial intuition of the reason why our heuristic might work. Let G = (V , E ) b e a sparse directed graph, and let (X, Y ) b e a -dense subgraph within G. Then, let u b e any vertex of X . Since (X, Y ) is a dense subgraph by definition we have u X, N+ (u) Y , and symmetrically v Y , N- (v ) X . For values > 0.5 the pigeon hole principle ensures that any two nodes u e and v of X always share a successor in Y , thus X N(u), It is sufficient to eliminate nodes of X of outdegree smaller than |Y |, and from Y those of indegree smaller than |X |. 1 463 WWW 2007 / Track: Search and, if every vertex of Y has at least a predecessor in X , e also Y N + (N(u)). The main idea now is to estimate quickly, for every vertex u of G, the degree of similarity of e N+ (u) and N+ (N(u)). In the case of an isolated complete e bipartite graph N+ (u) = Y , and N+ (N(u)) = Y . For an isolated -dense bipartite graph, we have N+ (u) Y and e N+ (N(u)) = Y . The conjecture is that when the -dense bipartite graph is a subgraph of G, and thus we have the e e weaker relationship Y N + (N(u)), the excess N + (N(u))\Y is small compared to Y so to make the comparison of the two sets still significant for detecting the presence of a dense subgraph. u Session: Web Graphs X Y Z Figure 1: A complete bipartite subgraph with |X | = |Y | = x, and some "noise". 4.1.2 The isolated complete case To gain in efficiency, instead of evaluating the similarity of successor set, we will estimate the similarity of out-degrees by counting. In a complete bipartite graph (X, Y ), we have that u X, N + (u) = Y , therefore, u, v X, N + (u) = N + (v ). The set of vertices sharing a successor with u is e e N(u) = X , and moreover N + (N(u)) = Y . Passing from relations among sets to relations among cardinalities we have that: u, v X, d+ (u) = d+ (v ), and the degree of any node coincide with the average out-degree: X+ 1 d+ (u) = d (v ). e (u)| |N e v N(u) e In a -dense bipartite graph, we still have N(u) = X but + now, |Y | d (v ) |Y | for every v X . Thus we can conclude that X+ 1 - + |d+ (u) - d (u). d (v )| (1 - )|Y | 1 e (u)| |N e v N(u) start with the case of the isolated complete bipartite graph. Consider a node u X , clearly N+ (u) = Y , and y N+ (u), N- (y ) = X , thus w N- (y ), N+ (w) = Y . Turning to the cardinalities: for a node u X , y N+ (u), w N- (y ) d+ (w) = |Y |. Thus also the average value of all outdegrees for nodes in N- (y ) is |Y |. In formulae: given u X , y N+ (u), 1 d - (y ) X d+ (w) = |Y |. w N - (y ) 4.1.3 The isolated -dense case Next we average over all y N+ (u) by obtaining the following equation: given u X , X X 1 P d+ (w) = |Y |. - (y ) y N + (u ) d + - y N (u ) w N (y ) Finally since d (u) = |Y | we have the equality: P 1 X X d+ (w) = d+ (u). + For 1 the difference tends to zero. Finally assuming e that for a -dense bipartite subgraph of G the excesses N(u)\ e X and N + (N(u)) \ Y give a small contribution, we can still use the ab ove test as evidence of the presence of a dense subgraph. At this p oint we pause, we state our first criterion and we sub ject it to criticism in order to improve it. e Criterion 1. If d+ (u) and |N(u)| are big enough and X+ 1 d+ (u) d (v ), e |N(u)| vN(u) e then " e e N(u), N+ (N(u)) might contain a community. " - y N + (u ) d ( y ) y N + (u ) w N - (y ) Next we see how to transform the ab ove equality for isolated -dense graphs. Consider a node u X , now N+ (u) Y , and for a node v Y , N- (v ) X . Thus we get the b ounds: |X ||Y | |Y |2 |X | X y N + (u ) d - (y ) X |Y | |X |, 2 |Y |2 |X |. X d + (w ) y N + (u ) w N - (y ) 4.1.4 A critique of Criterion 1 Unfortunately, this criterion 1 cannot b e used yet in this e form. One reason is that computing N(u) for every vertex u of big enough outdegree in the web graph G is not scalable. Moreover, the criterion is not robust enough w.r.t. noise from the graph. Assume that the situation depicted in figure 1 occurs: u X , (X, Y ) induces a complete bipartite graph with |Z | = |X | = |Y | = x, and each vertex of Y has e one morePredecessor of degree 1 in Z . Then, N(u) = X Z , p + x+ 1 1 so |N(u)| vN(u) d (v ) = 2 that is far from d+ (u) = x, e e so (X, Y ) will not b e detected. Thus the ratio of the two quantities is in the range || [ Y , |Y | 2 ]. On the other hand |Y | d+ (u) |Y |. Therefore the difference of the two terms is b ounded by 2 2 |Y | 1- , which is b ounded by d+ (u) 1-2 . Again for 1 and 1 the difference tends to zero. Thus in an approximate sense the relationship is preserved for isolated -dense bipartite graphs. Clearly now we will make a further relaxation by considering the sets N+ (.) and N- (.) as referred to the overall graph G, instead of just the isolated pair (X, Y ). e Criterion 2. If d+ (u) and |N(u)| are big enough and X X 1 d + (w ), d+ (u) P d - (y ) y N + (u ) + - y N (u ) w N (y ) 4.1.5 Overcoming the drawbacks of Criterion 1 Because of the shortcomings of Criterion 1 we describ e a second criterion that is more complex to derive but computationally more effective and robust. As b efore we will then " e e N(u), N (N(u)) might contain a community. + " 464 WWW 2007 / Track: Search Session: Web Graphs 4.1.6 Advantages of Criterion 2 There are several advantages in using Criterion 2. The first advantage is that the relevant summations are defined over sets N+ (.) and N- (.) that are encoded directly in the e graphs G and GT . We will compute N(u) in the second phase only for vertices that are likely to b elong to a community. The second advantage is that the result of the inner summation can b e pre-computed stored and reused. We just need to store P o tables of size n (n = |V |), one containing tw the values of vN- (w) d+ (v ), the other containing the indegrees. Thirdly, the criterion 2 is much more robust than criterion 1 to noise, since the outdegree of every vertex of X is counted many times. For example, in the situation depicted in figure 1, we obtain the following result: P + 2 u X and w N+ (u), v N- (w) d (v ) = x + 1. Thus, u X, P P x(x2 +1) + 1 P w N + (u ) v N- (w) d (v ) = x(x+1) x. d - (w ) wN+ (u) 4.1.7 Final refinement step Finally, let u b e a vertex that satisfies the criterion 2, e e we construct explicitly the two sets N(u) and N+ (N(u)). T " hen, we extract the community (X, Y ) contained in " e e N(u), N+ (N(u)) thanks to an iterative loop in which we e e remove from N(u) all vertices v for which N+ (v ) N+ (N(u)) +e is small, and we remove from N (N(u)) all vertices w for e which N- (w) N(u) is small. Algorithm RobustDensityEstimation Input: A directed graph G = (V , E ), a threshold for degrees Result: A set S of dense subgraphs detected by vertices of outdegrees > threshold begin Init: forall u of G do forall v N- (u) do TabSum[u] TabSum[u] + d+ (v ) end end Search: forall u that is not already a fan of a community and s.t. d+ (u) > threshold do sum 0; nb 0; forall v N+ (u) do sum sum + TabSum[v ]; nb nb + d- (v ); end if sum/nb d+ (u) and nb > d+ (u) × threshold then S S ExtractCommunity(u); end end Return S ; end Figure 2: RobustDensityEstimation performs the main filtering step. 4.2 Algorithms In figures 2 and 3 we give the pseudo-code for our heuristic. Algorithm RobustDensityEstimation detects vertices that satisfy the filtering formula of criterion 2, then funce e tion ExtractCommunity computes N(u) and N+ (N(u)) and extracts the community of which u is a fan. This two algorithms are a straightforward application of the formula in the criterion 2. case we do not miss any imp ortant structure of our data. Observe that the last loop of function ExtractCommunity removes logical ly from the graph all arcs of the current community, but not the vertices. Moreover, a vertex can b e fan of a community and center of several communities. In particular it can b e fan and center for the same community, so we are able to detect dense quasi bipartite subgraphs as well as quasi cliques. 4.3 Handling of overlapping communities Our algorithm can capture also partially overlapping communities. This case may happ en when we have older communities that are in the process of splitting or newly formed communities in the process of merging. However overlapping centers and overlapping fans are treated differently, since the algorithm is not fully symmetric in handling fans and centers. Communities sharing fans. The case depicted in Figure 4(a) is that of overlapping fans. If the overlap X X is large with resp ect to X X then our algorithm will just return the union of the two communities (X X , Y Y ). Otherwise when the overlap X X is not large the algorithm will return two communities: either the pairs (X, Y ) and (X \ X, Y ), or the pairs (X , Y ) and (X \ X , Y ). So we will rep ort b oth the communities having their fan-sets overlapping, but the representative fan sets will b e split. The notion of large/small overlap is a complex function of the degree threshold and other parameters of the algorithm. In either case we do not miss any imp ortant structure of our data. Communities sharing centers. Note that the b ehavior is different in the case of overlapping centers. A vertex can b e a center of several communities. Thus, in the case depicted in Figure 4(b), if the overlap Y Y is big with resp ect to Y Y , then we will return the union of the two communities (X X , Y Y ), otherwise we will return exactly the two overlapping communities (X, Y ) and (X , Y ). In either 4.4 Complexity analysis We p erform now a semi-empirical complexity analysis in the standard RAM model. The graph G and its transp ose GT are assumed to b e stored in main memory in such a way as to b e able to access a node in time O(1) and links incident to it in time O(1) p er link. We need O(1) extra storage p er node to store in-degree, out-degree, a counter TabSum, and a tag bit. Algorithm RobustDensityEstimation visits each edge at most once and p erforms O(1) op erations for each edge, thus has a cost O(|V | + |E |), except for the cost of invocations of the ExtractCommunity function. Potentially the total time cost of the invocations of ExtractCommunity is large, however exp erimentally the time cost grows only linearly with the numb er of communities found. This b ehavior can b e explained as follows. We measured that less than 30% of the invocations do not result in the construction of a community (see Table 5), and that the inner refinement loop converges on average in less than 3 iterations (see Table 4). If the numb er of nodes and edges of a community found by ExtractCommunity for u is prop ortional by a constant to " " e e the size of the bipartite sub-graph N(u), N+ (N(u)) then we are allowed to charge all op erations within invocations of ExtractCommunity to the size of the output. Under these conditions each edge is charged on average a constant numb er of op erations, thus explaining the observed overall b ehavior O(|V | + |E | + |Output|)). 465 WWW 2007 / Track: Search Session: Web Graphs Function ExtractCommunity Input: A vertex u of a directed graph G = (V , E ). Slackness parameter Result: A community of which u is a fan begin Initialization: forall v N+ (u) do forall w N- (v ) that is not already a fan of a community do if d+ (w) > (1 - )d+ (u) then mark w as p otential fan end end forall potential fan v do forall w N+ (v ) do mark w as p otential center; end end Iterative refinement: repeat Unmark p otential fans of small local outdegree; Unmark p otential centers of small local indegree; until Number of potential fans and centers have not changed significatively Up date global data structures: forall potential fan v do forall w N+ (v ) that is also a potential center do TabSum[w] TabSum[w] - d+ (v ); d- (w) d- (w) - 1; end end Return (p otential fans, p otential centers); end Figure 3: ExtractCommunity extracts the dense subgraph. X Y X Y (a) Communities sharing fans Y X Y X (b) Communities sharing centers Figure 4: Two cases of community intersection 4.5 Scalability The algorithm we describ ed, including the initial cleaning steps, can b e easily converted to work in the streaming model, except for procedure ExtractCommunity that seems to require the use of random access of data in core memory. Here we want to estimate with a "back of the envelop e" calculation the limits of this approach using core memory. Andrei Broder et al. [6] in the year 2000 estimated the size of the indexable web graph at 200M pages and 1.5G edges (thus an average degree ab out 7.5 links p er page, which is consistent with the average degree 8.4 of the WebBase data of 2001). A more recent estimate by Gulli and Signorini [22] in 2005 gives a count of 11.5G pages. The latest indexsize war ended with Google claiming an index of 25G pages. The average degree of the webgraph has b een increasing recently due to the dynamic generation of pages with high degree, and some measurements give a count of 40.2 The initial cleaning phase reduces the WebBase graph by a factor 0.17 in node count and 0.059 in the Edge count. Thus using these coefficients the cleaned web graph might have 4.25G nodes and 59G arcs. The compression techniques in [5] for the WebBase dataset achieves an overall p erformance of 3.08 bits/edge. These coefficient applied to our cleaned web graph give a total of 22.5Gbytes to store the graph. Storing the graph G and its transp ose we need to double the storage (although here some saving might b e achieved), thus achieving an estimate of ab out 45Gbytes. With current technology this amount of core memory can certainly b e provided by state of the art multiprocessors mainframes 2 (e.g IBM System Z9 sells in configurations ranging from 8 to 64 GB of RAM core memory). 5. TESTING EFFECTIVENESS By construction algorithms RobustDensityEstimation and ExtractCommunity return a list of dense subgraph (where size and density are controlled by the parameters t and ). Using standard terminology in Information Retrieval we can say that full precision is guaranteed by default. In this section we estimate the recall prop erties of the prop osed method. This task is complex since we have no efficient alternative method for obtaining a guaranteed ground truth. Therefore we proceed as follows. We add some arcs in the graph representing the Italian domain of the year 2004, so to create new dense subgraphs. Afterwards, we observe how many of these new "communities" are detected by the algorithm that is run blindly with resp ect to the artificially emb edded community. The numb er of edges added is of the order of only 50,000 and it is likely that the nature of a graph with 100M edges is not affected. In the first exp eriment, ab out detecting bipartite communities, we introduce 480 dense bipartite subgraphs. More precisely we introduce 10 bipartite subgraphs for each of the 48 categories representing all p ossible combinations of numb er of fans, numb er of centers, and density over a numb er of fans is chosen in {10, 20, 40, 80}; numb er of centers chosen in {10, 20, 40, 80}; and density randomly chosen in the ranges [0.25, 0.5] (low), [0.5, 0.75] (medium), and [0.75, 1] (high). Moreover, the fans and centers of every new community are chosen so that they don't intersect any community found in the original graph nor any other new community. The following table (Table 1) shows how many added communities S. Vigna and P. Boldi, p ersonal communication. 466 WWW 2007 / Track: Search are found in average over 53 exp eriments. For every one of the 48 typ es, the maximum recall numb er is 10. # Centers 80 40 20 10 0 0 0 0 10 5.2 9.6 10 5.4 9.5 9.9 2.7 5.4 6 0 0 0 20 40 80 # of Fans Low density 1.2 0.7 0.9 0.1 10 8.4 9.7 10 8 9.7 9.9 7.9 9.6 9.9 0.8 1.9 3.2 20 40 80 # of Fans Med. density 5.7 5.4 4.6 3.3 10 8.6 9.5 9.8 8.6 9.7 9.8 8.4 9.6 9.9 6.5 9 9.7 20 40 80 # of Fans High density Session: Web Graphs don't need to remove small outdegree pages and large indegree pages, as it is usually done for efficiency reasons, since our algorithm handles these cases efficiently and correctly. We obtain the reduced data sets shows in Table 3. Web 2001 20.1M pages 59.4M links av deg 3 .it 2004 17.3M pages 104.5M links av deg 6 .uk 2005 16.3M pages 183.3M links av deg 11 Table 3: The reduced data sets. Number of nodes, edges and average degree. Table 1: Number of added bipartite communities found with threshold=8 depending on number of fans, centers, and density. In the second exp eriment, ab out detecting cliques , we introduce ten cliques for each of 12 classes representing all p ossible combinations over: numb er of pages in {10, 20, 30, 40}, and density randomly chosen in the ranges [0.25, 0.5], [0.5, 0.75], and [0.75, 1]. The following table (Table 2) shows how many such cliques are found in average over 70 exp eriments. Again the maximum recall numb er p er entry is 10. 40 30 20 10 9.6 8.5 3.6 0 Low 9.8 9.7 9.4 9.3 7.6 8.3 0.1 3.5 Med High Density # Pages 6.2 Communities extraction Figure 5 presents the results obtained with the three graphs presented b efore. The y axe shows how many communities are found, and the x axe represents the value of the parameter threshold. Moreover communities are partitioned by density into four categories (shown in grey-scale) corresp onding to density intervals: [1,0.75], ]0.75, 0.5], ]0.5, 0.25], ]0.25, 0.00]. Table 4 rep orts the time needed for the exp eriments with an Intel Pentium IV 3.2 Ghz single processor computer using 3.5 GB RAM memory. The data sets, although large, were in a cleverly compressed format and could b e stored in main memory. The column "# loops" shows the average numb er of iterative refinement done for each community in Algorithm ExtractCommunity. Dep ending on the fan out degree threshold, time ranges from a few minutes to just ab ove two hours for the most intensive computation. Table 5 shows the effectiveness of the degree-based filter since in the large tests just only 6% to 8% of the invocations to ExtractCommunity do not return a community. Note that this false-p ositive rate of the first stage does not impact much on the algorithm's efficiency nor on the effectiveness. The false p ositives of the first stage are caught anyhow by the second stage. Interestingly in Table 7 it is shown the coverage of the communities with resp ect to the nodes of sufficiently high degree. In two national domains the p ercentage of nodes covered by a community is ab ove 90% for national domains, and just b elow 60% for the web graph (of 2001). Table 6 shows the distribution of size and density of communities found. The web 2001 data set seems richer in communities with few fans (range [10-25]) and p oorer in communities with many fans (range 100) and this might explain the lower coverage. Web 2001 Num. p erc. 364 6% 135 5% 246 18% 148 19% Italy 2004 Num. p erc. 34 3% 24 5% 24 9% 4 3% Uk 2005 Num. p erc. 377 8% 331 14% 526 30% 323 30% Table 2: Number of added clique communities found with threshold=8 depending on number of pages and density. The cleaned .it 2004 graph used for the test has an average degree roughly 6 (see Section 6). A small bipartite graph of 10-by-10 nodes or a small clique of 10 nodes at 50% density has an average degree of 5. The breakdown of the degreecounting heuristic for these low thresholds is easily explained with the fact that these small and sparse communities are effectively hard to distinguish from the background graph by simple degree counting. 6. LARGE COMMUNITIES IN THE WEB In this section we apply our algorithm to the task of extracting and classifying real large communities in the web. 6.1 Data set For our exp eriments we have used data from The Stanford WebBase pro ject [11] and data from the WebGraph pro ject [5, 4]. Raw data is publicly available at http://law.dsi.unimi.it/. More precisely we apply our algorithm on three graphs: the graph that represents a snapshot of the Web of the year 2001 (118M pages and 1G links); the graph that represents a snapshot of the Italian domain of the year 2004 (41M pages and 1.15G arcs); the graph that represents a snapshot of the United Kingdom domain of the year 2005 (39M pages and 0.9G links). Since we are searching communities by the study of social links, we first remove all nepotistic links, i.e., links b etween two pages that b elong to the same domain (this is a standard cleaning step used also in [31]). Once these links are removed, we remove also all isolated pages, i.e., pages with b oth outdegree and indegree equal to zero. Observe that we don't remove anything else from the graph, for example we Thresh. 10 15 20 25 Table 5: Number and percentage of useless calls to ExtractCommunity. Table 6 shows how many communities are found with the threshold equal to 10, in the three data sets in function of numb er of fans, centers, and density. Low, medium and high densities are resp ectively the ranges [0.25, 0.5], [0.5, 0.75], and [0.75, 1]. 467 WWW 2007 / Track: Search Session: Web Graphs (a) Web 2001 (b) Italy 2004 (c) United Kingdom 2005 Figure 5: Number of communities found by Algorithm RobustDensityEstimation as a function of the degree threshold. The gray scale denotes a partition of the communities by density. Web 2001 # com. # loops Time 5686 2.7 2h12min 2412 2.8 1h03min 1103 2.8 31min 616 2.6 19min Italy 2004 # com. # loops 1099 2.7 452 2.8 248 2.8 153 2.8 Uk 2005 # com. # loops 4220 2.5 2024 2.6 1204 2.7 767 2.7 Thresh. 10 15 20 25 Time 30min 17min 10min 7min Time 1h10min 38min 27min 20 min Table 4: Measurements of performance. Number of communities found, total computing time and average number of cleaning loops per community. 7. VISUALIZATION OF COMMUNITIES The compressed data structure in [5] storing the web graph does not hold any information ab out the textual content of the pages. Therefore, once the list of url's of fans and centers for each community has b een created, a nonrecursive crawl of the WWW focussed on this list of url's has b een p erformed in order to recover textual data from communities. What we want is to obtain an approximate description of the community topics. The intuition is that the topic of a community is well describ ed by its centers. As good summary of the content of a center page we extract the text contained in the title tag of the page. We treat fan pages in a different way. The full content of the page is probably not interesting b ecause a fan page can contain different topics, or might even b e part of different communities. We extract only the anchor text of the link to a center page b ecause it is a good textual description of the edge from the fan to a center in the community graph. For each community we build a weighted set of words getting all extracted words from centers and fans. The weight of each word takes into account if a word cames from a center and/or a fan and if it is rep eated. All the words in a stop word list are removed. We build a flat clustering of the communities. For clustering we use the k-center algorithm describ ed in [18, 17]. As a metric we adopt the Generalized Jaccard distance (a weighted form of the standard Jaccard distance). This pap er focusses on the algorithmic principles and testing of a fast and effective heuristic for detecting large-tomedium size dense subgraphs in the web graph. The examples of clusters rep orted in this section are to b e considered as anecdotical evidence of the capabilities of the Community Watch System. We plan on using the Community Watch tool for a full-scale analysis of p ortions of the Web Graph as future research. In Table 8 we show some high quality clusters of community found by the Community Watch tool in the data-set UK2005 among those communities detected with threshold t = 25 (767 communities). Further filtering of communities with too few centers reduces the numb er of items (communities) to 636. The full listing can b e insp ected by using the Community Watch web interface publicly available at http://comwatch.iit.cnr.it. 8. CONCLUSIONS AND FUTURE WORK In this pap er we tackle the problem of finding dense subgraphs of the web-graph. We prop ose an efficient heuristic method that is shown exp erimentally to b e able to discover ab out 80% of communities having ab out 20 fans/centers, even at medium density (ab ove 50%). The effectiveness increases and approaches 100% for larger and denser communities. For communities of less than 20 fans/centers (say 10 fans and 10 centers) our algorithm is still able to detect a sizable fraction of the communities present (ab out 35%) whenever these are at least 75% dense. Our method is effective for a medium range of community size/density which is not well detected by the current technology. One can cover the whole sp ectrum of communities by applying first our method to detect large and medium size communities, then, on the residual graph, the Trawling algorithm to find the smaller communities left. The efficiency of the Trawling algorithm is likely to b e b oosted by its application to a residual graph purified of larger communities that tend to b e re-discovered several times. We plan the coupling of our heuristic with the Trawling algorithm as future work. One op en problem is that of devising an efficient version the ExtractCommunity in the data stream model in order to cop e with instances of the web-graph stored in secondary memory. 468 WWW 2007 / Track: Search Web 2001 - 5686 communities found 92 21 49 24 5 8 7 185 35 48 38 11 26 9 247 54 136 52 28 89 17 167 68 437 13 29 217 1 low med high low med high low Session: Web Graphs at t=10 2 8 7 16 6 52 20 163 med high # Centers 100 [50, 100[ [25, 50[ [10, 25[ 6 11 13 17 low 1 9 14 23 med 11 22 100 347 high Density [10, 25[ Density [25, 50[ Density [50, 100[ at t=10 2 0 1 2 7 16 2 34 high Density 100 100 [50, 100[ [25, 50[ [10, 25[ # of Fans Italy 2004 - 1099 communities found 17 3 11 3 1 5 2 32 2 14 14 2 4 5 28 15 33 10 2 18 5 14 5 42 1 3 26 1 low med high low med high low # Centers 2 3 19 5 low 1 4 11 11 med 12 15 69 247 high med Density [10, 25[ Density [25, 50[ United Kingdom 2005 - 4220 100 24 5 18 17 4 [50, 100[ 63 23 55 14 21 [25, 50[ 76 23 151 28 18 [10, 25[ 43 30 299 7 8 low med high low med Density Density [50, 100[ 100 # of Fans communities found at t=10 15 10 3 14 11 5 51 34 19 11 42 24 22 81 159 16 7 68 51 22 273 266 8 11 159 34 44 705 high low med high low med high # Centers Density [10, 25[ Density [25, 50[ Density [50, 100[ # of Fans Density 100 Table 6: Distribution of the detected communities depending on number of fans, centers, and density, for t = 10. Thresh. 10 15 20 25 # Total 984 290 550 206 354 971 244 751 Web 2001 # in Com. 581 828 286 629 164 501 105 500 Perc. 59% 52% 46% 43% # Total 3 331 358 2 225 414 1 761 160 487 866 Italy 2004 # in Com. 3 031 723 2 009 107 642 960 284 218 Perc. 91% 90% 37% 58% # Total 4 085 309 3 476 321 2 923 794 2 652 204 Uk 2005 # in Com. 3 744 159 3 172 338 2 752 726 2 503 226 Perc. 92% 91% 94% 94% Table 7: Coverage of communities found in the web graphs. The leftmost column shows the threshold value. For each data set, the first column is the number of pages with d+ > t, and the second and third columns are the number and percentage of pages that have been found to be a fan of some community. 9. REFERENCES [1] J. Abello, M. G. C. Resende, and S. Sudarsky. Massive quasi-clique detection. In Latin American Theoretical Informatics (LATIN), pages 598­612, 2002. [2] K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114­1122, 2000. [3] M. Bianchini, M. Gori, and F. Scarselli. Inside pagerank. ACM Trans. Inter. Tech., 5(1):92­128, 2005. [4] P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. Software: Practice and Experience, 34(8):711­726, 2004. [5] P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In WWW '04, pages 595­601, 2004. [6] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Ra jagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 33(1-6):309­320, 2000. [7] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630­659, 2000. [8] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Selected papers from the sixth international conference on World Wide Web, pages 1157­1166, Essex, UK, 1997. Elsevier Science Publishers Ltd. [9] A. Capocci, V. D. P. Servedio, G. Caldarelli, and F. Colaiori. Communities detection in large networks. In WAW 2004: Algorithms and Models for the Web-Graph: Third International Workshop, pages 181­188, 2004. [10] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Ra jagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the link structure of the world wide web. Computer, 32(8):60­67, 1999. [11] J. Cho and H. Garcia-Molina. WebBase and the stanford interlib pro ject. In 2000 Kyoto International Conference 469 WWW 2007 / Track: Search Cl. ID 1 2 10 14 22 25 27 31 32 Cl. Keywords p oker (0.66) casino (1.0) games (0.80) phone (0.71) nokia (1.0) motorola (0.31) men (0.15) lingerie (0.16) women (0.23) antique (1.0) auction (0.11) search (0.19) car (1.0) hire (0.29) cheap (0.13) hotel (0.65) holiday (1.0) travel (0.22) delivery (0.31) flowers (1.0) gifts (0.66) credit (0.54) loans (1.0) insurance (0.56) city (0.59) council (1.0) community (0.31) # Comm. (8) (17) (14) (5) (25) (36) (8) (36) (7) # Rel. Comm. 8 17 14 4 20 34 8 28 6 Session: Web Graphs Prevalent Typ e gambling mobile phones clothing/underware antiques car sales/rent turism/travel gifts and flowers financial services city councils Table 8: Some notable clusters of communities in the data set UK05 for t = 25. Parameters used for filtering and clustering: # fans=0-1000, # centers=10-max, average degree =10-max, taget=70 clusters (55 done). Communities in the filtered data set: 636. We report, for each cluster, id number, keywords with weights, number of communities in the cluster and how many of these are relevant to the prevalent type. on Digital Libraries: Research and Practice, 2000. [12] U. Feige. Relations between average case complexity and approximation complexity. In Proc. of STOC 2002, Montreal., 2002. [13] U. Feige and M. Langberg. Approximation algorithms for maximization problems arising in graph partitioning. Journal of Algorithms, 41:174­211, 2001. [14] U. Feige, D. Peleg, and G. Kortsarz. The dense k -subgraph problem. Algorithmica, 29(3):410­421, 2001. [15] G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In KDD '00, pages 150­160, New York, NY, USA, 2000. ACM Press. [16] G. W. Flake, S. Lawrence, C. L. Giles, and F. Coetzee. Self-organization of the web and identification of communities. IEEE Computer, 35(3):66­71, 2002. [17] F. Geraci, M. Maggini, M. Pellegrini, and F. Sebastiani. Cluster generation and cluster labelling for web snippets. In (SPIRE 2006), pages 25­36, Glasgow, UK., October 2006. Volume 4209 in LNCS. [18] F. Geraci, M. Pellegrini, P. Pisati, and F. Sebastiani. A scalable algorithm for high-quality clustering of web snippets. In In Proceedings of the 21st Annual ACM Symposium on Applied Computing (SAC 2006), pages 1058­1062, Dijon, France, April 2006. [19] D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In HYPERTEXT '98, pages 225­234, New York, NY, USA, 1998. ACM Press. [20] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB '05, pages 721­732. VLDB Endowment, 2005. [21] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA, pages 7821­7826, 2002. [22] A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In WWW (Special interest tracks and posters), pages 902­903, 2005. [23] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In ¨ First International Workshop on Adversarial Information Retrieval on the Web, 2005. [24] Q. Han, Y. Ye, H. Zhang and J. Zhang. Approximation of dense k -subgraph, 2000. Manuscript. [25] J. Hastad. Clique is hard to approximate within n1- . Acta Mathematica, 182:105­142, 1999. [26] M. Henzinger. Algorithmic challenges in web search engines. Internet Mathematics, 1(1):115­126, 2002. [27] N. Imafuji and M. Kitsuregawa. Finding a web community by maximum flow algorithm with hits score based capacity. In DASFAA 2003, pages 101­106, 2003. [28] H. Ino, M. Kudo, and A. Nakamura. Partitioning of web graphs by community topology. In WWW '05, pages 661­669, New York, NY, USA, 2005. ACM Press. [29] H. Kautz, B. Selman, and M. Shah. Referral Web: Combining social networks and collaborative filtering. Communications of the ACM, 40(3):63­65, 1997. [30] R. Kumar, P. Raghavan, S. Ra jagopalan, and A. Tomkins. Extracting large-scale knowledge bases from the web. In VLDB '99, pages 639­650, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [31] R. Kumar, P. Raghavan, S. Ra jagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks (Amsterdam, Netherlands: 1999), 31(11­16):1481­1493, 1999. [32] R. Kumar, P. Raghavan, S. Ra jagopalan, and A. Tomkins. Method and system for trawling the world-wide web to identify implicitly-defined communities of web pages. US patent 6886129, 2005. [33] S. R. Kumar, P. Raghavan, S. Ra jagopalan, and A. Tomkins. Extracting large-scale knowledge bases from the web. In The VLDB Journal, pages 639­650, 1999. [34] R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1­6):387­401, 2000. [35] M. Newman. The structure and function of complex networks. SIAM Review, 45(2):167­256, 2003. [36] P. K. Reddy and M. Kitsuregawa. An approach to relate the web communities through bipartite graphs. In WISE 2001, pages 301­310, 2001. [37] B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW '05, pages 820­829, New York, NY, USA, 2005. ACM Press. 470