SIGIR 2007 Proceedings Session 17: Spam Spam Spam DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang, Irwin King, and Michael R. Lyu Dept. of Computer Science and Engineering The Chinese University of Hong Kong Shatin, NT, Hong Kong {hxyang,king,lyu}@cse.cuhk.edu.hk ABSTRACT While the PageRank algorithm has proven to be very effective for ranking Web pages, the rank scores of Web pages can be manipulated. To handle the manipulation problem and to cast a new insight on the Web structure, we propose a ranking algorithm called DiffusionRank. DiffusionRank is motivated by the heat diffusion phenomena, which can be connected to Web ranking because the activities flow on the Web can be imagined as heat flow, the link from a page to another can be treated as the pipe of an air-conditioner, and heat flow can embody the structure of the underlying Web graph. Theoretically we show that DiffusionRank can serve as a generalization of PageRank when the heat diffusion coefficient tends to infinity. In such a case 1/ = 0, DiffusionRank (PageRank ) has low ability of anti-manipulation. When = 0, DiffusionRank obtains the highest ability of anti-manipulation, but in such a case, the web structure is completely ignored. Consequently, is an interesting factor that can control the balance between the ability of preserving the original Web and the ability of reducing the effect of manipulation. It is found empirically that, when = 1, DiffusionRank has a Penicillin-like effect on the link manipulation. Moreover, DiffusionRank can be employed to find group-to-group relations on the Web, to divide the Web graph into several parts, and to find link communities. Experimental results show that the DiffusionRank algorithm achieves the above mentioned advantages as expected. Categories and Sub ject Descriptors: H.3.3 [Information Systems]: Information Search and Retrieval; G2.2 [Discrete Mathematics]: Graph Theory General Terms: Algorithms. Keywords: Random Graph, PageRank, DiffusionRank 1. INTRODUCTION While the PageRank algorithm [13] has proven to be very effective for ranking Web pages, inaccurate PageRank results are induced because of web page manipulations by peo- ple for commercial interests. The manipulation problem is also called the Web spam, which refers to hyperlinked pages on the World Wide Web that are created with the intention of misleading search engines [7]. It is reported that approximately 70% of all pages in the .biz domain and about 35% of the pages in the .us domain belong to the spam category [12]. The reason for the increasing amount of Web spam is explained in [12]: some web site operators try to influence the positioning of their pages within search results because of the large fraction of web traffic originating from searches and the high potential monetary value of this traffic. From the viewpoint of the Web site operators who want to increase the ranking value of a particular page for search engines, Keyword Stuffing and Link Stuffing are being used widely [7, 12]. From the viewpoint of the search engine managers, the Web spam is very harmful to the users' evaluations and thus their preference to choosing search engines because people believe that a good search engine should not return irrelevant or low-quality results. There are two methods being employed to combat the Web spam problem. Machine learning methods are employed to handle the keyword stuffing. To successfully apply machine learning methods, we need to dig out some useful textual features for Web pages, to mark part of the Web pages as either spam or non-spam, then to apply supervised learning techniques to mark other pages. For example, see [5, 12]. Link analysis methods are also employed to handle the link stuffing problem. One example is the TrustRank [7], a link-based method, in which the link structure is utilized so that human labelled trusted pages can propagate their trust scores trough their links. This paper focuses on the link-based method. The rest of the materials are organized as follows. In the next section, we give a brief literature review on various related ranking techniques. We establish the Heat Diffusion Model (HDM) on various cases in Section 3, and propose DiffusionRank in Section 4. In Section 5, we describe the data sets that we worked on and the experimental results. Finally, we draw conclusions in Section 6. 2. LITERATURE REVIEW Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. The importance of a Web page is determined by either the textual content of pages or the hyperlink structure or both. As in previous work [7, 13], we focus on ranking methods solely determined by hyperlink structure of the Web graph. All the mentioned ranking algorithms are established on a graph. For our convenience, we first give some notations. Denote a static graph by G = (V , E ), where V = {v1 , v2 , . . . , vn }, E = {(vi , vj ) | there is an edge from vi to 431 SIGIR 2007 Proceedings Session 17: Spam Spam Spam vj }. Ii and di denote the in-degree and the out-degree of page i respectively. 2.1 PageRank The importance of a Web page is an inherently sub jective matter, which depends on the reader's interests, knowledge and attitudes [13]. However, the average importance of all readers can be considered as an ob jective matter. PageRank tries to find such average importance based on the Web link structure, which is considered to contain a large amount of statistical data. The Web is modelled by a directed graph G in the PageRank algorithms, and the rank or "importance" xi for page vi V is defi( ed recursively in terms of pages n which point to it: xi = j,i)E aij xj , where aij is assumed to be 1/dj if there is a link from j to i, and 0 otherwise. Or in matrix terms, x = Ax. When the concept of "random jump" is introduced, the matrix form is changed to x = [(1 - )g1T + A]x, (1) All the above topics are related to our work. The readers can find that our model is a generalization of PageRank in order to resist Web manipulation, that we inherit the first part of TrustRank, that we borrow the concept of ranking on the manifold to introduce our model, and that heat diffusion is a main scheme in this paper. 3. HEAT DIFFUSION MODEL Heat diffusion provides us with another perspective about how we can view the Web and also a way to calculate ranking values. In this paper, the Web pages are considered to be drawn from an unknown manifold, and the link structure forms a directed graph, which is considered as an approximation to the unknown manifold. The heat kernel established on the Web graph is considered as the representation of the relationship between Web pages. The temperature distribution after a fixed time period, induced by a special initial temperature distribution, is considered as the rank scores on the Web pages. Before establishing the proposed models, we first show our motivations. where is the probability of following the actual link from a page, (1 - ) is the probability of taking a "random jump", and g is a stochastic vector, i.e., 1T g = 1. Typically, = 1 0.85, and g = n 1 is one of the standard settings, where 1 is the vector of all ones [6, 13]. 3.1 Motivations There are two points to explain that PageRank is susceptible to web spam. · Over-demo cratic. There is a belief behind PageRank --all pages are born equal. This can be seen from the equal voting ability of one page: the sum of each column is equal to one. This equal voting ability of all pages gives the chance for a Web site operator to increase a manipulated page by creating a large number of new pages pointing to this page since all the newly created pages can obtain an equal voting right. · Input-indep endent. For any given non-zero initial input, the iteration will converge to the same stable distribution corresponding to the maximum eigenvalue 1 of the transition matrix. This input-independent property makes it impossible to set a special initial input (larger values for trusted pages and less values even negative values for spam pages) to avoid web spam. The input-independent feature of PageRank can be further explained as follows. P = [(1 - )g1T + A] is a positive stochastic matrix if g is set to be a positive stochastic vector (the uniform distribution is one of such settings), and so the largest eigenvalue is 1 and no other eigenvalue whose absolute value is equal to 1, which is guaranteed by the Perron Theorem [11]. Let y be the eigenvector corresponding to 1, then we have Py = y. Let {xk } be the sequence generated from the iterations xk+1 = Pxk , and x0 is the initial input. If {xk } converges to x, then xk+1 = Pxk implies that x must satisfy Px = x. Since the only maximum eigenvalue is 1, we have x = cy where c is a constant, and if both x and y are normalized by their sums, then c = 1. The above discussions show that PageRank is independent of the initial input x0 . In our opinion, g and are ob jective parameters determined by the users' behaviors and preferences. A, and g are the "true" web structure. While A is obtained by a crawler and the setting = 0.85 is accepted by the people, we think that g should be determined by a user behavior investigation, something like [1]. Without any prior knowl1 edge, g has to be set as g = n 1. 2.2 TrustRank TrustRank [7] is composed of two parts. The first part is the seed selection algorithm, in which the inverse PageRank was proposed to help an expert of determining a good node. The second part is to utilize the biased PageRank, in which the stochastic distribution g is set to be shared by all the trusted pages found in the first part. Moreover, the initial input of x is also set to be g. The justification for the inverse PageRank and the solid experiments support its advantage in combating the Web spam. Although there are many variations of PageRank, e.g., a family of link-based ranking algorithms in [2], TrustRank is especially chosen for comparisons for three reasonss: (1) it is designed for combatting spamming; (2) its fixed parameters make a comparison easy; and (3) it has a strong theoretical relations with PageRank and DiffusionRank. 2.3 Manifold Ranking In [17], the idea of ranking on the data manifolds was proposed. The data points represented as vectors in Euclidean space are considered to be drawn from a manifold. From the data points on such a manifold, an undirected weighted graph is created, then the weight matrix is given by the Gaussian Kernel smoothing. While the manifold ranking algorithm achieves an impressive result on ranking images, the biased vector g and the parameter k in the general personalized PageRank in [17] are unknown in the Web graph setting; therefore we do not include it in the comparisons. 2.4 Heat Diffusion Heat diffusion is a physical phenomena. In a medium, heat always flow from position with high temperature to position with low temperature. Heat kernel is used to describe the amount of heat that one point receives from another point. Recently, the idea of heat kernel on a manifold is borrowed in applications such as dimension reduction [3] and classification [9, 10, 14]. In these work, the input data is considered to lie in a special structure. 432 SIGIR 2007 Proceedings Session 17: Spam Spam Spam TrustRank model does not follow the "true" web structure by setting a biased g, but the effects of combatting spamming are achieved in [7]; PageRank is on the contrary in some ways. We expect a ranking algorithm that has an effect of anti-manipulation as TrustRank while respecting the "true" web structure as PageRank. We observe that the heat diffusion model is a natural way to avoid the over-democratic and input-independent feature of PageRank. Since heat always flows from a position with higher temperatures to one with lower temperatures, points are not equal as some points are born with high temperatures while others are born with low temperatures. On the other hand, different initial temperature distributions will give rise to different temperature distributions after a fixed time period. Based on these considerations, we propose the novel DiffusionRank. This ranking algorithm is also motivated by the viewpoint for the Web structure. We view all the Web pages as points drawn from a highly complex geometric structure, like a manifold in a high dimensional space. On a manifold, heat can flow from one point to another through the underlying geometric structure in a given time period. Different geometric structures determine different heat diffusion behaviors, and conversely the diffusion behavior can reflect the geometric structure. More specifically, on the manifold, the heat flows from one point to another point, and in a given time period, if one point x receives a large amount of heat from another point y , we can say x and y are well connected, and thus x and y have a high similarity in the sense of a high mutual connection. We note that on a point with unit mass, the temperature and the heat of this point are equivalent, and these two terms are interchangeable in this paper. In the following, we first show the HDM on a manifold, which is the origin of HDM, but cannot be employed to the World Wide Web directly, and so is considered as the ideal case. To connect the ideal case and the practical case, we then establish HDM on a graph as an intermediate case. To model the real world problem, we further build HDM on a random graph as a practical case. Finally we demonstrate the DiffusionRank which is derived from the HDM on a random graph. 3.3 On an Undirected Graph On an undirected graph G, the edge (vi , vj ) is considered as a pipe that connects nodes vi and vj . The value fi (t) describes the heat at node vi at time t, beginning from an initial distribution of heat given by fi (0) at time zero. f (t) (f(0)) denotes the vector consisting of fi (t) (fi (0)). We construct our model as follows. Suppose, at time t, each node i receives M (i, j, t, t) amount of heat from its neighbor j during a period of t. The heat M (i, j, t, t) should be proportional to the time period t and the heat difference fj (t) - fi (t). Moreover, the heat flows from node j to node i through the pipe that connects nodes i and j . Based on this consideration, we assume that M (i, j, t, t) = (fj (t) - fi (t))t. As a result, the heat difference at node i between time t + t and time t will be equal to the sum of the heat that it receives from all its neighbors. This is formulated as j fi (t + t) - fi (t) = (fj (t) - fi (t))t, (2) :(j,i)E where E is the set of edges. To find a closed form solution to Eq. (2), we express it in a matrix form: (f (t + t) - f (t))/t = Hf (t), where d(v ) denotes the degree of the d node v . In the limit t 0, it becomes dt f (t) = Hf (t). tH f (0), especially we have Solving it, we obtain f (t) = e -d(vj ), j = i, 1, (vj , vi ) E , (3) f (1) = e H f (0), Hij = 0, otherwise, where e H is defined as e H = I + H + ! H2 + ! H3 + · · · . 2 3 2 3 3.4 On a Directed Graph The above heat diffusion model must be modified to fit the situation where the links between Web pages are directed. On one Web page, when the page-maker creates a link (a, b) to another page b, he actually forces the energy flow, for example, people's click-through activities, to that page, and so there is added energy imposed on the link. As a result, heat flows in a one-way manner, only from a to b, not from b to a. Based on such consideration, we modified the heat diffusion model on an undirected graph as follows. On a directed graph G, the pipe (vi , vj ) is forced by added energy such that heat flows only from vi to vj . Suppose, at time t, each node vi receives RH = RH (i, j, t, t) amount of heat from vj during a period of t. We have three assumptions: (1) RH should be proportional to the time period t; (2) RH should be proportional to the the heat at node vj ; and (3) RH is zero if there is no link from vj to vi . As a j result, vi will receive :(vj ,vi )E j fj (t)t amount of heat from all its neighbors that points to it. On the other hand, node vi diffuses DH (i, t, t) amount of heat to its subsequent nodes. We assume that: (1) The heat DH (i, t, t) should be proportional to the time period t. (2) The heat DH (i, t, t) should be proportional to the the heat at node vi . (3) Each node has the same ability of diffusing heat. This fits the intuition that a Web surfer only has one choice to find the next page that he wants to browse. (4) The heat DH (i, t, t) should be uniformly distributed to its subsequent nodes. The real situation is more complex than what we assume, but we have to make this simple assumption in order to make our model concise. As a result, node vi will diffuse fi (t)t/di amount of heat to any of its 3.2 Heat Flow On a Known Manifold If the underlying manifold is known, the heat flow throughout a geometric manifold with initial conditions can be described by the following second order differential equation: f (x,t) - f (x, t) = 0, where f (x, t) is the heat at location x t at time t, and f is the Laplace-Beltrami operator on a function f . The heat diffusion kernel Kt (x, y) is a special solution to the heat equation with a special initial condition--a unit heat source at position y when there is no heat in other positions. Based on this, the heat kernel Kt (x, y) describes the heat distribution at time t diffusing from the initial unit heat source at position y, and thus describes the connectivity (which is considered as a kind of similarity) between x and y. However, it is very difficult to represent the World Wide Web as a regular geometry with a known dimension; even the underlying is known, it is very difficult to find the heat kernel Kt (x, y), which involves solving the heat equation with the delta function as the initial condition. This motivates us to investigate the heat flow on a graph. The graph is considered as an approximation to the underlying manifold, and so the heat flow on the graph is considered as an approximation to the heat flow on the manifold. 433 SIGIR 2007 Proceedings Session 17: Spam Spam Spam subsequent nodes, and each of its subsequent node should receive fi (t)t/di amount of heat. Therefore j = /dj . To sum up, the heat difference at node vi between time t + t and time t will be equal to the sum of the heat that it receives, deducted by what it diffusesj This is formulated as . fi (t + t) - fi (t) = - fi (t)t + :(vj ,vi )E /dj fj (t)t. Similarly, we obtain j = i, -1, 1/dj , (vj , vi ) E , f (1) = e H f (0), Hij = (4) 0, otherwise. 4. DIFFUSIONRANK For a random graph, the matrix (I + N R)N or e R can measure the similarity relationship between nodes. Let fi (0)= 1, fj (0) = 0 if j = i, then the vector f (0) represent the unit heat at node vi while all other nodes has zero heat. For such f (0) in a random graph, we can find the heat distribution at time 1 by using Eq. (5) or Eq. (6). The heat distribu tion is exactly the i-th row of the matrix of (I + N R)N or R e . So the ith-row j th-column element hij in the matrix (I + tR)N or e R means the amount of heat that vi can receive from vj from time 0 to 1. Thus the value hij can be used to measure the similarity from vj to vi . For a static graph, similarly the matrix (I + N H)N or e H can measure the similarity relationship between nodes. The intuition behind is that the amount h(i, j ) of heat that a page vi receives from a unit heat in a page vj in a unit time embodies the extent of the link connections from page vj to page vi . Roughly speaking, when there are more uncrossed paths from vj to vi , vi will receive more heat from vj ; when the path length from vj to vi is shorter, vi will receive more heat from vj ; and when the pipe connecting vj and vi is wide, the heat will flow quickly. The final heat that vi receives will depend on various paths from vj to vi , their length, and the width of the pipes. 3.5 On a Random Directed Graph For real world applications, we have to consider random edges. This can be seen in two viewpoints. The first one is that in Eq. (1), the Web graph is actually modelled as a random graph, there is an edge from node vi to node vj with a probability of (1 - )gj (see the item (1 - )g1T ), and that the Web graph is predicted by a random graph [15, 16]. The second one is that the Web structure is a random graph in essence if we consider the content similarity between two pages, though this is not done in this paper. For these reasons, the model would become more flexible if we extend it to random graphs. The definition of a random graph is given below. Definition 1. A random graph RG = (V , P = (pij )) is defined as a graph with a vertex set V in which the edges are chosen independently, and for 1 i, j |V | the probability of (vi , vj ) being an edge is exactly pij . The original definition of random graphs in [4], is changed slightly to consider the situation of directed graphs. Note that every static graph can be considered as a special random graph in the sense that pij can only be 0 or 1. On a random graph RG = (V , P), where P = (pij ) is the probability of the edge (vi , vj ) exists. In such a random graph, the expected heat difference at node i between time t + t and time t will be equal to the sum of the expected heat that it receives from all its antecedents, deducted by the expected heat that it diffuses. Since the probability of the link (vj , vi ) is pj i , the expected heat flow from node j to node i should be multiplied byj pj i , and so we have fi (t + t) - fi (t) = - fi (t) t + + + :(vj ,vi )E pj i fj (t)t/RD (vj ), where RD (vi ) is the exk pected out-degree of node vi , it is defined as pik . Similarly we have j = i; -1, f (1) = e R f (0), Rij = (5) pj i /RD + (vj ), j = i. When the graph is large, a direct computation of e R is time-consuming, and we adopt its discrete approximation: f (1) = (I + R)N f (0). N (6) Algorithm 1 DiffusionRank Function Input: The transition matrix A; the inverse transition matrix U; the decay factor I for the inverse PageRank; the decay factor B for PageRank; number of iterations MI for the inverse PageRank; the number of trusted pages L; the thermal conductivity coefficient . Output: DiffusionRank score vector h. 1: s = 1 2: for i = 1 TO MI do 1 3: s = I · U · s + (1 - I ) · n · 1 4: end for 5: Sort s in a decreasing order: = Rank ({1, . . . , n}, s) 6: d = 0, C ount = 0, i = 0 7: while C ount L do if (i) is evaluated as a trusted page then 8: 9: d( (i)) = 1, C ount + + 10: end if 11: i++ 12: end while 13: d = d/|d| 14: h = d 15: Find the iteration number MB according to 16: for i = 1 TO MB do 1 17: h = (1 - MB )h + MB (B · A · h + (1 - B ) · n · 1) 18: end for 19: RETURN h 4.1 Algorithm For the ranking task, we adopt the heat kernel on a random graph. Formally the DiffusionRank is described in Algorithm 1, in which,the element Uij in the inverse transition matrix U is defined to be 1/Ij if there is a link from i to j , and 0 otherwise. This trusted pages selection procedure by inverse PageRank is completely borrowed from TrustRank [7] except for a fix number of the size of the trusted set. Although the inverse PageRank is not perfect in its abil- The matrix (I + N R)N in Eq. (6) and matrix e R in Eq. (5) are called Discrete Diffusion Kernel and the Continuous Diffusion Kernel respectively. Based on the Heat Diffusion Models and their solutions, DiffusionRank can be established on undirected graphs, directed graphs, and random graphs. In the next section, we mainly focus on DiffusionRank in the random graph setting. 434 SIGIR 2007 Proceedings Session 17: Spam Spam Spam ity of determining the maximum coverage, it is appealing because of its polynomial execution time and its reasonable intuition--we actually inverse the original link when we try to build the seed set from those pages that point to many pages that in turn point to many pages and so on. In the algorithm, the underlying random graph is set as 1 P = B · A + (1 - B ) · n · 1n×n , which is induced by the Web graph. As a result, R = -I + P. In fact, the more general setting for DiffusionRank is P = 1 B · A + (1 - B ) · n · g · 1T . By such a setting, DiffusionRank is a generalization of TrustRank when tends to infinity and when g is set in the same way as TrustRank. However, the second part of TrustRank is not adopted by us. In our model, g should be the true "teleportation" determined by the user's browse habits, popularity distribution over all the Web pages, and so on; P should be the true model of the random nature of the World Wide Web. Setting g according to the trusted pages will not be consistent with the basic idea of Heat Diffusion on a random graph. We simply set g = 1 only because we cannot find it without any priori knowledge. Remark. In a social network interpretation, DiffusionRank first recognizes a group of trusted people, who may not be highly ranked, but they know many other people. The initially trusted people are endowed with the power to decide who can be further trusted, but cannot decide the final voting results, and so they are not dictators. we need to first set f (0) for such an application as follows. In f (0) = (f1 (0), f2 (0), . . . , fn (0))T , if i {j1 , j2 , . . . , js }, then fi (0) = 1, and 0 otherwise. Then we employ Eq. (5) to calculate f (1) = (f1 (1), f2 (1), . . . , fn (1))T , finally we sum those fj (1) where j {i1 , i2 , . . . , it }. Fig. 1 (a) shows the results generated by the DiffusionRank. We consider five groups--five departments in our Engineering Faculty: CSE, MAE, EE, IE, and SE. is set to be 1, the numbers in Fig. 1 (a) are the amount of heat that they diffuse to each other. These results are normalized by the total number of each group, and the edges are ignored if the values are less than 0.000001. The group-to-group relations are therefore detected, for example, we can see that the most strong overall tie is from EE to IE. While it is a natural application for DiffusionRank because of the easy interpretation by the amount heat from one group to another group, it is difficult to apply other ranking techniques to such an application because they lack such a physical meaning. 4.2.3 Graph cut Third, it can be used to partition the Web graph into several parts. A quick example is shown below. The graph in Fig. 1 (b) is an undirected graph, and so we employ the Eq. (3). If we know that node 1 belongs to one community and that node 12 belongs to another community, then we can put one unit positive heat source on node 1 and one unit negative heat source on node 12. After time 1, if we set = 0.5, the heat distribution is [0.25, 0.16, 0.17, 0.16, 0.15, 0.09, 0.01, -0.04, -0.18 -0.21, -0.21, -0.34], and if we set = 1, it will be [0.17, 0.16, 0.17, 0.16, 0.16, 0.12, 0.02, -0.07, -0.18, -0.22, -0.24, -0.24]. In both settings, we can easily divide the graph into two parts: {1, 2, 3, 4, 5, 6, 7} with positive temperatures and {8, 9, 10, 11, 12} with negative temperatures. For directed graphs and random graphs, similarly we can cut them by employing corresponding heat solution. 4.2 4.2.1 Advantages Two closed forms Next we show the four advantages for DiffusionRank. First, its solutions have two forms, both of which are closed form. One takes the discrete form, and has the advantage of fast computing while the other takes the continuous form, and has the advantage of being easily analyzed in theoretical aspects. The theoretical advantage has been shown in the proof of theorem in the next section. 4.2.4 Anti-manipulation Fourth, it can be used to combat manipulation. Let G2 contain trusted Web pages (j1 , j2 , . . . , js ), then for each page v hi,jv is the heat that page i receives from G2, and can i, be computed by the discrete approximation of Eq. (4) in the case of a static graph or Eq. (6) in the case of a random graph, in which f (0) is set to be a special initial heat distribution so that the trusted Web pages have unit heat while all the others have zero heat. In doing so, manipulated Web page will get a lower rank unless it has strong in-links from the trusted Web pages directly or indirectly. The situation is quite different for PageRank because PageRank is inputindependent as we have shown in Section 3.1. Based on the fact that the connection from a trusted page to a "bad" page should be weak­less uncross paths, longer distance and narrower pipe, we can say DiffusionRank can resist web spam if we can select trusted pages. It is fortunate that the trusted pages selection method in [7]­the first part of TrustRank can help us to fulfill this task. For such an application of DiffusionRank, the computation complexity for Discrete Diffusion Kernel is the same as that for PageRank in cases of both a static graph and a random graph. This can be seen in Eq. (6), by which we need N iterations and for each iteration we need a multiplication operation between a matrix and a vector, while in Eq. (1) we also need a multiplication operation between a matrix and a vector for each iteration. (a) Group to Group Relations (b) An undirected graph Figure 1: Two graphs 4.2.2 Group-group relations Second, it can be naturally employed to detect the groupgroup relation. For example, let G2 and G1 denote two groups, containing pau es (j1 , j2 , . . . , js ) and (i1 , i2 , . . . , it ), g respectively. Then ,v hiu ,jv is the total amounts of heat that G1 receives from G2, where hiu ,jv is the iu -th row jv -th column element of the heat kernel. More specifically, 435 SIGIR 2007 Proceedings Session 17: Spam Spam Spam 4.3 The Physical Meaning of plays an important role in the anti-manipulation effect of DiffusionRank. is the thermal conductivity­the heat diffusion coefficient. If it has a high value, heat will diffuse very quickly. Conversely, if it is small, heat will diffuse slowly. In the extreme case, if it is infinitely large, then heat will diffuse from one node to other nodes immediately, and this is exactly the case corresponding to PageRank. Next, we will interpret it mathematically. Theorem 1. When tends to infinity and f (0) is not the zero vector, e R f (0) is proportional to the stable distribution produced by PageRank. 1 Let g = n 1. By the Perron Theorem [11], we have shown that 1 is the largest eigenvalue of P = [(1 - )g1T + A], and that no other eigenvalue whose absolute value is equal to 1. Let x be the stable distribution, and so Px = x. x is the eigenvector corresponding to the eigenvalue 1. Assume the n - 1 other eigenvalues of P are |2 | < 1, . . . , |n | < 1, we can find an invertible matrix S = ( x S1 ) such that 1 0 2 S-1 PS = (7) . .. 0 0 . 00 0 n For a given threshold , find N such that ||((I + N R)N - e R )f (0)|| < for any f (0) whose sum is one. Since it is difficult to solve this problem, we propose a heuristic motivated by the following observations. When R = -I + P, by Eq. (7), we have (I + N R)N = (I + N (-I + N P)) = 1 (2 -1) N 0 (1 + ) N S-1 S. (9) .. 0 . 0 (n -1) N 0 0 0 (1 + ) N Comparing Eq. (8) and Eq. (9), we observe that the eigen n values of (I + N R)N - e R are (1 + (N-1) )N - e (n -1) . We propose a heuristic method to determine N so that the difference between the eigenvalues are less than a threshold for only positive s. We also observe that if = 1, < 1, then |(1 + (-1) )N - N (-1) e | < 0.005 if N 100, and |(1 + (-1) )N - e (-1) | < N 0.01 if N 30. So we can set N = 30, or N = 100, or others according to different accuracy requirements. In this paper, we use the relatively accurate setting N = 100 to make the real eigenvalues in (I + N R)N - e R less than 0.005. 5. EXPERIMENTS In this section, we show the experimental data, the methodology, the setting, and the results. Since e R =e (-I+P) 1 0 S -1 0 0 = .. 0 . e (n -1) e (2 -1) 0 0 S, (8) 5.1 Data Preparation Our input data consist of a toy graph, a middle-size realworld graph, and a large-size real-world graph. The toy graph is shown in Fig. 2 (a). The graph below it shows node 1 is being manipulated by adding new nodes A, B , C, . . . such that they all point to node 1, and node 1 points to them all. The data of two real Web graph were obtained from the domain in our institute in October, 2004. The total number of pages found are 18,542 in the middle-size graph, and 607,170 in the large-size graph respectively. The middle-size graph is a subgraph of the large-size graph, and they were obtained by the same crawler: one is recorded by the crawler in its earlier time, and the other is obtained when the crawler stopped. all eigenvalues of the matrix e R are 1, e (2 -1) , . . . , e (n -1) . When , they become 1, 0, . . . , 0, which means that 1 is the only nonzero eigenvalue of e R when . We can see that when , e R e R f (0) = e R f (0), and so e R f (0) is an eigenvector of e R when . On the other hand, 2 3 2 e R x = (I+ R+ ! R2 + ! R3 +. . .)x = Ix+ Rx+ ! R2 x+ 2 3 2 + . . . = x since Rx = (-I + P)x = -x + x = 0, and hence x is the eigenvector of e R for any . Therefore both x and e R f (0) are the eigenvectors corresponding the unique eigenvalue 1 of e R when , and consequently x = ce R f (0). By this theorem, we see that DiffusionRank is a generalization of PageRank. When = 0, the ranking value is most robust to manipulation since no heat is diffused and the system is unchangeable, but the Web structure is completely ignored since e R f (0) = e0R f (0) = If (0) = f (0); when = , DiffusionRank becomes PageRank, it can be manipulated easily. We expect an appropriate setting of that can balance both. For this, we have no theoretical result, but in practice we find that = 1 works well in Section 5. Next we discuss how to determine the number of iterations if we employ the discrete heat kernel. 3 R3 x 3! 5.2 Methodology The algorithms we run include PageRank, TrustRank and DiffusionRank. All the rank values are multiplied by the number of nodes so that the sum of the rank values is equal to the number of nodes. By this normalization, we can compare the results on graphs with different sizes since the average rank value is one for any graph after such normalization. We will need value difference and pairwise order difference as comparison measures. Their definitions are listed as follows. Value Difference. The value difference betn een A = w {Ai }n 1 and B = {Bi }n 1 is measured as i=1 |Ai - Bi |. i= i= Pairwise Order Difference. The order difference between A and B is measured as the number of significant order differences between A and B. The pair (A[i], A[j ]) and (B [i], B [j ]) is considered as a significant order difference if one of the following cases happens: both A[i] > [ <]A[j ] + 0.1 and B [i] [ ]A[j ]; both A[i] [ ]A[j ] and B [i] > [ < ]A[j ] + 0.1. 4.4 The Number of Iterations While we enjoy the advantage of the concise form of the exponential heat kernel, it is better for us to calculate DiffusionRank by employing Eq. (6) in an iterative way. Then the problem about determining N ­the number of iterations arises: 436 SIGIR 2007 Proceedings Session 17: Spam Spam Spam Rank of the Manipulatd Node-1 12 50 40 30 20 10 0 50 40 30 20 10 0 0 2 1 5 10 DiffusionRank-Trust 4 PageRank TrustRanl-Trust 4 Rank of the Manipulatd Node-2 50 40 30 20 10 0 0 DiffusionRank-Trust 4 PageRank TrustRanl-Trust 4 Rank of the Manipulatd Node-3 50 40 30 20 10 0 0 DiffusionRank-Trust 4 PageRank TrustRanl-Trust 4 6 3 4 8 Value Difference A 6 Trust set={1} Trust set={2} Trust set={3} Trust set={4} Trust set={5} Trust set={6} 50 100 50 100 50 100 Rank of the Manipulatd Node-4 Rank of the Manipulatd Node-5 B 6 C 1 3 4 5 2 4 50 40 30 20 10 0 Rank of the Manipulatd Node-6 DiffusionRank-Trust 3 PageRank TrustRanl-Trust 3 DiffusionRank-Trust 4 PageRank TrustRanl-Trust 4 50 40 30 20 10 0 DiffusionRank-Trust 4 PageRank TrustRanl-Trust 4 2 ... 0 0 1 2 3 Gamma 4 5 6 (a) (b) 0 50 100 Number of New Added Nodes 0 50 100 Number of New Added Nodes 0 50 100 Number of New Added Nodes Figure 2: (a) The toy graph consisting of six no des, and no de 1 is b eing manipulated by adding new no des A, B , C, . . . (b) The approximation tendency to PageRank by DiffusionRank Figure 3: The rank values of the manipulated no des on the toy graph 8000 180 PageRank DiffusionRank-uniform DiffusionRank0 DiffusionRank1 DiffusionRank2 DiffusionRank3 TrustRank0 TrustRank1 TrustRank2 TrustRank3 160 Rank of the Manipulatd Node 5.3 Experimental Set-up Rank of the Manipulatd Node 7000 6000 5000 4000 3000 2000 1000 0 10000 The experiments on the middle-size graph and the largesize graphs are conducted on the workstation, whose hardware model is Nix Dual Intel Xeon 2.2GHz with 1GB RAM and a Linux Kernel 2.4.18-27smp (RedHat7.3). In calculating DiffusionRank, we employ Eq. (6) and the discrete approximation of Eq. (4) for such graphs. The related tasks are implemented using C language. While in the toy graph, we employ the continuous diffusion kernel in Eq. (4) and Eq. (5), and implement related tasks using Matlab. For nodes that have zero out-degree (dangling nodes), we employ the method in the modified PageRank algorithm [8], in which dangling nodes of are considered to have random links uniformly to each node. We set = I = B = 0.85 in all algorithms. We also set g to be the uniform distribution in both PageRank and DiffusionRank. For DiffusionRank, we set = 1. According to the discussions in Section 4.3 and Section 4.4, we set the iteration number to be MB = 100 in DiffusionRank, and for accuracy consideration, the iteration number in all the algorithms is set to be 100. 140 120 100 80 60 40 20 PageRank DiffusionRank TrustRank DiffusionRank-uniform 8000 6000 4000 2000 Number of New Added Points 0 2000 4000 6000 8000 10000 Number of New Added Points (a) (b) Figure 4: (a) The rank values of the manipulated no des on the middle-size graph; (b) The rank values of the manipulated no des on the large-size graph 5.4 Approximation of PageRank We show that when tends to infinity, the value differences between DiffusionRank and PageRank tend to zero. Fig. 2 (b) shows the approximation property of DiffusionRank, as proved in Theorem 1, on the toy graph. The horizontal axis of Fig. 2 (b) marks the value, and vertical axis corresponds to the value difference between DiffusionRank and PageRank. All the possible trusted sets with L = 1 are considered. For L > 1, the results should be the linear combination of some of these curves because of the linearity of the solutions to heat equations. On other graphs, the situations are similar. 5.5 Results of Anti-manipulation In this section, we show how the rank values change as the intensity of manipulation increases. We measure the intensity of manipulation by the number of newly added points that point to the manipulated point. The horizontal axes of Fig. 3 stand for the numbers of newly added points, and vertical axes show the corresponding rank values of the manipulated nodes. To be clear, we consider all six situations. Every node in Fig. 2 (a) is manipulated respectively, and its corresponding values for PageRank, TrustRank (TR), DiffusionRank (DR) are shown in the one of six sub-figures in Fig. 3. The vertical axes show which node is being manipulated. In each sub-figure, the trusted sets are computed below. Since the inverse PageRank yields the results [1.26, 0.85, 1.31, 1.36, 0.51, 0.71]. Let L = 1. If the manipulated node is not 4, then the trusted set is {4}, and otherwise {3}. We observe that in all the cases, rank values of the manipulated node for DiffusionRank grow slowest as the number of the newly added nodes increases. On the middle-size graph and the large-size graph, this conclusion is also true, see Fig. 4. Note that, in Fig. 4 (a), we choose four trusted sets (L = 1), on which we test DiffusionRank and TrustRank, the results are denoted by DiffusionRanki and TrustRanki (i = 0, 1, 2, 3 denotes the four trusted set); in Fig. 4 (b), we choose one trusted set (L = 1). Moreover, in both Fig. 4 (a) and Fig. 4 (b), we show the results for DiffusionRank when we have no trusted set, and we trust all the pages before some of them are manipulated. We also test the order difference between the ranking order A before the page is manipulated and the ranking order P A after the page is manipulated. Because after manipu- 437 SIGIR 2007 Proceedings Session 17: Spam Spam Spam lation, the number of pages changes, we only compare the common part of A and P A. This experiment is used to test the stability of all these algorithms. The less the order difference, the stabler the algorithm, in the sense that only a smaller part of the order relations is affected by the manipulation. Figure 5 (a) shows that the order difference values change when we add new nodes that point to the manipulated node. We give several settings. We find that when = 1, the least order difference is achieved by DiffusionRank. It is interesting to point out that as increases, the order difference will increase first; after reaching a maximum value, it will decrease, and finally it tends to the PageRank results. We show this tendency in Fig. 5 (b), in which we choose three different settings--the number of manipulated nodes are 2,000, 5,000, and 10,000 respectively. From these figures, we can see that when < 2, the values are less than those for PageRank, and that when > 20, the difference between PageRank and DiffusionRank is very small. After these investigations, we find that in all the graphs we tested, DiffusionRank (when = 1) is most robust to manipulation both in value difference and order difference. The trust set selection algorithm proposed in [7] is effective for both TrustRank and DiffusionRank. 3 x 10 5 7. ACKNOWLEDGMENTS We thank Patrick Lau, Zhenjiang Lin and Zenglin Xu for their help. This work is fully supported by two grants from the Research Grants Council of the Hong Kong Special administrative Region, China (Pro ject No. CUHK4205/04E and Pro ject No. CUHK4235/04E). 8. REFERENCES [1] E. Agichtein, E. Brill, and S. T. Dumais. Improving web search ranking by incorp orating user b ehavior information. In E. N. Efthimiadis, S. T. Dumais, D. Hawking, and K. J¨rvelin, a editors, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 19­26, 2006. [2] R. A. Baeza-Yates, P. Boldi, and C. Castillo. Generalizing pagerank: damping functions for link-based ranking algorithms. In E. N. Efthimiadis, S. T. Dumais, D. Hawking, and K. J¨rvelin, editors, Proceedings of the 29th Annual a International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 308­315, 2006. [3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373­1396, Jun 2003. [4] B. Bollob´s. Random Graphs. Academic Press Inc. (London), a 1985. [5] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (ICML), pages 89­96, 2005. [6] N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proceeding of the 13th World Wide Web Conference (WWW), pages 309­318, 2004. [7] Z. Gy¨ngyi, H. Garcia-Molina, and J. Pedersen. Combating o ¨ web spam with trustrank. In M. A. Nascimento, M. T. Ozsu, D. Kossmann, R. J. Miller, J. A. Blakeley, and K. B. Schiefer, editors, Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB), pages 576­587, 2004. [8] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Exploiting the blo ck structure of the web for computing pagerank. Technical rep ort, Stanford University, 2003. [9] R. I. Kondor and J. D. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In C. Sammut and A. G. Hoffmann, editors, Proceedings of the Nineteenth International Conference on Machine Learning (ICML), pages 315­322, 2002. [10] J. Lafferty and G. Lebanon. Diffusion kernels on statistical manifolds. Journal of Machine Learning Research, 6:129­163, Jan 2005. [11] C. R. MacCluer. The many pro ofs and applications of p erron's theorem. SIAM Review, 42(3):487­498, 2000. [12] A. Ntoulas, M. Na jork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web (WWW), pages 83­92, 2006. [13] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Rep ort Pap er SIDL-WP-1999-0120 (version of 11/11/1999), Stanford Digital Library Technologies Pro ject, 1999. [14] H. Yang, I. King, and M. R. Lyu. NHDC and PHDC: Non-propagating and propagating heat diffusion classifiers. In Proceedings of the 12th International Conference on Neural Information Processing (ICONIP), pages 394­399, 2005. [15] H. Yang, I. King, and M. R. Lyu. Predictive ranking: a novel page ranking approach by estimating the web structure. In Proceedings of the 14th international conference on World Wide Web (WWW) - Special interest tracks and posters, pages 944­945, 2005. [16] H. Yang, I. King, and M. R. Lyu. Predictive random graph ranking on the web. In Proceedings of the IEEE World Congress on Computational Intel ligence (WCCI), pages 3491­3498, 2006. [17] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Scholkopf. Ranking on data manifolds. In S. Thrun, L. Saul, ¨ and B. Sch¨lkopf, editors, Advances in Neural Information o Processing Systems 16 (NIPS 2003), 2004. 2.5 x 10 5 2.5 Pairwise Order Difference Pairwise Order Difference 2 PageRank DiffusionRank-Gamma=1 DiffusionRank-Gamma=2 DiffusionRank-Gamma=3 DiffusionRank-Gamma=4 DiffusionRank-Gamma=5 DiffusionRank-Gamma=15 TrustRank DiffusionRank: when added 2000 nodes DiffusionRank: when added 5000 nodes DiffusionRank: when added 10000 nodes PageRank 2 1.5 1.5 1 1 0.5 0.5 0 0 1000 2000 3000 4000 5000 6000 7000 Number of New Added Points 8000 9000 10000 0 0 5 10 Gamma 15 20 (a) (b) Figure 5: (a) Pairwise order difference on the middle-size graph, the least it is, the more stable the algorithm; (b) The tendency of varying 6. CONCLUSIONS We conclude that DiffusionRank is a generalization of PageRank, which is interesting in that the heat diffusion coefficient can balance the extent that we want to model the original Web graph and the extent that we want to reduce the effect of link manipulations. The experimental results show that we can actually achieve such a balance by setting = 1, although the best setting including varying i is still under further investigation. This anti-manipulation feature enables DiffusionRank to be a candidate as a penicillin for Web spamming. Moreover, DiffusionRank can be employed to find group-group relations and to partition Web graph into small communities. All these advantages can be achieved in the same computational complexity as PageRank. For the special application of anti-manipulation, DiffusionRank performs best both in reduction effects and in its stability among all the three algorithms. 438